100% found this document useful (5 votes)
2K views85 pages

Machine Learning Project Problem 1 Jupyter Notebook PDF

This document summarizes the steps taken to import libraries, read in an election dataset, clean the data, and analyze the dataset. Key steps include: 1) Importing machine learning libraries. 2) Reading in an election dataset with 1525 rows and 9 columns from an Excel file. 3) Checking for missing data (none found) and data types. 4) Removing 8 duplicate rows from the dataset. 5) Analyzing categorical variables to find unique values and frequencies.

Uploaded by

sonali Pradhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (5 votes)
2K views85 pages

Machine Learning Project Problem 1 Jupyter Notebook PDF

This document summarizes the steps taken to import libraries, read in an election dataset, clean the data, and analyze the dataset. Key steps include: 1) Importing machine learning libraries. 2) Reading in an election dataset with 1525 rows and 9 columns from an Excel file. 3) Checking for missing data (none found) and data types. 4) Removing 8 duplicate rows from the dataset. 5) Analyzing categorical variables to find unique values and frequencies.

Uploaded by

sonali Pradhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Type Markdown and LaTeX: 𝛼2


Importing required Libraries

In [1]: 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style
%matplotlib inline
import seaborn as sns; sns.set() # for plot styling
from scipy import stats
import scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram,linkage,fcluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics,model_selection
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from scipy.stats import zscore
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
import warnings
warnings.filterwarnings('ignore')

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 1/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [2]: 

Elect_df= pd.read_excel("Election_Data.xlsx",sheet_name="Election_Dataset_Two Classes",inde


Elect_df.head()

Out[2]:

vote age economic.cond.national economic.cond.household Blair Hague Europe politic

1 Labour 43 3 3 4 1 2

2 Labour 36 4 4 4 4 5

3 Labour 35 4 4 5 2 3

4 Labour 24 4 2 2 1 4

5 Labour 41 2 2 1 1 6

In [3]: 

# Shape function displays the number of rows and columns in a dafaframe.


print('The dataset has {} rows and {} columns'.format(Elect_df.shape[0],Elect_df.shape[1]))

The dataset has 1525 rows and 9 columns

In [4]: 

# Checking Data info


Elect_df.info();

<class 'pandas.core.frame.DataFrame'>

Int64Index: 1525 entries, 1 to 1525

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 vote 1525 non-null object

1 age 1525 non-null int64

2 economic.cond.national 1525 non-null int64

3 economic.cond.household 1525 non-null int64

4 Blair 1525 non-null int64

5 Hague 1525 non-null int64

6 Europe 1525 non-null int64

7 political.knowledge 1525 non-null int64

8 gender 1525 non-null object

dtypes: int64(7), object(2)

memory usage: 119.1+ KB

In [5]: 

# Handling missing data


# Test whether there is any null value in our dataset or not. We can do this using isnull()
Elect_df.isnull().sum()
print("There are", Elect_df.isnull().values.sum(),"Missing Values in dataset")

There are 0 Missing Values in dataset

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 2/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [6]: 

cat=[]
num=[]
for i in Elect_df.columns:
if Elect_df[i].dtype=="object":
cat.append(i)
else:
num.append(i)
print(cat)
print(num)

['vote', 'gender']

['age', 'economic.cond.national', 'economic.cond.household', 'Blair', 'Hagu


e', 'Europe', 'political.knowledge']

In [7]: 

for variable in cat:


print(variable,":", sum(Elect_df[variable] == '?'))

vote : 0

gender : 0

In [8]: 

Elect_df[num].describe().T

Out[8]:

count mean std min 25% 50% 75% max

age 1525.0 54.182295 15.711209 24.0 41.0 53.0 67.0 93.0

economic.cond.national 1525.0 3.245902 0.880969 1.0 3.0 3.0 4.0 5.0

economic.cond.household 1525.0 3.140328 0.929951 1.0 3.0 3.0 4.0 5.0

Blair 1525.0 3.334426 1.174824 1.0 2.0 4.0 4.0 5.0

Hague 1525.0 2.746885 1.230703 1.0 2.0 2.0 4.0 5.0

Europe 1525.0 6.728525 3.297538 1.0 4.0 6.0 10.0 11.0

political.knowledge 1525.0 1.542295 1.083315 0.0 0.0 2.0 2.0 3.0

In [9]: 

Elect_df[cat].describe().T

Out[9]:

count unique top freq

vote 1525 2 Labour 1063

gender 1525 2 female 812

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 3/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [10]: 

# Checking for Duplicates


dups=Elect_df.duplicated()
print("Total no of duplicate values = %d" % (dups.sum()))
Elect_df[dups]

Total no of duplicate values = 8

Out[10]:

vote age economic.cond.national economic.cond.household Blair Hague Europ

68 Labour 35 4 4 5 2

627 Labour 39 3 4 4 2

871 Labour 38 2 4 2 2

984 Conservative 74 4 3 2 4

1155 Conservative 53 3 4 2 2

1237 Labour 36 3 3 2 2

1245 Labour 29 4 4 4 2

1439 Labour 40 4 3 4 2

Removing Duplicate Data

In [11]: 

Elect_df.drop_duplicates(inplace=True)

In [12]: 

Elect_df.shape

Out[12]:

(1517, 9)

unique values for categorical variables

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 4/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [13]: 

### unique values for categorical variables


for column in Elect_df.columns:
if Elect_df[column].dtype == 'object':
print(column.upper(),': ',Elect_df[column].nunique())
print(Elect_df[column].value_counts().sort_values())
print('\n')

VOTE : 2

Conservative 460

Labour 1057

Name: vote, dtype: int64

GENDER : 2

male 709

female 808

Name: gender, dtype: int64

In [14]: 

# Checking the Skewness in data


Elect_df.skew(axis=0,skipna=True)

Out[14]:

age 0.139800

economic.cond.national -0.238474

economic.cond.household -0.144148

Blair -0.539514

Hague 0.146191

Europe -0.141891

political.knowledge -0.422928

dtype: float64

Univariate Analysis

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 5/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [15]: 

a=1
plt.figure(figsize=(15,112))
for i in Elect_df.columns:
if Elect_df[i].dtype != 'object':
plt.subplot(21,3,a)
sns.distplot(Elect_df[i])
plt.title("Distribution plot for:" + i)
plt.subplot(21,3,a+1)
sns.histplot(Elect_df[i])
plt.title("Histogram for:" + i)
plt.subplot(21,3,a+2)
sns.boxplot(Elect_df[i])
plt.title("Boxplot for:" + i)
a+=3

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 6/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Bivariate and Multivariate Analysis

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 7/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [16]: 

fig, (ax1,ax2,ax3,ax4)=plt.subplots(1,4,figsize=(16,5))
fig, (ax5,ax6,ax7)=plt.subplots(1,3,figsize=(12,5))

sns.stripplot(Elect_df["vote"], Elect_df['age'],orient='v',jitter=True,ax=ax1)
ax1.set_xlabel('vote', fontsize=15)
ax1.set_title('Distribution of vote', fontsize=15)
ax1.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['economic.cond.national'], jitter=True, ax=ax2)


ax2.set_xlabel('Vote', fontsize=15)
ax2.set_title('Distribution of Vote', fontsize=15)
ax2.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['economic.cond.household'], jitter=True, ax=ax3)


ax3.set_xlabel('vote', fontsize=15)
ax3.set_title('Distribution of vote', fontsize=15)
ax3.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['Blair'], jitter=True, ax=ax4)


ax4.set_xlabel('vote', fontsize=15)
ax4.set_title('Distribution of vote', fontsize=15)
ax4.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['Hague'], jitter=True, ax=ax5)


ax5.set_xlabel('vote', fontsize=15)
ax5.set_title('Distribution of vote', fontsize=15)
ax5.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['Europe'], jitter=True, ax=ax6)


ax6.set_xlabel('vote', fontsize=15)
ax6.set_title('Distribution of vote', fontsize=15)
ax6.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['political.knowledge'], jitter=True, ax=ax7)


ax7.set_xlabel('vote', fontsize=15)
ax7.set_title('Distribution of vote', fontsize=15)
ax7.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 8/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 9/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [17]: 

fig, (ax1,ax2,ax3,ax4)=plt.subplots(1,4,figsize=(16,5))
fig, (ax5,ax6,ax7)=plt.subplots(1,3,figsize=(12,5))

sns.stripplot(Elect_df["gender"], Elect_df['age'],orient='v',jitter=True,ax=ax1)
ax1.set_xlabel('gender', fontsize=15)
ax1.set_title('Distribution of gender', fontsize=15)
ax1.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['economic.cond.national'], jitter=True, ax=ax2)


ax2.set_xlabel('gender', fontsize=15)
ax2.set_title('Distribution of gender', fontsize=15)
ax2.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['economic.cond.household'], jitter=True, ax=ax3)


ax3.set_xlabel('gender', fontsize=15)
ax3.set_title('Distribution of gender', fontsize=15)
ax3.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['Blair'], jitter=True, ax=ax4)


ax4.set_xlabel('gender', fontsize=15)
ax4.set_title('Distribution of gender', fontsize=15)
ax4.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['Hague'], jitter=True, ax=ax5)


ax5.set_xlabel('gender', fontsize=15)
ax5.set_title('Distribution of gender', fontsize=15)
ax5.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['Europe'], jitter=True, ax=ax6)


ax6.set_xlabel('gender', fontsize=15)
ax6.set_title('Distribution of gender', fontsize=15)
ax6.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['political.knowledge'], jitter=True, ax=ax7)


ax7.set_xlabel('gender', fontsize=15)
ax7.set_title('Distribution of gender', fontsize=15)
ax7.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 10/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Check for Data Distribution w.r.t Vote

In [18]: 

### Data Distribution


plt.figure(figsize=(24,8))
sns.pairplot(Elect_df,hue='vote');

<Figure size 1728x576 with 0 Axes>

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 11/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [19]: 

#correlation matrix
Elect_df.corr()

Out[19]:

age economic.cond.national economic.cond.household Bla

age 1.000000 0.018687 -0.038868 0.03208

economic.cond.national 0.018687 1.000000 0.347687 0.32614

economic.cond.household -0.038868 0.347687 1.000000 0.21582

Blair 0.032084 0.326141 0.215822 1.00000

Hague 0.031144 -0.200790 -0.100392 -0.24350

Europe 0.064562 -0.209150 -0.112897 -0.29594

political.knowledge -0.046598 -0.023510 -0.038528 -0.02129

In [20]: 

# plot the correlation coefficients as a heatmap


plt.subplots(figsize=(15,10))
sns.heatmap(Elect_df.corr(), annot=True, fmt='.2f', cmap='Blues', vmax=1, vmin=-1);

Check for Outliers

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 12/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [21]: 

#Check for presence of outliers


plt.figure(figsize=(15,10))
Elect_df[num].boxplot(patch_artist = True, color='red',notch=True)
plt.title('Rectangular box plot')
plt.show();

There are nearly no outliers in most of the numerical columns, only outlier is in economic.cond.national
variable & economic.cond.household Variable . In Gaussian Naive Bayes, outliers will affect the shape
of the Gaussian distribution and have the usual effects on the mean etc. So depending on our use case,
it makes sense to remove outlier .

In [22]: 

print('Range of values: ', Elect_df['economic.cond.national'].max()-Elect_df['economic.cond

Range of values: 4

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 13/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [23]: 

#Central values
print('Minimum value economic.cond.national: ', Elect_df['economic.cond.national'].min())
print('Maximum economic.cond.national: ',Elect_df['economic.cond.national'].max())
print('Mean value economic.cond.national: ', Elect_df['economic.cond.national'].mean())
print('Median value economic.cond.national: ',Elect_df['economic.cond.national'].median())
print('Standard deviation economic.cond.national: ', Elect_df['economic.cond.national'].std
print('Null values economic.cond.national: ',Elect_df['economic.cond.national'].isnull().an

Minimum value economic.cond.national: 1

Maximum economic.cond.national: 5

Mean value economic.cond.national: 3.245220830586684

Median value economic.cond.national: 3.0

Standard deviation economic.cond.national: 0.8817924638047195

Null values economic.cond.national: False

In [24]: 

#Quartiles

Q1=Elect_df['economic.cond.national'].quantile(q=0.25)
Q3=Elect_df['economic.cond.national'].quantile(q=0.75)
print('economic.cond.national - 1st Quartile (Q1) is: ', Q1)
print('economic.cond.national - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of economic.cond.national is ', stats.iqr(Elect_df['econom

economic.cond.national - 1st Quartile (Q1) is: 3.0

economic.cond.national - 3st Quartile (Q3) is: 4.0

Interquartile range (IQR) of economic.cond.national is 1.0

In [25]: 

#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in economic.cond.national: ', L_outliers)
print('Upper outliers in economic.cond.national: ', U_outliers)

Lower outliers in economic.cond.national: 1.5

Upper outliers in economic.cond.national: 5.5

In [26]: 

print('Number of outliers in economic.cond.national upper : ', Elect_df[Elect_df['economic.


print('Number of outliers in economic.cond.national lower : ', Elect_df[Elect_df['economic.
print('% of Outlier in economic.cond.national upper: ',round(Elect_df[Elect_df['economic.co
print('% of Outlier in economic.cond.national lower: ',round(Elect_df[Elect_df['economic.co

Number of outliers in economic.cond.national upper : 0

Number of outliers in economic.cond.national lower : 1517

% of Outlier in economic.cond.national upper: 0 %

% of Outlier in economic.cond.national lower: 100 %

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 14/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Oulier Treatment

In [27]: 

def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

In [28]: 

lr,ur=remove_outlier(Elect_df["economic.cond.national"])
Elect_df["economic.cond.national"]=np.where(Elect_df["economic.cond.national"]>ur,ur,Elect_
Elect_df["economic.cond.national"]=np.where(Elect_df["economic.cond.national"]<lr,lr,Elect_
lr,ur=remove_outlier(Elect_df["economic.cond.household"])
Elect_df["economic.cond.household"]=np.where(Elect_df["economic.cond.household"]>ur,ur,Elec
Elect_df["economic.cond.household"]=np.where(Elect_df["economic.cond.household"]<lr,lr,Elec

In [29]: 

#Check for presence of outliers


plt.figure(figsize=(15,10))
Elect_df[num].boxplot(patch_artist = True, color='red',notch=True)
plt.title('Rectangular box plot')
plt.show();

Get_dummies of the object variables

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 15/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [30]: 

cat

Out[30]:

['vote', 'gender']

In [31]: 

cat1 = ['vote', 'gender']

drop_first is used to ensure that multiple columns created based on the levels of categorical variable
are not included else it will result in to multicollinearity . This is done to ensure that we do not land in to
dummy trap.

In [32]: 

df=pd.get_dummies(Elect_df, columns=cat1,drop_first=True)
df.head()

Out[32]:

age economic.cond.national economic.cond.household Blair Hague Europe political.knowl

1 43 3.0 3.0 4 1 2

2 36 4.0 4.0 4 4 5

3 35 4.0 4.0 5 2 3

4 24 4.0 2.0 2 1 4

5 41 2.0 2.0 1 1 6

In [33]: 

# Copy all the predictor variables into X dataframe


X=df.drop('vote_Labour',axis=1)
# Copy target into the y dataframe.
y=df['vote_Labour']

In [34]: 

# Var prior to scaling


X.var()

Out[34]:

age 246.544655

economic.cond.national 0.728713

economic.cond.household 0.785491

Blair 1.380089

Hague 1.519005

Europe 10.883687

political.knowledge 1.175961

gender_male 0.249099

dtype: float64

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 16/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [35]: 

# Data prior to scaling


plt.plot(X)
plt.title('Data prior to scaling ', fontsize=15)
plt.show()

Is Scaling necessary here or not?


In [36]: 

# Scaling the attributes.


X[['age','economic.cond.national','economic.cond.household','Blair','Hague','Europe','polit

In [37]: 

# Var post scaling


X.var()

Out[37]:

age 1.00066

economic.cond.national 1.00066

economic.cond.household 1.00066

Blair 1.00066

Hague 1.00066

Europe 1.00066

political.knowledge 1.00066

gender_male 1.00066

dtype: float64

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 17/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [38]: 

# Data post scaling


plt.plot(X)
plt.title('Data post scaling ', fontsize=15)
plt.show()

In [39]: 

X.head()

Out[39]:

age economic.cond.national economic.cond.household Blair Hague Europe

1 -0.716161 -0.301648 -0.179682 0.565802 -1.419969 -1.437338

2 -1.162118 0.870183 0.949003 0.565802 1.014951 -0.527684

3 -1.225827 0.870183 0.949003 1.417312 -0.608329 -1.134120

4 -1.926617 0.870183 -1.308366 -1.137217 -1.419969 -0.830902

5 -0.843577 -1.473479 -1.308366 -1.988727 -1.419969 -0.224465

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 18/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [40]: 

y.head()

Out[40]:

1 1

2 1

3 1

4 1

5 1

Name: vote_Labour, dtype: uint8

Train-Test Split Split X and y into training and test set in 70:30 ratio with
random_state=1

In [41]: 

# Split X and y into training and test set in 70:30 ratio


X_train,X_test, y_train, y_test=train_test_split(X,y,test_size=0.30, random_state=1)

In [42]: 

print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('y_train',y_train.shape)
print('y_test',y_test.shape)

X_train (1061, 8)

X_test (456, 8)

y_train (1061,)

y_test (456,)

In [43]: 

Logistic_model = LogisticRegression(solver='newton-cg',max_iter=10000,penalty='none',verbos
Logistic_model.fit(X_train, y_train)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.

[Parallel(n_jobs=2)]: Done 1 out of 1 | elapsed: 1.1s finished

Out[43]:

LogisticRegression(max_iter=10000, n_jobs=2, penalty='none', solver='newton-


cg',

verbose=True)

Now LogisticRegression classifier is built. The classifier is trained using training data. We can use fit() method
for training it. After building a classifier, our model is ready to make predictions. We can use predict() method
with test set features as its parameters.

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 19/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [44]: 

## Performance Matrix on train data set


y_train_predict=Logistic_model.predict(X_train)
Logistic_model_score_train=Logistic_model.score(X_train,y_train) ## Accuracy
print("The Logistic Regression Model Score on train data set is %.3f " % Logistic_model_sco
print(metrics.confusion_matrix(y_train,y_train_predict)) ## Confusion Matrix
print(metrics.classification_report(y_train,y_train_predict)) ## Classification r

The Logistic Regression Model Score on train data set is 0.834

[[197 110]

[ 66 688]]

precision recall f1-score support

0 0.75 0.64 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

In [45]: 

# Get the confusion matrix on the train data


sns.heatmap((metrics.confusion_matrix(y_train,Logistic_model.predict(X_train))),annot=True,
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 20/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [46]: 

## Performance Matrix on test data set


y_test_predict=Logistic_model.predict(X_test)
Logistic_model_score_test=Logistic_model.score(X_test,y_test) ## Accuracy
print("The Logistic Regression Model Score on test data set is %.3f " % Logistic_model_sco
print(metrics.confusion_matrix(y_test,y_test_predict)) ## Confusion Matrix
print(metrics.classification_report(y_test,y_test_predict)) ## Classification re

The Logistic Regression Model Score on test data set is 0.829

[[111 42]

[ 36 267]]

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [47]: 

# Get the confusion matrix on the test data


sns.heatmap((metrics.confusion_matrix(y_test,Logistic_model.predict(X_test))),annot=True,fm
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

Training Data and Test Data Confusion Matrix Comparison

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 21/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [48]: 

f,a = plt.subplots(1,2,sharex=True,sharey=True,squeeze=False)

#Plotting confusion matrix for the different models for the Training Data

plot_0 = sns.heatmap((metrics.confusion_matrix(y_train,y_train_predict)),annot=True,fmt='.5
a[0][0].set_title('Training Data')

plot_1 = sns.heatmap((metrics.confusion_matrix(y_test,y_test_predict)),annot=True,fmt='.5g'
a[0][1].set_title('Test Data');

Training Data and Test Data Classification Report Comparison

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 22/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [49]: 

print('Classification Report of the training data:\n\n',metrics.classification_report(y_tra


print('Classification Report of the test data:\n\n',metrics.classification_report(y_test,y_

Classification Report of the training data:

precision recall f1-score support

0 0.75 0.64 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

Classification Report of the test data:

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

1- Applying GridSearchCV for Logistic Regression

In [50]: 

grid={'penalty':['l2','none','l1','elasticnet'],
'solver':['liblinear','lbfgs','newton-cg'],
'tol':[0.0001,0.00001],
'max_iter': [10000, 5000,15000]}

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 23/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [51]: 

from sklearn.model_selection import RepeatedStratifiedKFold


cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator = Logistic_model, param_grid = grid, cv = cv, n_jobs=2
grid_search.fit(X_train, y_train)

[LibLinear]

Out[51]:

GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_sta


te=1),

estimator=LogisticRegression(max_iter=10000, n_jobs=2,

penalty='none', solver='newton-c
g',

verbose=True),

n_jobs=2,

param_grid={'max_iter': [10000, 5000, 15000],

'penalty': ['l2', 'none', 'l1', 'elasticnet'],

'solver': ['liblinear', 'lbfgs', 'newton-cg'],

'tol': [0.0001, 1e-05]},

scoring='f1')

In [52]: 

print(grid_search.best_params_,'\n')
print(grid_search.best_estimator_)

{'max_iter': 10000, 'penalty': 'l2', 'solver': 'liblinear', 'tol': 0.0001}

LogisticRegression(max_iter=10000, n_jobs=2, solver='liblinear', verbose=Tru


e)

In [53]: 

best_model_lr = grid_search.best_estimator_

In [54]: 

# Prediction on the training set

ytrain_predict_lr = best_model_lr.predict(X_train)
ytest_predict_lr = best_model_lr.predict(X_test)

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 24/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [55]: 

## Getting the probabilities on the test set

ytest_predict_prob=best_model_lr.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()

Out[55]:

0 1

0 0.428858 0.571142

1 0.155518 0.844482

2 0.006996 0.993004

3 0.839503 0.160497

4 0.066109 0.933891

Model Evaluation for Train Data

In [56]: 

print("The Best Logistic Regression Model Score on train data set post tuning is %.3f " % b

The Best Logistic Regression Model Score on train data set post tuning is 0.
834

In [57]: 

# Get the confusion matrix on the train data


confusion_matrix(y_train,best_model_lr.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lr.predict(X_train)),annot=True,fmt='.5g',c
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 25/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [58]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LR Tuned Model train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='green')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for LR Tuned Model train data set",fontsize=14,color = 'red');

The ROC_AUC score for LR Tuned Model train data set 0.89

Model Evaluation for Test Data

In [59]: 

print("The Best Logistic Regression Model Score on train data post tuning set is %.3f " % b

The Best Logistic Regression Model Score on train data post tuning set is 0.
829

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 26/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [60]: 

# Get the confusion matrix on the train data


confusion_matrix(y_test,best_model_lr.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lr.predict(X_test)),annot=True,fmt='.5g',cma
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 27/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [61]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for LR Tuned Model test data set %.2f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='green')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for LR Tuned Model test data set",fontsize=14,color = 'red');

The ROC_AUC score for LR Tuned Model test data set 0.88

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 28/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [62]: 

print('Classification Report of the training data:\n\n',classification_report(y_train, ytra


print('Classification Report of the test data:\n\n',classification_report(y_test, ytest_pre

Classification Report of the training data:

precision recall f1-score support

0 0.75 0.64 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

Classification Report of the test data:

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [63]: 

(best_model_lr.score(X_train, y_train)-best_model_lr.score(X_test, y_test))

Out[63]:

0.00517138746961654

LDA (linear discriminant analysis)


In [64]: 

LDA_model=LinearDiscriminantAnalysis()
LDA_model.fit(X_train,y_train)

Out[64]:

LinearDiscriminantAnalysis()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 29/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [65]: 

## Performance Matrix on train data set


y_train_predict=LDA_model.predict(X_train)
LDA_model_score_train=LDA_model.score(X_train,y_train)
print("The LDA Model Score on train data set is %.3f " % LDA_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The LDA Model Score on train data set is 0.834

[[200 107]

[ 69 685]]

precision recall f1-score support

0 0.74 0.65 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.80 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

In [66]: 

# Get the confusion matrix on the train data


sns.heatmap((metrics.confusion_matrix(y_train,LDA_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 30/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [67]: 

#Performance Matrix on test data set


y_test_predict=LDA_model.predict(X_test)
LDA_model_score_test=LDA_model.score(X_test,y_test)
print("The LDA Model Score on test data set is %.3f " % LDA_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))

The LDA Model Score on test data set is 0.831

[[111 42]

[ 35 268]]

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [68]: 

# Get the confusion matrix on the test data


sns.heatmap((metrics.confusion_matrix(y_test,LDA_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

Applying GridSearchCV for LDA

In [69]: 

grid_lda ={'solver' :['svd', 'lsqr', 'eigen']}


grid_search_lda = GridSearchCV(estimator = LDA_model, param_grid = grid_lda, cv = cv, n_job
grid_search_lda.fit(X_train, y_train)
best_model_lda = grid_search_lda.best_estimator_

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 31/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Model Evaluation for Train Data

In [70]: 

ytrain_predict_lda = best_model_lda.predict(X_train)
ytest_predict_lda= best_model_lda.predict(X_test)

In [71]: 

## Getting the probabilities on the test set


ytest_predict_prob=best_model_lda.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()

Out[71]:

0 1

0 0.466328 0.533672

1 0.137291 0.862709

2 0.005950 0.994050

3 0.866706 0.133294

4 0.053474 0.946526

In [72]: 

#### Model Evaluation for Train Data


print("The Best LDA Model Score on train data set post tuning is %.3f " % best_model_lda.sc

The Best LDA Model Score on train data set post tuning is 0.835

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 32/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [73]: 

# Get the confusion matrix on the train data


confusion_matrix(y_train,best_model_lda.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lda.predict(X_train)),annot=True,fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 33/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [74]: 

# predict probabilities
probs = best_model_lda.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LDA Tuned Model train data set %.3f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for LDA Tuned Model train data set",fontsize=14,color = 'red');

The ROC_AUC score for LDA Tuned Model train data set 0.890

Model Evaluation for Test Data

In [75]: 

#### Model Evaluation for Train Data


print("The Best LDA Model Score on test data post tuning set is %.3f " % best_model_lda.sco

The Best LDA Model Score on test data post tuning set is 0.831

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 34/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [76]: 

# Get the confusion matrix on the Test data


confusion_matrix(y_test,best_model_lda.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lda.predict(X_test)),annot=True,fmt='.5g',cm
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 35/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [77]: 

# predict probabilities
probs = best_model_lda.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for LDA Tuned Model test data set %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for LDA Tuned Model test data set",fontsize=14,color = 'red');

The ROC_AUC score for LDA Tuned Model test data set 0.888

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 36/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [78]: 

### Classification of Best LDA Model on Train and Test Data


print('Classification Report of the training data:\n\n',classification_report(y_train, ytra
print('Classification Report of the test data:\n\n',classification_report(y_test, ytest_pre

Classification Report of the training data:

precision recall f1-score support

0 0.74 0.65 0.70 307

1 0.87 0.91 0.89 754

accuracy 0.84 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.84 0.83 1061

Classification Report of the test data:

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [79]: 

(best_model_lda.score(X_train, y_train)-best_model_lda.score(X_test, y_test))*100

Out[79]:

0.3920912082279293

KNN Model

Generally, good KNN performance usually requires preprocessing of data to make all variables similarly
scaled and centered

In [80]: 

KNN_model=KNeighborsClassifier()
KNN_model.fit(X_train,y_train)

Out[80]:

KNeighborsClassifier()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 37/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [81]: 

## Performance Matrix on train data set


y_train_predict = KNN_model.predict(X_train)
KNN_model_score_train=KNN_model.score(X_train, y_train)
print("The KNN Model Score on Train data %.3f " % KNN_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The KNN Model Score on Train data 0.857

[[217 90]

[ 62 692]]

precision recall f1-score support

0 0.78 0.71 0.74 307

1 0.88 0.92 0.90 754

accuracy 0.86 1061

macro avg 0.83 0.81 0.82 1061

weighted avg 0.85 0.86 0.85 1061

In [82]: 

## Performance Matrix on test data set


y_test_predict = KNN_model.predict(X_test)
KNN_model_score_test = KNN_model.score(X_test, y_test)
print("The KNN Model Score on Test data %.3f " % KNN_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The KNN Model Score on Test data 0.827

[[109 44]

[ 35 268]]

precision recall f1-score support

0 0.76 0.71 0.73 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.80 456

weighted avg 0.82 0.83 0.83 456

Run the KNN with no of neighbours to be 1,3,5..19 and *Find the optimal number of neighbours from
K=1,3,5,7....19 using the Mis classification error

Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with neighbours
= 1,3,5...19 and find the model with lowest MCE

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 38/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [83]: 

# empty list that will hold accuracy scores


ac_scores = []

# perform accuracy metrics for values from 1,3,5....19


for k in range(1,20,2):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# evaluate test accuracy
scores = knn.score(X_test, y_test)
ac_scores.append(scores)

# changing to misclassification error


MCE = [1 - x for x in ac_scores]
MCE

Out[83]:

[0.2149122807017544,

0.19736842105263153,

0.17324561403508776,

0.1842105263157895,

0.18201754385964908,

0.17105263157894735,

0.17763157894736847,

0.16885964912280704,

0.16666666666666663,

0.17105263157894735]

Plot misclassification error vs k (with k value on X-axis)

In [84]: 

# plot misclassification error vs k


plt.plot(range(1,20,2), MCE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.title("Misclassicication error Vs K Value",fontsize=14,color = 'red');
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 39/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

For K = 11 it is giving the best test accuracy. We will build the model with k=11

In [85]: 

from sklearn.neighbors import KNeighborsClassifier


KNN_model_1=KNeighborsClassifier(n_neighbors= 11)
KNN_model_1.fit(X_train,y_train)

Out[85]:

KNeighborsClassifier(n_neighbors=11)

Performance Matrix of KNN New Model on train data set

In [86]: 

## Performance Matrix on train data set


y_train_predict = KNN_model_1.predict(X_train)
KNN_model_score_train_New=KNN_model_1.score(X_train, y_train)
print("The KNN Model Score on Train data %.3f " % KNN_model_score_train_New)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The KNN Model Score on Train data 0.843

[[206 101]

[ 66 688]]

precision recall f1-score support

0 0.76 0.67 0.71 307

1 0.87 0.91 0.89 754

accuracy 0.84 1061

macro avg 0.81 0.79 0.80 1061

weighted avg 0.84 0.84 0.84 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 40/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [87]: 

# Get the confusion matrix on the train data


sns.heatmap((metrics.confusion_matrix(y_train,KNN_model_1.predict(X_train))),annot=True,fmt
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('KNN-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 41/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [88]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for KNN train data set %.3f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.911

Performance Matrix of KNN New Model on test data set

In [89]: 

## Performance Matrix on test data set


y_test_predict = KNN_model_1.predict(X_test)
KNN_model_score_test_New = KNN_model_1.score(X_test, y_test)
print("The KNN Model Score on Test data %.3f " % KNN_model_score_test_New)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The KNN Model Score on Test data 0.829

[[105 48]

[ 30 273]]

precision recall f1-score support

0 0.78 0.69 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.81 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 42/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [90]: 

# Get the confusion matrix on the test data


sns.heatmap((metrics.confusion_matrix(y_test,KNN_model_1.predict(X_test))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('KNN-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 43/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [91]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for KNN train data set %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.889

Naive Bayes
In [92]: 

NB_model=GaussianNB()
NB_model.fit(X_train, y_train)

Out[92]:

GaussianNB()

Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit() method
for training it. After building a classifier, our model is ready to make predictions. We can use predict()
method with test set features as its parameters.

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 44/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [93]: 

#Performance Matrix on train data set


y_train_predict=NB_model.predict(X_train)
Naive_Bayes_model_score_train=NB_model.score(X_train, y_train) ## Accur
print("The Naive Bayes Model Score on train data is %.3f " % Naive_Bayes_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict)) ## confusion_matrix
print(metrics.classification_report(y_train,y_train_predict)) ## classification_report

The Naive Bayes Model Score on train data is 0.834

[[212 95]

[ 81 673]]

precision recall f1-score support

0 0.72 0.69 0.71 307

1 0.88 0.89 0.88 754

accuracy 0.83 1061

macro avg 0.80 0.79 0.80 1061

weighted avg 0.83 0.83 0.83 1061

In [94]: 

# Get the confusion matrix on the train data


sns.heatmap((metrics.confusion_matrix(y_train,NB_model.predict(X_train))),annot=True,fmt='.
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('NB-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 45/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [95]: 

## Performance Matrix on test data set


y_test_predict = NB_model.predict(X_test)
Naive_Bayes_model_score_test=NB_model.score(X_test, y_test) ## Accuracy
print("The Naive Bayes Model Score on test data is %.3f " % Naive_Bayes_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict)) ## confusion_matrix
print(metrics.classification_report(y_test, y_test_predict)) ## classification_report

The Naive Bayes Model Score on test data is 0.822

[[112 41]

[ 40 263]]

precision recall f1-score support

0 0.74 0.73 0.73 153

1 0.87 0.87 0.87 303

accuracy 0.82 456

macro avg 0.80 0.80 0.80 456

weighted avg 0.82 0.82 0.82 456

In [96]: 

# Get the confusion matrix on the test data


sns.heatmap((metrics.confusion_matrix(y_test,NB_model.predict(X_test))),annot=True,fmt='.5g
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('NB-Confusion Matrix-Test Data')
plt.show()

Naive Bayes with SMOTE


In [97]: 

from imblearn.over_sampling import SMOTE


#SMOTE is only applied on the train data set
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 46/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [98]: 

X_train.shape

Out[98]:

(1061, 8)

In [99]: 

## Let's check the shape after SMOTE


X_train_res.shape

Out[99]:

(1508, 8)

In [100]: 

NB_SM_model = GaussianNB()
NB_SM_model.fit(X_train_res, y_train_res)

Out[100]:

GaussianNB()

In [101]: 

## Performance Matrix on train data set with SMOTE


y_train_predict = NB_SM_model.predict(X_train_res)
SMOTE_model_score_train = NB_SM_model.score(X_train_res, y_train_res)
print("The SMOTE Model Score for train data set is %.3f " % SMOTE_model_score_train)
print(metrics.confusion_matrix(y_train_res, y_train_predict))
print(metrics.classification_report(y_train_res ,y_train_predict))

The SMOTE Model Score for train data set is 0.822

[[616 138]

[131 623]]

precision recall f1-score support

0 0.82 0.82 0.82 754

1 0.82 0.83 0.82 754

accuracy 0.82 1508

macro avg 0.82 0.82 0.82 1508

weighted avg 0.82 0.82 0.82 1508

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 47/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [102]: 

# Get the confusion matrix on the train data


sns.heatmap((metrics.confusion_matrix(y_train_res,NB_SM_model.predict(X_train_res))),annot=
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

ROC_AUC Curve for Naive Bayes with SMOTE Model on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 48/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [103]: 

probs = NB_SM_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Naive Bayes with SMOTE train data set %.3f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for Naive Bayes with SMOTE train data set",fontsize=14,color = 're

The ROC_AUC score for Naive Bayes with SMOTE train data set 0.887

In [104]: 

## Performance Matrix on test data set


y_test_predict = NB_SM_model.predict(X_test)
SMOTE_model_score_test = NB_SM_model.score(X_test, y_test)
print("The SMOTE Model Score for test data set is %.3f " % SMOTE_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The SMOTE Model Score for test data set is 0.809

[[125 28]

[ 59 244]]

precision recall f1-score support

0 0.68 0.82 0.74 153

1 0.90 0.81 0.85 303

accuracy 0.81 456

macro avg 0.79 0.81 0.80 456

weighted avg 0.82 0.81 0.81 456

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 49/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [105]: 

# Get the confusion matrix on the test data


sns.heatmap((metrics.confusion_matrix(y_test,NB_SM_model.predict(X_test))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

ROC_AUC Curve for Naive Bayes with SMOTE Model on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 50/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [106]: 

probs_test = NB_SM_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Naive Bayes with SMOTE test data set %.3f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Naive Bayes with SMOTE test data set",fontsize=14,color = 'red');

The ROC_AUC score for Naive Bayes with SMOTE test data set 0.876

Random Forest
In [107]: 

RF_model=RandomForestClassifier(n_estimators=100,random_state=1)
RF_model.fit(X_train, y_train)

Out[107]:

RandomForestClassifier(random_state=1)

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 51/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [108]: 

## Performance Matrix on train data set


y_train_predict = RF_model.predict(X_train)
RF_model_score_train =RF_model.score(X_train, y_train)
print("The random Forest Score on train data is %.2f " % RF_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The random Forest Score on train data is 1.00

[[307 0]

[ 0 754]]

precision recall f1-score support

0 1.00 1.00 1.00 307

1 1.00 1.00 1.00 754

accuracy 1.00 1061

macro avg 1.00 1.00 1.00 1061

weighted avg 1.00 1.00 1.00 1061

In [109]: 

# Get the confusion matrix on the train data


sns.heatmap((metrics.confusion_matrix(y_train,RF_model.predict(X_train))),annot=True,fmt='.
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 52/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [110]: 

## Performance Matrix on test data set


y_test_predict = RF_model.predict(X_test)
RF_model_score_test = RF_model.score(X_test, y_test)
print("The random Forest Score on test data is %.3f " % RF_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The random Forest Score on test data is 0.831

[[104 49]

[ 28 275]]

precision recall f1-score support

0 0.79 0.68 0.73 153

1 0.85 0.91 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

In [111]: 

# Get the confusion matrix on the test data


sns.heatmap((metrics.confusion_matrix(y_test,RF_model.predict(X_test))),annot=True,fmt='.5g
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

In [112]: 

(RF_model_score_train-RF_model_score_test)*100

Out[112]:

16.885964912280706

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 53/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Bagging
In [113]: 

cart=RandomForestClassifier()
Bagging_model=BaggingClassifier(base_estimator=cart,n_estimators=100, random_state=1)
Bagging_model.fit(X_train,y_train)

Out[113]:

BaggingClassifier(base_estimator=RandomForestClassifier(), n_estimators=100,

random_state=1)

In [114]: 

## Performance Matrix on train data set


y_train_predict=Bagging_model.predict(X_train)
Bagging_model_score_train=Bagging_model.score(X_train,y_train)
print("The Bagging Model Score for train data set is %.2f " % Bagging_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The Bagging Model Score for train data set is 0.97

[[278 29]

[ 5 749]]

precision recall f1-score support

0 0.98 0.91 0.94 307

1 0.96 0.99 0.98 754

accuracy 0.97 1061

macro avg 0.97 0.95 0.96 1061

weighted avg 0.97 0.97 0.97 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 54/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [115]: 

# Get the confusion matrix on the train data


sns.heatmap((metrics.confusion_matrix(y_train,Bagging_model.predict(X_train))),annot=True,f
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Bagging-Confusion Matrix-Train Data')
plt.show()

In [116]: 

## Performance Matrix on test data set


y_test_predict=Bagging_model.predict(X_test)
Bagging_model_score_test=Bagging_model.score(X_test,y_test)
print("The Bagging Model Score for test data set is %.2f " % Bagging_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))

The Bagging Model Score for test data set is 0.83

[[104 49]

[ 29 274]]

precision recall f1-score support

0 0.78 0.68 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 55/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [117]: 

# Get the confusion matrix on the test data


sns.heatmap((metrics.confusion_matrix(y_test,Bagging_model.predict(X_test))),annot=True,fmt
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Bagging-Confusion Matrix-Test Data')
plt.show()

In [118]: 

(Bagging_model_score_train-Bagging_model_score_test)

Out[118]:

0.13900739123964478

Boosting

Ada Boost

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 56/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [119]: 

ADB_model=AdaBoostClassifier(n_estimators=100,random_state=1)
ADB_model.fit(X_train,y_train)

Out[119]:

AdaBoostClassifier(n_estimators=100, random_state=1)

In [120]: 

## Performance Matrix on train data set


y_train_predict=ADB_model.predict(X_train)
ADB_model_score_train=ADB_model.score(X_train,y_train)
print("The ADA boost Model Score for train data set is %.3f " % ADB_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The ADA boost Model Score for train data set is 0.850

[[214 93]

[ 66 688]]

precision recall f1-score support

0 0.76 0.70 0.73 307

1 0.88 0.91 0.90 754

accuracy 0.85 1061

macro avg 0.82 0.80 0.81 1061

weighted avg 0.85 0.85 0.85 1061

In [121]: 

# Get the confusion matrix on the train data


sns.heatmap((metrics.confusion_matrix(y_train,ADB_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('ADA Boost-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 57/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [122]: 

## Performance Matrix on train data set


y_test_predict = ADB_model.predict(X_test)
ADB_model_score_test = ADB_model.score(X_test, y_test)
print("The ADA boost Model Score for test data set is %.3f " % ADB_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The ADA boost Model Score for test data set is 0.814

[[103 50]

[ 35 268]]

precision recall f1-score support

0 0.75 0.67 0.71 153

1 0.84 0.88 0.86 303

accuracy 0.81 456

macro avg 0.79 0.78 0.79 456

weighted avg 0.81 0.81 0.81 456

In [123]: 

# Get the confusion matrix on the test data


sns.heatmap((metrics.confusion_matrix(y_test,ADB_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('ADA boost-Confusion Matrix-Test Data')
plt.show()

In [124]: 

(ADB_model_score_train-ADB_model_score_test)*100

Out[124]:

3.654488483225027

Gradient Boosting
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 58/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Gradient Boosting
In [125]: 

gbc_model=GradientBoostingClassifier(random_state=1)
gbc_model.fit(X_train, y_train)

Out[125]:

GradientBoostingClassifier(random_state=1)

In [126]: 

## Performance Matrix on train data set


y_train_predict = gbc_model.predict(X_train)
gbc_model_score_train = gbc_model.score(X_train, y_train)
print("The Gradient Boosting Score for train data set is %.2f " % gbc_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The Gradient Boosting Score for train data set is 0.89

[[239 68]

[ 46 708]]

precision recall f1-score support

0 0.84 0.78 0.81 307

1 0.91 0.94 0.93 754

accuracy 0.89 1061

macro avg 0.88 0.86 0.87 1061

weighted avg 0.89 0.89 0.89 1061

In [127]: 

# Get the confusion matrix on the train data


sns.heatmap((metrics.confusion_matrix(y_train,gbc_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Gradiant Boost -Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 59/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [128]: 

## Performance Matrix on test data set


y_test_predict = gbc_model.predict(X_test)
gbc_model_score_test = gbc_model.score(X_test, y_test)
print("The Gradient Boosting Score for train data set is %.2f " % gbc_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The Gradient Boosting Score for train data set is 0.84

[[105 48]

[ 27 276]]

precision recall f1-score support

0 0.80 0.69 0.74 153

1 0.85 0.91 0.88 303

accuracy 0.84 456

macro avg 0.82 0.80 0.81 456

weighted avg 0.83 0.84 0.83 456

In [129]: 

# Get the confusion matrix on the test data


sns.heatmap((metrics.confusion_matrix(y_test,gbc_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Gradiant Boost-Confusion Matrix-Test Data')
plt.show()

In [130]: 

(gbc_model_score_train-gbc_model_score_test)*100

Out[130]:

5.702787836698253

Performance Matrix of Logistic Regression on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 60/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [131]: 

## Performance Matrix on train data set


print("The Best Logistic Regression Model Score on train data set is %.2f " % best_model_lr
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lr.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lr.predict(X_train)),annot=True, fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LR-Confusion Matrix-Train Data')
plt.show()

The Best Logistic Regression Model Score on train data set is 0.83

ROC_AUC Curve for Logistic Regression on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 61/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [132]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print('The ROC_AUC score for Logistic Regression Train data set: %.3f' % auc)
# calculate ROC curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Logistic Regression Train data set",fontsize=14,color = 'red');

The ROC_AUC score for Logistic Regression Train data set: 0.890

Performance Matrix of Logistic Regression on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 62/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [133]: 

## Performance Matrix on test data set


print("The Best Logistic Regression Model Score on train data set is %.2f " % best_model_lr
# Get the confusion matrix on the train data
confusion_matrix(y_test,best_model_lr.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lr.predict(X_test)),annot=True, fmt='.5g', c
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LR-Confusion Matrix-test data')
plt.show()

The Best Logistic Regression Model Score on train data set is 0.83

ROC_AUC Curve for Logistic Regression on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 63/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [134]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('The ROC_AUC score for Logistic Regression Test data set : %.3f' % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Logistic Regression Test data set ",fontsize=14,color = 'red');

The ROC_AUC score for Logistic Regression Test data set : 0.883

Performance Matrix of LDA (linear discriminant analysis) on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 64/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [135]: 

## Performance Matrix on train data set


print("The Best LDA Model Score on train data set is %.2f " % best_model_lda.score(X_train,
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lda.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lda.predict(X_train)),annot=True, fmt='.5g'
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LDA-Confusion Matrix-Train data')
plt.show()

The Best LDA Model Score on train data set is 0.84

ROC_AUC Curve for LDA (linear discriminant analysis) on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 65/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [136]: 

# predict probabilities
probs = best_model_lda.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LDA Train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for LDA Train data set",fontsize=14,color = 'red');

The ROC_AUC score for LDA Train data set 0.89

Performance Matrix of LDA (linear discriminant analysis) on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 66/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [137]: 

#Performance Matrix on test data set


print("The Best LDA Model Score on test data set is %.2f " % best_model_lda.score(X_test, y
# Get the confusion matrix on the Test data
confusion_matrix(y_test,best_model_lda.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lda.predict(X_test)),annot=True, fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LDA-Confusion Matrix-Test Data')
plt.show()

The Best LDA Model Score on test data set is 0.83

ROC_AUC Curve for LDA (linear discriminant analysis) on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 67/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [138]: 

probs = best_model_lda.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
print("The ROC_AUC score for LDA Test data set is' %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for LDA Test data set",fontsize=14,color = 'red');

AUC: 0.888

The ROC_AUC score for LDA Test data set is' 0.888

Performance Matrix of KNN on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 68/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [139]: 

## Performance Matrix on train data set


y_train_predict = KNN_model_1.predict(X_train)
KNN_model_score_train_New=KNN_model_1.score(X_train, y_train)
print("The KNN Model Score on Train data %.2f " % KNN_model_score_train_New)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The KNN Model Score on Train data 0.84

[[206 101]

[ 66 688]]

precision recall f1-score support

0 0.76 0.67 0.71 307

1 0.87 0.91 0.89 754

accuracy 0.84 1061

macro avg 0.81 0.79 0.80 1061

weighted avg 0.84 0.84 0.84 1061

ROC_AUC Curve for KNN on train data set

In [140]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for KNN train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.91

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 69/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Performance Matrix of KNN on test data set

In [141]: 

## Performance Matrix on test data set


y_test_predict = KNN_model_1.predict(X_test)
KNN_model_score_test_New = KNN_model_1.score(X_test, y_test)
print("The KNN Model Score on Test data %.2f " % KNN_model_score_test_New)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The KNN Model Score on Test data 0.83

[[105 48]

[ 30 273]]

precision recall f1-score support

0 0.78 0.69 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.81 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

ROC_AUC Curve for KNN on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 70/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [142]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for KNN train data set %.2f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.89

Performance Matrix of Naive Bayes with SMOTE on train data set

In [143]: 

## Performance Matrix on train data set with SMOTE


y_train_predict = NB_SM_model.predict(X_train_res)
SMOTE_model_score_train = NB_SM_model.score(X_train_res, y_train_res)
print("The SMOTE Model Score for train data set is %.2f " % SMOTE_model_score_train)
print(metrics.confusion_matrix(y_train_res, y_train_predict))
print(metrics.classification_report(y_train_res ,y_train_predict))

The SMOTE Model Score for train data set is 0.82

[[616 138]

[131 623]]

precision recall f1-score support

0 0.82 0.82 0.82 754

1 0.82 0.83 0.82 754

accuracy 0.82 1508

macro avg 0.82 0.82 0.82 1508

weighted avg 0.82 0.82 0.82 1508

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 71/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

ROC_AUC Curve for Naive Bayes with SMOTE Model on train data set

In [144]: 

probs = NB_SM_model.predict_proba(X_train_res)
probs = probs[:, 1]
auc = roc_auc_score(y_train_res, probs)
print("The ROC_AUC score for Naive Bayes with SMOTE train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train_res, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Naive Bayes with SMOTE train data set",fontsize=14,color = 'red');

The ROC_AUC score for Naive Bayes with SMOTE train data set 0.90

Performance Matrix of Naive Bayes with SMOTE on test data set

In [145]: 

## Performance Matrix on test data set


y_test_predict = NB_SM_model.predict(X_test)
SMOTE_model_score_test = NB_SM_model.score(X_test, y_test)
print("The SMOTE Model Score for test data set is %.2f " % SMOTE_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The SMOTE Model Score for test data set is 0.81

[[125 28]

[ 59 244]]

precision recall f1-score support

0 0.68 0.82 0.74 153

1 0.90 0.81 0.85 303

accuracy 0.81 456

macro avg 0.79 0.81 0.80 456

weighted avg 0.82 0.81 0.81 456

ROC AUC Curve for Naive Bayes with SMOTE Model on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 72/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
ROC_AUC Curve for Naive Bayes with SMOTE Model on test data set

In [146]: 

probs_test = NB_SM_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Naive Bayes with SMOTE Model on test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Naive Bayes with SMOTE Model on test data set",fontsize=14,color =

The ROC_AUC score for Naive Bayes with SMOTE Model on test data set 0.88

Performance Matrix of Random Forest on train data set

In [147]: 

## Performance Matrix on train data set


y_train_predict = RF_model.predict(X_train)
RF_model_score_train =RF_model.score(X_train, y_train)
print("The random Forest Score on train data is %.2f " % RF_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The random Forest Score on train data is 1.00

[[307 0]

[ 0 754]]

precision recall f1-score support

0 1.00 1.00 1.00 307

1 1.00 1.00 1.00 754

accuracy 1.00 1061

macro avg 1.00 1.00 1.00 1061

weighted avg 1.00 1.00 1.00 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 73/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [148]: 

Recall=(754/(0+754))
print("Random Forest-Train Data Set-Recall for class 1 is %.2f " % Recall)

Random Forest-Train Data Set-Recall for class 1 is 1.00

ROC_AUC Curve for Random Forest on train data set

In [149]: 

probs = RF_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The AUC_ROC score for Random Forest train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Random Forest train data",fontsize=14,color = 'red');

The AUC_ROC score for Random Forest train data set 1.00

Performance Matrix of Random Forest on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 74/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [150]: 

## Performance Matrix on test data set


y_test_predict = RF_model.predict(X_test)
RF_model_score_test = RF_model.score(X_test, y_test)
print("The random Forest Score on test data is %.2f " % RF_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The random Forest Score on test data is 0.83

[[104 49]

[ 28 275]]

precision recall f1-score support

0 0.79 0.68 0.73 153

1 0.85 0.91 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

In [151]: 

Recall=(275/(28+275))
print("Random Forest-Test Data Set-Recall for class 1 is %.2f " % Recall)

Random Forest-Test Data Set-Recall for class 1 is 0.91

ROC_AUC Curve for Random Forest on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 75/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [152]: 

probs_test = RF_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for Random Forest test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Random Forest Test data set",fontsize=14,color = 'red');

The AUC_ROC score for Random Forest test data set 0.90

Performance Matrix of Bagging on train data set

In [153]: 

## Performance Matrix on train data set


y_train_predict=Bagging_model.predict(X_train)
Bagging_model_score_train=Bagging_model.score(X_train,y_train)
print("The Bagging Model Score for train data set is %.2f " % Bagging_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The Bagging Model Score for train data set is 0.97

[[278 29]

[ 5 749]]

precision recall f1-score support

0 0.98 0.91 0.94 307

1 0.96 0.99 0.98 754

accuracy 0.97 1061

macro avg 0.97 0.95 0.96 1061

weighted avg 0.97 0.97 0.97 1061

ROC_AUC Curve for Bagging on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 76/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [154]: 

probs = Bagging_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Bagging train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Bagging Train data set",fontsize=14,color = 'red');

The ROC_AUC score for Bagging train data set 1.00

Performance Matrix of Bagging on test data set

In [155]: 

## Performance Matrix on test data set


y_test_predict=Bagging_model.predict(X_test)
Bagging_model_score_test=Bagging_model.score(X_test,y_test)
print("The Bagging Model Score for test data set is %.2f " % Bagging_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))

The Bagging Model Score for test data set is 0.83

[[104 49]

[ 29 274]]

precision recall f1-score support

0 0.78 0.68 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

ROC_AUC Curve for Bagging on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 77/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [156]: 

probs_test = Bagging_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for Bagging test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Bagging Test data set",fontsize=14,color = 'red');

The AUC_ROC score for Bagging test data set 0.90

Performance Matrix of Ada Boost on train data set

In [157]: 

## Performance Matrix on train data set


y_train_predict=ADB_model.predict(X_train)
ADB_model_score_train=ADB_model.score(X_train,y_train)
print("The ADA boost Model Score for train data set is %.3f " % ADB_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The ADA boost Model Score for train data set is 0.850

[[214 93]

[ 66 688]]

precision recall f1-score support

0 0.76 0.70 0.73 307

1 0.88 0.91 0.90 754

accuracy 0.85 1061

macro avg 0.82 0.80 0.81 1061

weighted avg 0.85 0.85 0.85 1061

ROC_AUC Curve for Ada Boost on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 78/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [158]: 

probs = ADB_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The AUC_ROC score for ADB Model train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for ADB Model train data set",fontsize=14,color = 'red');

The AUC_ROC score for ADB Model train data set 0.91

Performance Matrix of Ada Boost on test data set

In [159]: 

## Performance Matrix on train data set


y_test_predict = ADB_model.predict(X_test)
ADB_model_score_test = ADB_model.score(X_test, y_test)
print("The ADA boost Model Score for test data set is %.2f " % ADB_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The ADA boost Model Score for test data set is 0.81

[[103 50]

[ 35 268]]

precision recall f1-score support

0 0.75 0.67 0.71 153

1 0.84 0.88 0.86 303

accuracy 0.81 456

macro avg 0.79 0.78 0.79 456

weighted avg 0.81 0.81 0.81 456

ROC_AUC Curve for Ada Boost on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 79/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [160]: 

probs_test = ADB_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for ADB Model test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for ADB Model test data set",fontsize=14,color = 'red');

The AUC_ROC score for ADB Model test data set 0.88

Performance Matrix of Gradient Boosting on train data set

In [161]: 

## Performance Matrix on train data set


y_train_predict = gbc_model.predict(X_train)
gbc_model_score_train = gbc_model.score(X_train, y_train)
print("The Gradient Boosting Score for train data set is %.3f " % gbc_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The Gradient Boosting Score for train data set is 0.893

[[239 68]

[ 46 708]]

precision recall f1-score support

0 0.84 0.78 0.81 307

1 0.91 0.94 0.93 754

accuracy 0.89 1061

macro avg 0.88 0.86 0.87 1061

weighted avg 0.89 0.89 0.89 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 80/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [162]: 

Recall=(708/(46+708))
print("Gradient Boosting-Train Data Set-Recall for class 1 is %.3f " % Recall)

Gradient Boosting-Train Data Set-Recall for class 1 is 0.939

ROC_AUC Curve for Gradient Boosting on train data set

In [163]: 

probs = gbc_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Gradient Boosting train data set %.3f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Gradient Boosting train data set",fontsize=14,color = 'red');

The ROC_AUC score for Gradient Boosting train data set 0.951

Performance Matrix of Gradient Boosting on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 81/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [164]: 

## Performance Matrix on test data set


y_test_predict = gbc_model.predict(X_test)
gbc_model_score_test = gbc_model.score(X_test, y_test)
print("The Gradient Boosting Score for train data set is %.3f " % gbc_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The Gradient Boosting Score for train data set is 0.836

[[105 48]

[ 27 276]]

precision recall f1-score support

0 0.80 0.69 0.74 153

1 0.85 0.91 0.88 303

accuracy 0.84 456

macro avg 0.82 0.80 0.81 456

weighted avg 0.83 0.84 0.83 456

In [165]: 

Recall=(276/(27+276))
print("Gradient Boosting-Test Data Set-Recall for class 1 is %.3f " % Recall)

Gradient Boosting-Test Data Set-Recall for class 1 is 0.911

ROC_AUC Curve for Gradient Boosting on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 82/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [166]: 

probs_test = gbc_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Gradient Boosting test data set %.3f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Gradient Boosting test data set",fontsize=14,color = 'red');

The ROC_AUC score for Gradient Boosting test data set 0.899

Comparison of Different Models


In [168]: 

print("The Logistic Regression Model Score Post Tuning on train data set is %.3f " % best_m
print("The Logistic Regression Model Score Post Tuning on test data set is %.3f " % best_m
print("The LDA Model Score Post Tuning on train data set is %.3f " % best_model_lda.score(X
print("The LDA Model Score Post Tuning on test data set is %.3f " % best_model_lda.score(X
print("The KNN Model Score Post Tuning on Train data %.3f " % KNN_model_1.score(X_train, y_
print("The KNN Model Score Post Tuning on Test data %.3f " % KNN_model_1.score(X_test, y_te
print("The Naive Bayes Model Score Post Tuning on train data is %.3f " % NB_SM_model.score(
print("The Naive Bayes Model Score Post Tuning on test data is %.3f " % NB_SM_model.score(X

The Logistic Regression Model Score Post Tuning on train data set is 0.834

The Logistic Regression Model Score Post Tuning on test data set is 0.829

The LDA Model Score Post Tuning on train data set is 0.835

The LDA Model Score Post Tuning on test data set is 0.831

The KNN Model Score Post Tuning on Train data 0.843

The KNN Model Score Post Tuning on Test data 0.829

The Naive Bayes Model Score Post Tuning on train data is 0.822

The Naive Bayes Model Score Post Tuning on test data is 0.809

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 83/85


3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [169]: 

print("Variance in Test and train Scores of LDA Model is %.5f " % (best_model_lr.score(X_tr

Variance in Test and train Scores of LDA Model is 0.00517

In [170]: 

print("Variance in Test and train Scores of LDA Model is %.5f " % (best_model_lda.score(X_t

Variance in Test and train Scores of LDA Model is 0.00392

In [171]: 

print("Variance in Test and train Scores of KNN Model for is %.5f " % (KNN_model_1.score(X

Variance in Test and train Scores of KNN Model for is 0.01365

In [172]: 

print("Variance in Test and train Scores of LR Model for is %.5f " % (NB_SM_model.score(X_

Variance in Test and train Scores of LR Model for is 0.01241

Cross Validation
In [173]: 

from sklearn.model_selection import cross_val_score

In [174]: 

scores = cross_val_score(best_model_lda, X_train, y_train, cv=10)


scores

Out[174]:

array([0.78504673, 0.77358491, 0.83962264, 0.85849057, 0.85849057,

0.8490566 , 0.81132075, 0.8490566 , 0.81132075, 0.82075472])

In [175]: 

scores = cross_val_score(best_model_lda, X_test, y_test, cv=10)


scores

Out[175]:

array([0.80434783, 0.76086957, 0.86956522, 0.82608696, 0.89130435,

0.86956522, 0.93333333, 0.84444444, 0.75555556, 0.84444444])

--------------------------------------END OF PROBELM 1------------


localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 84/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

--------------------------

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 85/85

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy