100% found this document useful (5 votes)

2K views85 pages

Machine Learning Project Problem 1 Jupyter Notebook PDF

This document summarizes the steps taken to import libraries, read in an election dataset, clean the data, and analyze the dataset. Key steps include: 1) Importing machine learning libraries. 2) Reading in an election dataset with 1525 rows and 9 columns from an Excel file. 3) Checking for missing data (none found) and data types. 4) Removing 8 duplicate rows from the dataset. 5) Analyzing categorical variables to find unique values and frequencies.

Uploaded by

sonali Pradhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (5 votes)

2K views85 pages

Machine Learning Project Problem 1 Jupyter Notebook PDF

Uploaded by

sonali Pradhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Type Markdown and LaTeX: 𝛼2

Importing required Libraries

In [1]: 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style
%matplotlib inline
import seaborn as sns; sns.set() # for plot styling
from scipy import stats
import scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram,linkage,fcluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics,model_selection
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from scipy.stats import zscore
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
import warnings
warnings.filterwarnings('ignore')

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 1/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [2]: 

Elect_df= pd.read_excel("Election_Data.xlsx",sheet_name="Election_Dataset_Two Classes",inde

Elect_df.head()

Out[2]:

vote age economic.cond.national economic.cond.household Blair Hague Europe politic

1 Labour 43 3 3 4 1 2

2 Labour 36 4 4 4 4 5

3 Labour 35 4 4 5 2 3

4 Labour 24 4 2 2 1 4

5 Labour 41 2 2 1 1 6

In [3]: 

# Shape function displays the number of rows and columns in a dafaframe.

print('The dataset has {} rows and {} columns'.format(Elect_df.shape[0],Elect_df.shape[1]))

The dataset has 1525 rows and 9 columns

In [4]: 

# Checking Data info

Elect_df.info();

Int64Index: 1525 entries, 1 to 1525

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 vote 1525 non-null object

1 age 1525 non-null int64

2 economic.cond.national 1525 non-null int64

3 economic.cond.household 1525 non-null int64

4 Blair 1525 non-null int64

5 Hague 1525 non-null int64

6 Europe 1525 non-null int64

7 political.knowledge 1525 non-null int64

8 gender 1525 non-null object

dtypes: int64(7), object(2)

memory usage: 119.1+ KB

In [5]: 

# Handling missing data

# Test whether there is any null value in our dataset or not. We can do this using isnull()
Elect_df.isnull().sum()
print("There are", Elect_df.isnull().values.sum(),"Missing Values in dataset")

There are 0 Missing Values in dataset

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 2/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [6]: 

cat=[]
num=[]
for i in Elect_df.columns:
if Elect_df[i].dtype=="object":
cat.append(i)
else:
num.append(i)
print(cat)
print(num)

['vote', 'gender']

['age', 'economic.cond.national', 'economic.cond.household', 'Blair', 'Hagu

e', 'Europe', 'political.knowledge']

In [7]: 

for variable in cat:

print(variable,":", sum(Elect_df[variable] == '?'))

vote : 0

gender : 0

In [8]: 

Elect_df[num].describe().T

Out[8]:

count mean std min 25% 50% 75% max

age 1525.0 54.182295 15.711209 24.0 41.0 53.0 67.0 93.0

economic.cond.national 1525.0 3.245902 0.880969 1.0 3.0 3.0 4.0 5.0

economic.cond.household 1525.0 3.140328 0.929951 1.0 3.0 3.0 4.0 5.0

Blair 1525.0 3.334426 1.174824 1.0 2.0 4.0 4.0 5.0

Hague 1525.0 2.746885 1.230703 1.0 2.0 2.0 4.0 5.0

Europe 1525.0 6.728525 3.297538 1.0 4.0 6.0 10.0 11.0

political.knowledge 1525.0 1.542295 1.083315 0.0 0.0 2.0 2.0 3.0

In [9]: 

Elect_df[cat].describe().T

Out[9]:

count unique top freq

vote 1525 2 Labour 1063

gender 1525 2 female 812

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 3/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [10]: 

# Checking for Duplicates

dups=Elect_df.duplicated()
print("Total no of duplicate values = %d" % (dups.sum()))
Elect_df[dups]

Total no of duplicate values = 8

Out[10]:

vote age economic.cond.national economic.cond.household Blair Hague Europ

68 Labour 35 4 4 5 2

627 Labour 39 3 4 4 2

871 Labour 38 2 4 2 2

984 Conservative 74 4 3 2 4

1155 Conservative 53 3 4 2 2

1237 Labour 36 3 3 2 2

1245 Labour 29 4 4 4 2

1439 Labour 40 4 3 4 2

Removing Duplicate Data

In [11]: 

Elect_df.drop_duplicates(inplace=True)

In [12]: 

Elect_df.shape

Out[12]:

(1517, 9)

unique values for categorical variables

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 4/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [13]: 

### unique values for categorical variables

for column in Elect_df.columns:
if Elect_df[column].dtype == 'object':
print(column.upper(),': ',Elect_df[column].nunique())
print(Elect_df[column].value_counts().sort_values())
print('\n')

VOTE : 2

Conservative 460

Labour 1057

Name: vote, dtype: int64

GENDER : 2

male 709

female 808

Name: gender, dtype: int64

In [14]: 

# Checking the Skewness in data

Elect_df.skew(axis=0,skipna=True)

Out[14]:

age 0.139800

economic.cond.national -0.238474

economic.cond.household -0.144148

Blair -0.539514

Hague 0.146191

Europe -0.141891

political.knowledge -0.422928

dtype: float64

Univariate Analysis

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 5/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [15]: 

a=1
plt.figure(figsize=(15,112))
for i in Elect_df.columns:
if Elect_df[i].dtype != 'object':
plt.subplot(21,3,a)
sns.distplot(Elect_df[i])
plt.title("Distribution plot for:" + i)
plt.subplot(21,3,a+1)
sns.histplot(Elect_df[i])
plt.title("Histogram for:" + i)
plt.subplot(21,3,a+2)
sns.boxplot(Elect_df[i])
plt.title("Boxplot for:" + i)
a+=3

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 6/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Bivariate and Multivariate Analysis

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 7/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [16]: 

fig, (ax1,ax2,ax3,ax4)=plt.subplots(1,4,figsize=(16,5))
fig, (ax5,ax6,ax7)=plt.subplots(1,3,figsize=(12,5))

sns.stripplot(Elect_df["vote"], Elect_df['age'],orient='v',jitter=True,ax=ax1)
ax1.set_xlabel('vote', fontsize=15)
ax1.set_title('Distribution of vote', fontsize=15)
ax1.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['economic.cond.national'], jitter=True, ax=ax2)

ax2.set_xlabel('Vote', fontsize=15)
ax2.set_title('Distribution of Vote', fontsize=15)
ax2.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['economic.cond.household'], jitter=True, ax=ax3)

ax3.set_xlabel('vote', fontsize=15)
ax3.set_title('Distribution of vote', fontsize=15)
ax3.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['Blair'], jitter=True, ax=ax4)

ax4.set_xlabel('vote', fontsize=15)
ax4.set_title('Distribution of vote', fontsize=15)
ax4.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['Hague'], jitter=True, ax=ax5)

ax5.set_xlabel('vote', fontsize=15)
ax5.set_title('Distribution of vote', fontsize=15)
ax5.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['Europe'], jitter=True, ax=ax6)

ax6.set_xlabel('vote', fontsize=15)
ax6.set_title('Distribution of vote', fontsize=15)
ax6.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['political.knowledge'], jitter=True, ax=ax7)

ax7.set_xlabel('vote', fontsize=15)
ax7.set_title('Distribution of vote', fontsize=15)
ax7.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 8/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 9/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [17]: 

fig, (ax1,ax2,ax3,ax4)=plt.subplots(1,4,figsize=(16,5))
fig, (ax5,ax6,ax7)=plt.subplots(1,3,figsize=(12,5))

sns.stripplot(Elect_df["gender"], Elect_df['age'],orient='v',jitter=True,ax=ax1)
ax1.set_xlabel('gender', fontsize=15)
ax1.set_title('Distribution of gender', fontsize=15)
ax1.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['economic.cond.national'], jitter=True, ax=ax2)

ax2.set_xlabel('gender', fontsize=15)
ax2.set_title('Distribution of gender', fontsize=15)
ax2.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['economic.cond.household'], jitter=True, ax=ax3)

ax3.set_xlabel('gender', fontsize=15)
ax3.set_title('Distribution of gender', fontsize=15)
ax3.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['Blair'], jitter=True, ax=ax4)

ax4.set_xlabel('gender', fontsize=15)
ax4.set_title('Distribution of gender', fontsize=15)
ax4.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['Hague'], jitter=True, ax=ax5)

ax5.set_xlabel('gender', fontsize=15)
ax5.set_title('Distribution of gender', fontsize=15)
ax5.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['Europe'], jitter=True, ax=ax6)

ax6.set_xlabel('gender', fontsize=15)
ax6.set_title('Distribution of gender', fontsize=15)
ax6.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['political.knowledge'], jitter=True, ax=ax7)

ax7.set_xlabel('gender', fontsize=15)
ax7.set_title('Distribution of gender', fontsize=15)
ax7.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 10/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Check for Data Distribution w.r.t Vote

In [18]: 

### Data Distribution

plt.figure(figsize=(24,8))
sns.pairplot(Elect_df,hue='vote');

<Figure size 1728x576 with 0 Axes>

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 11/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [19]: 

#correlation matrix
Elect_df.corr()

Out[19]:

age economic.cond.national economic.cond.household Bla

age 1.000000 0.018687 -0.038868 0.03208

economic.cond.national 0.018687 1.000000 0.347687 0.32614

economic.cond.household -0.038868 0.347687 1.000000 0.21582

Blair 0.032084 0.326141 0.215822 1.00000

Hague 0.031144 -0.200790 -0.100392 -0.24350

Europe 0.064562 -0.209150 -0.112897 -0.29594

political.knowledge -0.046598 -0.023510 -0.038528 -0.02129

In [20]: 

# plot the correlation coefficients as a heatmap

plt.subplots(figsize=(15,10))
sns.heatmap(Elect_df.corr(), annot=True, fmt='.2f', cmap='Blues', vmax=1, vmin=-1);

Check for Outliers

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 12/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [21]: 

#Check for presence of outliers

plt.figure(figsize=(15,10))
Elect_df[num].boxplot(patch_artist = True, color='red',notch=True)
plt.title('Rectangular box plot')
plt.show();

There are nearly no outliers in most of the numerical columns, only outlier is in economic.cond.national
variable & economic.cond.household Variable . In Gaussian Naive Bayes, outliers will affect the shape
of the Gaussian distribution and have the usual effects on the mean etc. So depending on our use case,
it makes sense to remove outlier .

In [22]: 

print('Range of values: ', Elect_df['economic.cond.national'].max()-Elect_df['economic.cond

Range of values: 4

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 13/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [23]: 

#Central values
print('Minimum value economic.cond.national: ', Elect_df['economic.cond.national'].min())
print('Maximum economic.cond.national: ',Elect_df['economic.cond.national'].max())
print('Mean value economic.cond.national: ', Elect_df['economic.cond.national'].mean())
print('Median value economic.cond.national: ',Elect_df['economic.cond.national'].median())
print('Standard deviation economic.cond.national: ', Elect_df['economic.cond.national'].std
print('Null values economic.cond.national: ',Elect_df['economic.cond.national'].isnull().an

Minimum value economic.cond.national: 1

Maximum economic.cond.national: 5

Mean value economic.cond.national: 3.245220830586684

Median value economic.cond.national: 3.0

Standard deviation economic.cond.national: 0.8817924638047195

Null values economic.cond.national: False

In [24]: 

#Quartiles

Q1=Elect_df['economic.cond.national'].quantile(q=0.25)
Q3=Elect_df['economic.cond.national'].quantile(q=0.75)
print('economic.cond.national - 1st Quartile (Q1) is: ', Q1)
print('economic.cond.national - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of economic.cond.national is ', stats.iqr(Elect_df['econom

economic.cond.national - 1st Quartile (Q1) is: 3.0

economic.cond.national - 3st Quartile (Q3) is: 4.0

Interquartile range (IQR) of economic.cond.national is 1.0

In [25]: 

#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in economic.cond.national: ', L_outliers)
print('Upper outliers in economic.cond.national: ', U_outliers)

Lower outliers in economic.cond.national: 1.5

Upper outliers in economic.cond.national: 5.5

In [26]: 

print('Number of outliers in economic.cond.national upper : ', Elect_df[Elect_df['economic.

print('Number of outliers in economic.cond.national lower : ', Elect_df[Elect_df['economic.
print('% of Outlier in economic.cond.national upper: ',round(Elect_df[Elect_df['economic.co
print('% of Outlier in economic.cond.national lower: ',round(Elect_df[Elect_df['economic.co

Number of outliers in economic.cond.national upper : 0

Number of outliers in economic.cond.national lower : 1517

% of Outlier in economic.cond.national upper: 0 %

% of Outlier in economic.cond.national lower: 100 %

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 14/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Oulier Treatment

In [27]: 

def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

In [28]: 

lr,ur=remove_outlier(Elect_df["economic.cond.national"])
Elect_df["economic.cond.national"]=np.where(Elect_df["economic.cond.national"]>ur,ur,Elect_
Elect_df["economic.cond.national"]=np.where(Elect_df["economic.cond.national"]<lr,lr,Elect_
lr,ur=remove_outlier(Elect_df["economic.cond.household"])
Elect_df["economic.cond.household"]=np.where(Elect_df["economic.cond.household"]>ur,ur,Elec
Elect_df["economic.cond.household"]=np.where(Elect_df["economic.cond.household"]<lr,lr,Elec

In [29]: 

#Check for presence of outliers

plt.figure(figsize=(15,10))
Elect_df[num].boxplot(patch_artist = True, color='red',notch=True)
plt.title('Rectangular box plot')
plt.show();

Get_dummies of the object variables

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 15/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [30]: 

cat

Out[30]:

['vote', 'gender']

In [31]: 

cat1 = ['vote', 'gender']

drop_first is used to ensure that multiple columns created based on the levels of categorical variable
are not included else it will result in to multicollinearity . This is done to ensure that we do not land in to
dummy trap.

In [32]: 

df=pd.get_dummies(Elect_df, columns=cat1,drop_first=True)
df.head()

Out[32]:

age economic.cond.national economic.cond.household Blair Hague Europe political.knowl

1 43 3.0 3.0 4 1 2

2 36 4.0 4.0 4 4 5

3 35 4.0 4.0 5 2 3

4 24 4.0 2.0 2 1 4

5 41 2.0 2.0 1 1 6

In [33]: 

# Copy all the predictor variables into X dataframe

X=df.drop('vote_Labour',axis=1)
# Copy target into the y dataframe.
y=df['vote_Labour']

In [34]: 

# Var prior to scaling

X.var()

Out[34]:

age 246.544655

economic.cond.national 0.728713

economic.cond.household 0.785491

Blair 1.380089

Hague 1.519005

Europe 10.883687

political.knowledge 1.175961

gender_male 0.249099

dtype: float64

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 16/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [35]: 

# Data prior to scaling

plt.plot(X)
plt.title('Data prior to scaling ', fontsize=15)
plt.show()

Is Scaling necessary here or not?

In [36]: 

# Scaling the attributes.

X[['age','economic.cond.national','economic.cond.household','Blair','Hague','Europe','polit

In [37]: 

# Var post scaling

X.var()

Out[37]:

age 1.00066

economic.cond.national 1.00066

economic.cond.household 1.00066

Blair 1.00066

Hague 1.00066

Europe 1.00066

political.knowledge 1.00066

gender_male 1.00066

dtype: float64

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 17/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [38]: 

# Data post scaling

plt.plot(X)
plt.title('Data post scaling ', fontsize=15)
plt.show()

In [39]: 

X.head()

Out[39]:

age economic.cond.national economic.cond.household Blair Hague Europe

1 -0.716161 -0.301648 -0.179682 0.565802 -1.419969 -1.437338

2 -1.162118 0.870183 0.949003 0.565802 1.014951 -0.527684

3 -1.225827 0.870183 0.949003 1.417312 -0.608329 -1.134120

4 -1.926617 0.870183 -1.308366 -1.137217 -1.419969 -0.830902

5 -0.843577 -1.473479 -1.308366 -1.988727 -1.419969 -0.224465

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 18/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [40]: 

y.head()

Out[40]:

1 1

2 1

3 1

4 1

5 1

Name: vote_Labour, dtype: uint8

Train-Test Split Split X and y into training and test set in 70:30 ratio with
random_state=1

In [41]: 

# Split X and y into training and test set in 70:30 ratio

X_train,X_test, y_train, y_test=train_test_split(X,y,test_size=0.30, random_state=1)

In [42]: 

print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('y_train',y_train.shape)
print('y_test',y_test.shape)

X_train (1061, 8)

X_test (456, 8)

y_train (1061,)

y_test (456,)

In [43]: 

Logistic_model = LogisticRegression(solver='newton-cg',max_iter=10000,penalty='none',verbos
Logistic_model.fit(X_train, y_train)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.

[Parallel(n_jobs=2)]: Done 1 out of 1 | elapsed: 1.1s finished

Out[43]:

LogisticRegression(max_iter=10000, n_jobs=2, penalty='none', solver='newton-

cg',

verbose=True)

Now LogisticRegression classifier is built. The classifier is trained using training data. We can use fit() method
for training it. After building a classifier, our model is ready to make predictions. We can use predict() method
with test set features as its parameters.

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 19/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [44]: 

## Performance Matrix on train data set

y_train_predict=Logistic_model.predict(X_train)
Logistic_model_score_train=Logistic_model.score(X_train,y_train) ## Accuracy
print("The Logistic Regression Model Score on train data set is %.3f " % Logistic_model_sco
print(metrics.confusion_matrix(y_train,y_train_predict)) ## Confusion Matrix
print(metrics.classification_report(y_train,y_train_predict)) ## Classification r

The Logistic Regression Model Score on train data set is 0.834

[[197 110]

[ 66 688]]

precision recall f1-score support

0 0.75 0.64 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

In [45]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,Logistic_model.predict(X_train))),annot=True,
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 20/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [46]: 

## Performance Matrix on test data set

y_test_predict=Logistic_model.predict(X_test)
Logistic_model_score_test=Logistic_model.score(X_test,y_test) ## Accuracy
print("The Logistic Regression Model Score on test data set is %.3f " % Logistic_model_sco
print(metrics.confusion_matrix(y_test,y_test_predict)) ## Confusion Matrix
print(metrics.classification_report(y_test,y_test_predict)) ## Classification re

The Logistic Regression Model Score on test data set is 0.829

[[111 42]

[ 36 267]]

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [47]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,Logistic_model.predict(X_test))),annot=True,fm
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

Training Data and Test Data Confusion Matrix Comparison

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 21/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [48]: 

f,a = plt.subplots(1,2,sharex=True,sharey=True,squeeze=False)

#Plotting confusion matrix for the different models for the Training Data

plot_0 = sns.heatmap((metrics.confusion_matrix(y_train,y_train_predict)),annot=True,fmt='.5
a[0][0].set_title('Training Data')

plot_1 = sns.heatmap((metrics.confusion_matrix(y_test,y_test_predict)),annot=True,fmt='.5g'
a[0][1].set_title('Test Data');

Training Data and Test Data Classification Report Comparison

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 22/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [49]: 

print('Classification Report of the training data:\n\n',metrics.classification_report(y_tra

print('Classification Report of the test data:\n\n',metrics.classification_report(y_test,y_

Classification Report of the training data:

precision recall f1-score support

0 0.75 0.64 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

Classification Report of the test data:

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

1- Applying GridSearchCV for Logistic Regression

In [50]: 

grid={'penalty':['l2','none','l1','elasticnet'],
'solver':['liblinear','lbfgs','newton-cg'],
'tol':[0.0001,0.00001],
'max_iter': [10000, 5000,15000]}

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 23/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [51]: 

from sklearn.model_selection import RepeatedStratifiedKFold

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator = Logistic_model, param_grid = grid, cv = cv, n_jobs=2
grid_search.fit(X_train, y_train)

[LibLinear]

Out[51]:

GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_sta

te=1),

estimator=LogisticRegression(max_iter=10000, n_jobs=2,

penalty='none', solver='newton-c
g',

verbose=True),

n_jobs=2,

param_grid={'max_iter': [10000, 5000, 15000],

'penalty': ['l2', 'none', 'l1', 'elasticnet'],

'solver': ['liblinear', 'lbfgs', 'newton-cg'],

'tol': [0.0001, 1e-05]},

scoring='f1')

In [52]: 

print(grid_search.best_params_,'\n')
print(grid_search.best_estimator_)

{'max_iter': 10000, 'penalty': 'l2', 'solver': 'liblinear', 'tol': 0.0001}

LogisticRegression(max_iter=10000, n_jobs=2, solver='liblinear', verbose=Tru

In [53]: 

best_model_lr = grid_search.best_estimator_

In [54]: 

# Prediction on the training set

ytrain_predict_lr = best_model_lr.predict(X_train)
ytest_predict_lr = best_model_lr.predict(X_test)

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 24/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [55]: 

## Getting the probabilities on the test set

ytest_predict_prob=best_model_lr.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()

Out[55]:

0 1

0 0.428858 0.571142

1 0.155518 0.844482

2 0.006996 0.993004

3 0.839503 0.160497

4 0.066109 0.933891

Model Evaluation for Train Data

In [56]: 

print("The Best Logistic Regression Model Score on train data set post tuning is %.3f " % b

The Best Logistic Regression Model Score on train data set post tuning is 0.
834

In [57]: 

# Get the confusion matrix on the train data

confusion_matrix(y_train,best_model_lr.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lr.predict(X_train)),annot=True,fmt='.5g',c
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 25/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [58]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LR Tuned Model train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='green')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for LR Tuned Model train data set",fontsize=14,color = 'red');

The ROC_AUC score for LR Tuned Model train data set 0.89

Model Evaluation for Test Data

In [59]: 

print("The Best Logistic Regression Model Score on train data post tuning set is %.3f " % b

The Best Logistic Regression Model Score on train data post tuning set is 0.
829

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 26/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [60]: 

# Get the confusion matrix on the train data

confusion_matrix(y_test,best_model_lr.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lr.predict(X_test)),annot=True,fmt='.5g',cma
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 27/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [61]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for LR Tuned Model test data set %.2f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='green')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for LR Tuned Model test data set",fontsize=14,color = 'red');

The ROC_AUC score for LR Tuned Model test data set 0.88

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 28/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [62]: 

print('Classification Report of the training data:\n\n',classification_report(y_train, ytra

print('Classification Report of the test data:\n\n',classification_report(y_test, ytest_pre

Classification Report of the training data:

precision recall f1-score support

0 0.75 0.64 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

Classification Report of the test data:

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [63]: 

(best_model_lr.score(X_train, y_train)-best_model_lr.score(X_test, y_test))

Out[63]:

0.00517138746961654

LDA (linear discriminant analysis)

In [64]: 

LDA_model=LinearDiscriminantAnalysis()
LDA_model.fit(X_train,y_train)

Out[64]:

LinearDiscriminantAnalysis()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 29/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [65]: 

## Performance Matrix on train data set

y_train_predict=LDA_model.predict(X_train)
LDA_model_score_train=LDA_model.score(X_train,y_train)
print("The LDA Model Score on train data set is %.3f " % LDA_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The LDA Model Score on train data set is 0.834

[[200 107]

[ 69 685]]

precision recall f1-score support

0 0.74 0.65 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.80 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

In [66]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,LDA_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 30/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [67]: 

#Performance Matrix on test data set

y_test_predict=LDA_model.predict(X_test)
LDA_model_score_test=LDA_model.score(X_test,y_test)
print("The LDA Model Score on test data set is %.3f " % LDA_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))

The LDA Model Score on test data set is 0.831

[[111 42]

[ 35 268]]

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [68]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,LDA_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

Applying GridSearchCV for LDA

In [69]: 

grid_lda ={'solver' :['svd', 'lsqr', 'eigen']}

grid_search_lda = GridSearchCV(estimator = LDA_model, param_grid = grid_lda, cv = cv, n_job
grid_search_lda.fit(X_train, y_train)
best_model_lda = grid_search_lda.best_estimator_

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 31/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Model Evaluation for Train Data

In [70]: 

ytrain_predict_lda = best_model_lda.predict(X_train)
ytest_predict_lda= best_model_lda.predict(X_test)

In [71]: 

## Getting the probabilities on the test set

ytest_predict_prob=best_model_lda.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()

Out[71]:

0 1

0 0.466328 0.533672

1 0.137291 0.862709

2 0.005950 0.994050

3 0.866706 0.133294

4 0.053474 0.946526

In [72]: 

#### Model Evaluation for Train Data

print("The Best LDA Model Score on train data set post tuning is %.3f " % best_model_lda.sc

The Best LDA Model Score on train data set post tuning is 0.835

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 32/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [73]: 

# Get the confusion matrix on the train data

confusion_matrix(y_train,best_model_lda.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lda.predict(X_train)),annot=True,fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 33/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [74]: 

# predict probabilities
probs = best_model_lda.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LDA Tuned Model train data set %.3f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for LDA Tuned Model train data set",fontsize=14,color = 'red');

The ROC_AUC score for LDA Tuned Model train data set 0.890

Model Evaluation for Test Data

In [75]: 

#### Model Evaluation for Train Data

print("The Best LDA Model Score on test data post tuning set is %.3f " % best_model_lda.sco

The Best LDA Model Score on test data post tuning set is 0.831

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 34/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [76]: 

# Get the confusion matrix on the Test data

confusion_matrix(y_test,best_model_lda.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lda.predict(X_test)),annot=True,fmt='.5g',cm
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 35/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [77]: 

# predict probabilities
probs = best_model_lda.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for LDA Tuned Model test data set %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for LDA Tuned Model test data set",fontsize=14,color = 'red');

The ROC_AUC score for LDA Tuned Model test data set 0.888

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 36/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [78]: 

### Classification of Best LDA Model on Train and Test Data

print('Classification Report of the training data:\n\n',classification_report(y_train, ytra
print('Classification Report of the test data:\n\n',classification_report(y_test, ytest_pre

Classification Report of the training data:

precision recall f1-score support

0 0.74 0.65 0.70 307

1 0.87 0.91 0.89 754

accuracy 0.84 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.84 0.83 1061

Classification Report of the test data:

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [79]: 

(best_model_lda.score(X_train, y_train)-best_model_lda.score(X_test, y_test))*100

Out[79]:

0.3920912082279293

KNN Model

Generally, good KNN performance usually requires preprocessing of data to make all variables similarly
scaled and centered

In [80]: 

KNN_model=KNeighborsClassifier()
KNN_model.fit(X_train,y_train)

Out[80]:

KNeighborsClassifier()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 37/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [81]: 

## Performance Matrix on train data set

y_train_predict = KNN_model.predict(X_train)
KNN_model_score_train=KNN_model.score(X_train, y_train)
print("The KNN Model Score on Train data %.3f " % KNN_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The KNN Model Score on Train data 0.857

[[217 90]

[ 62 692]]

precision recall f1-score support

0 0.78 0.71 0.74 307

1 0.88 0.92 0.90 754

accuracy 0.86 1061

macro avg 0.83 0.81 0.82 1061

weighted avg 0.85 0.86 0.85 1061

In [82]: 

## Performance Matrix on test data set

y_test_predict = KNN_model.predict(X_test)
KNN_model_score_test = KNN_model.score(X_test, y_test)
print("The KNN Model Score on Test data %.3f " % KNN_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The KNN Model Score on Test data 0.827

[[109 44]

[ 35 268]]

precision recall f1-score support

0 0.76 0.71 0.73 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.80 456

weighted avg 0.82 0.83 0.83 456

Run the KNN with no of neighbours to be 1,3,5..19 and *Find the optimal number of neighbours from
K=1,3,5,7....19 using the Mis classification error

Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with neighbours
= 1,3,5...19 and find the model with lowest MCE

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 38/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [83]: 

# empty list that will hold accuracy scores

ac_scores = []

# perform accuracy metrics for values from 1,3,5....19

for k in range(1,20,2):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# evaluate test accuracy
scores = knn.score(X_test, y_test)
ac_scores.append(scores)

# changing to misclassification error

MCE = [1 - x for x in ac_scores]
MCE

Out[83]:

[0.2149122807017544,

0.19736842105263153,

0.17324561403508776,

0.1842105263157895,

0.18201754385964908,

0.17105263157894735,

0.17763157894736847,

0.16885964912280704,

0.16666666666666663,

0.17105263157894735]

Plot misclassification error vs k (with k value on X-axis)

In [84]: 

# plot misclassification error vs k

plt.plot(range(1,20,2), MCE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.title("Misclassicication error Vs K Value",fontsize=14,color = 'red');
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 39/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

For K = 11 it is giving the best test accuracy. We will build the model with k=11

In [85]: 

from sklearn.neighbors import KNeighborsClassifier

KNN_model_1=KNeighborsClassifier(n_neighbors= 11)
KNN_model_1.fit(X_train,y_train)

Out[85]:

KNeighborsClassifier(n_neighbors=11)

Performance Matrix of KNN New Model on train data set

In [86]: 

## Performance Matrix on train data set

y_train_predict = KNN_model_1.predict(X_train)
KNN_model_score_train_New=KNN_model_1.score(X_train, y_train)
print("The KNN Model Score on Train data %.3f " % KNN_model_score_train_New)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The KNN Model Score on Train data 0.843

[[206 101]

[ 66 688]]

precision recall f1-score support

0 0.76 0.67 0.71 307

1 0.87 0.91 0.89 754

accuracy 0.84 1061

macro avg 0.81 0.79 0.80 1061

weighted avg 0.84 0.84 0.84 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 40/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [87]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,KNN_model_1.predict(X_train))),annot=True,fmt
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('KNN-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 41/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [88]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for KNN train data set %.3f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.911

Performance Matrix of KNN New Model on test data set

In [89]: 

## Performance Matrix on test data set

y_test_predict = KNN_model_1.predict(X_test)
KNN_model_score_test_New = KNN_model_1.score(X_test, y_test)
print("The KNN Model Score on Test data %.3f " % KNN_model_score_test_New)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The KNN Model Score on Test data 0.829

[[105 48]

[ 30 273]]

precision recall f1-score support

0 0.78 0.69 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.81 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 42/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [90]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,KNN_model_1.predict(X_test))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('KNN-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 43/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [91]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for KNN train data set %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.889

Naive Bayes
In [92]: 

NB_model=GaussianNB()
NB_model.fit(X_train, y_train)

Out[92]:

GaussianNB()

Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit() method
for training it. After building a classifier, our model is ready to make predictions. We can use predict()
method with test set features as its parameters.

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 44/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [93]: 

#Performance Matrix on train data set

y_train_predict=NB_model.predict(X_train)
Naive_Bayes_model_score_train=NB_model.score(X_train, y_train) ## Accur
print("The Naive Bayes Model Score on train data is %.3f " % Naive_Bayes_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict)) ## confusion_matrix
print(metrics.classification_report(y_train,y_train_predict)) ## classification_report

The Naive Bayes Model Score on train data is 0.834

[[212 95]

[ 81 673]]

precision recall f1-score support

0 0.72 0.69 0.71 307

1 0.88 0.89 0.88 754

accuracy 0.83 1061

macro avg 0.80 0.79 0.80 1061

weighted avg 0.83 0.83 0.83 1061

In [94]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,NB_model.predict(X_train))),annot=True,fmt='.
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('NB-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 45/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [95]: 

## Performance Matrix on test data set

y_test_predict = NB_model.predict(X_test)
Naive_Bayes_model_score_test=NB_model.score(X_test, y_test) ## Accuracy
print("The Naive Bayes Model Score on test data is %.3f " % Naive_Bayes_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict)) ## confusion_matrix
print(metrics.classification_report(y_test, y_test_predict)) ## classification_report

The Naive Bayes Model Score on test data is 0.822

[[112 41]

[ 40 263]]

precision recall f1-score support

0 0.74 0.73 0.73 153

1 0.87 0.87 0.87 303

accuracy 0.82 456

macro avg 0.80 0.80 0.80 456

weighted avg 0.82 0.82 0.82 456

In [96]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,NB_model.predict(X_test))),annot=True,fmt='.5g
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('NB-Confusion Matrix-Test Data')
plt.show()

Naive Bayes with SMOTE

In [97]: 

from imblearn.over_sampling import SMOTE

#SMOTE is only applied on the train data set
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 46/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [98]: 

X_train.shape

Out[98]:

(1061, 8)

In [99]: 

## Let's check the shape after SMOTE

X_train_res.shape

Out[99]:

(1508, 8)

In [100]: 

NB_SM_model = GaussianNB()
NB_SM_model.fit(X_train_res, y_train_res)

Out[100]:

GaussianNB()

In [101]: 

## Performance Matrix on train data set with SMOTE

y_train_predict = NB_SM_model.predict(X_train_res)
SMOTE_model_score_train = NB_SM_model.score(X_train_res, y_train_res)
print("The SMOTE Model Score for train data set is %.3f " % SMOTE_model_score_train)
print(metrics.confusion_matrix(y_train_res, y_train_predict))
print(metrics.classification_report(y_train_res ,y_train_predict))

The SMOTE Model Score for train data set is 0.822

[[616 138]

[131 623]]

precision recall f1-score support

0 0.82 0.82 0.82 754

1 0.82 0.83 0.82 754

accuracy 0.82 1508

macro avg 0.82 0.82 0.82 1508

weighted avg 0.82 0.82 0.82 1508

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 47/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [102]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train_res,NB_SM_model.predict(X_train_res))),annot=
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

ROC_AUC Curve for Naive Bayes with SMOTE Model on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 48/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [103]: 

probs = NB_SM_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Naive Bayes with SMOTE train data set %.3f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for Naive Bayes with SMOTE train data set",fontsize=14,color = 're

The ROC_AUC score for Naive Bayes with SMOTE train data set 0.887

In [104]: 

## Performance Matrix on test data set

y_test_predict = NB_SM_model.predict(X_test)
SMOTE_model_score_test = NB_SM_model.score(X_test, y_test)
print("The SMOTE Model Score for test data set is %.3f " % SMOTE_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The SMOTE Model Score for test data set is 0.809

[[125 28]

[ 59 244]]

precision recall f1-score support

0 0.68 0.82 0.74 153

1 0.90 0.81 0.85 303

accuracy 0.81 456

macro avg 0.79 0.81 0.80 456

weighted avg 0.82 0.81 0.81 456

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 49/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [105]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,NB_SM_model.predict(X_test))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

ROC_AUC Curve for Naive Bayes with SMOTE Model on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 50/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [106]: 

probs_test = NB_SM_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Naive Bayes with SMOTE test data set %.3f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Naive Bayes with SMOTE test data set",fontsize=14,color = 'red');

The ROC_AUC score for Naive Bayes with SMOTE test data set 0.876

Random Forest
In [107]: 

RF_model=RandomForestClassifier(n_estimators=100,random_state=1)
RF_model.fit(X_train, y_train)

Out[107]:

RandomForestClassifier(random_state=1)

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 51/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [108]: 

## Performance Matrix on train data set

y_train_predict = RF_model.predict(X_train)
RF_model_score_train =RF_model.score(X_train, y_train)
print("The random Forest Score on train data is %.2f " % RF_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The random Forest Score on train data is 1.00

[[307 0]

[ 0 754]]

precision recall f1-score support

0 1.00 1.00 1.00 307

1 1.00 1.00 1.00 754

accuracy 1.00 1061

macro avg 1.00 1.00 1.00 1061

weighted avg 1.00 1.00 1.00 1061

In [109]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,RF_model.predict(X_train))),annot=True,fmt='.
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 52/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [110]: 

## Performance Matrix on test data set

y_test_predict = RF_model.predict(X_test)
RF_model_score_test = RF_model.score(X_test, y_test)
print("The random Forest Score on test data is %.3f " % RF_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The random Forest Score on test data is 0.831

[[104 49]

[ 28 275]]

precision recall f1-score support

0 0.79 0.68 0.73 153

1 0.85 0.91 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

In [111]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,RF_model.predict(X_test))),annot=True,fmt='.5g
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

In [112]: 

(RF_model_score_train-RF_model_score_test)*100

Out[112]:

16.885964912280706

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 53/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Bagging
In [113]: 

cart=RandomForestClassifier()
Bagging_model=BaggingClassifier(base_estimator=cart,n_estimators=100, random_state=1)
Bagging_model.fit(X_train,y_train)

Out[113]:

BaggingClassifier(base_estimator=RandomForestClassifier(), n_estimators=100,

random_state=1)

In [114]: 

## Performance Matrix on train data set

y_train_predict=Bagging_model.predict(X_train)
Bagging_model_score_train=Bagging_model.score(X_train,y_train)
print("The Bagging Model Score for train data set is %.2f " % Bagging_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The Bagging Model Score for train data set is 0.97

[[278 29]

[ 5 749]]

precision recall f1-score support

0 0.98 0.91 0.94 307

1 0.96 0.99 0.98 754

accuracy 0.97 1061

macro avg 0.97 0.95 0.96 1061

weighted avg 0.97 0.97 0.97 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 54/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [115]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,Bagging_model.predict(X_train))),annot=True,f
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Bagging-Confusion Matrix-Train Data')
plt.show()

In [116]: 

## Performance Matrix on test data set

y_test_predict=Bagging_model.predict(X_test)
Bagging_model_score_test=Bagging_model.score(X_test,y_test)
print("The Bagging Model Score for test data set is %.2f " % Bagging_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))

The Bagging Model Score for test data set is 0.83

[[104 49]

[ 29 274]]

precision recall f1-score support

0 0.78 0.68 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 55/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [117]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,Bagging_model.predict(X_test))),annot=True,fmt
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Bagging-Confusion Matrix-Test Data')
plt.show()

In [118]: 

(Bagging_model_score_train-Bagging_model_score_test)

Out[118]:

0.13900739123964478

Boosting

Ada Boost

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 56/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [119]: 

ADB_model=AdaBoostClassifier(n_estimators=100,random_state=1)
ADB_model.fit(X_train,y_train)

Out[119]:

AdaBoostClassifier(n_estimators=100, random_state=1)

In [120]: 

## Performance Matrix on train data set

y_train_predict=ADB_model.predict(X_train)
ADB_model_score_train=ADB_model.score(X_train,y_train)
print("The ADA boost Model Score for train data set is %.3f " % ADB_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The ADA boost Model Score for train data set is 0.850

[[214 93]

[ 66 688]]

precision recall f1-score support

0 0.76 0.70 0.73 307

1 0.88 0.91 0.90 754

accuracy 0.85 1061

macro avg 0.82 0.80 0.81 1061

weighted avg 0.85 0.85 0.85 1061

In [121]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,ADB_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('ADA Boost-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 57/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [122]: 

## Performance Matrix on train data set

y_test_predict = ADB_model.predict(X_test)
ADB_model_score_test = ADB_model.score(X_test, y_test)
print("The ADA boost Model Score for test data set is %.3f " % ADB_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The ADA boost Model Score for test data set is 0.814

[[103 50]

[ 35 268]]

precision recall f1-score support

0 0.75 0.67 0.71 153

1 0.84 0.88 0.86 303

accuracy 0.81 456

macro avg 0.79 0.78 0.79 456

weighted avg 0.81 0.81 0.81 456

In [123]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,ADB_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('ADA boost-Confusion Matrix-Test Data')
plt.show()

In [124]: 

(ADB_model_score_train-ADB_model_score_test)*100

Out[124]:

3.654488483225027

Gradient Boosting
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 58/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Gradient Boosting
In [125]: 

gbc_model=GradientBoostingClassifier(random_state=1)
gbc_model.fit(X_train, y_train)

Out[125]:

GradientBoostingClassifier(random_state=1)

In [126]: 

## Performance Matrix on train data set

y_train_predict = gbc_model.predict(X_train)
gbc_model_score_train = gbc_model.score(X_train, y_train)
print("The Gradient Boosting Score for train data set is %.2f " % gbc_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The Gradient Boosting Score for train data set is 0.89

[[239 68]

[ 46 708]]

precision recall f1-score support

0 0.84 0.78 0.81 307

1 0.91 0.94 0.93 754

accuracy 0.89 1061

macro avg 0.88 0.86 0.87 1061

weighted avg 0.89 0.89 0.89 1061

In [127]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,gbc_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Gradiant Boost -Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 59/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [128]: 

## Performance Matrix on test data set

y_test_predict = gbc_model.predict(X_test)
gbc_model_score_test = gbc_model.score(X_test, y_test)
print("The Gradient Boosting Score for train data set is %.2f " % gbc_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The Gradient Boosting Score for train data set is 0.84

[[105 48]

[ 27 276]]

precision recall f1-score support

0 0.80 0.69 0.74 153

1 0.85 0.91 0.88 303

accuracy 0.84 456

macro avg 0.82 0.80 0.81 456

weighted avg 0.83 0.84 0.83 456

In [129]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,gbc_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Gradiant Boost-Confusion Matrix-Test Data')
plt.show()

In [130]: 

(gbc_model_score_train-gbc_model_score_test)*100

Out[130]:

5.702787836698253

Performance Matrix of Logistic Regression on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 60/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [131]: 

## Performance Matrix on train data set

print("The Best Logistic Regression Model Score on train data set is %.2f " % best_model_lr
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lr.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lr.predict(X_train)),annot=True, fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LR-Confusion Matrix-Train Data')
plt.show()

The Best Logistic Regression Model Score on train data set is 0.83

ROC_AUC Curve for Logistic Regression on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 61/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [132]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print('The ROC_AUC score for Logistic Regression Train data set: %.3f' % auc)
# calculate ROC curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Logistic Regression Train data set",fontsize=14,color = 'red');

The ROC_AUC score for Logistic Regression Train data set: 0.890

Performance Matrix of Logistic Regression on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 62/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [133]: 

## Performance Matrix on test data set

print("The Best Logistic Regression Model Score on train data set is %.2f " % best_model_lr
# Get the confusion matrix on the train data
confusion_matrix(y_test,best_model_lr.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lr.predict(X_test)),annot=True, fmt='.5g', c
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LR-Confusion Matrix-test data')
plt.show()

The Best Logistic Regression Model Score on train data set is 0.83

ROC_AUC Curve for Logistic Regression on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 63/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [134]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('The ROC_AUC score for Logistic Regression Test data set : %.3f' % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Logistic Regression Test data set ",fontsize=14,color = 'red');

The ROC_AUC score for Logistic Regression Test data set : 0.883

Performance Matrix of LDA (linear discriminant analysis) on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 64/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [135]: 

## Performance Matrix on train data set

print("The Best LDA Model Score on train data set is %.2f " % best_model_lda.score(X_train,
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lda.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lda.predict(X_train)),annot=True, fmt='.5g'
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LDA-Confusion Matrix-Train data')
plt.show()

The Best LDA Model Score on train data set is 0.84

ROC_AUC Curve for LDA (linear discriminant analysis) on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 65/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [136]: 

# predict probabilities
probs = best_model_lda.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LDA Train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for LDA Train data set",fontsize=14,color = 'red');

The ROC_AUC score for LDA Train data set 0.89

Performance Matrix of LDA (linear discriminant analysis) on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 66/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [137]: 

#Performance Matrix on test data set

print("The Best LDA Model Score on test data set is %.2f " % best_model_lda.score(X_test, y
# Get the confusion matrix on the Test data
confusion_matrix(y_test,best_model_lda.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lda.predict(X_test)),annot=True, fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LDA-Confusion Matrix-Test Data')
plt.show()

The Best LDA Model Score on test data set is 0.83

ROC_AUC Curve for LDA (linear discriminant analysis) on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 67/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [138]: 

probs = best_model_lda.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
print("The ROC_AUC score for LDA Test data set is' %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for LDA Test data set",fontsize=14,color = 'red');

AUC: 0.888

The ROC_AUC score for LDA Test data set is' 0.888

Performance Matrix of KNN on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 68/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [139]: 

## Performance Matrix on train data set

y_train_predict = KNN_model_1.predict(X_train)
KNN_model_score_train_New=KNN_model_1.score(X_train, y_train)
print("The KNN Model Score on Train data %.2f " % KNN_model_score_train_New)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The KNN Model Score on Train data 0.84

[[206 101]

[ 66 688]]

precision recall f1-score support

0 0.76 0.67 0.71 307

1 0.87 0.91 0.89 754

accuracy 0.84 1061

macro avg 0.81 0.79 0.80 1061

weighted avg 0.84 0.84 0.84 1061

ROC_AUC Curve for KNN on train data set

In [140]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for KNN train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.91

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 69/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Performance Matrix of KNN on test data set

In [141]: 

## Performance Matrix on test data set

y_test_predict = KNN_model_1.predict(X_test)
KNN_model_score_test_New = KNN_model_1.score(X_test, y_test)
print("The KNN Model Score on Test data %.2f " % KNN_model_score_test_New)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The KNN Model Score on Test data 0.83

[[105 48]

[ 30 273]]

precision recall f1-score support

0 0.78 0.69 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.81 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

ROC_AUC Curve for KNN on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 70/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [142]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for KNN train data set %.2f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.89

Performance Matrix of Naive Bayes with SMOTE on train data set

In [143]: 

## Performance Matrix on train data set with SMOTE

y_train_predict = NB_SM_model.predict(X_train_res)
SMOTE_model_score_train = NB_SM_model.score(X_train_res, y_train_res)
print("The SMOTE Model Score for train data set is %.2f " % SMOTE_model_score_train)
print(metrics.confusion_matrix(y_train_res, y_train_predict))
print(metrics.classification_report(y_train_res ,y_train_predict))

The SMOTE Model Score for train data set is 0.82

[[616 138]

[131 623]]

precision recall f1-score support

0 0.82 0.82 0.82 754

1 0.82 0.83 0.82 754

accuracy 0.82 1508

macro avg 0.82 0.82 0.82 1508

weighted avg 0.82 0.82 0.82 1508

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 71/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

ROC_AUC Curve for Naive Bayes with SMOTE Model on train data set

In [144]: 

probs = NB_SM_model.predict_proba(X_train_res)
probs = probs[:, 1]
auc = roc_auc_score(y_train_res, probs)
print("The ROC_AUC score for Naive Bayes with SMOTE train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train_res, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Naive Bayes with SMOTE train data set",fontsize=14,color = 'red');

The ROC_AUC score for Naive Bayes with SMOTE train data set 0.90

Performance Matrix of Naive Bayes with SMOTE on test data set

In [145]: 

## Performance Matrix on test data set

y_test_predict = NB_SM_model.predict(X_test)
SMOTE_model_score_test = NB_SM_model.score(X_test, y_test)
print("The SMOTE Model Score for test data set is %.2f " % SMOTE_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The SMOTE Model Score for test data set is 0.81

[[125 28]

[ 59 244]]

precision recall f1-score support

0 0.68 0.82 0.74 153

1 0.90 0.81 0.85 303

accuracy 0.81 456

macro avg 0.79 0.81 0.80 456

weighted avg 0.82 0.81 0.81 456

ROC AUC Curve for Naive Bayes with SMOTE Model on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 72/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
ROC_AUC Curve for Naive Bayes with SMOTE Model on test data set

In [146]: 

probs_test = NB_SM_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Naive Bayes with SMOTE Model on test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Naive Bayes with SMOTE Model on test data set",fontsize=14,color =

The ROC_AUC score for Naive Bayes with SMOTE Model on test data set 0.88

Performance Matrix of Random Forest on train data set

In [147]: 

## Performance Matrix on train data set

The random Forest Score on train data is 1.00

[[307 0]

[ 0 754]]

precision recall f1-score support

0 1.00 1.00 1.00 307

1 1.00 1.00 1.00 754

accuracy 1.00 1061

macro avg 1.00 1.00 1.00 1061

weighted avg 1.00 1.00 1.00 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 73/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [148]: 

Recall=(754/(0+754))
print("Random Forest-Train Data Set-Recall for class 1 is %.2f " % Recall)

Random Forest-Train Data Set-Recall for class 1 is 1.00

ROC_AUC Curve for Random Forest on train data set

In [149]: 

probs = RF_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The AUC_ROC score for Random Forest train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Random Forest train data",fontsize=14,color = 'red');

The AUC_ROC score for Random Forest train data set 1.00

Performance Matrix of Random Forest on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 74/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [150]: 

## Performance Matrix on test data set

y_test_predict = RF_model.predict(X_test)
RF_model_score_test = RF_model.score(X_test, y_test)
print("The random Forest Score on test data is %.2f " % RF_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The random Forest Score on test data is 0.83

[[104 49]

[ 28 275]]

precision recall f1-score support

0 0.79 0.68 0.73 153

1 0.85 0.91 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

In [151]: 

Recall=(275/(28+275))
print("Random Forest-Test Data Set-Recall for class 1 is %.2f " % Recall)

Random Forest-Test Data Set-Recall for class 1 is 0.91

ROC_AUC Curve for Random Forest on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 75/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [152]: 

probs_test = RF_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for Random Forest test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Random Forest Test data set",fontsize=14,color = 'red');

The AUC_ROC score for Random Forest test data set 0.90

Performance Matrix of Bagging on train data set

In [153]: 

## Performance Matrix on train data set

The Bagging Model Score for train data set is 0.97

[[278 29]

[ 5 749]]

precision recall f1-score support

0 0.98 0.91 0.94 307

1 0.96 0.99 0.98 754

accuracy 0.97 1061

macro avg 0.97 0.95 0.96 1061

weighted avg 0.97 0.97 0.97 1061

ROC_AUC Curve for Bagging on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 76/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [154]: 

probs = Bagging_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Bagging train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Bagging Train data set",fontsize=14,color = 'red');

The ROC_AUC score for Bagging train data set 1.00

Performance Matrix of Bagging on test data set

In [155]: 

## Performance Matrix on test data set

The Bagging Model Score for test data set is 0.83

[[104 49]

[ 29 274]]

precision recall f1-score support

0 0.78 0.68 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

ROC_AUC Curve for Bagging on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 77/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [156]: 

probs_test = Bagging_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for Bagging test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Bagging Test data set",fontsize=14,color = 'red');

The AUC_ROC score for Bagging test data set 0.90

Performance Matrix of Ada Boost on train data set

In [157]: 

## Performance Matrix on train data set

The ADA boost Model Score for train data set is 0.850

[[214 93]

[ 66 688]]

precision recall f1-score support

0 0.76 0.70 0.73 307

1 0.88 0.91 0.90 754

accuracy 0.85 1061

macro avg 0.82 0.80 0.81 1061

weighted avg 0.85 0.85 0.85 1061

ROC_AUC Curve for Ada Boost on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 78/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [158]: 

probs = ADB_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The AUC_ROC score for ADB Model train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for ADB Model train data set",fontsize=14,color = 'red');

The AUC_ROC score for ADB Model train data set 0.91

Performance Matrix of Ada Boost on test data set

In [159]: 

## Performance Matrix on train data set

y_test_predict = ADB_model.predict(X_test)
ADB_model_score_test = ADB_model.score(X_test, y_test)
print("The ADA boost Model Score for test data set is %.2f " % ADB_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The ADA boost Model Score for test data set is 0.81

[[103 50]

[ 35 268]]

precision recall f1-score support

0 0.75 0.67 0.71 153

1 0.84 0.88 0.86 303

accuracy 0.81 456

macro avg 0.79 0.78 0.79 456

weighted avg 0.81 0.81 0.81 456

ROC_AUC Curve for Ada Boost on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 79/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [160]: 

probs_test = ADB_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for ADB Model test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for ADB Model test data set",fontsize=14,color = 'red');

The AUC_ROC score for ADB Model test data set 0.88

Performance Matrix of Gradient Boosting on train data set

In [161]: 

## Performance Matrix on train data set

y_train_predict = gbc_model.predict(X_train)
gbc_model_score_train = gbc_model.score(X_train, y_train)
print("The Gradient Boosting Score for train data set is %.3f " % gbc_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The Gradient Boosting Score for train data set is 0.893

[[239 68]

[ 46 708]]

precision recall f1-score support

0 0.84 0.78 0.81 307

1 0.91 0.94 0.93 754

accuracy 0.89 1061

macro avg 0.88 0.86 0.87 1061

weighted avg 0.89 0.89 0.89 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 80/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [162]: 

Recall=(708/(46+708))
print("Gradient Boosting-Train Data Set-Recall for class 1 is %.3f " % Recall)

Gradient Boosting-Train Data Set-Recall for class 1 is 0.939

ROC_AUC Curve for Gradient Boosting on train data set

In [163]: 

probs = gbc_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Gradient Boosting train data set %.3f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Gradient Boosting train data set",fontsize=14,color = 'red');

The ROC_AUC score for Gradient Boosting train data set 0.951

Performance Matrix of Gradient Boosting on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 81/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [164]: 

## Performance Matrix on test data set

y_test_predict = gbc_model.predict(X_test)
gbc_model_score_test = gbc_model.score(X_test, y_test)
print("The Gradient Boosting Score for train data set is %.3f " % gbc_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The Gradient Boosting Score for train data set is 0.836

[[105 48]

[ 27 276]]

precision recall f1-score support

0 0.80 0.69 0.74 153

1 0.85 0.91 0.88 303

accuracy 0.84 456

macro avg 0.82 0.80 0.81 456

weighted avg 0.83 0.84 0.83 456

In [165]: 

Recall=(276/(27+276))
print("Gradient Boosting-Test Data Set-Recall for class 1 is %.3f " % Recall)

Gradient Boosting-Test Data Set-Recall for class 1 is 0.911

ROC_AUC Curve for Gradient Boosting on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 82/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [166]: 

probs_test = gbc_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Gradient Boosting test data set %.3f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Gradient Boosting test data set",fontsize=14,color = 'red');

The ROC_AUC score for Gradient Boosting test data set 0.899

Comparison of Different Models

In [168]: 

print("The Logistic Regression Model Score Post Tuning on train data set is %.3f " % best_m
print("The Logistic Regression Model Score Post Tuning on test data set is %.3f " % best_m
print("The LDA Model Score Post Tuning on train data set is %.3f " % best_model_lda.score(X
print("The LDA Model Score Post Tuning on test data set is %.3f " % best_model_lda.score(X
print("The KNN Model Score Post Tuning on Train data %.3f " % KNN_model_1.score(X_train, y_
print("The KNN Model Score Post Tuning on Test data %.3f " % KNN_model_1.score(X_test, y_te
print("The Naive Bayes Model Score Post Tuning on train data is %.3f " % NB_SM_model.score(
print("The Naive Bayes Model Score Post Tuning on test data is %.3f " % NB_SM_model.score(X

The Logistic Regression Model Score Post Tuning on train data set is 0.834

The Logistic Regression Model Score Post Tuning on test data set is 0.829

The LDA Model Score Post Tuning on train data set is 0.835

The LDA Model Score Post Tuning on test data set is 0.831

The KNN Model Score Post Tuning on Train data 0.843

The KNN Model Score Post Tuning on Test data 0.829

The Naive Bayes Model Score Post Tuning on train data is 0.822

The Naive Bayes Model Score Post Tuning on test data is 0.809

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 83/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [169]: 

print("Variance in Test and train Scores of LDA Model is %.5f " % (best_model_lr.score(X_tr

Variance in Test and train Scores of LDA Model is 0.00517

In [170]: 

print("Variance in Test and train Scores of LDA Model is %.5f " % (best_model_lda.score(X_t

Variance in Test and train Scores of LDA Model is 0.00392

In [171]: 

print("Variance in Test and train Scores of KNN Model for is %.5f " % (KNN_model_1.score(X

Variance in Test and train Scores of KNN Model for is 0.01365

In [172]: 

print("Variance in Test and train Scores of LR Model for is %.5f " % (NB_SM_model.score(X_

Variance in Test and train Scores of LR Model for is 0.01241

Cross Validation
In [173]: 

from sklearn.model_selection import cross_val_score

In [174]: 

scores = cross_val_score(best_model_lda, X_train, y_train, cv=10)

scores

Out[174]:

array([0.78504673, 0.77358491, 0.83962264, 0.85849057, 0.85849057,

0.8490566 , 0.81132075, 0.8490566 , 0.81132075, 0.82075472])

In [175]: 

scores = cross_val_score(best_model_lda, X_test, y_test, cv=10)

scores

Out[175]:

array([0.80434783, 0.76086957, 0.86956522, 0.82608696, 0.89130435,

0.86956522, 0.93333333, 0.84444444, 0.75555556, 0.84444444])

--------------------------------------END OF PROBELM 1------------

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 84/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

--------------------------

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 85/85

Anderson-Darling Test - Real Statistics Using Excel
No ratings yet
Anderson-Darling Test - Real Statistics Using Excel
37 pages
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
No ratings yet
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
22 pages
Final Project - ML - Nikita Chaturvedi - 03.10.2021 - Jupyter Notebook
100% (11)
Final Project - ML - Nikita Chaturvedi - 03.10.2021 - Jupyter Notebook
154 pages
ML2 Easy Visa Project Business Report
100% (1)
ML2 Easy Visa Project Business Report
24 pages
Predictive-Modelling-Project - Graded Project - Predictive Modeling - Business Report - PDF at Main Aadyatomar - Predictive-Modelling-Project GitHub
100% (8)
Predictive-Modelling-Project - Graded Project - Predictive Modeling - Business Report - PDF at Main Aadyatomar - Predictive-Modelling-Project GitHub
64 pages
SMDM Final - Jupyter Notebook
100% (1)
SMDM Final - Jupyter Notebook
17 pages
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
100% (5)
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
83 pages
Research Methods Statistics and Applications
100% (6)
Research Methods Statistics and Applications
962 pages
Pearson R Correlation JOHN E. CORDERO
No ratings yet
Pearson R Correlation JOHN E. CORDERO
69 pages
Weekly Quiz 1 Machine Learning Great Learning PDF
100% (2)
Weekly Quiz 1 Machine Learning Great Learning PDF
7 pages
Project Report
100% (3)
Project Report
36 pages
Robust Statistical Methods with R 1st Edition Jana Jureckova download
100% (2)
Robust Statistical Methods with R 1st Edition Jana Jureckova download
47 pages
2013 Book BayesianAndFrequentistRegressi PDF
No ratings yet
2013 Book BayesianAndFrequentistRegressi PDF
700 pages
Business Report Project SMDM Sonali Pradhan
100% (1)
Business Report Project SMDM Sonali Pradhan
56 pages
AI-900 - Fundamental Principles of ML
No ratings yet
AI-900 - Fundamental Principles of ML
55 pages
ML Quiz 3 Machine Learning Great Learning
89% (9)
ML Quiz 3 Machine Learning Great Learning
7 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
1 STA404-course Outline (2020 Mar-Jul) PDF
No ratings yet
1 STA404-course Outline (2020 Mar-Jul) PDF
3 pages
Yaffee Promer For Panel Data Analysis
No ratings yet
Yaffee Promer For Panel Data Analysis
12 pages
Arima Jmulti
No ratings yet
Arima Jmulti
11 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Rcop Meta Day 1-26-04 - 24
No ratings yet
Rcop Meta Day 1-26-04 - 24
15 pages
Project ML
100% (4)
Project ML
36 pages
Introduction To Data Analytics: Sampling Distributions
No ratings yet
Introduction To Data Analytics: Sampling Distributions
31 pages
Reliability Analysis PDF
No ratings yet
Reliability Analysis PDF
3 pages
Quiz 3 LDA Predictive Modeling Great Learning
100% (5)
Quiz 3 LDA Predictive Modeling Great Learning
7 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
State Wise Health Income Clustering 18th December 2021 PDF
100% (2)
State Wise Health Income Clustering 18th December 2021 PDF
29 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
Customer Churn - E-Commerce: Capstone Project Report
100% (1)
Customer Churn - E-Commerce: Capstone Project Report
43 pages
Predictive Modelling
67% (3)
Predictive Modelling
64 pages
Statisitics Project 6
100% (2)
Statisitics Project 6
48 pages
CMSU Survey Data Analysis PDF
100% (3)
CMSU Survey Data Analysis PDF
13 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Project Time Series Forecasting
100% (1)
Project Time Series Forecasting
53 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
ANN Doc
No ratings yet
ANN Doc
2 pages
Predictive Modelling Project Report Final
45% (11)
Predictive Modelling Project Report Final
49 pages
Interval Estimation Practice Questions
0% (2)
Interval Estimation Practice Questions
19 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
MRA Milestone-1 Graded Project
100% (2)
MRA Milestone-1 Graded Project
41 pages
Shoe Sales
100% (3)
Shoe Sales
105 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
Sample Size Determination
No ratings yet
Sample Size Determination
15 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Data Mini Proj
100% (2)
Data Mini Proj
44 pages
Next Pathway Hack Backpackers Problem Statement
No ratings yet
Next Pathway Hack Backpackers Problem Statement
11 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
Machine Learning With SQL
100% (1)
Machine Learning With SQL
12 pages
Math 7 - Q4, WK5 Las
No ratings yet
Math 7 - Q4, WK5 Las
11 pages
Discrete Probability Distributions
No ratings yet
Discrete Probability Distributions
10 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
Regression Modelling 1 Assignment (r)[1][1]
No ratings yet
Regression Modelling 1 Assignment (r)[1][1]
7 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
ML Project Report
100% (2)
ML Project Report
35 pages
Machine Learning Project: Problem 1
67% (3)
Machine Learning Project: Problem 1
26 pages
Suresh Kumar 5-9 Chap Notes
No ratings yet
Suresh Kumar 5-9 Chap Notes
24 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
An Analysis of Impact of FDI On Indian Stock Market.: Rupam Sen
No ratings yet
An Analysis of Impact of FDI On Indian Stock Market.: Rupam Sen
8 pages
Answer
100% (3)
Answer
5 pages
CLUSTERING ANALYSIS State Wise Health PDF
No ratings yet
CLUSTERING ANALYSIS State Wise Health PDF
14 pages
As Quiz 3 PCA Solution PDF
100% (1)
As Quiz 3 PCA Solution PDF
1 page
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Cold Storage Assignment Solution Ankur Jain
75% (8)
Cold Storage Assignment Solution Ankur Jain
6 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
100% (1)
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
3 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
A B Shingles Case
No ratings yet
A B Shingles Case
2 pages
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
100% (1)
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
47 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
ISDS_Tutorial 1 answers
No ratings yet
ISDS_Tutorial 1 answers
4 pages
Panel Data Models: V X U Y V U X Y
No ratings yet
Panel Data Models: V X U Y V U X Y
2 pages
DVT Alternate Project
50% (2)
DVT Alternate Project
1 page
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Ekolu 2019
No ratings yet
Ekolu 2019
6 pages
Z-Scores-Worksheet
No ratings yet
Z-Scores-Worksheet
3 pages
Business Report Machine Learning-1
100% (7)
Business Report Machine Learning-1
60 pages
Anscombe's Data Workbook
No ratings yet
Anscombe's Data Workbook
5 pages
Machine Learning Project - Sapan Parikh
100% (1)
Machine Learning Project - Sapan Parikh
12 pages
Regression Statistics
No ratings yet
Regression Statistics
4 pages
Statics For Management One Exit +2015
No ratings yet
Statics For Management One Exit +2015
3 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
non-linear-regression-saturation-growth-curve
No ratings yet
non-linear-regression-saturation-growth-curve
2 pages
Advance Statistics Business Report
No ratings yet
Advance Statistics Business Report
15 pages
Wilcoxon Signed-Rank Test
No ratings yet
Wilcoxon Signed-Rank Test
39 pages
Homework Chapter 9: More On Specification and Data Issues: Econ 30331: Econometrics Prof. Byung-Joo Lee
No ratings yet
Homework Chapter 9: More On Specification and Data Issues: Econ 30331: Econometrics Prof. Byung-Joo Lee
2 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
Capstone Project
100% (1)
Capstone Project
7 pages
MRA Project Milestone 1 PDF
No ratings yet
MRA Project Milestone 1 PDF
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.