100% found this document useful (2 votes)
457 views48 pages

Statisitics Project 6

This document contains code to analyze a dataset containing information about holiday packages. It loads the dataset, explores the data distribution and relationships between variables through visualizations, and performs univariate and bivariate analysis. It checks for missing data, duplicates, distribution of categorical variables, and skewness. Visualizations include histograms, boxplots, countplots, and swarm plots to understand relationships between categorical and continuous variables. The goal is to understand the dataset and prepare for building predictive models.

Uploaded by

AMAN PRAKASH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
457 views48 pages

Statisitics Project 6

This document contains code to analyze a dataset containing information about holiday packages. It loads the dataset, explores the data distribution and relationships between variables through visualizations, and performs univariate and bivariate analysis. It checks for missing data, duplicates, distribution of categorical variables, and skewness. Visualizations include histograms, boxplots, countplots, and swarm plots to understand relationships between categorical and continuous variables. The goal is to understand the dataset and prepare for building predictive models.

Uploaded by

AMAN PRAKASH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

7/3/2021 temp-162530817097237750

In [1]:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split,GridSearchCV

from sklearn.linear_model import LogisticRegression

from sklearn import metrics

from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_mat


rix,plot_confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn import metrics,model_selection

from sklearn.preprocessing import scale

from warnings import filterwarnings

filterwarnings('ignore')

In [3]:

df = pd.read_csv('Holiday_Package.csv')

In [4]:

df.head()

Out[4]:

Unnamed:
Holliday_Package Salary age educ no_young_children no_older_children fo
0

0 1 no 48412 30 8 1 1

1 2 yes 37207 45 8 0 1

2 3 no 58022 46 9 0 0

3 4 no 66503 31 11 2 0

4 5 no 66734 44 12 0 2

In [5]:

df.tail()

Out[5]:

Unnamed:
Holliday_Package Salary age educ no_young_children no_older_children
0

867 868 no 40030 24 4 2 1

868 869 yes 32137 48 8 0 0

869 870 no 25178 24 6 2 0

870 871 yes 55958 41 10 0 1

871 872 no 74659 51 10 0 0

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 1/48
7/3/2021 temp-162530817097237750

In [6]:

df.shape

Out[6]:

(872, 8)

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 872 entries, 0 to 871

Data columns (total 8 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Unnamed: 0 872 non-null int64

1 Holliday_Package 872 non-null object

2 Salary 872 non-null int64

3 age 872 non-null int64

4 educ 872 non-null int64

5 no_young_children 872 non-null int64

6 no_older_children 872 non-null int64

7 foreign 872 non-null object

dtypes: int64(6), object(2)

memory usage: 54.6+ KB

Data Description
In [8]:

df.describe(include ='all').T

Out[8]:

count unique top freq mean std min 25% 50%

Unnamed: 0 872 NaN NaN NaN 436.5 251.869 1 218.75 436.5 65

Holliday_Package 872 2 no 471 NaN NaN NaN NaN NaN

Salary 872 NaN NaN NaN 47729.2 23418.7 1322 35324 41903.5 534

age 872 NaN NaN NaN 39.9553 10.5517 20 32 39

educ 872 NaN NaN NaN 9.30734 3.03626 1 8 9

no_young_children 872 NaN NaN NaN 0.311927 0.61287 0 0 0

no_older_children 872 NaN NaN NaN 0.982798 1.08679 0 0 1

foreign 872 2 no 656 NaN NaN NaN NaN NaN

Null value check

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 2/48
7/3/2021 temp-162530817097237750

In [9]:

df.isnull().sum()

Out[9]:

Unnamed: 0 0

Holliday_Package 0

Salary 0

age 0

educ 0

no_young_children 0

no_older_children 0

foreign 0

dtype: int64

check for duplicates in data

In [10]:

dups = df.duplicated()

print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 0

unique values for categorical variables

In [11]:

for column in df.columns:

if df[column].dtype == 'object':

print(column.upper(),': ',df[column].nunique())

print(df[column].value_counts().sort_values())

print('\n')

HOLLIDAY_PACKAGE : 2

yes 401

no 471

Name: Holliday_Package, dtype: int64

FOREIGN : 2

yes 216

no 656

Name: foreign, dtype: int64

df.Holliday_Package.value_counts(1)

Univariate / Bivariate analysis

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 3/48
7/3/2021 temp-162530817097237750

In [12]:

fig, axes = plt.subplots(nrows=5,ncols=2)

fig.set_size_inches(15, 23)

a = sns.distplot(df['Salary'], kde_kws = {'bw' : 1}, ax=axes[0][0])

a.set_title("Salary Distribution",fontsize=10)

a = sns.boxplot(df['Salary'] , orient = "v" , ax=axes[0][1])

a.set_title("Salary Distribution",fontsize=15)

a = sns.distplot(df['age'], kde_kws = {'bw' : 1}, ax=axes[1][0])

a.set_title("age Distribution",fontsize=10)

a = sns.boxplot(df['age'] , orient = "v" , ax=axes[1][1])

a.set_title("age Distribution",fontsize=10)

a = sns.distplot(df['educ'], kde_kws = {'bw' : 1}, ax=axes[2][0])

a.set_title("educ Distribution",fontsize=10)

a = sns.boxplot(df['educ'] , orient = "v" , ax=axes[2][1])

a.set_title("educ Distribution",fontsize=10)

a = sns.distplot(df['no_young_children'], kde_kws = {'bw' : 1}, ax=axes[3][0])

a.set_title("no_young_children Distribution",fontsize=10)

a = sns.boxplot(df['no_young_children'] , orient = "v" , ax=axes[3][1])

a.set_title("no_young_children Distribution",fontsize=10)

a = sns.distplot(df['no_older_children'], kde_kws = {'bw' : 1}, ax=axes[4][0])

a.set_title("no_older_children Distribution",fontsize=10)

a = sns.boxplot(df['no_older_children'] , orient = "v" , ax=axes[4][1])

a.set_title("no_older_children Distribution",fontsize=10)

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 4/48
7/3/2021 temp-162530817097237750

Out[12]:

Text(0.5, 1.0, 'no_older_children Distribution')

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 5/48
7/3/2021 temp-162530817097237750

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 6/48
7/3/2021 temp-162530817097237750

In [13]:

df.columns

Out[13]:

Index(['Unnamed: 0', 'Holliday_Package', 'Salary', 'age', 'educ',

'no_young_children', 'no_older_children', 'foreign'],

dtype='object')

In [14]:

df.skew()

Out[14]:

Unnamed: 0 0.000000

Salary 3.103216

age 0.146412

educ -0.045501

no_young_children 1.946515

no_older_children 0.953951

dtype: float64

Categorical Variables

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 7/48
7/3/2021 temp-162530817097237750

In [15]:

sns.countplot(x="foreign", data=df, color="c")

Out[15]:

<AxesSubplot:xlabel='foreign', ylabel='count'>

In [16]:

sns.countplot(x="Holliday_Package", data=df, color="c")

Out[16]:

<AxesSubplot:xlabel='Holliday_Package', ylabel='count'>

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 8/48
7/3/2021 temp-162530817097237750

In [17]:

sns.catplot(x="Holliday_Package", y="Salary",kind="swarm",data=df)

Out[17]:

<seaborn.axisgrid.FacetGrid at 0x2591280b700>

In [18]:

sns.catplot(x="Holliday_Package", y="age",kind="swarm",data=df)

Out[18]:

<seaborn.axisgrid.FacetGrid at 0x259121c45b0>

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 9/48
7/3/2021 temp-162530817097237750

In [19]:

sns.catplot(x="Holliday_Package", y="educ",kind="swarm",data=df)

Out[19]:

<seaborn.axisgrid.FacetGrid at 0x2591280bca0>

In [21]:

sns.catplot(x="Holliday_Package", y="no_young_children",kind="swarm",data=df)

Out[21]:

<seaborn.axisgrid.FacetGrid at 0x259122bae20>

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 10/48
7/3/2021 temp-162530817097237750

In [22]:

sns.catplot(x="Holliday_Package", y="no_older_children",kind="swarm",data=df)

Out[22]:

<seaborn.axisgrid.FacetGrid at 0x25912222760>

In [23]:

sns.scatterplot(data = df, x='age',y='Salary', hue = 'Holliday_Package')

Out[23]:

<AxesSubplot:xlabel='age', ylabel='Salary'>

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 11/48
7/3/2021 temp-162530817097237750

In [24]:

sns.lmplot(x="age", y="Salary", hue="Holliday_Package", data=df,

palette="Set1")

Out[24]:

<seaborn.axisgrid.FacetGrid at 0x2591231ca30>

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 12/48
7/3/2021 temp-162530817097237750

In [25]:

sns.lmplot(x="educ", y="Salary", hue="Holliday_Package", data=df,

palette="Set1")

Out[25]:

<seaborn.axisgrid.FacetGrid at 0x259124438e0>

In [26]:

sns.scatterplot(data = df, x='educ',y='Salary', hue = 'Holliday_Package')

Out[26]:

<AxesSubplot:xlabel='educ', ylabel='Salary'>

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 13/48
7/3/2021 temp-162530817097237750

In [27]:

sns.scatterplot(data = df, x='no_young_children',y='age', hue = 'Holliday_Package')

Out[27]:

<AxesSubplot:xlabel='no_young_children', ylabel='age'>

In [28]:

sns.lmplot(x="age", y="no_young_children", hue="Holliday_Package", data=df,

palette="Set1")

Out[28]:

<seaborn.axisgrid.FacetGrid at 0x25912509580>

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 14/48
7/3/2021 temp-162530817097237750

In [29]:

sns.scatterplot(data = df, x='no_older_children',y='age', hue = 'Holliday_Package')

Out[29]:

<AxesSubplot:xlabel='no_older_children', ylabel='age'>

In [30]:

sns.lmplot(x="age", y="no_older_children", hue="Holliday_Package", data=df,

palette="Set1")

Out[30]:

<seaborn.axisgrid.FacetGrid at 0x259125e3a60>

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 15/48
7/3/2021 temp-162530817097237750

In [31]:

cols = ['Salary' ,'age', 'educ', 'no_young_children', 'no_older_children']

for i in cols:

sns.boxplot(df[i])

plt.show()

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 16/48
7/3/2021 temp-162530817097237750

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 17/48
7/3/2021 temp-162530817097237750

In [32]:

df.columns

Out[32]:

Index(['Unnamed: 0', 'Holliday_Package', 'Salary', 'age', 'educ',

'no_young_children', 'no_older_children', 'foreign'],

dtype='object')

Data Distribution

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 18/48
7/3/2021 temp-162530817097237750

In [33]:

# Pairplot using sns

sns.pairplot(df ,diag_kind='kde' ,hue='Holliday_Package');

checking for Correlations

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 19/48
7/3/2021 temp-162530817097237750

In [34]:

df_cor = df.corr()

plt.figure(figsize=(8,6))

sns.heatmap(df_cor, annot=True, fmt = '.2f', cmap='coolwarm')

Out[34]:

<AxesSubplot:>

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 20/48
7/3/2021 temp-162530817097237750

In [35]:

df.isnull().sum()

Out[35]:

Unnamed: 0 0

Holliday_Package 0

Salary 0

age 0

educ 0

no_young_children 0

no_older_children 0

foreign 0

dtype: int64

Treating Outliers

In [36]:

cont=df.dtypes[(df.dtypes!='uint8') & (df.dtypes!='object')].index

In [37]:

def remove_outlier(col):

sorted(col)
Q1,Q3=np.percentile(col,[25,75])

IQR=Q3-Q1

lower_range= Q1-(1.5 * IQR)

upper_range= Q3+(1.5 * IQR)

return lower_range, upper_range

In [41]:

for column in df[cont].columns:

lr,ur=remove_outlier(df[column])

df[column]=np.where(df[column]>ur,ur,df[column])

df[column]=np.where(df[column]<lr,lr,df[column])

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 21/48
7/3/2021 temp-162530817097237750

In [42]:

cols = ['Salary' ,'age', 'educ', 'no_young_children', 'no_older_children']

for i in cols:

sns.boxplot(df[i])

plt.show()

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 22/48
7/3/2021 temp-162530817097237750

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 23/48
7/3/2021 temp-162530817097237750

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 24/48
7/3/2021 temp-162530817097237750

In [40]:

df.head()

Out[40]:

Unnamed:
Holliday_Package Salary age educ no_young_children no_older_children
0

0 1.0 no 48412.0 30.0 8.0 0.0 1.0

1 2.0 yes 37207.0 45.0 8.0 0.0 1.0

2 3.0 no 58022.0 46.0 9.0 0.0 0.0

3 4.0 no 66503.0 31.0 11.0 0.0 0.0

4 5.0 no 66734.0 44.0 12.0 0.0 2.0

In [43]:

# drop the id column as it is useless for the model

df1 = df.drop(columns=['Unnamed: 0'], axis=1)

In [44]:

df1.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 872 entries, 0 to 871

Data columns (total 7 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Holliday_Package 872 non-null object

1 Salary 872 non-null float64

2 age 872 non-null float64

3 educ 872 non-null float64

4 no_young_children 872 non-null float64

5 no_older_children 872 non-null float64

6 foreign 872 non-null object

dtypes: float64(5), object(2)

memory usage: 47.8+ KB

In [45]:

df2 = df1.copy()

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

Converting categorical to dummy variables in data

In [46]:

data = pd.get_dummies(df2, columns=['Holliday_Package','foreign'], drop_first = True)

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 25/48
7/3/2021 temp-162530817097237750

In [47]:

data.head()

Out[47]:

Salary age educ no_young_children no_older_children Holliday_Package_yes foreign_

0 48412.0 30.0 8.0 0.0 1.0 0

1 37207.0 45.0 8.0 0.0 1.0 1

2 58022.0 46.0 9.0 0.0 0.0 0

3 66503.0 31.0 11.0 0.0 0.0 0

4 66734.0 44.0 12.0 0.0 2.0 0

In [48]:

data.columns

Out[48]:

Index(['Salary', 'age', 'educ', 'no_young_children', 'no_older_children',

'Holliday_Package_yes', 'foreign_yes'],

dtype='object')

Train/ Test split

In [49]:

# Copy all the predictor variables into X dataframe

X = data.drop('Holliday_Package_yes', axis=1)

# Copy target into the y dataframe.

y = data['Holliday_Package_yes']

In [50]:

# Split X and y into training and test set in 70:30 ratio

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state


=1,stratify=y)

In [51]:

y_train.value_counts(1)

Out[51]:

0 0.539344

1 0.460656

Name: Holliday_Package_yes, dtype: float64

Applying GridSearchCV for Logistic Regression

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 26/48
7/3/2021 temp-162530817097237750

In [53]:

grid={'penalty':['l1','l2','none'],

'solver':['lbfgs', 'liblinear'],

'tol':[0.0001,0.000001]}

In [54]:

model = LogisticRegression(max_iter=100000,n_jobs=2)

In [55]:

grid_search = GridSearchCV(estimator = model, param_grid = grid, cv = 3,n_jobs=-1,scori


ng='f1')

In [56]:

grid_search.fit(X_train, y_train)

Out[56]:

GridSearchCV(cv=3, estimator=LogisticRegression(max_iter=100000, n_jobs=


2),

n_jobs=-1,

param_grid={'penalty': ['l1', 'l2', 'none'],

'solver': ['lbfgs', 'liblinear'],

'tol': [0.0001, 1e-06]},

scoring='f1')

In [57]:

print(grid_search.best_params_,'\n')

print(grid_search.best_estimator_)

{'penalty': 'l2', 'solver': 'liblinear', 'tol': 1e-06}

LogisticRegression(max_iter=100000, n_jobs=2, solver='liblinear', tol=1e-0


6)

In [58]:

best_model = grid_search.best_estimator_

In [59]:

# Prediction on the training set

ytrain_predict = best_model.predict(X_train)

ytest_predict = best_model.predict(X_test)

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 27/48
7/3/2021 temp-162530817097237750

In [60]:

ytrain_predict

Out[60]:

array([1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,

1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1,

0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,

1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0,

1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,

1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,

0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,

1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,

0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1,

1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,

0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1,

0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,

0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1,

0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,

1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,

1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1,

0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,

0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1,

0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1,

0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,

0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,

0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0,

0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0], dtype=uint8)

In [61]:

## Getting the probabilities on the test set

ytest_predict_prob=best_model.predict_proba(X_test)

pd.DataFrame(ytest_predict_prob).head()

Out[61]:

0 1

0 0.636523 0.363477

1 0.576651 0.423349

2 0.650835 0.349165

3 0.568064 0.431936

4 0.536356 0.463644

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 28/48
7/3/2021 temp-162530817097237750

In [62]:

## Confusion matrix on the training data

plot_confusion_matrix(best_model,X_train,y_train)

print(classification_report(y_train, ytrain_predict),'\n');

precision recall f1-score support

0 0.63 0.79 0.70 329

1 0.65 0.45 0.53 281

accuracy 0.63 610

macro avg 0.64 0.62 0.62 610

weighted avg 0.64 0.63 0.62 610

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 29/48
7/3/2021 temp-162530817097237750

In [63]:

## Confusion matrix on the test data

plot_confusion_matrix(best_model,X_test,y_test)

print(classification_report(y_test, ytest_predict),'\n');

precision recall f1-score support

0 0.64 0.83 0.72 142

1 0.69 0.45 0.55 120

accuracy 0.66 262

macro avg 0.67 0.64 0.63 262

weighted avg 0.66 0.66 0.64 262

In [64]:

# Accuracy - Training Data

lr_train_acc = best_model.score(X_train, y_train)

lr_train_acc

Out[64]:

0.6344262295081967

AUC and ROC for the training data

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 30/48
7/3/2021 temp-162530817097237750

In [65]:

# predict probabilities

probs = best_model.predict_proba(X_train)

# keep probabilities for the positive outcome only

probs = probs[:, 1]

# calculate AUC

lr_train_auc = roc_auc_score(y_train, probs)

print('AUC: %.3f' % lr_train_auc)

# calculate roc curve

train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)

plt.plot([0, 1], [0, 1], linestyle='--')

# plot the roc curve for the model

plt.plot(train_fpr, train_tpr);

AUC: 0.661

In [66]:

# Accuracy - Test Data

lr_test_acc = best_model.score(X_test, y_test)

lr_test_acc

Out[66]:

0.6564885496183206

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 31/48
7/3/2021 temp-162530817097237750

In [67]:

# predict probabilities

probs = best_model.predict_proba(X_test)

# keep probabilities for the positive outcome only

probs = probs[:, 1]

# calculate AUC

lr_test_auc = roc_auc_score(y_test, probs)

print('AUC: %.3f' % lr_test_auc)

# calculate roc curve

test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)

plt.plot([0, 1], [0, 1], linestyle='--')

# plot the roc curve for the model

plt.plot(test_fpr, test_tpr);

AUC: 0.675

In [68]:

lr_metrics=classification_report(y_train, ytrain_predict,output_dict=True)

df=pd.DataFrame(lr_metrics).transpose()

lr_train_f1=round(df.loc["1"][2],2)

lr_train_recall=round(df.loc["1"][1],2)

lr_train_precision=round(df.loc["1"][0],2)

print ('lr_train_precision ',lr_train_precision)

print ('lr_train_recall ',lr_train_recall)

print ('lr_train_f1 ',lr_train_f1)

lr_train_precision 0.65

lr_train_recall 0.45

lr_train_f1 0.53

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 32/48
7/3/2021 temp-162530817097237750

In [69]:

lr_metrics=classification_report(y_test, ytest_predict,output_dict=True)

df=pd.DataFrame(lr_metrics).transpose()

lr_test_f1=round(df.loc["1"][2],2)

lr_test_recall=round(df.loc["1"][1],2)

lr_test_precision=round(df.loc["1"][0],2)

print ('lr_test_precision ',lr_test_precision)

print ('lr_test_recall ',lr_test_recall)

print ('lr_test_f1 ',lr_test_f1)

lr_test_precision 0.69

lr_test_recall 0.45

lr_test_f1 0.55

LDA MODEL

In [70]:

df1.head()

Out[70]:

Holliday_Package Salary age educ no_young_children no_older_children foreign

0 no 48412.0 30.0 8.0 0.0 1.0 no

1 yes 37207.0 45.0 8.0 0.0 1.0 no

2 no 58022.0 46.0 9.0 0.0 0.0 no

3 no 66503.0 31.0 11.0 0.0 0.0 no

4 no 66734.0 44.0 12.0 0.0 2.0 no

In [71]:

df1.shape

Out[71]:

(872, 7)

In [72]:

df1.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 872 entries, 0 to 871

Data columns (total 7 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Holliday_Package 872 non-null object

1 Salary 872 non-null float64

2 age 872 non-null float64

3 educ 872 non-null float64

4 no_young_children 872 non-null float64

5 no_older_children 872 non-null float64

6 foreign 872 non-null object

dtypes: float64(5), object(2)

memory usage: 47.8+ KB

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 33/48
7/3/2021 temp-162530817097237750

In [73]:

for feature in df1.columns:

if df1[feature].dtype == 'object':

print('\n')

print('feature:',feature)

print(pd.Categorical(df1[feature].unique()))

print(pd.Categorical(df1[feature].unique()).codes)

df1[feature] = pd.Categorical(df1[feature]).codes

feature: Holliday_Package

['no', 'yes']

Categories (2, object): ['no', 'yes']

[0 1]

feature: foreign

['no', 'yes']

Categories (2, object): ['no', 'yes']

[0 1]

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 34/48
7/3/2021 temp-162530817097237750

In [74]:

cols = ['Salary' ,'age', 'educ', 'no_young_children', 'no_older_children']

for i in cols:

sns.boxplot(df1[i])

plt.show()

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 35/48
7/3/2021 temp-162530817097237750

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 36/48
7/3/2021 temp-162530817097237750

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 37/48
7/3/2021 temp-162530817097237750

In [75]:

df1.head()

Out[75]:

Holliday_Package Salary age educ no_young_children no_older_children foreign

0 0 48412.0 30.0 8.0 0.0 1.0 0

1 1 37207.0 45.0 8.0 0.0 1.0 0

2 0 58022.0 46.0 9.0 0.0 0.0 0

3 0 66503.0 31.0 11.0 0.0 0.0 0

4 0 66734.0 44.0 12.0 0.0 2.0 0

In [76]:

df1.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 872 entries, 0 to 871

Data columns (total 7 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Holliday_Package 872 non-null int8

1 Salary 872 non-null float64

2 age 872 non-null float64

3 educ 872 non-null float64

4 no_young_children 872 non-null float64

5 no_older_children 872 non-null float64

6 foreign 872 non-null int8

dtypes: float64(5), int8(2)

memory usage: 35.9 KB

In [77]:

X = df1.drop('Holliday_Package',axis=1)

Y = df1.pop('Holliday_Package')

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 38/48
7/3/2021 temp-162530817097237750

In [78]:

X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X,Y,test_size=0.30,ran
dom_state=1,stratify = Y)

In [79]:

#Build LDA Model

clf = LinearDiscriminantAnalysis()

model=clf.fit(X_train,Y_train)

In [80]:

# Training Data Class Prediction with a cut-off value of 0.5

pred_class_train = model.predict(X_train)

# Test Data Class Prediction with a cut-off value of 0.5

pred_class_test = model.predict(X_test)

In [81]:

pred_class_test

Out[81]:

array([0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,

0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,

0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,

0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,

0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,

0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,

1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0,

0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,

0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,

1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],

dtype=int8)

In [82]:

# Training Data Probability Prediction

pred_prob_train = model.predict_proba(X_train)

# Test Data Probability Prediction

pred_prob_test = model.predict_proba(X_test)

In [83]:

lda_train_acc = model.score(X_train,Y_train)

lda_train_acc

Out[83]:

0.6327868852459017

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 39/48
7/3/2021 temp-162530817097237750

In [84]:

print(classification_report(Y_train, pred_class_train))

precision recall f1-score support

0 0.62 0.80 0.70 329

1 0.65 0.44 0.52 281

accuracy 0.63 610

macro avg 0.64 0.62 0.61 610

weighted avg 0.64 0.63 0.62 610

In [85]:

confusion_matrix(Y_train, pred_class_train)

Out[85]:

array([[263, 66],

[158, 123]], dtype=int64)

In [86]:

lda_test_acc = model.score(X_test,Y_test)

lda_test_acc

Out[86]:

0.6564885496183206

In [87]:

print(classification_report(Y_test, pred_class_test))

precision recall f1-score support

0 0.64 0.83 0.72 142

1 0.69 0.45 0.55 120

accuracy 0.66 262

macro avg 0.67 0.64 0.63 262

weighted avg 0.66 0.66 0.64 262

In [88]:

confusion_matrix(Y_test, pred_class_test)

Out[88]:

array([[118, 24],

[ 66, 54]], dtype=int64)

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 40/48
7/3/2021 temp-162530817097237750

In [89]:

for j in np.arange(0.1,1,0.1):

custom_prob = j #defining the cut-off value of our choice

custom_cutoff_data=[]#defining an empty list

for i in range(0,len(Y_train)):#defining a loop for the length of the test data

if np.array(pred_prob_train[:,1])[i] > custom_prob:#issuing a condition for our


probability values to be

#greater than the custom cutoff value

a=1#if the probability values are greater than the custom cutoff then the v
alue should be 1

else:

a=0#if the probability values are less than the custom cutoff then the valu
e should be 0

custom_cutoff_data.append(a)#adding either 1 or 0 based on the condition to the


end of the list defined by us

print(round(j,3),'\n')

print('Accuracy Score',round(metrics.accuracy_score(Y_train,custom_cutoff_data),4))

print('F1 Score',round(metrics.f1_score(Y_train,custom_cutoff_data),4),'\n')

plt.figure(figsize=(6,4))

print('Confusion Matrix')

sns.heatmap(metrics.confusion_matrix(Y_train,custom_cutoff_data),annot=True,fmt='.4
g'),'\n\n'

plt.show();

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 41/48
7/3/2021 temp-162530817097237750

0.1

Accuracy Score 0.4607

F1 Score 0.6308

Confusion Matrix

0.2

Accuracy Score 0.4738

F1 Score 0.6365

Confusion Matrix

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 42/48
7/3/2021 temp-162530817097237750

0.3

Accuracy Score 0.5344

F1 Score 0.6485

Confusion Matrix

0.4

Accuracy Score 0.5787

F1 Score 0.6088

Confusion Matrix

0.5

Accuracy Score 0.6328

F1 Score 0.5234

Confusion Matrix

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 43/48
7/3/2021 temp-162530817097237750

0.6

Accuracy Score 0.6213

F1 Score 0.446

Confusion Matrix

0.7

Accuracy Score 0.5869

F1 Score 0.2455

Confusion Matrix

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 44/48
7/3/2021 temp-162530817097237750

0.8

Accuracy Score 0.541

F1 Score 0.0071

Confusion Matrix

0.9

Accuracy Score 0.5393

F1 Score 0.0

Confusion Matrix

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 45/48
7/3/2021 temp-162530817097237750

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 46/48
7/3/2021 temp-162530817097237750

In [90]:

# AUC and ROC for the training data

# calculate AUC

lda_train_auc = metrics.roc_auc_score(Y_train,pred_prob_train[:,1])

print('AUC for the Training Data: %.3f' % lda_train_auc)

# calculate roc curve

fpr, tpr, thresholds = metrics.roc_curve(Y_train,pred_prob_train[:,1])

plt.plot([0, 1], [0, 1], linestyle='--')

# plot the roc curve for the model

plt.plot(fpr, tpr, marker='.',label = 'Training Data')

# AUC and ROC for the test data

# calculate AUC

lda_test_auc = metrics.roc_auc_score(Y_test,pred_prob_test[:,1])

print('AUC for the Test Data: %.3f' % lda_test_auc)

# calculate roc curve

fpr, tpr, thresholds = metrics.roc_curve(Y_test,pred_prob_test[:,1])

plt.plot([0, 1], [0, 1], linestyle='--')

# plot the roc curve for the model

plt.plot(fpr, tpr, marker='.',label='Test Data')

# show the plot

plt.legend(loc='best')

plt.show()

AUC for the Training Data: 0.661

AUC for the Test Data: 0.675

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 47/48
7/3/2021 temp-162530817097237750

In [91]:

lda_metrics=classification_report(Y_train, pred_class_train,output_dict=True)

df=pd.DataFrame(lda_metrics).transpose()

lda_train_f1=round(df.loc["1"][2],2)

lda_train_recall=round(df.loc["1"][1],2)

lda_train_precision=round(df.loc["1"][0],2)

print ('lda_train_precision ',lda_train_precision)

print ('lda_train_recall ',lda_train_recall)

print ('lda_train_f1 ',lr_train_f1)

lda_train_precision 0.65

lda_train_recall 0.44

lda_train_f1 0.53

In [92]:

lda_metrics=classification_report(Y_test, pred_class_test,output_dict=True)

df=pd.DataFrame(lda_metrics).transpose()

lda_test_f1=round(df.loc["1"][2],2)

lda_test_recall=round(df.loc["1"][1],2)

lda_test_precision=round(df.loc["1"][0],2)

print ('lda_test_precision ',lda_test_precision)

print ('lda_test_recall ',lda_test_recall)

print ('lda_test_f1 ',lda_test_f1)

lda_test_precision 0.69

lda_test_recall 0.45

lda_test_f1 0.55

In [93]:

index=['Accuracy', 'AUC', 'Recall','Precision','F1 Score']

data = pd.DataFrame({'LR Train':[lr_train_acc,lr_train_auc,lr_train_recall,lr_train_pre


cision,lr_train_f1],

'LR Test':[lr_test_acc,lr_test_auc,lr_test_recall,lr_test_precision,lr_test_f1
],

'LDA Train':[lda_train_acc,lda_train_auc,lda_train_recall,lda_train_precision,ld
a_train_f1],

'LDA Test':[lda_test_acc,lda_test_auc,lda_test_recall,lda_test_precision,lda_te
st_f1],},index=index)

round(data,2)

Out[93]:

LR Train LR Test LDA Train LDA Test

Accuracy 0.63 0.66 0.63 0.66

AUC 0.66 0.68 0.66 0.68

Recall 0.45 0.45 0.44 0.45

Precision 0.65 0.69 0.65 0.69

F1 Score 0.53 0.55 0.52 0.55

In [ ]:

https://htmtopdf.herokuapp.com/ipynbviewer/temp/c9ffe7dee6cf683104ff5b70752d8eb0/Holiday_Package.html?t=1625308272464 48/48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy