0% found this document useful (0 votes)

60 views30 pages

Water Quality 1673157384

Uploaded by

morgan.mbatia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views30 pages

Water Quality 1673157384

Uploaded by

morgan.mbatia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

water-quality - Jupyter Notebook 26/12/22, 10:02 PM

Water Quality Analysis and Classifiction

In [1]: #imporing the Libery
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from tqdm import tqdm_notebook
import plotly.figure_factory as ff
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')
%matplotlib inline

Data Collection ¶
In [2]: #loading water dataset in pandas
data=pd.read_csv(water_potability.csv')

#check first five rows of the dataset

data.head()

Out[2]: ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon

0 NaN 204.890455 20791.318981 7.300212 368.516441 564.308654 10.379783

1 3.716080 129.422921 18630.057858 6.635246 NaN 592.885359 15.180013

2 8.099124 224.236259 19909.541732 9.275884 NaN 418.606213 16.868637

3 8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 18.436524

4 9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 11.558279

EDA

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 1 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [3]: #describe of the dataset

data.describe()

Out[3]: ph Hardness Solids Chloramines Sulfate Conductivity Organic

count 2785.000000 3276.000000 3276.000000 3276.000000 2495.000000 3276.000000

mean 7.080795 196.369496 22014.092526 7.122277 333.775777 426.205111

std 1.594320 32.879761 8768.570828 1.583085 41.416840 80.824064

min 0.000000 47.432000 320.942611 0.352000 129.000000 181.483754

25% 6.093092 176.850538 15666.690297 6.127421 307.699498 365.734414

50% 7.036752 196.967627 20927.833607 7.130299 333.073546 421.884968

75% 8.062066 216.667456 27332.762127 8.114887 359.950170 481.792304

max 14.000000 323.124000 61227.196008 13.127000 481.030642 753.342620

In [4]: #infomation of the dataset

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ph 2785 non-null float64
1 Hardness 3276 non-null float64
2 Solids 3276 non-null float64
3 Chloramines 3276 non-null float64
4 Sulfate 2495 non-null float64
5 Conductivity 3276 non-null float64
6 Organic_carbon 3276 non-null float64
7 Trihalomethanes 3114 non-null float64
8 Turbidity 3276 non-null float64
9 Potability 3276 non-null int64
dtypes: float64(9), int64(1)
memory usage: 256.1 KB

In [5]: print('There are {} data points and {} features in the data'.format(

There are 3276 data points and 10 features in the data

Null Values

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 2 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [6]: print('There are {} data points and {} features in the data'.format(

Out[6]: <AxesSubplot:>

In [7]: for i in data.columns:

if data[i].isnull().sum()>0:
print("There are {} null values in {} column".format(data[i].

There are 491 null values in ph column

There are 781 null values in Sulfate column
There are 162 null values in Trihalomethanes column

Handling Null Values

In [8]: #describe of ph value

data['ph'].describe()

Out[8]: count 2785.000000

mean 7.080795
std 1.594320
min 0.000000
25% 6.093092
50% 7.036752
75% 8.062066
max 14.000000
Name: ph, dtype: float64

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 3 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

Filling the missing values by mean

In [9]: #filling missing value by mean()

data['ph_mean']=data['ph'].fillna(data['ph'].mean())

In [10]: #check missing values

data['ph_mean'].isnull().sum()

Out[10]: 0

In [11]: #plot dataset ph

fig = plt.figure()
ax = fig.add_subplot(111)
data['ph'].plot(kind='kde', ax=ax)
data.ph_mean.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
plt.show()

The distribution is not uniform

Filling the data with random values

In [12]: def impute_nan(df,variable):

df[variable+"_random"]=df[variable]
##It will have the random sample to fill the na
random_sample=df[variable].dropna().sample(df[variable].isnull().
##pandas need to have same index in order to merge the dataset
random_sample.index=df[df[variable].isnull()].index
df.loc[df[variable].isnull(),variable+'_random']=random_sample

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 4 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [13]:

#fill missing values

impute_nan(data,"ph")

In [14]: #plot ph value

fig = plt.figure()
ax = fig.add_subplot(111)
data['ph'].plot(kind='kde', ax=ax)
data.ph_random.plot(kind='kde', ax=ax, color='green')
data.ph_mean.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
plt.show()

Uniform distribution with random initialization

In [15]: #fill sulfate

impute_nan(data,"Sulfate")

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 5 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [16]: #plot sulfate value

fig = plt.figure()
ax = fig.add_subplot(111)
data['Sulfate'].plot(kind='kde', ax=ax)
data["Sulfate_random"].plot(kind='kde', ax=ax, color='green')
#data.ph_mean.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
plt.show()

In [17]: #fill Trihalomethanes

impute_nan(data,"Trihalomethanes")

In [18]: #plot Trihalomethanes value

fig = plt.figure()
ax = fig.add_subplot(111)
data['Trihalomethanes'].plot(kind='kde', ax=ax)
data.Trihalomethanes_random.plot(kind='kde', ax=ax, color='green')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
plt.show()

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 6 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [19]: #drop values

data=data.drop(['ph','Sulfate','Trihalomethanes','ph_mean'],axis=1)

In [20]: #check missing value after fill and drop values

data.isnull().sum()

Out[20]: Hardness 0
Solids 0
Chloramines 0
Conductivity 0
Organic_carbon 0
Turbidity 0
Potability 0
ph_random 0
Sulfate_random 0
Trihalomethanes_random 0
dtype: int64

Check for Correlation

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 7 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [21]: #check correlation of the dataset

plt.figure(figsize=(20, 17))
matrix = np.triu(data.corr())
sns.heatmap(data.corr(), annot=True,linewidth=.8, mask=matrix, cmap

There are no correlated columns presebt in the data

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 8 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [22]: #plot pairplot of the dataset

sns.pairplot(data, hue="Potability", palette="husl");

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 9 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [23]: #plot dist of the dataset

def distributionPlot(data):
"""
Creates distribution plot.
"""
fig = plt.figure(figsize=(20, 20))
for i in tqdm_notebook(range(0, len(data.columns))):
fig.add_subplot(np.ceil(len(data.columns)/3), 3, i+1)
sns.distplot(
data.iloc[:, i], color="lightcoral", rug=True)
fig.tight_layout(pad=3.0)
plot_data = data.drop(['Potability'], axis =1)
distributionPlot(plot_data)

0%| | 0/9 [00:00<?, ?it/s]

Hardness
http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 10 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [24]: #check hardness of the describe

data['Hardness'].describe()

Out[24]: count 3276.000000

mean 196.369496
std 32.879761
min 47.432000
25% 176.850538
50% 196.967627
75% 216.667456
max 323.124000
Name: Hardness, dtype: float64

In [25]: #plot Hardness values

plt.figure(figsize = (16, 7))
sns.distplot(data['Hardness'])
plt.title('Distribution Plot of Hardness\n', fontsize = 20)
plt.show()

In [26]: # basic scatter plot

fig = px.scatter(data,range(data['Hardness'].count()), sorted(data[
color=data['Potability'],
labels={
'x': "Count",
'y': "Hardness",
'color':'Potability'

}, template = 'plotly_dark')
fig.update_layout(title='Hardness wrt Potability')
fig.show()

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 11 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [27]:

#plot histogram
px.histogram(data_frame = data, x = 'Hardness', nbins = 10, color =
template = 'plotly_dark')

Solids
In [28]: #check Solids describe
data['Solids'].describe()

Out[28]: count 3276.000000

mean 22014.092526
std 8768.570828
min 320.942611
25% 15666.690297
50% 20927.833607
75% 27332.762127
max 61227.196008
Name: Solids, dtype: float64

In [29]: #plot Solids

plt.figure(figsize = (16, 7))
sns.distplot(data['Solids'])
plt.title('Distribution Plot of Solids\n', fontsize = 20)
plt.show()

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 12 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [30]: #plot scatter values

fig = px.scatter(data, sorted(data["Solids"]), range(data["Solids"].
facet_row="Potability")
fig.show()

In [31]: px.histogram(data_frame = data, x = 'Solids', nbins = 10, color = 'Pota

template = 'plotly_dark')

In [32]: # basic scatter plot

fig = px.scatter(data,range(data['Solids'].count()), sorted(data['Solid
color=data['Potability'],
labels={
'x': "Count",
'y': "Hardness",
'color':'Potability'

},
color_continuous_scale=px.colors.sequential.tempo,
template = 'plotly_dark')
fig.update_layout(title='Hardness wrt Potability')
fig.show()

Chloramines
In [33]: data['Chloramines'].describe()

Out[33]: count 3276.000000

mean 7.122277
std 1.583085
min 0.352000
25% 6.127421
50% 7.130299
75% 8.114887
max 13.127000
Name: Chloramines, dtype: float64

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 13 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [34]: plt.figure(figsize = (16, 7))

sns.distplot(data['Chloramines'])
plt.title('Distribution Plot of Chloramines\n', fontsize = 20)
plt.show()

In [35]: fig = px.line(x=range(data['Chloramines'].count()), y=sorted(data['Chlo

'x': "Count",
'y': "Chloramines",
'color':'Potability'

}, template = 'plotly_dark')
fig.update_layout(title='Chloramines wrt Potability')
fig.show()

In [36]: fig = px.box(x = 'Chloramines', data_frame = data, template = 'plotly_d

fig.update_layout(title='Chloramines')
fig.show()

Conductivity
In [37]: data["Conductivity"].describe()

Out[37]: count 3276.000000

mean 426.205111
std 80.824064
min 181.483754
25% 365.734414
50% 421.884968
75% 481.792304
max 753.342620
Name: Conductivity, dtype: float64

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 14 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [38]: plt.figure(figsize = (16, 7))

sns.distplot(data['Conductivity'])
plt.title('Distribution Plot of Conductivity\n', fontsize = 20)
plt.show()

In [39]: fig = px.bar(data, x=range(data['Conductivity'].count()),

y=sorted(data['Conductivity']), labels={
'x': "Count",
'y': "Conductivity",
'color':'Potability'

},
color=data['Potability']
,template = 'plotly_dark')
fig.update_layout(title='Conductivity wrt Potability')
fig.show()

In [40]:
group_labels = ['distplot'] # name of the dataset

fig = ff.create_distplot([data['Conductivity']], group_labels)

fig.show()

Organic_carbon

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 15 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [41]: data['Organic_carbon'].describe()

Out[41]: count 3276.000000

mean 14.284970
std 3.308162
min 2.200000
25% 12.065801
50% 14.218338
75% 16.557652
max 28.300000
Name: Organic_carbon, dtype: float64

In [42]:
group_labels = ['Organic_carbon'] # name of the dataset

fig = ff.create_distplot([data['Organic_carbon']], group_labels)

fig.show()

In [43]: dt_5=data[data['Organic_carbon']<5]
dt_5_10=data[(data['Organic_carbon']>5)&(data['Organic_carbon']<10)]
dt_10_15=data[(data['Organic_carbon']>10)&(data['Organic_carbon']<15
dt_15_20=data[(data['Organic_carbon']>15)&(data['Organic_carbon']<20
dt_20_25=data[(data['Organic_carbon']>20)&(data['Organic_carbon']<25
dt_25=data[(data['Organic_carbon']>25)]

x_Age = ['5', '5-10', '10-15', '15-20', '25+']

y_Age = [len(dt_5.values), len(dt_5_10.values), len(dt_10_15.values),
len(dt_25.values)]

px.bar(data_frame = data, x = x_Age, y = y_Age, color = x_Age, template

title = 'Number of passengers per Age group')

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 16 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [44]: sns.catplot(x = 'Organic_carbon', y = 'Organic_carbon', hue = 'Potabili

height = 5, aspect = 2)
plt.show()

Turbidity
In [45]: data['Turbidity'].describe()

Out[45]: count 3276.000000

mean 3.966786
std 0.780382
min 1.450000
25% 3.439711
50% 3.955028
75% 4.500320
max 6.739000
Name: Turbidity, dtype: float64

In [46]:
group_labels = ['Turbidity'] # name of the dataset

fig = ff.create_distplot([data['Turbidity']], group_labels)

fig.show()

In [47]: data['turbid_class']=data['Turbidity'].astype(int)

In [48]: data['turbid_class'].unique()

Out[48]: array([2, 4, 3, 5, 6, 1])

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 17 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [49]: px.scatter(data_frame = data, x = 'Turbidity', y = 'turbid_class',

In [50]: fig = px.pie(data,

values=data['turbid_class'].value_counts(),
names=data['turbid_class'].value_counts().keys(),
)
fig.update_layout(
title='turbid_class',
template = 'plotly_dark'
)
fig.show()

In [51]: data=data.drop(['turbid_class'],axis=1)

ph_random
In [52]: data['ph_random'].describe()

Out[52]: count 3276.000000

mean 7.071639
std 1.607991
min 0.000000
25% 6.081460
50% 7.029490
75% 8.063147
max 14.000000
Name: ph_random, dtype: float64

In [53]:
group_labels = ['ph_random'] # name of the dataset

fig = ff.create_distplot([data['ph_random']], group_labels)

fig.show()

In [54]: px.histogram(data_frame = data, x = 'ph_random', nbins = 10, color

template = 'plotly_dark')

In [55]: fig = px.scatter(data, sorted(data["ph_random"]), range(data["ph_random

facet_row="Potability")
fig.show()

Sulfate_random

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 18 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [56]: data['Sulfate_random'].describe()

Out[56]: count 3276.000000

mean 333.430954
std 41.026947
min 129.000000
25% 307.523159
50% 332.879578
75% 359.710517
max 481.030642
Name: Sulfate_random, dtype: float64

In [57]: group_labels = ['distplot'] # name of the dataset

fig = ff.create_distplot([data['Sulfate_random']], group_labels)

fig.show()

In [58]: sns.catplot(x = 'Sulfate_random', y = 'Sulfate_random', hue = 'Potabili

height = 5, aspect = 2)
plt.show()

Trihalomethanes_random

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 19 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [59]: data['Trihalomethanes_random'].describe()

Out[59]: count 3276.000000

mean 66.419200
std 16.184832
min 0.738000
25% 55.861675
50% 66.639068
75% 77.384166
max 124.000000
Name: Trihalomethanes_random, dtype: float64

In [60]:
group_labels = ['Trihalomethanes_random'] # name of the dataset

fig = ff.create_distplot([data['Trihalomethanes_random']], group_labels

fig.show()

In [61]: fig = px.box(x = 'Trihalomethanes_random', data_frame = data, template

fig.update_layout(title='Trihalomethanes_random')
fig.show()

In [62]: fig = px.line(x=range(data['Trihalomethanes_random'].count()), y=sorted

'x': "Count",
'y': "Trihalomethanes",
'color':'Potability'

}, template = 'plotly_dark')
fig.update_layout(title='Trihalomethane wrt Potability')
fig.show()

Potability
In [63]: data['Potability'].describe()

Out[63]: count 3276.000000

mean 0.390110
std 0.487849
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 1.000000
Name: Potability, dtype: float64

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 20 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [64]: px.histogram(data_frame = data, x = 'Potability', color = 'Potability'

template = 'plotly_dark')

In [65]: fig = px.pie(data,

values=data['Potability'].value_counts(),
names=data['Potability'].value_counts().keys(),
)
fig.update_layout(
title='Potability',
template = 'plotly_dark'
)
fig.show()

Data Preprocessing
In [66]: from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [67]: X=data.drop(['Potability'],axis=1)
y=data['Potability']

Since the data is not in a uniform shape, we scale the data using standard scalar

In [68]: scaler = StandardScaler()

x=scaler.fit_transform(X)

In [69]: # split the data to train and test set

x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.85

print("training data shape:-{} labels{} ".format(x_train.shape,y_train

print("testing data shape:-{} labels{} ".format(x_test.shape,y_test.

training data shape:-(2784, 9) labels(2784,)

testing data shape:-(492, 9) labels(492,)

Modeling

Logistic Regression

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 21 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [70]: from sklearn.linear_model import LogisticRegression

log = LogisticRegression(random_state=0).fit(x_train, y_train)
log.score(x_test, y_test)

Out[70]: 0.6219512195121951

In [71]: # Confusion matrix

from sklearn.metrics import confusion_matrix
# Make Predictions
pred1=log.predict(np.array(x_test))
plt.title("Confusion Matrix testing data")
sns.heatmap(confusion_matrix(y_test,pred1),annot=True,cbar=False)
plt.legend()
plt.show()

K Nearest Neighbours

In [72]: from sklearn.neighbors import KNeighborsClassifier

In [73]: knn = KNeighborsClassifier(n_neighbors=2)

# Train the model using the training sets
knn.fit(x_train,y_train)

#Predict Output
predicted= knn.predict(x_test) # 0:Overcast, 2:Mild

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 22 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [74]: # Confusion matrix

from sklearn.metrics import confusion_matrix
# Make Predictions
pred1=knn.predict(np.array(x_test))
plt.title("Confusion Matrix testing data")
sns.heatmap(confusion_matrix(y_test,pred1),annot=True,cbar=False)
plt.legend()
plt.show()

SVM

In [75]: from sklearn import svm

from sklearn.metrics import accuracy_score

In [76]: svmc = svm.SVC()

svmc.fit(x_train, y_train)

y_pred = svmc.predict(x_test)
print(accuracy_score(y_test,y_pred))

0.6808943089430894

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 23 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [77]: # Confusion matrix

from sklearn.metrics import confusion_matrix
# Make Predictions
pred1=svmc.predict(np.array(x_test))
plt.title("Confusion Matrix testing data")
sns.heatmap(confusion_matrix(y_test,pred1),annot=True,cbar=False)
plt.legend()
plt.show()

Decision Tree

In [78]: from sklearn import tree

from sklearn.metrics import accuracy_score

In [79]: tre = tree.DecisionTreeClassifier()

tre = tre.fit(x_train, y_train)

y_pred = tre.predict(x_test)
print(accuracy_score(y_test,y_pred))

0.5487804878048781

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 24 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [80]: # Confusion matrix

from sklearn.metrics import confusion_matrix
# Make Predictions
pred1=tre.predict(np.array(x_test))
plt.title("Confusion Matrix testing data")
sns.heatmap(confusion_matrix(y_test,pred1),annot=True,cbar=False)
plt.legend()
plt.show()

Random Forest

In [81]: from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

In [82]: # create the model

model_rf = RandomForestClassifier(n_estimators=500, oob_score=True,

# fitting the model

model_rf=model_rf.fit(x_train, y_train)

y_pred = model_rf.predict(x_test)
print(accuracy_score(y_test,y_pred))

0.6788617886178862

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 25 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [83]: # Confusion matrix

from sklearn.metrics import confusion_matrix
# Make Predictions
pred1=model_rf.predict(np.array(x_test))
plt.title("Confusion Matrix testing data")
sns.heatmap(confusion_matrix(y_test,pred1),annot=True,cbar=False)
plt.legend()
plt.show()

XG Boost

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 26 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [84]: from xgboost import XGBClassifier

from sklearn.metrics import r2_score

xgb = XGBClassifier(colsample_bylevel= 0.9,

colsample_bytree = 0.8,
gamma=0.99,
max_depth= 5,
min_child_weight= 1,
n_estimators= 8,
nthread= 5,
random_state= 0,
)
xgb.fit(x_train,y_train)

[14:31:40] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.

3.0, the default evaluation metric used with the objective 'binary
:logistic' was changed from 'error' to 'logloss'. Explicitly set e
val_metric if you'd like to restore the old behavior.

Out[84]: XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=

0.9,
colsample_bynode=1, colsample_bytree=0.8, gamma=0.99
, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_dep
th=5,
min_child_weight=1, missing=nan, monotone_constraint
s='()',
n_estimators=8, n_jobs=5, nthread=5, num_parallel_tr
ee=1,
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos
_weight=1,
subsample=1, tree_method='exact', validate_parameter
s=1,
verbosity=None)

In [85]: print('Accuracy of XGBoost classifier on training set: {:.2f}'

.format(xgb.score(x_train, y_train)))
print('Accuracy of XGBoost classifier on test set: {:.2f}'
.format(xgb.score(x_test, y_test)))

Accuracy of XGBoost classifier on training set: 0.72

Accuracy of XGBoost classifier on test set: 0.63

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 27 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [86]: from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)

plt.figure(figsize = (15, 8))
sns.set(font_scale=1.4) # for label size
sns.heatmap(conf_matrix, annot=True, annot_kws={"size": 16},cbar=False
plt.title("Test Confusion Matrix")
plt.xlabel("Predicted class")
plt.ylabel("Actual class")
plt.savefig('conf_test.png')
plt.show()

SVM tuned

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 28 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [87]: from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV
svc=SVC()
param_grid={'C':[1.2,1.5,2.2,3.5,3.2,4.1],'kernel':['linear', 'poly'
gridsearch=GridSearchCV(svc,param_grid=param_grid,n_jobs=-1,verbose
gridsearch.fit(x_train,y_train)

Fitting 3 folds for each of 240 candidates, totalling 720 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent

workers.
[Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 2.9s
[Parallel(n_jobs=-1)]: Done 90 tasks | elapsed: 7.5s
[Parallel(n_jobs=-1)]: Done 213 tasks | elapsed: 15.3s
[Parallel(n_jobs=-1)]: Done 384 tasks | elapsed: 26.8s
[Parallel(n_jobs=-1)]: Done 605 tasks | elapsed: 42.5s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 51.0s fini
shed

Out[87]: GridSearchCV(cv=3, estimator=SVC(), n_jobs=-1,

param_grid={'C': [1.2, 1.5, 2.2, 3.5, 3.2, 4.1],
'degree': [1, 2, 4, 8, 10], 'gamma': ['sc
ale', 'auto'],
'kernel': ['linear', 'poly', 'rbf', 'sigm
oid']},
verbose=4)

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 29 of 30
water-quality - Jupyter Notebook 26/12/22, 10:02 PM

In [88]: y_pred=gridsearch.predict(x_test)
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)

http://localhost:8888/notebooks/Documents/Machine-Learning-Projects/Water%20Quality%20Classification/water-quality.ipynb Page 30 of 30

CQA Certification Guide and How To Crack Exam On Asq Certified Quality Auditor
100% (1)
CQA Certification Guide and How To Crack Exam On Asq Certified Quality Auditor
15 pages
Set 4
67% (12)
Set 4
2 pages
Comparing The Areas Under Two or More Correlated Receiver Operating Characteristic Curves A Nonparametric Approach
No ratings yet
Comparing The Areas Under Two or More Correlated Receiver Operating Characteristic Curves A Nonparametric Approach
10 pages
Water - Qualit (2) - JupyterLab
No ratings yet
Water - Qualit (2) - JupyterLab
10 pages
Water Quality Index - EDA and Classification
No ratings yet
Water Quality Index - EDA and Classification
9 pages
14-May - Jupyter Notebook
No ratings yet
14-May - Jupyter Notebook
15 pages
Aditi Project
No ratings yet
Aditi Project
12 pages
DAC Phase5
No ratings yet
DAC Phase5
12 pages
Water Quality Analysis and Prediction
No ratings yet
Water Quality Analysis and Prediction
26 pages
Lab 1 - Python - Excel
No ratings yet
Lab 1 - Python - Excel
14 pages
Presentation 1
No ratings yet
Presentation 1
24 pages
PRJ
No ratings yet
PRJ
17 pages
Tasks
No ratings yet
Tasks
11 pages
Wine DS
No ratings yet
Wine DS
14 pages
Water Potablity Detection
No ratings yet
Water Potablity Detection
29 pages
Water Potability PPT
No ratings yet
Water Potability PPT
12 pages
Case Study Template 2.pptx-2
No ratings yet
Case Study Template 2.pptx-2
8 pages
Water - Resources - Business - Plan - by - Slidesgo (1) .PPTX - Read-Only
No ratings yet
Water - Resources - Business - Plan - by - Slidesgo (1) .PPTX - Read-Only
13 pages
A8 Rcps Problem
No ratings yet
A8 Rcps Problem
3 pages
Presentation Final Thesis Surobhi Deb
No ratings yet
Presentation Final Thesis Surobhi Deb
18 pages
23mda025 Keerthana S
No ratings yet
23mda025 Keerthana S
17 pages
Code
No ratings yet
Code
5 pages
Coding An
No ratings yet
Coding An
19 pages
Water Portability Sunig R
No ratings yet
Water Portability Sunig R
4 pages
The Chemical Analysis of Water Quality of India
No ratings yet
The Chemical Analysis of Water Quality of India
16 pages
Water Quality Index WQI Prediction Using Machine Learning Algorithms
No ratings yet
Water Quality Index WQI Prediction Using Machine Learning Algorithms
5 pages
Quality Prediction Checkpoint
No ratings yet
Quality Prediction Checkpoint
14 pages
Capstoneppt Waterpotabilityprediction 241025130941 5d99fced
No ratings yet
Capstoneppt Waterpotabilityprediction 241025130941 5d99fced
12 pages
Water Quality Prediction Using Decision Tree and KNN
No ratings yet
Water Quality Prediction Using Decision Tree and KNN
4 pages
CODER
No ratings yet
CODER
18 pages
Checkfinal 123
No ratings yet
Checkfinal 123
18 pages
Prediction of Water Quality Using Naive Bayesian Algorithm
No ratings yet
Prediction of Water Quality Using Naive Bayesian Algorithm
2 pages
CC08 Group 07 Probability and Statistics Assignment Report PDF
No ratings yet
CC08 Group 07 Probability and Statistics Assignment Report PDF
36 pages
E&U P3.Ipynb - Colab
No ratings yet
E&U P3.Ipynb - Colab
7 pages
Iciccd 2024 Paper Id XX
No ratings yet
Iciccd 2024 Paper Id XX
12 pages
Case Study Final PDF
No ratings yet
Case Study Final PDF
21 pages
IMPLEMENTATION
No ratings yet
IMPLEMENTATION
6 pages
Sustainability 14 08811
No ratings yet
Sustainability 14 08811
18 pages
40 - Обзор применения машинного обучения для оценки качества воды
No ratings yet
40 - Обзор применения машинного обучения для оценки качества воды
10 pages
Chamohconf
No ratings yet
Chamohconf
4 pages
Research Paper (Yafra Khan)
No ratings yet
Research Paper (Yafra Khan)
6 pages
Efficient Water Quality Analysis and Prediction
No ratings yet
Efficient Water Quality Analysis and Prediction
34 pages
v1 Covered
No ratings yet
v1 Covered
20 pages
Sample Format Project Report
No ratings yet
Sample Format Project Report
3 pages
Water 15 00475 v2
No ratings yet
Water 15 00475 v2
17 pages
Mallika OTCON
No ratings yet
Mallika OTCON
12 pages
Water Quality Analysis and Prediction Using Machine Learning
No ratings yet
Water Quality Analysis and Prediction Using Machine Learning
6 pages
Review On Data Mining Techniques For Prediction of Water Quality
No ratings yet
Review On Data Mining Techniques For Prediction of Water Quality
6 pages
(Group 6) Presentation
No ratings yet
(Group 6) Presentation
32 pages
JWC 2023403
No ratings yet
JWC 2023403
23 pages
AIMLREPORT
No ratings yet
AIMLREPORT
31 pages
Water Quality Analysis
No ratings yet
Water Quality Analysis
7 pages
Code R
No ratings yet
Code R
3 pages
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
No ratings yet
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
10 pages
DAVL PR1.2 Mit
No ratings yet
DAVL PR1.2 Mit
10 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
Report 18
No ratings yet
Report 18
20 pages
Before 7
No ratings yet
Before 7
17 pages
An AI-Driven Approach To Potable Water Classification Using Machine Learning Techniques - Abdulla A
No ratings yet
An AI-Driven Approach To Potable Water Classification Using Machine Learning Techniques - Abdulla A
8 pages
Air Quality Index Analysis Using Machine Learning 1647514117
No ratings yet
Air Quality Index Analysis Using Machine Learning 1647514117
20 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Urban Water Quality Final
No ratings yet
Urban Water Quality Final
25 pages
8 Economics Minor
No ratings yet
8 Economics Minor
15 pages
Achieve Maths-Bk6-Data Statistics Drawing Graphs - FREE 2019
100% (2)
Achieve Maths-Bk6-Data Statistics Drawing Graphs - FREE 2019
66 pages
The Use of Road Management Systems For Optimal Road Asset Management M I Pinard, G Rohde and R Frank
No ratings yet
The Use of Road Management Systems For Optimal Road Asset Management M I Pinard, G Rohde and R Frank
16 pages
WRE Micro Project ..
No ratings yet
WRE Micro Project ..
15 pages
Research Methodology and Statistical Methods
No ratings yet
Research Methodology and Statistical Methods
300 pages
Assignment #01: Subject: Educational Statistics Course Code (8614)
No ratings yet
Assignment #01: Subject: Educational Statistics Course Code (8614)
16 pages
Final Syllabus 12 Jan 2023
No ratings yet
Final Syllabus 12 Jan 2023
44 pages
Fasilitas (Indikator)
No ratings yet
Fasilitas (Indikator)
22 pages
Management Practices and Productive Performances of Sasso Chickens Breed Under Village Production System in SNNPR, Ethiopia
No ratings yet
Management Practices and Productive Performances of Sasso Chickens Breed Under Village Production System in SNNPR, Ethiopia
16 pages
Damascus University Telecommunication Subject Description
No ratings yet
Damascus University Telecommunication Subject Description
19 pages
O Level Mathematics Project 2
No ratings yet
O Level Mathematics Project 2
9 pages
Sampling Design
No ratings yet
Sampling Design
31 pages
Bachelor thesis-G.H. Van de Water-S2297213
No ratings yet
Bachelor thesis-G.H. Van de Water-S2297213
48 pages
Attitude of Government and Private School Teachers Towards Their Profession
No ratings yet
Attitude of Government and Private School Teachers Towards Their Profession
19 pages
A Study of Gender Specific Pitch Variation Pattern of Emotion Expression For Hindi Speech
No ratings yet
A Study of Gender Specific Pitch Variation Pattern of Emotion Expression For Hindi Speech
9 pages
4.1 - The Literacy Myth - Preface
No ratings yet
4.1 - The Literacy Myth - Preface
40 pages
Research Methods Notes
No ratings yet
Research Methods Notes
15 pages
Final Exam - Quanti Research (1st Sem 2014 - Satorre)
No ratings yet
Final Exam - Quanti Research (1st Sem 2014 - Satorre)
3 pages
Flight Delay Prediction System Paper - 802 - 826 - 828
No ratings yet
Flight Delay Prediction System Paper - 802 - 826 - 828
7 pages
Fs2action Research Proposal Jamito Charisse April Beed4a
No ratings yet
Fs2action Research Proposal Jamito Charisse April Beed4a
7 pages
Cmg349statistics For Management
No ratings yet
Cmg349statistics For Management
2 pages
Compaire Mean Assignments
No ratings yet
Compaire Mean Assignments
4 pages
Reliability Enginnering: Presented by
100% (1)
Reliability Enginnering: Presented by
15 pages
How To Prepare A Procedures To Collect Data in Sba
No ratings yet
How To Prepare A Procedures To Collect Data in Sba
3 pages
MGT782 - Assignment 3
No ratings yet
MGT782 - Assignment 3
8 pages
Unit 3 Notes
100% (2)
Unit 3 Notes
32 pages
Heck 1984
No ratings yet
Heck 1984
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.