27 ShivangiSrivastava ML Lab
27 ShivangiSrivastava ML Lab
ETCS - 454
AIM: Study and implement the Naive Bayes learner using WEKA (Breast cancer data file)
THEORY:
It is a classification technique based on Bayes’ Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features,
all of these properties independently contribute to the probability that this fruit is an apple and
that is why it is known as ‘Naive’.
The Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:
Above,
● P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
● P(c) is the prior probability of class.
● P(x|c) is the likelihood which is the probability of predictor given class.
● P(x) is the prior probability of predictor.
First we use the data mining tools WEKA to do the training data prediction. Here, we will
use 10 fold cross validation on training data to calculate the machine learning rules and their
performance. The results are as follows:
Relation: breast
Instances: 683
Attributes: 10
a b <-- classified as
425 19 | a=2
5 234 | b=4
EXPERIMENT - 2
AIM: Estimate the accuracy of the decision classifier on breast cancer dataset using 5-fold cross
validation. (You need to choose the appropriate options for missing values.)
CODE:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
OUTPUT:
Confusion matrix: [[86 4]
[ 2 51]]
Mean of accuracies: 0.9200820793433653
Standard deviation of accuracies: 0.03203202972210602
EXPERIMENT - 3
AIM: Estimate the precision, recall, accuracy, and F-measure of the decision tree classifier on
the text classification task for each of the 10 categories using 10-fold cross-validation.
INTRODUCTION:
Text classification is one of the key techniques in text mining to categorize the documents in a
supervised manner. The processing of text classification involves two main problems: the
extraction of feature terms that become effective keywords in the training phase and then the
actual classification of the document using these feature terms in the test phase. This text
classification task has numerous applications such as automated indexing of scientific articles
according to predefined thesauri of technical terms, routing of customer email in a customer
service department, filing patents into patent directories, automated population of hierarchical
catalogues of Web resources, selective dissemination of information to consumers, identification
of document genre, or detection and identification of criminal activities for military, police, or
secrete service environments and so on. Text classification can be used for document filtering
and routing to topic specific processing mechanisms such as information extraction and machine
translation.
TP = true positives: number of examples predicted positive that are actually positive
FP = false positives: number of examples predicted positive that are actually negative
TN = true negatives: number of examples predicted negative that are actually negative
FN = false negatives: number of examples predicted negative that are actually positive
proportion of examples which were classified as class x, among all examples which truly have
class
The Precision is the proportion of the examples which truly have class x among all those which
were
classified as class x.
The F-Measure is simply 2*Precision*Recall/(Precision+Recall), a combined measure for
precision
and recall.
DataFile
@relation
textclass1
@data
politics ministers,
performance, sports
election, politics
poll,performance,
politics
ball,performance, sports
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:
textclass1
Instances:
13
Attributes
te
xt1
text2
news
Test mode:
10-fold cross-validation
------------------
Number of Leaves
ab
<-- classified as
4 3 | a = politics
6 0 | b = sports
EXPERIMENT - 4
CODE:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import nltk
nltk.download('stopwords')
import re
import string
df = pd.read_csv('../input/email-classification-nlp/SMS_train.csv', encoding='unicode_escape')
def process_mail(mail):
"""Process mail function.
Input:
mail: a string containing message body
Output:
mail_clean: a list of words containing the processed body
"""
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
# tokenize reviews
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
mail_tokens = tokenizer.tokenize(mail)
mail_clean = []
for word in mail_tokens:
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
# mail_clean.append(word)
stem_word = stemmer.stem(word) # stemming word
mail_clean.append(stem_word)
return mail_clean
OUTPUT:
Bag of words: Message_body Label
0 [rofl, true, name] 0
1 [guy, act, like, i'd, interest, buy, so... 0
2 [piti, mood, ..., suggest] 0
3 [ü, b, go, esplanad, fr, home] 0
4 [2nd, time, tri, 2, contact, u, u, £, 750, pou... 1
Confusion matrix: [[162 0]
[ 13 17]]
Accuracy is 0.9322916666666666
EXPERIMENT - 5
AIM: Develop a machine learning method to predict stock prices based on past price variation.
CODE:
import numpy as np
import pandas as pd
import math
import sklearn
import sklearn.preprocessing
import datetime
import os
import matplotlib.pyplot as plt
import tensorflow as tf
df.describe()
df.info()
# function to create train, validation, test data given stock data and sequence length
def load_data(stock, seq_len):
data_raw = stock.as_matrix() # convert to numpy array
data = []
data = np.array(data);
valid_set_size = int(np.round(valid_set_size_percentage/100*data.shape[0]));
test_set_size = int(np.round(test_set_size_percentage/100*data.shape[0]));
train_set_size = data.shape[0] - (valid_set_size + test_set_size);
x_train = data[:train_set_size,:-1,:]
y_train = data[:train_set_size,-1,:]
x_valid = data[train_set_size:train_set_size+valid_set_size,:-1,:]
y_valid = data[train_set_size:train_set_size+valid_set_size,-1,:]
x_test = data[train_set_size+valid_set_size:,:-1,:]
y_test = data[train_set_size+valid_set_size:,-1,:]
cols = list(df_stock.columns.values)
print('df_stock.columns.values = ', cols)
# normalize stock
df_stock_norm = df_stock.copy()
df_stock_norm = normalize_data(df_stock_norm)
index_in_epoch = 0;
perm_array = np.arange(x_train.shape[0])
np.random.shuffle(perm_array)
end = index_in_epoch
return x_train[perm_array[start:end]], y_train[perm_array[start:end]]
# parameters
n_steps = seq_len-1
n_inputs = 4
n_neurons = 200
n_outputs = 4
n_layers = 2
learning_rate = 0.001
batch_size = 50
n_epochs = 100
train_set_size = x_train.shape[0]
test_set_size = x_test.shape[0]
tf.reset_default_graph()
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)
# run graph
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for iteration in range(int(n_epochs*train_set_size/batch_size)):
x_batch, y_batch = get_next_batch(batch_size) # fetch the next training batch
sess.run(training_op, feed_dict={X: x_batch, y: y_batch})
if iteration % int(5*train_set_size/batch_size) == 0:
mse_train = loss.eval(feed_dict={X: x_train, y: y_train})
mse_valid = loss.eval(feed_dict={X: x_valid, y: y_valid})
print('%.2f epochs: MSE train/valid = %.6f/%.6f'%(
iteration*batch_size/train_set_size, mse_train, mse_valid))
y_train.shape
corr_price_development_train = np.sum(np.equal(np.sign(y_train[:,1]-y_train[:,0]),
np.sign(y_train_pred[:,1]-y_train_pred[:,0])).astype(int)) / y_train.shape[0]
corr_price_development_valid = np.sum(np.equal(np.sign(y_valid[:,1]-y_valid[:,0]),
np.sign(y_valid_pred[:,1]-y_valid_pred[:,0])).astype(int)) / y_valid.shape[0]
corr_price_development_test = np.sum(np.equal(np.sign(y_test[:,1]-y_test[:,0]),
np.sign(y_test_pred[:,1]-y_test_pred[:,0])).astype(int)) / y_test.shape[0]
print('correct sign prediction for close - open price for train/valid/test: %.2f/%.2f/%.2f'%(
corr_price_development_train, corr_price_development_valid, corr_price_development_test))
OUTPUT:
Index: 851264 entries, 2016-01-05 to 2016-12-30
Data columns (total 6 columns):
symbol 851264 non-null object
open 851264 non-null float64
close 851264 non-null float64
low 851264 non-null float64
high 851264 non-null float64
volume 851264 non-null float64
dtypes: float64(5), object(1)
memory usage: 45.5+ MB
AIM: Develop a machine learning method to predict how people would rate movies, books, etc.
CODE:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df.Age.plot.hist(bins=25)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('Age')
groupedby_movieName = df.groupby('MovieName')
groupedby_rating = df.groupby('Ratings')
groupedby_uid = df.groupby('UserID')
movies = df.groupby('MovieName').size().sort_values(ascending=True)[:1000]
ToyStory_data = groupedby_movieName.get_group('Toy Story 2 (1999)')
ToyStory_data.shape
#Find and visualize the user rating of the movie “Toy Story”
plt.figure(figsize=(10,10))
plt.scatter(ToyStory_data['MovieName'],ToyStory_data['Ratings'])
plt.title('Plot showing the user rating of the movie “Toy Story”')
plt.show()
#Find and visualize the viewership of the movie “Toy Story” by age group
ToyStory_data[['MovieName','age_group']]
# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(train, train_labels)
Y_pred = decision_tree.predict(test)
acc_decision_tree = round(decision_tree.score(train, train_labels) * 100, 2)
print("The accuracy of Decision tree algorithm is ",acc_decision_tree)
OUTPUT:
The accuracy of Decision tree algorithm is 98.54
EXPERIMENT - 7
AIM: Develop a machine learning method to cluster gene expression data, how to modify
existing methods to solve the problem better.
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Load dataset
Train_Data = pd.read_csv("gene-expression/data_set_ALL_AML_train.csv")
Test_Data = pd.read_csv("gene-expression/data_set_ALL_AML_independent.csv")
labels = pd.read_csv("gene-expression/actual.csv", index_col = 'patient')
Train_Data.head()
df_all["patient"] = pd.to_numeric(patients)
labels["cancer"]= pd.get_dummies(labels.cancer, drop_first=True)
Data.head()
Data['cancer'].value_counts()
plt.figure(figsize=(4,8))
colors = ["AML", "ALL"]
sns.countplot('cancer', data=Data, palette = "Set1")
plt.title('Class Distributions \n (0: AML || 1: ALL)', fontsize=14)
print(X)
print(y)
#feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
X_train.shape
pca = PCA()
pca.fit_transform(X_train)
total = sum(pca.explained_variance_)
k=0
current_variance = 0
while current_variance/total < 0.90:
current_variance += pca.explained_variance_[k]
k=k+1
print(k, " features explain around 90% of the variance. From 7129 features to ", k, ", not too
bad.", sep='')
pca = PCA(n_components=k)
X_train_pca = pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
var_exp = pca.explained_variance_ratio_.cumsum()
var_exp = var_exp*100
plt.bar(range(k), var_exp,color = 'r')
pca.n_components_
pca3 = PCA(n_components=3).fit(X_train)
X_train_reduced = pca3.transform(X_train)
plt.clf()
fig = plt.figure(1, figsize=(10,6))
ax = Axes3D(fig, elev=-150, azim=110,)
ax.scatter(X_train_reduced[:, 0], X_train_reduced[:, 1], X_train_reduced[:, 2], c = y_train,
cmap='coolwarm', linewidths=10)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
print("Before Upsampling:-")
print(Counter(y_train))
print("After Upsampling:-")
print(Counter(y_train_ov))
# do a grid search
svc_params = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9]}]
prediction=svc_model.predict(X_test_pca)
acc_svc = accuracy_score(prediction,y_test)
print('The accuracy of SVM is', acc_svc)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))
#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True, cmap='Greens', fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)
knn_param = {
"n_neighbors": [i for i in range(1,30,5)],
"weights": ["uniform", "distance"],
"algorithm": ["ball_tree", "kd_tree", "brute"],
"leaf_size": [1, 10, 30],
"p": [1,2]
}
search = GridSearchCV(KNeighborsClassifier(), knn_param, n_jobs=-1, verbose=1)
search.fit(X_train_ov, y_train_ov)
knn_model.fit(X_train_ov,y_train_ov)
prediction=knn_model.predict(X_test_pca)
acc_knn = accuracy_score(prediction,y_test)
print('The accuracy of K-NN is', acc_knn)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))
#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True, cmap='Greens', fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)
log_model = GridSearchCV(estimator=LogisticRegression(solver='liblinear'),
param_grid=log_grid,
cv=3,
scoring='accuracy')
log_model.fit(X_train_ov, y_train_ov)
#Logistic Regression
lr_model = LogisticRegression(C=0.001, solver='liblinear')
lr_model.fit(X_train_ov,y_train_ov)
prediction=lr_model.predict(X_test_pca)
acc_log = accuracy_score(prediction,y_test)
print('Validation accuracy of Logistic Regression is', acc_log)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))
#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)
decision_search.fit(X_train_ov, y_train_ov)
ds_model.fit(X_train_ov,y_train_ov)
prediction=ds_model.predict(X_test_pca)
acc_decision_tree = accuracy_score(prediction,y_test)
print('Validation accuracy of Decision Tree is', acc_decision_tree)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))
#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)
#Random forest
rf_model = RandomForestClassifier(bootstrap=False, max_features=0.6, min_samples_leaf=8,
min_samples_split=3, n_estimators=70)
rf_model.fit(X_train_ov,y_train_ov)
prediction=rf_model.predict(X_test_pca)
acc_random_forest = accuracy_score(prediction,y_test)
print('Validation accuracy of RandomForest Classifier is', acc_random_forest)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))
#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)
#XB Boost
xgb_model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.001, max_delta_step=0, max_depth=3,
min_child_weight=1, monotone_constraints='()',
n_estimators=40, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
xgb_model.fit(X_train_ov,y_train_ov)
prediction=xgb_model.predict(X_test_pca)
acc_xgb = accuracy_score(prediction,y_test)
print('Validation accuracy of XG Boost is', acc_xgb)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))
#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)
nb_model.fit(X_train_ov,y_train_ov)
prediction=nb_model.predict(X_test_pca)
acc_nb = accuracy_score(prediction,y_test)
print('Validation accuracy of Naive Bayes is', acc_nb)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))
#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 'Decision Tree',
'Random Forest', 'XG Boost', 'Naive Bayes'],
OUTPUT:
Model Score
6 Naive Bayes 0.944444
2 Logistic Regression 0.833333
1 KNN 0.722222
3 Decision Tree 0.722222
4 Random Forest 0.722222
5 XG Boost 0.722222
0 Support Vector Machines 0.666667
EXPERIMENT - 8
AIM: Select 2 datasets. Each dataset should contain examples from multiple classes. For training
purposes, assume that the class label of each example is unknown(if it is known, ignore it).
Implement the K-means algorithm and apply it to the data you selected. Evaluate performance by
measuring the sum of Euclidean distance of each example from its class center. Test the
performance of the algorithm as a function of the parameter k.
CODE:
Dataset used: Mall_Customers.csv
OUTPUT:
Dataset used: Iris.csv
#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()