0% found this document useful (0 votes)
17 views52 pages

27 ShivangiSrivastava ML Lab

Uploaded by

Mukul Mahawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views52 pages

27 ShivangiSrivastava ML Lab

Uploaded by

Mukul Mahawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

MACHINE LEARNING LAB

ETCS - 454

Submitted to: Submitted by:


Dr Koyel Datta Gupta Name: Shivangi Srivastava
Serial No: 27
Enrollment No: 05815002717
Class: CSE - 2

Maharaja Surajmal Institute of Technology(Affiliated to G.G.S.I.P.U.)


June 2021
EXPERIMENT - 1

AIM: Study and implement the Naive Bayes learner using WEKA (Breast cancer data file)

THEORY:
It is a classification technique based on Bayes’ Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features,
all of these properties independently contribute to the probability that this fruit is an apple and
that is why it is known as ‘Naive’.

The Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:

Above,
● P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
● P(c) is the prior probability of class.
● P(x|c) is the likelihood which is the probability of predictor given class.
● P(x) is the prior probability of predictor.

First we use the data mining tools WEKA to do the training data prediction. Here, we will
use 10 fold cross validation on training data to calculate the machine learning rules and their
performance. The results are as follows:

Relation: breast

Instances: 683

Attributes: 10

Test mode: 10-fold cross-validation

Time taken to build model: 0.08 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 659 96.4861%

Incorrectly Classified Instances 24 3.5139%

Kappa statistic 0.9238

K&B Relative Info Score 62650.9331 %

K&B Information Score 585.4063 bits 0.8571bits/instance


Class complexity | order 0 637.9242 bits 0.934 bits/instance
Class complexity | scheme 1877.4218 bits 2.7488 bits/instance
Complexity improvement (Sf) -1239.4976 bits -1.8148 bits/instance
Mean absolute error 0.0362
Root mean squared error 0.1869

Relative absolute error 7.950%


Root relative squared error 39.192%

Total Number of Instances 683

=== Confusion Matrix ===

a b <-- classified as
425 19 | a=2

5 234 | b=4
EXPERIMENT - 2

AIM: Estimate the accuracy of the decision classifier on breast cancer dataset using 5-fold cross
validation. (You need to choose the appropriate options for missing values.)

CODE:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('breast-cancer-wisconsin-data/data.csv')
Y = dataset.diagnosis
list = ['Unnamed: 32','id','diagnosis']
X = dataset.drop(list,axis = 1 )

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting Decision Tree Classification to the Training set


from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)
# Predicting the Test set results
Y_pred = classifier.predict(X_test)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
print("Confusion matrix: ",cm)

# Applying 5-fold Cross Validation


from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = Y_train, cv = None)
print("Mean of accuracies: ",accuracies.mean())
print("Standard deviation of accuracies: ",accuracies.std())

OUTPUT:
Confusion matrix: [[86 4]
[ 2 51]]
Mean of accuracies: 0.9200820793433653
Standard deviation of accuracies: 0.03203202972210602
EXPERIMENT - 3

AIM: Estimate the precision, recall, accuracy, and F-measure of the decision tree classifier on
the text classification task for each of the 10 categories using 10-fold cross-validation.

INTRODUCTION:

Text classification is one of the key techniques in text mining to categorize the documents in a
supervised manner. The processing of text classification involves two main problems: the
extraction of feature terms that become effective keywords in the training phase and then the
actual classification of the document using these feature terms in the test phase. This text
classification task has numerous applications such as automated indexing of scientific articles
according to predefined thesauri of technical terms, routing of customer email in a customer
service department, filing patents into patent directories, automated population of hierarchical
catalogues of Web resources, selective dissemination of information to consumers, identification
of document genre, or detection and identification of criminal activities for military, police, or
secrete service environments and so on. Text classification can be used for document filtering
and routing to topic specific processing mechanisms such as information extraction and machine
translation.

TP = true positives: number of examples predicted positive that are actually positive

FP = false positives: number of examples predicted positive that are actually negative

TN = true negatives: number of examples predicted negative that are actually negative

FN = false negatives: number of examples predicted negative that are actually positive

Recall is referred to as the true positive rate or sensitivity.

The True Positive (TP) rate is the

proportion of examples which were classified as class x, among all examples which truly have
class

x, i.e. how much part of the class was captured.

The Precision is the proportion of the examples which truly have class x among all those which
were

classified as class x.
The F-Measure is simply 2*Precision*Recall/(Precision+Recall), a combined measure for
precision

and recall.

These measures are useful for comparing classifiers.

DataFile

@relation

textclass1

@attribute text1 {ball,goal,medals,party,poll,ministers}

@attribute text2 {wicket,ball,poll,election,performance,party}

@attribute news {politics, sports}

@data

ball, wicket, sports goal,

ball, sports party, poll,

politics poll, election,

politics ministers,

election, politics medals,

performance, sports

ball, party, sports goal,

wicket, sports ministers,

party, politics party,

election, politics goal,

election, politics

poll,performance,

politics
ball,performance, sports

=== Run information ===

Scheme:

weka.classifiers.trees.J48 -C 0.25 -M 2

Relation:

textclass1

Instances:

13

Attributes

te

xt1

text2

news

Test mode:

10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree

------------------

text1 = ball: sports (3.0)

text1 = goal: sports

(3.0/1.0) text1 = medals:


sports (1.0) text1 = party:

politics (2.0) text1 = poll:

politics (2.0) text1 =

ministers: politics (2.0)

Number of Leaves

Size of the tree :

Time taken to build model: 0 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 4 30.7692 %

Incorrectly Classified Instances 9 69.2308 %

Kappa statistic -0.4444

Mean absolute error 0.5192

Root mean squared error 0.6517

Relative absolute error 100.5319 %

Root relative squared error 125.8013 %

Total Number of Instances 13


=== Confusion Matrix ===

ab

<-- classified as

4 3 | a = politics

6 0 | b = sports
EXPERIMENT - 4

AIM: Develop a machine learning method to classify your incoming mails.

CODE:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import nltk
nltk.download('stopwords')

import re
import string

from nltk.corpus import stopwords


from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

from sklearn.linear_model import LogisticRegression


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('../input/email-classification-nlp/SMS_train.csv', encoding='unicode_escape')

df = df.drop(['S. No.'], axis=1) # dropping unnecessary column


label_encoder = preprocessing.LabelEncoder() # label encoding for 'Label' column
df['Label'] = label_encoder.fit_transform(df['Label'])
df.isnull().any() # checking for null values if any

def process_mail(mail):
"""Process mail function.
Input:
mail: a string containing message body
Output:
mail_clean: a list of words containing the processed body
"""
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
# tokenize reviews
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
mail_tokens = tokenizer.tokenize(mail)
mail_clean = []
for word in mail_tokens:
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
# mail_clean.append(word)
stem_word = stemmer.stem(word) # stemming word
mail_clean.append(stem_word)
return mail_clean

# using the process_mail function for:


# 1. Removing stop words
# 2. Tokenization
# 3. Stemming
A = []
a = df['Message_body']
for i in a:
i = process_mail(i)
A.append(i)
df['Message_body'] = A
print("Bag of words: ",df.head())
cv = CountVectorizer(max_features=1500, analyzer='word', lowercase=False)

df['Message_body'] = df['Message_body'].apply(lambda x: " ".join(x) )


X = cv.fit_transform(df['Message_body'])
y = pd.DataFrame(df['Label'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
roc_auc_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix: ",cm)
print("Accuracy is ",(cm[0][0]+cm[1][1])/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1]))

OUTPUT:
Bag of words: Message_body Label
0 [rofl, true, name] 0
1 [guy, act, like, i'd, interest, buy, so... 0
2 [piti, mood, ..., suggest] 0
3 [ü, b, go, esplanad, fr, home] 0
4 [2nd, time, tri, 2, contact, u, u, £, 750, pou... 1
Confusion matrix: [[162 0]
[ 13 17]]
Accuracy is 0.9322916666666666
EXPERIMENT - 5

AIM: Develop a machine learning method to predict stock prices based on past price variation.

CODE:
import numpy as np
import pandas as pd
import math
import sklearn
import sklearn.preprocessing
import datetime
import os
import matplotlib.pyplot as plt
import tensorflow as tf

# split data in 80%/10%/10% train/validation/test sets


valid_set_size_percentage = 10
test_set_size_percentage = 10

#display parent directory and working directory


print(os.path.dirname(os.getcwd())+':', os.listdir(os.path.dirname(os.getcwd())));
print(os.getcwd()+':', os.listdir(os.getcwd()));

# import all stock prices


df = pd.read_csv("../input/prices-split-adjusted.csv", index_col = 0)
df.info()
df.head()

# number of different stocks


print('\n number of different stocks: ', len(list(set(df.symbol))))
print(list(set(df.symbol))[:10])
df.tail()

df.describe()

df.info()

# function for min-max normalization of stock


def normalize_data(df):
min_max_scaler = sklearn.preprocessing.MinMaxScaler()
df['open'] = min_max_scaler.fit_transform(df.open.values.reshape(-1,1))
df['high'] = min_max_scaler.fit_transform(df.high.values.reshape(-1,1))
df['low'] = min_max_scaler.fit_transform(df.low.values.reshape(-1,1))
df['close'] = min_max_scaler.fit_transform(df['close'].values.reshape(-1,1))
return df

# function to create train, validation, test data given stock data and sequence length
def load_data(stock, seq_len):
data_raw = stock.as_matrix() # convert to numpy array
data = []

# create all possible sequences of length seq_len


for index in range(len(data_raw) - seq_len):
data.append(data_raw[index: index + seq_len])

data = np.array(data);
valid_set_size = int(np.round(valid_set_size_percentage/100*data.shape[0]));
test_set_size = int(np.round(test_set_size_percentage/100*data.shape[0]));
train_set_size = data.shape[0] - (valid_set_size + test_set_size);

x_train = data[:train_set_size,:-1,:]
y_train = data[:train_set_size,-1,:]

x_valid = data[train_set_size:train_set_size+valid_set_size,:-1,:]
y_valid = data[train_set_size:train_set_size+valid_set_size,-1,:]

x_test = data[train_set_size+valid_set_size:,:-1,:]
y_test = data[train_set_size+valid_set_size:,-1,:]

return [x_train, y_train, x_valid, y_valid, x_test, y_test]

# choose one stock


df_stock = df[df.symbol == 'EQIX'].copy()
df_stock.drop(['symbol'],1,inplace=True)
df_stock.drop(['volume'],1,inplace=True)

cols = list(df_stock.columns.values)
print('df_stock.columns.values = ', cols)

# normalize stock
df_stock_norm = df_stock.copy()
df_stock_norm = normalize_data(df_stock_norm)

# create train, test data


seq_len = 20 # choose sequence length
x_train, y_train, x_valid, y_valid, x_test, y_test = load_data(df_stock_norm, seq_len)
print('x_train.shape = ',x_train.shape)
print('y_train.shape = ', y_train.shape)
print('x_valid.shape = ',x_valid.shape)
print('y_valid.shape = ', y_valid.shape)
print('x_test.shape = ', x_test.shape)
print('y_test.shape = ',y_test.shape)
## Basic Cell RNN in tensorflow

index_in_epoch = 0;
perm_array = np.arange(x_train.shape[0])
np.random.shuffle(perm_array)

# function to get the next batch


def get_next_batch(batch_size):
global index_in_epoch, x_train, perm_array
start = index_in_epoch
index_in_epoch += batch_size

if index_in_epoch > x_train.shape[0]:


np.random.shuffle(perm_array) # shuffle permutation array
start = 0 # start next epoch
index_in_epoch = batch_size

end = index_in_epoch
return x_train[perm_array[start:end]], y_train[perm_array[start:end]]

# parameters
n_steps = seq_len-1
n_inputs = 4
n_neurons = 200
n_outputs = 4
n_layers = 2
learning_rate = 0.001
batch_size = 50
n_epochs = 100
train_set_size = x_train.shape[0]
test_set_size = x_test.shape[0]

tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])


y = tf.placeholder(tf.float32, [None, n_outputs])

# use Basic RNN Cell


layers = [tf.contrib.rnn.BasicRNNCell(num_units=n_neurons, activation=tf.nn.elu)
for layer in range(n_layers)]

# use Basic LSTM Cell


#layers = [tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons, activation=tf.nn.elu)
# for layer in range(n_layers)]

# use LSTM Cell with peephole connections


#layers = [tf.contrib.rnn.LSTMCell(num_units=n_neurons,
# activation=tf.nn.leaky_relu, use_peepholes = True)
# for layer in range(n_layers)]

# use GRU cell


#layers = [tf.contrib.rnn.GRUCell(num_units=n_neurons, activation=tf.nn.leaky_relu)
# for layer in range(n_layers)]

multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)

stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, n_neurons])


stacked_outputs = tf.layers.dense(stacked_rnn_outputs, n_outputs)
outputs = tf.reshape(stacked_outputs, [-1, n_steps, n_outputs])
outputs = outputs[:,n_steps-1,:] # keep only last output of sequence
loss = tf.reduce_mean(tf.square(outputs - y)) # loss function = mean squared error
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

# run graph
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for iteration in range(int(n_epochs*train_set_size/batch_size)):
x_batch, y_batch = get_next_batch(batch_size) # fetch the next training batch
sess.run(training_op, feed_dict={X: x_batch, y: y_batch})
if iteration % int(5*train_set_size/batch_size) == 0:
mse_train = loss.eval(feed_dict={X: x_train, y: y_train})
mse_valid = loss.eval(feed_dict={X: x_valid, y: y_valid})
print('%.2f epochs: MSE train/valid = %.6f/%.6f'%(
iteration*batch_size/train_set_size, mse_train, mse_valid))

y_train_pred = sess.run(outputs, feed_dict={X: x_train})


y_valid_pred = sess.run(outputs, feed_dict={X: x_valid})
y_test_pred = sess.run(outputs, feed_dict={X: x_test})

y_train.shape

corr_price_development_train = np.sum(np.equal(np.sign(y_train[:,1]-y_train[:,0]),
np.sign(y_train_pred[:,1]-y_train_pred[:,0])).astype(int)) / y_train.shape[0]
corr_price_development_valid = np.sum(np.equal(np.sign(y_valid[:,1]-y_valid[:,0]),
np.sign(y_valid_pred[:,1]-y_valid_pred[:,0])).astype(int)) / y_valid.shape[0]
corr_price_development_test = np.sum(np.equal(np.sign(y_test[:,1]-y_test[:,0]),
np.sign(y_test_pred[:,1]-y_test_pred[:,0])).astype(int)) / y_test.shape[0]

print('correct sign prediction for close - open price for train/valid/test: %.2f/%.2f/%.2f'%(
corr_price_development_train, corr_price_development_valid, corr_price_development_test))

OUTPUT:
Index: 851264 entries, 2016-01-05 to 2016-12-30
Data columns (total 6 columns):
symbol 851264 non-null object
open 851264 non-null float64
close 851264 non-null float64
low 851264 non-null float64
high 851264 non-null float64
volume 851264 non-null float64
dtypes: float64(5), object(1)
memory usage: 45.5+ MB

number of different stocks: 501


['SLG', 'SNI', 'DLR', 'PG', 'O', 'BLK', 'FCX', 'WLTW', 'SHW', 'UPS']
<class 'pandas.core.frame.DataFrame'>
Index: 851264 entries, 2016-01-05 to 2016-12-30
Data columns (total 6 columns):
symbol 851264 non-null object
open 851264 non-null float64
close 851264 non-null float64
low 851264 non-null float64
high 851264 non-null float64
volume 851264 non-null float64
dtypes: float64(5), object(1)
memory usage: 45.5+ MB
df_stock.columns.values = ['open', 'close', 'low', 'high']
x_train.shape = (1394, 19, 4)
y_train.shape = (1394, 4)
x_valid.shape = (174, 19, 4)
y_valid.shape = (174, 4)
x_test.shape = (174, 19, 4)
y_test.shape = (174, 4)
0.00 epochs: MSE train/valid = 0.060335/0.047891
4.99 epochs: MSE train/valid = 0.000148/0.000607
9.97 epochs: MSE train/valid = 0.000132/0.000617
14.96 epochs: MSE train/valid = 0.000130/0.000761
19.94 epochs: MSE train/valid = 0.000106/0.000351
24.93 epochs: MSE train/valid = 0.000097/0.000467
29.91 epochs: MSE train/valid = 0.000095/0.000437
34.90 epochs: MSE train/valid = 0.000115/0.000442
39.89 epochs: MSE train/valid = 0.000076/0.000353
44.87 epochs: MSE train/valid = 0.000078/0.000339
49.86 epochs: MSE train/valid = 0.000068/0.000222
54.84 epochs: MSE train/valid = 0.000099/0.000282
59.83 epochs: MSE train/valid = 0.000082/0.000232
64.81 epochs: MSE train/valid = 0.000068/0.000311
69.80 epochs: MSE train/valid = 0.000061/0.000202
74.78 epochs: MSE train/valid = 0.000078/0.000312
79.77 epochs: MSE train/valid = 0.000066/0.000246
84.76 epochs: MSE train/valid = 0.000063/0.000202
89.74 epochs: MSE train/valid = 0.000062/0.000251
94.73 epochs: MSE train/valid = 0.000069/0.000271
99.71 epochs: MSE train/valid = 0.000081/0.000203
correct sign prediction for close - open price for train/valid/test: 0.72/0.47/0.41
EXPERIMENT - 6

AIM: Develop a machine learning method to predict how people would rate movies, books, etc.

CODE:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split


from sklearn.tree import DecisionTreeClassifier

#Data acquisition of the movies dataset


df_movie=pd.read_csv('../input/movies.dat', sep = '::', engine='python')
df_movie.columns =['MovieIDs','MovieName','Category']
df_movie.dropna(inplace=True)
df_movie.head()

#Data acquisition of the rating dataset


df_rating = pd.read_csv("../input/ratings.dat",sep='::', engine='python')
df_rating.columns =['ID','MovieID','Ratings','TimeStamp']
df_rating.dropna(inplace=True)
df_rating.head()

#Data acquisition of the users dataset


df_user = pd.read_csv("../input/users.dat",sep='::',engine='python')
df_user.columns =['UserID','Gender','Age','Occupation','Zip-code']
df_user.dropna(inplace=True)
df_user.head()

df = pd.concat([df_movie, df_rating,df_user], axis=1)


df.head()

#Visualize user age distribution


df['Age'].value_counts().plot(kind='barh',alpha=0.7,figsize=(10,10))
plt.show()

df.Age.plot.hist(bins=25)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('Age')

labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']


df['age_group'] = pd.cut(df.Age, range(0, 81, 10), right=False, labels=labels)
df[['Age', 'age_group']].drop_duplicates()[:10]

#Visualize overall rating by users


df['Ratings'].value_counts().plot(kind='bar',alpha=0.7,figsize=(10,10))
plt.show()

groupedby_movieName = df.groupby('MovieName')
groupedby_rating = df.groupby('Ratings')
groupedby_uid = df.groupby('UserID')

movies = df.groupby('MovieName').size().sort_values(ascending=True)[:1000]
ToyStory_data = groupedby_movieName.get_group('Toy Story 2 (1999)')
ToyStory_data.shape

#Find and visualize the user rating of the movie “Toy Story”
plt.figure(figsize=(10,10))
plt.scatter(ToyStory_data['MovieName'],ToyStory_data['Ratings'])
plt.title('Plot showing the user rating of the movie “Toy Story”')
plt.show()

#Find and visualize the viewership of the movie “Toy Story” by age group
ToyStory_data[['MovieName','age_group']]

#Find and visualize the top 25 movies by viewership rating


top_25 = df[25:]
top_25['Ratings'].value_counts().plot(kind='barh',alpha=0.6,figsize=(7,7))
plt.show()

#Visualize the rating data by user of user id = 2696


userid_2696 = groupedby_uid.get_group(2696)
userid_2696[['UserID','Ratings']]

#First 500 extracted records


first_500 = df[500:]
first_500.dropna(inplace=True)

#Use the following features:movie id,age,occupation


features = first_500[['MovieID','Age','Occupation']].values

#Use rating as label


labels = first_500[['Ratings']].values
#Create train and test data set
train, test, train_labels, test_labels =
train_test_split(features,labels,test_size=0.33,random_state=42)

#Create a histogram for movie


df.Age.plot.hist(bins=25)
plt.title("Movie & Rating")
plt.ylabel('MovieID')
plt.xlabel('Ratings')

#Create a histogram for age


df.Age.plot.hist(bins=25)
plt.title("Age & Rating")
plt.ylabel('Age')
plt.xlabel('Ratings')

#Create a histogram for occupation


df.Age.plot.hist(bins=25)
plt.title("Occupation & Rating")
plt.ylabel('Occupation')
plt.xlabel('Ratings')

# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(train, train_labels)
Y_pred = decision_tree.predict(test)
acc_decision_tree = round(decision_tree.score(train, train_labels) * 100, 2)
print("The accuracy of Decision tree algorithm is ",acc_decision_tree)

OUTPUT:
The accuracy of Decision tree algorithm is 98.54
EXPERIMENT - 7

AIM: Develop a machine learning method to cluster gene expression data, how to modify
existing methods to solve the problem better.

CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier


from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb

from sklearn.model_selection import train_test_split


from sklearn.model_selection import GridSearchCV

from sklearn.metrics import recall_score, precision_score,


classification_report,accuracy_score,confusion_matrix, roc_curve, auc,
roc_curve,accuracy_score,plot_confusion_matrix
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.decomposition import PCA
from scipy import ndimage
import seaborn as sns

#Load dataset
Train_Data = pd.read_csv("gene-expression/data_set_ALL_AML_train.csv")
Test_Data = pd.read_csv("gene-expression/data_set_ALL_AML_independent.csv")
labels = pd.read_csv("gene-expression/actual.csv", index_col = 'patient')

Train_Data.head()

#check for nulls


print(Train_Data.isna().sum().max())
print(Test_Data.isna().sum().max())

#drop 'call' columns


cols = [col for col in Test_Data.columns if 'call' in col]
test = Test_Data.drop(cols, 1)
cols = [col for col in Train_Data.columns if 'call' in col]
train = Train_Data.drop(cols, 1)

#Join all the data


patients = [str(i) for i in range(1, 73, 1)]
df_all = pd.concat([train, test], axis = 1)[patients]

#transpose rows and columns


df_all = df_all.T

df_all["patient"] = pd.to_numeric(patients)
labels["cancer"]= pd.get_dummies(labels.cancer, drop_first=True)

# add the cancer column to train data


Data = pd.merge(df_all, labels, on="patient")

Data.head()

Data['cancer'].value_counts()
plt.figure(figsize=(4,8))
colors = ["AML", "ALL"]
sns.countplot('cancer', data=Data, palette = "Set1")
plt.title('Class Distributions \n (0: AML || 1: ALL)', fontsize=14)

#X -> matrix of independent variable


#y -> vector of dependent variable
X, y = Data.drop(columns=["cancer"]), Data["cancer"]

print(X)
print(y)

#split the dataset


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_state= 0)

#feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

X_train.shape

pca = PCA()
pca.fit_transform(X_train)

total = sum(pca.explained_variance_)
k=0
current_variance = 0
while current_variance/total < 0.90:
current_variance += pca.explained_variance_[k]
k=k+1
print(k, " features explain around 90% of the variance. From 7129 features to ", k, ", not too
bad.", sep='')

pca = PCA(n_components=k)
X_train_pca = pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

var_exp = pca.explained_variance_ratio_.cumsum()
var_exp = var_exp*100
plt.bar(range(k), var_exp,color = 'r')

pca.n_components_

from mpl_toolkits.mplot3d import Axes3D

pca3 = PCA(n_components=3).fit(X_train)
X_train_reduced = pca3.transform(X_train)

plt.clf()
fig = plt.figure(1, figsize=(10,6))
ax = Axes3D(fig, elev=-150, azim=110,)
ax.scatter(X_train_reduced[:, 0], X_train_reduced[:, 1], X_train_reduced[:, 2], c = y_train,
cmap='coolwarm', linewidths=10)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])

from sklearn.utils import resample


from collections import Counter

print("Before Upsampling:-")
print(Counter(y_train))

from imblearn.over_sampling import SMOTE


oversample = SMOTE()
X_train_ov, y_train_ov = oversample.fit_resample(X_train_pca,y_train)

print("After Upsampling:-")
print(Counter(y_train_ov))

# do a grid search
svc_params = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9]}]

search = GridSearchCV(SVC(), svc_params, n_jobs=-1, verbose=1)


search.fit(X_train_ov, y_train_ov)

best_accuracy = search.best_score_ #to get best score


best_parameters = search.best_params_ #to get best parameters
# select best svc
best_svc = search.best_estimator_
best_svc

#build SVM model with best parameters


svc_model = SVC(C=1, kernel='linear',probability=True)
svc_model.fit(X_train_ov, y_train_ov)

prediction=svc_model.predict(X_test_pca)

acc_svc = accuracy_score(prediction,y_test)
print('The accuracy of SVM is', acc_svc)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))

#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True, cmap='Greens', fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)

#ROC curve and Area under the curve plotting


predicting_probabilites = svc_model.predict_proba(X_test_pca)[:,1]
fpr,tpr,thresholds = roc_curve(y_test,predicting_probabilites)
plt.subplot(222)
plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
plt.legend(loc = "best")
plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)

knn_param = {
"n_neighbors": [i for i in range(1,30,5)],
"weights": ["uniform", "distance"],
"algorithm": ["ball_tree", "kd_tree", "brute"],
"leaf_size": [1, 10, 30],
"p": [1,2]
}
search = GridSearchCV(KNeighborsClassifier(), knn_param, n_jobs=-1, verbose=1)
search.fit(X_train_ov, y_train_ov)

best_accuracy = search.best_score_ #to get best score


best_parameters = search.best_params_ #to get best parameters
# select best svc
best_knn = search.best_estimator_
best_knn

knn_model = KNeighborsClassifier(algorithm='ball_tree', leaf_size=1, n_neighbors=6,


weights='distance')

knn_model.fit(X_train_ov,y_train_ov)
prediction=knn_model.predict(X_test_pca)

acc_knn = accuracy_score(prediction,y_test)
print('The accuracy of K-NN is', acc_knn)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))

#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True, cmap='Greens', fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)

#ROC curve and Area under the curve plotting


predicting_probabilites = knn_model.predict_proba(X_test_pca)[:,1]
fpr,tpr,thresholds = roc_curve(y_test,predicting_probabilites)
plt.subplot(222)
plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
plt.legend(loc = "best")
plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)

log_grid = {'C': [1e-03, 1e-2, 1e-1, 1, 10],


'penalty': ['l1', 'l2']}

log_model = GridSearchCV(estimator=LogisticRegression(solver='liblinear'),
param_grid=log_grid,
cv=3,
scoring='accuracy')
log_model.fit(X_train_ov, y_train_ov)

best_accuracy = log_model.best_score_ #to get best score


best_parameters = log_model.best_params_ #to get best parameters
# select best svc
best_lr = log_model.best_estimator_
best_lr

#Logistic Regression
lr_model = LogisticRegression(C=0.001, solver='liblinear')

lr_model.fit(X_train_ov,y_train_ov)

prediction=lr_model.predict(X_test_pca)

acc_log = accuracy_score(prediction,y_test)
print('Validation accuracy of Logistic Regression is', acc_log)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))
#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)

#ROC curve and Area under the curve plotting


predicting_probabilites = lr_model.predict_proba(X_test_pca)[:,1]
fpr,tpr,thresholds = roc_curve(y_test,predicting_probabilites)
plt.subplot(222)
plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
plt.legend(loc = "best")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)

params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4, 5, 6],


'max_depth':[3,4,5,6,7,8]}
decision_search = GridSearchCV(DecisionTreeClassifier(random_state=42), params, verbose=1,
cv=3)

decision_search.fit(X_train_ov, y_train_ov)

best_accuracy = decision_search.best_score_ #to get best score


best_parameters = decision_search.best_params_ #to get best parameters
# select best svc
best_ds = decision_search.best_estimator_
best_ds
#Decision Tree
ds_model = DecisionTreeClassifier(max_depth=3, max_leaf_nodes=3, random_state=42)

ds_model.fit(X_train_ov,y_train_ov)

prediction=ds_model.predict(X_test_pca)

acc_decision_tree = accuracy_score(prediction,y_test)
print('Validation accuracy of Decision Tree is', acc_decision_tree)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))

#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)

#ROC curve and Area under the curve plotting


predicting_probabilites = ds_model.predict_proba(X_test_pca)[:,1]
fpr,tpr,thresholds = roc_curve(y_test,predicting_probabilites)
plt.subplot(222)
plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
plt.legend(loc = "best")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)

# Hyperparameters search grid


rf_param_grid = {'bootstrap': [False, True],
'n_estimators': [60, 70, 80, 90, 100],
'max_features': [0.6, 0.65, 0.7, 0.75, 0.8],
'min_samples_leaf': [8, 10, 12, 14],
'min_samples_split': [3, 5, 7]
}

# Create the GridSearchCV object


rf_search = GridSearchCV(estimator=RandomForestClassifier(), param_grid=rf_param_grid,
cv=3, scoring='accuracy')
rf_search.fit(X_train_ov, y_train_ov)

best_accuracy = rf_search.best_score_ #to get best score


best_parameters = rf_search.best_params_ #to get best parameters
# select best svc
best_rf = rf_search.best_estimator_
best_rf

#Random forest
rf_model = RandomForestClassifier(bootstrap=False, max_features=0.6, min_samples_leaf=8,
min_samples_split=3, n_estimators=70)

rf_model.fit(X_train_ov,y_train_ov)

prediction=rf_model.predict(X_test_pca)

acc_random_forest = accuracy_score(prediction,y_test)
print('Validation accuracy of RandomForest Classifier is', acc_random_forest)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))

#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)

#ROC curve and Area under the curve plotting


predicting_probabilites = rf_model.predict_proba(X_test_pca)[:,1]
fpr,tpr,thresholds = roc_curve(y_test,predicting_probabilites)
plt.subplot(222)
plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
plt.legend(loc = "best")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)

xgb_grid_params = {'max_depth': [3, 4, 5, 6, 7, 8, 10, 12],


'min_child_weight': [1, 2, 4, 6, 8, 10, 12, 15],
'n_estimators': [40, 50, 60, 70, 80, 90, 100, 110, 120, 130],
'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.2, 0.3]}

# Create the GridSearchCV object


xgb_search = GridSearchCV(estimator=xgb.XGBClassifier(), param_grid=xgb_grid_params,
cv=3, scoring='accuracy')
xgb_search.fit(X_train_ov, y_train_ov)

best_accuracy = xgb_search.best_score_ #to get best score


best_parameters = xgb_search.best_params_ #to get best parameters
# select best svc
best_xgb = xgb_search.best_estimator_
best_xgb

#XB Boost
xgb_model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.001, max_delta_step=0, max_depth=3,
min_child_weight=1, monotone_constraints='()',
n_estimators=40, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)

xgb_model.fit(X_train_ov,y_train_ov)

prediction=xgb_model.predict(X_test_pca)

acc_xgb = accuracy_score(prediction,y_test)
print('Validation accuracy of XG Boost is', acc_xgb)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))

#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)

#ROC curve and Area under the curve plotting


predicting_probabilites = xgb_model.predict_proba(X_test_pca)[:,1]
fpr,tpr,thresholds = roc_curve(y_test,predicting_probabilites)
plt.subplot(222)
plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
plt.legend(loc = "best")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)

from sklearn.naive_bayes import GaussianNB


#In case of naive Bayes, there isn't a hyper-parameter to tune, so you have nothing to grid
search over.
nb_model = GaussianNB()

nb_model.fit(X_train_ov,y_train_ov)

prediction=nb_model.predict(X_test_pca)

acc_nb = accuracy_score(prediction,y_test)
print('Validation accuracy of Naive Bayes is', acc_nb)
print ("\nClassification report :\n",(classification_report(y_test,prediction)))

#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test,prediction),annot=True,cmap="Greens",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)

#ROC curve and Area under the curve plotting


predicting_probabilites = nb_model.predict_proba(X_test_pca)[:,1]
fpr,tpr,thresholds = roc_curve(y_test,predicting_probabilites)
plt.subplot(222)
plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
plt.legend(loc = "best")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)

models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 'Decision Tree',
'Random Forest', 'XG Boost', 'Naive Bayes'],

'Score': [acc_svc, acc_knn, acc_log, acc_decision_tree,


acc_random_forest, acc_xgb, acc_nb]})
models.sort_values(by='Score', ascending=False)

OUTPUT:
Model Score
6 Naive Bayes 0.944444
2 Logistic Regression 0.833333
1 KNN 0.722222
3 Decision Tree 0.722222
4 Random Forest 0.722222
5 XG Boost 0.722222
0 Support Vector Machines 0.666667
EXPERIMENT - 8

AIM: Select 2 datasets. Each dataset should contain examples from multiple classes. For training
purposes, assume that the class label of each example is unknown(if it is known, ignore it).
Implement the K-means algorithm and apply it to the data you selected. Evaluate performance by
measuring the sum of Euclidean distance of each example from its class center. Test the
performance of the algorithm as a function of the parameter k.

CODE:
Dataset used: Mall_Customers.csv

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values

# Using the elbow method to find the optimal number of clusters


from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Fitting K-Means to the dataset


kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)

# Visualising the clusters


plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow',
label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

OUTPUT:
Dataset used: Iris.csv

#Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Importing the Iris dataset with pandas


dataset = pd.read_csv('../input/Iris.csv')
x = dataset.iloc[:, [1, 2, 3, 4]].values

#Finding the optimum number of clusters for k-means classification


from sklearn.cluster import KMeans
wcss = []

for i in range(1, 11):


kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10,
random_state = 0)
kmeans.fit(x)
wcss.append(kmeans.inertia_)

#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

#Applying kmeans to the dataset / Creating the kmeans classifier


kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state
= 0)
y_kmeans = kmeans.fit_predict(x)

#Visualising the clusters


plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'blue', label =
'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')

#Plotting the centroids of the clusters


plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'yellow', label
= 'Centroids')
plt.legend()
OUTPUT:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy