0% found this document useful (0 votes)
20 views34 pages

ML Lab R18

MNBVG

Uploaded by

21x01a0579
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views34 pages

ML Lab R18

MNBVG

Uploaded by

21x01a0579
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Lab Manual

III B Tech – II SEMESTER

MACHINE LEARNING (R18)

(CS604PC)
DEPARTMENT VISION AND MISSION

Department Vision To produce technically competent professionals with quality


education in cutting edge technologies with professional ethics.

Department Mission M1: To impart quality technical education in design and


implementation of IT applications through innovative teaching -
learning practices

M2: To inculcate Professional behavior, with strong ethical


values, and research capabilities.

M3: To educate students to be an effective problem solver with


social sensitivity for the betterment of the society and humanity as
a whole.

COURSE OUTCOMES
CO1 Develop skills in data extraction and manipulation using Python.

CO2 Apply Machine Learning and Text Classification to model relationships between variable.

CO3 Analyze credit-worthiness classification data and calculate unconditional and conditional
probabilities using Python.
PROGRAM OUTCOMES (POs)
PO-1 Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
PO-2 Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO-3 Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
PO-4 Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis
of the information to provide valid conclusions.
PO-5 Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
PO-6 The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant
to the professional engineering practice.
PO-7 Environment and sustainability: Understand the impact of the professional engineering
solution sin societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
PO-8 Apply ethical principles and commit to professional ethics and responsibilities and norms
of the engineering practice (Ethics).
PO-9 Individual and team work: Function effectively as an individual, and as a member or
leader indiverse teams, and in multidisciplinary settings.
PO-10 Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give and
receive clear instructions.
PO-11 Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
PO-12
Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
PROGRAM SPECIFIC OUTCOMES

Apply acquired knowledge of programming languages, data structures,


PSO 1 algorithms, and standard software engineering principles to devise effective solutions
for intricate computational issues.
Design and develop efficient web and mobile based applications under realistic
PSO 2 constraints.

Apply core and advanced concepts of database management systems, data mining
PSO 3 and machine learning to devise engineer solutions for practical problems.

PROGRAM EDUCATIONAL OBJECTIVES

Demonstrate proficiency in fundamental concepts and advanced


PEO 1
technologies ofcomputer science to succeed in their careers and/or obtain a
higher degree.
Analyze complex computing problems in multidisciplinary area and
PEO 2
creativelysolve them.

Recognize ethical dilemmas in work environment and apply professional code


PEO 3 ofethics
S.NO. LIST OF EXPERIMENTS PAGE NO
1. The probability that it is Friday and that a student is absent is 3 1
%. Since there are 5 school days in a week, the probability that it
is Friday is 20 %. What is the probability that a student is absent
given that today is Friday? Apply Baye’s rule in Python to get
the result. (Ans: 15%)
2. Extract the data from the database using python. 2
3. Implement k-nearest neighbors classification using python. 2
4. Given the following data, whhich specify classifications for nine
3
combinations of V AR1 and VAR2 predict a classification for a
case where VAR1=0.906 an d VAR2=0.606, using the result of
k- means clustering with 3 means (i.e., 3 centroids)

VAR VAR2 CLAS


1 S
1.713 1.586 0
0.180 1.786 1
0.353 1.240 1
0.940 1.566 0
1.486 0.759 1
1.266 1.106 0
1.540 0.419 1
0.459 1.799 1
0.773 0.186 1
The following training examples map descriptions of 5
5. individuals onto high, medium, and lowcredit-worthiness.
medium skiing design single twenties no -> highRisk
high golf trading married forties yes -> lowRisk
low speedway transport married
thirties yes -> medRisk
medium football banking single thirties yes -> lowRisk
high flying media married fifties yes -> highRisk
low football security single twenties no -> medRisk
low football security single twenties no -> medRisk
medium golf transport married the forties yes -> lowRisk
high skiing banking single thirties yes -> highRisk
low golf unemployed married forties yes -> highRisk
Input attributes are (from left to right) income, recreation, job,
status, age-group, home-owner. Find the unconditional
probability of `golf' and the conditional probability of `single'
given `medRisk' in the dataset.

6. Implement linear regression using Python. 6

7. Implement Naïve Baye’s theorem to classify the English text. 9


8. Implement an algorithm to demonstrate the significance of the 11
genetic algorithm.
9. Implement the finite words classification system using the 15
Back-propagation algorithm.

ADDITIONAL PROGRAMS

10. Apply a Hierarchical Clustering algorithm to cluster a set 19


of data stored in a . CSV file. Use the same data set for
clustering using the k-means algorithm. Compare the
results of these two algorithms and comment on the quality
of clustering.
11. Write a Python Program to implement Logistic Regression. 26
1.The probability that it is Friday and that a student is absent is 3 %. Since there are 5 school days in
a week, the probability that it is Friday is 20 %. What is the probability that a student is absent given
that today is Friday? Apply Baye’s rule in Python to get the result.(Ans:15%)

Source Code:
PFIA=float(input('Enter probability that it is Friday and that a student is absent='))

PF=float(input(' probability that it is Friday='))

PABF=PFIA / PF

print('probability that a student is absent given that today is Friday using conditional
probabilities=',PABF)

Output:

Enter probability that it is Friday and that a student is absent=0.03

probability that it is Friday=0.2

probability that a student is absent given that today is Friday using conditional probabilities=
0.15

1
2.Extract the data from the database using python.

SOURCECODE:
import mysql.connector
mydb=mysql.connector.connect(host="localhost",user="root",password="password")
print(mydb)
import mysql.connector
mydb=mysql.connector.connect(host="localhost",user="root",password="password")
cur=mydb.cursor()
cur.execute("CREATE DATABASE COLLEGE")
import mysql.connector
mydb=mysql.connector.connect(host="localhost",user="root",password="password",database="college")
cur=mydb.cursor()
s="CREATE TABLE student(rollno integer(4), name varchar(20))"
cur.execute(s)
import mysql.connector
mydb=mysql.connector.connect(host="localhost",user="root",password="password",database="college")
cur=mydb.cursor()
s="INSERT INTO student(rollno,name) VALUES(%s,%s)"
a1=[(1,"Suresh"),(2,"Ramesh")]
cur.executemany(s,a1)
mydb.commit()
print("Done")
import mysql.connector
mydb=mysql.connector.connect(host="localhost",user="root",password="password",database="college")
cur=mydb.cursor()
s="SELECT * from student"
cur.execute(s)
result=cur.fetchall()
for rec in result:
print(rec)
Output:
(1, ‘Suresh’)(2, ‘Ramesh’)

2
3. Implement k-nearest neighbors classification using python.

Source Code:

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris

irisData=load_iris()

x=irisData.data

y=irisData.target

print('irirData.feature_names')

print('irirData.target_names')

print('\nfirst10 rows of x:\n,x[:10]')

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

knn=KNeighborsClassifier(n_neighbors=2)

knn.fit(x_train,y_train)

knn.predict([[3.2,5.4,4.1,2.5]])

Output:

array([1])

3
4.Given the following data, which specify classifications for nine combinations of
VAR1 and VAR2 predict a classification for a case where VAR1=0.906 and
VAR2=0.606, using the result of k-means clustering with 3means(i.e.,3 centroids)
VAR1 VAR2 CLASS
1.713 1.586 0
0.180 1.786 1
0.353 1.240 1
0.940 1.566 0
1.486 0.759 1
1.266 1.106 0
1.540 0.419 1
0.459 1.799 1
0.773 0.186 1

Source Code:

from sklearn.cluster import KMeans

import numpy as np

x=np.array([[1.713,1.586],[0.180,1.786],[0.353,1.240],[0.940,1.566],[1.486,0.759],[1.266,1.106],
[1.540,0.419],[0.459,1.799],[0.773,0.186]])

y=np.array([0,1,1,0,1,0,1,1,1])

kmeans=KMeans(n_clusters=3,random_state=0).fit(x,y)

kmeans.predict([[0.906,0.606]])

Output:
array([0])

4
5. The following training examples map descriptions of Finding individuals with
high, medium, and low creditworthiness.
Medium skiing design single twenties no -> high-Risk
high golf trading married forties yes-> low-Risk
Low-speed way transport married thirties yes -> med-Risk
medium football banking single thirties yes -> low-risk
high flying media married fifties yes -> high-Risk
Low football security single twenties no -> med-Risk
medium golf media single thirties yes -> med-Risk
medium golf transport married forties yes ->low-Risk
high skiing banking single thirties yes -> high-Risk
low golf unemployed married forties yes -> high-Risk
Input attributes are (from left to right) income, recreation, job, status, age-group, and
home-owner. Find the unconditional probability of `golf' and the conditional
probability of `single' given `med Risk' in the dataset?

Source Code:

totalRecords=10

numGolfRecords=4

unConditionalprobGolf=numGolfRecords / total_Records

print("Unconditional probability of golf: ={}".format(unConditionalprobGolf))

#conditional probability of 'single' given 'medRisk'

numMedRiskSingle=2

numMedRisk=3

probMedRiskSingle=numMedRiskSingle/total_Records

probMedRisk=numMedRisk/total_Records

conditionalProb=(probMedRiskSingle/probMedRisk)

print("Conditional probability of single given medRisk: = {}".format(conditionalProb))

Output:
Unconditional probability of golf: =0.4

Conditional probability of single given medRisk: = 0.6666666666666667

5
6. Implement linear regression using Python.

Source Code:

import numpy as np

import matplotlib.pyplot as plt

def estimate_coef(x, y):

# number of observations/points

n = np.size(x)

# mean of x and y vector

m_x = np.mean(x)

m_y = np.mean(y)

# calculating cross-deviation and deviation about x

SS_xy = np.sum(y*x) - n*m_y*m_x

SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients

b_1 = SS_xy / SS_xx

b_0 = m_y - b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):

# plotting the actual points as scatter plot

plt.scatter(x, y, color = "m",

marker = "o", s = 30)

6
# predicted response vector

y_pred = b[0] + b[1]*x

# plotting the regression line

plt.plot(x, y_pred, color = "g")

# putting labels

plt.xlabel('x')

plt.ylabel('y')

# function to show plot

plt.show()

def main():

# observations / data

x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients

b = estimate_coef(x, y)

print("Estimated coefficients:\nb_0 = {} \

\nb_1 = {}".format(b[0], b[1]))

# plotting regression line

plot_regression_line(x, y, b)

if name == " main ":

main()

7
Output:
Estimated coefficients:

b_0 = 1.2363636363636363

b_1 = 1.1696969696969697

8
7. Implement Naïve Bayes theorem to classify the English text.

Source Code:
Import pandas as pd
From sklearn.model_selection import train_test_split
From sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score,recall_score
msg = pd.read_csv('document.csv', names=['message', 'label'])print("Total Instances of
Dataset: ", msg.shape[0])msg['labelnum'] =msg.label.map({'pos': 1,'neg':0})

X = msg.messagey=msg.labelnum
Xtrain,Xtest,ytrain,ytest=train_test_split(X,y)count_v=CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)Xtest_dm=count_v.transform(Xtest)
df=pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
clf = ultinomialNB()clf.fit(Xtrain_dm,ytrain)pred=clf.predict(Xtest_dm)
print('AccuracyMetrics:')

print('Accuracy:',accuracy_score(ytest,pred))print('Recall:',recall_score(ytest,pred))

print('Precision:',precision_score(ytest,pred))

print('ConfusionMatrix:\n',confusion_matrix(ytest,pred))

document.csv:
I love this and wich, pos This is an amazing place , pos
I feel very good about these beers, pos This is mybest work, pos
What an awesome view,pos I do not like this restaurant, neg I am tired of this stuff,
neg I can't deal with this, neg Heismys worn enemy, neg My boss is horrible, neg
This is an awesome place, pos I do not like the taste of this juice, neg I love to
dance, pos I am sick and tired of this place, neg What a great holiday, pos That is a
bad locality to stay,neg We will have good fun tomorrow, pos I went to my enemy's
house today, neg

Output:
Total Instances of Dataset: 18AccuracyMetrics: Accuracy:0.6
Recall:
0.6666666666666666
Precision:0.66666666666666

9
8. Implement an algorithm to demonstrate the significance of the genetic
algorithm.

Source Code:
# Python3 program to create target string, starting from#
random string using Genetic Algorithm
import random
# Number of individuals in each generation
POPULATION_SIZE = 100
# Valid genes
GENES = '''abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP
QRSTUVWXYZ 1234567890, .-;:_!"#%&/()=?@${[]}'''
# Target string to be generated
TARGET = "Ravindra Raman Cholla"

class Individual(object):'''
Class representing individual in population'''
def init (self, chromosome):
self.chromosome = chromosome
self.fitness = self.cal_fitness()
@classmethod
def mutated_genes(self):
'''
create random genes for mutation'''
global GENES
gene = random.choice(GENES)
return gene

@classmethod
def create_gnome(self):
'''
create chromosome or string of genes'''
global TARGET gnome_len
= len(TARGET)
return [self.mutated_genes() for _ in range(gnome_len)]
def mate(self, par2):
'''
Perform mating and produce new offspring'''

10
# chromosome for offspring
child_chromosome = []
for gp1, gp2 in zip(self.chromosome, par2.chromosome):

# random probability prob


= random.random()

# if prob is less than 0.45, insert gene#


from parent 1
if prob < 0.45:
child_chromosome.append(gp1)

# if prob is between 0.45 and 0.90, insert#


gene from parent 2
elif prob < 0.90:
child_chromosome.append(gp2)

# otherwise insert random gene(mutate),#


for maintaining diversity
else:
child_chromosome.append(self.mutated_genes())

# create new Individual(offspring) using#


generated chromosome for offspring
return Individual(child_chromosome)
def cal_fitness(self):
'''
Calculate fitness score, it is the number of
characters in string which differ from target
string.
'''
global TARGET
fitness = 0
for gs, gt in zip(self.chromosome, TARGET):
if gs != gt: fitness+= 1
return fitness

# Driver code

11
def main():
global POPULATION_SIZE
#current generation
generation = 1
found = False
population = []
# create initial population
for _ in range(POPULATION_SIZE):
gnome = Individual.create_gnome()
population.append(Individual(gnome))
while not found:
# sort the population in increasing order of fitness score
population = sorted(population, key = lambda x:x.fitness)

# if the individual having lowest fitness score ie.


# 0 then we know that we have reached to the target
# and break the loop
if population[0].fitness <= 0:
found = True
break

# Otherwise generate new offsprings for new generation


new_generation = []

# Perform Elitism, that mean 10% of fittest population#


goes to the next generation
s = int((10*POPULATION_SIZE)/100)
new_generation.extend(population[:s])

# From 50% of fittest population, Individuals


# will mate to produce offspring
s = int((90*POPULATION_SIZE)/100)
for _ in range(s):
parent1 = random.choice(population[:50])
parent2 = random.choice(population[:50])
child = parent1.mate(parent2)
new_generation.append(child)

population = new_generation print("Generation:

12
{}\tString: {}\tFitness: {}".\
format(generation,
"".join(population[0].chromosome),
population[0].fitness))

generation += 1

print("Generation: {}\tString: {}\tFitness: {}".\


format(generation,
"".join(population[0].chromosome),
population[0].fitness))

if name == ' main ':


main()

Output:
Generation: 1 String: qRIS Fitness: 3
Generation: 2 String: qRIS Fitness: 3
Generation: 3 String: qRIS Fitness: 3
Generation: 4 String: NR:n Fitness: 2
Generation: 5 String: NR:n Fitness: 2
Generation: 6 String: NRCn Fitness: 1
Generation: 7 String: NRCn Fitness: 1
Generation: 8 String: NRCn Fitness: 1
Generation: 9 String: NRCn Fitness: 1
Generation: 10 String: NRCn Fitness: 1
Generation: 11 String: NRCn Fitness: 1
Generation: 12 String: NRCn Fitness: 1
Generation: 13 String: NRCn Fitness: 1
Generation: 14 String: NRCMFitness: 0

13
9. Implement the finite words classification system using Back-propagation
algorithm.

Source code:
# 4. BackPropagation algorithm:
import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)
X = X/np.amax(X,axis=0) #maximum of X array longitudinally
y = y/100
#Sigmoid Function
def sigmoid (x):
return 1/(1 + np.exp(-x))
#Derivative of Sigmoid Function
def derivatives_sigmoid(x):
return x * (1 - x)
#Variable initialization
epoch=5 #Setting training iterations
lr=0.1 #Setting learning rate
inputlayer_neurons = 2 #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1 #number of neurons at output layer
#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
#draws a random range of numbers uniformly of dim x*y
for i in range(epoch):
#Forward Propogation
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp= outinp1+bout
output = sigmoid(outinp)

#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
d_output = EO * outgrad
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act)#how much hidden layer wts contributed toerror

14
d_hiddenlayer = EH * hiddengrad
wout += hlayer_act.T.dot(d_output) *lr # dotproduct of next layer error and current layerop
wh += X.T.dot(d_hiddenlayer) *lr
print ("-----------Epoch-", i+1, "Starts--------- ")
print("Input: \n" + str(X))
print("Actual Output: \n" + str(y))
print("Predicted Output: \n" ,output)
print ("-----------Epoch-", i+1, "Ends --------- \n")
print("Input: \n" + str(X))
print("Actual Output: \n" + str(y))
print("Predicted Output: \n" ,output)

Output:
Epoch- 1 Starts
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.7014538 ]
[0.68028913]
[0.69778034]]
Epoch- 1 Ends

Epoch- 2 Starts
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.70594364]
[0.68437414]
[0.70224318]]
Epoch- 2 Ends

Epoch- 3 Starts
Input:

15
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.71026427]
[0.68831216]
[0.70653908]]
Epoch- 3 Ends

Epoch- 4 Starts
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.71442409]
[0.69211026]
[0.71067628]]
Epoch- 4 Ends

Epoch- 5 Starts
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.71843105]
[0.69577512]
[0.71466255]]
Epoch- 5 Ends

Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]

16
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.71843105]
[0.69577512]
[0.71466255]]

17
ADDITIONAL PROGRAMS:

1.Apply a Hierarchical Clustering algorithm to cluster a set of data stored in a .


CSV file. Use the same data set for clustering using the k-means algorithm.
Compare the results of these two algorithms and comment on the quality of
clustering.
Source code:
# Importing Libraries
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
from sklearn.metrics import silhouette_score
from scipy.spatial import distance # To calculate distances
from google.colab import files
from IPython.display import Image
import scipy.cluster.hierarchy as sch
#uploaded = files.upload()
print('HIERARCHICAL CLUSTERING ALGORITHIM STEPS :')
Image('F037A50D-FDEF-4187-962D-CB74158F4FFC.png')
print('HIERARCHICAL CLUSTERING ALGORITHIM TYPE :')
Image('207B916C-6140-4038-AA53-0027F4E66C3C.png')
print('DIFFERENT TYPE OF LINKAGE :')
Image('093C32DB-A817-4026-889C-CF26F7240891.jpeg')
# Lets take an small example
x_axis = np.array([1,2,3,18,19,20])
y_axis = np.array([1,2,3,18,19,20])
data = pd.DataFrame({'x':x_axis, 'y':y_axis}) plt.plot()
plt.xlim([0,21])
plt.ylim([0,21])
plt.title('Dataset')
plt.scatter(x_axis,y_axis) plt.show()
print('DENDOGRAM EXAMPLE :')
Image('70CA363E-F7FA-4C9D-871F-4D3FC35EC0B7.png')

18
print('DENDOGRAM OVERVIEW :') Image('EE76D8A0-
92FA-4705-B1F6-748F38D83112.png')
# Dendrogram (Median Linkage)

Z = sch.linkage(data, method = 'median') plt.figure(figsize=(10,4))


den = sch.dendrogram(Z)
plt.title('Dendrogram (Median Linkage) for the clustering of the dataset') plt.xlabel('Data
Points Number')
plt.ylabel('Euclidean distance in the space with other variables')

# Dendrogram (Complete Linkage)

Z = sch.linkage(data, method = 'complete') plt.figure(figsize=(10,4))


den = sch.dendrogram(Z)
plt.title('Dendrogram (Complete Linkage) for the clustering of the dataset')
plt.xlabel('Data Points Number')
plt.ylabel('Euclidean distance in the space with other variables')
# Dendrogram (Average Linkage)
Z = sch.linkage(data, method = 'average') plt.figure(figsize=(10,4))
den = sch.dendrogram(Z)
plt.title('Dendrogram (Average Linkage) for the clustering of the dataset') plt.xlabel('Data
Points Number')
plt.ylabel('Euclidean distance in the space with other variables')
# Building an Agglomerative Clustering Model : Initialise Model # We analyse the
above-created dendrogram
# decide that we will be making 2 clusters for this dataset
cluster_H = AgglomerativeClustering(n_clusters=2,linkage= 'average')
# Model Fit
model_clt = cluster_H.fit(data) print(model_clt)
print('\n')
data['clusters'] = model_clt.labels_
print('Clusters assigned to each datapoints, cluster = 2 :') print(data['clusters'])
# Silhouette Score
data = pd.DataFrame({'x':x_axis, 'y':y_axis})
for k in range(2,6):# Maximum range should be 6, as it contains only 6
data points
cluster_H = AgglomerativeClustering(n_clusters=k,linkage= 'average') model_clt =

19
cluster_H.fit(data)
label = model_clt.labels_
sil_coeff = silhouette_score(data,label,metric = 'euclidean')
print('For cluster= {}, Silhouette Coefficient is {}'.format(k,sil_coeff)) print('\n')
print('For Cluster = 2, it has highest Silhouette Value. So Number of Cluster = 2')
# Lets take another example : IRIS DATASET # Loading the Dataset
iris = datasets.load_iris()
iris_data = pd.DataFrame(iris.data)
iris_data.columns = iris.feature_names iris_data['Type']=iris.target
iris_data.head()
iris_X = iris_data.iloc[:, [0, 1, 2,3]].values print(iris_X[:5,:]) # Printing First
5 Rows
iris_Y = iris_data['Type'] iris_Y = np.array(iris_Y) print(iris_Y)
# Frequency count of the Output clusters
unique, counts = np.unique(iris_Y, return_counts=True) freq_1 = dict(zip(unique,
counts))
freq_1
# Filtering Setosa
Setosa = iris_data['Type'] == 0
print("Filtering Setosa, True means its Setosa and False means Non Setosa")
print(Setosa.head())
print("Top 6 Rows of Setosa") Setosa_v2 = iris_data[Setosa]
print(Setosa_v2[Setosa_v2.columns[0:2]].head()) print("Last 6 Rows of Setosa")
print(Setosa_v2[Setosa_v2.columns[0:2]].tail())
# Filtering Setosa for 2D Plot

print("Setosa for 2D Plot") print("X Axis points")


print(iris_X[iris_Y == 0,0]) print("Y Axis Points")
print(iris_X[iris_Y == 0,1]) print('\n')
# For Setosa in Target Column i.e, iris_Y = 0
# In other word it should range from (0,0) to (0,1)
plt.scatter(iris_X[iris_Y == 0, 0], iris_X[iris_Y == 0, 1], s = 80, c =
'orange', label = 'Iris-setosa')
plt.xlim([4.5,8])
plt.ylim([2,4.5])
# Filtering Versicolour

20
Versi = iris_data['Type'] == 1
print("Filtering Versicolour, True means its Versicolour and False means Non
Versicolour") print(Versi.head())
print("Top 6 Rows of Versicolour") Versi_v2 = iris_data[Versi]
print(Versi_v2[Versi_v2.columns[0:2]].head()) print("Last 6 Rows of Versicolour")
print(Versi_v2[Versi_v2.columns[0:2]].tail())
# Filtering Versicolour for 2D Plot
print("Versicolour for 2D Plot") print("X Axis points")
print(iris_X[iris_Y == 1,0]) print("Y Axis Points")
print(iris_X[iris_Y == 1,1]) print('\n')
plt.scatter(iris_X[iris_Y == 1, 0], iris_X[iris_Y == 1, 1],
s = 80, c = 'yellow', label = 'Iris-versicolour')
plt.xlim([4.5,8])
plt.ylim([2,4.5])
# Filtering Virginica
Virginica = iris_data['Type'] == 2
print("Filtering Virginica, True means its Virginica and False means Non Virginica")
print(Virginica.head())
print("Top 6 Rows of Virginica")
Virginica_v2 = iris_data[Virginica]
print(Virginica_v2[Virginica_v2.columns[0:2]].head()) print("Last 6 Rows of
Virginica")
print(Virginica_v2[Virginica_v2.columns[0:2]].tail())

# Filtering Virginica for 2D Plot


print("Virginica for 2D Plot") print("X Axis points")
print(iris_X[iris_Y == 2,0]) print("Y Axis Points")
print(iris_X[iris_Y == 2,1]) print('\n')
plt.scatter(iris_X[iris_Y == 2, 0], iris_X[iris_Y == 2, 1], s = 80, c =
'green', label = 'Iris-virginica')
plt.xlim([4.5,8])
plt.ylim([2,4.5])
plt.scatter(iris_X[iris_Y == 0, 0], iris_X[iris_Y == 0, 1], s = 80, c =
'orange', label = 'Iris-setosa')
plt.scatter(iris_X[iris_Y == 1, 0], iris_X[iris_Y == 1, 1],
s = 80, c = 'yellow', label = 'Iris-versicolour') plt.scatter(iris_X[iris_Y == 2,
0], iris_X[iris_Y == 2, 1],

21
s = 80, c = 'green', label = 'Iris-virginica')
plt.legend()
iris_X_1 = iris_data[['sepal length (cm)','sepal width (cm)',
'petal length (cm)','petal width (cm)']]
iris_X_1.head()

Z = sch.linkage(iris_X_1, method = 'median') plt.figure(figsize=(20,7))


den = sch.dendrogram(Z)
plt.title('Dendrogram for the clustering of the dataset iris)') plt.xlabel('Type')
plt.ylabel('Euclidean distance in the space with other variables')

Z = sch.linkage(iris_X_1, method = 'single') plt.figure(figsize=(20,7))


den = sch.dendrogram(Z)
plt.title('Dendrogram for the clustering of the dataset iris)') plt.xlabel('Type')
plt.ylabel('Euclidean distance in the space with other variables')
Z = sch.linkage(iris_X_1, method = 'complete') plt.figure(figsize=(20,7))
den = sch.dendrogram(Z)
plt.title('Dendrogram for the clustering of the dataset iris)') plt.xlabel('Type')
plt.ylabel('Euclidean distance in the space with other variables')
Z = sch.linkage(iris_X_1, method = 'average') plt.figure(figsize=(20,7))
den = sch.dendrogram(Z)
plt.title('Dendrogram for the clustering of the dataset iris)') plt.xlabel('Type')
plt.ylabel('Euclidean distance in the space with other variables')
cluster_H = AgglomerativeClustering(n_clusters=3,linkage = 'average')
# Fitting Model
# After building Agglomerative clustering, we will fit our iris data set # Note that only
the independent variables from the Iris dataset
# are taken into account for the purpose of clustering
model_clt = cluster_H.fit(iris_X_1) model_clt
print('Output Clusters are') pred1 = model_clt.labels_
print(pred1)
# Frequency count of the Output clusters
unique, counts = np.unique(pred1, return_counts=True) print(dict(zip(unique,
counts)))
print('Original Cluster') print(freq_1)
# Frequency count of the Output clusters

22
unique, counts = np.unique(pred1, return_counts=True) print('Hierarchical
Clustering Output Cluster')
print(dict(zip(unique, counts)))
# Silhouette Score
print('Silhouette Score for 3 Clusters') print(silhouette_score(iris_X,pred1))
print('\n')
# In the above output we got value labels: ‘0’, ‘1’ and ‘2’
# For a better understanding, we can visualize these clusters.
# We use the above-found class labels and visualise how the clusters have been formed.
plt.scatter(iris_X[pred1 == 0, 0], iris_X[pred1 == 0, 1], s = 80, c =
'orange', label = 'Iris-setosa')
plt.scatter(iris_X[pred1 == 1, 0], iris_X[pred1 == 1, 1],
s = 80, c = 'yellow', label = 'Iris-versicolour') plt.scatter(iris_X[pred1 == 2,
0], iris_X[pred1 == 2, 1],
s = 80, c = 'green', label = 'Iris-virginica')
plt.legend()

for k in range(2,10):
cluster_H = AgglomerativeClustering(n_clusters=k,linkage= 'average') model_clt =
cluster_H.fit(iris_X)
label = model_clt.labels_
sil_coeff = silhouette_score(iris_X,label,metric = 'euclidean')
print('For cluster= {}, Silhouette Coefficient is {}'.format(k,sil_coeff)) print('\n')
print('For Cluster = 2, it has highest Silhouette Value')
print('But according to Visualization and data, Number of Cluster is 3')

Output:
Original Cluster
{0: 50, 1: 50, 2: 50}
Hierarchical Clustering Output Cluster
{0: 64, 1: 50, 2: 36}
Silhouette Score for 3 Clusters 0.5541608580282847
For cluster= 2, Silhouette Coefficient is 0.6867350732769776
For cluster= 3, Silhouette Coefficient is 0.5541608580282847
For cluster= 4, Silhouette Coefficient is 0.4719936084994249
For cluster= 5, Silhouette Coefficient is 0.4306699739542549

23
For cluster= 6, Silhouette Coefficient is 0.3419903827982995
For cluster= 7, Silhouette Coefficient is 0.3707424079292066
For cluster= 8, Silhouette Coefficient is 0.3658753388418643 For cluster= 9, Silhouette
Coefficient is 0.3166806903618151
For Cluster = 2, it has highest Silhouette Value

24
2.Write a Python Program to implement Logistic Regression.
Objective: To implement Logistic Regression.
Outcome: Student will be able to implement Logistic Regression method.
Input: User_Data.csv

Source code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv('...\\User_Data.csv')
# input
x = dataset.iloc[:, [2, 3]].values

# output
y = dataset.iloc[:, 4].values
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(

25
x, y, test_size = 0.25, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
xtrain = sc_x.fit_transform(xtrain)
xtest = sc_x.transform(xtest)
print (xtrain[0:10, :])
Output :
[[ 0.58164944 -0.88670699]
[-0.60673761 1.46173768]
[-0.01254409 -0.5677824 ]
[-0.60673761 1.89663484]
[ 1.37390747 -1.40858358]
[ 1.47293972 0.99784738]
[ 0.08648817 -0.79972756]
[-0.01254409 -0.24885782]
[-0.21060859 -0.5677824 ]
[-0.21060859 -0.19087153]]

Here once see that Age and Estimated salary features values are sacled and now there in
the -1 to 1. Hence, each feature will contribute equally in decision making i.e. finalizing
the hypothesis.

Finally, we are training our Logistic Regression model:


from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(xtrain, ytrain)
After training the model, it time to use it to do prediction on testing data.
y_pred = classifier.predict(xtest)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(ytest, y_pred)
print ("Confusion Matrix : \n", cm)

26
Output:
Confusion Matrix :
[[65 3]
[ 8 24]]
out of 100 :
TruePostive + TrueNegative = 65 + 24
FalsePositive + FalseNegative = 3 + 8
Performance measure – Accuracy:
from sklearn.metrics import accuracy_score
print ("Accuracy : ", accuracy_score(ytest, y_pred))
Output:
Accuracy : 0.89
Visualizing the performance of our model.
from matplotlib.colors import ListedColormap
X_set, y_set = xtest, ytest
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(
np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Classifier (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

27
28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy