0% found this document useful (0 votes)

637 views

ML LAB MANUAL-BCSL606[1]

The Machine Learning Lab Manual (BCSL606) is designed for VI Semester CSE/ISE students at Bangalore Institute of Technology, effective from the academic year 2024-2025. It outlines prerequisites, course outcomes, required resources, and a series of experiments focusing on implementing and evaluating various machine learning algorithms using Python and Java. The manual includes evaluation schemes for continuous internal evaluation (CIE) and semester-end examination (SEE), emphasizing practical skills in machine learning applications.

Uploaded by

bhavanajagalur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

637 views

ML LAB MANUAL-BCSL606[1]

Uploaded by

bhavanajagalur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 64

Machine Learning Lab Manual – BCSL606

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

BELAGAVI

BANGALORE INSTITUTE OF TECHNOLOGY

K.R. ROAD, V.V PURAM, BANGALORE – 560 004

DEPARTMENT OF
INFORMATION SCIENCE AND ENGINEERING

SUBJECT CODE: BCSL606

MACHINE LEARNING LABORATORY MANUAL

As per Choice Based Credit System Scheme (CBCS)

FOR VI SEMESTER CSE/ISE AS PRESCRIBED BY VTU

Effective from the Academic year 2024-2025

Prepared By:

Dr. H Roopa Dr. Jayasheela C S

Professor Associate Professor
Dept. of ISE, BIT Dept. of ISE, BIT
Machine Learning Lab Manual – BCSL606

1. PREREQUISITES:

 Creative thinking, sound mathematical insight and programming skills

 Data Structures and Applications & Lab (18CS32 & 18CSL38)
 Design and Analysis of Algorithms & Lab (18CS42 & 18CSL47)
 Object Oriented Concepts & Java Programming (18CS45)
 Application Development using Python (18CS55)

2. BASE COURSE:
 Machine Learning (BCS602)

3. COURSE OUTCOMES:
At the end of the course, the student will be able to:

CO1 Apply appropriate data sets to the Machine learning algorithms to predict the target.

CO Analyze the machine learning algorithms for different number of training examples,
2 various numbers of epochs and hyper parameters.
CO Evaluate machine learning algorithms to select appropriate algorithm for a given problem
3 for different contexts.
CO Create Python or Java program to implement A*, AO*, Find-S, candidate elimination,
4 ID3, BPN, Naive Bayesian classifier, KNN, K-Means algorithm
CO Use the modern tool such as Windows/Linux operating system to develop and test
5 machine learning program using Python/Java languages.

4. RESOURSES REQUIRED:
 Hardware resources
 Desktop PC
 Windows / Linux operating system
 Software resources
 Python
 Anaconda IDE
 Datasets from standard repositories (Ex: https://archive.ics.uci.edu/ml/datasets.php)

5. RELEVANCE OF THE COURSE:

Project work (18CSP77, 18CSP83)

6. GENERAL INSTRUCTIONS:
 Implement the program in Python editor like Spider or Jupyter and demonstrate the same.

 External practical examination.

 All laboratory experiments are to be included
 Students are allowed to pick one experiment from the lot.
 Marks distribution:
Write-up + Execution + Viva-voce= 20 + 60 + 20 = 100 Marks
 Change of experiment is allowed only once and marks allotted to the procedure part to be
made zero

Department of ISE Page 1

Machine Learning Lab Manual – BCSL606

7. CONTENTS:
1. Implement and evaluate AI and ML algorithms in and Python programming language.
2. Data sets can be taken from standard repositories or constructed by the students.
Exp. RBT
Title of the Experiments CO
No. Level
Develop a program to create histograms for all numerical features and
analyze the distribution of each feature. Generate box plots for all
1 L3 1,2,3,4
numerical features and identify any outliers. Use California Housing
dataset.
Develop a program to Compute the correlation matrix to understand the
relationships between pairs of features. Visualize the correlation matrix
2 using a heatmap to know which variables have strong positive/negative L3 1,2,3,4
correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset.
Develop a program to implement Principal Component Analysis (PCA) for
3 L3 1,2,3,4
reducing the dimensionality of the Iris dataset from 4 features to 2.
For a given set of training data examples stored in a .CSV file, implement
4 and demonstrate the Find-S algorithm to output a description of the set of L3 1,2,3,4
all hypotheses consistent with the training examples.
Develop a program to implement k-Nearest Neighbour algorithm to
classify the randomly generated 100 values of x in the range of [0,1].
Perform the following based on dataset generated.
5 1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε L3 1,2,3,4
Class1, else xi ε Class1
2. Classify the remaining points, x51,……,x100 using KNN. Perform this
for k=1,2,3,4,5,20,30
Implement the non-parametric Locally Weighted Regression algorithm in
6 order to fit data points. Select appropriate data set for your experiment and L3 1,2,3,4
draw graphs.
Develop a program to demonstrate the working of Linear Regression and
Polynomial Regression. Use Boston Housing Dataset for Linear
7 L3 1,2,3,4
Regression and Auto MPG Dataset (for vehicle fuel efficiency prediction)
for Polynomial Regression.
Develop a program to demonstrate the working of the decision tree
8 algorithm. Use Breast Cancer Data set for building the decision tree and L3 1,2,3,4
apply this knowledge to classify a new sample.
Develop a program to implement the Naive Bayesian classifier considering
9 Olivetti Face Data set for training. Compute the accuracy of the classifier, L3 1,2,3,4
considering a few test data sets.
Develop a program to implement k-means clustering using Wisconsin
10 L3 1,2,3,4
Breast Cancer data set and visualize the clustering result.
Course outcomes: The students should be able to:
1. Illustrate the principles of multivariate data and apply dimensionality reduction techniques..
2. Demonstrate similarity-based learning methods and perform regression analysis.
3. Apply appropriate data sets to the Machine Learning algorithms.
4. Identify and apply Machine Learning algorithms to solve real world problems.

Department of ISE Page 2

Machine Learning Lab Manual – BCSL606

8. REFERENCE:

1. S Sridhar and M Vijayalakshmi, “Machine Learning”, Oxford University Press, 2021.

2. M N Murty and Ananthanarayana V S, “Machine Learning: Theory and Practice”,
Universities Press (India) Pvt. Limited, 2024.

9. WEB LINKS AND VIDEO LECTURES (E-RESOURCES):

1. https://www.drssridhar.com/?page_id=1053
2. https://www.universitiespress.com/resources?id=9789393330697
3. https://onlinecourses.nptel.ac.in/noc23_cs18/preview

C. EVALUATION SCHEME

For CBCS 2022s scheme:

Conduction of Practical Examination:

1. All laboratory experiments are to be included for practical examination.
2. Students are allowed to pick one experiment from the lot.
3. Strictly follow the instructions as printed on the cover page of answer script
4. Marks distribution: Procedure + Conduction + Viva:15 + 70 +15 (100)
Change of experiment is allowed only once and marks allotted to the procedure part to
be made zero.

Department of ISE Page 3

Machine Learning Lab Manual – BCSL606

MACHINE LEARNING LABORATORY

(Effective from the academic year 2024 -2025)
SEMESTER – VI

Department of ISE Page 4

Machine Learning Lab Manual – BCSL606

Course Code BCSL606

Teaching Hours/Week (L:T:P: S) 0:0:2:0 CIE Marks 50
Credits 01 SEE Marks 50
Total Number of Contact Hours 24 Exam Hours 03
Examination type (SEE) Practical
Course Objectives:
 To become familiar with data and visualize univariate, bivariate, and multivariate data
using statistical techniques and dimensionality reduction.
To understand various machine learning algorithms such as similarity-based learning,
regression, decision trees, and clustering.
 To familiarize with learning theories, probability-based models and developing the skills
required for decision-making in dynamic environments.
Descriptions (if any):
Installation procedure of the required software must be demonstrated, carried out in
groups and documented in the journal.
Experiments:
1 Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.
2 Develop a program to Compute the correlation matrix to understand the relationships
between pairs of features. Visualize the correlation matrix using a heatmap to know
which variables have strong positive/negative correlations. Create a pair plot to visualize
pairwise relationships between features. Use California Housing dataset.
3 Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
4 For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples.
5 Develop a program to implement k-Nearest Neighbour algorithm to classify the
randomly generated 100 values of x in the range of [0,1]. Perform the following based
on dataset generated.
1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε Class1, else
xi ε Class1
2. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30
6 Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points. Select appropriate data set for your experiment and draw graphs.
7 Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset
(for vehicle fuel efficiency prediction) for Polynomial Regression.
8 Develop a program to demonstrate the working of the decision tree algorithm. Use
Breast Cancer Data set for building the decision tree and apply this knowledge to
classify a new sample.
9 Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training. Compute the accuracy of the classifier, considering a few test data
sets.
10 Develop a program to implement k-means clustering using Wisconsin Breast Cancer
Department of ISE Page 5
Machine Learning Lab Manual – BCSL606

data set and visualize the clustering result.

Laboratory Outcomes: The student should be able to:
 Implement and demonstrate ML algorithms.
 Evaluate different algorithms.
Assessment Details (both CIE and SEE):
The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam
(SEE) is 50%. The minimum passing mark for the CIE is 40% of the maximum marks (20 marks
out of 50) and for the SEE minimum passing mark is 35% of the maximum marks (18 out of 50
marks). A student shall be deemed to have satisfied the academic requirements and earned the
credits allotted to each subject/ course if the student secures a minimum of 40% (40 marks out of
100) in the sum total of the CIE (Continuous Internal Evaluation) and SEE (Semester End
Examination) taken together.
Continuous Internal Evaluation (CIE):
CIE marks for the practical course are 50 Marks.
The split-up of CIE marks for record/ journal and test are in the ratio 60:40.
 Each experiment is to be evaluated for conduction with an observation sheet and record
write-up. Rubrics for the evaluation of the journal/write-up for hardware/software
experiments are designed by the faculty who is handling the laboratory session and are
made known to students at the beginning of the practical session.
 Record should contain all the specified experiments in the syllabus and each experiment
write-up will be evaluated for 10 marks.
 Total marks scored by the students are scaled down to 30 marks (60% of maximum
marks).
 Weightage to be given for neatness and submission of record/write-up on time.
 Department shall conduct a test of 100 marks after the completion of all the experiments
listed in the syllabus.
 In a test, test write-up, conduction of experiment, acceptable result, and procedural
knowledge will carry a weightage of 60% and the rest 40% for viva-voce.
 The suitable rubrics can be designed to evaluate each student’s performance and learning
ability
 The marks scored shall be scaled down to 20 marks (40% of the maximum marks).
The Sum of scaled-down marks scored in the report write-up/journal and marks of a test is the
total CIE marks scored by the student.
Semester End Evaluation (SEE):
 SEE marks for the practical course are 50 Marks.
 SEE shall be conducted jointly by the two examiners of the same institute, examiners are
appointed by the Head of the Institute.
 The examination schedule and names of examiners are informed to the university before
the conduction of the examination. These practical examinations are to be conducted
between the schedules mentioned in the academic calendar of the University.
 All laboratory experiments are to be included for practical examination
 (Rubrics) Breakup of marks and the instructions printed on the cover page of the answer
script to be strictly adhered to by the examiners. OR based on the course requirement
evaluation rubrics shall be decided jointly by examiners.
 Students can pick one question (experiment) from the questions lot prepared by the
examiners jointly.
 Evaluation of test write-up/ conduction procedure and result/viva will be conducted
jointly by examiners.

General rubrics suggested for SEE are mentioned here, writeup-20%, Conduction procedure and
Department of ISE Page 6
Machine Learning Lab Manual – BCSL606

result in -60%, Viva-voce 20% of maximum marks. SEE for practical shall be evaluated for 100
marks and scored marks shall be scaled down to 50 marks (however, based on course type,
rubrics shall be decided by the examiners)

Change of experiment is allowed only once and 15% of Marks allotted to the procedure part are
to be made zero.

The minimum duration of SEE is 02 hours.

 Experiment distribution
o For laboratories having only one part: Students are allowed to pick one
experiment from the lot with equal opportunity.
o For laboratories having PART A and PART B: Students are allowed to pick one
experiment from PART A and one experiment from PART B, with equal
opportunity.
 Change of experiment is allowed only once and 15% of Marks allotted to the procedure
part are to be made zero.
 Marks Distribution (Coursed to change in accordance with university regulations)
a) For laboratories having only one part – Write-up + Execution + Viva-Voce:
20+60+20 = 100 Marks
b) For laboratories having PART A and PART B
Procedure + Execution + Viva = 20 + 60 + 20 = 100 Marks

1. EXPERIMENT NO: 1

2. TITLE: The California housing dataset

3. LEARNING OBJECTIVES:

Department of ISE Page 7

Machine Learning Lab Manual – BCSL606

4. AIM: Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.

5. THEORY

California Housing dataset

--------------------------

Data Set Characteristics:

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

1. Longitude: A measure of how far west a house is; a higher value is farther west
2. Latitude: A measure of how far north a house is; a higher value is farther north
3. Housing Median Age: Median age of a house within a block; a lower number is a newer
building
4. Total Rooms: Total number of rooms within a block
5. Total Bedrooms: Total number of bedrooms within a block
6. Population: Total number of people residing within a block
7. Households: Total number of households, a group of people residing within a home unit, for a
block
8. Median Income: Median income for households within a block of houses (measured in tens of
thousands of US Dollars)
9. Median House Value: Median house value for households within a block (measured in US
Dollars)
10. Ocean Proximity: Location of the house w.r.t ocean/sea

The target variable is the median house value for California districts, expressed in hundreds of
thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block
group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a
block group typically has a population of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average number of rooms and
bedrooms in this dataset are provided per household, these columns may take surprisingly large
values for block groups with few households and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the

:func:`sklearn.datasets.fetch_california_housing` function.

6. PROCEDURE / PROGRAMME:

import pandas as pd
import numpy as np
import seaborn as sns
Department of ISE Page 8
Machine Learning Lab Manual – BCSL606

import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing

# Step 1: Load the California Housing dataset

data = fetch_california_housing(as_frame=True)
housing_df = data.frame

# Step 2: Create histograms for numerical features

numerical_features =
housing_df.select_dtypes(include=[np.number]).columns

# Plot histograms
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.histplot(housing_df[feature], kde=True, bins=30,
color='blue')
plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

# Step 3: Generate box plots for numerical features

plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.boxplot(x=housing_df[feature], color='orange')
plt.title(f'Box Plot of {feature}')
plt.tight_layout()
plt.show()

# Step 4: Identify outliers using the IQR method

print("Outliers Detection:")
outliers_summary = {}
for feature in numerical_features:
Q1 = housing_df[feature].quantile(0.25)
Department of ISE Page 9
Machine Learning Lab Manual – BCSL606

Q3 = housing_df[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = housing_df[(housing_df[feature] < lower_bound) |
(housing_df[feature] > upper_bound)]
outliers_summary[feature] = len(outliers)
print(f"{feature}: {len(outliers)} outliers")

# Optional: Print a summary of the dataset

print("\nDataset Summary:")
print(housing_df.describe())

OUTPUT:

Department of ISE Page 10

Machine Learning Lab Manual – BCSL606

Outliers Detection:
MedInc: 681 outliers

Department of ISE Page 11

Machine Learning Lab Manual – BCSL606

HouseAge: 0 outliers
AveRooms: 511 outliers
AveBedrms: 1424 outliers
Population: 1196 outliers
AveOccup: 711 outliers
Latitude: 0 outliers
Longitude: 0 outliers
MedHouseVal: 1071 outliers

Dataset Summary:
MedInc HouseAge AveRooms AveBedrms Population \
count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744
std 1.899822 12.585558 2.474173 0.473911 1132.462122
min 0.499900 1.000000 0.846154 0.333333 3.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000

AveOccup Latitude Longitude MedHouseVal

count 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.070655 35.631861 -119.569704 2.068558
std 10.386050 2.135952 2.003532 1.153956
min 0.692308 32.540000 -124.350000 0.149990
25% 2.429741 33.930000 -121.800000 1.196000
50% 2.818116 34.260000 -118.490000 1.797000
75% 3.282261 37.710000 -118.010000 2.647250
max 1243.333333 41.950000 -114.310000 5.000010

1. EXPERIMENT NO: 2

2. TITLE: correlation matrix to understand the relationships between pairs of features

Department of ISE Page 12

Machine Learning Lab Manual – BCSL606

3. LEARNING OBJECTIVES:

4. AIM: Develop a program to compute the correlation matrix to understand the relationships
between pairs of features. Visualize the correlation matrix using a heatmap to know which variables
have strong positive/negative correlations. Create a pair plot to visualize pairwise relationships
between features. Use California Housing dataset.

5. THEORY

California Housing dataset

--------------------------

Data Set Characteristics:

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

1. Longitude: A measure of how far west a house is; a higher value is farther west
2. Latitude: A measure of how far north a house is; a higher value is farther north
3. Housing Median Age: Median age of a house within a block; a lower number is a newer building
4. Total Rooms: Total number of rooms within a block
5. Total Bedrooms: Total number of bedrooms within a block
6. Population: Total number of people residing within a block
7. Households: Total number of households, a group of people residing within a home unit, for a
block
8. Median Income: Median income for households within a block of houses (measured in tens of
thousands of US Dollars)
9. Median House Value: Median house value for households within a block (measured in US
Dollars)
10. Ocean Proximity: Location of the house w.r.t ocean/sea

The target variable is the median house value for California districts, expressed in hundreds of
thousands of dollars ($100,000).

It can be downloaded/loaded using the

:func:`sklearn.datasets.fetch_california_housing` function.

6. PROCEDURE / PROGRAMME:

Department of ISE Page 13

Machine Learning Lab Manual – BCSL606

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Step 1: Load the California Housing Dataset

california_data = fetch_california_housing(as_frame=True)
data = california_data.frame

# Step 2: Compute the correlation matrix

correlation_matrix = data.corr()

# Step 3: Visualize the correlation matrix using a heatmap

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of California Housing Features')
plt.show()

# Step 4: Create a pair plot to visualize pairwise relationships

sns.pairplot(data, diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot of California Housing Features', y=1.02)
plt.show()

OUTPUT:

Department of ISE Page 14

Machine Learning Lab Manual – BCSL606

Department of ISE Page 15

Machine Learning Lab Manual – BCSL606

Department of ISE Page 16

Machine Learning Lab Manual – BCSL606

1. EXPERIMENT NO: 3

2. TITLE: Reduce Data Dimensionality using PCA – Python

3. LEARNING OBJECTIVES:

4. AIM: Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.

5. THEORY

Reduce Data Dimensionality using PCA – Python

The advancements in Data Science and Machine Learning have made it possible for us to
solve several complex regression and classification problems. However, the performance of all these
ML models depends on the data fed to them. Thus, it is imperative that we provide our ML models
with an optimal dataset. Now, one might think that the more data we provide to our model, the better
it becomes – however, it is not the case. If we feed our model with an excessively large dataset (with
a large no. of features/columns), it gives rise to the problem of overfitting, wherein the model starts
getting influenced by outlier values and noise. This is called the Curse of Dimensionality.

The following graph represents the change in model performance with the increase in the number of
dimensions of the dataset. It can be observed that the model performance is best only at an option
dimension, beyond which it starts decreasing.

Dimensionality Reduction is a statistical/ML-based technique wherein we try to reduce the number

of features in our dataset and obtain a dataset with an optimal number of dimensions.

One of the most common ways to accomplish Dimensionality Reduction is Feature Extraction,
wherein we reduce the number of dimensions by mapping a higher dimensional feature space to a
lower-dimensional feature space. The most popular technique of Feature Extraction is Principal
Component Analysis (PCA)

Department of ISE Page 17

Machine Learning Lab Manual – BCSL606

Principal Component Analysis (PCA)

As stated earlier, Principal Component Analysis is a technique of feature extraction that maps a
higher dimensional feature space to a lower-dimensional feature space. While reducing the number
of dimensions, PCA ensures that maximum information of the original dataset is retained in the
dataset with the reduced no. of dimensions and the co-relation between the newly obtained Principal
Components is minimum. The new features obtained after applying PCA are called Principal
Components and are denoted as PCi (i=1,2,3…n). Here, (Principal Component-1) PC1 captures the
maximum information of the original dataset, followed by PC2, then PC3 and so on.

The following bar graph depicts the amount of Explained Variance captured by various Principal
Components. (The Explained Variance defines the amount of information captured by the Principal
Components).

6. PROCEDURE / PROGRAMME :

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the Iris dataset

iris = load_iris()
data = iris.data
labels = iris.target
label_names = iris.target_names

Department of ISE Page 18

Machine Learning Lab Manual – BCSL606

# Convert to a DataFrame for better visualization

iris_df = pd.DataFrame(data, columns=iris.feature_names)

# Perform PCA to reduce dimensionality to 2

pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)

# Create a DataFrame for the reduced data

reduced_df = pd.DataFrame(data_reduced, columns=['Principal
Component 1', 'Principal Component 2'])
reduced_df['Label'] = labels

# Plot the reduced data

plt.figure(figsize=(8, 6))
colors = ['r', 'g', 'b']
for i, label in enumerate(np.unique(labels)):
plt.scatter(
reduced_df[reduced_df['Label'] == label]['Principal
Component 1'],
reduced_df[reduced_df['Label'] == label]['Principal
Component 2'],
label=label_names[label],
color=colors[i]
)

plt.title('PCA on Iris Dataset')

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid()
plt.show()

Department of ISE Page 19

Machine Learning Lab Manual – BCSL606

OUTPUT:

Department of ISE Page 20

Machine Learning Lab Manual – BCSL606

1. EXPERIMENT NO: 4

2. TITLE: FIND-S ALGORITHM

3. LEARNING OBJECTIVES:
a. Make use of Data sets in implementing the machine learning algorithms.
b. Implement ML concepts and algorithms in Python

4. AIM: For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Find-S algorithm to output a description of the set of all hypotheses consistent with the
training examples.

5. THEORY:

• The concept learning approach in machine learning, can be formulated as “Problem of

searching through a predefined space of potential hypotheses for the hypothesis that best fits
the training examples”.
• Find-S algorithm for concept learning is one of the most basic algorithms of machine
learning.

Find-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint a i in h :
If the constraint a i in h is satisfied by x then do nothing
Else replace a i in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
• It is Guaranteed to output the most specific hypothesis within H that is consistent with the
positive training examples.
• Also Notice that negative examples are ignored.

Limitations of the Find-S algorithm:

• No way to determine if the only final hypothesis (found by Find-S) is consistent with data or
there are more hypothesis that is consistent with data.
• Inconsistent sets of training data can mislead the finds algorithm as it ignores negative data
samples.
• A good concept learning algorithm should be able to backtrack the choice of hypothesis
found so that the resulting hypothesis can be improved over time. Unfortunately, Find-S
provide no such method.

DATA SETS

airTem humidit wate foreca enjoySpo

sky p y wind r st rt
Sunn Stron War
y Warm Normal g m Same Yes
Sunn Stron War
y Warm High g m Same Yes

Department of ISE Page 21

Machine Learning Lab Manual – BCSL606

Rain Stron War Chang

y Cold High g m e No
Sunn Stron Chang
y Warm High g Cool e Yes

6. PROCEDURE / PROGRAMME :

import numpy as np
import pandas as pd
data=pd.read_csv('finds.csv')
print('Data',data)
def train(concepts,target):
specific_h=concepts[0]
print('\nspecific1\n',specific_h)
for i,h in enumerate(concepts):
print('i',i)
print('h',h)
if target[i]=="Yes":
for x in range(len(specific_h)):
print('x',x)
print('specific',specific_h)
if h[x]==specific_h[x]:
pass
else:
specific_h[x]="?"
return specific_h
concepts=np.array(data.iloc[:,0:-1])
target=np.array(data.iloc[:,-1])
print('\nConcept\n',concepts)
print('Target',target)
print(train(concepts,target))

OUTPUT:

Data sky airTemp humidity wind water forecast enjoySport

0 Sunny Warm Normal Strong Warm Same Yes
1 Sunny Warm High Strong Warm Same Yes
2 Rainy Cold High Strong Warm Change No
3 Sunny Warm High Strong Cool Change Yes

Department of ISE Page 22

Machine Learning Lab Manual – BCSL606

Concept
[['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
['Sunny' 'Warm' 'High' 'Strong' 'Warm' 'Same']
['Rainy' 'Cold' 'High' 'Strong' 'Warm' 'Change']
['Sunny' 'Warm' 'High' 'Strong' 'Cool' 'Change']]
Target ['Yes' 'Yes' 'No' 'Yes']
specific1
['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
i 0
h ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 0
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 1
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 2
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 3
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 4
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 5
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
i 1
h ['Sunny' 'Warm' 'High' 'Strong' 'Warm' 'Same']
x 0
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 1
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 2
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 3
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 4
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 5
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
i 2
h ['Rainy' 'Cold' 'High' 'Strong' 'Warm' 'Change']
i 3
h ['Sunny' 'Warm' 'High' 'Strong' 'Cool' 'Change']
x 0
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 1
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 2
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 3
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 4
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 5
specific ['Sunny' 'Warm' '?' 'Strong' '?' 'Same']
Department of ISE Page 23
Machine Learning Lab Manual – BCSL606

['Sunny' 'Warm' '?' 'Strong' '?' '?']

1. EXPERIMENT NO: 5

2. TITLE: k-Nearest Neighbour algorithm

3. LEARNING OBJECTIVES:

4. AIM: Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly
generated 100 values of x in the range of [0,1]. Perform the following based on dataset
generated.
1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε Class1, else xi
ε Class1
2. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30

5. THEORY: K-Nearest Neighbor(KNN) Algorithm

K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s nearby.
Imagine a streaming service wants to predict if a new user is likely to cancel their subscription
(churn) based on their age. They checks the ages of its existing users and whether they churned
or stayed. If most of the “K” closest users in age of new user canceled their subscription KNN
will predict the new user might churn too. The key idea is that users with similar ages tend to
have similar behaviors and KNN uses this closeness to make decisions.

Getting Started with K-Nearest Neighbors

K-Nearest Neighbors is also called as a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the dataset and at the time of
classification it performs an action on the dataset.

As an example, consider the following table of data points containing two features:

The new point is classified as Category 2 because most of its closest neighbors are
blue squares. KNN assigns the category based on the majority of nearby points.

The image shows how KNN predicts the category of a new data point based on its
closest neighbours.

Department of ISE Page 24

Machine Learning Lab Manual – BCSL606

1. The red diamonds represent Category 1 and the blue squares represent Category 2.
2. The new data point checks its closest neighbours (circled points).
3. Since the majority of its closest neighbours are blue squares (Category 2) KNN
predicts the new data point belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.

What is ‘K’ in K Nearest Neighbour ?

In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the algorithm how
many nearby points (neighbours) to look at when it makes a decision.

Example:

Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits you
already know.

 If k = 3, the algorithm looks at the 3 closest fruits to the new one.

 If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an
apple because most of its neighbours are apples.

How to choose the value of k for KNN Algorithm?

The value of k is critical in KNN as it determines the number of neighbors to consider when
making predictions. Selecting the optimal value of k depends on the characteristics of the input
data. If the dataset has significant outliers or noise a higher k can help smooth out the predictions
and reduce the influence of noisy data. However choosing very high value can lead to underfitting
where the model becomes too simplistic.

Statistical Methods for Selecting k:

 Cross-Validation: A robust method for selecting the best k is to perform k-fold cross-
validation. This involves splitting the data into k subsets training the model on some subsets
and testing it on the remaining ones and repeating this for each subset. The value of k that
results in the highest average validation accuracy is usually the best choice.
 Elbow Method: In the elbow method we plot the model’s error rate or accuracy for different
values of k. As we increase k the error usually decreases initially. However after a certain
point the error rate starts to decrease more slowly. This point where the curve forms an
“elbow” that point is considered as best k.
 Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.

Distance Metrics Used in KNN Algorithm

KNN uses distance metrics to identify nearest neighbour, these neighbours are used for
classification and regression task.

To identify nearest neighbour we use below distance metrics:

1. Euclidean Distance

Department of ISE Page 25

Machine Learning Lab Manual – BCSL606

Euclidean distance is defined as the straight-line distance between two points in a plane or
space. You can think of it like the shortest path you would walk if you were to go directly
from one point to another.

distance(x,Xi)=∑j=1d(xj–Xij)2]

2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and
vertical lines (like a grid or city streets). It’s also called “taxicab distance” because a taxi can
only drive along the grid-like streets of a city.

d(x,y)=∑i=1n∣xi−yi∣

3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and
Manhattan distances as special cases.

d(x,y)=(∑i=1n(xi−yi)p)p1

From the formula above we can say that when p = 2 then it is the same as the formula for the
Euclidean distance and when p = 1 then we obtain the formula for the Manhattan distance.

So, you can think of Minkowski as a flexible distance formula that can look like either
Manhattan or Euclidean distance depending on the value of p

Working of KNN algorithm

Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where it
predicts the label or value of a new data point by considering the labels or values of its K nearest
neighbors in the training dataset.

Step-by-Step explanation of how KNN works is discussed below:

Step 1: Selecting the optimal value of K

K represents the number of nearest neighbors that needs to be considered while making
prediction.

Department of ISE Page 26

Machine Learning Lab Manual – BCSL606

Step 2: Calculating distance

To measure the similarity between target and training data points Euclidean distance is
used. Distance is calculated between data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
When you want to classify a data point into a category (like spam or not spam), the K-NN
algorithm looks at the K closest points in the dataset. These closest points are called
neighbors. The algorithm then looks at which category the neighbors belong to and picks
the one that appears the most. This is called majority voting.

In regression, the algorithm still looks for the K closest points. But instead of voting for a
class in classification, it takes the average of the values of those K neighbors. This average
is the predicted value for the new point for the algorithm.

It shows how a test point is classified based on its nearest neighbors. As the test point moves
the algorithm identifies the closest ‘k’ data points i.e 5 in this case and assigns test point the
majority class label that is grey label class here.

6. PROCEDURE / PROGRAMME :

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

data = np.random.rand(100)

labels = ["Class1" if x <= 0.5 else "Class2" for x in data[:50]]

Department of ISE Page 27

Machine Learning Lab Manual – BCSL606

def euclidean_distance(x1, x2):

return abs(x1 - x2)

def knn_classifier(train_data, train_labels, test_point, k):

distances = [(euclidean_distance(test_point, train_data[i]),
train_labels[i]) for i in range(len(train_data))]

distances.sort(key=lambda x: x[0])
k_nearest_neighbors = distances[:k]

k_nearest_labels = [label for _, label in k_nearest_neighbors]

return Counter(k_nearest_labels).most_common(1)[0][0]

train_data = data[:50]
train_labels = labels

test_data = data[50:]

k_values = [1, 2, 3, 4, 5, 20, 30]

print("--- k-Nearest Neighbors Classification ---")
print("Training dataset: First 50 points labeled based on the rule
(x <= 0.5 -> Class1, x > 0.5 -> Class2)")
print("Testing dataset: Remaining 50 points to be classified\n")

results = {}

for k in k_values:
print(f"Results for k = {k}:")
classified_labels = [knn_classifier(train_data, train_labels,
test_point, k) for test_point in test_data]
results[k] = classified_labels

for i, label in enumerate(classified_labels, start=51):

print(f"Point x{i} (value: {test_data[i - 51]:.4f}) is

Department of ISE Page 28

Machine Learning Lab Manual – BCSL606

classified as {label}")
print("\n")

print("Classification complete.\n")

for k in k_values:
classified_labels = results[k]
class1_points = [test_data[i] for i in range(len(test_data)) if
classified_labels[i] == "Class1"]
class2_points = [test_data[i] for i in range(len(test_data)) if
classified_labels[i] == "Class2"]

plt.figure(figsize=(10, 6))
plt.scatter(train_data, [0] * len(train_data), c=["blue" if
label == "Class1" else "red" for label in train_labels],
label="Training Data", marker="o")
plt.scatter(class1_points, [1] * len(class1_points), c="blue",
label="Class1 (Test)", marker="x")
plt.scatter(class2_points, [1] * len(class2_points), c="red",
label="Class2 (Test)", marker="x")

plt.title(f"k-NN Classification Results for k = {k}")

plt.xlabel("Data Points")
plt.ylabel("Classification Level")
plt.legend()
plt.grid(True)
plt.show()

OUTPUT:

--- k-Nearest Neighbors Classification ---

Training dataset: First 50 points labeled based on the rule (x <=
0.5 -> Class1, x > 0.5 -> Class2)
Testing dataset: Remaining 50 points to be classified

Results for k = 1:

Department of ISE Page 29

Machine Learning Lab Manual – BCSL606

Point x51 (value: 0.4186) is classified as Class1

Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Point x73 (value: 0.5532) is classified as Class2
Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Point x79 (value: 0.7665) is classified as Class2
Point x80 (value: 0.4851) is classified as Class1
Point x81 (value: 0.5881) is classified as Class2
Point x82 (value: 0.9204) is classified as Class2
Point x83 (value: 0.4165) is classified as Class1
Point x84 (value: 0.4188) is classified as Class1
Point x85 (value: 0.9696) is classified as Class2
Point x86 (value: 0.1754) is classified as Class1
Point x87 (value: 0.2621) is classified as Class1
Point x88 (value: 0.3443) is classified as Class1
Point x89 (value: 0.5252) is classified as Class2
Point x90 (value: 0.2649) is classified as Class1
Point x91 (value: 0.5842) is classified as Class2
Point x92 (value: 0.6777) is classified as Class2
Point x93 (value: 0.7905) is classified as Class2
Point x94 (value: 0.9382) is classified as Class2
Point x95 (value: 0.6094) is classified as Class2
Point x96 (value: 0.3705) is classified as Class1
Department of ISE Page 30
Machine Learning Lab Manual – BCSL606

Point x97 (value: 0.5000) is classified as Class2

Point x98 (value: 0.9226) is classified as Class2
Point x99 (value: 0.5391) is classified as Class2
Point x100 (value: 0.3558) is classified as Class1

Results for k = 2:
Point x51 (value: 0.4186) is classified as Class1
Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Point x73 (value: 0.5532) is classified as Class2
Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Point x79 (value: 0.7665) is classified as Class2
Point x80 (value: 0.4851) is classified as Class1
Point x81 (value: 0.5881) is classified as Class2
Point x82 (value: 0.9204) is classified as Class2
Point x83 (value: 0.4165) is classified as Class1
Point x84 (value: 0.4188) is classified as Class1
Point x85 (value: 0.9696) is classified as Class2
Point x86 (value: 0.1754) is classified as Class1
Point x87 (value: 0.2621) is classified as Class1
Point x88 (value: 0.3443) is classified as Class1
Point x89 (value: 0.5252) is classified as Class2
Point x90 (value: 0.2649) is classified as Class1
Department of ISE Page 31
Machine Learning Lab Manual – BCSL606

Point x91 (value: 0.5842) is classified as Class2

Point x92 (value: 0.6777) is classified as Class2
Point x93 (value: 0.7905) is classified as Class2
Point x94 (value: 0.9382) is classified as Class2
Point x95 (value: 0.6094) is classified as Class2
Point x96 (value: 0.3705) is classified as Class1
Point x97 (value: 0.5000) is classified as Class2
Point x98 (value: 0.9226) is classified as Class2
Point x99 (value: 0.5391) is classified as Class2
Point x100 (value: 0.3558) is classified as Class1

Results for k = 3:
Point x51 (value: 0.4186) is classified as Class1
Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Point x73 (value: 0.5532) is classified as Class2
Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Point x79 (value: 0.7665) is classified as Class2
Point x80 (value: 0.4851) is classified as Class1
Point x81 (value: 0.5881) is classified as Class2
Point x82 (value: 0.9204) is classified as Class2
Point x83 (value: 0.4165) is classified as Class1
Point x84 (value: 0.4188) is classified as Class1
Department of ISE Page 32
Machine Learning Lab Manual – BCSL606

Point x85 (value: 0.9696) is classified as Class2

Point x86 (value: 0.1754) is classified as Class1
Point x87 (value: 0.2621) is classified as Class1
Point x88 (value: 0.3443) is classified as Class1
Point x89 (value: 0.5252) is classified as Class2
Point x90 (value: 0.2649) is classified as Class1
Point x91 (value: 0.5842) is classified as Class2
Point x92 (value: 0.6777) is classified as Class2
Point x93 (value: 0.7905) is classified as Class2
Point x94 (value: 0.9382) is classified as Class2
Point x95 (value: 0.6094) is classified as Class2
Point x96 (value: 0.3705) is classified as Class1
Point x97 (value: 0.5000) is classified as Class2
Point x98 (value: 0.9226) is classified as Class2
Point x99 (value: 0.5391) is classified as Class2
Point x100 (value: 0.3558) is classified as Class1

Results for k = 4:
Point x51 (value: 0.4186) is classified as Class1
Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Point x73 (value: 0.5532) is classified as Class2
Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Department of ISE Page 33
Machine Learning Lab Manual – BCSL606

Point x79 (value: 0.7665) is classified as Class2

Point x80 (value: 0.4851) is classified as Class1
Point x81 (value: 0.5881) is classified as Class2
Point x82 (value: 0.9204) is classified as Class2
Point x83 (value: 0.4165) is classified as Class1
Point x84 (value: 0.4188) is classified as Class1
Point x85 (value: 0.9696) is classified as Class2
Point x86 (value: 0.1754) is classified as Class1
Point x87 (value: 0.2621) is classified as Class1
Point x88 (value: 0.3443) is classified as Class1
Point x89 (value: 0.5252) is classified as Class2
Point x90 (value: 0.2649) is classified as Class1
Point x91 (value: 0.5842) is classified as Class2
Point x92 (value: 0.6777) is classified as Class2
Point x93 (value: 0.7905) is classified as Class2
Point x94 (value: 0.9382) is classified as Class2
Point x95 (value: 0.6094) is classified as Class2
Point x96 (value: 0.3705) is classified as Class1
Point x97 (value: 0.5000) is classified as Class2
Point x98 (value: 0.9226) is classified as Class2
Point x99 (value: 0.5391) is classified as Class2
Point x100 (value: 0.3558) is classified as Class1

Results for k = 5:
Point x51 (value: 0.4186) is classified as Class1
Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Department of ISE Page 34
Machine Learning Lab Manual – BCSL606

Point x73 (value: 0.5532) is classified as Class2

Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Point x79 (value: 0.7665) is classified as Class2
Point x80 (value: 0.4851) is classified as Class1
Point x81 (value: 0.5881) is classified as Class2
Point x82 (value: 0.9204) is classified as Class2
Point x83 (value: 0.4165) is classified as Class1
Point x84 (value: 0.4188) is classified as Class1
Point x85 (value: 0.9696) is classified as Class2
Point x86 (value: 0.1754) is classified as Class1
Point x87 (value: 0.2621) is classified as Class1
Point x88 (value: 0.3443) is classified as Class1
Point x89 (value: 0.5252) is classified as Class2
Point x90 (value: 0.2649) is classified as Class1
Point x91 (value: 0.5842) is classified as Class2
Point x92 (value: 0.6777) is classified as Class2
Point x93 (value: 0.7905) is classified as Class2
Point x94 (value: 0.9382) is classified as Class2
Point x95 (value: 0.6094) is classified as Class2
Point x96 (value: 0.3705) is classified as Class1
Point x97 (value: 0.5000) is classified as Class2
Point x98 (value: 0.9226) is classified as Class2
Point x99 (value: 0.5391) is classified as Class2
Point x100 (value: 0.3558) is classified as Class1

Results for k = 20:

Point x51 (value: 0.4186) is classified as Class1
Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Department of ISE Page 35
Machine Learning Lab Manual – BCSL606

Point x67 (value: 0.1940) is classified as Class1

Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Point x73 (value: 0.5532) is classified as Class2
Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Point x79 (value: 0.7665) is classified as Class2
Point x80 (value: 0.4851) is classified as Class2
Point x81 (value: 0.5881) is classified as Class2
Point x82 (value: 0.9204) is classified as Class2
Point x83 (value: 0.4165) is classified as Class1
Point x84 (value: 0.4188) is classified as Class1
Point x85 (value: 0.9696) is classified as Class2
Point x86 (value: 0.1754) is classified as Class1
Point x87 (value: 0.2621) is classified as Class1
Point x88 (value: 0.3443) is classified as Class1
Point x89 (value: 0.5252) is classified as Class2
Point x90 (value: 0.2649) is classified as Class1
Point x91 (value: 0.5842) is classified as Class2
Point x92 (value: 0.6777) is classified as Class2
Point x93 (value: 0.7905) is classified as Class2
Point x94 (value: 0.9382) is classified as Class2
Point x95 (value: 0.6094) is classified as Class2
Point x96 (value: 0.3705) is classified as Class1
Point x97 (value: 0.5000) is classified as Class2
Point x98 (value: 0.9226) is classified as Class2
Point x99 (value: 0.5391) is classified as Class2
Point x100 (value: 0.3558) is classified as Class1

Results for k = 30:

Point x61 (value: 0.0987) is classified as Class1

Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Point x73 (value: 0.5532) is classified as Class2
Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Point x79 (value: 0.7665) is classified as Class2
Point x80 (value: 0.4851) is classified as Class2
Point x81 (value: 0.5881) is classified as Class2
Point x82 (value: 0.9204) is classified as Class2
Point x83 (value: 0.4165) is classified as Class1
Point x84 (value: 0.4188) is classified as Class1
Point x85 (value: 0.9696) is classified as Class2
Point x86 (value: 0.1754) is classified as Class1
Point x87 (value: 0.2621) is classified as Class1
Point x88 (value: 0.3443) is classified as Class1
Point x89 (value: 0.5252) is classified as Class2
Point x90 (value: 0.2649) is classified as Class1
Point x91 (value: 0.5842) is classified as Class2
Point x92 (value: 0.6777) is classified as Class2
Point x93 (value: 0.7905) is classified as Class2
Point x94 (value: 0.9382) is classified as Class2
Point x95 (value: 0.6094) is classified as Class2
Point x96 (value: 0.3705) is classified as Class1
Point x97 (value: 0.5000) is classified as Class2
Point x98 (value: 0.9226) is classified as Class2
Point x99 (value: 0.5391) is classified as Class2
Point x100 (value: 0.3558) is classified as Class1

Classification complete.

Department of ISE Page 37

Machine Learning Lab Manual – BCSL606

Department of ISE Page 38

Machine Learning Lab Manual – BCSL606

1. EXPERIMENT NO: 6

2. TITLE: Locally Weighted Regression algorithm

3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python

4. AIM: Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
5. THEORY:
 Given a dataset X, y, we attempt to find a linear model h(x) that minimizes residual
sum of squared errors. The solution is given by Normal equations.
 Linear model can only fit a straight line, however, it can be empowered by polynomial
features to get more powerful models. Still, we have to decide and fix the number and
types of features ahead.
 Alternate approach is given by locally weighted regression.
 Given a dataset X, y, we attempt to find a model h(x) that minimizes residual
sum of weighted squared errors.
 The weights are given by a kernel function which can be chosen arbitrarily and in my
case I chose a Gaussian kernel.
 The solution is very similar to Normal equations, we only need to insert diagonal
weight matrix W.

total_bil
l tip sex smoker day time size
16.99 1.01 Female No Sun Dinner 2
10.34 1.66 Male No Sun Dinner 3
21.01 3.5 Male No Sun Dinner 3
23.68 3.31 Male No Sun Dinner 2
24.59 3.61 Female No Sun Dinner 4
25.29 4.71 Male No Sun Dinner 4
8.77 2 Male No Sun Dinner 2
26.88 3.12 Male No Sun Dinner 4
15.04 1.96 Male No Sun Dinner 2
14.78 3.23 Male No Sun Dinner 2
10.27 1.71 Male No Sun Dinner 2
35.26 5 Female No Sun Dinner 4
15.42 1.57 Male No Sun Dinner 2
18.43 3 Male No Sun Dinner 4
14.83 3.02 Female No Sun Dinner 2
21.58 3.92 Male No Sun Dinner 2

Department of ISE Page 39

Machine Learning Lab Manual – BCSL606

6. PROCEDURE / PROGRAMME

import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

def kernel(point,xmat, k):

m,n = np.shape(xmat)
weights = np.mat(np.eye((m))) # eye - identity matrix
for j in range(m):
diff = point - X[j]
weights[j,j] = np.exp(diff*diff.T/(-2.0*k**2))
return weights
def localWeight(point,xmat,ymat,k):
wei = kernel(point,xmat,k)
W = (X.T*(wei*X)).I*(X.T*(wei*ymat.T))
return W

def localWeightRegression(xmat,ymat,k):
m,n = np.shape(xmat)
ypred = np.zeros(m)
for i in range(m):
ypred[i] = xmat[i]*localWeight(xmat[i],xmat,ymat,k)
return ypred

def graphPlot(X,ypred):
sortindex = X[:,1].argsort(0) #argsort - index of the smallest
xsort = X[sortindex][:,0]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(bill,tip, color='green')
ax.plot(xsort[:,1],ypred[sortindex], color = 'red',
linewidth=5)
plt.xlabel('Total bill')
plt.ylabel('Tip')

Department of ISE Page 40

Machine Learning Lab Manual – BCSL606

plt.show();
# load data points
data = pd.read_csv('Program6_dataset_tips.csv')
bill = np.array(data.total_bill) # We use only Bill amount and Tips data
tip = np.array(data.tip)
mbill = np.mat(bill) # .mat will convert nd array is converted in 2D array
mtip = np.mat(tip)
m= np.shape(mbill)[1]
one = np.mat(np.ones(m))
X = np.hstack((one.T,mbill.T)) # 244 rows, 2 cols increase k to get smooth curves
ypred = localWeightRegression(X,mtip,9)
graphPlot(X,ypred)

OUTPUT:

Regression with parameter k = 3

Department of ISE Page 41

Machine Learning Lab Manual – BCSL606

Regression with parameter k = 9

Department of ISE Page 42

Machine Learning Lab Manual – BCSL606

1. EXPERIMENT NO: 7

2. TITLE: Linear Regression and Polynomial Regression

3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python

4. AIM: Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for
vehicle fuel efficiency prediction) for Polynomial Regression.
5. THEORY:

6. PROCEDURE / PROGRAMME

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures,
StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score

def linear_regression_california():
housing = fetch_california_housing(as_frame=True)
X = housing.data[["AveRooms"]]
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

Department of ISE Page 43

Machine Learning Lab Manual – BCSL606

y_pred = model.predict(X_test)
plt.scatter(X_test, y_test, color="blue", label="Actual")
plt.plot(X_test, y_pred, color="red", label="Predicted")
plt.xlabel("Average number of rooms (AveRooms)")
plt.ylabel("Median value of homes ($100,000)")
plt.title("Linear Regression - California Housing Dataset")
plt.legend()
plt.show()

print("Linear Regression - California Housing Dataset")

print("Mean Squared Error:", mean_squared_error(y_test,
y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))

def polynomial_regression_auto_mpg():
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/auto-mpg/auto-mpg.data"
column_names = ["mpg", "cylinders", "displacement",
"horsepower", "weight", "acceleration", "model_year", "origin"]
data = pd.read_csv(url, sep='\s+', names=column_names,
na_values="?")
data = data.dropna()

X = data["displacement"].values.reshape(-1, 1)
y = data["mpg"].values

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state=42)

poly_model = make_pipeline(PolynomialFeatures(degree=2),
StandardScaler(), LinearRegression())
poly_model.fit(X_train, y_train)

y_pred = poly_model.predict(X_test)
plt.scatter(X_test, y_test, color="blue", label="Actual")
plt.scatter(X_test, y_pred, color="red", label="Predicted")

Department of ISE Page 44

Machine Learning Lab Manual – BCSL606

plt.xlabel("Displacement")
plt.ylabel("Miles per gallon (mpg)")
plt.title("Polynomial Regression - Auto MPG Dataset")
plt.legend()
plt.show()

print("Polynomial Regression - Auto MPG Dataset")

print("Mean Squared Error:", mean_squared_error(y_test,
y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))
if __name__ == "__main__":
print("Demonstrating Linear Regression and Polynomial
Regression\n")
linear_regression_california()
polynomial_regression_auto_mpg()

OUTPUT

Department of ISE Page 45

Machine Learning Lab Manual – BCSL606

Linear Regression - California Housing Dataset

Mean Squared Error: 1.2923314440807296
R^2 Score: 0.013795337532285012

Polynomial Regression - Auto MPG Dataset

Mean Squared Error: 0.743149055720586
R^2 Score: 0.7505650609469627

Department of ISE Page 46

Machine Learning Lab Manual – BCSL606

1. EXPERIMENT NO: 8

2. TITLE: Decision tree algorithm

3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python
4. AIM: Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data set for building the decision tree and apply this knowledge to classify a new sample.
5. THEORY: Decision Tree

Decision tree is a simple diagram that shows different choices and their possible results
helping you make decisions easily. This article is all about what decision trees are, how they
work, their advantages and disadvantages and their applications.

Understanding Decision Tree

A decision tree is a graphical representation of different options for solving a problem and
show how different factors are related. It has a hierarchical tree structure starts with one main
question at the top called a node which further branches out into different possible outcomes
where:

 Root Node is the starting point that represents the entire dataset.
 Branches: These are the lines that connect nodes. It shows the flow from one
decision to another.
 Internal Nodes are Points where decisions are made based on the input features.
 Leaf Nodes: These are the terminal nodes at the end of branches that represent
final outcomes or predictions

They also support decision-making by visualizing outcomes. You can quickly evaluate
and compare the “branches” to determine which course of action is best for you.

Now, let’s take an example to understand the decision tree. Imagine you want to decide

Department of ISE Page 47

Machine Learning Lab Manual – BCSL606

whether to drink coffee based on the time of day and how tired you feel. First the tree checks
the time of day—if it’s morning it asks whether you are tired. If you’re tired the tree suggests
drinking coffee if not it says there’s no need. Similarly in the afternoon the tree again asks if
you are tired. If you recommends drinking coffee if not it concludes no coffee is needed.

Classification of Decision Tree

We have mainly two types of decision tree based on the nature of the target variable:
classification trees and regression trees.

Classification trees: They are designed to predict categorical outcomes means they classify data
into different classes. They can determine whether an email is “spam” or “not spam” based on
various features of the email.
Regression trees: These are used when the target variable is continuous It predict numerical
values rather than categories. For example a regression tree can estimate the price of a house
based on its size, location, and other features.

How Decision Trees Work?

A decision tree working starts with a main question known as the root node. This question
is derived from the features of the dataset and serves as the starting point for decision-making.

From the root node, the tree asks a series of yes/no questions. Each question is designed to
split the data into subsets based on specific attributes. For example if the first question is “Is it
raining?” the answer will determine which branch of the tree to follow. Depending on the
response to each question you follow different branches. If your answer is “Yes,” you might
proceed down one path if “No,” you will take another path.

This branching continues through a sequence of decisions. As you follow each branch,
you get more questions that break the data into smaller groups. This step-by-step process
continues until you have no more helpful questions.

You reach at the end of a branch where you find the final outcome or decision. It could be a

Department of ISE Page 48

Machine Learning Lab Manual – BCSL606

classification (like “spam” or “not spam”) or a prediction (such as estimated price).

Advantages of Decision Trees

 Simplicity and Interpretability: Decision trees are straightforward and easy to

understand. You can visualize them like a flowchart which makes it simple to see how
decisions are made.
 Versatility: It means they can be used for different types of tasks can work well for
both classification and regression
 No Need for Feature Scaling: They don’t require you to normalize or scale your data.
 Handles Non-linear Relationships: It is capable of capturing non-linear relationships
between features and target variables.

Disadvantages of Decision Trees

 Overfitting: Overfitting occurs when a decision tree captures noise and details in the
training data and it perform poorly on new data.
 Instability: instability means that the model can be unreliable slight variations in input
can lead to significant differences in predictions.
 Bias towards Features with More Levels: Decision trees can become biased towards
features with many categories focusing too much on them during decision-making. This
can cause the model to miss out other important features led to less accurate predictions.

Applications of Decision Trees

 Loan Approval in Banking: A bank needs to decide whether to approve a loan

application based on customer profiles.
 Input features include income, credit score, employment status, and loan history.
 The decision tree predicts loan approval or rejection, helping the bank make quick and
reliable decisions
 Medical Diagnosis: A healthcare provider wants to predict whether a patient has diabetes
based on clinical test results.
 Features like glucose levels, BMI, and blood pressure are used to make a decision tree.
 Tree classifies patients into diabetic or non-diabetic, assisting doctors in diagnosis.

 De Predicting Exam Results in Education: School wants to predict whether a student

will pass or fail based on study habits.
 Data includes attendance, time spent studying, and previous grades.
 The decision tree identifies at-risk students, allowing teachers to provide additional
support.
A decision tree can also be used to help build automated predictive models, which have
applications in machine learning, data mining, and statistics

Department of ISE Page 49

Machine Learning Lab Manual – BCSL606

6. PROCEDURE / PROGRAMME

# Importing necessary libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
data = load_breast_cancer()

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state=42)
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy * 100:.2f}%")
new_sample = np.array([X_test[0]])
prediction = clf.predict(new_sample)

prediction_class = "Benign" if prediction == 1 else "Malignant"

print(f"Predicted Class for the new sample: {prediction_class}")

plt.figure(figsize=(12,8))
tree.plot_tree(clf, filled=True,
feature_names=data.feature_names, class_names=data.target_names)
plt.title("Decision Tree - Breast Cancer Dataset")
plt.show()

Department of ISE Page 50

Machine Learning Lab Manual – BCSL606

OUTPUT:

Model Accuracy: 94.74%

Predicted Class for the new sample: Benign

Department of ISE Page 51

Machine Learning Lab Manual – BCSL606

1. EXPERIMENT NO: 9

2. TITLE: Naive Bayesian classifier

3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python

4. AIM: Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training. Compute the accuracy of the classifier, considering a few test data sets.
THEORY:

5. THEORY: Naive Bayes Classifiers

Naive Bayes classifiers are supervised machine learning algorithms used for classification
tasks, based on Bayes’ Theorem to find probabilities. This article will give you an overview as
well as more advanced use and implementation of Naive Bayes in machine learning.

Key Features of Naive Bayes Classifiers

The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to classify data
based on the probabilities of different classes given the features of the data. It is used mostly
in high-dimensional text classification

 The Naive Bayes Classifier is a simple probabilistic classifier and it has very few
number of parameters which are used to build the ML models that can predict at a
faster speed than other classification algorithms.
 It is a probabilistic classifier because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature contributes to
the predictions with no relation between each other.
 Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis, classifying
articles and many more.

Why it is Called Naive Bayes?

It is named as “Naive” because it assumes the presence of one feature does not affect other
features.

The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.

Assumption of Naive Bayes

The fundamental Naive Bayes assumption is that each feature makes an:

 Feature independence: This means that when we are trying to classify something, we
assume that each feature (or piece of information) in the data does not affect any other
feature.
 Continuous features are normally distributed: If a feature is continuous, then it is
assumed to be normally distributed within each class.
 Discrete features have multinomial distributions: If a feature is discrete, then it is
assumed to have a multinomial distribution within each class.

Department of ISE Page 52

Machine Learning Lab Manual – BCSL606

 Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
 No missing data: The data should not contain any missing values.

The assumptions made by Naive Bayes are not generally correct in real-world situations. In-
fact, the independence assumption is never correct but often works well in practice. Now, before
moving to the formula for Naive Bayes, it is important to know about Bayes’ theorem.

Types of Naive Bayes Model

There are three types of Naive Bayes Model:

Gaussian Naive Bayes: In Gaussian Naive Bayes, continuous values associated with each feature
are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also
called Normal distribution when plotted, it gives a bell shaped curve which is symmetric about the
mean of the feature values as shown below:

Multinomial Naive Bayes: Multinomial Naive Bayes is used when features represent the frequency
of terms (such as word counts) in a document. It is commonly applied in text classification, where
term frequencies are important.

Bernoulli Naive Bayes: Bernoulli Naive Bayes deals with binary features, where each feature
indicates whether a word appears or not in a document. It is suited for scenarios where the presence
or absence of terms is more relevant than their frequency. Both models are widely used in document
classification tasks.

Advantages of Naive Bayes Classifier

 Easy to implement and computationally efficient.
 Effective in cases with a large number of features.
 Performs well even with limited training data.
 It performs well in the presence of categorical features.
 For numerical features data is assumed to come from normal distributions

Disadvantages of Naive Bayes Classifier

 Assumes that features are independent, which may not always hold in real-world data.
 Can be influenced by irrelevant attributes.
 May assign zero probability to unseen events, leading to poor generalization.

Applications of Naive Bayes Classifier

 Spam Email Filtering: Classifies emails as spam or non-spam based on features.
 Text Classification: Used in sentiment analysis, document categorization, and topic
classification.
 Medical Diagnosis: Helps in predicting the likelihood of a disease based on symptoms.
 Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
 Weather Prediction: Classifies weather conditions based on various factors.
6. PROCEDURE / PROGRAMME

Department of ISE Page 53

Machine Learning Lab Manual – BCSL606

import numpy as np
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split,
cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix
import matplotlib.pyplot as plt

data = fetch_olivetti_faces(shuffle=True, random_state=42)

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.3, random_state=42)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy * 100:.2f}%')

print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=1))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

cross_val_accuracy = cross_val_score(gnb, X, y, cv=5,

scoring='accuracy')
print(f'\nCross-validation accuracy: {cross_val_accuracy.mean() *
100:.2f}%')

Department of ISE Page 54

Machine Learning Lab Manual – BCSL606

fig, axes = plt.subplots(3, 5, figsize=(12, 8))

for ax, image, label, prediction in zip(axes.ravel(), X_test,
y_test, y_pred):
ax.imshow(image.reshape(64, 64), cmap=plt.cm.gray)
ax.set_title(f"True: {label}, Pred: {prediction}")
ax.axis('off')

plt.show()

OUTPUT:

downloading Olivetti faces from

https://ndownloader.figshare.com/files/5976027 to C:\Users\vijay\
scikit_learn_data
Accuracy: 80.83%

Classification Report:
precision recall f1-score support
Department of ISE Page 55
Machine Learning Lab Manual – BCSL606

0 0.67 1.00 0.80 2

1 1.00 1.00 1.00 2
2 0.33 0.67 0.44 3
3 1.00 0.00 0.00 5
4 1.00 0.50 0.67 4
5 1.00 1.00 1.00 2
7 1.00 0.75 0.86 4
8 1.00 0.67 0.80 3
9 1.00 0.75 0.86 4
10 1.00 1.00 1.00 3
11 1.00 1.00 1.00 1
12 0.40 1.00 0.57 4
13 1.00 0.80 0.89 5
14 1.00 0.40 0.57 5
15 0.67 1.00 0.80 2
16 1.00 0.67 0.80 3
17 1.00 1.00 1.00 3
18 1.00 1.00 1.00 3
19 0.67 1.00 0.80 2
20 1.00 1.00 1.00 3
21 1.00 0.67 0.80 3
22 1.00 0.60 0.75 5
23 1.00 0.75 0.86 4
24 1.00 1.00 1.00 3
25 1.00 0.75 0.86 4
26 1.00 1.00 1.00 2
27 1.00 1.00 1.00 5
28 0.50 1.00 0.67 2
29 1.00 1.00 1.00 2
30 1.00 1.00 1.00 2
31 1.00 0.75 0.86 4
32 1.00 1.00 1.00 2
34 0.25 1.00 0.40 1
35 1.00 1.00 1.00 5
36 1.00 1.00 1.00 3
37 1.00 1.00 1.00 1
38 1.00 0.75 0.86 4
39 0.50 1.00 0.67 5

accuracy 0.81 120

macro avg 0.89 0.85 0.83 120
weighted avg 0.91 0.81 0.81 120
Confusion Matrix:

Department of ISE Page 56

Machine Learning Lab Manual – BCSL606

[[2 0 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
[0 0 2 ... 0 0 1]
...
[0 0 0 ... 1 0 0]
[0 0 0 ... 0 3 0]
[0 0 0 ... 0 0 5]]

Cross-validation accuracy: 87.25%

1. EXPERIMENT NO: 10

Department of ISE Page 57

Machine Learning Lab Manual – BCSL606

2. TITLE: k-means clustering

3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python

4. AIM: Develop a program to implement k-means clustering using Wisconsin Breast Cancer data
set and visualize the clustering result.
5. THEORY: k-means clustering

K-Means Clustering is an Unsupervised Machine Learning algorithm which groups the

unlabeled dataset into different clusters. The article aims to explore the fundamentals and
working of k means clustering along with its implementation.

Understanding K-means Clustering

K-means clustering is a technique used to organize data into groups based on their
similarity. For example online store uses K-Means to group customers based on purchase
frequency and spending creating segments like Budget Shoppers, Frequent Buyers and Big
Spenders for personalized marketing.

The algorithm works by first randomly picking some central points called centroids and each
data point is then assigned to the closest centroid forming a cluster. After all the points are
assigned to a cluster the centroids are updated by finding the average position of the points in
each cluster. This process repeats until the centroids stop changing forming clusters. The goal of
clustering is to divide the data points into clusters so that similar data points belong to same
group.

How k-means clustering works?

We are given a data set of items with certain features and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the K-
means algorithm. ‘K’ in the name of the algorithm represents the number of groups/clusters we
want to classify our items into.

The algorithm will categorize the items into k groups or clusters of similarity. To calculate
that similarity, we will use the Euclidean distance as a measurement. The algorithm works as
follows:
1. First, we randomly initialize k points, called means or cluster centroids.

Department of ISE Page 58

Machine Learning Lab Manual – BCSL606

2. We categorize each item to its closest mean, and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.

The “points” mentioned above are called means because they are the mean values of the items
categorized in them. To initialize these means, we have a lot of options. An intuitive method is to
initialize the means at random items in the data set. Another method is to initialize the means at
random values between the boundaries of the data set. For example for a feature x the items have
values in [0,3] we will initialize the means with values for x at [0,3].

6. PROCEDURE / PROGRAMME

import numpy as np

Department of ISE Page 59

Machine Learning Lab Manual – BCSL606

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix,
classification_report

data = load_breast_cancer()
X = data.data
y = data.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=2, random_state=42)

y_kmeans = kmeans.fit_predict(X_scaled)

print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])

df['Cluster'] = y_kmeans
df['True Label'] = y

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster',
palette='Set1', s=100, edgecolor='black', alpha=0.7)
plt.title('K-Means Clustering of Breast Cancer Dataset')

Department of ISE Page 60

Machine Learning Lab Manual – BCSL606

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='True Label',
palette='coolwarm', s=100, edgecolor='black', alpha=0.7)
plt.title('True Labels of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="True Label")
plt.show()

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster',
palette='Set1', s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='red',
marker='X', label='Centroids')
plt.title('K-Means Clustering with Centroids')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()

OUTPUT:

Confusion Matrix:

Department of ISE Page 61

Machine Learning Lab Manual – BCSL606

[[175 37]
[ 13 344]]

Classification Report:
precision recall f1-score support

0 0.93 0.83 0.88 212

1 0.90 0.96 0.93 357

accuracy 0.91 569

macro avg 0.92 0.89 0.90 569
weighted avg 0.91 0.91 0.91 569

Department of ISE Page 62

Machine Learning Lab Manual – BCSL606

Department of ISE Page 63

Generative AI For Dummies
67% (3)
Generative AI For Dummies
6 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Beinborn L. Cognitive Plausibility in Natural Language Processing 2024
No ratings yet
Beinborn L. Cognitive Plausibility in Natural Language Processing 2024
171 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
C & Data Structures
From Everand
C & Data Structures
Prof. P. Padmanabham
No ratings yet
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet
ML Lab
No ratings yet
ML Lab
62 pages
Machine Learning Lab Manual-1
No ratings yet
Machine Learning Lab Manual-1
35 pages
15CSL76
No ratings yet
15CSL76
3 pages
MLA LabManual1
No ratings yet
MLA LabManual1
52 pages
Updated ML LAB Manual-2020-21
No ratings yet
Updated ML LAB Manual-2020-21
57 pages
ML Lab Manual 18csl76 1
No ratings yet
ML Lab Manual 18csl76 1
54 pages
ML Lab Manual-18csl76
No ratings yet
ML Lab Manual-18csl76
52 pages
AD3461_ML_MANUAL
No ratings yet
AD3461_ML_MANUAL
34 pages
ML Lab Manual-17csl76
No ratings yet
ML Lab Manual-17csl76
43 pages
ML Manual - 2023-24
No ratings yet
ML Manual - 2023-24
54 pages
18csl76 Lab Manual Lab Material
No ratings yet
18csl76 Lab Manual Lab Material
61 pages
Machine Learning (BCSL606) Lab Manual
No ratings yet
Machine Learning (BCSL606) Lab Manual
30 pages
Machinelearninglabmanual
No ratings yet
Machinelearninglabmanual
47 pages
Astha Ml Manual
No ratings yet
Astha Ml Manual
56 pages
Lab Manual: Department of Computer Science and Engineering
No ratings yet
Lab Manual: Department of Computer Science and Engineering
30 pages
ML Lab
No ratings yet
ML Lab
45 pages
Ad3461 Machine Learning Laboratory_1
No ratings yet
Ad3461 Machine Learning Laboratory_1
1 page
MLT LAb list of programs
No ratings yet
MLT LAb list of programs
2 pages
CP4252 SET2
No ratings yet
CP4252 SET2
4 pages
AK ML Lab Manual
No ratings yet
AK ML Lab Manual
103 pages
Machine learning record
No ratings yet
Machine learning record
52 pages
Bcsl606_lab Manual (1)
No ratings yet
Bcsl606_lab Manual (1)
28 pages
ML Lab Manual
No ratings yet
ML Lab Manual
40 pages
AD3461 MACHINE LEARNING LABORATORY SYLLABUS
No ratings yet
AD3461 MACHINE LEARNING LABORATORY SYLLABUS
2 pages
R20 B.Tech - CSM Siddarth Institute of Engineering & Technology: Puttur (Autonomous) Machine Learning Lab 3 Course Objectives
No ratings yet
R20 B.Tech - CSM Siddarth Institute of Engineering & Technology: Puttur (Autonomous) Machine Learning Lab 3 Course Objectives
2 pages
ML LAB
No ratings yet
ML LAB
20 pages
Final PRINT 2022 SCHEME VI SEM SCHEME & SYLLABUS
No ratings yet
Final PRINT 2022 SCHEME VI SEM SCHEME & SYLLABUS
30 pages
Lab_cycle
No ratings yet
Lab_cycle
1 page
Lab Workbook
No ratings yet
Lab Workbook
160 pages
ML Minor Syllabus-Sem-04
No ratings yet
ML Minor Syllabus-Sem-04
4 pages
Machine Learning Lab Sheets
No ratings yet
Machine Learning Lab Sheets
5 pages
LAB PLAN II SEC B(ML LAB)
No ratings yet
LAB PLAN II SEC B(ML LAB)
4 pages
CL-I Lab Manual
No ratings yet
CL-I Lab Manual
131 pages
Guidelines_Machine_Learning
No ratings yet
Guidelines_Machine_Learning
2 pages
ML Lab Lesson Plan 2024 25 Even Sem
No ratings yet
ML Lab Lesson Plan 2024 25 Even Sem
3 pages
FML_lab_manual
No ratings yet
FML_lab_manual
49 pages
Machine_learning_laboratory
No ratings yet
Machine_learning_laboratory
44 pages
ML_final
No ratings yet
ML_final
80 pages
Machine L-Lab-Manual
No ratings yet
Machine L-Lab-Manual
90 pages
ML Manual AIDS
No ratings yet
ML Manual AIDS
44 pages
15CSL76 Lab Manual
75% (4)
15CSL76 Lab Manual
48 pages
CS3491 Set3
No ratings yet
CS3491 Set3
2 pages
ML Lab Manual
No ratings yet
ML Lab Manual
36 pages
AIML Lab Improvement
No ratings yet
AIML Lab Improvement
20 pages
B.Tech.AIDS-90
No ratings yet
B.Tech.AIDS-90
1 page
ML LAB MANUAL (ACSML0651) - Dr Roop Singh
No ratings yet
ML LAB MANUAL (ACSML0651) - Dr Roop Singh
58 pages
ml syll
No ratings yet
ml syll
2 pages
ML Practical Format
No ratings yet
ML Practical Format
82 pages
ML[1]
No ratings yet
ML[1]
49 pages
ML-MANUAL
No ratings yet
ML-MANUAL
42 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
iml_lab[1].177 (2)
No ratings yet
iml_lab[1].177 (2)
32 pages
ML Lab Manual
No ratings yet
ML Lab Manual
26 pages
ML Syllabus
No ratings yet
ML Syllabus
5 pages
22CM1105
No ratings yet
22CM1105
2 pages
22AIP3101RAP Skill Work Book
No ratings yet
22AIP3101RAP Skill Work Book
43 pages
Machine Learning in Bioinformatics
100% (1)
Machine Learning in Bioinformatics
50 pages
(IJCST-V10I2P3) :prof. Jogi John Saurabh Sonawne, Sagar Mankar, Dhanshri Wasu, Pranjali Rachchawar, Grishma Bhoyar
No ratings yet
(IJCST-V10I2P3) :prof. Jogi John Saurabh Sonawne, Sagar Mankar, Dhanshri Wasu, Pranjali Rachchawar, Grishma Bhoyar
6 pages
Artificial Intelligence Education For Young Children
No ratings yet
Artificial Intelligence Education For Young Children
7 pages
Neural Network
100% (1)
Neural Network
54 pages
KDnuggets The Complete Collection of Data Science Cheatsheets v3
No ratings yet
KDnuggets The Complete Collection of Data Science Cheatsheets v3
18 pages
Human Machine Interaction Towards Industry 5 0 Human Cen - 2024 - Digital Engin
No ratings yet
Human Machine Interaction Towards Industry 5 0 Human Cen - 2024 - Digital Engin
17 pages
Boltzmann Machine
No ratings yet
Boltzmann Machine
6 pages
Instant download Supervised Learning with Python: Concepts and Practical Implementation Using Python Vaibhav Verdhan pdf all chapter
100% (16)
Instant download Supervised Learning with Python: Concepts and Practical Implementation Using Python Vaibhav Verdhan pdf all chapter
55 pages
Machine learning notes
No ratings yet
Machine learning notes
53 pages
(IJCST-V12I2P7) :devika C.J Nair, Reshma H M, Smrithi B Rajeev, Varsha V, Theertha B
No ratings yet
(IJCST-V12I2P7) :devika C.J Nair, Reshma H M, Smrithi B Rajeev, Varsha V, Theertha B
4 pages
Learning in Latent Spaces Improves The Predictive Accuracy of Deep Neural Operators
No ratings yet
Learning in Latent Spaces Improves The Predictive Accuracy of Deep Neural Operators
22 pages
A Datadriven Predictive Maintenance Model To Estimate RUL in A Multirotor UASInternational Journal of Micro Air Vehicles
No ratings yet
A Datadriven Predictive Maintenance Model To Estimate RUL in A Multirotor UASInternational Journal of Micro Air Vehicles
14 pages
Framework For Building ML Systems: Crisp-Dm
No ratings yet
Framework For Building ML Systems: Crisp-Dm
28 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
59 pages
193-Article Text-479-1-10-20240215
No ratings yet
193-Article Text-479-1-10-20240215
20 pages
Media Detection Based On Natural Language Processing and
No ratings yet
Media Detection Based On Natural Language Processing and
7 pages
Heart Diseases Prediction Using Deep Learning Neural Network Model
No ratings yet
Heart Diseases Prediction Using Deep Learning Neural Network Model
5 pages
Module 5 Machine Learning
No ratings yet
Module 5 Machine Learning
36 pages
Heart Disease Detection System Proposal - 2024
No ratings yet
Heart Disease Detection System Proposal - 2024
28 pages
Da&ml PPT-1
No ratings yet
Da&ml PPT-1
35 pages
BDOD - Getting From A To AI - Web
No ratings yet
BDOD - Getting From A To AI - Web
16 pages
As An AI Language Model
No ratings yet
As An AI Language Model
2 pages
Deep Limit Order Book Trading - Half-A-Second, Please! - 1647041664887001aTDL
No ratings yet
Deep Limit Order Book Trading - Half-A-Second, Please! - 1647041664887001aTDL
24 pages
DL syllabus 3164601
No ratings yet
DL syllabus 3164601
4 pages
SAP Procurement BWZ
No ratings yet
SAP Procurement BWZ
30 pages
2021 01 14 SWGDE Overview Artificial Intelligence Trends in Video Analysis v1.0
No ratings yet
2021 01 14 SWGDE Overview Artificial Intelligence Trends in Video Analysis v1.0
17 pages
The Role of Artificial Intelligence in Revolutionizing Healthcare A Comprehensive Review
No ratings yet
The Role of Artificial Intelligence in Revolutionizing Healthcare A Comprehensive Review
11 pages
BTech All Branch 3rd-Semt CBCS - 6
No ratings yet
BTech All Branch 3rd-Semt CBCS - 6
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ML LAB MANUAL-BCSL606[1]

Uploaded by

ML LAB MANUAL-BCSL606[1]

Uploaded by

Machine Learning Lab Manual – BCSL606

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

BANGALORE INSTITUTE OF TECHNOLOGY

SUBJECT CODE: BCSL606

MACHINE LEARNING LABORATORY MANUAL

As per Choice Based Credit System Scheme (CBCS)

FOR VI SEMESTER CSE/ISE AS PRESCRIBED BY VTU

Effective from the Academic year 2024-2025

Dr. H Roopa Dr. Jayasheela C S

 Creative thinking, sound mathematical insight and programming skills

5. RELEVANCE OF THE COURSE:

 External practical examination.

Department of ISE Page 1

Department of ISE Page 2

1. S Sridhar and M Vijayalakshmi, “Machine Learning”, Oxford University Press, 2021.

9. WEB LINKS AND VIDEO LECTURES (E-RESOURCES):

For CBCS 2022s scheme:

Conduction of Practical Examination:

Department of ISE Page 3

MACHINE LEARNING LABORATORY

Department of ISE Page 4

Course Code BCSL606

data set and visualize the clustering result.

The minimum duration of SEE is 02 hours.

2. TITLE: The California housing dataset

Department of ISE Page 7

California Housing dataset

**Data Set Characteristics: **

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

It can be downloaded/loaded using the

import matplotlib.pyplot as plt

# Step 1: Load the California Housing dataset

# Step 2: Create histograms for numerical features

# Step 3: Generate box plots for numerical features

# Step 4: Identify outliers using the IQR method

# Optional: Print a summary of the dataset

Department of ISE Page 10

Department of ISE Page 11

AveOccup Latitude Longitude MedHouseVal

2. TITLE: correlation matrix to understand the relationships between pairs of features

Department of ISE Page 12

California Housing dataset

**Data Set Characteristics: **

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

It can be downloaded/loaded using the

Department of ISE Page 13

# Step 1: Load the California Housing Dataset

# Step 2: Compute the correlation matrix

# Step 3: Visualize the correlation matrix using a heatmap

# Step 4: Create a pair plot to visualize pairwise relationships

Department of ISE Page 14

Department of ISE Page 15

Department of ISE Page 16

2. TITLE: Reduce Data Dimensionality using PCA – Python

Reduce Data Dimensionality using PCA – Python

Dimensionality Reduction is a statistical/ML-based technique wherein we try to reduce the number

Department of ISE Page 17

Principal Component Analysis (PCA)

# Load the Iris dataset

Department of ISE Page 18

# Convert to a DataFrame for better visualization

# Perform PCA to reduce dimensionality to 2

# Create a DataFrame for the reduced data

# Plot the reduced data

plt.title('PCA on Iris Dataset')

Department of ISE Page 19

Department of ISE Page 20

2. TITLE: FIND-S ALGORITHM

• The concept learning approach in machine learning, can be formulated as “Problem of

Limitations of the Find-S algorithm:

airTem humidit wate foreca enjoySpo

Department of ISE Page 21

Rain Stron War Chang

Data sky airTemp humidity wind water forecast enjoySport

Data Set Characteristics:

Data Set Characteristics: