ML LAB MANUAL-BCSL606[1]
ML LAB MANUAL-BCSL606[1]
DEPARTMENT OF
INFORMATION SCIENCE AND ENGINEERING
Prepared By:
1. PREREQUISITES:
2. BASE COURSE:
Machine Learning (BCS602)
3. COURSE OUTCOMES:
At the end of the course, the student will be able to:
CO1 Apply appropriate data sets to the Machine learning algorithms to predict the target.
CO Analyze the machine learning algorithms for different number of training examples,
2 various numbers of epochs and hyper parameters.
CO Evaluate machine learning algorithms to select appropriate algorithm for a given problem
3 for different contexts.
CO Create Python or Java program to implement A*, AO*, Find-S, candidate elimination,
4 ID3, BPN, Naive Bayesian classifier, KNN, K-Means algorithm
CO Use the modern tool such as Windows/Linux operating system to develop and test
5 machine learning program using Python/Java languages.
4. RESOURSES REQUIRED:
Hardware resources
Desktop PC
Windows / Linux operating system
Software resources
Python
Anaconda IDE
Datasets from standard repositories (Ex: https://archive.ics.uci.edu/ml/datasets.php)
6. GENERAL INSTRUCTIONS:
Implement the program in Python editor like Spider or Jupyter and demonstrate the same.
7. CONTENTS:
1. Implement and evaluate AI and ML algorithms in and Python programming language.
2. Data sets can be taken from standard repositories or constructed by the students.
Exp. RBT
Title of the Experiments CO
No. Level
Develop a program to create histograms for all numerical features and
analyze the distribution of each feature. Generate box plots for all
1 L3 1,2,3,4
numerical features and identify any outliers. Use California Housing
dataset.
Develop a program to Compute the correlation matrix to understand the
relationships between pairs of features. Visualize the correlation matrix
2 using a heatmap to know which variables have strong positive/negative L3 1,2,3,4
correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset.
Develop a program to implement Principal Component Analysis (PCA) for
3 L3 1,2,3,4
reducing the dimensionality of the Iris dataset from 4 features to 2.
For a given set of training data examples stored in a .CSV file, implement
4 and demonstrate the Find-S algorithm to output a description of the set of L3 1,2,3,4
all hypotheses consistent with the training examples.
Develop a program to implement k-Nearest Neighbour algorithm to
classify the randomly generated 100 values of x in the range of [0,1].
Perform the following based on dataset generated.
5 1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε L3 1,2,3,4
Class1, else xi ε Class1
2. Classify the remaining points, x51,……,x100 using KNN. Perform this
for k=1,2,3,4,5,20,30
Implement the non-parametric Locally Weighted Regression algorithm in
6 order to fit data points. Select appropriate data set for your experiment and L3 1,2,3,4
draw graphs.
Develop a program to demonstrate the working of Linear Regression and
Polynomial Regression. Use Boston Housing Dataset for Linear
7 L3 1,2,3,4
Regression and Auto MPG Dataset (for vehicle fuel efficiency prediction)
for Polynomial Regression.
Develop a program to demonstrate the working of the decision tree
8 algorithm. Use Breast Cancer Data set for building the decision tree and L3 1,2,3,4
apply this knowledge to classify a new sample.
Develop a program to implement the Naive Bayesian classifier considering
9 Olivetti Face Data set for training. Compute the accuracy of the classifier, L3 1,2,3,4
considering a few test data sets.
Develop a program to implement k-means clustering using Wisconsin
10 L3 1,2,3,4
Breast Cancer data set and visualize the clustering result.
Course outcomes: The students should be able to:
1. Illustrate the principles of multivariate data and apply dimensionality reduction techniques..
2. Demonstrate similarity-based learning methods and perform regression analysis.
3. Apply appropriate data sets to the Machine Learning algorithms.
4. Identify and apply Machine Learning algorithms to solve real world problems.
8. REFERENCE:
1. https://www.drssridhar.com/?page_id=1053
2. https://www.universitiespress.com/resources?id=9789393330697
3. https://onlinecourses.nptel.ac.in/noc23_cs18/preview
C. EVALUATION SCHEME
General rubrics suggested for SEE are mentioned here, writeup-20%, Conduction procedure and
Department of ISE Page 6
Machine Learning Lab Manual – BCSL606
result in -60%, Viva-voce 20% of maximum marks. SEE for practical shall be evaluated for 100
marks and scored marks shall be scaled down to 50 marks (however, based on course type,
rubrics shall be decided by the examiners)
Change of experiment is allowed only once and 15% of Marks allotted to the procedure part are
to be made zero.
Experiment distribution
o For laboratories having only one part: Students are allowed to pick one
experiment from the lot with equal opportunity.
o For laboratories having PART A and PART B: Students are allowed to pick one
experiment from PART A and one experiment from PART B, with equal
opportunity.
Change of experiment is allowed only once and 15% of Marks allotted to the procedure
part are to be made zero.
Marks Distribution (Coursed to change in accordance with university regulations)
a) For laboratories having only one part – Write-up + Execution + Viva-Voce:
20+60+20 = 100 Marks
b) For laboratories having PART A and PART B
Procedure + Execution + Viva = 20 + 60 + 20 = 100 Marks
1. EXPERIMENT NO: 1
3. LEARNING OBJECTIVES:
4. AIM: Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.
5. THEORY
1. Longitude: A measure of how far west a house is; a higher value is farther west
2. Latitude: A measure of how far north a house is; a higher value is farther north
3. Housing Median Age: Median age of a house within a block; a lower number is a newer
building
4. Total Rooms: Total number of rooms within a block
5. Total Bedrooms: Total number of bedrooms within a block
6. Population: Total number of people residing within a block
7. Households: Total number of households, a group of people residing within a home unit, for a
block
8. Median Income: Median income for households within a block of houses (measured in tens of
thousands of US Dollars)
9. Median House Value: Median house value for households within a block (measured in US
Dollars)
10. Ocean Proximity: Location of the house w.r.t ocean/sea
The target variable is the median house value for California districts, expressed in hundreds of
thousands of dollars ($100,000).
This dataset was derived from the 1990 U.S. census, using one row per census block group. A block
group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a
block group typically has a population of 600 to 3,000 people).
A household is a group of people residing within a home. Since the average number of rooms and
bedrooms in this dataset are provided per household, these columns may take surprisingly large
values for block groups with few households and many empty houses, such as vacation resorts.
6. PROCEDURE / PROGRAMME:
import pandas as pd
import numpy as np
import seaborn as sns
Department of ISE Page 8
Machine Learning Lab Manual – BCSL606
# Plot histograms
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.histplot(housing_df[feature], kde=True, bins=30,
color='blue')
plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()
Q3 = housing_df[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = housing_df[(housing_df[feature] < lower_bound) |
(housing_df[feature] > upper_bound)]
outliers_summary[feature] = len(outliers)
print(f"{feature}: {len(outliers)} outliers")
OUTPUT:
Outliers Detection:
MedInc: 681 outliers
HouseAge: 0 outliers
AveRooms: 511 outliers
AveBedrms: 1424 outliers
Population: 1196 outliers
AveOccup: 711 outliers
Latitude: 0 outliers
Longitude: 0 outliers
MedHouseVal: 1071 outliers
Dataset Summary:
MedInc HouseAge AveRooms AveBedrms Population \
count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744
std 1.899822 12.585558 2.474173 0.473911 1132.462122
min 0.499900 1.000000 0.846154 0.333333 3.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000
1. EXPERIMENT NO: 2
3. LEARNING OBJECTIVES:
4. AIM: Develop a program to compute the correlation matrix to understand the relationships
between pairs of features. Visualize the correlation matrix using a heatmap to know which variables
have strong positive/negative correlations. Create a pair plot to visualize pairwise relationships
between features. Use California Housing dataset.
5. THEORY
1. Longitude: A measure of how far west a house is; a higher value is farther west
2. Latitude: A measure of how far north a house is; a higher value is farther north
3. Housing Median Age: Median age of a house within a block; a lower number is a newer building
4. Total Rooms: Total number of rooms within a block
5. Total Bedrooms: Total number of bedrooms within a block
6. Population: Total number of people residing within a block
7. Households: Total number of households, a group of people residing within a home unit, for a
block
8. Median Income: Median income for households within a block of houses (measured in tens of
thousands of US Dollars)
9. Median House Value: Median house value for households within a block (measured in US
Dollars)
10. Ocean Proximity: Location of the house w.r.t ocean/sea
The target variable is the median house value for California districts, expressed in hundreds of
thousands of dollars ($100,000).
This dataset was derived from the 1990 U.S. census, using one row per census block group. A block
group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a
block group typically has a population of 600 to 3,000 people).
A household is a group of people residing within a home. Since the average number of rooms and
bedrooms in this dataset are provided per household, these columns may take surprisingly large
values for block groups with few households and many empty houses, such as vacation resorts.
6. PROCEDURE / PROGRAMME:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
OUTPUT:
1. EXPERIMENT NO: 3
3. LEARNING OBJECTIVES:
4. AIM: Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
5. THEORY
The advancements in Data Science and Machine Learning have made it possible for us to
solve several complex regression and classification problems. However, the performance of all these
ML models depends on the data fed to them. Thus, it is imperative that we provide our ML models
with an optimal dataset. Now, one might think that the more data we provide to our model, the better
it becomes – however, it is not the case. If we feed our model with an excessively large dataset (with
a large no. of features/columns), it gives rise to the problem of overfitting, wherein the model starts
getting influenced by outlier values and noise. This is called the Curse of Dimensionality.
The following graph represents the change in model performance with the increase in the number of
dimensions of the dataset. It can be observed that the model performance is best only at an option
dimension, beyond which it starts decreasing.
One of the most common ways to accomplish Dimensionality Reduction is Feature Extraction,
wherein we reduce the number of dimensions by mapping a higher dimensional feature space to a
lower-dimensional feature space. The most popular technique of Feature Extraction is Principal
Component Analysis (PCA)
As stated earlier, Principal Component Analysis is a technique of feature extraction that maps a
higher dimensional feature space to a lower-dimensional feature space. While reducing the number
of dimensions, PCA ensures that maximum information of the original dataset is retained in the
dataset with the reduced no. of dimensions and the co-relation between the newly obtained Principal
Components is minimum. The new features obtained after applying PCA are called Principal
Components and are denoted as PCi (i=1,2,3…n). Here, (Principal Component-1) PC1 captures the
maximum information of the original dataset, followed by PC2, then PC3 and so on.
The following bar graph depicts the amount of Explained Variance captured by various Principal
Components. (The Explained Variance defines the amount of information captured by the Principal
Components).
6. PROCEDURE / PROGRAMME :
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
OUTPUT:
1. EXPERIMENT NO: 4
3. LEARNING OBJECTIVES:
a. Make use of Data sets in implementing the machine learning algorithms.
b. Implement ML concepts and algorithms in Python
4. AIM: For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Find-S algorithm to output a description of the set of all hypotheses consistent with the
training examples.
5. THEORY:
Find-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint a i in h :
If the constraint a i in h is satisfied by x then do nothing
Else replace a i in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
• It is Guaranteed to output the most specific hypothesis within H that is consistent with the
positive training examples.
• Also Notice that negative examples are ignored.
DATA SETS
6. PROCEDURE / PROGRAMME :
import numpy as np
import pandas as pd
data=pd.read_csv('finds.csv')
print('Data',data)
def train(concepts,target):
specific_h=concepts[0]
print('\nspecific1\n',specific_h)
for i,h in enumerate(concepts):
print('i',i)
print('h',h)
if target[i]=="Yes":
for x in range(len(specific_h)):
print('x',x)
print('specific',specific_h)
if h[x]==specific_h[x]:
pass
else:
specific_h[x]="?"
return specific_h
concepts=np.array(data.iloc[:,0:-1])
target=np.array(data.iloc[:,-1])
print('\nConcept\n',concepts)
print('Target',target)
print(train(concepts,target))
OUTPUT:
Concept
[['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
['Sunny' 'Warm' 'High' 'Strong' 'Warm' 'Same']
['Rainy' 'Cold' 'High' 'Strong' 'Warm' 'Change']
['Sunny' 'Warm' 'High' 'Strong' 'Cool' 'Change']]
Target ['Yes' 'Yes' 'No' 'Yes']
specific1
['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
i 0
h ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 0
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 1
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 2
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 3
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 4
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 5
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
i 1
h ['Sunny' 'Warm' 'High' 'Strong' 'Warm' 'Same']
x 0
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 1
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 2
specific ['Sunny' 'Warm' 'Normal' 'Strong' 'Warm' 'Same']
x 3
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 4
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 5
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
i 2
h ['Rainy' 'Cold' 'High' 'Strong' 'Warm' 'Change']
i 3
h ['Sunny' 'Warm' 'High' 'Strong' 'Cool' 'Change']
x 0
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 1
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 2
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 3
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 4
specific ['Sunny' 'Warm' '?' 'Strong' 'Warm' 'Same']
x 5
specific ['Sunny' 'Warm' '?' 'Strong' '?' 'Same']
Department of ISE Page 23
Machine Learning Lab Manual – BCSL606
1. EXPERIMENT NO: 5
3. LEARNING OBJECTIVES:
4. AIM: Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly
generated 100 values of x in the range of [0,1]. Perform the following based on dataset
generated.
1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε Class1, else xi
ε Class1
2. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s nearby.
Imagine a streaming service wants to predict if a new user is likely to cancel their subscription
(churn) based on their age. They checks the ages of its existing users and whether they churned
or stayed. If most of the “K” closest users in age of new user canceled their subscription KNN
will predict the new user might churn too. The key idea is that users with similar ages tend to
have similar behaviors and KNN uses this closeness to make decisions.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the dataset and at the time of
classification it performs an action on the dataset.
As an example, consider the following table of data points containing two features:
The new point is classified as Category 2 because most of its closest neighbors are
blue squares. KNN assigns the category based on the majority of nearby points.
The image shows how KNN predicts the category of a new data point based on its
closest neighbours.
1. The red diamonds represent Category 1 and the blue squares represent Category 2.
2. The new data point checks its closest neighbours (circled points).
3. Since the majority of its closest neighbours are blue squares (Category 2) KNN
predicts the new data point belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the algorithm how
many nearby points (neighbours) to look at when it makes a decision.
Example:
Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits you
already know.
Cross-Validation: A robust method for selecting the best k is to perform k-fold cross-
validation. This involves splitting the data into k subsets training the model on some subsets
and testing it on the remaining ones and repeating this for each subset. The value of k that
results in the highest average validation accuracy is usually the best choice.
Elbow Method: In the elbow method we plot the model’s error rate or accuracy for different
values of k. As we increase k the error usually decreases initially. However after a certain
point the error rate starts to decrease more slowly. This point where the curve forms an
“elbow” that point is considered as best k.
Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.
KNN uses distance metrics to identify nearest neighbour, these neighbours are used for
classification and regression task.
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or
space. You can think of it like the shortest path you would walk if you were to go directly
from one point to another.
distance(x,Xi)=∑j=1d(xj–Xij)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and
vertical lines (like a grid or city streets). It’s also called “taxicab distance” because a taxi can
only drive along the grid-like streets of a city.
d(x,y)=∑i=1n∣xi−yi∣
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and
Manhattan distances as special cases.
d(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above we can say that when p = 2 then it is the same as the formula for the
Euclidean distance and when p = 1 then we obtain the formula for the Manhattan distance.
So, you can think of Minkowski as a flexible distance formula that can look like either
Manhattan or Euclidean distance depending on the value of p
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where it
predicts the label or value of a new data point by considering the labels or values of its K nearest
neighbors in the training dataset.
In regression, the algorithm still looks for the K closest points. But instead of voting for a
class in classification, it takes the average of the values of those K neighbors. This average
is the predicted value for the new point for the algorithm.
It shows how a test point is classified based on its nearest neighbors. As the test point moves
the algorithm identifies the closest ‘k’ data points i.e 5 in this case and assigns test point the
majority class label that is grey label class here.
6. PROCEDURE / PROGRAMME :
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
data = np.random.rand(100)
distances.sort(key=lambda x: x[0])
k_nearest_neighbors = distances[:k]
return Counter(k_nearest_labels).most_common(1)[0][0]
train_data = data[:50]
train_labels = labels
test_data = data[50:]
results = {}
for k in k_values:
print(f"Results for k = {k}:")
classified_labels = [knn_classifier(train_data, train_labels,
test_point, k) for test_point in test_data]
results[k] = classified_labels
classified as {label}")
print("\n")
print("Classification complete.\n")
for k in k_values:
classified_labels = results[k]
class1_points = [test_data[i] for i in range(len(test_data)) if
classified_labels[i] == "Class1"]
class2_points = [test_data[i] for i in range(len(test_data)) if
classified_labels[i] == "Class2"]
plt.figure(figsize=(10, 6))
plt.scatter(train_data, [0] * len(train_data), c=["blue" if
label == "Class1" else "red" for label in train_labels],
label="Training Data", marker="o")
plt.scatter(class1_points, [1] * len(class1_points), c="blue",
label="Class1 (Test)", marker="x")
plt.scatter(class2_points, [1] * len(class2_points), c="red",
label="Class2 (Test)", marker="x")
OUTPUT:
Results for k = 1:
Results for k = 2:
Point x51 (value: 0.4186) is classified as Class1
Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Point x73 (value: 0.5532) is classified as Class2
Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Point x79 (value: 0.7665) is classified as Class2
Point x80 (value: 0.4851) is classified as Class1
Point x81 (value: 0.5881) is classified as Class2
Point x82 (value: 0.9204) is classified as Class2
Point x83 (value: 0.4165) is classified as Class1
Point x84 (value: 0.4188) is classified as Class1
Point x85 (value: 0.9696) is classified as Class2
Point x86 (value: 0.1754) is classified as Class1
Point x87 (value: 0.2621) is classified as Class1
Point x88 (value: 0.3443) is classified as Class1
Point x89 (value: 0.5252) is classified as Class2
Point x90 (value: 0.2649) is classified as Class1
Department of ISE Page 31
Machine Learning Lab Manual – BCSL606
Results for k = 3:
Point x51 (value: 0.4186) is classified as Class1
Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Point x73 (value: 0.5532) is classified as Class2
Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Point x79 (value: 0.7665) is classified as Class2
Point x80 (value: 0.4851) is classified as Class1
Point x81 (value: 0.5881) is classified as Class2
Point x82 (value: 0.9204) is classified as Class2
Point x83 (value: 0.4165) is classified as Class1
Point x84 (value: 0.4188) is classified as Class1
Department of ISE Page 32
Machine Learning Lab Manual – BCSL606
Results for k = 4:
Point x51 (value: 0.4186) is classified as Class1
Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Point x73 (value: 0.5532) is classified as Class2
Point x74 (value: 0.7819) is classified as Class2
Point x75 (value: 0.7890) is classified as Class2
Point x76 (value: 0.8762) is classified as Class2
Point x77 (value: 0.8628) is classified as Class2
Point x78 (value: 0.9900) is classified as Class2
Department of ISE Page 33
Machine Learning Lab Manual – BCSL606
Results for k = 5:
Point x51 (value: 0.4186) is classified as Class1
Point x52 (value: 0.1913) is classified as Class1
Point x53 (value: 0.9719) is classified as Class2
Point x54 (value: 0.6504) is classified as Class2
Point x55 (value: 0.2149) is classified as Class1
Point x56 (value: 0.0625) is classified as Class1
Point x57 (value: 0.8785) is classified as Class2
Point x58 (value: 0.7059) is classified as Class2
Point x59 (value: 0.6395) is classified as Class2
Point x60 (value: 0.3241) is classified as Class1
Point x61 (value: 0.0987) is classified as Class1
Point x62 (value: 0.1907) is classified as Class1
Point x63 (value: 0.1081) is classified as Class1
Point x64 (value: 0.5463) is classified as Class2
Point x65 (value: 0.5245) is classified as Class2
Point x66 (value: 0.0095) is classified as Class1
Point x67 (value: 0.1940) is classified as Class1
Point x68 (value: 0.7450) is classified as Class2
Point x69 (value: 0.0305) is classified as Class1
Point x70 (value: 0.0046) is classified as Class1
Point x71 (value: 0.4473) is classified as Class1
Point x72 (value: 0.0449) is classified as Class1
Department of ISE Page 34
Machine Learning Lab Manual – BCSL606
Classification complete.
1. EXPERIMENT NO: 6
3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python
4. AIM: Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
5. THEORY:
Given a dataset X, y, we attempt to find a linear model h(x) that minimizes residual
sum of squared errors. The solution is given by Normal equations.
Linear model can only fit a straight line, however, it can be empowered by polynomial
features to get more powerful models. Still, we have to decide and fix the number and
types of features ahead.
Alternate approach is given by locally weighted regression.
Given a dataset X, y, we attempt to find a model h(x) that minimizes residual
sum of weighted squared errors.
The weights are given by a kernel function which can be chosen arbitrarily and in my
case I chose a Gaussian kernel.
The solution is very similar to Normal equations, we only need to insert diagonal
weight matrix W.
total_bil
l tip sex smoker day time size
16.99 1.01 Female No Sun Dinner 2
10.34 1.66 Male No Sun Dinner 3
21.01 3.5 Male No Sun Dinner 3
23.68 3.31 Male No Sun Dinner 2
24.59 3.61 Female No Sun Dinner 4
25.29 4.71 Male No Sun Dinner 4
8.77 2 Male No Sun Dinner 2
26.88 3.12 Male No Sun Dinner 4
15.04 1.96 Male No Sun Dinner 2
14.78 3.23 Male No Sun Dinner 2
10.27 1.71 Male No Sun Dinner 2
35.26 5 Female No Sun Dinner 4
15.42 1.57 Male No Sun Dinner 2
18.43 3 Male No Sun Dinner 4
14.83 3.02 Female No Sun Dinner 2
21.58 3.92 Male No Sun Dinner 2
6. PROCEDURE / PROGRAMME
def localWeightRegression(xmat,ymat,k):
m,n = np.shape(xmat)
ypred = np.zeros(m)
for i in range(m):
ypred[i] = xmat[i]*localWeight(xmat[i],xmat,ymat,k)
return ypred
def graphPlot(X,ypred):
sortindex = X[:,1].argsort(0) #argsort - index of the smallest
xsort = X[sortindex][:,0]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(bill,tip, color='green')
ax.plot(xsort[:,1],ypred[sortindex], color = 'red',
linewidth=5)
plt.xlabel('Total bill')
plt.ylabel('Tip')
plt.show();
# load data points
data = pd.read_csv('Program6_dataset_tips.csv')
bill = np.array(data.total_bill) # We use only Bill amount and Tips data
tip = np.array(data.tip)
mbill = np.mat(bill) # .mat will convert nd array is converted in 2D array
mtip = np.mat(tip)
m= np.shape(mbill)[1]
one = np.mat(np.ones(m))
X = np.hstack((one.T,mbill.T)) # 244 rows, 2 cols increase k to get smooth curves
ypred = localWeightRegression(X,mtip,9)
graphPlot(X,ypred)
OUTPUT:
1. EXPERIMENT NO: 7
3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python
4. AIM: Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for
vehicle fuel efficiency prediction) for Polynomial Regression.
5. THEORY:
6. PROCEDURE / PROGRAMME
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures,
StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
def linear_regression_california():
housing = fetch_california_housing(as_frame=True)
X = housing.data[["AveRooms"]]
y = housing.target
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plt.scatter(X_test, y_test, color="blue", label="Actual")
plt.plot(X_test, y_pred, color="red", label="Predicted")
plt.xlabel("Average number of rooms (AveRooms)")
plt.ylabel("Median value of homes ($100,000)")
plt.title("Linear Regression - California Housing Dataset")
plt.legend()
plt.show()
def polynomial_regression_auto_mpg():
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/auto-mpg/auto-mpg.data"
column_names = ["mpg", "cylinders", "displacement",
"horsepower", "weight", "acceleration", "model_year", "origin"]
data = pd.read_csv(url, sep='\s+', names=column_names,
na_values="?")
data = data.dropna()
X = data["displacement"].values.reshape(-1, 1)
y = data["mpg"].values
poly_model = make_pipeline(PolynomialFeatures(degree=2),
StandardScaler(), LinearRegression())
poly_model.fit(X_train, y_train)
y_pred = poly_model.predict(X_test)
plt.scatter(X_test, y_test, color="blue", label="Actual")
plt.scatter(X_test, y_pred, color="red", label="Predicted")
plt.xlabel("Displacement")
plt.ylabel("Miles per gallon (mpg)")
plt.title("Polynomial Regression - Auto MPG Dataset")
plt.legend()
plt.show()
OUTPUT
1. EXPERIMENT NO: 8
3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python
4. AIM: Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data set for building the decision tree and apply this knowledge to classify a new sample.
5. THEORY: Decision Tree
Decision tree is a simple diagram that shows different choices and their possible results
helping you make decisions easily. This article is all about what decision trees are, how they
work, their advantages and disadvantages and their applications.
A decision tree is a graphical representation of different options for solving a problem and
show how different factors are related. It has a hierarchical tree structure starts with one main
question at the top called a node which further branches out into different possible outcomes
where:
Root Node is the starting point that represents the entire dataset.
Branches: These are the lines that connect nodes. It shows the flow from one
decision to another.
Internal Nodes are Points where decisions are made based on the input features.
Leaf Nodes: These are the terminal nodes at the end of branches that represent
final outcomes or predictions
They also support decision-making by visualizing outcomes. You can quickly evaluate
and compare the “branches” to determine which course of action is best for you.
Now, let’s take an example to understand the decision tree. Imagine you want to decide
whether to drink coffee based on the time of day and how tired you feel. First the tree checks
the time of day—if it’s morning it asks whether you are tired. If you’re tired the tree suggests
drinking coffee if not it says there’s no need. Similarly in the afternoon the tree again asks if
you are tired. If you recommends drinking coffee if not it concludes no coffee is needed.
We have mainly two types of decision tree based on the nature of the target variable:
classification trees and regression trees.
Classification trees: They are designed to predict categorical outcomes means they classify data
into different classes. They can determine whether an email is “spam” or “not spam” based on
various features of the email.
Regression trees: These are used when the target variable is continuous It predict numerical
values rather than categories. For example a regression tree can estimate the price of a house
based on its size, location, and other features.
A decision tree working starts with a main question known as the root node. This question
is derived from the features of the dataset and serves as the starting point for decision-making.
From the root node, the tree asks a series of yes/no questions. Each question is designed to
split the data into subsets based on specific attributes. For example if the first question is “Is it
raining?” the answer will determine which branch of the tree to follow. Depending on the
response to each question you follow different branches. If your answer is “Yes,” you might
proceed down one path if “No,” you will take another path.
This branching continues through a sequence of decisions. As you follow each branch,
you get more questions that break the data into smaller groups. This step-by-step process
continues until you have no more helpful questions.
You reach at the end of a branch where you find the final outcome or decision. It could be a
Overfitting: Overfitting occurs when a decision tree captures noise and details in the
training data and it perform poorly on new data.
Instability: instability means that the model can be unreliable slight variations in input
can lead to significant differences in predictions.
Bias towards Features with More Levels: Decision trees can become biased towards
features with many categories focusing too much on them during decision-making. This
can cause the model to miss out other important features led to less accurate predictions.
6. PROCEDURE / PROGRAMME
X = data.data
y = data.target
plt.figure(figsize=(12,8))
tree.plot_tree(clf, filled=True,
feature_names=data.feature_names, class_names=data.target_names)
plt.title("Decision Tree - Breast Cancer Dataset")
plt.show()
OUTPUT:
1. EXPERIMENT NO: 9
3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python
4. AIM: Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training. Compute the accuracy of the classifier, considering a few test data sets.
THEORY:
The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to classify data
based on the probabilities of different classes given the features of the data. It is used mostly
in high-dimensional text classification
The Naive Bayes Classifier is a simple probabilistic classifier and it has very few
number of parameters which are used to build the ML models that can predict at a
faster speed than other classification algorithms.
It is a probabilistic classifier because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature contributes to
the predictions with no relation between each other.
Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis, classifying
articles and many more.
The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.
The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: This means that when we are trying to classify something, we
assume that each feature (or piece of information) in the data does not affect any other
feature.
Continuous features are normally distributed: If a feature is continuous, then it is
assumed to be normally distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it is
assumed to have a multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
No missing data: The data should not contain any missing values.
The assumptions made by Naive Bayes are not generally correct in real-world situations. In-
fact, the independence assumption is never correct but often works well in practice. Now, before
moving to the formula for Naive Bayes, it is important to know about Bayes’ theorem.
Gaussian Naive Bayes: In Gaussian Naive Bayes, continuous values associated with each feature
are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also
called Normal distribution when plotted, it gives a bell shaped curve which is symmetric about the
mean of the feature values as shown below:
Multinomial Naive Bayes: Multinomial Naive Bayes is used when features represent the frequency
of terms (such as word counts) in a document. It is commonly applied in text classification, where
term frequencies are important.
Bernoulli Naive Bayes: Bernoulli Naive Bayes deals with binary features, where each feature
indicates whether a word appears or not in a document. It is suited for scenarios where the presence
or absence of terms is more relevant than their frequency. Both models are widely used in document
classification tasks.
import numpy as np
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split,
cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix
import matplotlib.pyplot as plt
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=1))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
plt.show()
OUTPUT:
Classification Report:
precision recall f1-score support
Department of ISE Page 55
Machine Learning Lab Manual – BCSL606
[[2 0 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
[0 0 2 ... 0 0 1]
...
[0 0 0 ... 1 0 0]
[0 0 0 ... 0 3 0]
[0 0 0 ... 0 0 5]]
1. EXPERIMENT NO: 10
3. LEARNING OBJECTIVES:
1. Make use of Data sets in implementing the machine learning algorithms.
2. Implement ML concepts and algorithms in Python
4. AIM: Develop a program to implement k-means clustering using Wisconsin Breast Cancer data
set and visualize the clustering result.
5. THEORY: k-means clustering
K-means clustering is a technique used to organize data into groups based on their
similarity. For example online store uses K-Means to group customers based on purchase
frequency and spending creating segments like Budget Shoppers, Frequent Buyers and Big
Spenders for personalized marketing.
The algorithm works by first randomly picking some central points called centroids and each
data point is then assigned to the closest centroid forming a cluster. After all the points are
assigned to a cluster the centroids are updated by finding the average position of the points in
each cluster. This process repeats until the centroids stop changing forming clusters. The goal of
clustering is to divide the data points into clusters so that similar data points belong to same
group.
We are given a data set of items with certain features and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the K-
means algorithm. ‘K’ in the name of the algorithm represents the number of groups/clusters we
want to classify our items into.
The algorithm will categorize the items into k groups or clusters of similarity. To calculate
that similarity, we will use the Euclidean distance as a measurement. The algorithm works as
follows:
1. First, we randomly initialize k points, called means or cluster centroids.
2. We categorize each item to its closest mean, and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.
The “points” mentioned above are called means because they are the mean values of the items
categorized in them. To initialize these means, we have a lot of options. An intuitive method is to
initialize the means at random items in the data set. Another method is to initialize the means at
random values between the boundaries of the data set. For example for a feature x the items have
values in [0,3] we will initialize the means with values for x at [0,3].
6. PROCEDURE / PROGRAMME
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix,
classification_report
data = load_breast_cancer()
X = data.data
y = data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster',
palette='Set1', s=100, edgecolor='black', alpha=0.7)
plt.title('K-Means Clustering of Breast Cancer Dataset')
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='True Label',
palette='coolwarm', s=100, edgecolor='black', alpha=0.7)
plt.title('True Labels of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="True Label")
plt.show()
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster',
palette='Set1', s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='red',
marker='X', label='Centroids')
plt.title('K-Means Clustering with Centroids')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()
OUTPUT:
Confusion Matrix:
[[175 37]
[ 13 344]]
Classification Report:
precision recall f1-score support