0% found this document useful (0 votes)
33 views82 pages

ML File Fnail Merged

The document is a lab file for the Machine Learning course (SE-206n) at Delhi Technological University, detailing the vision, mission, educational objectives, and specific outcomes of the Software Engineering department. It includes a comprehensive index of experiments focusing on Python programming, data preprocessing, and various machine learning algorithms, along with theoretical explanations of essential programming concepts and libraries. Additionally, it outlines the use of popular Python libraries such as NumPy, Pandas, and Matplotlib for data analysis and visualization.

Uploaded by

Aradhay Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views82 pages

ML File Fnail Merged

The document is a lab file for the Machine Learning course (SE-206n) at Delhi Technological University, detailing the vision, mission, educational objectives, and specific outcomes of the Software Engineering department. It includes a comprehensive index of experiments focusing on Python programming, data preprocessing, and various machine learning algorithms, along with theoretical explanations of essential programming concepts and libraries. Additionally, it outlines the use of popular Python libraries such as NumPy, Pandas, and Matplotlib for data analysis and visualization.

Uploaded by

Aradhay Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

MACHINE LEARNING (SE-206n) LAB FILE

Subject Code: SE206n


Subject Name: Machine Learning
Branch: Software Engineering
Year: 2nd year/4th Semester

Submitted By:
Aradhay Jain
23/SE/030

Submitted to:
Dr. Shweta Meena
Assistant Professor
Department of Software Engineering

Delhi Technological University


Shahbad Daulatpur, Main Bawana Road, Delhi-110042
Department of Software Engineering

Vision
The mission of Department of Software Engineering is “To be one of the premium departments in
the world by focusing on innovative research and quality education”.

Mission
The Mission of the Department of Software Engineering is as follows:

1. To inculcate theoretical and practical knowledge of software development and engineering


practices according to global industry standards in the students of the department.

2. To develop an innovative mind-set among the students through various research initiatives,
opportunities, collaborative projects, and experiential learning programs offered to the students.

3. To develop industry-ready software engineers by imparting knowledge through skill-based and


technical trainings in collaboration with software industries.

4. To instill attitude for social good by providing opportunities work with community-sensitive
organizations.

5. To foster personnel, professional, and research ethics in the students.


Program Educational Objective
PEO 1: Apply the knowledge of software engineering principles and paradigms in the design of
system components and processes that meet the specific needs of the industry.

PEO 2: Use the techniques, skills and CASE tools necessary for engineering practice and coordinate
the construction, deployment and maintenance of software systems

PEO 3: Ability to apply mathematical foundations and principles for modelling engineering
problems.

PEO 4: Use research- b a s e d knowledge and tools for the analysis and interpretation of data to
synthesize information for obtaining valid conclusions.

Program Specific Outcomes (PSOs)


PSO1: Design, analyse and develop a solution to the existing engineering problems.

PSO2: Specify, design, develop and maintain usable systems that behave reliably and efficiently.

PSO3: Develop systems that would perform tasks related to Research, Education and Training
and/or E-governance.

Program Outcomes (POs)

Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals,


and an engineering specialization to the solution of complex engineering problems.

Problem analysis: Identify, formulate, review research literature, & analyse complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences.

Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.

Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of information.

Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools to complex engineering activities with understanding of the limitations.

The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent relevant responsibilities.
Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.

Ethics: Apply ethical principles and commit to professional ethics, responsibilities and norms.

Individual and team work: Function effectively as an individual, and as a responsible member or
dynamic leader in diverse teams, and in multidisciplinary, multicultural, & collaborative professional
settings.

Communication: Communicate effectively on complex engineering activities with the engineering


community and with society at large, such as, being able to comprehend and write effective reports
and design documentation, make effective presentations, and give and receive clear instructions.

Project management and finance: Demonstrate knowledge and understanding of the engineering
and management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary, multicultural, and collaborative professional settings.

Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
INDEX

S.No. Experiment Date Signature


1. Exploring and demonstrating Python. 10/01/2025

Perform Data Preprocessing like outlier detection, handling


2. missing value, analyzing redundancy and normalization on 17/01/2025
different datasets.

3. Write a program to implement Linear Regression using any


appropriate dataset. 24/01/2025

Write a program to exhibit the working of the decision tree


4. based ID3 algorithm. With the help of an appropriate data 31/01/2025
set, build the decision tree and classify a new sample.

Write a program to demonstrate the working of the decision


tree based C4.5 algorithm. With the help of the data set used
5. 07/02/2025
in the above experiment, build the decision tree and classify
a new sample.

Write a program to demonstrate the working of decision tree


based CART algorithm. Build the decision tree and classify
6. a new sample using a suitable dataset. Compare the 14/02/2025
performance with that of ID, C4.5, and CART in terms of
accuracy, recall, precision and sensitivity.

Build an Artificial Neural Network by implementing the


7. Back propagation algorithm and test the same using 21/03/2025
appropriate data sets.

Write a program to implement the Naïve Bayesian classifier


8. for appropriate dataset and compute the performance 28/03/2025
measures of the model.

Write a program to implement k-Nearest Neighbor


9. algorithm to classify any dataset of your choice. Print both 04/04/2025
correct and wrong predictions.

Apply k-Means clustering algorithm on suitable datasets and


10. 04/04/2025
comment on the quality of clustering.

i
S.No. Experiment Date Signature
Write a program to implement ensemble algorithm-
11. AdaBoost and Bagging using appropriate dataset and 11/04/2025
evaluate their performance on that dataset.

12.
Select any two dataset based on their statistics and 11/04/2025
perform comparison among all implemented algorithm
using them.

Conduct survey of at least 5 different machine learning 11/04/2025


13.
tools available.

ii
EXPERIMENT-1(a)
AIM: To explore basic Python programming concepts such as classes and objects, data structures,
functions and exception handling.

THEORY:
1. Classes and Objects

• Class: A blueprint or template that defines the attributes and methods for objects. For
example, a “Car” class defines characteristics like colour and speed and behaviours like
driving and braking.
• Object: A specific instance of a class. For instance, a red Toyota Corolla is an object of the
“Car” class. Each object has unique values but follows the class structure.
• Purpose: Classes and objects are key to Object-Oriented Programming (OOP), helping
structure code in a modular and reusable way while modelling real-world entities.
2. Data Structures
Data structures organize and store data efficiently:

• List: An ordered, mutable collection of items, useful for storing sequences like numbers or
names.
• Tuple: Similar to a list but immutable, ideal for protecting data from changes.
• Set: An unordered collection of unique items, automatically removing duplicates, useful
for membership tests.
• Dictionary: Stores data in key-value pairs, useful for fast lookups and scenarios where data
is labelled (like configurations).
3. Functions

• A function is a reusable block of code that performs a specific task, making code modular
and easier to manage. Functions take parameters as input and return values as output.
• Benefits: Functions promote modularity, reusability, clarity, and make it easier to test and
debug code.
4. Exception Handling

• Exception: An error that occurs during program execution, such as dividing by zero or
missing a file.
• Exception Handling: A way to manage errors without crashing the program, allowing it to
handle the error and continue running (e.g., by displaying a user-friendly message).
• Importance: It prevents crashes, helps in debugging, and improves the user experience by
managing errors gracefully.

1
CODE:
#CLASS IMPLEMENTATION
class Dog:
def __init__(self, name, age):
self.name = name
self.age = age
def bark(self):
print(f"{self.name} says woof!")
dog1 = Dog("Buddy", 3)
print(dog1.name) # Output: Buddy
dog1.bark() # Output: Buddy says woof!
#FUNCTIONS:
def calculate_average(numbers):
try:
if not numbers:
raise ValueError("The list is empty. Cannot calculate average.")
total = sum(numbers)
count = len(numbers)
return total / count
except TypeError:
raise ValueError("All elements must be numeric.")
try:
data = [10, 20, 30, 40]
print(f"Average: {calculate_average(data)}") # Output: Average: 25.0
print(f"Average: {calculate_average([])}") # Raises ValueError
except ValueError as e:
print(f"Error: {e}")

2
#EXCEPTION HANDLING
def read_number_from_file(filename):
try:
with open(filename) as f:
content = f.read()
number = int(content)
return number ** 2
except FileNotFoundError:
print(f"Error: The file '{filename}' was not found.")
except ValueError:
print("Error: The file does not contain a valid integer.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
result = read_number_from_file("number.txt")
if result is not None:
print(f"Square: {result}")
#DATA STRCTURES:
#Stack
class Stack:
def __init__(self):
self.items = []
def is_empty(self):
return len(self.items) == 0
def push(self, item):
self.items.append(item)
def pop(self):

3
if not self.is_empty():
return self.items.pop()
else:
return "Stack is empty"
def peek(self):
if not self.is_empty():
return self.items[-1]
else:
return "Stack is empty"
def size(self):
return len(self.items)
# Example usage
stack = Stack()
stack.push(10)
stack.push(20)
stack.push(30)
pop_item = stack.pop()
peek_item = stack.peek()
stack_size = stack.size()
stack_items = stack.items
#queue
from collections import deque
class Queue:
def __init__(self):
self.items = deque()
def is_empty(self):
return len(self.items) == 0

4
def enqueue(self, item):
self.items.append(item)
def dequeue(self):
if not self.is_empty():
return self.items.popleft()
else:
return "Queue is empty"
def size(self):
return len(self.items)
# Example usage
queue = Queue()
queue.enqueue(10)
queue.enqueue(20)
queue.enqueue(30)
dequeue_item = queue.dequeue()
queue_size = queue.size()
queue_items = list(queue.items)
class List:
def __init__(self):
self.items = []
def append(self, item):
self.items.append(item)
def remove(self, item):
if item in self.items:
self.items.remove(item)
else:
return "Item not found"

5
def get(self, index):
if 0 <= index < len(self.items):
return self.items[index]
else:
return "Index out of range"
def size(self):
return len(self.items)
def display(self):
return self.items
# Example usage
custom_list = CustomList()
custom_list.append(10)
custom_list.append(20)
custom_list.append(30)
item_at_1 = custom_list.get(1)
custom_list.remove(20)
size_after_removal = custom_list.size()
list_items = custom_list.display()
#DICTIONARY
student = {"name": "Alice", "age": 21, "major": "Computer Science"}
# Using the dict() constructor
grades = dict(math=90, science=85, english=92)
print(student["name"])
print(student.get("age"))
print(student.get("GPA", "Not found"))

6
OUTPUT:

CLASSES AND OBJECTS

FUNCTIONS

ERROR HANDLING

DATA STRCUTRES
STACK

QUEUE

7
LIST

DICTIONARY

LEARNING
In this experiment, we learnt about various important concepts such as data structures, exception
handling, functions and classes and objects.

8
EXPERIMENT-1(b)
AIM: To explore Python libraries such as NumPy, Pandas, Matplotlib, etc. Demonstrating how
these libraries help in analysing and visualizing the dataset.
DATASET
The HCV dataset is a real-world and frequently used dataset in healthcare and machine learning,
especially for classification problems. It contains multiple patient records, categorized into
different liver disease classes like Fibrosis, Cirrhosis, and Suspect Blood Donor. Each record
includes ten numerical attributes: ALB, ALP, ALT, AST, BIL, CHE, CHOL, CREA, GGT, and
PROT. The objective is to use these features to correctly predict the disease category for each
patient. This dataset is popular because it’s concise, real, and well-structured, making it ideal for
beginners to explore preprocessing, visualization, and classification techniques like k-nearest
neighbors, random forests, and naive Bayes.

THEORY:
Python comes with a rich standard library and a vast ecosystem of third-party packages, enabling
development in various fields such as web development, data science, machine learning,
automation, game development, and more. Its user-friendly syntax and strong community support
have made Python one of the most popular programming languages in the world.
Libraries:
1. NumPy
Features & Usability:
•Numerical Operations: Provides an efficient multi-dimensional array object (ndarray)
and functions for operating on these arrays.
•Performance: Implements vectorized operations written in C, providing significant
speed advantages over native Python loops.
•Broad Support: Forms the base for many other scientific libraries.
Example Use-Cases: Array manipulations, mathematical computations, and as a
foundation for other packages.
2. Pandas
Features & Usability:
•Data Structures: Offers DataFrame and Series data structures that simplify data
manipulation and analysis.

9
•Data Cleaning & Transformation: Provides robust tools for reading/writing data,
handling missing values, and grouping/merging datasets.
•Time Series Analysis: Offers extensive functionality to work with date/time data.
Example Use-Cases: Data cleaning, exploration, and preprocessing tasks, especially in
data science and machine learning.
3. Matplotlib
Features & Usability:
•Data Visualization: A comprehensive 2D plotting library that allows the creation of
static, animated, and interactive visualizations.
•Flexibility: Highly customizable and integrates well with other Python libraries such
as NumPy and Pandas.
•Publication Quality: Generates high-quality figures that can be used in academic and
professional presentations.
Example Use-Cases: Plotting line charts, histograms, scatter plots, and more complex
visualizations such as 3D graphs.
4. Keras
Features & Usability:
• User-Friendly API: A high-level neural network API that runs on top of TensorFlow,
making it simpler to build and prototype deep learning models.
• Modularity: Offers easy-to-use building blocks for designing neural networks,
including layers, optimizers, and activation functions.
•Rapid Prototyping: Ideal for quick experimentation with deep neural network
architectures.
Example Use-Cases: Prototyping and deploying neural network models for image
classification, natural language processing, and more.
5. TensorFlow
Features & Usability:
•Comprehensive Ecosystem: An end-to-end platform for machine learning that supports
production-level deployment as well as research.
•Flexibility and Performance: Allows the creation of custom neural network
architectures with an extensive library of tools and resources.

10
•Scalability: Supports distributed computing and can run on CPUs, GPUs, and TPUs.
Example Use-Cases: Large-scale machine learning projects, from training deep neural
networks to deploying machine learning models in production.
6. PyBrain
Features & Usability:
• Machine Learning Framework: Provides tools for both supervised and reinforcement
learning with a modular design.
• Neural Networks: Supports feedforward and recurrent neural networks with
customizable architecture.
• Training Algorithms: Includes standard training techniques like backpropagation and
gradient descent.
• Easy Prototyping: Simple and intuitive API, suitable for educational purposes and
small-scale experimentation.
• Dataset Handling: Offers built-in tools for managing datasets and preparing them for
training.
Example Use-Cases:
Building classification models, function approximation tasks, reinforcement learning
agents (e.g., maze solvers), and simple neural network prototypes for research or
education.

CODE:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

df = pd.read_csv('hcvdat0.csv')

11
df_clean = df.dropna()
ages = df_clean['Age'].values
mean_age = np.mean(ages)
std_age = np.std(ages)
print(f"Mean Age: {mean_age:.2f}, Std Age: {std_age:.2f}")
plt.figure(figsize=(12, 7))
plt.hist(ages, bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Age', fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.title('Age Distribution', fontsize=18)
plt.show()
label_encoder = LabelEncoder()
df_clean['Category_enc'] = label_encoder.fit_transform(df_clean['Category'])
X = df_clean[['Age', 'ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT']]
y = df_clean['Category_enc']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
model = keras.Sequential([
keras.layers.Dense(32, activation='relu', input_shape=(X_train.shape[1],)),
keras.layers.Dense(16, activation='relu'),
keras.layers.Dense(len(np.unique(y)), activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")

12
OUTPUT:

LEARNING:
In this experiment, we learnt about various important libraries such as NumPy, pandas and
matplotlib.

13
EXPERIMENT-2
AIM: Perform Data Preprocessing like outlier detection, handling missing value, analyzing
redundancy and normalization on different datasets.
THEORY:
Data Preprocessing is a crucial step in any machine learning or data analysis project. It involves
preparing raw data to make it suitable for modeling by cleaning, transforming, and organizing it.
1. Outlier Detection
▪ Outliers are data points that significantly differ from other observations. They can be
caused by errors, variability in measurement, or rare events.
▪ Why it's important: Outliers can distort statistical analyses and model performance.
▪ Detection methods:
o Statistical techniques like Z-score or IQR (Interquartile Range).
o Visualization tools such as box plots and scatter plots.
▪ Action: Depending on the context, outliers may be removed, capped, or corrected.
2. Handling Missing Values
▪ Missing data occurs when no value is stored for a feature in a data record.
▪ Why it matters: Machine learning algorithms often cannot handle missing values directly.
▪ Handling methods:
o Deletion: Remove rows or columns with too many missing values.
o Imputation: Fill in missing values using:
➢ Mean, median, or mode (for numerical/categorical data).
➢ Forward or backward fill in time-series data.
➢ More advanced techniques like KNN or regression-based imputation.
3. Analyzing Redundancy
▪ Redundancy refers to duplicate or highly correlated features that do not add value and may
introduce noise.
▪ Why address it: Redundant features can lead to overfitting and increase computation time
without improving accuracy.
▪ Techniques:
o Correlation analysis: Identify features that are highly correlated.
o Dimensionality reduction: Methods like PCA (Principal Component Analysis) to
remove redundancy.

14
4. Normalization
▪ Normalization (or feature scaling) is the process of transforming features to a common
scale.
▪ Why it's needed: Many algorithms (e.g., k-NN, SVM, gradient descent-based models)
perform better when features are on a similar scale.
▪ Common methods:
o Min-Max Scaling: Scales values to a range [0, 1].
o Z-score Standardization: Centers data around the mean with unit variance.
o Robust Scaling: Uses median and IQR, useful when outliers are present.
Data preprocessing enhances the quality of data, leading to more accurate and reliable machine
learning models. Detecting and managing outliers, handling missing values, removing redundancy,
and normalizing data are essential steps that ensure the dataset is clean, consistent, and ready for
analysis or model training.

CODE:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

file_path = 'hcvdat0.csv'
df = pd.read_csv(file_path)

print("Initial Dataset Shape:", df.shape)


print("\nDataset Head:")
print(df.head())
print("\nDataset Info:")
print(df.info())

15
print("\nDataset Description:")
print(df.describe())

missing_values = df.isnull().sum()
print("\nMissing Values per Column:")
print(missing_values)

numeric_cols = ['Age', 'ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT',
'PROT']
for col in numeric_cols:
if missing_values[col] > 0:
median_val = df[col].median()
df[col] = df[col].fillna(median_val)

cat_cols = ['Category', 'Sex']


for col in cat_cols:
if missing_values[col] > 0:
mode_val = df[col].mode()[0]
df[col] = df[col].fillna(mode_val)

print("\nMissing Values After Imputation:")


print(df.isnull().sum())

plt.figure(figsize=(15, 10))
for i, col in enumerate(numeric_cols):
plt.subplot(3, 4, i+1)
sns.boxplot(x=df[col])

16
plt.title(col)
plt.tight_layout()
plt.show()

def cap_outliers(col):
Q1 = col.quantile(0.25)
Q3 = col.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return col.clip(lower=lower_bound, upper=upper_bound)

for col in numeric_cols:


df[col] = cap_outliers(df[col])

print("\nSummary statistics after capping outliers:")


print(df[numeric_cols].describe())

corr_matrix = df[numeric_cols].corr()
print("\nCorrelation Matrix:")
print(corr_matrix)

threshold = 0.5
print("\nModerate-to-High Correlated Feature Pairs (|correlation| > {}):".format(threshold))
for i in range(len(corr_matrix.columns)):
for j in range(i):
if abs(corr_matrix.iloc[i, j]) > threshold:

17
col1 = corr_matrix.columns[i]
col2 = corr_matrix.columns[j]
corr_val = corr_matrix.iloc[i, j]
print(f"{col1} and {col2}: {corr_val:.2f}")

scaler = MinMaxScaler()
df_normalized = df.copy()
df_normalized[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print("\nFirst 5 rows after Normalization:")


print(df_normalized[['Category', 'Age', 'Sex'] + numeric_cols].head())

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
for col in numeric_cols:
sns.kdeplot(df[col], label=col)
plt.title("Distributions After Outlier Capping")
plt.legend(loc='upper right')
plt.subplot(1, 2, 2)
for col in numeric_cols:
sns.kdeplot(df_normalized[col], label=col)
plt.title("Distributions After Normalization")
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()
df.to_csv('hcv_preprocessed.csv', index=False)
df_normalized.to_csv('hcv_normalized.csv', index=False)

18
OUTPUT:

19
20
LEARNING:
In this experiment, we performed data pre-processing operations such as handling missing
values, detecting outliers, and applying normalization on the HCV Dataset.

21
EXPERIMENT-3
AIM: Write a program to implement Linear Regression using any appropriate dataset.
THEORY:
Linear Regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to the observed
data. The goal is to predict the value of the dependent variable (often called the target or output)
based on the independent variables (often called features or predictors).
Linear-regression models are relatively simple and provide an easy-to-interpret mathematical
formula that can generate predictions. Linear regression can be applied to various areas in
business and academic study. Linear regression is used in everything from biological,
behavioural, environmental and social sciences to business. Linear-regression models have
become a proven way to scientifically and reliably predict the future. Because linear regression
is a long-established statistical procedure, the properties of linear-regression models are well
understood and can be trained very quickly.
In the simplest case, where there is one independent variable, the relationship is modelled as a
straight line, and the equation takes the form:
y=β0+β1x+ϵ
Where:

• y is the dependent variable (the value we want to predict),


• x is the independent variable (the feature used to make the prediction),
• β0 is the intercept of the line (the value of y when x=0),
• β1 is the slope of the line (how much y changes for a one-unit change in x),
• ϵ represents the error term (the difference between the observed and predicted values).
In multiple linear regression, the equation becomes:

y=β0+β1x1+β2x2+⋯+βnxn+ϵ

Where:
▪ x1,x2,…,xn are the multiple independent variables.
▪ β1,β2,…,βn are the coefficients (weights) that represent the contribution of each
independent variable to the prediction of y.
Key Points:

• Simple Linear Regression: Involves one independent variable.


• Multiple Linear Regression: Involves multiple independent variables.
• Linear regression assumes a linear relationship between the dependent and independent
variables.

22
• The model "learns" the best-fitting line by minimizing the sum of squared errors
between the observed values and the predicted values.
Linear regression is widely used for prediction and forecasting in fields such as economics,
finance, biology, and more.

CODE:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.impute import SimpleImputer

path="drive/MyDrive/hcvdat0.csv"
df = pd.read_csv(path)
print("Initial data shape:", df.shape)

df = df.dropna()
print("Shape after dropping missing values:", df.shape)

if "Category" not in df.columns:


raise ValueError("Target column 'Category' not found in the dataset.")

y = df["Category"]
X = df.drop("Category", axis=1)

if y.dtype == 'object':
le = LabelEncoder()
y = le.fit_transform(y)
print("Target encoded. Classes:", le.classes_)

X = pd.get_dummies(X)
print("Features after one-hot encoding. Shape:", X.shape)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

23
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred_continuous = model.predict(X_test)
y_pred = np.clip(np.rint(y_pred_continuous), 0, len(np.unique(y)) - 1).astype(int)

accuracy = accuracy_score(y_test, y_pred)


print("Accuracy: {:.2f}%".format(accuracy * 100))

OUTPUT:

LEARNING:
In this experiment, we implemented Linear Regression using HCV dataset.

24
EXPERIMENT-4
AIM: Write a program to exhibit the working of the decision tree based ID3 algorithm. With
the help of an appropriate data set, build the decision tree and classify a new sample.

THEORY:
ID3 (Iterative Dichotomiser 3) is a decision tree algorithm used for classification tasks. It builds
the tree by choosing the attribute with the highest Information Gain at each step, aiming to
reduce uncertainty (entropy) in the data.
Key Formulas:
1. Entropy (H)
Measures the uncertainty or impurity in the dataset.
H(S)=− n∑i=1 pilog2(pi)

• S: current dataset
• pi: proportion of instances in class ii
• If all instances are of one class, entropy = 0 (pure set)
2. Information Gain (IG)
Measures the reduction in entropy after splitting the dataset on an attribute AA.

IG(S,A)=H(S)−∑v∈Values(A) (∣Sv∣/∣S∣)H(Sv)

• Values(A)Values(A): possible values of attribute A


• Sv: subset of S where attribute A = v
• Choose the attribute with the highest IG

How the ID3 Algorithm Works:


1. Start with the full dataset.
2. Calculate entropy of the current dataset.
3. For each attribute, calculate the information gain resulting from splitting the dataset
based on that attribute.
4. Choose the attribute with the highest information gain as the decision node.
5. Split the dataset into subsets based on the selected attribute’s values.
6. Repeat the above steps recursively for each subset, using only the remaining attributes.
7. Stop when:
▪ All data in a node belong to the same class (pure subset).

25
▪ There are no more attributes to split on.
▪ The dataset is empty (in which case a default class is assigned).
Advantages

• Simple and interpretable.


• Good for categorical data.
Limitations

• Can overfit on noisy data.


• Needs preprocessing for continuous variables.
• Prefers attributes with many values (can be biased).

CODE:
import pandas as pd

from sklearn.preprocessing import LabelEncoder

from sklearn.impute import SimpleImputer

from imblearn.over_sampling import SMOTE

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, accuracy_score

import matplotlib.pyplot as plt

if 'Unnamed: 0' in df.columns:

df = df.drop(columns=['Unnamed: 0'])

imputer = SimpleImputer(strategy='mean')

numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

label_encoders = {}

for col in ['Sex', 'Category']:

26
le = LabelEncoder()

df[col] = le.fit_transform(df[col])

label_encoders[col] = le

le_cat = label_encoders['Category']

X = df.drop(columns=["Category"])

y = df["Category"]

smote = SMOTE(k_neighbors=2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

id3_smote = DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=42)

id3_smote.fit(X_train_smote, y_train_smote)

y_pred_smote = id3_smote.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_smote))

print(classification_report(y_test, y_pred_smote, target_names=le_cat.classes_))

plt.figure(figsize=(20, 10))

plot_tree(id3_smote, feature_names=X.columns, class_names=le_cat.classes_, filled=True,


rounded=True)

plt.title("ID3 Decision Tree (Entropy-based)")


plt.show()

27
OUTPUT:

LEARNING:
In this experiment, we exhibit the working of the decision tree based ID3 algorithm.

28
EXPERIMENT-5
AIM: Write a program to demonstrate the working of the decision tree based C4.5 algorithm.
With the help of the data set used in the above experiment, build the decision tree and classify a
new sample.

THEORY:
C4.5 is an advanced decision tree algorithm used for classification, developed as an extension
of ID3. It builds decision trees using the Gain Ratio, which corrects ID3's bias toward attributes
with many values. Unlike ID3, C4.5 can handle both continuous and categorical data, manage
missing values, and performs pruning to reduce overfitting. The algorithm selects the best
attribute to split the dataset based on the highest Gain Ratio, computed from Information Gain
and Split Information.
Key Formulas:
1. Entropy: Measures data impurity
H(S)=−∑pilog2(pi)
2. Information Gain: Reduction in entropy

IG(S,A)=H(S)−∑(∣Sv∣/∣S∣) H(Sv)

3. Split Info: Entropy of the split itself

SI(S,A)=−∑(∣Sv∣/∣S∣)log2(∣Sv∣/∣S∣)
4. Gain Ratio: Used to choose the best attribute
GR(S,A)=IG(S,A)/SI(S,A)
Steps in the C4.5 Algorithm
▪ Start with the entire dataset.
▪ Calculate the entropy of the dataset.
▪ Compute Gain Ratio for all attributes.
▪ Choose the attribute with the highest Gain Ratio.
▪ Split the dataset accordingly.
▪ Repeat recursively for each subset.
▪ Apply Pruning after the tree is created.
Advantages:
▪ Works with both discrete and continuous data.
▪ Handles missing values gracefully.
▪ Reduces overfitting via pruning.
▪ Less biased compared to ID3 (thanks to Gain Ratio).

29
Limitations:
▪ Computationally more expensive than ID3.
▪ Can still overfit if not pruned correctly.
▪ May produce complex trees that are hard to interpret.

CODE:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt
df = df.drop(columns=['Unnamed: 0'])
imputer = SimpleImputer(strategy='mean')
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
label_encoders = {}
for col in ['Sex', 'Category']:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
le_cat = label_encoders['Category']
df_sampled = df.sample(n=60, random_state=42)
X = df_sampled.drop(columns=["Category"])
y = df_sampled["Category"]
clf = DecisionTreeClassifier(criterion="gini", max_depth=4, random_state=42)
clf.fit(X, y)
plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=X.columns, class_names=le_cat.classes_, filled=True,
rounded=True)
plt.title("Decision Tree (Gini-based)")
plt.show()
smote = SMOTE(k_neighbors=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Category']),
df['Category'], test_size=0.2, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
C4_smote = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42)
C4_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = C4_smote.predict(X_test)

30
print("Accuracy:", accuracy_score(y_test, y_pred_smote))
print(classification_report(y_test, y_pred_smote,
target_names=label_encoders['Category'].classes_))

OUTPUT:

LEARNING:
In this experiment, we demonstrated the working of the decision tree based C4.5
algorithm.
31
EXPERIMENT-6
AIM: Write a program to demonstrate the working of decision tree based CART algorithm.
Build the decision tree and classify a new sample using a suitable dataset. Compare the
performance with that of ID, C4.5, and CART in terms of accuracy, recall, precision and
sensitivity.

THEORY:
CART is a popular decision tree algorithm used for both classification and regression tasks. It
builds binary trees by splitting the data at each node based on the feature that results in the best
separation of the target variable.

• Binary Splits Only: Each node splits the data into two child nodes, unlike ID3 or C4.5
which can create multi-way splits.
• Handles Both Tasks:
➢ Classification Trees: Used when the output is categorical.
➢ Regression Trees: Used when the output is numerical.
Key Formulas:
1. Gini Impurity (for classification)
Measures how often a randomly chosen element would be incorrectly labeled.
Gini(t)=1−n∑i=1 pi2
Where:
➢ pi = proportion of class i in node t
➢ Lower Gini → purer node.
2. Mean Squared Error (MSE) (for regression)
Used to measure the quality of a split in regression tasks.
MSE=(1/N) * N∑i=1 (yi−y^)2
Where:
➢ yi is the actual value
➢ y^ is the predicted value (usually the mean of the node)

Steps in CART Algorithm:


Start with the entire dataset.
▪ For each feature and possible split value:
▪ Compute the Gini Impurity (classification) or MSE (regression) for the split.
▪ Choose the feature and split that gives the best score (lowest impurity or error).
▪ Split the data into two subsets and repeat recursively.

32
Stop when:
▪ A node is pure (only one class), or
▪ A stopping condition is met (e.g., max depth, minimum samples)

Advantages:
▪ Works for both classification and regression.
▪ Handles both categorical and numerical data.
▪ Simple and interpretable.
Limitations:
▪ Prone to overfitting without proper pruning.
▪ Greedy approach can result in suboptimal trees.
▪ Instability: Small data changes can lead to different trees.
▪ CART is widely used for its versatility and clear structure in decision-making tasks.

CODE:
import pandas as pd

from sklearn.preprocessing import LabelEncoder

from sklearn.impute import SimpleImputer

from imblearn.over_sampling import SMOTE

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, accuracy_score

from sklearn.utils.multiclass import unique_labels

df = pd.read_csv("HCVData.csv")

df = df.drop(columns=['Unnamed: 0'], errors='ignore')

imputer = SimpleImputer(strategy='mean')

numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

33
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

label_encoders = {}

for col in ['Sex', 'Category']:

le = LabelEncoder()

df[col] = le.fit_transform(df[col])

label_encoders[col] = le

X = df.drop(columns=['Category'])

y = df['Category']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

smote = SMOTE(k_neighbors=2, random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# ID3

id3 = DecisionTreeClassifier(criterion="entropy", max_depth=5, random_state=42)

id3.fit(X_train, y_train)

id3_preds = id3.predict(X_test)

# C4.5 (approximated)

c45 = DecisionTreeClassifier(criterion="entropy", max_depth=5, class_weight='balanced',


random_state=42)

c45.fit(X_train, y_train)

c45_preds = c45.predict(X_test)

34
# CART

cart = DecisionTreeClassifier(criterion="gini", max_depth=5, random_state=42)

cart.fit(X_train_smote, y_train_smote)

cart_preds = cart.predict(X_test)

labels_present = unique_labels(y_test, cart_preds)

target_names = [label_encoders['Category'].inverse_transform([i])[0] for i in labels_present]

print("Accuracy:", accuracy_score(y_test, id3_preds))

print(classification_report(y_test, id3_preds, labels=labels_present,


target_names=target_names))

print("Accuracy:", accuracy_score(y_test, c45_preds))

print(classification_report(y_test, c45_preds, labels=labels_present,


target_names=target_names))

print("Accuracy:", accuracy_score(y_test, cart_preds))

print(classification_report(y_test, cart_preds, labels=labels_present,


target_names=target_names))

OUTPUT:
ID3

35
C4.5

CART

LEARNING:
In this experiment, we demonstrated the working of the decision tree based CART algorithm
and compared the performance with that of ID, C4.5, and CART in terms of accuracy, recall,
precision and sensitivity.

36
EXPERIMENT-7
AIM: Build an Artificial Neural Network by implementing the Back propagation algorithm
and test the same using appropriate data sets.

THEORY:
Artificial Neural Networks (ANNs) are computational models that simulate the way the human
brain processes information, used for tasks like classification, regression, and pattern
recognition. They consist of multiple layers of neurons (nodes) connected by weighted edges,
where each neuron performs a calculation based on input data and produces an output.
Key Components of ANN:
▪ Neurons: Each neuron processes an input and generates an output using an activation
function.
▪ Layers:
o Input Layer: Receives input data.
o Hidden Layers: Perform computations and process data.
o Output Layer: Provides the final prediction.
▪ Activation Functions: Introduce non-linearity to the model. Common functions include:
o Sigmoid: σ(x)=1/(1+e^−x)
o ReLU: ReLU(x)=max(0,x)
o Tanh: tanh(x)=(e^x−e^−x)/(e^x+e^−x)
Backpropagation Algorithm:
Backpropagation is a supervised learning algorithm used to train ANNs. It adjusts weights to
minimize the error between predicted and actual outputs using gradient descent.
1.Forward Pass: Input data is passed through the network, and output is computed using the
following formula:
a=f(Wx+b)
• a is the output (activation).
• W is the weight matrix.
• x is the input vector.
• b is the bias term.
• f is the activation function (e.g., ReLU, Sigmoid).
2.Compute the Error (Loss): After obtaining the predicted output, the error (or loss) is
computed by comparing it with the actual output using a loss function, such as Mean Squared
Error (MSE) for regression or Cross-Entropy Loss for classification:
• MSE:
MSE=(1/n)*n∑i=1 (ytrue−ypred)^2

37
where ytrue is the true value, ypred is the predicted value, and n is the number of data
points.
3.Backward Pass (Backpropagation): The error is propagated backward through the
network, calculating the gradient of the loss with respect to the weights. Using the chain rule
of calculus, the gradients for each weight WW are computed:

∂Loss/∂W=∂Loss/∂a⋅∂a/∂W
where:
• ∂Loss/∂a is the derivative of the loss with respect to the activation.
• ∂a/∂W is the derivative of the activation with respect to the weight.
4.Update Weights: Weights are updated using gradient descent:

W=W−η⋅∂Loss/∂W
• W is the weight.
• η is the learning rate (a small constant).
• ∂Loss/∂W is the gradient of the loss with respect to the weight.
Advantages of ANN:
• Non-linear Learning: ANNs can learn complex, non-linear patterns.
• Versatility: They can be used for a wide range of tasks such as classification, regression,
and time-series forecasting.
• Adaptability: ANNs can improve with more data and training.
Limitations of ANN:
• Computationally Expensive: Training large networks can be time-consuming and
require significant computational resources.
• Overfitting: ANNs can overfit the data if not properly regularized.
• Interpretability: ANN models, especially deep networks, are often seen as "black boxes"
because their decision-making process is hard to interpret.

CODE:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

path = "drive/MyDrive/hcvdat0.csv"
df = pd.read_csv(path)
df = df.drop(columns=['Unnamed: 0'])

38
imputer = SimpleImputer(strategy='mean')
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
label_encoders = {}
for col in ['Sex', 'Category']:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
X = df.drop('Category', axis=1)
y = df['Category']
smote = SMOTE(k_neighbors=2,random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
X_train,y_train=smote.fit_resample(X_train,y_train)

num_classes = len(set(y_train))
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(X_train.shape[1],)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(num_classes, activation='softmax')
])

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

model.fit(X_train, y_train, epochs=30, batch_size=32)

loss, accuracy = model.evaluate(X_test, y_test)


print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_classes))

39
OUTPUT:

LEARNING:
In this experiment, we build an Artificial Neural Network by implementing the
Backpropagation algorithm

40
EXPERIMENT-8
AIM: Write a program to implement the Naïve Bayesian classifier for appropriate dataset and
compute the performance measures of the model.

THEORY:
The Naive Bayes classifier is a probabilistic classification algorithm based on Bayes' Theorem.
It is particularly effective for large datasets and is widely used for text classification, such as
spam filtering and sentiment analysis. Despite its simplicity, Naive Bayes often performs
surprisingly well in many real-world applications.
Bayes' Theorem
At the core of the Naive Bayes classifier is Bayes' Theorem, which expresses the probability
of a class C given the observed features X (i.e., X=(x1,x2,…,xn)):
P(C∣X)=(P(X∣C)⋅P(C))/P(X)
Where:
• P(C∣X) is the posterior probability: the probability of class C given the observed
features X.
• P(X∣C) is the likelihood: the probability of observing features X given that the class is
C.
• P(C) is the prior probability: the probability of class C before observing any features.
• P(X) is the evidence: the total probability of the features across all classes.
The main challenge is to compute P(C∣X) efficiently. Naive Bayes simplifies this calculation
by assuming that the features are conditionally independent, given the class.
Naive Assumption: Conditional Independence
The "naive" assumption in the Naive Bayes classifier is that each feature is independent of the
others, given the class. This means:
P(X∣C)=P(x1,x2,…,xn∣C)=n∏i=1 P(xi∣C)
This greatly simplifies the computation of the likelihood term P(X∣C), as it turns a joint
probability distribution into a product of individual probabilities for each feature.
Working
The Naive Bayes classifier is based on Bayes' Theorem, which calculates the probability of a
class given observed features. It assumes that the features are conditionally independent,
simplifying the calculation of probabilities. The algorithm works as follows:
• Prior Probability: It calculates the probability of each class based on the training data.
• Likelihood: It calculates the probability of observing the given features for each class.
• Posterior Probability: Using Bayes' Theorem, it calculates the posterior probability
for each class given the features.
• Prediction: The class with the highest posterior probability is chosen as the predicted
class.

41
Advantages
• Simple and Fast: It is easy to implement and computationally efficient, even with large
datasets.
• Works Well with High-Dimensional Data: Especially useful for text classification,
where the number of features (e.g., words) is large.
• Handles Missing Data: Can handle missing data by ignoring features with missing
values during probability estimation.
• Works Well with Small Datasets: It performs surprisingly well even with limited data
and in cases where the independence assumption holds.
Disadvantages
• Independence Assumption: The assumption that features are independent is often
unrealistic, especially when features are correlated, leading to reduced performance.
• Limited Performance with Highly Correlated Features: When features are highly
dependent, the model's performance might degrade.
• Not Ideal for Continuous Features: Though Gaussian Naive Bayes handles continuous
features, the model can struggle if the distribution of these features deviates from
normal.

CODE:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
classification_report
)

# df = pd.read_csv("HCVData.csv")
df.drop(columns=["Unnamed: 0"], inplace=True)
df.fillna(df.mean(numeric_only=True), inplace=True)

42
label_encoders = {}
for col in ['Sex', 'Category']:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le

X = df.drop(columns=['Category'])
y = df['Category']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

model = GaussianNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred, average='weighted',
zero_division=0)
recall = recall_score(y_test, y_pred, average='weighted',
zero_division=0)
f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred,
target_names=label_encoders['Category'].classes_)

print("Accuracy:", accuracy)
print("\nClassification Report:\n", class_report)

43
OUTPUT:

LEARNING:
In this experiment, we implemented the Naïve Bayesian classifier for appropriate dataset and
computed the performance measures of the model.

44
EXPERIMENT-9
AIM: Write a program to implement k-Nearest Neighbor algorithm to classify any dataset of
your choice. Print both correct and wrong predictions.

THEORY:
The K-Nearest Neighbors (KNN) algorithm is a simple, instance-based learning algorithm used
for both classification and regression tasks. It works by identifying the "K" closest data points
(neighbors) to a given input and then making predictions based on the majority label or average
of these neighbors.
How KNN Works:
1. Store All Training Data:
KNN is a lazy learning algorithm, meaning it doesn't learn a model during training but instead
stores the entire training dataset.
2. Distance Metric:
To predict the class or value for a new instance, KNN uses a distance metric (usually Euclidean
distance) to measure how far the new instance is from all the training examples.
For two points x=(x1,x2,…,xn)and y=(y1,y2,…,yn):
Euclidean Distance Formula:
d(x,y)=(n∑i=1(xi−yi)2)1/2
Manhattan Distance Formula:
d(x,y)=n∑i=1 |xi−yi|
Minkowski Distance Formula:
d(x,y)=(n∑i=1(xi−yi)p)1/p
3. Find Nearest Neighbors:
For a new data point, the algorithm computes the distance to every training data point, selects
the K closest points (neighbors), and uses them to make a prediction.
4. Prediction:
For classification, the predicted class of the new instance is the class that appears most
frequently among its K nearest neighbors.
For regression, the predicted value is the average of the values of the K nearest neighbors.
Steps Involved in KNN:
• Choose the number K (the number of neighbors to consider).
• Compute the distance between the new data point and all the training points.
• Sort the training points based on the computed distance.
• Select the top K neighbors.

45
• For classification, predict the majority class; for regression, predict the average of the
values.
Advantages of KNN:
• Simple and Intuitive: KNN is easy to understand and implement.
• No Training Phase: It’s a lazy learner, meaning it doesn’t require explicit training, just
storing the dataset.
• Adaptable: It can be used for both classification and regression tasks.
• Works Well with Small, Clean Datasets: KNN performs well when the dataset is small,
and the data is not too noisy.
Disadvantages of KNN:
• Computationally Expensive: Since it stores the entire training set and computes the
distance to every point during testing, it can be very slow for large datasets.
• High Memory Usage: KNN requires storing all training data, which can be memory-
intensive for large datasets.
• Sensitive to Irrelevant Features: KNN performs poorly if there are many irrelevant
features because they will distort the distance calculation.
• Choosing the Right K: Selecting the optimal number of neighbors KK is crucial; too
few neighbors may lead to overfitting, while too many may cause underfitting.

CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, LabelEncoder


from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.impute import SimpleImputer

df = pd.read_csv("hcvdat0.csv")

continuous_cols = ['ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT',
'PROT']
imputer = SimpleImputer(strategy='mean')
df[continuous_cols] = imputer.fit_transform(df[continuous_cols])

46
df['Sex'] = df['Sex'].map({'m': 1, 'f': 0})

label_encoder = LabelEncoder()
df['Category'] = label_encoder.fit_transform(df['Category'])

X = df.drop(columns=['Category', 'ID']) if 'ID' in df.columns else


df.drop(columns=['Category'])
y = df['Category']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

df['Category'].value_counts().plot(kind='bar')
plt.title('Class Distribution')
plt.xlabel('Category')
plt.ylabel('Count')
plt.show()

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3,


random_state=42)

param_grid = {'n_neighbors': list(range(1, 21))}


grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_k = grid_search.best_params_['n_neighbors']
print(f"Best k value: {best_k}")

knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

accuracy_knn = accuracy_score(y_test, y_pred_knn) * 100


correct_knn = (y_pred_knn == y_test).sum()
wrong_knn = (y_pred_knn != y_test).sum()

print(f"k-NN Accuracy: {accuracy_knn:.2f}%")

47
print(f"Correct Predictions: {correct_knn}")
print(f"Wrong Predictions: {wrong_knn}\n")

for i in range(len(y_test)):
actual = y_test.iloc[i]
predicted = y_pred_knn[i]
if actual == predicted:
print(f"Correct: Predicted = {predicted}, Actual = {actual}")
else:
print(f"Wrong: Predicted = {predicted}, Actual = {actual}")

cv_scores_knn = cross_val_score(knn, X_scaled, y, cv=5, scoring='accuracy')


print(f"Cross-validated k-NN Accuracy: {cv_scores_knn.mean() * 100:.2f}%")

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred_rf) * 100


print(f"Random Forest Accuracy: {accuracy_rf:.2f}%")

print("Random Forest Classification Report:")


print(classification_report(y_test, y_pred_rf))

print("Random Forest Confusion Matrix:")


print(confusion_matrix(y_test, y_pred_rf))

print("k-NN Classification Report:")


print(classification_report(y_test, y_pred_knn))

print("k-NN Confusion Matrix:")


print(confusion_matrix(y_test, y_pred_knn))

48
OUTPUT:

LEARNING:
In this experiment, we implement k-Nearest Neighbor algorithm to classify HCV dataset.

49
EXPERIMENT-10
AIM: Apply k-Means clustering algorithm on suitable datasets and comment on
the quality of clustering.
THEORY:
K-Means is a popular unsupervised learning algorithm used for clustering tasks, where the goal
is to group similar data points into clusters. It partitions a dataset into K clusters based on
feature similarity, where K is a user-defined number of clusters.
How K-Means Works
The K-Means algorithm follows a straightforward approach to classify data into clusters:
1. Initialization:
▪ Choose the number of clusters K.
▪ Randomly select K data points from the dataset as the initial centroids of the clusters.
2. Assign Data Points to Closest Cluster:
▪ For each data point, calculate its distance to each centroid (often using Euclidean
distance).
▪ Assign each data point to the nearest centroid, forming K clusters.
3. Recompute Centroids:
▪ After assigning all points to clusters, recalculate the centroids by computing the mean
of all data points in each cluster.
4. Repeat:
▪ Repeat steps 2 and 3 until convergence (when the centroids no longer change or the
change is minimal).
Mathematical Representation:
1. Distance Calculation (Euclidean Distance is typically used):
Distance=(n∑i=1 (xi−ci)2)1/2
where:
▪ xi represents the ith feature of the data point.
▪ ci represents the ith feature of the centroid.
2. Centroid Update: For each cluster k, the new centroid ck is calculated as:
ck=(1/∣Sk∣)*∑i∈Sk (xi)
where:
▪ Sk is the set of points assigned to cluster k.
▪ xi is a data point in the cluster.

50
Stopping Criteria:
The algorithm stops when:
▪ The centroids do not change significantly between iterations.
▪ A predefined number of iterations is reached.
▪ The assignment of points to clusters no longer changes.
Advantages:
▪ Efficient and fast, especially for large datasets.
▪ Scalable to handle large datasets with a time complexity of O(n⋅K⋅t).
▪ Simple to understand and implement.
▪ Fast convergence in most cases.
Disadvantages:
▪ Requires predefined K, the number of clusters, which is often hard to determine.
▪ Sensitive to initial centroids, which may lead to suboptimal clustering.
▪ Assumes spherical clusters and may not work well for non-linearly separable clusters.
▪ Sensitive to outliers, which can affect centroid positioning.

CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
data = pd.read_csv('hcvdat0.csv')
numeric_data = data.select_dtypes(include=[np.number])
print("Numeric Columns:")
print(numeric_data.columns)
print("Numeric Data Shape:", numeric_data.shape)
imputer = SimpleImputer(strategy='mean')
numeric_data_imputed = pd.DataFrame(imputer.fit_transform(numeric_data),
columns=numeric_data.columns)

51
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data_imputed)
k=3
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(scaled_data)
numeric_data_imputed['Cluster'] = clusters
print("\nCluster Centers (in scaled space):")
print(kmeans.cluster_centers_)
print("\nCluster Counts:")
print(numeric_data_imputed['Cluster'].value_counts())
inertia = kmeans.inertia_
silhouette_avg = silhouette_score(scaled_data, clusters)
print("\nClustering Performance:")
print("Inertia:", inertia)
print("Silhouette Score:", silhouette_avg)
if scaled_data.shape[1] > 2:
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)
plt.figure(figsize=(8, 6))
scatter = plt.scatter(pca_result[:, 0], pca_result[:, 1], c=clusters, cmap='viridis',
alpha=0.6)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clusters Visualized using PCA')
plt.colorbar(scatter, label='Cluster')
else:
plt.figure(figsize=(8, 6))
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.xlabel(numeric_data_imputed.columns[0])

52
plt.ylabel(numeric_data_imputed.columns[1])
plt.title('K-Means Clusters')
plt.show()

OUTPUT:

LEARNING:
In this experiment, we applied k-Means clustering algorithm on bank marketing dataset

53
EXPERIMENT-11
AIM: Write a program to implement ensemble algorithm- AdaBoost and Bagging using
appropriate dataset and evaluate their performance on that dataset.

THEORY:
Ensemble algorithms combine multiple models (also known as weak learners) to create a
stronger model. The idea behind ensemble methods is that by combining multiple predictions,
we can improve the accuracy and robustness of the final model compared to any individual
model. These techniques are widely used in machine learning to boost performance, reduce
overfitting, and provide more reliable predictions.
The two main types of ensemble methods are Bagging and Boosting.
1. Bagging (Bootstrap Aggregating)
Bagging aims to reduce variance and prevent overfitting by training multiple instances of the
same model on different subsets of the dataset. These subsets are created by sampling the
training data with replacement (bootstrap sampling). The predictions of all models are
combined (by averaging for regression or voting for classification) to produce the final
prediction.
Key Features:
▪ Parallelism: Since each model is trained independently, bagging can be parallelized,
improving computational efficiency.
▪ Reduces Overfitting: By combining the predictions of multiple models, bagging
reduces the overall variance and overfitting, especially with high-variance models like
decision trees.
▪ Popular Model: Random Forest is a popular example of a bagging-based algorithm.
Advantages:
▪ Reduces variance and prevents overfitting.
▪ Works well with complex models (e.g., decision trees).
▪ Effective for high-dimensional data.
Disadvantages:
▪ Increased computational cost due to multiple models.
▪ Does not always reduce bias.
▪ May not improve performance if the base model is too simple.
2. Boosting
Boosting is an ensemble technique that works by sequentially training models. Each new model
tries to correct the errors of the previous model by focusing on the misclassified examples.
Unlike bagging, boosting combines the models in a weighted manner, where models that
perform better have more influence on the final prediction.

54
Key Features:
▪ Sequential Learning: Boosting builds the ensemble in a sequential manner, where each
model corrects the mistakes of the previous one.
▪ Adjusts Weights: Misclassified examples are given more weight, ensuring that
subsequent models focus on harder-to-classify data.
▪ Popular Algorithms: Examples include AdaBoost, Gradient Boosting, and XGBoost.
Advantages:
▪ Reduces both bias and variance.
▪ Effective for both classification and regression problems.
▪ Often provides high predictive accuracy.
Disadvantages:
▪ Can be sensitive to noisy data and outliers.
▪ Prone to overfitting, especially with too many iterations.
▪ Computationally intensive, as models are built sequentially.

CODE:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load and preprocess data


path="drive/MyDrive/hcvdat0.csv"
df = pd.read_csv(path)
df = df.drop(columns=['Unnamed: 0'])

imputer = SimpleImputer(strategy='mean')
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
label_encoders = {}
for col in ['Sex', 'Category']:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
X = df.drop('Category', axis=1)

55
y = df['Category']
smote = SMOTE(k_neighbors=2,random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train,y_train=smote.fit_resample(X_train,y_train)
class DecisionStump:
def __init__(self):
self.feature_idx = None
self.threshold = None
self.alpha = None
self.polarity = 1

def predict(self, X):


n_samples = X.shape[0]
X_column = X[:, self.feature_idx]
predictions = np.ones(n_samples)

if self.polarity == 1:
predictions[X_column < self.threshold] = -1
else:
predictions[X_column > self.threshold] = -1

return predictions

class AdaBoostScratch:
def __init__(self, n_estimators=50):
self.n_estimators = n_estimators
self.models = []

def fit(self, X, y):


n_samples, n_features = X.shape
w = np.full(n_samples, 1/n_samples)

for _ in range(self.n_estimators):
model = DecisionStump()
min_error = float('inf')

# Find best feature and threshold


for feature_idx in range(n_features):

56
X_column = X[:, feature_idx]
thresholds = np.unique(X_column)

for threshold in thresholds:


for polarity in [1, -1]:
predictions = np.ones(n_samples)
if polarity == 1:
predictions[X_column < threshold] = -1
else:
predictions[X_column > threshold] = -1

error = np.sum(w[y != predictions])

if error < min_error:


min_error = error
model.feature_idx = feature_idx
model.threshold = threshold
model.polarity = polarity

# Calculate alpha
EPS = 1e-10
model.alpha = 0.5 * np.log((1 - min_error + EPS)/(min_error + EPS))

# Update weights
predictions = model.predict(X)
w *= np.exp(-model.alpha * y * predictions)
w /= np.sum(w)

self.models.append(model)

def predict(self, X):


preds = np.zeros(X.shape[0])
for model in self.models:
preds += model.alpha * model.predict(X)
return np.sign(preds)

class BaggingScratch:
def __init__(self, base_estimator=None, n_estimators=50):

57
self.n_estimators = n_estimators
self.models = []
self.base_estimator = base_estimator or DecisionTreeClassifier(max_depth=3)

def _bootstrap_sample(self, X, y):


n_samples = X.shape[0]
idxs = np.random.choice(n_samples, n_samples, replace=True)
return X[idxs], y[idxs]

def fit(self, X, y):


for _ in range(self.n_estimators):
X_sample, y_sample = self._bootstrap_sample(X, y)
model = clone(self.base_estimator)
model.fit(X_sample, y_sample)
self.models.append(model)

def predict(self, X):


all_preds = np.array([model.predict(X) for model in self.models])
return np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=all_preds)

# AdaBoost
ada_scratch = AdaBoostScratch(n_estimators=50)
ada_scratch.fit(X_train, y_train)
ada_preds = ada_scratch.predict(X_test)

# Bagging
bag_scratch = BaggingScratch(n_estimators=50)
bag_scratch.fit(X_train, y_train)
bag_preds = bag_scratch.predict(X_test)

# Results
print("AdaBoost (Scratch) Performance:")
print(classification_report(y_test, ada_preds, target_names=le.classes_))
print(f"Accuracy: {accuracy_score(y_test, ada_preds):.2f}\n")

print("Bagging (Scratch) Performance:")


print(classification_report(y_test, bag_preds, target_names=le.classes_))
print(f"Accuracy: {accuracy_score(y_test, bag_preds):.2f}")

58
OUTPUT:

LEARNING:
In this experiment, we implement ensemble algorithm-AdaBoost and Bagging using hcv
dataset.

59
EXPERIMENT-12
AIM: Select any two datasets based on their statistics and perform comparison among all
implemented algorithm using them.

THEORY:
1. Linear Regression
▪ Definition: A statistical method used to model the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to observed
data.
▪ Used for: Regression.
▪ When to use: When you need to predict a continuous variable and assume a linear
relationship between the dependent and independent variables.
2. ID3 (Iterative Dichotomiser 3)
▪ Definition: A decision tree algorithm that builds a tree by selecting the attribute with
the highest Information Gain at each step, reducing uncertainty (entropy) in the data.
▪ Used for: Classification.
▪ When to use: When working with categorical data and need an interpretable model that
classifies data into categories.
3. C4.5
▪ Definition: An extension of ID3 for decision trees, it handles continuous data, prunes
trees to prevent overfitting, and uses the Gain Ratio for splitting.
▪ Used for: Classification.
▪ When to use: When working with both categorical and continuous data, and when
overfitting is a concern that needs to be addressed through pruning.
4. CART (Classification and Regression Trees)
▪ Definition: A decision tree algorithm that supports both classification and regression
tasks, using Gini Index for classification and variance for regression.
▪ Used for: Both Classification and Regression.
▪ When to use: When you need a versatile algorithm for both classification and
regression, or when you're dealing with a mixture of categorical and continuous
variables.
5. ANN (Artificial Neural Networks) with Back Propagation
▪ Definition: A model consisting of layers of interconnected neurons, where learning
occurs via the backpropagation algorithm that adjusts weights based on errors to
minimize the loss.

60
▪ Used for: Both Classification and Regression.
▪ When to use: When dealing with complex, non-linear relationships in large datasets,
especially when feature interaction is difficult to model with other algorithms.
6. Naive Bayesian Classifier
▪ Definition: A probabilistic classifier based on Bayes' Theorem, assuming independence
between features.
▪ Used for: Classification.
▪ When to use: When the features are conditionally independent and you have categorical
data, especially effective in text classification tasks like spam filtering.
7. K-nearest Neighbours (K-NN)
▪ Definition: A non-parametric algorithm that classifies data points based on the majority
class of the K nearest neighbours.
▪ Used for: Both Classification and Regression.
▪ When to use: When you have a small to medium-sized dataset and when you want a
simple, non-parametric model with minimal training time.
8. K-means Clustering
▪ Definition: An unsupervised learning algorithm that groups data into K clusters based
on proximity to centroids.
▪ Used for: Clustering (Unsupervised Learning).
▪ When to use: When you need to segment data into distinct groups, especially when the
number of clusters (K) is known or can be estimated.
9. Bagging and AdaBoost
Bagging:
▪ Definition: A method that trains multiple models on bootstrapped subsets of data,
reducing variance by averaging predictions (for regression) or voting (for
classification).
▪ Used for: Both Classification and Regression.
▪ When to use: When you want to reduce model variance and improve accuracy,
especially with unstable models like decision trees.
AdaBoost:
▪ Definition: An ensemble method that combines weak classifiers sequentially, adjusting
the weights of misclassified data points and focusing on difficult-to-classify examples.
▪ Used for: Both Classification and Regression.
▪ When to use: When you have weak models and want to boost their performance,
especially when the data has a lot of noise or misclassifications.

61
Datasets:
▪ HCV Dataset: The HCV dataset is a real-world and frequently used dataset in
healthcare and machine learning, especially for classification problems. It contains
multiple patient records, categorized into different liver disease classes like
Fibrosis, Cirrhosis, and Suspect Blood Donor.
▪ California Housing Dataset: This dataset contains information about various
geographical regions in California and aims to predict the median house value. Its
used for regression, but also converted to a classification problem by categorizing
the house values into high/low.

Application of Algorithms:
▪ Classification: The algorithms (ID3, C4.5, CART, Naive Bayes, KNN, ANN)
were evaluated on the ability to predict categorical outcomes (e.g., high vs. low
median house value).
▪ Regression: Algorithms like Linear Regression and ANN (MLPRegressor) were
applied to predict continuous variables (e.g., median house value in California or
diabetes progression).
▪ Clustering: KMeans was used to discover inherent groupings within the data
(e.g., clustering regions based on similar housing attributes).

CODE:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, r2_score, adjusted_rand_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import pandas as pd

df = pd.read_csv("hcvdat0.csv")
df = df.dropna()
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
df[col] = le.fit_transform(df[col])
X = df.drop('Category', axis=1)

62
y = df['Category']
X_scaled = StandardScaler().fit_transform(X)
Xc_train, Xc_test, yc_train, yc_test = train_test_split(X, y, test_size=0.3, random_state=42)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X, y, test_size=0.3, random_state=42)
results = []

# Classification
results.append(("HCV Dataset", 'ID3', 'Classification', accuracy_score(yc_test,
DecisionTreeClassifier(criterion='entropy').fit(Xc_train, yc_train).predict(Xc_test))))

results.append(("HCV Dataset", 'C4.5', 'Classification', accuracy_score(yc_test,


DecisionTreeClassifier(criterion='entropy', max_depth=5).fit(Xc_train,
yc_train).predict(Xc_test))))

results.append(("HCV Dataset", 'CART', 'Classification', accuracy_score(yc_test,


DecisionTreeClassifier(criterion='gini').fit(Xc_train, yc_train).predict(Xc_test))))

results.append(("HCV Dataset", 'Naive Bayes', 'Classification', accuracy_score(yc_test,


GaussianNB().fit(Xc_train, yc_train).predict(Xc_test))))

results.append(("HCV Dataset", 'KNN', 'Classification', accuracy_score(yc_test,


KNeighborsClassifier(n_neighbors=5).fit(Xc_train, yc_train).predict(Xc_test))))

results.append(("HCV Dataset", 'ANN', 'Classification', accuracy_score(yc_test,


MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000).fit(Xc_train,
yc_train).predict(Xc_test))))

results.append(("HCV Dataset", 'AdaBoost', 'Classification', accuracy_score(yc_test,


AdaBoostClassifier(n_estimators=100).fit(Xc_train, yc_train).predict(Xc_test))))

results.append(("HCV Dataset", 'Bagging', 'Classification', accuracy_score(yc_test,


BaggingClassifier(n_estimators=100).fit(Xc_train, yc_train).predict(Xc_test))))

results.append(("HCV Dataset", 'Logistic Regression', 'Classification',


accuracy_score(yc_test, LogisticRegression(max_iter=1000).fit(Xc_train,
yc_train).predict(Xc_test))))

63
# Regression
results.append(("HCV Dataset", 'Linear Regression', 'Regression', r2_score(yr_test,
LinearRegression().fit(Xr_train, yr_train).predict(Xr_test))))

# Clustering
clusters = KMeans(n_clusters=len(set(y)), random_state=42).fit_predict(X_scaled)
results.append(("HCV Dataset", 'KMeans', 'Clustering', adjusted_rand_score(y, clusters)))

# Summary Table
summary_df = pd.DataFrame(results, columns=['Dataset', 'Algorithm', 'Task', 'Score'])
print("\n=== Summary Table ===")
print(summary_df.pivot(index='Dataset', columns=['Algorithm', 'Task'],
values='Score').round(4))

data = fetch_california_housing()
X, y_regression = data.data, data.target

# Convert regression target to classification (e.g., median value > 2.5)


y_classification = (y_regression > 2.5).astype(int)

# Train/test split
Xc_train, Xc_test, yc_train, yc_test = train_test_split(X, y_classification, test_size=0.3,
random_state=42)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X, y_regression, test_size=0.3,
random_state=42)

# Scaling for clustering


X_scaled = StandardScaler().fit_transform(X)

# Results collection
results = []

# Classification models
results.append(("California Housing", 'ID3', 'Classification',accuracy_score(yc_test,
DecisionTreeClassifier(criterion='entropy').fit(Xc_train, yc_train).predict(Xc_test))))

64
results.append(("California Housing", 'C4.5', 'Classification',accuracy_score(yc_test,
DecisionTreeClassifier(criterion='entropy', max_depth=5).fit(Xc_train,
yc_train).predict(Xc_test))))

results.append(("California Housing", 'CART', 'Classification',accuracy_score(yc_test,


DecisionTreeClassifier(criterion='gini').fit(Xc_train, yc_train).predict(Xc_test))))

results.append(("California Housing", 'Naive Bayes', 'Classification',accuracy_score(yc_test,


GaussianNB().fit(Xc_train, yc_train).predict(Xc_test))))

results.append(("California Housing", 'KNN', 'Classification',accuracy_score(yc_test,


KNeighborsClassifier(n_neighbors=5).fit(Xc_train, yc_train).predict(Xc_test))))

results.append(("California Housing", 'ANN', 'Classification',accuracy_score(yc_test,


MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000, random_state=42).fit(Xc_train,
yc_train).predict(Xc_test))))

# Regression models
results.append(("California Housing", 'Linear Regression', 'Regression',r2_score(yr_test,
LinearRegression().fit(Xr_train, yr_train).predict(Xr_test))))

results.append(("California Housing", 'ANN', 'Regression', r2_score(yr_test,


MLPRegressor(hidden_layer_sizes=(50,), max_iter=1000, random_state=42).fit(Xr_train,
yr_train).predict(Xr_test))))

# Clustering model
clusters = KMeans(n_clusters=2, random_state=42).fit_predict(X_scaled)
results.append(("California Housing", 'KMeans',
'Clustering',adjusted_rand_score(y_classification, clusters)))

# Create summary DataFrame


summary_df = pd.DataFrame(results, columns=['Dataset', 'Algorithm', 'Task', 'Score'])
pivot_summary = summary_df.pivot(index='Dataset', columns=['Algorithm', 'Task'],
values='Score').round(4)
pivot_summary

65
OUTPUT:

LEARNING:
In this experiment, we selected two datasets based on their statistics and perform comparison
among all implemented algorithm using them.

66
EXPERIMENT-13
AIM: Conduct survey of at least 5 different machine learning tools available.
THEORY:
1. TensorFlow
Overview and Features
TensorFlow is a free, open-source library developed by Google for building and training
machine learning models. It is known for its flexibility, efficiency, and strong compatibility
with Python, making it accessible to both beginners and experts. TensorFlow supports a wide
range of applications, from deep learning to traditional machine learning, and includes high-
level APIs like Keras for rapid prototyping. One of its standout features is TensorBoard, which
provides powerful visualization and debugging tools, allowing users to track metrics, analyze
performance, and optimize models easily.
Pre-trained Models and APIs
TensorFlow offers a vast library of pre-trained models for tasks such as image classification
(Inception, MobileNet, ResNet, VGG), object detection (Faster R-CNN, YOLO), NLP (BERT,
ALBERT), and speech recognition (wav2vec2, yamnet). This accelerates development by
enabling transfer learning and reducing the need for training from scratch. The high-level Keras
API simplifies model building, making it suitable for rapid experimentation and deployment.
Practical Applications

• Image Recognition & Computer Vision: Used in healthcare for diagnostics, social
media for photo tagging, and autonomous vehicles for navigation.

• Natural Language Processing: Powers language translation, sentiment analysis, and text
summarization.
• Robotics & Autonomous Systems: Facilitates object detection, navigation, and
reinforcement learning in robotics.
• Industry Use: Widely adopted in healthcare, finance, retail, manufacturing, and
entertainment for tasks ranging from fraud detection to content creation.
Advantages
TensorFlow stands out for its scalability (runs on CPUs, GPUs, TPUs), strong community
support, comprehensive documentation, and robust ecosystem including tools like TensorFlow
Hub and TensorFlow Extended (TFX) for end-to-end ML workflows.

67
68
2. PyTorch
Overview and Features
PyTorch is an open-source deep learning library developed by Meta AI (Facebook). It is
renowned for its dynamic computation graph (define-by-run), which allows for real-time graph
construction and modification, making it highly flexible and intuitive—especially for research
and experimentation. PyTorch supports GPU acceleration, integrates seamlessly with Python,
and provides an imperative programming style that simplifies debugging and logic
development.
Core Modules
▪ torch.nn: For building neural networks.
▪ torch.optim: Optimization algorithms.
▪ torch.autograd: Automatic differentiation.
▪ torchvision: Utilities for computer vision tasks.
▪ DataLoader: Efficient data management and batch processing.
Practical Applications
▪ Computer Vision: Image classification, object detection, segmentation, and generative
models (GANs).
▪ Natural Language Processing: Sentiment analysis, translation, and text generation using
RNNs, LSTMs, and Transformers.
▪ Reinforcement Learning: Used in robotics, games, and autonomous systems.
▪ Healthcare: Medical image analysis and predictive analytics.
▪ Finance & Recommendation Systems: Fraud detection, credit scoring, and personalized
recommendations.
Advantages
PyTorch is favored in academia and research for its flexibility, ease of use, and strong
community. Its dynamic graph and Pythonic interface streamline model development and
debugging. PyTorch also supports deployment on edge devices and offers native ONNX export
for cross-platform compatibility.

69
70
3. Scikit-learn
Overview and Features
Scikit-learn is a versatile, open-source Python library designed for traditional machine learning
tasks. Built on top of NumPy, SciPy, and Matplotlib, it provides a consistent and user-friendly
API for data preprocessing, model training, and evaluation. Its modular design allows users to
build pipelines using interchangeable components.
Key Features
Wide Algorithm Support: Classification (logistic regression, SVM, decision trees), regression,
clustering (K-means, DBSCAN), and dimensionality reduction (PCA).
Data Preprocessing: Splitting, scaling, feature selection, and extraction.
Model Evaluation: Metrics (accuracy, precision, recall, F1-score, MSE) and cross-validation.
Model Selection: Grid search, randomized search for hyperparameter tuning.
Inbuilt Datasets: For experimentation and learning.
Open-source and Community-driven: Ensures continuous development and support.
Practical Applications
▪ Finance: Credit risk assessment, fraud detection.
▪ Healthcare: Disease prediction, treatment optimization.
▪ Marketing: Customer segmentation, churn prediction.
▪ Text Mining: Spam detection, sentiment analysis.
▪ General Data Science: Prototyping, education, and research.
Advantages
Scikit-learn is praised for its simplicity, performance, and integration with the broader Python
data science ecosystem. It is ideal for structured/tabular data and quick prototyping.

71
72
4. Azure Machine Learning
Overview and Features
Azure Machine Learning (Azure ML) is a cloud-based platform from Microsoft that supports
the end-to-end machine learning lifecycle. It offers a drag-and-drop visual interface (Azure ML
Studio), scalable cloud resources, and integration with popular frameworks like TensorFlow,
PyTorch, and Scikit-learn.
Key Features
▪ AI Operationalization: Seamless model integration into business applications.
▪ Real-time Predictions: Supports immediate feedback and predictions.
▪ Cost Efficiency: Pay-as-you-go model with scalable resources.
▪ MLOps Tools: For monitoring, retraining, and redeployment.
▪ Security and Compliance: Role-based access, audit trails, and certifications.
▪ Support for Diverse Workloads: Handles various frameworks and languages.
Practical Applications
▪ Healthcare: Medical image analysis, diagnostics.
▪ Finance: Credit scoring, customer analytics.
▪ Manufacturing: Predictive maintenance, quality control.
▪ Retail: Demand forecasting, personalized recommendations.
▪ Energy: Consumption optimization.
▪ HR & Telecom: Talent acquisition, network optimization.
Advantages
Azure ML simplifies model development and deployment, supports a wide range of algorithms
and data sources, and facilitates MLOps for production-scale machine learning. Its enterprise-
grade security and compliance make it suitable for large organizations.

73
74
5. H2O.ai
Overview and Features
H2O.ai is an open-source, distributed in-memory machine learning platform known for its
scalability and automated machine learning (AutoML) capabilities. It supports a broad range
of algorithms, including gradient boosted machines, generalized linear models, deep learning,
and more. H2O can be used via R, Python, or its graphical interface, H2O Flow.
Key Features
▪ AutoML: Automates algorithm selection, feature engineering, hyperparameter tuning,
and model assessment.
▪ Distributed Computing: Handles big data efficiently via integration with Hadoop,
Spark, and Kubernetes.
▪ Seamless Deployment: Models can be exported as Java (POJO), binary (MOJO), or
deployed as REST APIs.
▪ Model Interpretation: Tools for variable importance, partial dependence plots, and
SHAP values.
▪ Continuous Learning: Supports regular retraining and optimization.
Practical Applications
▪ Insurance & Healthcare: Large-scale claims modeling, disease prediction.
▪ Finance: Credit scoring, fraud detection.
▪ Retail & Marketing: Customer segmentation, demand forecasting.
▪ General Business Analytics: Predictive maintenance, anomaly detection.
Advantages
H2O.ai excels in automation, scalability, and transparency. Its AutoML feature saves time and
effort, making advanced machine learning accessible even to non-experts. The platform is
highly extensible and integrates with various data science tools and big data infrastructure.

75
SUMMARY TABLE:

LEARNING:
In this experiment, we learned about 5 machine learning tools.
76

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy