Ai&Ml Bail606 ML Lab Manual
Ai&Ml Bail606 ML Lab Manual
LAB MANUAL
(Effective from the academic year 2024-2025 under 2022 CBCS scheme)
CODE:-
# Import the required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
#Statistical analysis
mean=df[nc].mean()
median=df[nc].median()
mode=df[nc].mode()
var=df[nc].var()
std=df[nc].std()
dr=df[nc].max()-df[nc].min()
# Generate a histogram
plt.figure(figsize=(10,6))
sns.histplot(df[nc],kde=True)
plt.title(f'Histogram of {nc}')
plt.xlabel(nc)
plt.ylabel('Frequency')
plt.show()
# Generate a boxplot
plt.figure(figsize=(10,6))
sns.boxplot(x=df[nc])
plt.title(f'Boxplot of {nc}')
plt.xlabel(nc)
plt.show()
OUTPUT:-
VIVA-VOCE:
● The "tips" dataset from the seaborn library is used, which contains information about
restaurant tips, including columns like total_bill, tip, sex, smoker, day, time,
and size.
● KDE provides a smoothed estimate of the data distribution, making it easier to visualize
patterns.
● A boxplot (sns.boxplot()) shows the median, quartiles (Q1 and Q3), outliers, and
data spread. It helps in detecting skewness and extreme values.
● The program computes the frequency of each category in the sex column and visualizes
it using bar and pie charts.
11. What is the difference between a bar chart and a pie chart?
● A bar chart is used to compare categorical values, whereas a pie chart represents
proportions as slices of a circle.
CODE:-
# Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
OUTPUT:-
VIVA-VOCE:-
● This program performs a statistical analysis of two numerical columns from a dataset by
plotting a scatter plot, calculating Pearson correlation, computing covariance and
correlation matrices, and visualizing the correlation matrix using a heatmap.
● The scatter plot visually represents the relationship between two numeric variables.
● The scatter plot is generated by using Seaborn’s scatterplot() function.
● It shows the nature of correlation (positive, negative, or no correlation) between the
selected numerical columns.
● Pearson correlation measures the strength and direction of a linear relationship between
two variables.
● Pearson Correlation is computed in the code using Pandas’ corr() function.
● Covariance measures the direction of the relationship between two variables but does
not standardize the values.
8. Why is a heatmap used for visualizing correlations?
● A heatmap makes it easy to identify variables that have strong positive or negative
correlations.
● The correlation matrix is visualized using Seaborn’s heatmap() function.
CODE:-
# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
OUTPUT:-
cov_matrix = np.cov(x_scaled.T)
print(cov_matrix)
OUTPUT:-
OUTPUT:-
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues_sorted=eigenvalues[sorted_indices]
eigenvectors_sorted = eigenvectors[:,sorted_indices]
print(eigenvalues_sorted)
print("\n",eigenvectors_sorted)
OUTPUT:-
top_2_eigenvectors = eigenvectors_sorted[:,:2]
print(top_2_eigenvectors)
OUTPUT:-
x_pca = x_scaled.dot(top_2_eigenvectors)
print(x_pca)
OUTPUT:-
total_variance=sum(eigenvalues_sorted)
explained_variance_ratio=eigenvalues_sorted[:2]/total_variance
print(f"Explained variance ratio of the 1st two components:
{explained_variance_ratio}")
# Visualization
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1], c=y, cmap="viridis",edgecolor="k",s=50)
plt.title("PCA of Iris Dataset (Reduced to 2D)",fontsize = 14)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar(label="Species")
plt.show()
OUTPUT:-
VIVA-VOCE:-
● The objective is to apply Principal Component Analysis (PCA) on the Iris dataset to
reduce the number of features from 4 to 2, while retaining the most important
information.
● PCA is sensitive to scale. Standardizing ensures that all features contribute equally by
transforming them to have zero mean and unit variance.
● The principal components are ranked based on variance, so we select the top
components that contribute the most.
● Since PCA reduces the data to 2D, a scatter plot helps in visualizing how well the data
clusters after transformation.
● The color represents different species (target labels) in the Iris dataset.
● It tells how much information the selected principal components retain from the original
dataset.
CODE:-
# Import the required libraries
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
# Split the dataset into training and testing sets (80% training, 20%
testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
for k in k_values:
# Initialize k-NN classifier with or without distance-based weights
if weighted:
knn = KNeighborsClassifier(n_neighbors=k, weights='distance')
else:
knn = KNeighborsClassifier(n_neighbors=k, weights='uniform')
# Train the classifier
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
return results
# Test k-NN for different values of k for unweighted and weighted classes
k_values = [1, 3, 5]
# Unweighted k-NN
print("Unweighted k-NN:")
unweighted_results = evaluate_knn(k_values, X_train, X_test, y_train,
y_test, weighted=False)
for k, accuracy, f1 in unweighted_results:
print(f"k={k}, Accuracy={accuracy:.4f}, F1-score={f1:.4f}")
# Weighted k-NN
print("Weighted k-NN:")
weighted_results = evaluate_knn(k_values, X_train, X_test, y_train,
y_test, weighted=True)
for k, accuracy, f1 in weighted_results:
print(f"k={k}, Accuracy={accuracy:.4f}, F1-score={f1:.4f}")
OUTPUT:-
VIVA-VOCE:-
● The objective is to implement the k-Nearest Neighbors (k-NN) algorithm on the Iris
dataset for classification, evaluate its performance with different values of k, and
compare weighted and unweighted k-NN.
● k-NN is a supervised learning algorithm that classifies a data point based on the
majority vote of its k nearest neighbors in the feature space.
1. Computes the distance between the query point and all other points.
2. Selects the k nearest neighbors.
3. Assigns the most common class label among the neighbors to the query point.
● It randomly splits the dataset into 80% training data and 20% testing data.
● It initializes a k-NN classifier with k neighbors and assigns equal weights to all neighbors.
● When data points closer to the query point should have higher influence, especially
in cases where:
○ Data is noisy.
○ Class imbalance exists.
● If weighted k-NN outperforms unweighted k-NN, it means that closer neighbors should
be given higher importance.
OUTPUT:-
VIVA-VOCE:-
● LWR is a variant of linear regression where each data point has a different contribution
to the prediction based on its distance from the query point.
● It gives higher weights to nearby points and lower weights to distant points.
3. How does LWR differ from Ordinary Least Squares (OLS) regression?
● It does not learn a fixed set of parameters for the entire dataset but instead computes
weights dynamically for each query point.
● The bandwidth (τ: tau) controls how much weight is given to nearby points.
● A small τ → More localized model (risk of overfitting).
● A large τ → More generalized model (risk of underfitting).
● To compare how well the LWR curve fits the data points.
CODE:-
# Load necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, label='Linear Regression')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],
color='red')
plt.title('Linear Regression with Boston Housing Dataset')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.legend()
plt.show()
OUTPUT:-
2. Polynomial Regression for Auto MPG Dataset
CODE:-
# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing, fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
def polynomial_regression_auto_mpg():
# Load auto mpg dataset
auto_mpg = fetch_openml(name="autoMpg", version=1, as_frame=True)
data = auto_mpg.data
target = auto_mpg.target
# Remove rows with missing 'horsepower' values from both data and
target
data = data.dropna(subset=["horsepower"]) # FIXED: Changed
subset["horsepower"] to subset=["horsepower"]
target = target.loc[data.index]
X_hp = data[["horsepower"]].astype(float)
y_mpg = target.astype(float)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_hp, y_mpg,
test_size=0.2, random_state=42)
# Polynomial transformation
poly_features = PolynomialFeatures(degree=3)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)
# Evaluation metrics
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)
# Visualization
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test_sorted, y_pred_sorted, color='red', linewidth=2,
label='Polynomial Regression Fit')
plt.xlabel('Horsepower')
plt.ylabel('Miles per Gallon (MPG)')
plt.title('Polynomial Regression: Horsepower vs. MPG')
plt.legend()
plt.show()
# Output metrics
print(f'Mean Squared Error: {mse_poly}')
print(f'R² Score: {r2_poly}')
def run_models():
# Make sure you define this function or remove it if not needed
polynomial_regression_auto_mpg()
run_models()
OUTPUT:-
7. Develop a program to load the Titanic dataset. Split the data into training and
test sets. Train a decision tree classifier. Visualize the tree structure. Evaluate
accuracy, precision, recall, and F1-score.
● Load the Titanic dataset as a .csv file in Jupyter Notebook and execute the following
program.
CODE:-
# Load necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.metrics import
accuracy_score,precision_score,recall_score,f1_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
OUTPUT:-
CODE:-
# Data Preprocessing
df['Age'].fillna(df['Age'].median(),inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0],inplace=True)
df['Fare'].fillna(df['Fare'].median(),inplace=True)
df['Cabin'].fillna('U',inplace=True) # Fill missing CABIN values with 'U'
as unknown
df.head()
OUTPUT:-
CODE:-
# Drop unnecessary columns
label_encoder=LabelEncoder()
df['Sex']=label_encoder.fit_transform(df['Sex'])
df['Embarked']=label_encoder.fit_transform(df['Embarked']) # C:0, Q:1, S:2
df.drop(columns=['Name','Ticket','Cabin'],inplace=True)
X=df.drop(columns=['Survived'])
y=df['Survived']
df.head()
OUTPUT:-
CODE:-
# Train the model
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_st
ate=42)
clf=DecisionTreeClassifier(random_state=42)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
# Evaluation Metrics
accuracy=accuracy_score(y_test,y_pred)
precision=precision_score(y_test,y_pred)
recall=recall_score(y_test,y_pred)
f1=f1_score(y_test,y_pred)
print(f"Accuracy:{accuracy:.4f}")
print(f"Precision:{precision:.4f}")
print(f"Recall:{recall:.4f}")
print(f"F1-score:{f1:.4f}")
# Visualization
plt.figure(figsize=(12,8))
plot_tree(clf,filled=True,feature_names=X.columns,class_names=['Not
Survived','Survived'],rounded=True)
plt.title("Decision Tree Classifier for Titanic dataset")
plt.show()
OUTPUT:-
8. Develop a program to implement the Naive Bayesian Classifier considering the
Iris dataset. Compute the accuracy of the classifier by considering the test data.
CODE:-
# Import the required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
OUTPUT:-
9. Develop a program to implement k-means clustering using Wisconsin Breast
Cancer dataset and visualize the clustering result.
CODE:-
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Visualization
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1',
s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='yellow', marker='X',
label='Centroids')
plt.title('k-Means Clustering for Wisconsin Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()
OUTPUT:-
EXTRA PROGRAMS
CODE:-
# Import the required libraries
import numpy as np
# OR gate dataset
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y = np.array([0, 1, 1, 1])
# Update rule
weights += lr * error * X[i]
bias += lr * error
OUTPUT:-
2. Develop a program to demonstrate the working of Logistic Regression using
Iris dataset.
CODE:-
# Import the required libraries
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
OUTPUT:-
SAMPLE VIVA-VOCE BASED ON THEORY SYLLABUS