0% found this document useful (0 votes)

7 views50 pages

Ai&Ml Bail606 ML Lab Manual

The document is a lab manual for the Machine Learning Laboratory course at CMR Institute of Technology, effective from the academic year 2024-2025. It includes programming exercises on statistical analysis, PCA, and k-NN classification using datasets like 'tips' and 'iris', along with code examples and explanations for each task. The manual emphasizes the use of various libraries such as pandas, seaborn, and sklearn for data manipulation and visualization.

Uploaded by

sara22aiml

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views50 pages

Ai&Ml Bail606 ML Lab Manual

Uploaded by

sara22aiml

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

CMR INSTITUTE OF TECHNOLOGY

(Affiliated to VTU, Approved by AICTE, Accredited by NBA and NAAC

with “A++” Grade)
ITPL MAIN ROAD, BROOKFIELD, BENGALURU-560037,
KARNATAKA, INDIA

Department of Artificial Intelligence and Machine Learning

LAB MANUAL
(Effective from the academic year 2024-2025 under 2022 CBCS scheme)

Subject: Machine Learning Laboratory

Subject Code: BAIL606
Semester: 6
1. Develop a program to load a dataset and select one numerical column.
Compute mean, median, mode, standard deviation, variance, and range for a
given numerical column in a dataset. Generate a histogram and a boxplot to
understand the distribution of the data. Identify any outliers in the data using IQR.
Select a categorical variable from a dataset. Compute the frequency of each
category and display it as a bar chart or pie chart.

CODE:-
# Import the required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset

df=sns.load_dataset('tips')

# Select a numerical column

nc='total_bill'

#Statistical analysis
mean=df[nc].mean()
median=df[nc].median()
mode=df[nc].mode()
var=df[nc].var()
std=df[nc].std()
dr=df[nc].max()-df[nc].min()

# Print the computed statistics

print(f'Mean:{mean}')
print(f'Median:{median}')
print(f'Mode:{mode}')
print(f'Variance:{var}')
print(f'Standard Deviation:{std}')
print(f'Range:{dr}')

# Generate a histogram
plt.figure(figsize=(10,6))
sns.histplot(df[nc],kde=True)
plt.title(f'Histogram of {nc}')
plt.xlabel(nc)
plt.ylabel('Frequency')
plt.show()

# Generate a boxplot
plt.figure(figsize=(10,6))
sns.boxplot(x=df[nc])
plt.title(f'Boxplot of {nc}')
plt.xlabel(nc)
plt.show()

# Detect outliers using IQR (Inter-quartile Range)

Q1=df[nc].quantile(0.25)
Q3=df[nc].quantile(0.75)
IQR=Q3-Q1

# Define the outlier threshold

lb=Q1-1.5*IQR
ub=Q3+1.5*IQR

# Find the outliers

out=df[(df[nc]<lb) | (df[nc]>ub)]
print(f'Outliers:\n{out}')

# Select a categorical variable

categorical_column='sex'

# Compute the frequency of each category

category_counts=df[categorical_column].value_counts()

# Display the frequencies as a bar chart

plt.figure(figsize=(10, 6))
category_counts.plot(kind='bar', color='skyblue')
plt.title(f'Frequency of categories in {categorical_column}')
plt.xlabel(categorical_column)
plt.ylabel('Frequency')
plt.show()

# Display the frequencies as a pie chart

plt.figure(figsize=(8, 8))
category_counts.plot(kind='pie', autopct='%1.1f%%',
startangle=90, colors=['lightblue', 'lightcoral'])
plt.title(f'Pie chart of {categorical_column} categories')
plt.ylabel('') # Hide the y-axis label
plt.show()

OUTPUT:-
VIVA-VOCE:

1. What is the purpose of this program?

● The program performs statistical analysis on a numerical column in a dataset, visualizes

the data using histograms and boxplots, detects outliers using the IQR method, and
analyzes categorical data using bar and pie charts.

2. Which dataset is used in this program?

● The "tips" dataset from the seaborn library is used, which contains information about
restaurant tips, including columns like total_bill, tip, sex, smoker, day, time,
and size.

3. Which libraries are used in this program and why?

● pandas: For data manipulation and statistical computations.

● seaborn: For visualization (histogram, boxplot).
● matplotlib.pyplot: For plotting graphs (bar chart, pie chart).
● numpy: For numerical operations.

4. How is the dataset loaded in this program?

● Using Seaborn’s built-in dataset function:

df = sns.load_dataset('tips')

5. Which statistical measures are computed in this program?

● Mean (mean()): Average of all values.

● Median (median()): Middle value of sorted data.
● Mode (mode()): Most frequently occurring value.
● Variance (var()): Measure of data spread.
● Standard Deviation (std()): Square root of variance.
● Range: Difference between max and min values.
6. What is the purpose of a histogram?

● A histogram (histplot()) shows the distribution of a numerical variable by dividing the

data into bins. It helps to identify patterns such as normality, skewness, or multimodal
distributions.

7. Why is KDE (Kernel Density Estimation) used in the histogram?

● KDE provides a smoothed estimate of the data distribution, making it easier to visualize
patterns.

8. What insights can we gain from a boxplot?

● A boxplot (sns.boxplot()) shows the median, quartiles (Q1 and Q3), outliers, and
data spread. It helps in detecting skewness and extreme values.

9. How are outliers detected using the IQR method?

10. How is categorical data analyzed in this program?

● The program computes the frequency of each category in the sex column and visualizes
it using bar and pie charts.

11. What is the difference between a bar chart and a pie chart?

● A bar chart is used to compare categorical values, whereas a pie chart represents
proportions as slices of a circle.

12. Why do we use value_counts() for categorical data?

● value_counts() calculates the frequency of unique values in a categorical column,

which is useful for understanding the distribution of categorical data.
2. Develop a program to load a dataset with at least two numeric columns (e.g.,
Iris, Titanic). Plot a scatter plot of two variables and calculate their Pearson
correlation coefficient. Write a program to compute the covariance and
correlation matrix for a dataset. Visualize the correlation matrix using a heatmap
to know which variables have strong positive/negative correlations.

CODE:-
# Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset

df=sns.load_dataset('iris')

# Select 2 numerical columns

num_col_1='sepal_length'
num_col_2='sepal_width'

# Plot a scatter plot

plt.figure(figsize=(8,6))
sns.scatterplot(x=df[num_col_1],y=df[num_col_2],color='blue')
plt.title(f'Scatter plot between {num_col_1} and {num_col_2}')
plt.xlabel(num_col_1)
plt.ylabel(num_col_2)
plt.show()

# Calculate Pearson Coefficient

pearson_corr=df[num_col_1].corr(df[num_col_2])
print(f'Pearson Coefficient between {num_col_1} and {num_col_2}:
{pearson_corr}')

# Compute covariance and correlation matrix

numeric_columns=df.select_dtypes(include=['number']).columns
print("\nNumeric Columns:")
print(numeric_columns)
cov_matrix=df[numeric_columns].cov()
print("\nCovariance Matrix:")
print(cov_matrix)
corr_matrix=df[numeric_columns].corr()
print("\nCorrelation Matrix:")
print(corr_matrix)

# Visualize the correlation matrix using a heatmap

plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm',fmt='.2f',linewidths=0.
5)
plt.title('Correlation Matrix Heatmap')
plt.show()

OUTPUT:-
VIVA-VOCE:-

1. What is the objective of this program?

● This program performs a statistical analysis of two numerical columns from a dataset by
plotting a scatter plot, calculating Pearson correlation, computing covariance and
correlation matrices, and visualizing the correlation matrix using a heatmap.

2. Which dataset and libraries are used in this program?

● The iris dataset from the Seaborn library is used.

● pandas: For data handling and computations.
● numpy: For numerical calculations.
● matplotlib.pyplot: For data visualization (scatter plot, heatmap).
● seaborn: For advanced visualizations (scatter plot, heatmap).

3. How are numerical columns selected?

● The dataset is filtered to include only numeric columns using:

numeric_columns = df.select_dtypes(include=['number']).columns
● Specifically, sepal_length and sepal_width are selected.

4. What is the purpose of a scatter plot in this program?

● The scatter plot visually represents the relationship between two numeric variables.
● The scatter plot is generated by using Seaborn’s scatterplot() function.
● It shows the nature of correlation (positive, negative, or no correlation) between the
selected numerical columns.

5. What is Pearson Correlation Coefficient?

● Pearson correlation measures the strength and direction of a linear relationship between
two variables.
● Pearson Correlation is computed in the code using Pandas’ corr() function.

6. What is a correlation matrix?

● A correlation matrix is a table showing correlation coefficients between multiple

variables.
● A correlation matrix is computed in the code using Pandas’ corr() function.

7. What is covariance? How is it different from correlation?

● Covariance measures the direction of the relationship between two variables but does
not standardize the values.
8. Why is a heatmap used for visualizing correlations?

● A heatmap makes it easy to identify variables that have strong positive or negative
correlations.
● The correlation matrix is visualized using Seaborn’s heatmap() function.

9. What does the color scheme in the heatmap represent?

● Red (positive values): Strong positive correlation.

● Blue (negative values): Strong negative correlation.
● Near zero values: Weak or no correlation.
3. Develop a program to implement Principal Component Analysis (PCA) for
reducing the dimensionality of the Iris dataset from 4 features to 2.

CODE:-
# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load the dataset

iris = load_iris()
x = iris.data # Features
y = iris.target # Target values
print(x)
print(y)

OUTPUT:-

# Compute PCA conventionally

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
print(x_scaled)
OUTPUT:-

cov_matrix = np.cov(x_scaled.T)
print(cov_matrix)

OUTPUT:-

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print(eigenvalues)
print("\n",eigenvectors)

OUTPUT:-

sorted_indices = np.argsort(eigenvalues)[::-1]

eigenvalues_sorted=eigenvalues[sorted_indices]
eigenvectors_sorted = eigenvectors[:,sorted_indices]
print(eigenvalues_sorted)
print("\n",eigenvectors_sorted)

OUTPUT:-

top_2_eigenvectors = eigenvectors_sorted[:,:2]
print(top_2_eigenvectors)

OUTPUT:-

x_pca = x_scaled.dot(top_2_eigenvectors)
print(x_pca)

OUTPUT:-
total_variance=sum(eigenvalues_sorted)
explained_variance_ratio=eigenvalues_sorted[:2]/total_variance
print(f"Explained variance ratio of the 1st two components:
{explained_variance_ratio}")

# Visualization
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1], c=y, cmap="viridis",edgecolor="k",s=50)
plt.title("PCA of Iris Dataset (Reduced to 2D)",fontsize = 14)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar(label="Species")
plt.show()

OUTPUT:-
VIVA-VOCE:-

1. What is the objective of this program?

● The objective is to apply Principal Component Analysis (PCA) on the Iris dataset to
reduce the number of features from 4 to 2, while retaining the most important
information.

2. What is Principal Component Analysis (PCA)?

● PCA is a dimensionality reduction technique that transforms correlated features into a

smaller set of uncorrelated features called principal components, which capture the
maximum variance in the data.

3. Why is PCA used in Machine Learning?

● PCA is used to:

○ Reduce computational complexity.
○ Remove redundant features.
○ Improve model efficiency.
○ Help in visualizing high-dimensional data.

4. Which libraries are used in this program and why?

● numpy: For mathematical computations (eigenvalues, eigenvectors, covariance matrix).

● matplotlib.pyplot: For visualization of the reduced dataset.
● sklearn.datasets: To load the Iris dataset.
● sklearn.preprocessing.StandardScaler: To standardize the dataset before
applying PCA.

5. Why is the dataset standardized using StandardScaler()?

● PCA is sensitive to scale. Standardizing ensures that all features contribute equally by
transforming them to have zero mean and unit variance.

6. How is the covariance matrix computed in PCA?

● In Python: cov_matrix = np.cov(x_scaled.T)

7. What are eigenvalues and eigenvectors in PCA?

● Eigenvalues indicate how much variance is captured by each principal component.

● Eigenvectors define the direction of the new feature space.
● They are computed using NumPy’s np.linalg.eig() function.

8. Why are eigenvalues sorted in descending order?

● The principal components are ranked based on variance, so we select the top
components that contribute the most.

9. Why do we use a scatter plot to visualize PCA results?

● Since PCA reduces the data to 2D, a scatter plot helps in visualizing how well the data
clusters after transformation.

10. What does the color represent in the scatter plot?

● The color represents different species (target labels) in the Iris dataset.

11. How do we calculate the explained variance ratio?

● The explained variance ratio is computed as:

Explained Variance Ratio =

Eigenvalue of a principal component / Sum of all Eigenvalues

● It tells how much information the selected principal components retain from the original
dataset.

12. What are some real-world applications and limitations of PCA?

● Image compression ● PCA assumes linear

● Feature selection in ML relationships among features.
● Genomic data analysis ● It may lose interpretability of
● Fraud detection original features.
● It does not work well with
categorical data.

14. How do we decide the number of principal components to keep?

● Using the explained variance ratio.

● The elbow method is often used to determine the optimal number.
4. Develop a program to load the Iris dataset. Implement the k-Nearest Neighbors
(k-NN) algorithm for classifying flowers based on their features. Split the dataset
into training and testing sets and evaluate the model using metrics like accuracy
and F1-score. Test it for different values of k (e.g., k=1,3,5) and evaluate the
accuracy. Extend the k-NN algorithm to assign weights based on the distance of
neighbors (e.g., weight=1/d2). Compare the performance of weighted k-NN and
regular k-NN on a synthetic or real-world dataset.

CODE:-
# Import the required libraries
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Load the dataset

iris = datasets.load_iris()
X = iris.data # Features
y = iris.target # Target labels

# Split the dataset into training and testing sets (80% training, 20%
testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Function to evaluate k-NN with different k values

def evaluate_knn(k_values, X_train, X_test, y_train, y_test,
weighted=False):
results = []

for k in k_values:
# Initialize k-NN classifier with or without distance-based weights
if weighted:
knn = KNeighborsClassifier(n_neighbors=k, weights='distance')
else:
knn = KNeighborsClassifier(n_neighbors=k, weights='uniform')
# Train the classifier
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Calculate accuracy and F1-score

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted') # Weighted
F1-score multi-class

results.append((k, accuracy, f1))

return results

# Test k-NN for different values of k for unweighted and weighted classes
k_values = [1, 3, 5]

# Unweighted k-NN
print("Unweighted k-NN:")
unweighted_results = evaluate_knn(k_values, X_train, X_test, y_train,
y_test, weighted=False)
for k, accuracy, f1 in unweighted_results:
print(f"k={k}, Accuracy={accuracy:.4f}, F1-score={f1:.4f}")

# Weighted k-NN
print("Weighted k-NN:")
weighted_results = evaluate_knn(k_values, X_train, X_test, y_train,
y_test, weighted=True)
for k, accuracy, f1 in weighted_results:
print(f"k={k}, Accuracy={accuracy:.4f}, F1-score={f1:.4f}")

# Visualize the results

unweighted_accuracies = [accuracy for _, accuracy, _ in
unweighted_results]
weighted_accuracies = [accuracy for _, accuracy, _ in weighted_results]

# Plotting the results

plt.figure(figsize=(10, 6))
plt.plot(k_values, unweighted_accuracies, label='Unweighted k-NN',
marker='o')
plt.plot(k_values, weighted_accuracies, label='Weighted k-NN', marker='o')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('Accuracy of k-NN with Different k Values')
plt.legend()
plt.show()

OUTPUT:-
VIVA-VOCE:-

1. What is the objective of this program?

● The objective is to implement the k-Nearest Neighbors (k-NN) algorithm on the Iris
dataset for classification, evaluate its performance with different values of k, and
compare weighted and unweighted k-NN.

2. What is the k-Nearest Neighbors (k-NN) algorithm?

● k-NN is a supervised learning algorithm that classifies a data point based on the
majority vote of its k nearest neighbors in the feature space.

3. How does k-NN work?

1. Computes the distance between the query point and all other points.
2. Selects the k nearest neighbors.
3. Assigns the most common class label among the neighbors to the query point.

4. Why do we split the dataset into training and testing sets?

● To evaluate the model's generalization ability on unseen data.

5. Which libraries are used in this program and why?

● pandas & numpy: For data handling and mathematical operations.

● sklearn.datasets: To load the Iris dataset.
● sklearn.model_selection.train_test_split: To split the dataset into training
and testing sets.
● sklearn.metrics: To evaluate the model using accuracy and F1-score.
● sklearn.neighbors.KNeighborsClassifier: To implement k-NN classifier.
● matplotlib.pyplot: To visualize accuracy for different values of k.

6. What does train_test_split() do?

● It randomly splits the dataset into 80% training data and 20% testing data.

7. What does KNeighborsClassifier(n_neighbors=k, weights='uniform') do?

● It initializes a k-NN classifier with k neighbors and assigns equal weights to all neighbors.

8. What evaluation metrics are used in this program?

● Accuracy Score: Measures the proportion of correctly classified instances using the
accuracy_score() function.
● F1-score: A weighted average of precision and recall, useful for imbalanced datasets
using the f1_score() function.

9. Why do we test k-NN for different values of k?

● To find the best k-value that provides optimal performance.

● A small k (e.g., 1) may lead to overfitting, while a large k (e.g., 20) may cause
underfitting.

10. What is the difference between weighted and unweighted k-NN?

● Unweighted k-NN: Each neighbor has equal weight (default weights='uniform').

● Weighted k-NN: Neighbors closer to the query point have higher influence
(weights='distance').

11. How does weighted k-NN assign weights?

● It assigns a weight inversely proportional to the square of the distance (weight=1/d2).

12. When should we use weighted k-NN instead of unweighted k-NN?

● When data points closer to the query point should have higher influence, especially
in cases where:
○ Data is noisy.
○ Class imbalance exists.

13. Why do we plot accuracy for different values of k?

● To compare the performance of weighted vs. unweighted k-NN.

14. How does the number of neighbors (k) affect accuracy?

● Small k (e.g., 1) → More variance, risk of overfitting.

● Large k (e.g., 20) → More bias, risk of underfitting.

15. What conclusion can we draw from the accuracy comparison?

● If weighted k-NN outperforms unweighted k-NN, it means that closer neighbors should
be given higher importance.

16. What are the advantages of k-NN?

● Simple & intuitive.

● Works well with small datasets.
● No training phase, making it a lazy learner.

17. What are the disadvantages of k-NN?

● Computationally expensive for large datasets.

● Sensitive to irrelevant features.
● Requires careful choice of k.

18. How can we improve k-NN performance?

● Feature scaling (Standardization).

● Using distance-weighted k-NN.
● Selecting an optimal k.

19. What real-world applications use k-NN?

● Handwriting recognition (e.g., OCR).

● Recommender systems.
● Medical diagnosis.

5. Implement the non-parametric Locally Weighted Regression algorithm in order

to fit data points. Select the appropriate dataset for your experiment and draw
graphs.
CODE:-
# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data

np.random.seed(42)
X = np.linspace(0, 10, 100)
y = np.sin(X) + np.random.normal(scale=0.2, size=X.shape) # Add some
noise

# Reshape X for matrix operations

X = X[:, np.newaxis]

# Locally Weighted Regression algorithm function

def lwr(X_train, y_train, query_point, tau):

# Compute weights using Gaussian Kernel Sampling

weights = np.exp(-np.sum((X_train - query_point) ** 2, axis=1) / (2 *
tau ** 2))

# Create a diagonal weight matrix

W = np.diag(weights)
# Add bias term to X_train
X_bias = np.hstack([np.ones_like(X_train), X_train])
# Compute the weighted normal equation to find the parameters (theta)
theta = np.linalg.inv(X_bias.T @ W @ X_bias) @ X_bias.T @ W @ y_train

# Predict for query point

query_point_bias = np.array([1, query_point])
pred = query_point_bias @ theta
return pred

# Function to make predictions using the trained LWR model

def predict_lwr(X_train, y_train, X_test, tau):
prediction = np.array([lwr(X_train, y_train, x[0], tau) for x in
X_test])
return prediction
# Hyperparameter: Bandwidth (tau)
tau = 0.5
# Predict values for the test set
y_pred = predict_lwr(X, y, X, tau)

# Plot the results

plt.figure(figsize=(10, 6))
plt.scatter(X, y, label="Data Points", color="blue", s=10)
plt.plot(X, y_pred, label=f"LWR Prediction (tau={tau})", color="red",
linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Locally Weighted Regression (LWR)")
plt.legend()
plt.grid(True)
plt.show()

OUTPUT:-

VIVA-VOCE:-

1. What is the objective of this program?

● The objective is to implement the Locally Weighted Regression (LWR) algorithm, a
non-parametric regression technique, to fit data points and visualize the regression
curve.

2. What is Locally Weighted Regression (LWR)?

● LWR is a variant of linear regression where each data point has a different contribution
to the prediction based on its distance from the query point.
● It gives higher weights to nearby points and lower weights to distant points.

3. How does LWR differ from Ordinary Least Squares (OLS) regression?

● OLS Regression: Assumes a global relationship between features and output by

minimizing a cost function over all data points.
● LWR: Uses local relationships by assigning weights to different data points based on
their proximity to the query point.

4. Why is LWR called a non-parametric algorithm?

● It does not learn a fixed set of parameters for the entire dataset but instead computes
weights dynamically for each query point.

5. Which libraries are used in this program and why?

● numpy: For numerical operations (matrix computations, Gaussian weighting).

● matplotlib.pyplot: For data visualization (scatter plots, regression curve).

6. Why is synthetic data used in this program?

● To create a controlled environment where a non-linear pattern (sinusoidal curve) is

modeled with added noise.

7. What is the role of the bandwidth parameter (τ)?

● The bandwidth (τ: tau) controls how much weight is given to nearby points.
● A small τ → More localized model (risk of overfitting).
● A large τ → More generalized model (risk of underfitting).

8. How are the weights assigned in LWR?

9. What is the purpose of the diagonal weight matrix W?

● The diagonal weight matrix ensures that nearby points contribute more to the
regression model than distant points.

10. What is the weighted normal equation used in this code?

11. Why do we add a bias term to X?

● A bias term is added to include an intercept (β₀) in the regression model.

● The linear regression equation is: y = β₀ + β1x where β₀ is the intercept and β1 is the
slope.

12. How does the function lwr() work?

● It takes training data, a query point, and τ, then:

1. Computes weights using the Gaussian kernel.
2. Forms a weighted normal equation.
3. Solves for θ (theta).
4. Make a prediction for the query point.

13. How is LWR different from k-Nearest Neighbors (k-NN)?

● k-NN: Uses a fixed number of neighbors.

● LWR: Uses all data points with weights that decrease based on distance.

14. Why do we plot the LWR predictions?

● To compare how well the LWR curve fits the data points.

15. What does the red curve in the plot represent?

● It represents the LWR-predicted values for different X values.

16. What happens when τ (tau) is too large or too small?

● Too large → The model behaves like linear regression.

● Too small → The model overfits and captures noise.

17. What are some real-world applications of Locally Weighted Regression?

● Stock market price prediction.

● Medical diagnosis (personalized medicine).
● Handwriting and speech recognition.
18. What are the advantages of LWR?

● Works well for non-linear data.

● No need to define a fixed function form.
● Requires no prior assumptions about the data.
● Provides customized fitting for different regions of the dataset.

19. What are the disadvantages of LWR?

● Computationally expensive (needs to compute weights for every test point).

● Not scalable for large datasets.
● Requires careful tuning of τ.

6. Develop a program to demonstrate the working of Linear Regression and

Polynomial Regression. Use Boston Housing Dataset for Linear Regression and
Auto MPG Dataset (for vehicle fuel efficiency prediction) for Polynomial
Regression.

1. Linear Regression for Boston Housing Dataset

● Load the Boston Housing dataset as a .csv file in Jupyter Notebook and execute the
following program.

CODE:-
# Load necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset

dataset = "BostonHousingData.csv"

# Read the dataset

df= pd.read_csv(dataset)

# Display the first few rows

df.head()

# Impute missing values with mean

df['CRIM'].fillna(df['CRIM'].mean(), inplace=True)
df['ZN'].fillna(df['ZN'].mean(), inplace=True)
df['INDUS'].fillna(df['INDUS'].mean(), inplace=True)
df['CHAS'].fillna(df['CHAS'].mode()[0], inplace=True)
df['AGE'].fillna(df['AGE'].mean(), inplace=True)
df['LSTAT'].fillna(df['LSTAT'].mean(), inplace=True)
df.head()

# Drop the columns 'CHAS' and 'ZN'

df = df.drop(columns=['CHAS', 'ZN'])

# Verify the columns are dropped

df.head()
# Train the model
x = df.drop('MEDV', axis=1) # Features
y = df['MEDV'] # Target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=42)
model_lr = LinearRegression()
model_lr.fit(x_train, y_train)
y_pred_lr = model_lr.predict(x_test)

# Evaluate the model's performance

mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

def print_metrics(y_test, y_pred, model_name):

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\n{model_name} - MSE: {mse}, R²: {r2}")
print_metrics(y_test, y_pred_lr, 'Linear Regression')

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, label='Linear Regression')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],
color='red')
plt.title('Linear Regression with Boston Housing Dataset')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.legend()
plt.show()

OUTPUT:-
2. Polynomial Regression for Auto MPG Dataset

CODE:-
# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing, fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

def polynomial_regression_auto_mpg():
# Load auto mpg dataset
auto_mpg = fetch_openml(name="autoMpg", version=1, as_frame=True)
data = auto_mpg.data
target = auto_mpg.target
# Remove rows with missing 'horsepower' values from both data and
target
data = data.dropna(subset=["horsepower"]) # FIXED: Changed
subset["horsepower"] to subset=["horsepower"]
target = target.loc[data.index]

X_hp = data[["horsepower"]].astype(float)
y_mpg = target.astype(float)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_hp, y_mpg,
test_size=0.2, random_state=42)

# Polynomial transformation
poly_features = PolynomialFeatures(degree=3)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Linear regression on polynomial features

lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train)
y_pred_poly = lr_poly.predict(X_test_poly)

# Evaluation metrics
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)

# Sorted predictions for a smoother curve

X_test_sorted, y_test_sorted =
zip(*sorted(zip(X_test.values.flatten(), y_test)))
y_pred_sorted =
lr_poly.predict(poly_features.transform(np.array(X_test_sorted).reshape(-1
, 1)))

# Visualization
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test_sorted, y_pred_sorted, color='red', linewidth=2,
label='Polynomial Regression Fit')
plt.xlabel('Horsepower')
plt.ylabel('Miles per Gallon (MPG)')
plt.title('Polynomial Regression: Horsepower vs. MPG')
plt.legend()
plt.show()

# Output metrics
print(f'Mean Squared Error: {mse_poly}')
print(f'R² Score: {r2_poly}')

def run_models():
# Make sure you define this function or remove it if not needed
polynomial_regression_auto_mpg()
run_models()

OUTPUT:-
7. Develop a program to load the Titanic dataset. Split the data into training and
test sets. Train a decision tree classifier. Visualize the tree structure. Evaluate
accuracy, precision, recall, and F1-score.

● Load the Titanic dataset as a .csv file in Jupyter Notebook and execute the following
program.

CODE:-
# Load necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.metrics import
accuracy_score,precision_score,recall_score,f1_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Load the dataset

df=pd.read_csv("titanic.csv")
df.head()

OUTPUT:-

CODE:-
# Data Preprocessing
df['Age'].fillna(df['Age'].median(),inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0],inplace=True)
df['Fare'].fillna(df['Fare'].median(),inplace=True)
df['Cabin'].fillna('U',inplace=True) # Fill missing CABIN values with 'U'
as unknown
df.head()
OUTPUT:-

CODE:-
# Drop unnecessary columns
label_encoder=LabelEncoder()
df['Sex']=label_encoder.fit_transform(df['Sex'])
df['Embarked']=label_encoder.fit_transform(df['Embarked']) # C:0, Q:1, S:2
df.drop(columns=['Name','Ticket','Cabin'],inplace=True)
X=df.drop(columns=['Survived'])
y=df['Survived']
df.head()

OUTPUT:-

CODE:-
# Train the model
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_st
ate=42)
clf=DecisionTreeClassifier(random_state=42)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
# Evaluation Metrics
accuracy=accuracy_score(y_test,y_pred)
precision=precision_score(y_test,y_pred)
recall=recall_score(y_test,y_pred)
f1=f1_score(y_test,y_pred)
print(f"Accuracy:{accuracy:.4f}")
print(f"Precision:{precision:.4f}")
print(f"Recall:{recall:.4f}")
print(f"F1-score:{f1:.4f}")

# Visualization
plt.figure(figsize=(12,8))
plot_tree(clf,filled=True,feature_names=X.columns,class_names=['Not
Survived','Survived'],rounded=True)
plt.title("Decision Tree Classifier for Titanic dataset")
plt.show()

OUTPUT:-
8. Develop a program to implement the Naive Bayesian Classifier considering the
Iris dataset. Compute the accuracy of the classifier by considering the test data.

CODE:-
# Import the required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix

# Load the Iris dataset

iris=load_iris()

# Train and test the classifier

X=iris.data
y=iris.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_st
ate=42)
model=GaussianNB()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
accuracy=accuracy_score(y_test,y_pred)

# Print the metrics

print(f"Accuracy: ",accuracy)
print("Iris Classification
report:\n",classification_report(y_test,y_pred,target_names=iris.target_na
mes))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))

OUTPUT:-
9. Develop a program to implement k-means clustering using Wisconsin Breast
Cancer dataset and visualize the clustering result.

CODE:-
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the dataset

data = load_breast_cancer()
X = data.data
y = data.target

# Train the model

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=2, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df['Cluster'] = y_kmeans
df['True Label'] = y

# Visualization
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1',
s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='yellow', marker='X',
label='Centroids')
plt.title('k-Means Clustering for Wisconsin Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()

OUTPUT:-
EXTRA PROGRAMS

1. Develop a program to implement the Perceptron Learning algorithm to simulate

an OR gate.

CODE:-
# Import the required libraries
import numpy as np

# OR gate dataset
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y = np.array([0, 1, 1, 1])

# Initialize weights and bias

weights = np.zeros(X.shape[1])
bias = 0
lr = 0.1
epochs = 10

# Activation function (step)

def step(x):
return 1 if x >= 0 else 0

# Training the perceptron

for epoch in range(epochs):
for i in range(len(X)):
z = np.dot(X[i], weights) + bias
y_pred = step(z)
error = y[i] - y_pred

# Update rule
weights += lr * error * X[i]
bias += lr * error

print("Trained weights:", weights)

print("Trained bias:", bias)
# Test the perceptron
for x in X:
output = step(np.dot(x, weights) + bias)
print(f"Input: {x} => Output: {output}")

OUTPUT:-
2. Develop a program to demonstrate the working of Logistic Regression using
Iris dataset.

CODE:-
# Import the required libraries
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset

iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create logistic regression model

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict on the test set

y_pred = model.predict(X_test)

# Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

OUTPUT:-
SAMPLE VIVA-VOCE BASED ON THEORY SYLLABUS

Module 1: Introduction & Understanding Data

1. What is Machine Learning?

○ Machine Learning (ML) is a subset of artificial intelligence that enables systems
to learn from data and improve their performance without being explicitly
programmed.
2. What are the different types of Machine Learning?
○ Supervised Learning: The model learns from labeled data. (e.g., classification,
regression)
○ Unsupervised Learning: The model identifies patterns in unlabeled data. (e.g.,
clustering, association)
○ Reinforcement Learning: The model learns by interacting with an environment
and receiving rewards or penalties.
3. What are some challenges in Machine Learning?
○ Data quality and availability
○ Overfitting and underfitting
○ Model interpretability
○ Computational complexity
4. What is descriptive statistics, and why is it used in ML?
○ Descriptive statistics summarize data using measures like mean, median, mode,
variance, and standard deviation. It helps in understanding data distribution
before applying ML models.
5. What is the difference between univariate, bivariate, and multivariate analysis?
○ Univariate analysis: Analysis of a single variable (e.g., histogram of age).
○ Bivariate analysis: Relationship between two variables (e.g., scatter plot of
height vs. weight).
○ Multivariate analysis: Involves multiple variables (e.g., PCA for dimensionality
reduction).

Module 2: Feature Engineering & ML Algorithms

6. What is feature engineering?

○ Feature engineering involves selecting, transforming, or creating features to
improve model performance.
7. What is dimensionality reduction? Name two techniques.
○ Dimensionality reduction removes redundant features to improve efficiency.
○ Techniques: Principal Component Analysis (PCA), Linear Discriminant Analysis
(LDA).
8. What is overfitting and how can it be prevented?
○ Overfitting occurs when a model performs well on training data but poorly on
unseen data.
○ Prevention methods: Cross-validation, regularization (L1/L2), pruning (for
decision trees), using more data.
9. What are training, validation, and testing sets?
○ Training set: Used to train the model.
○ Validation set: Used for hyperparameter tuning.
○ Testing set: Used to evaluate final model performance.
10.What is a confusion matrix? Explain its components.
○ A confusion matrix is used to evaluate classification performance.
○ Components: True Positive (TP), False Positive (FP), True Negative (TN), False
Negative (FN).
11.What is the ROC curve and AUC?
● ROC (Receiver Operating Characteristic) Curve: Plots TPR vs. FPR for
different thresholds.
● AUC (Area Under Curve): Measures classifier performance (closer to 1 means
better).

Module 3: Similarity-Based Learning

12.What is the k-Nearest Neighbor (k-NN) algorithm?

● k-NN is a non-parametric, lazy learning algorithm that classifies a sample based
on the majority class of its k nearest neighbors.
13.How does weighted k-NN differ from k-NN?
● Weighted k-NN assigns higher importance to closer neighbors by giving them
higher weights.
14.What is the nearest centroid classifier?
● A classification algorithm that assigns a data point to the class whose centroid
(mean vector) is closest to the data point.
15.What is Locally Weighted Regression (LWR)?
● LWR is a non-parametric regression technique that gives different weights to
training points based on their distance from the test point.

Crypto Trading Secrets How To Earn Big in The Cryptocurrency Market
100% (4)
Crypto Trading Secrets How To Earn Big in The Cryptocurrency Market
25 pages
Man B W L32 40
100% (1)
Man B W L32 40
321 pages
Deophantine 5 Problems
No ratings yet
Deophantine 5 Problems
60 pages
Regulation of Food Additives in Sri Lanka
100% (1)
Regulation of Food Additives in Sri Lanka
5 pages
Data Science
No ratings yet
Data Science
18 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
Data Analisis 2
No ratings yet
Data Analisis 2
13 pages
45 Shivi Hal
No ratings yet
45 Shivi Hal
4 pages
Administration of Waqf in India
No ratings yet
Administration of Waqf in India
12 pages
Solutions To Number Theory Exercises
No ratings yet
Solutions To Number Theory Exercises
2 pages
Data Visualization With Python
No ratings yet
Data Visualization With Python
34 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Generalized Linear Model: Badr Missaoui
No ratings yet
Generalized Linear Model: Badr Missaoui
35 pages
Supplementary KYC
No ratings yet
Supplementary KYC
1 page
Important Questions With Solutions IP
No ratings yet
Important Questions With Solutions IP
5 pages
Disc Insulators
No ratings yet
Disc Insulators
6 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Audi A4 1.8T Ultimate Timing Belt Kit Installation Guide ES#8146
No ratings yet
Audi A4 1.8T Ultimate Timing Belt Kit Installation Guide ES#8146
15 pages
AD3411 - 1 To 5
No ratings yet
AD3411 - 1 To 5
11 pages
Sustainable Measurement
No ratings yet
Sustainable Measurement
8 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
Presentation 1
No ratings yet
Presentation 1
30 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Python Pandas II Notes XII
No ratings yet
Python Pandas II Notes XII
20 pages
Texture Mapping Tutorial
No ratings yet
Texture Mapping Tutorial
25 pages
Employee Welfare
No ratings yet
Employee Welfare
47 pages
Data Visualization
No ratings yet
Data Visualization
48 pages
The Effect of e-WOM On Destination Image, Satisfaction and Loyalty
No ratings yet
The Effect of e-WOM On Destination Image, Satisfaction and Loyalty
8 pages
DSBDL Write Ups 8 To 10
No ratings yet
DSBDL Write Ups 8 To 10
7 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Data Sci
No ratings yet
Data Sci
10 pages
This Study Resource Was: Chapter 5: Performance Pay Choices
No ratings yet
This Study Resource Was: Chapter 5: Performance Pay Choices
9 pages
ITS62604 Tutorial 6 (Answer)
No ratings yet
ITS62604 Tutorial 6 (Answer)
2 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Indonesia Student Information Session
No ratings yet
Indonesia Student Information Session
28 pages
Epas Application Form
No ratings yet
Epas Application Form
2 pages
Sharmeine B. Tenedero: E-Mail
No ratings yet
Sharmeine B. Tenedero: E-Mail
3 pages
Mod Menu Crash 2022 10 30-21 33 40
No ratings yet
Mod Menu Crash 2022 10 30-21 33 40
2 pages
PR Loral Paris 2022final
No ratings yet
PR Loral Paris 2022final
20 pages
Oscommerce Online Merchant V3.0.2
No ratings yet
Oscommerce Online Merchant V3.0.2
7 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Syhunliong v. Rivera GR No. 200148
No ratings yet
Syhunliong v. Rivera GR No. 200148
21 pages
DAVP Lab Manual
No ratings yet
DAVP Lab Manual
12 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
Lecture 4
No ratings yet
Lecture 4
60 pages
1.1 Univariate Analysis: 1.1.1 Categorical Data
No ratings yet
1.1 Univariate Analysis: 1.1.1 Categorical Data
10 pages
FDS Ii Ans Key PDF
No ratings yet
FDS Ii Ans Key PDF
50 pages
Description of Data Visualization Tools
No ratings yet
Description of Data Visualization Tools
15 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
2 Mark Key DS
No ratings yet
2 Mark Key DS
3 pages
Data Visualization Lab: Experiment 1
No ratings yet
Data Visualization Lab: Experiment 1
8 pages
Principles of AI Laboratory Varshadr
No ratings yet
Principles of AI Laboratory Varshadr
54 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
Cs 405 Operating System Jun 2020
No ratings yet
Cs 405 Operating System Jun 2020
4 pages
2024 2025 TF Season-1
No ratings yet
2024 2025 TF Season-1
19 pages
DSA Lab Manual Pgms - fINAL
No ratings yet
DSA Lab Manual Pgms - fINAL
34 pages
Aws Data Engineer
No ratings yet
Aws Data Engineer
66 pages
Applications of Integration in Calculus
No ratings yet
Applications of Integration in Calculus
13 pages
Ua o Physics FINALS Laboratory Activity 7
No ratings yet
Ua o Physics FINALS Laboratory Activity 7
5 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
Co 2 Multivariate Analysis
No ratings yet
Co 2 Multivariate Analysis
71 pages
Als Project
No ratings yet
Als Project
18 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
ML Lab
No ratings yet
ML Lab
14 pages
Pandas Notes
No ratings yet
Pandas Notes
27 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
GE Practical Sem 2
No ratings yet
GE Practical Sem 2
28 pages
Keybank Hassle Free Fee Transparency
No ratings yet
Keybank Hassle Free Fee Transparency
2 pages
Experiment No 9
No ratings yet
Experiment No 9
13 pages
Cheatsheetforstatistics
No ratings yet
Cheatsheetforstatistics
4 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Seaborn
No ratings yet
Seaborn
7 pages
Gec Practicals
No ratings yet
Gec Practicals
31 pages
Aphical Representation
No ratings yet
Aphical Representation
8 pages
Vanshika Goyal Gec Practicals
No ratings yet
Vanshika Goyal Gec Practicals
31 pages
Data Preprocess Steps
No ratings yet
Data Preprocess Steps
2 pages
DL Lab Programs
No ratings yet
DL Lab Programs
16 pages
Unit 2
No ratings yet
Unit 2
36 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
2 Program
No ratings yet
2 Program
8 pages
Unit 3 DS
No ratings yet
Unit 3 DS
30 pages
Advanced Plot Types With Seaborn
No ratings yet
Advanced Plot Types With Seaborn
8 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Ai&Ml Bail606 ML Lab Manual

Uploaded by

Ai&Ml Bail606 ML Lab Manual

Uploaded by

CMR INSTITUTE OF TECHNOLOGY

(Affiliated to VTU, Approved by AICTE, Accredited by NBA and NAAC

Department of Artificial Intelligence and Machine Learning

Subject: Machine Learning Laboratory

# Load the dataset

# Select a numerical column

# Print the computed statistics

# Detect outliers using IQR (Inter-quartile Range)

# Define the outlier threshold

# Find the outliers

# Select a categorical variable

# Compute the frequency of each category

# Display the frequencies as a bar chart

# Display the frequencies as a pie chart

1. What is the purpose of this program?

●​ The program performs statistical analysis on a numerical column in a dataset, visualizes

2. Which dataset is used in this program?

3. Which libraries are used in this program and why?

●​ pandas: For data manipulation and statistical computations.

4. How is the dataset loaded in this program?

●​ Using Seaborn’s built-in dataset function:​

5. Which statistical measures are computed in this program?

●​ Mean (mean()): Average of all values.

●​ A histogram (histplot()) shows the distribution of a numerical variable by dividing the

7. Why is KDE (Kernel Density Estimation) used in the histogram?

8. What insights can we gain from a boxplot?

9. How are outliers detected using the IQR method?

12. Why do we use value_counts() for categorical data?

●​ value_counts() calculates the frequency of unique values in a categorical column,

# Load the dataset

# Select 2 numerical columns

# Plot a scatter plot

# Calculate Pearson Coefficient

# Compute covariance and correlation matrix

# Visualize the correlation matrix using a heatmap

1. What is the objective of this program?

2. Which dataset and libraries are used in this program?

●​ The iris dataset from the Seaborn library is used.

3. How are numerical columns selected?

●​ The dataset is filtered to include only numeric columns using:​

4. What is the purpose of a scatter plot in this program?

5. What is Pearson Correlation Coefficient?

6. What is a correlation matrix?

●​ A correlation matrix is a table showing correlation coefficients between multiple

7. What is covariance? How is it different from correlation?

9. What does the color scheme in the heatmap represent?

●​ Red (positive values): Strong positive correlation.

# Load the dataset

# Compute PCA conventionally

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

1. What is the objective of this program?

2. What is Principal Component Analysis (PCA)?

●​ PCA is a dimensionality reduction technique that transforms correlated features into a

3. Why is PCA used in Machine Learning?

●​ PCA is used to:

4. Which libraries are used in this program and why?

●​ numpy: For mathematical computations (eigenvalues, eigenvectors, covariance matrix).

5. Why is the dataset standardized using StandardScaler()?

6. How is the covariance matrix computed in PCA?

●​ In Python: cov_matrix = np.cov(x_scaled.T)

●​ Eigenvalues indicate how much variance is captured by each principal component.

8. Why are eigenvalues sorted in descending order?

9. Why do we use a scatter plot to visualize PCA results?

10. What does the color represent in the scatter plot?

11. How do we calculate the explained variance ratio?

●​ The explained variance ratio is computed as:

Explained Variance Ratio =

Eigenvalue of a principal component / Sum of all Eigenvalues

12. What are some real-world applications and limitations of PCA?

●​ Image compression ●​ PCA assumes linear

14. How do we decide the number of principal components to keep?

●​ Using the explained variance ratio.

# Load the dataset

# Function to evaluate k-NN with different k values

# Calculate accuracy and F1-score

results.append((k, accuracy, f1))

● The program performs statistical analysis on a numerical column in a dataset, visualizes

● pandas: For data manipulation and statistical computations.

● Using Seaborn’s built-in dataset function:

● Mean (mean()): Average of all values.

● A histogram (histplot()) shows the distribution of a numerical variable by dividing the

● value_counts() calculates the frequency of unique values in a categorical column,

● The iris dataset from the Seaborn library is used.

● The dataset is filtered to include only numeric columns using:

● A correlation matrix is a table showing correlation coefficients between multiple

● Red (positive values): Strong positive correlation.

● PCA is a dimensionality reduction technique that transforms correlated features into a

● PCA is used to:

● numpy: For mathematical computations (eigenvalues, eigenvectors, covariance matrix).

● In Python: cov_matrix = np.cov(x_scaled.T)

● Eigenvalues indicate how much variance is captured by each principal component.

● The explained variance ratio is computed as:

● Image compression ● PCA assumes linear

● Using the explained variance ratio.

● To evaluate the model's generalization ability on unseen data.

● pandas & numpy: For data handling and mathematical operations.

● To find the best k-value that provides optimal performance.

● Unweighted k-NN: Each neighbor has equal weight (default weights='uniform').

● It assigns a weight inversely proportional to the square of the distance (weight=1/d2).

● To compare the performance of weighted vs. unweighted k-NN.

● Small k (e.g., 1) → More variance, risk of overfitting.

● Simple & intuitive.

● Computationally expensive for large datasets.

● Feature scaling (Standardization).

● Handwriting recognition (e.g., OCR).

● OLS Regression: Assumes a global relationship between features and output by

● numpy: For numerical operations (matrix computations, Gaussian weighting).

● To create a controlled environment where a non-linear pattern (sinusoidal curve) is

● A bias term is added to include an intercept (β₀) in the regression model.

● It takes training data, a query point, and τ, then:

● k-NN: Uses a fixed number of neighbors.

● It represents the LWR-predicted values for different X values.

● Too large → The model behaves like linear regression.

● Stock market price prediction.

● Works well for non-linear data.

● Computationally expensive (needs to compute weights for every test point).