0% found this document useful (0 votes)
7 views50 pages

Ai&Ml Bail606 ML Lab Manual

The document is a lab manual for the Machine Learning Laboratory course at CMR Institute of Technology, effective from the academic year 2024-2025. It includes programming exercises on statistical analysis, PCA, and k-NN classification using datasets like 'tips' and 'iris', along with code examples and explanations for each task. The manual emphasizes the use of various libraries such as pandas, seaborn, and sklearn for data manipulation and visualization.

Uploaded by

sara22aiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views50 pages

Ai&Ml Bail606 ML Lab Manual

The document is a lab manual for the Machine Learning Laboratory course at CMR Institute of Technology, effective from the academic year 2024-2025. It includes programming exercises on statistical analysis, PCA, and k-NN classification using datasets like 'tips' and 'iris', along with code examples and explanations for each task. The manual emphasizes the use of various libraries such as pandas, seaborn, and sklearn for data manipulation and visualization.

Uploaded by

sara22aiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CMR INSTITUTE OF TECHNOLOGY

(Affiliated to VTU, Approved by AICTE, Accredited by NBA and NAAC


with “A++” Grade)
ITPL MAIN ROAD, BROOKFIELD, BENGALURU-560037,
KARNATAKA, INDIA

Department of Artificial Intelligence and Machine Learning

LAB MANUAL
(Effective from the academic year 2024-2025 under 2022 CBCS scheme)

Subject: Machine Learning Laboratory


Subject Code: BAIL606
Semester: 6
1. Develop a program to load a dataset and select one numerical column.
Compute mean, median, mode, standard deviation, variance, and range for a
given numerical column in a dataset. Generate a histogram and a boxplot to
understand the distribution of the data. Identify any outliers in the data using IQR.
Select a categorical variable from a dataset. Compute the frequency of each
category and display it as a bar chart or pie chart.

CODE:-
# Import the required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset


df=sns.load_dataset('tips')

# Select a numerical column


nc='total_bill'

#Statistical analysis
mean=df[nc].mean()
median=df[nc].median()
mode=df[nc].mode()
var=df[nc].var()
std=df[nc].std()
dr=df[nc].max()-df[nc].min()

# Print the computed statistics


print(f'Mean:{mean}')
print(f'Median:{median}')
print(f'Mode:{mode}')
print(f'Variance:{var}')
print(f'Standard Deviation:{std}')
print(f'Range:{dr}')

# Generate a histogram
plt.figure(figsize=(10,6))
sns.histplot(df[nc],kde=True)
plt.title(f'Histogram of {nc}')
plt.xlabel(nc)
plt.ylabel('Frequency')
plt.show()

# Generate a boxplot
plt.figure(figsize=(10,6))
sns.boxplot(x=df[nc])
plt.title(f'Boxplot of {nc}')
plt.xlabel(nc)
plt.show()

# Detect outliers using IQR (Inter-quartile Range)


Q1=df[nc].quantile(0.25)
Q3=df[nc].quantile(0.75)
IQR=Q3-Q1

# Define the outlier threshold


lb=Q1-1.5*IQR
ub=Q3+1.5*IQR

# Find the outliers


out=df[(df[nc]<lb) | (df[nc]>ub)]
print(f'Outliers:\n{out}')

# Select a categorical variable


categorical_column='sex'

# Compute the frequency of each category


category_counts=df[categorical_column].value_counts()

# Display the frequencies as a bar chart


plt.figure(figsize=(10, 6))
category_counts.plot(kind='bar', color='skyblue')
plt.title(f'Frequency of categories in {categorical_column}')
plt.xlabel(categorical_column)
plt.ylabel('Frequency')
plt.show()

# Display the frequencies as a pie chart


plt.figure(figsize=(8, 8))
category_counts.plot(kind='pie', autopct='%1.1f%%',
startangle=90, colors=['lightblue', 'lightcoral'])
plt.title(f'Pie chart of {categorical_column} categories')
plt.ylabel('') # Hide the y-axis label
plt.show()

OUTPUT:-
VIVA-VOCE:

1. What is the purpose of this program?

●​ The program performs statistical analysis on a numerical column in a dataset, visualizes


the data using histograms and boxplots, detects outliers using the IQR method, and
analyzes categorical data using bar and pie charts.

2. Which dataset is used in this program?

●​ The "tips" dataset from the seaborn library is used, which contains information about
restaurant tips, including columns like total_bill, tip, sex, smoker, day, time,
and size.

3. Which libraries are used in this program and why?

●​ pandas: For data manipulation and statistical computations.


●​ seaborn: For visualization (histogram, boxplot).
●​ matplotlib.pyplot: For plotting graphs (bar chart, pie chart).
●​ numpy: For numerical operations.

4. How is the dataset loaded in this program?

●​ Using Seaborn’s built-in dataset function:​


df = sns.load_dataset('tips')

5. Which statistical measures are computed in this program?

●​ Mean (mean()): Average of all values.


●​ Median (median()): Middle value of sorted data.
●​ Mode (mode()): Most frequently occurring value.
●​ Variance (var()): Measure of data spread.
●​ Standard Deviation (std()): Square root of variance.
●​ Range: Difference between max and min values.
6. What is the purpose of a histogram?

●​ A histogram (histplot()) shows the distribution of a numerical variable by dividing the


data into bins. It helps to identify patterns such as normality, skewness, or multimodal
distributions.

7. Why is KDE (Kernel Density Estimation) used in the histogram?

●​ KDE provides a smoothed estimate of the data distribution, making it easier to visualize
patterns.

8. What insights can we gain from a boxplot?

●​ A boxplot (sns.boxplot()) shows the median, quartiles (Q1 and Q3), outliers, and
data spread. It helps in detecting skewness and extreme values.

9. How are outliers detected using the IQR method?


10. How is categorical data analyzed in this program?

●​ The program computes the frequency of each category in the sex column and visualizes
it using bar and pie charts.

11. What is the difference between a bar chart and a pie chart?

●​ A bar chart is used to compare categorical values, whereas a pie chart represents
proportions as slices of a circle.

12. Why do we use value_counts() for categorical data?

●​ value_counts() calculates the frequency of unique values in a categorical column,


which is useful for understanding the distribution of categorical data.
2. Develop a program to load a dataset with at least two numeric columns (e.g.,
Iris, Titanic). Plot a scatter plot of two variables and calculate their Pearson
correlation coefficient. Write a program to compute the covariance and
correlation matrix for a dataset. Visualize the correlation matrix using a heatmap
to know which variables have strong positive/negative correlations.

CODE:-
# Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset


df=sns.load_dataset('iris')

# Select 2 numerical columns


num_col_1='sepal_length'
num_col_2='sepal_width'

# Plot a scatter plot


plt.figure(figsize=(8,6))
sns.scatterplot(x=df[num_col_1],y=df[num_col_2],color='blue')
plt.title(f'Scatter plot between {num_col_1} and {num_col_2}')
plt.xlabel(num_col_1)
plt.ylabel(num_col_2)
plt.show()

# Calculate Pearson Coefficient


pearson_corr=df[num_col_1].corr(df[num_col_2])
print(f'Pearson Coefficient between {num_col_1} and {num_col_2}:
{pearson_corr}')

# Compute covariance and correlation matrix


numeric_columns=df.select_dtypes(include=['number']).columns
print("\nNumeric Columns:")
print(numeric_columns)
cov_matrix=df[numeric_columns].cov()
print("\nCovariance Matrix:")
print(cov_matrix)
corr_matrix=df[numeric_columns].corr()
print("\nCorrelation Matrix:")
print(corr_matrix)

# Visualize the correlation matrix using a heatmap


plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm',fmt='.2f',linewidths=0.
5)
plt.title('Correlation Matrix Heatmap')
plt.show()

OUTPUT:-
VIVA-VOCE:-

1. What is the objective of this program?

●​ This program performs a statistical analysis of two numerical columns from a dataset by
plotting a scatter plot, calculating Pearson correlation, computing covariance and
correlation matrices, and visualizing the correlation matrix using a heatmap.

2. Which dataset and libraries are used in this program?

●​ The iris dataset from the Seaborn library is used.


●​ pandas: For data handling and computations.
●​ numpy: For numerical calculations.
●​ matplotlib.pyplot: For data visualization (scatter plot, heatmap).
●​ seaborn: For advanced visualizations (scatter plot, heatmap).

3. How are numerical columns selected?

●​ The dataset is filtered to include only numeric columns using:​


numeric_columns = df.select_dtypes(include=['number']).columns
●​ Specifically, sepal_length and sepal_width are selected.

4. What is the purpose of a scatter plot in this program?

●​ The scatter plot visually represents the relationship between two numeric variables.
●​ The scatter plot is generated by using Seaborn’s scatterplot() function.
●​ It shows the nature of correlation (positive, negative, or no correlation) between the
selected numerical columns.

5. What is Pearson Correlation Coefficient?

●​ Pearson correlation measures the strength and direction of a linear relationship between
two variables.
●​ Pearson Correlation is computed in the code using Pandas’ corr() function.

6. What is a correlation matrix?

●​ A correlation matrix is a table showing correlation coefficients between multiple


variables.
●​ A correlation matrix is computed in the code using Pandas’ corr() function.

7. What is covariance? How is it different from correlation?

●​ Covariance measures the direction of the relationship between two variables but does
not standardize the values.
8. Why is a heatmap used for visualizing correlations?

●​ A heatmap makes it easy to identify variables that have strong positive or negative
correlations.
●​ The correlation matrix is visualized using Seaborn’s heatmap() function.

9. What does the color scheme in the heatmap represent?

●​ Red (positive values): Strong positive correlation.


●​ Blue (negative values): Strong negative correlation.
●​ Near zero values: Weak or no correlation.
3. Develop a program to implement Principal Component Analysis (PCA) for
reducing the dimensionality of the Iris dataset from 4 features to 2.

CODE:-
# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load the dataset


iris = load_iris()
x = iris.data # Features
y = iris.target # Target values
print(x)
print(y)

OUTPUT:-

# Compute PCA conventionally


scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
print(x_scaled)
OUTPUT:-

cov_matrix = np.cov(x_scaled.T)
print(cov_matrix)

OUTPUT:-

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)


print(eigenvalues)
print("\n",eigenvectors)

OUTPUT:-

sorted_indices = np.argsort(eigenvalues)[::-1]

eigenvalues_sorted=eigenvalues[sorted_indices]
eigenvectors_sorted = eigenvectors[:,sorted_indices]
print(eigenvalues_sorted)
print("\n",eigenvectors_sorted)

OUTPUT:-

top_2_eigenvectors = eigenvectors_sorted[:,:2]
print(top_2_eigenvectors)

OUTPUT:-

x_pca = x_scaled.dot(top_2_eigenvectors)
print(x_pca)

OUTPUT:-
total_variance=sum(eigenvalues_sorted)
explained_variance_ratio=eigenvalues_sorted[:2]/total_variance
print(f"Explained variance ratio of the 1st two components:
{explained_variance_ratio}")

# Visualization
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1], c=y, cmap="viridis",edgecolor="k",s=50)
plt.title("PCA of Iris Dataset (Reduced to 2D)",fontsize = 14)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar(label="Species")
plt.show()

OUTPUT:-
VIVA-VOCE:-

1. What is the objective of this program?

●​ The objective is to apply Principal Component Analysis (PCA) on the Iris dataset to
reduce the number of features from 4 to 2, while retaining the most important
information.

2. What is Principal Component Analysis (PCA)?

●​ PCA is a dimensionality reduction technique that transforms correlated features into a


smaller set of uncorrelated features called principal components, which capture the
maximum variance in the data.

3. Why is PCA used in Machine Learning?

●​ PCA is used to:


○​ Reduce computational complexity.
○​ Remove redundant features.
○​ Improve model efficiency.
○​ Help in visualizing high-dimensional data.

4. Which libraries are used in this program and why?

●​ numpy: For mathematical computations (eigenvalues, eigenvectors, covariance matrix).


●​ matplotlib.pyplot: For visualization of the reduced dataset.
●​ sklearn.datasets: To load the Iris dataset.
●​ sklearn.preprocessing.StandardScaler: To standardize the dataset before
applying PCA.

5. Why is the dataset standardized using StandardScaler()?

●​ PCA is sensitive to scale. Standardizing ensures that all features contribute equally by
transforming them to have zero mean and unit variance.

6. How is the covariance matrix computed in PCA?

●​ In Python: cov_matrix = np.cov(x_scaled.T)


7. What are eigenvalues and eigenvectors in PCA?

●​ Eigenvalues indicate how much variance is captured by each principal component.


●​ Eigenvectors define the direction of the new feature space.
●​ They are computed using NumPy’s np.linalg.eig() function.

8. Why are eigenvalues sorted in descending order?

●​ The principal components are ranked based on variance, so we select the top
components that contribute the most.

9. Why do we use a scatter plot to visualize PCA results?

●​ Since PCA reduces the data to 2D, a scatter plot helps in visualizing how well the data
clusters after transformation.

10. What does the color represent in the scatter plot?

●​ The color represents different species (target labels) in the Iris dataset.

11. How do we calculate the explained variance ratio?

●​ The explained variance ratio is computed as:

Explained Variance Ratio =

Eigenvalue of a principal component / Sum of all Eigenvalues

●​ It tells how much information the selected principal components retain from the original
dataset.

12. What are some real-world applications and limitations of PCA?

●​ Image compression ●​ PCA assumes linear


●​ Feature selection in ML relationships among features.
●​ Genomic data analysis ●​ It may lose interpretability of
●​ Fraud detection original features.
●​ It does not work well with
categorical data.

14. How do we decide the number of principal components to keep?

●​ Using the explained variance ratio.


●​ The elbow method is often used to determine the optimal number.
4. Develop a program to load the Iris dataset. Implement the k-Nearest Neighbors
(k-NN) algorithm for classifying flowers based on their features. Split the dataset
into training and testing sets and evaluate the model using metrics like accuracy
and F1-score. Test it for different values of k (e.g., k=1,3,5) and evaluate the
accuracy. Extend the k-NN algorithm to assign weights based on the distance of
neighbors (e.g., weight=1/d2). Compare the performance of weighted k-NN and
regular k-NN on a synthetic or real-world dataset.

CODE:-
# Import the required libraries
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Load the dataset


iris = datasets.load_iris()
X = iris.data # Features
y = iris.target # Target labels

# Split the dataset into training and testing sets (80% training, 20%
testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Function to evaluate k-NN with different k values


def evaluate_knn(k_values, X_train, X_test, y_train, y_test,
weighted=False):
results = []

for k in k_values:
# Initialize k-NN classifier with or without distance-based weights
if weighted:
knn = KNeighborsClassifier(n_neighbors=k, weights='distance')
else:
knn = KNeighborsClassifier(n_neighbors=k, weights='uniform')
# Train the classifier
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Calculate accuracy and F1-score


accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted') # Weighted
F1-score multi-class

results.append((k, accuracy, f1))

return results

# Test k-NN for different values of k for unweighted and weighted classes
k_values = [1, 3, 5]

# Unweighted k-NN
print("Unweighted k-NN:")
unweighted_results = evaluate_knn(k_values, X_train, X_test, y_train,
y_test, weighted=False)
for k, accuracy, f1 in unweighted_results:
print(f"k={k}, Accuracy={accuracy:.4f}, F1-score={f1:.4f}")

# Weighted k-NN
print("Weighted k-NN:")
weighted_results = evaluate_knn(k_values, X_train, X_test, y_train,
y_test, weighted=True)
for k, accuracy, f1 in weighted_results:
print(f"k={k}, Accuracy={accuracy:.4f}, F1-score={f1:.4f}")

# Visualize the results


unweighted_accuracies = [accuracy for _, accuracy, _ in
unweighted_results]
weighted_accuracies = [accuracy for _, accuracy, _ in weighted_results]

# Plotting the results


plt.figure(figsize=(10, 6))
plt.plot(k_values, unweighted_accuracies, label='Unweighted k-NN',
marker='o')
plt.plot(k_values, weighted_accuracies, label='Weighted k-NN', marker='o')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('Accuracy of k-NN with Different k Values')
plt.legend()
plt.show()

OUTPUT:-
VIVA-VOCE:-

1. What is the objective of this program?

●​ The objective is to implement the k-Nearest Neighbors (k-NN) algorithm on the Iris
dataset for classification, evaluate its performance with different values of k, and
compare weighted and unweighted k-NN.

2. What is the k-Nearest Neighbors (k-NN) algorithm?

●​ k-NN is a supervised learning algorithm that classifies a data point based on the
majority vote of its k nearest neighbors in the feature space.

3. How does k-NN work?

1.​ Computes the distance between the query point and all other points.
2.​ Selects the k nearest neighbors.
3.​ Assigns the most common class label among the neighbors to the query point.

4. Why do we split the dataset into training and testing sets?

●​ To evaluate the model's generalization ability on unseen data.

5. Which libraries are used in this program and why?

●​ pandas & numpy: For data handling and mathematical operations.


●​ sklearn.datasets: To load the Iris dataset.
●​ sklearn.model_selection.train_test_split: To split the dataset into training
and testing sets.
●​ sklearn.metrics: To evaluate the model using accuracy and F1-score.
●​ sklearn.neighbors.KNeighborsClassifier: To implement k-NN classifier.
●​ matplotlib.pyplot: To visualize accuracy for different values of k.

6. What does train_test_split() do?

●​ It randomly splits the dataset into 80% training data and 20% testing data.

7. What does KNeighborsClassifier(n_neighbors=k, weights='uniform') do?

●​ It initializes a k-NN classifier with k neighbors and assigns equal weights to all neighbors.

8. What evaluation metrics are used in this program?


●​ Accuracy Score: Measures the proportion of correctly classified instances using the
accuracy_score() function.
●​ F1-score: A weighted average of precision and recall, useful for imbalanced datasets
using the f1_score() function.

9. Why do we test k-NN for different values of k?

●​ To find the best k-value that provides optimal performance.


●​ A small k (e.g., 1) may lead to overfitting, while a large k (e.g., 20) may cause
underfitting.

10. What is the difference between weighted and unweighted k-NN?

●​ Unweighted k-NN: Each neighbor has equal weight (default weights='uniform').


●​ Weighted k-NN: Neighbors closer to the query point have higher influence
(weights='distance').

11. How does weighted k-NN assign weights?

●​ It assigns a weight inversely proportional to the square of the distance (weight=1/d2).

12. When should we use weighted k-NN instead of unweighted k-NN?

●​ When data points closer to the query point should have higher influence, especially
in cases where:
○​ Data is noisy.
○​ Class imbalance exists.

13. Why do we plot accuracy for different values of k?

●​ To compare the performance of weighted vs. unweighted k-NN.

14. How does the number of neighbors (k) affect accuracy?

●​ Small k (e.g., 1) → More variance, risk of overfitting.


●​ Large k (e.g., 20) → More bias, risk of underfitting.

15. What conclusion can we draw from the accuracy comparison?

●​ If weighted k-NN outperforms unweighted k-NN, it means that closer neighbors should
be given higher importance.

16. What are the advantages of k-NN?

●​ Simple & intuitive.


●​ Works well with small datasets.
●​ No training phase, making it a lazy learner.

17. What are the disadvantages of k-NN?

●​ Computationally expensive for large datasets.


●​ Sensitive to irrelevant features.
●​ Requires careful choice of k.

18. How can we improve k-NN performance?

●​ Feature scaling (Standardization).


●​ Using distance-weighted k-NN.
●​ Selecting an optimal k.

19. What real-world applications use k-NN?

●​ Handwriting recognition (e.g., OCR).


●​ Recommender systems.
●​ Medical diagnosis.

5. Implement the non-parametric Locally Weighted Regression algorithm in order


to fit data points. Select the appropriate dataset for your experiment and draw
graphs.
CODE:-
# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data


np.random.seed(42)
X = np.linspace(0, 10, 100)
y = np.sin(X) + np.random.normal(scale=0.2, size=X.shape) # Add some
noise

# Reshape X for matrix operations


X = X[:, np.newaxis]

# Locally Weighted Regression algorithm function


def lwr(X_train, y_train, query_point, tau):

# Compute weights using Gaussian Kernel Sampling


weights = np.exp(-np.sum((X_train - query_point) ** 2, axis=1) / (2 *
tau ** 2))

# Create a diagonal weight matrix


W = np.diag(weights)
# Add bias term to X_train
X_bias = np.hstack([np.ones_like(X_train), X_train])
# Compute the weighted normal equation to find the parameters (theta)
theta = np.linalg.inv(X_bias.T @ W @ X_bias) @ X_bias.T @ W @ y_train

# Predict for query point


query_point_bias = np.array([1, query_point])
pred = query_point_bias @ theta
return pred

# Function to make predictions using the trained LWR model


def predict_lwr(X_train, y_train, X_test, tau):
prediction = np.array([lwr(X_train, y_train, x[0], tau) for x in
X_test])
return prediction
# Hyperparameter: Bandwidth (tau)
tau = 0.5
# Predict values for the test set
y_pred = predict_lwr(X, y, X, tau)

# Plot the results


plt.figure(figsize=(10, 6))
plt.scatter(X, y, label="Data Points", color="blue", s=10)
plt.plot(X, y_pred, label=f"LWR Prediction (tau={tau})", color="red",
linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Locally Weighted Regression (LWR)")
plt.legend()
plt.grid(True)
plt.show()

OUTPUT:-

VIVA-VOCE:-

1. What is the objective of this program?


●​ The objective is to implement the Locally Weighted Regression (LWR) algorithm, a
non-parametric regression technique, to fit data points and visualize the regression
curve.

2. What is Locally Weighted Regression (LWR)?

●​ LWR is a variant of linear regression where each data point has a different contribution
to the prediction based on its distance from the query point.
●​ It gives higher weights to nearby points and lower weights to distant points.

3. How does LWR differ from Ordinary Least Squares (OLS) regression?

●​ OLS Regression: Assumes a global relationship between features and output by


minimizing a cost function over all data points.
●​ LWR: Uses local relationships by assigning weights to different data points based on
their proximity to the query point.

4. Why is LWR called a non-parametric algorithm?

●​ It does not learn a fixed set of parameters for the entire dataset but instead computes
weights dynamically for each query point.

5. Which libraries are used in this program and why?

●​ numpy: For numerical operations (matrix computations, Gaussian weighting).


●​ matplotlib.pyplot: For data visualization (scatter plots, regression curve).

6. Why is synthetic data used in this program?

●​ To create a controlled environment where a non-linear pattern (sinusoidal curve) is


modeled with added noise.

7. What is the role of the bandwidth parameter (τ)?

●​ The bandwidth (τ: tau) controls how much weight is given to nearby points.
●​ A small τ → More localized model (risk of overfitting).
●​ A large τ → More generalized model (risk of underfitting).

8. How are the weights assigned in LWR?

9. What is the purpose of the diagonal weight matrix W?


●​ The diagonal weight matrix ensures that nearby points contribute more to the
regression model than distant points.

10. What is the weighted normal equation used in this code?

11. Why do we add a bias term to X?

●​ A bias term is added to include an intercept (β₀) in the regression model.


●​ The linear regression equation is: y = β₀ + β1x where β₀ is the intercept and β1 is the
slope.

12. How does the function lwr() work?

●​ It takes training data, a query point, and τ, then:


1.​ Computes weights using the Gaussian kernel.
2.​ Forms a weighted normal equation.
3.​ Solves for θ (theta).
4.​ Make a prediction for the query point.

13. How is LWR different from k-Nearest Neighbors (k-NN)?

●​ k-NN: Uses a fixed number of neighbors.


●​ LWR: Uses all data points with weights that decrease based on distance.

14. Why do we plot the LWR predictions?

●​ To compare how well the LWR curve fits the data points.

15. What does the red curve in the plot represent?

●​ It represents the LWR-predicted values for different X values.

16. What happens when τ (tau) is too large or too small?

●​ Too large → The model behaves like linear regression.


●​ Too small → The model overfits and captures noise.

17. What are some real-world applications of Locally Weighted Regression?

●​ Stock market price prediction.


●​ Medical diagnosis (personalized medicine).
●​ Handwriting and speech recognition.
18. What are the advantages of LWR?

●​ Works well for non-linear data.


●​ No need to define a fixed function form.
●​ Requires no prior assumptions about the data.
●​ Provides customized fitting for different regions of the dataset.

19. What are the disadvantages of LWR?

●​ Computationally expensive (needs to compute weights for every test point).


●​ Not scalable for large datasets.
●​ Requires careful tuning of τ.

6. Develop a program to demonstrate the working of Linear Regression and


Polynomial Regression. Use Boston Housing Dataset for Linear Regression and
Auto MPG Dataset (for vehicle fuel efficiency prediction) for Polynomial
Regression.

1. Linear Regression for Boston Housing Dataset


●​ Load the Boston Housing dataset as a .csv file in Jupyter Notebook and execute the
following program.

CODE:-
# Load necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset


dataset = "BostonHousingData.csv"

# Read the dataset


df= pd.read_csv(dataset)

# Display the first few rows


df.head()

# Impute missing values with mean


df['CRIM'].fillna(df['CRIM'].mean(), inplace=True)
df['ZN'].fillna(df['ZN'].mean(), inplace=True)
df['INDUS'].fillna(df['INDUS'].mean(), inplace=True)
df['CHAS'].fillna(df['CHAS'].mode()[0], inplace=True)
df['AGE'].fillna(df['AGE'].mean(), inplace=True)
df['LSTAT'].fillna(df['LSTAT'].mean(), inplace=True)
df.head()

# Drop the columns 'CHAS' and 'ZN'


df = df.drop(columns=['CHAS', 'ZN'])

# Verify the columns are dropped


df.head()
# Train the model
x = df.drop('MEDV', axis=1) # Features
y = df['MEDV'] # Target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=42)
model_lr = LinearRegression()
model_lr.fit(x_train, y_train)
y_pred_lr = model_lr.predict(x_test)

# Evaluate the model's performance


mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

def print_metrics(y_test, y_pred, model_name):


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\n{model_name} - MSE: {mse}, R²: {r2}")
print_metrics(y_test, y_pred_lr, 'Linear Regression')

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, label='Linear Regression')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],
color='red')
plt.title('Linear Regression with Boston Housing Dataset')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.legend()
plt.show()

OUTPUT:-
2. Polynomial Regression for Auto MPG Dataset

CODE:-
# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing, fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

def polynomial_regression_auto_mpg():
# Load auto mpg dataset
auto_mpg = fetch_openml(name="autoMpg", version=1, as_frame=True)
data = auto_mpg.data
target = auto_mpg.target
# Remove rows with missing 'horsepower' values from both data and
target
data = data.dropna(subset=["horsepower"]) # FIXED: Changed
subset["horsepower"] to subset=["horsepower"]
target = target.loc[data.index]

X_hp = data[["horsepower"]].astype(float)
y_mpg = target.astype(float)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_hp, y_mpg,
test_size=0.2, random_state=42)

# Polynomial transformation
poly_features = PolynomialFeatures(degree=3)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Linear regression on polynomial features


lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train)
y_pred_poly = lr_poly.predict(X_test_poly)

# Evaluation metrics
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)

# Sorted predictions for a smoother curve


X_test_sorted, y_test_sorted =
zip(*sorted(zip(X_test.values.flatten(), y_test)))
y_pred_sorted =
lr_poly.predict(poly_features.transform(np.array(X_test_sorted).reshape(-1
, 1)))

# Visualization
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test_sorted, y_pred_sorted, color='red', linewidth=2,
label='Polynomial Regression Fit')
plt.xlabel('Horsepower')
plt.ylabel('Miles per Gallon (MPG)')
plt.title('Polynomial Regression: Horsepower vs. MPG')
plt.legend()
plt.show()

# Output metrics
print(f'Mean Squared Error: {mse_poly}')
print(f'R² Score: {r2_poly}')

def run_models():
# Make sure you define this function or remove it if not needed
polynomial_regression_auto_mpg()
run_models()

OUTPUT:-
7. Develop a program to load the Titanic dataset. Split the data into training and
test sets. Train a decision tree classifier. Visualize the tree structure. Evaluate
accuracy, precision, recall, and F1-score.

●​ Load the Titanic dataset as a .csv file in Jupyter Notebook and execute the following
program.

CODE:-
# Load necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.metrics import
accuracy_score,precision_score,recall_score,f1_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Load the dataset


df=pd.read_csv("titanic.csv")
df.head()

OUTPUT:-

CODE:-
# Data Preprocessing
df['Age'].fillna(df['Age'].median(),inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0],inplace=True)
df['Fare'].fillna(df['Fare'].median(),inplace=True)
df['Cabin'].fillna('U',inplace=True) # Fill missing CABIN values with 'U'
as unknown
df.head()
OUTPUT:-

CODE:-
# Drop unnecessary columns
label_encoder=LabelEncoder()
df['Sex']=label_encoder.fit_transform(df['Sex'])
df['Embarked']=label_encoder.fit_transform(df['Embarked']) # C:0, Q:1, S:2
df.drop(columns=['Name','Ticket','Cabin'],inplace=True)
X=df.drop(columns=['Survived'])
y=df['Survived']
df.head()

OUTPUT:-

CODE:-
# Train the model
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_st
ate=42)
clf=DecisionTreeClassifier(random_state=42)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
# Evaluation Metrics
accuracy=accuracy_score(y_test,y_pred)
precision=precision_score(y_test,y_pred)
recall=recall_score(y_test,y_pred)
f1=f1_score(y_test,y_pred)
print(f"Accuracy:{accuracy:.4f}")
print(f"Precision:{precision:.4f}")
print(f"Recall:{recall:.4f}")
print(f"F1-score:{f1:.4f}")

# Visualization
plt.figure(figsize=(12,8))
plot_tree(clf,filled=True,feature_names=X.columns,class_names=['Not
Survived','Survived'],rounded=True)
plt.title("Decision Tree Classifier for Titanic dataset")
plt.show()

OUTPUT:-
8. Develop a program to implement the Naive Bayesian Classifier considering the
Iris dataset. Compute the accuracy of the classifier by considering the test data.

CODE:-
# Import the required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix

# Load the Iris dataset


iris=load_iris()

# Train and test the classifier


X=iris.data
y=iris.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_st
ate=42)
model=GaussianNB()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
accuracy=accuracy_score(y_test,y_pred)

# Print the metrics


print(f"Accuracy: ",accuracy)
print("Iris Classification
report:\n",classification_report(y_test,y_pred,target_names=iris.target_na
mes))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))

OUTPUT:-
9. Develop a program to implement k-means clustering using Wisconsin Breast
Cancer dataset and visualize the clustering result.

CODE:-
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the dataset


data = load_breast_cancer()
X = data.data
y = data.target

# Train the model


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=2, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df['Cluster'] = y_kmeans
df['True Label'] = y

# Visualization
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1',
s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='yellow', marker='X',
label='Centroids')
plt.title('k-Means Clustering for Wisconsin Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()

OUTPUT:-
EXTRA PROGRAMS

1. Develop a program to implement the Perceptron Learning algorithm to simulate


an OR gate.

CODE:-
# Import the required libraries
import numpy as np

# OR gate dataset
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y = np.array([0, 1, 1, 1])

# Initialize weights and bias


weights = np.zeros(X.shape[1])
bias = 0
lr = 0.1
epochs = 10

# Activation function (step)


def step(x):
return 1 if x >= 0 else 0

# Training the perceptron


for epoch in range(epochs):
for i in range(len(X)):
z = np.dot(X[i], weights) + bias
y_pred = step(z)
error = y[i] - y_pred

# Update rule
weights += lr * error * X[i]
bias += lr * error

print("Trained weights:", weights)


print("Trained bias:", bias)
# Test the perceptron
for x in X:
output = step(np.dot(x, weights) + bias)
print(f"Input: {x} => Output: {output}")

OUTPUT:-
2. Develop a program to demonstrate the working of Logistic Regression using
Iris dataset.

CODE:-
# Import the required libraries
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create logistic regression model


model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict on the test set


y_pred = model.predict(X_test)

# Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

OUTPUT:-
SAMPLE VIVA-VOCE BASED ON THEORY SYLLABUS

Module 1: Introduction & Understanding Data

1.​ What is Machine Learning?


○​ Machine Learning (ML) is a subset of artificial intelligence that enables systems
to learn from data and improve their performance without being explicitly
programmed.
2.​ What are the different types of Machine Learning?
○​ Supervised Learning: The model learns from labeled data. (e.g., classification,
regression)
○​ Unsupervised Learning: The model identifies patterns in unlabeled data. (e.g.,
clustering, association)
○​ Reinforcement Learning: The model learns by interacting with an environment
and receiving rewards or penalties.
3.​ What are some challenges in Machine Learning?
○​ Data quality and availability
○​ Overfitting and underfitting
○​ Model interpretability
○​ Computational complexity
4.​ What is descriptive statistics, and why is it used in ML?
○​ Descriptive statistics summarize data using measures like mean, median, mode,
variance, and standard deviation. It helps in understanding data distribution
before applying ML models.
5.​ What is the difference between univariate, bivariate, and multivariate analysis?
○​ Univariate analysis: Analysis of a single variable (e.g., histogram of age).
○​ Bivariate analysis: Relationship between two variables (e.g., scatter plot of
height vs. weight).
○​ Multivariate analysis: Involves multiple variables (e.g., PCA for dimensionality
reduction).

Module 2: Feature Engineering & ML Algorithms

6.​ What is feature engineering?


○​ Feature engineering involves selecting, transforming, or creating features to
improve model performance.
7.​ What is dimensionality reduction? Name two techniques.
○​ Dimensionality reduction removes redundant features to improve efficiency.
○​ Techniques: Principal Component Analysis (PCA), Linear Discriminant Analysis
(LDA).
8.​ What is overfitting and how can it be prevented?
○​ Overfitting occurs when a model performs well on training data but poorly on
unseen data.
○​ Prevention methods: Cross-validation, regularization (L1/L2), pruning (for
decision trees), using more data.
9.​ What are training, validation, and testing sets?
○​ Training set: Used to train the model.
○​ Validation set: Used for hyperparameter tuning.
○​ Testing set: Used to evaluate final model performance.
10.​What is a confusion matrix? Explain its components.
○​ A confusion matrix is used to evaluate classification performance.
○​ Components: True Positive (TP), False Positive (FP), True Negative (TN), False
Negative (FN).
11.​What is the ROC curve and AUC?
●​ ROC (Receiver Operating Characteristic) Curve: Plots TPR vs. FPR for
different thresholds.
●​ AUC (Area Under Curve): Measures classifier performance (closer to 1 means
better).

Module 3: Similarity-Based Learning

12.​What is the k-Nearest Neighbor (k-NN) algorithm?


●​ k-NN is a non-parametric, lazy learning algorithm that classifies a sample based
on the majority class of its k nearest neighbors.
13.​How does weighted k-NN differ from k-NN?
●​ Weighted k-NN assigns higher importance to closer neighbors by giving them
higher weights.
14.​What is the nearest centroid classifier?
●​ A classification algorithm that assigns a data point to the class whose centroid
(mean vector) is closest to the data point.
15.​What is Locally Weighted Regression (LWR)?
●​ LWR is a non-parametric regression technique that gives different weights to
training points based on their distance from the test point.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy