ML Lab Manual
ML Lab Manual
1. Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.
import pandas as pd
import numpy as np
data = fetch_california_housing(as_frame=True)
housing_df = data.frame
numerical_features = housing_df.select_dtypes(include=[np.number]).columns
# Plot histograms
plt.figure(figsize=(15, 10))
plt.subplot(3, 3, i + 1)
plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()
plt.figure(figsize=(15, 10))
sns.boxplot(x=housing_df[feature], color='orange')
plt.tight_layout()
plt.show()
print("Outliers Detection:")
outliers_summary = {}
Q1 = housing_df[feature].quantile(0.25)
Q3 = housing_df[feature].quantile(0.75)
IQR = Q3 - Q1
outliers_summary[feature] = len(outliers)
print("\nDataset Summary:")
print(housing_df.describe())
OUTPUT:
Outliers Detection:
MedInc: 681 outliers
HouseAge: 0 outliers
AveRooms: 511 outliers
AveBedrms: 1424 outliers
Population: 1196 outliers
AveOccup: 711 outliers
Latitude: 0 outliers
Longitude: 0 outliers
MedHouseVal: 1071 outliers
Dataset Summary:
[8 rows x 9 columns]
BCSL606 Program 2
2. Develop a program to Compute the correlation matrix to understand the relationships between
pairs of features. Visualize the correlation matrix using a heatmap to know which variables have
strong positive/negative correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset.
PROGRAM:
import pandas as pd
california_data = fetch_california_housing(as_frame=True)
data = california_data.frame
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
plt.show()
plt.show()
OUTPUT:
BCSL606 Program 3
3. Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
PROGRAM:
import numpy as np
import pandas as pd
iris = load_iris()
data = iris.data
labels = iris.target
label_names = iris.target_names
pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)
reduced_df['Label'] = labels
plt.figure(figsize=(8, 6))
label=label_names[label],
color=colors[i]
plt.legend()
plt.grid()
plt.show()
OUTPUT:
BCSL606 Program 4
4. For a given set of training data examples stored in a .CSV file, implement and demonstrate the
Find-S algorithm to output a description of the set of all hypotheses consistent with the training
examples.
import pandas as pd
def find_s_algorithm(file_path):
data = pd.read_csv(file_path)
print("Training data:")
print(data)
attributes = data.columns[:-1]
class_label = data.columns[-1]
if row[class_label] == 'Yes':
hypothesis[i] = value
else:
hypothesis[i] = '?'
return hypothesis
file_path = 'training_data.csv'
hypothesis = find_s_algorithm(file_path)
print("\nThe final hypothesis is:", hypothesis)
OUTPUT:
Training data:
PROGRAM:
import numpy as np
data = np.random.rand(100)
distances.sort(key=lambda x: x[0])
k_nearest_neighbors = distances[:k]
return Counter(k_nearest_labels).most_common(1)[0][0]
train_data = data[:50]
train_labels = labels
test_data = data[50:]
print("Training dataset: First 50 points labeled based on the rule (x <= 0.5 -> Class1, x > 0.5 ->
Class2)")
results = {}
for k in k_values:
results[k] = classified_labels
print("\n")
print("Classification complete.\n")
for k in k_values:
classified_labels = results[k]
plt.figure(figsize=(10, 6))
plt.scatter(train_data, [0] * len(train_data), c=["blue" if label == "Class1" else "red" for label in
train_labels],
plt.xlabel("Data Points")
plt.ylabel("Classification Level")
plt.legend()
plt.grid(True)
plt.show()
OUTPUT:
--- k-Nearest Neighbors Classification ---
Training dataset: First 50 points labeled based on the rule (x <= 0.5 -> Class1, x > 0.5 -> Class2)
Results for k = 1:
Results for k = 2:
Results for k = 3:
Results for k = 4:
Results for k = 5:
Classification complete.
BCSL606 Program 6
6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
PROGRAM:
import numpy as np
m = X.shape[0]
W = np.diag(weights)
X_transpose_W = X.T @ W
return x @ theta
np.random.seed(42)
X_bias = np.c_[np.ones(X.shape), X]
tau = 0.5
plt.figure(figsize=(10, 6))
plt.xlabel('X', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.show()
OUTPUT:
BCSL606 Program 7
PROGRAM:
import numpy as np
import pandas as pd
def linear_regression_california():
housing = fetch_california_housing(as_frame=True)
X = housing.data[["AveRooms"]]
y = housing.target
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plt.show()
def polynomial_regression_auto_mpg():
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
data = data.dropna()
X = data["displacement"].values.reshape(-1, 1)
y = data["mpg"].values
poly_model.fit(X_train, y_train)
y_pred = poly_model.predict(X_test)
plt.xlabel("Displacement")
plt.legend()
plt.show()
if __name__ == "__main__":
linear_regression_california()
polynomial_regression_auto_mpg()
OUTPUT:
Demonstrating Linear Regression and Polynomial Regression
8. Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data set for building the decision tree and apply this knowledge to classify a new sample.
PROGRAM:
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
new_sample = np.array([X_test[0]])
prediction = clf.predict(new_sample)
plt.figure(figsize=(12,8))
tree.plot_tree(clf, filled=True, feature_names=data.feature_names,
class_names=data.target_names)
plt.show()
OUTPUT:
BCSL606 Program 9
9. Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set
for training. Compute the accuracy of the classifier, considering a few test data sets.
PROGRAM:
import numpy as np
X = data.data
y = data.target
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("\nClassification Report:")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
ax.axis('off')
plt.show()
OUTPUT:
Accuracy: 80.83%
Classification Report:
Confusion Matrix:
[[2 0 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
[0 0 2 ... 0 0 1]
...
[0 0 0 ... 1 0 0]
[0 0 0 ... 0 3 0]
[0 0 0 ... 0 0 5]]
10. Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set
and visualize the clustering result.
PROGRAM:
import numpy as np
import pandas as pd
data = load_breast_cancer()
X = data.data
y = data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
y_kmeans = kmeans.fit_predict(X_scaled)
print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df['Cluster'] = y_kmeans
df['True Label'] = y
plt.figure(figsize=(8, 6))
plt.legend(title="Cluster")
plt.show()
plt.figure(figsize=(8, 6))
plt.legend(title="True Label")
plt.show()
plt.figure(figsize=(8, 6))
centers = pca.transform(kmeans.cluster_centers_)
plt.legend(title="Cluster")
plt.show()
OUTPUT:
Confusion Matrix:
[[175 37]
[ 13 344]]
Classification Report: