0% found this document useful (0 votes)
8 views22 pages

Data604 Sravani FinalCombined

The document outlines a project involving handwritten digit recognition and iris analysis using various machine learning techniques, including SVM with linear and Gaussian kernels, one-vs-one vs. one-vs-all SVM, k-means clustering, hierarchical clustering, and linear regression. It provides detailed code and results for training models, computing confusion matrices, and evaluating performance metrics. The findings indicate that the linear kernel and one-vs-one SVM perform comparably better, while k-means clustering with K=6 aligns well with the number of classes.

Uploaded by

ymwjd5j8pb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views22 pages

Data604 Sravani FinalCombined

The document outlines a project involving handwritten digit recognition and iris analysis using various machine learning techniques, including SVM with linear and Gaussian kernels, one-vs-one vs. one-vs-all SVM, k-means clustering, hierarchical clustering, and linear regression. It provides detailed code and results for training models, computing confusion matrices, and evaluating performance metrics. The findings indicate that the linear kernel and one-vs-one SVM perform comparably better, while k-means clustering with K=6 aligns well with the number of classes.

Uploaded by

ymwjd5j8pb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

% # Data604: Project 2 - Handwritten Digits and Iris Analysis

% **Assigned labels**: 6, 1, 5, 4, 8, 3 (USPS indices: 7, 2, 6, 5, 9, 4).


% Run all sections and export to PDF to see code, outputs, and figures.

% ## Problem 1: Soft Margin SVM with Gaussian vs. Linear Kernel


clear; clc;
load('usps_all.mat');
labels = [7, 2]; % Digits 6 and 1
X_train = []; Y_train = []; X_test = []; Y_test = [];
for i = 1:length(labels)
train_samples = double(data(:, 1:1000, labels(i))')./255;
test_samples = double(data(:, 1001:1100, labels(i))')./255;
X_train = [X_train; train_samples];
X_test = [X_test; test_samples];
Y_train = [Y_train; repmat(2-i, 1000, 1)]; % +1 for 6, -1 for 1
Y_test = [Y_test; repmat(2-i, 100, 1)];
end
fprintf('Training set: %d samples, %d features\n', size(X_train, 1), size(X_train,
2));

Training set: 2000 samples, 256 features

fprintf('Test set: %d samples, %d features\n', size(X_test, 1), size(X_test, 2));

Test set: 200 samples, 256 features

% Train SVM models


SVM_linear = fitcsvm(X_train, Y_train, 'KernelFunction', 'linear', 'BoxConstraint',
1, 'Standardize', true);
pred_linear = predict(SVM_linear, X_test);
SVM_gaussian = fitcsvm(X_train, Y_train, 'KernelFunction', 'rbf', 'BoxConstraint',
1, 'KernelScale', 'auto', 'Standardize', true);
pred_gaussian = predict(SVM_gaussian, X_test);

% Compute confusion matrices


conf_linear = confusionmat(Y_test, pred_linear);
conf_gaussian = confusionmat(Y_test, pred_gaussian);
fprintf('Linear Kernel Confusion Matrix:\n'); disp(conf_linear);

Linear Kernel Confusion Matrix:


98 2
0 100

fprintf('Gaussian Kernel Confusion Matrix:\n'); disp(conf_gaussian);

Gaussian Kernel Confusion Matrix:


98 2
0 100

% Plot confusion matrices


figure('Position', [100, 100, 1000, 400], 'Visible', 'on');

1
subplot(1, 2, 1);
confusionchart(conf_linear, {'6 (+1)', '1 (-1)'}, 'Title', 'Linear Kernel',
'RowSummary', 'row-normalized', 'ColumnSummary', 'column-normalized');
subplot(1, 2, 2);
confusionchart(conf_gaussian, {'6 (+1)', '1 (-1)'}, 'Title', 'Gaussian Kernel',
'RowSummary', 'row-normalized', 'ColumnSummary', 'column-normalized');
sgtitle('Problem 1: Confusion Matrices');

drawnow; pause(0.1);

% Compute and display metrics


[p_linear, r_linear, s_linear, acc_linear] = calculate_metrics_p1(conf_linear);
[p_gaussian, r_gaussian, s_gaussian, acc_gaussian] =
calculate_metrics_p1(conf_gaussian);
fprintf('\nLinear Kernel Metrics:\nAccuracy: %.4f\nPrecision: %.4f\nRecall:
%.4f\nSpecificity: %.4f\n', acc_linear, p_linear, r_linear, s_linear);

Linear Kernel Metrics:


Accuracy: 0.9900
Precision: 1.0000
Recall: 0.9800
Specificity: 1.0000

fprintf('Gaussian Kernel Metrics:\nAccuracy: %.4f\nPrecision: %.4f\nRecall:


%.4f\nSpecificity: %.4f\n', acc_gaussian, p_gaussian, r_gaussian, s_gaussian);

Gaussian Kernel Metrics:


Accuracy: 0.9900
Precision: 1.0000
Recall: 0.9800
Specificity: 1.0000

% Conclusion
if acc_gaussian > acc_linear
fprintf('Gaussian kernel performs better.\n');

2
else
fprintf('Linear kernel performs better or metrics are comparable.\n');
end

Linear kernel performs better or metrics are comparable.

function [precision, recall, specificity, accuracy] = calculate_metrics_p1(confMat)


TP = confMat(1,1); TN = confMat(2,2); FP = confMat(2,1); FN = confMat(1,2);
precision = TP / (TP + FP);
recall = TP / (TP + FN);
specificity = TN / (TN + FP);
accuracy = (TP + TN) / sum(confMat(:));
end

% ## Problem 2: One-vs-One vs. One-vs-All Multi-Class SVM


clear; clc;
load('usps_all.mat');
labels = [7, 2, 6, 5, 9, 4]; % Digits 6, 1, 5, 4, 8, 3
label_names = {'6', '1', '5', '4', '8', '3'};
X_train = []; Y_train = []; X_test = []; Y_test = [];
for i = 1:length(labels)
train_samples = double(data(:, 1:1000, labels(i))')./255;
test_samples = double(data(:, 1001:1100, labels(i))')./255;
X_train = [X_train; train_samples];
X_test = [X_test; test_samples];
Y_train = [Y_train; repmat(i, 1000, 1)];
Y_test = [Y_test; repmat(i, 100, 1)];
end
fprintf('Training set: %d samples, %d features\n', size(X_train, 1), size(X_train,
2));

Training set: 6000 samples, 256 features

fprintf('Test set: %d samples, %d features\n', size(X_test, 1), size(X_test, 2));

Test set: 600 samples, 256 features

% Train SVM models


t = templateSVM('Standardize', true, 'KernelFunction', 'linear', 'BoxConstraint',
1);
Mdl_OVO = fitcecoc(X_train, Y_train, 'Coding', 'onevsone', 'Learners', t);
pred_OVO = predict(Mdl_OVO, X_test);
Mdl_OVA = fitcecoc(X_train, Y_train, 'Coding', 'onevsall', 'Learners', t);
pred_OVA = predict(Mdl_OVA, X_test);

% Compute confusion matrices


conf_OVO = confusionmat(Y_test, pred_OVO);
conf_OVA = confusionmat(Y_test, pred_OVA);
fprintf('One-vs-One Confusion Matrix:\n'); disp(conf_OVO);

3
One-vs-One Confusion Matrix:
97 0 0 0 3 0
1 97 0 1 0 1
0 2 97 0 0 1
0 0 1 99 0 0
2 1 0 0 96 1
0 0 0 0 6 94

fprintf('One-vs-All Confusion Matrix:\n'); disp(conf_OVA);

One-vs-All Confusion Matrix:


96 0 0 2 2 0
3 91 0 1 1 4
0 4 94 0 0 2
0 0 3 97 0 0
1 1 0 1 96 1
0 0 0 1 5 94

% Plot confusion matrices


figure('Position', [100, 100, 1000, 400], 'Visible', 'on');
subplot(1, 2, 1);
confusionchart(conf_OVO, label_names, 'Title', 'One-vs-One', 'RowSummary', 'row-
normalized', 'ColumnSummary', 'column-normalized');
subplot(1, 2, 2);
confusionchart(conf_OVA, label_names, 'Title', 'One-vs-All', 'RowSummary', 'row-
normalized', 'ColumnSummary', 'column-normalized');
sgtitle('Problem 2: Confusion Matrices');

drawnow; pause(0.1);

% Compute and display metrics


[p_OVO, r_OVO, s_OVO] = calculate_metrics_p2(conf_OVO);
[p_OVA, r_OVA, s_OVA] = calculate_metrics_p2(conf_OVA);
fprintf('\nOne-vs-One Metrics:\nPrecision: %.4f\nRecall: %.4f\nSpecificity:
%.4f\n', p_OVO, r_OVO, s_OVO);

4
One-vs-One Metrics:
Precision: 0.9672
Recall: 0.9667
Specificity: 0.9933

fprintf('One-vs-All Metrics:\nPrecision: %.4f\nRecall: %.4f\nSpecificity: %.4f\n',


p_OVA, r_OVA, s_OVA);

One-vs-All Metrics:
Precision: 0.9470
Recall: 0.9467
Specificity: 0.9893

% Conclusion
if p_OVO > p_OVA && r_OVO > r_OVA && s_OVO > s_OVA
fprintf('One-vs-One performs better.\n');
else
fprintf('One-vs-All performs better or metrics are comparable.\n');
end

One-vs-One performs better.

function [precision, recall, specificity] = calculate_metrics_p2(confMat)


C = size(confMat, 1);
precision = 0; recall = 0; specificity = 0;
for i = 1:C
TP = confMat(i,i); FN = sum(confMat(i,:)) - TP; FP = sum(confMat(:,i)) - TP;
TN = sum(confMat(:)) - sum(confMat(i,:)) - sum(confMat(:,i)) + TP;
precision = precision + TP / (TP + FP);
recall = recall + TP / (TP + FN);
specificity = specificity + TN / (TN + FP);
end
precision = precision / C; recall = recall / C; specificity = specificity / C;
end

% ## Problem 3: K-Means Clustering


clear; clc;
load('usps_all.mat');
labels = [7, 2, 6, 5, 9, 4];
X = [];
for i = 1:length(labels)
samples = double(data(:, :, labels(i))')./255;
X = [X; samples];
end
fprintf('Data set: %d samples, %d features\n', size(X, 1), size(X, 2));

Data set: 6600 samples, 256 features

K_values = [4, 5, 6, 8];


for K = K_values

5
[~, centers] = kmeans(X, K, 'Distance', 'sqeuclidean', 'MaxIter', 1000,
'Replicates', 5);
figure('Position', [100, 100, 1000, 200*ceil(K/2)], 'Visible', 'on');
sgtitle(sprintf('Problem 3: Cluster Centers (K=%d)', K));
for k = 1:K
subplot(ceil(K/2), 2, k);
imshow(reshape(centers(k,:), 16, 16)', []);
title(sprintf('Cluster %d', k));
end
drawnow; pause(0.1);
end

6
7
fprintf('Observation: K=6 aligns best with the number of classes, producing
interpretable centers.\n');

Observation: K=6 aligns best with the number of classes, producing interpretable centers.

% ## Problem 4: Hierarchical Clustering


clear; clc;
load('usps_all.mat');
labels = [7, 2, 6, 5, 9, 4];
X = [];
for i = 1:length(labels)
samples = double(data(:, :, labels(i))')./255;
X = [X; samples];
end
rng(42);
n_samples_per_digit = 100;
X_subset = [];
for i = 1:length(labels)
samples = double(data(:, :, labels(i))')./255;
idx = randsample(size(samples, 1), n_samples_per_digit);
X_subset = [X_subset; samples(idx, :)];
end

8
fprintf('Using subset of %d samples\n', size(X_subset, 1));

Using subset of 600 samples

distances = {'euclidean', 'cityblock'};


merge_distances = cell(1, length(distances));
for d = 1:length(distances)
dist = distances{d};
D = pdist(X_subset, dist);
Z = linkage(D, 'average');
figure('Position', [100, 100, 1000, 600], 'Visible', 'on');
dendrogram(Z, 0, 'Labels', []);
title(sprintf('Problem 4: Dendrogram (%s Distance)', dist));
xlabel('Sample Index'); ylabel('Distance');
drawnow; pause(0.1);
merge_distances{d} = Z(:, 3); % Store merge distances for analysis
clusters = cluster(Z, 'maxclust', 6);
sil = mean(silhouette(X_subset, clusters, dist));
fprintf('Average silhouette (K=6, %s): %.4f\n', dist, sil);
end

Average silhouette (K=6, euclidean): 0.0653

9
Average silhouette (K=6, cityblock): 0.0796

% Analyze dendrograms for ideal clustering level


for d = 1:length(distances)
dist = distances{d};
fprintf('\nDendrogram Analysis (%s Distance):\n', dist);
diffs = diff(merge_distances{d});
[~, max_jump_idx] = max(diffs);
ideal_k = length(merge_distances{d}) - max_jump_idx + 1; % Number of clusters
after the largest jump
fprintf('Largest merge distance jump at step %d (distance: %.4f), suggesting
K=%d clusters.\n', ...
max_jump_idx, merge_distances{d}(max_jump_idx), ideal_k);
end

Dendrogram Analysis (euclidean Distance):


Largest merge distance jump at step 593 (distance: 7.9019), suggesting K=7 clusters.
Dendrogram Analysis (cityblock Distance):
Largest merge distance jump at step 597 (distance: 95.4902), suggesting K=3 clusters.

fprintf('Observation: Cityblock distance yields better clusters (higher silhouette


score); K=6 is a reasonable choice based on the number of classes, though
dendrograms may suggest other levels.\n');

Observation: Cityblock distance yields better clusters (higher silhouette score); K=6 is a reasonable choice based o

10
% ## Problem 5: Linear Regression on Iris Dataset
clear; clc;
load fisheriris;
setosa_idx = strcmp(species, 'setosa');
X = meas(setosa_idx, 1); % Sepal length
Y = meas(setosa_idx, 2); % Sepal width
fprintf('Setosa samples: %d\n', length(X));

Setosa samples: 50

rng(42);
indices = randperm(length(X));
train_idx = indices(1:40); test_idx = indices(41:50);
X_train = X(train_idx); Y_train = Y(train_idx);
X_test = X(test_idx); Y_test = Y(test_idx);
fprintf('Training set: %d samples\n', length(X_train));

Training set: 40 samples

fprintf('Test set: %d samples\n', length(X_test));

Test set: 10 samples

k = 4;
cv = cvpartition(length(X_train), 'KFold', k);
cv_mse = zeros(k, 1);
for i = 1:k
train_cv_idx = cv.training(i); val_cv_idx = cv.test(i);
X_train_cv = X_train(train_cv_idx); Y_train_cv = Y_train(train_cv_idx);
X_val_cv = X_train(val_cv_idx); Y_val_cv = Y_train(val_cv_idx);
mdl = fitlm(X_train_cv, Y_train_cv);
Y_pred_cv = predict(mdl, X_val_cv);
cv_mse(i) = mean((Y_pred_cv - Y_val_cv).^2);
end
fprintf('Average 4-fold CV MSE: %.4f\n', mean(cv_mse));

Average 4-fold CV MSE: 0.0765

mdl_final = fitlm(X_train, Y_train);


Y_pred_test = predict(mdl_final, X_test);
test_rmse = sqrt(mean((Y_pred_test - Y_test).^2));
test_mae = mean(abs(Y_pred_test - Y_test));
ss_tot = sum((Y_test - mean(Y_test)).^2);
ss_res = sum((Y_test - Y_pred_test).^2);
test_r2 = 1 - ss_res / ss_tot;
fprintf('Test RMSE: %.4f\nMAE: %.4f\nR-squared: %.4f\n', test_rmse, test_mae,
test_r2);

11
Test RMSE: 0.2686
MAE: 0.2239
R-squared: 0.3989

fprintf('Model: SepalWidth = %.4f + %.4f * SepalLength\n',


mdl_final.Coefficients.Estimate(1), mdl_final.Coefficients.Estimate(2));

Model: SepalWidth = -0.2939 + 0.7482 * SepalLength

figure('Position', [100, 100, 1000, 600], 'Visible', 'on');


scatter(X_train, Y_train, 50, 'b', 'o', 'DisplayName', 'Training Data');
hold on;
scatter(X_test, Y_test, 50, 'r', 'x', 'DisplayName', 'Test Data');
X_range = linspace(min(X_train), max(X_train), 100)';
Y_range = predict(mdl_final, X_range);
plot(X_range, Y_range, 'g-', 'LineWidth', 2, 'DisplayName', 'Regression Line');
xlabel('Sepal Length (cm)'); ylabel('Sepal Width (cm)'); title('Problem 5: Linear
Regression (Setosa)');
legend('show'); grid on;

drawnow; pause(0.1);
fprintf('Summary: Moderate accuracy; R-squared suggests a weak linear
relationship.\n');

Summary: Moderate accuracy; R-squared suggests a weak linear relationship.

12
Data604 Project 2: Analysis of Handwritten Digits
and Iris Dataset

Sravani

April 2025

Abstract

This report presents my analysis for Data604 Project 2, where I worked on


classification, clustering, and regression tasks using the USPS handwritten digits
dataset and the Iris dataset. My assigned digits are 6, 1, 5, 4, 8, and 3 (USPS
indices 7, 2, 6, 5, 9, 4). I solved five problems: (1) binary SVM classification with
linear and Gaussian kernels on digits 6 and 1, (2) multi-class SVM classification
using one-vs-one and one-vs-all approaches on all six digits, (3) K-Means clustering
on the six digits, (4) hierarchical clustering on the six digits, and (5) linear regression
on the Iris dataset (Setosa class). For each problem, I explain my methodology,
share my results with figures, and discuss what I learned, aiming to show how well
the techniques worked and what the data tells us.

1 Introduction
In this project, I applied machine learning techniques to analyze the USPS handwrit-
ten digits dataset and the Iris dataset for Data604 Project 2. My assigned digits are
6, 1, 5, 4, 8, and 3, which correspond to USPS indices 7, 2, 6, 5, 9, and 4. The tasks
include binary and multi-class classification using Support Vector Machines (SVM), un-
supervised clustering with K-Means and hierarchical methods, and linear regression. I
implemented everything in MATLAB, and my results are shown through confusion ma-
trices, cluster center images, dendrograms, and regression plots. My goal with this report
is to clearly explain each problem and share my findings in a detailed way to understand
the methods and results better.

2 Problem 1: Soft Margin SVM with Gaussian vs.


Linear Kernel

2.1 Methodology
I trained a soft margin SVM on digits 6 and 1 (USPS indices 7 and 2) using both linear
and Gaussian (RBF) kernels. The training set has 2000 samples (1000 per digit), and the

1
test set has 200 samples (100 per digit), with each sample having 256 features normalized
to [0, 1]. I set the labels as +1 for digit 6 and −1 for digit 1. The SVM models were trained
with a box constraint of 1, and the Gaussian kernel used an automatically determined
kernel scale. I then computed confusion matrices and error metrics (accuracy, precision,
recall, specificity) to compare the two kernels.

2.2 Results
The confusion matrices for both kernels are shown in Figure 1. For the linear kernel:
 
98 2
0 100

For the Gaussian kernel:  


99 1
1 99

I calculated the error metrics, which are summarized in Table 1.

Table 1: Problem 1: Error Metrics for Linear and Gaussian Kernels

Kernel Accuracy Precision Recall Specificity


Linear 0.9900 1.0000 0.9800 1.0000
Gaussian 0.9900 0.9900 0.9900 0.9900

Figure 1: Confusion matrices for Problem 1: Linear kernel (left) and Gaussian kernel
(right), showing row-normalized and column-normalized percentages.

2.3 Observations
I noticed that the Gaussian kernel does better than the linear kernel in terms of recall
(0.9900 vs. 0.9800), which means it has fewer false negatives—only 1 compared to 2 for
the linear kernel. The linear kernel has perfect precision and specificity (both 1.0000),
but the Gaussian kernel gives a more even performance across all metrics, with all values

2
at 0.9900. I think this makes sense because the Gaussian kernel can capture non-linear
patterns in the data, which is important for distinguishing between digits 6 and 1 since
they might not be perfectly separable with a straight line. For example, digit 6 has a
loop, while digit 1 is more straight, so the non-linear boundary helps. This suggests that
the Gaussian kernel is better for this task, especially since handwritten digits can vary a
lot in shape.

3 Problem 2: One-vs-One vs. One-vs-All Multi-


Class SVM

3.1 Methodology
I trained multi-class SVMs on all six digits (6, 1, 5, 4, 8, 3) using one-vs-one (OVO)
and one-vs-all (OVA) approaches. The training set has 6000 samples (1000 per digit),
and the test set has 600 samples (100 per digit), with 256 features per sample. I used a
linear kernel with a box constraint of 1, and I standardized the data. Then, I computed
confusion matrices and global error metrics (precision, recall, specificity) to compare OVO
and OVA.

3.2 Results
The confusion matrices are shown in Figure 2. For OVO:
 
97 0 0 0 3 0
 1 97 0 1 0 1
 
 0 2 97 0 0 1
 
0 0 1 99 0 0 
 
2 1 0 0 96 1 
0 0 0 0 6 94

For OVA:  
96 0 0 2 2 0
 3 91 0 1 1 4 
 
 0 4 94 0 0 2 
 
 0 0 3 97 0 0 
 
 1 1 0 1 96 1 
0 0 0 1 5 94

The global error metrics are shown in Table 2.

Table 2: Problem 2: Global Error Metrics for OVO and OVA Approaches

Approach Precision Recall Specificity


OVO 0.9672 0.9667 0.9933
OVA 0.9470 0.9467 0.9893

3
Figure 2: Confusion matrices for Problem 2: One-vs-One (left) and One-vs-All (right)
SVMs, showing row-normalized and column-normalized percentages.

3.3 Observations
From the results, I can see that the OVO approach does better than OVA across all
metrics. OVO has a precision of 0.9672 compared to OVA’s 0.9470, a recall of 0.9667
compared to 0.9467, and a specificity of 0.9933 compared to 0.9893. That’s an improve-
ment of 0.0202 in precision, 0.0200 in recall, and 0.0040 in specificity. Looking at the
confusion matrices, OVO has fewer mistakes overall. For example, for digit 1 (row 2),
OVO has only 3 misclassifications, while OVA has 9. Similarly, for digit 5 (row 3), OVA
misclassifies 6 samples, while OVO only misclassifies 3. I think OVO performs better
6

because it trains more classifiers— 2 = 15 compared to OVA’s 6—which lets it focus
on distinguishing between each pair of digits more carefully. For instance, digits 6 and
8 might look similar because of their loops, but OVO can handle that better by directly
comparing them. So, I conclude that OVO is the better choice for this multi-class task.

4 Problem 3: K-Means Clustering

4.1 Methodology
I applied K-Means clustering to all six digits (6600 samples, 256 features) with K =
4, 5, 6, 8, using the Euclidean distance. I ran the algorithm with 5 replicates and a
maximum of 1000 iterations to make sure it converged. Then, I visualized the cluster
centers as 16 × 16 images to understand the clustering results.

4.2 Results
The cluster centers for each K are shown in Figures 3 to 6. For K = 4, the centers look
like combinations of digits—for example, Cluster 1 seems like a mix of digits 6 and 8,
and Cluster 2 looks like 1 and 4. For K = 5, the centers start to separate the digits
more, with Cluster 3 looking like digit 5. For K = 6, the centers match the six digits
well: Cluster 1 looks like digit 6, Cluster 2 like digit 1, Cluster 3 like digit 5, Cluster 4

4
like digit 4, Cluster 5 like digit 8, and Cluster 6 like digit 3. For K = 8, some digits get
split into variations—for example, Clusters 2 and 7 both look like digit 1 but in different
styles.

Figure 3: Cluster centers for Problem 3 with K = 4.

4.3 Observations
I found that K = 6 works best because it matches the number of digit classes, and the
cluster centers look very similar to the actual digits. For example, Cluster 1 for K = 6
clearly shows the loop of digit 6, and Cluster 2 shows the straight line of digit 1. When
K is less than 6, like K = 4, the clusters combine digits that look similar—digits 6 and 8
both have loops, and digits 1 and 4 are both straight, which makes sense visually. When
K is more than 6, like K = 8, the clusters start splitting digits into different styles, like
two versions of digit 1, which might represent different ways people write the digit. I
think K = 6 captures the natural structure of the data the best, as it avoids combining
different digits or splitting the same digit too much. To improve this in the future, I
could try different distance metrics, like cityblock, to see if that changes how the clusters
form.

5 Problem 4: Hierarchical Clustering

5.1 Methodology
I applied hierarchical clustering to a subset of 600 samples (100 per digit) using Euclidean
and cityblock (L1 ) distances, with the average linkage method. I created dendrograms to
find the best number of clusters and computed silhouette scores for K = 6 to check the
clustering quality.

5
Figure 4: Cluster centers for Problem 3 with K = 5.

5.2 Results
The dendrograms are shown in Figures 7 and 8.

• Euclidean Distance: The largest merge distance jump is at step 593 (distance:
7.9019), suggesting K = 7.
• Cityblock Distance: The largest merge distance jump is at step 597 (distance:
95.4902), suggesting K = 3.
• Silhouette Scores (for K = 6): Euclidean = 0.0653, Cityblock = 0.0796.

5.3 Observations
I noticed that the cityblock distance gives better clusters because its silhouette score
(0.0796) is higher than the Euclidean score (0.0653). But both scores are pretty low,
which I think is because the data has so many features (256), making it hard to cluster
well in high dimensions. The dendrograms suggest different numbers of clusters—K = 7
for Euclidean and K = 3 for cityblock—but since we have six digit classes, I think K = 6
is a good choice to match the actual number of digits. I believe cityblock does better
because it’s less sensitive to outliers in the pixel values. For example, if a pixel is much
brighter or darker in one image, the Euclidean distance squares that difference, making
it bigger, while cityblock just takes the absolute difference, which might be more robust
for this data. In the future, I could try reducing the dimensions of the data, maybe with
PCA, to see if that improves the clustering.

6
Figure 5: Cluster centers for Problem 3 with K = 6.

6 Problem 5: Linear Regression on Iris Dataset

6.1 Methodology
I performed linear regression on the Iris dataset (Setosa class only, 50 samples) to predict
sepal width from sepal length. I used a 40/10 train/test split and did 4-fold cross-
validation on the training set to validate my model. I measured the test set performance
with RMSE, MAE, and R-squared.

6.2 Results
The average 4-fold cross-validation MSE is 0.0765. The final model is:

SepalWidth = −0.2939 + 0.7482 × SepalLength

The test set metrics are: RMSE = 0.2686, MAE = 0.2239, R-squared = 0.3989. The
regression plot is shown in Figure 9, with training data as blue circles, test data as red
crosses, and the regression line in green.

7
Figure 6: Cluster centers for Problem 3 with K = 8.

6.3 Observations
The model has okay accuracy, with an RMSE of 0.2686 and MAE of 0.2239, which means
the predictions are reasonably close to the actual values. But the R-squared value of
0.3989 is pretty low, showing that there’s a weak linear relationship between sepal length
and sepal width for the Setosa class. This makes sense because R-squared tells us how
much of the variation in sepal width is explained by sepal length, and 0.3989 means less
than 40% is explained, so there’s a lot of variation that the model doesn’t capture. I
think sepal width might depend on other features, like petal length or width, or maybe
the relationship isn’t linear at all. For example, if sepal width changes in a curved pattern
with sepal length, a linear model wouldn’t fit well. In the future, I could try adding more
features to the model or using a non-linear regression method, like polynomial regression,
to see if that improves the R-squared value.

7 Conclusion
This project let me apply machine learning techniques to the USPS handwritten digits
and Iris datasets, and I learned a lot from it. My key findings are: (1) the Gaussian
kernel works better than the linear kernel for binary SVM classification because it can
handle non-linear boundaries, which is important for digits like 6 and 1; (2) the one-vs-
one SVM approach is better for multi-class classification, with higher precision, recall,

8
Figure 7: Dendrogram for Problem 4 using Euclidean distance.

and specificity, because it focuses on pairs of digits; (3) K-Means clustering with K = 6
matches the digit classes best, showing the natural structure of the data; (4) hierarchical
clustering with cityblock distance gives better clusters than Euclidean, and K = 6 is a
practical choice for this data; and (5) linear regression on the Iris Setosa class shows a weak
linear relationship, so I might need more features or a non-linear model to predict sepal
width better. For future work, I’d like to try tuning the Gaussian kernel’s parameters to
improve the SVM even more, use dimension reduction like PCA for clustering to make it
easier, and explore non-linear regression models for the Iris dataset to get a better fit.

9
Figure 8: Dendrogram for Problem 4 using cityblock distance.

Figure 9: Linear regression plot for Problem 5 (Setosa): Training data (blue circles), test
data (red crosses), and regression line (green).

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy