0% found this document useful (0 votes)

8 views22 pages

Data604 Sravani FinalCombined

The document outlines a project involving handwritten digit recognition and iris analysis using various machine learning techniques, including SVM with linear and Gaussian kernels, one-vs-one vs. one-vs-all SVM, k-means clustering, hierarchical clustering, and linear regression. It provides detailed code and results for training models, computing confusion matrices, and evaluating performance metrics. The findings indicate that the linear kernel and one-vs-one SVM perform comparably better, while k-means clustering with K=6 aligns well with the number of classes.

Uploaded by

ymwjd5j8pb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views22 pages

Data604 Sravani FinalCombined

Uploaded by

ymwjd5j8pb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

% # Data604: Project 2 - Handwritten Digits and Iris Analysis

% Assigned labels: 6, 1, 5, 4, 8, 3 (USPS indices: 7, 2, 6, 5, 9, 4).

% Run all sections and export to PDF to see code, outputs, and figures.

% ## Problem 1: Soft Margin SVM with Gaussian vs. Linear Kernel

clear; clc;
load('usps_all.mat');
labels = [7, 2]; % Digits 6 and 1
X_train = []; Y_train = []; X_test = []; Y_test = [];
for i = 1:length(labels)
train_samples = double(data(:, 1:1000, labels(i))')./255;
test_samples = double(data(:, 1001:1100, labels(i))')./255;
X_train = [X_train; train_samples];
X_test = [X_test; test_samples];
Y_train = [Y_train; repmat(2-i, 1000, 1)]; % +1 for 6, -1 for 1
Y_test = [Y_test; repmat(2-i, 100, 1)];
end
fprintf('Training set: %d samples, %d features\n', size(X_train, 1), size(X_train,
2));

Training set: 2000 samples, 256 features

fprintf('Test set: %d samples, %d features\n', size(X_test, 1), size(X_test, 2));

Test set: 200 samples, 256 features

% Train SVM models

SVM_linear = fitcsvm(X_train, Y_train, 'KernelFunction', 'linear', 'BoxConstraint',
1, 'Standardize', true);
pred_linear = predict(SVM_linear, X_test);
SVM_gaussian = fitcsvm(X_train, Y_train, 'KernelFunction', 'rbf', 'BoxConstraint',
1, 'KernelScale', 'auto', 'Standardize', true);
pred_gaussian = predict(SVM_gaussian, X_test);

% Compute confusion matrices

conf_linear = confusionmat(Y_test, pred_linear);
conf_gaussian = confusionmat(Y_test, pred_gaussian);
fprintf('Linear Kernel Confusion Matrix:\n'); disp(conf_linear);

Linear Kernel Confusion Matrix:

98 2
0 100

fprintf('Gaussian Kernel Confusion Matrix:\n'); disp(conf_gaussian);

Gaussian Kernel Confusion Matrix:

98 2
0 100

% Plot confusion matrices

figure('Position', [100, 100, 1000, 400], 'Visible', 'on');

1
subplot(1, 2, 1);
confusionchart(conf_linear, {'6 (+1)', '1 (-1)'}, 'Title', 'Linear Kernel',
'RowSummary', 'row-normalized', 'ColumnSummary', 'column-normalized');
subplot(1, 2, 2);
confusionchart(conf_gaussian, {'6 (+1)', '1 (-1)'}, 'Title', 'Gaussian Kernel',
'RowSummary', 'row-normalized', 'ColumnSummary', 'column-normalized');
sgtitle('Problem 1: Confusion Matrices');

drawnow; pause(0.1);

% Compute and display metrics

[p_linear, r_linear, s_linear, acc_linear] = calculate_metrics_p1(conf_linear);
[p_gaussian, r_gaussian, s_gaussian, acc_gaussian] =
calculate_metrics_p1(conf_gaussian);
fprintf('\nLinear Kernel Metrics:\nAccuracy: %.4f\nPrecision: %.4f\nRecall:
%.4f\nSpecificity: %.4f\n', acc_linear, p_linear, r_linear, s_linear);

Linear Kernel Metrics:

Accuracy: 0.9900
Precision: 1.0000
Recall: 0.9800
Specificity: 1.0000

fprintf('Gaussian Kernel Metrics:\nAccuracy: %.4f\nPrecision: %.4f\nRecall:

%.4f\nSpecificity: %.4f\n', acc_gaussian, p_gaussian, r_gaussian, s_gaussian);

Gaussian Kernel Metrics:

Accuracy: 0.9900
Precision: 1.0000
Recall: 0.9800
Specificity: 1.0000

% Conclusion
if acc_gaussian > acc_linear
fprintf('Gaussian kernel performs better.\n');

2
else
fprintf('Linear kernel performs better or metrics are comparable.\n');
end

Linear kernel performs better or metrics are comparable.

function [precision, recall, specificity, accuracy] = calculate_metrics_p1(confMat)

TP = confMat(1,1); TN = confMat(2,2); FP = confMat(2,1); FN = confMat(1,2);
precision = TP / (TP + FP);
recall = TP / (TP + FN);
specificity = TN / (TN + FP);
accuracy = (TP + TN) / sum(confMat(:));
end

% ## Problem 2: One-vs-One vs. One-vs-All Multi-Class SVM

clear; clc;
load('usps_all.mat');
labels = [7, 2, 6, 5, 9, 4]; % Digits 6, 1, 5, 4, 8, 3
label_names = {'6', '1', '5', '4', '8', '3'};
X_train = []; Y_train = []; X_test = []; Y_test = [];
for i = 1:length(labels)
train_samples = double(data(:, 1:1000, labels(i))')./255;
test_samples = double(data(:, 1001:1100, labels(i))')./255;
X_train = [X_train; train_samples];
X_test = [X_test; test_samples];
Y_train = [Y_train; repmat(i, 1000, 1)];
Y_test = [Y_test; repmat(i, 100, 1)];
end
fprintf('Training set: %d samples, %d features\n', size(X_train, 1), size(X_train,
2));

Training set: 6000 samples, 256 features

fprintf('Test set: %d samples, %d features\n', size(X_test, 1), size(X_test, 2));

Test set: 600 samples, 256 features

% Train SVM models

t = templateSVM('Standardize', true, 'KernelFunction', 'linear', 'BoxConstraint',
1);
Mdl_OVO = fitcecoc(X_train, Y_train, 'Coding', 'onevsone', 'Learners', t);
pred_OVO = predict(Mdl_OVO, X_test);
Mdl_OVA = fitcecoc(X_train, Y_train, 'Coding', 'onevsall', 'Learners', t);
pred_OVA = predict(Mdl_OVA, X_test);

% Compute confusion matrices

conf_OVO = confusionmat(Y_test, pred_OVO);
conf_OVA = confusionmat(Y_test, pred_OVA);
fprintf('One-vs-One Confusion Matrix:\n'); disp(conf_OVO);

3
One-vs-One Confusion Matrix:
97 0 0 0 3 0
1 97 0 1 0 1
0 2 97 0 0 1
0 0 1 99 0 0
2 1 0 0 96 1
0 0 0 0 6 94

fprintf('One-vs-All Confusion Matrix:\n'); disp(conf_OVA);

One-vs-All Confusion Matrix:

96 0 0 2 2 0
3 91 0 1 1 4
0 4 94 0 0 2
0 0 3 97 0 0
1 1 0 1 96 1
0 0 0 1 5 94

% Plot confusion matrices

figure('Position', [100, 100, 1000, 400], 'Visible', 'on');
subplot(1, 2, 1);
confusionchart(conf_OVO, label_names, 'Title', 'One-vs-One', 'RowSummary', 'row-
normalized', 'ColumnSummary', 'column-normalized');
subplot(1, 2, 2);
confusionchart(conf_OVA, label_names, 'Title', 'One-vs-All', 'RowSummary', 'row-
normalized', 'ColumnSummary', 'column-normalized');
sgtitle('Problem 2: Confusion Matrices');

drawnow; pause(0.1);

% Compute and display metrics

[p_OVO, r_OVO, s_OVO] = calculate_metrics_p2(conf_OVO);
[p_OVA, r_OVA, s_OVA] = calculate_metrics_p2(conf_OVA);
fprintf('\nOne-vs-One Metrics:\nPrecision: %.4f\nRecall: %.4f\nSpecificity:
%.4f\n', p_OVO, r_OVO, s_OVO);

4
One-vs-One Metrics:
Precision: 0.9672
Recall: 0.9667
Specificity: 0.9933

fprintf('One-vs-All Metrics:\nPrecision: %.4f\nRecall: %.4f\nSpecificity: %.4f\n',

p_OVA, r_OVA, s_OVA);

One-vs-All Metrics:
Precision: 0.9470
Recall: 0.9467
Specificity: 0.9893

% Conclusion
if p_OVO > p_OVA && r_OVO > r_OVA && s_OVO > s_OVA
fprintf('One-vs-One performs better.\n');
else
fprintf('One-vs-All performs better or metrics are comparable.\n');
end

One-vs-One performs better.

function [precision, recall, specificity] = calculate_metrics_p2(confMat)

C = size(confMat, 1);
precision = 0; recall = 0; specificity = 0;
for i = 1:C
TP = confMat(i,i); FN = sum(confMat(i,:)) - TP; FP = sum(confMat(:,i)) - TP;
TN = sum(confMat(:)) - sum(confMat(i,:)) - sum(confMat(:,i)) + TP;
precision = precision + TP / (TP + FP);
recall = recall + TP / (TP + FN);
specificity = specificity + TN / (TN + FP);
end
precision = precision / C; recall = recall / C; specificity = specificity / C;
end

% ## Problem 3: K-Means Clustering

clear; clc;
load('usps_all.mat');
labels = [7, 2, 6, 5, 9, 4];
X = [];
for i = 1:length(labels)
samples = double(data(:, :, labels(i))')./255;
X = [X; samples];
end
fprintf('Data set: %d samples, %d features\n', size(X, 1), size(X, 2));

Data set: 6600 samples, 256 features

K_values = [4, 5, 6, 8];

for K = K_values

5
[~, centers] = kmeans(X, K, 'Distance', 'sqeuclidean', 'MaxIter', 1000,
'Replicates', 5);
figure('Position', [100, 100, 1000, 200*ceil(K/2)], 'Visible', 'on');
sgtitle(sprintf('Problem 3: Cluster Centers (K=%d)', K));
for k = 1:K
subplot(ceil(K/2), 2, k);
imshow(reshape(centers(k,:), 16, 16)', []);
title(sprintf('Cluster %d', k));
end
drawnow; pause(0.1);
end

6
7
fprintf('Observation: K=6 aligns best with the number of classes, producing
interpretable centers.\n');

Observation: K=6 aligns best with the number of classes, producing interpretable centers.

% ## Problem 4: Hierarchical Clustering

clear; clc;
load('usps_all.mat');
labels = [7, 2, 6, 5, 9, 4];
X = [];
for i = 1:length(labels)
samples = double(data(:, :, labels(i))')./255;
X = [X; samples];
end
rng(42);
n_samples_per_digit = 100;
X_subset = [];
for i = 1:length(labels)
samples = double(data(:, :, labels(i))')./255;
idx = randsample(size(samples, 1), n_samples_per_digit);
X_subset = [X_subset; samples(idx, :)];
end

8
fprintf('Using subset of %d samples\n', size(X_subset, 1));

Using subset of 600 samples

distances = {'euclidean', 'cityblock'};

merge_distances = cell(1, length(distances));
for d = 1:length(distances)
dist = distances{d};
D = pdist(X_subset, dist);
Z = linkage(D, 'average');
figure('Position', [100, 100, 1000, 600], 'Visible', 'on');
dendrogram(Z, 0, 'Labels', []);
title(sprintf('Problem 4: Dendrogram (%s Distance)', dist));
xlabel('Sample Index'); ylabel('Distance');
drawnow; pause(0.1);
merge_distances{d} = Z(:, 3); % Store merge distances for analysis
clusters = cluster(Z, 'maxclust', 6);
sil = mean(silhouette(X_subset, clusters, dist));
fprintf('Average silhouette (K=6, %s): %.4f\n', dist, sil);
end

Average silhouette (K=6, euclidean): 0.0653

9
Average silhouette (K=6, cityblock): 0.0796

% Analyze dendrograms for ideal clustering level

for d = 1:length(distances)
dist = distances{d};
fprintf('\nDendrogram Analysis (%s Distance):\n', dist);
diffs = diff(merge_distances{d});
[~, max_jump_idx] = max(diffs);
ideal_k = length(merge_distances{d}) - max_jump_idx + 1; % Number of clusters
after the largest jump
fprintf('Largest merge distance jump at step %d (distance: %.4f), suggesting
K=%d clusters.\n', ...
max_jump_idx, merge_distances{d}(max_jump_idx), ideal_k);
end

Dendrogram Analysis (euclidean Distance):

Largest merge distance jump at step 593 (distance: 7.9019), suggesting K=7 clusters.
Dendrogram Analysis (cityblock Distance):
Largest merge distance jump at step 597 (distance: 95.4902), suggesting K=3 clusters.

fprintf('Observation: Cityblock distance yields better clusters (higher silhouette

score); K=6 is a reasonable choice based on the number of classes, though
dendrograms may suggest other levels.\n');

Observation: Cityblock distance yields better clusters (higher silhouette score); K=6 is a reasonable choice based o

10
% ## Problem 5: Linear Regression on Iris Dataset
clear; clc;
load fisheriris;
setosa_idx = strcmp(species, 'setosa');
X = meas(setosa_idx, 1); % Sepal length
Y = meas(setosa_idx, 2); % Sepal width
fprintf('Setosa samples: %d\n', length(X));

Setosa samples: 50

rng(42);
indices = randperm(length(X));
train_idx = indices(1:40); test_idx = indices(41:50);
X_train = X(train_idx); Y_train = Y(train_idx);
X_test = X(test_idx); Y_test = Y(test_idx);
fprintf('Training set: %d samples\n', length(X_train));

Training set: 40 samples

fprintf('Test set: %d samples\n', length(X_test));

Test set: 10 samples

k = 4;
cv = cvpartition(length(X_train), 'KFold', k);
cv_mse = zeros(k, 1);
for i = 1:k
train_cv_idx = cv.training(i); val_cv_idx = cv.test(i);
X_train_cv = X_train(train_cv_idx); Y_train_cv = Y_train(train_cv_idx);
X_val_cv = X_train(val_cv_idx); Y_val_cv = Y_train(val_cv_idx);
mdl = fitlm(X_train_cv, Y_train_cv);
Y_pred_cv = predict(mdl, X_val_cv);
cv_mse(i) = mean((Y_pred_cv - Y_val_cv).^2);
end
fprintf('Average 4-fold CV MSE: %.4f\n', mean(cv_mse));

Average 4-fold CV MSE: 0.0765

mdl_final = fitlm(X_train, Y_train);

Y_pred_test = predict(mdl_final, X_test);
test_rmse = sqrt(mean((Y_pred_test - Y_test).^2));
test_mae = mean(abs(Y_pred_test - Y_test));
ss_tot = sum((Y_test - mean(Y_test)).^2);
ss_res = sum((Y_test - Y_pred_test).^2);
test_r2 = 1 - ss_res / ss_tot;
fprintf('Test RMSE: %.4f\nMAE: %.4f\nR-squared: %.4f\n', test_rmse, test_mae,
test_r2);

11
Test RMSE: 0.2686
MAE: 0.2239
R-squared: 0.3989

fprintf('Model: SepalWidth = %.4f + %.4f * SepalLength\n',

mdl_final.Coefficients.Estimate(1), mdl_final.Coefficients.Estimate(2));

Model: SepalWidth = -0.2939 + 0.7482 * SepalLength

figure('Position', [100, 100, 1000, 600], 'Visible', 'on');

scatter(X_train, Y_train, 50, 'b', 'o', 'DisplayName', 'Training Data');
hold on;
scatter(X_test, Y_test, 50, 'r', 'x', 'DisplayName', 'Test Data');
X_range = linspace(min(X_train), max(X_train), 100)';
Y_range = predict(mdl_final, X_range);
plot(X_range, Y_range, 'g-', 'LineWidth', 2, 'DisplayName', 'Regression Line');
xlabel('Sepal Length (cm)'); ylabel('Sepal Width (cm)'); title('Problem 5: Linear
Regression (Setosa)');
legend('show'); grid on;

drawnow; pause(0.1);
fprintf('Summary: Moderate accuracy; R-squared suggests a weak linear
relationship.\n');

Summary: Moderate accuracy; R-squared suggests a weak linear relationship.

12
Data604 Project 2: Analysis of Handwritten Digits
and Iris Dataset

Sravani

April 2025

Abstract

This report presents my analysis for Data604 Project 2, where I worked on

classification, clustering, and regression tasks using the USPS handwritten digits
dataset and the Iris dataset. My assigned digits are 6, 1, 5, 4, 8, and 3 (USPS
indices 7, 2, 6, 5, 9, 4). I solved five problems: (1) binary SVM classification with
linear and Gaussian kernels on digits 6 and 1, (2) multi-class SVM classification
using one-vs-one and one-vs-all approaches on all six digits, (3) K-Means clustering
on the six digits, (4) hierarchical clustering on the six digits, and (5) linear regression
on the Iris dataset (Setosa class). For each problem, I explain my methodology,
share my results with figures, and discuss what I learned, aiming to show how well
the techniques worked and what the data tells us.

1 Introduction
In this project, I applied machine learning techniques to analyze the USPS handwrit-
ten digits dataset and the Iris dataset for Data604 Project 2. My assigned digits are
6, 1, 5, 4, 8, and 3, which correspond to USPS indices 7, 2, 6, 5, 9, and 4. The tasks
include binary and multi-class classification using Support Vector Machines (SVM), un-
supervised clustering with K-Means and hierarchical methods, and linear regression. I
implemented everything in MATLAB, and my results are shown through confusion ma-
trices, cluster center images, dendrograms, and regression plots. My goal with this report
is to clearly explain each problem and share my findings in a detailed way to understand
the methods and results better.

2 Problem 1: Soft Margin SVM with Gaussian vs.

Linear Kernel

2.1 Methodology
I trained a soft margin SVM on digits 6 and 1 (USPS indices 7 and 2) using both linear
and Gaussian (RBF) kernels. The training set has 2000 samples (1000 per digit), and the

1
test set has 200 samples (100 per digit), with each sample having 256 features normalized
to [0, 1]. I set the labels as +1 for digit 6 and −1 for digit 1. The SVM models were trained
with a box constraint of 1, and the Gaussian kernel used an automatically determined
kernel scale. I then computed confusion matrices and error metrics (accuracy, precision,
recall, specificity) to compare the two kernels.

2.2 Results
The confusion matrices for both kernels are shown in Figure 1. For the linear kernel:

98 2
0 100

For the Gaussian kernel:

99 1
1 99

I calculated the error metrics, which are summarized in Table 1.

Table 1: Problem 1: Error Metrics for Linear and Gaussian Kernels

Kernel Accuracy Precision Recall Specificity

Linear 0.9900 1.0000 0.9800 1.0000
Gaussian 0.9900 0.9900 0.9900 0.9900

Figure 1: Confusion matrices for Problem 1: Linear kernel (left) and Gaussian kernel
(right), showing row-normalized and column-normalized percentages.

2.3 Observations
I noticed that the Gaussian kernel does better than the linear kernel in terms of recall
(0.9900 vs. 0.9800), which means it has fewer false negatives—only 1 compared to 2 for
the linear kernel. The linear kernel has perfect precision and specificity (both 1.0000),
but the Gaussian kernel gives a more even performance across all metrics, with all values

2
at 0.9900. I think this makes sense because the Gaussian kernel can capture non-linear
patterns in the data, which is important for distinguishing between digits 6 and 1 since
they might not be perfectly separable with a straight line. For example, digit 6 has a
loop, while digit 1 is more straight, so the non-linear boundary helps. This suggests that
the Gaussian kernel is better for this task, especially since handwritten digits can vary a
lot in shape.

3 Problem 2: One-vs-One vs. One-vs-All Multi-

Class SVM

3.1 Methodology
I trained multi-class SVMs on all six digits (6, 1, 5, 4, 8, 3) using one-vs-one (OVO)
and one-vs-all (OVA) approaches. The training set has 6000 samples (1000 per digit),
and the test set has 600 samples (100 per digit), with 256 features per sample. I used a
linear kernel with a box constraint of 1, and I standardized the data. Then, I computed
confusion matrices and global error metrics (precision, recall, specificity) to compare OVO
and OVA.

3.2 Results
The confusion matrices are shown in Figure 2. For OVO:
 
97 0 0 0 3 0
 1 97 0 1 0 1
 
 0 2 97 0 0 1
 
0 0 1 99 0 0 
 
2 1 0 0 96 1 
0 0 0 0 6 94

For OVA:  
96 0 0 2 2 0
 3 91 0 1 1 4 
 
 0 4 94 0 0 2 
 
 0 0 3 97 0 0 
 
 1 1 0 1 96 1 
0 0 0 1 5 94

The global error metrics are shown in Table 2.

Table 2: Problem 2: Global Error Metrics for OVO and OVA Approaches

Approach Precision Recall Specificity

OVO 0.9672 0.9667 0.9933
OVA 0.9470 0.9467 0.9893

3
Figure 2: Confusion matrices for Problem 2: One-vs-One (left) and One-vs-All (right)
SVMs, showing row-normalized and column-normalized percentages.

3.3 Observations
From the results, I can see that the OVO approach does better than OVA across all
metrics. OVO has a precision of 0.9672 compared to OVA’s 0.9470, a recall of 0.9667
compared to 0.9467, and a specificity of 0.9933 compared to 0.9893. That’s an improve-
ment of 0.0202 in precision, 0.0200 in recall, and 0.0040 in specificity. Looking at the
confusion matrices, OVO has fewer mistakes overall. For example, for digit 1 (row 2),
OVO has only 3 misclassifications, while OVA has 9. Similarly, for digit 5 (row 3), OVA
misclassifies 6 samples, while OVO only misclassifies 3. I think OVO performs better
6

because it trains more classifiers— 2 = 15 compared to OVA’s 6—which lets it focus
on distinguishing between each pair of digits more carefully. For instance, digits 6 and
8 might look similar because of their loops, but OVO can handle that better by directly
comparing them. So, I conclude that OVO is the better choice for this multi-class task.

4 Problem 3: K-Means Clustering

4.1 Methodology
I applied K-Means clustering to all six digits (6600 samples, 256 features) with K =
4, 5, 6, 8, using the Euclidean distance. I ran the algorithm with 5 replicates and a
maximum of 1000 iterations to make sure it converged. Then, I visualized the cluster
centers as 16 × 16 images to understand the clustering results.

4.2 Results
The cluster centers for each K are shown in Figures 3 to 6. For K = 4, the centers look
like combinations of digits—for example, Cluster 1 seems like a mix of digits 6 and 8,
and Cluster 2 looks like 1 and 4. For K = 5, the centers start to separate the digits
more, with Cluster 3 looking like digit 5. For K = 6, the centers match the six digits
well: Cluster 1 looks like digit 6, Cluster 2 like digit 1, Cluster 3 like digit 5, Cluster 4

4
like digit 4, Cluster 5 like digit 8, and Cluster 6 like digit 3. For K = 8, some digits get
split into variations—for example, Clusters 2 and 7 both look like digit 1 but in different
styles.

Figure 3: Cluster centers for Problem 3 with K = 4.

4.3 Observations
I found that K = 6 works best because it matches the number of digit classes, and the
cluster centers look very similar to the actual digits. For example, Cluster 1 for K = 6
clearly shows the loop of digit 6, and Cluster 2 shows the straight line of digit 1. When
K is less than 6, like K = 4, the clusters combine digits that look similar—digits 6 and 8
both have loops, and digits 1 and 4 are both straight, which makes sense visually. When
K is more than 6, like K = 8, the clusters start splitting digits into different styles, like
two versions of digit 1, which might represent different ways people write the digit. I
think K = 6 captures the natural structure of the data the best, as it avoids combining
different digits or splitting the same digit too much. To improve this in the future, I
could try different distance metrics, like cityblock, to see if that changes how the clusters
form.

5 Problem 4: Hierarchical Clustering

5.1 Methodology
I applied hierarchical clustering to a subset of 600 samples (100 per digit) using Euclidean
and cityblock (L1 ) distances, with the average linkage method. I created dendrograms to
find the best number of clusters and computed silhouette scores for K = 6 to check the
clustering quality.

5
Figure 4: Cluster centers for Problem 3 with K = 5.

5.2 Results
The dendrograms are shown in Figures 7 and 8.

• Euclidean Distance: The largest merge distance jump is at step 593 (distance:
7.9019), suggesting K = 7.
• Cityblock Distance: The largest merge distance jump is at step 597 (distance:
95.4902), suggesting K = 3.
• Silhouette Scores (for K = 6): Euclidean = 0.0653, Cityblock = 0.0796.

5.3 Observations
I noticed that the cityblock distance gives better clusters because its silhouette score
(0.0796) is higher than the Euclidean score (0.0653). But both scores are pretty low,
which I think is because the data has so many features (256), making it hard to cluster
well in high dimensions. The dendrograms suggest different numbers of clusters—K = 7
for Euclidean and K = 3 for cityblock—but since we have six digit classes, I think K = 6
is a good choice to match the actual number of digits. I believe cityblock does better
because it’s less sensitive to outliers in the pixel values. For example, if a pixel is much
brighter or darker in one image, the Euclidean distance squares that difference, making
it bigger, while cityblock just takes the absolute difference, which might be more robust
for this data. In the future, I could try reducing the dimensions of the data, maybe with
PCA, to see if that improves the clustering.

6
Figure 5: Cluster centers for Problem 3 with K = 6.

6 Problem 5: Linear Regression on Iris Dataset

6.1 Methodology
I performed linear regression on the Iris dataset (Setosa class only, 50 samples) to predict
sepal width from sepal length. I used a 40/10 train/test split and did 4-fold cross-
validation on the training set to validate my model. I measured the test set performance
with RMSE, MAE, and R-squared.

6.2 Results
The average 4-fold cross-validation MSE is 0.0765. The final model is:

SepalWidth = −0.2939 + 0.7482 × SepalLength

The test set metrics are: RMSE = 0.2686, MAE = 0.2239, R-squared = 0.3989. The
regression plot is shown in Figure 9, with training data as blue circles, test data as red
crosses, and the regression line in green.

7
Figure 6: Cluster centers for Problem 3 with K = 8.

6.3 Observations
The model has okay accuracy, with an RMSE of 0.2686 and MAE of 0.2239, which means
the predictions are reasonably close to the actual values. But the R-squared value of
0.3989 is pretty low, showing that there’s a weak linear relationship between sepal length
and sepal width for the Setosa class. This makes sense because R-squared tells us how
much of the variation in sepal width is explained by sepal length, and 0.3989 means less
than 40% is explained, so there’s a lot of variation that the model doesn’t capture. I
think sepal width might depend on other features, like petal length or width, or maybe
the relationship isn’t linear at all. For example, if sepal width changes in a curved pattern
with sepal length, a linear model wouldn’t fit well. In the future, I could try adding more
features to the model or using a non-linear regression method, like polynomial regression,
to see if that improves the R-squared value.

7 Conclusion
This project let me apply machine learning techniques to the USPS handwritten digits
and Iris datasets, and I learned a lot from it. My key findings are: (1) the Gaussian
kernel works better than the linear kernel for binary SVM classification because it can
handle non-linear boundaries, which is important for digits like 6 and 1; (2) the one-vs-
one SVM approach is better for multi-class classification, with higher precision, recall,

8
Figure 7: Dendrogram for Problem 4 using Euclidean distance.

and specificity, because it focuses on pairs of digits; (3) K-Means clustering with K = 6
matches the digit classes best, showing the natural structure of the data; (4) hierarchical
clustering with cityblock distance gives better clusters than Euclidean, and K = 6 is a
practical choice for this data; and (5) linear regression on the Iris Setosa class shows a weak
linear relationship, so I might need more features or a non-linear model to predict sepal
width better. For future work, I’d like to try tuning the Gaussian kernel’s parameters to
improve the SVM even more, use dimension reduction like PCA for clustering to make it
easier, and explore non-linear regression models for the Iris dataset to get a better fit.

9
Figure 8: Dendrogram for Problem 4 using cityblock distance.

Figure 9: Linear regression plot for Problem 5 (Setosa): Training data (blue circles), test
data (red crosses), and regression line (green).

Reserach Proposal Car Rental Final
100% (1)
Reserach Proposal Car Rental Final
10 pages
Dbs Group Data Anaytics in Audit CS Clean
No ratings yet
Dbs Group Data Anaytics in Audit CS Clean
12 pages
Data604 Final Submission Sravani
No ratings yet
Data604 Final Submission Sravani
21 pages
DA Programs
No ratings yet
DA Programs
44 pages
Data604 Project2
No ratings yet
Data604 Project2
1 page
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
Optimiation Ass 05
No ratings yet
Optimiation Ass 05
5 pages
K-Means Clustering Using Matlab: December 2015
No ratings yet
K-Means Clustering Using Matlab: December 2015
6 pages
EX - NO:3: Algorithm
No ratings yet
EX - NO:3: Algorithm
11 pages
K-Means Clustering Tutorial - Matlab Code
No ratings yet
K-Means Clustering Tutorial - Matlab Code
3 pages
ISYE6501 Homework 1
No ratings yet
ISYE6501 Homework 1
7 pages
CSE 474/574 Introduction To Machine Learning Fall 2011 Assignment 3
No ratings yet
CSE 474/574 Introduction To Machine Learning Fall 2011 Assignment 3
3 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
Simulate SVM Classification For A Dataset.: EX NO:04 Date
No ratings yet
Simulate SVM Classification For A Dataset.: EX NO:04 Date
4 pages
Image Classifaction
No ratings yet
Image Classifaction
17 pages
Matlab Code:: All 'Train - CSV' 'Test - Org - CSV' 'Testme - CSV'
No ratings yet
Matlab Code:: All 'Train - CSV' 'Test - Org - CSV' 'Testme - CSV'
3 pages
Weekly Homework X
No ratings yet
Weekly Homework X
15 pages
PCA Codebase
No ratings yet
PCA Codebase
6 pages
Assignment 11-17-15: Michael Petzold November 19, 2015
No ratings yet
Assignment 11-17-15: Michael Petzold November 19, 2015
4 pages
Final ML File
No ratings yet
Final ML File
34 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
9 pages
EE 559 HW2Code PDF
No ratings yet
EE 559 HW2Code PDF
7 pages
Chenhao HW1
No ratings yet
Chenhao HW1
5 pages
Bilal Ahmad Ai & DSS Assign # 03
No ratings yet
Bilal Ahmad Ai & DSS Assign # 03
7 pages
Aiml Lab
No ratings yet
Aiml Lab
37 pages
# ELG 5255 Applied Machine Learning Fall 2020 # Assignment 3 (Multivariate Method)
No ratings yet
# ELG 5255 Applied Machine Learning Fall 2020 # Assignment 3 (Multivariate Method)
8 pages
50 Inference
No ratings yet
50 Inference
31 pages
ML Lab Prgms Split
No ratings yet
ML Lab Prgms Split
3 pages
Grid Search For SVM
No ratings yet
Grid Search For SVM
9 pages
Analysis Course HW2
No ratings yet
Analysis Course HW2
13 pages
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
Rajeek8 12
No ratings yet
Rajeek8 12
21 pages
HW 1
No ratings yet
HW 1
4 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
Materi 5 - 2
No ratings yet
Materi 5 - 2
25 pages
Detection and Pattern Recognition: Matlab: 1 Supervised Classification
No ratings yet
Detection and Pattern Recognition: Matlab: 1 Supervised Classification
4 pages
COMP 4211 - Machine Learning
No ratings yet
COMP 4211 - Machine Learning
19 pages
Problems
No ratings yet
Problems
2 pages
R Assignment
No ratings yet
R Assignment
8 pages
Machine Learning With MATLAB Quick Reference
No ratings yet
Machine Learning With MATLAB Quick Reference
36 pages
Topic 2 Matlab Examples
No ratings yet
Topic 2 Matlab Examples
5 pages
R Console
No ratings yet
R Console
6 pages
Assignment III
No ratings yet
Assignment III
3 pages
Assignment 1
No ratings yet
Assignment 1
16 pages
MLT 9
No ratings yet
MLT 9
10 pages
MLLab Manual
No ratings yet
MLLab Manual
24 pages
Matlab Program
No ratings yet
Matlab Program
15 pages
Mvda 2
No ratings yet
Mvda 2
13 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
Saurabh
No ratings yet
Saurabh
22 pages
Implementation
No ratings yet
Implementation
14 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
13 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
K Means Clustering: All All
No ratings yet
K Means Clustering: All All
5 pages
Machine Learning
100% (5)
Machine Learning
56 pages
ML2
No ratings yet
ML2
7 pages
Week 10 Abhishek Srivastava VFinal
No ratings yet
Week 10 Abhishek Srivastava VFinal
14 pages
'D:/thesis/sheet - CSV' '%F%F%F%F%F%S' 'Delimiter' ','
No ratings yet
'D:/thesis/sheet - CSV' '%F%F%F%F%F%S' 'Delimiter' ','
2 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
IGCSE ICT - Turtle Graphics
100% (1)
IGCSE ICT - Turtle Graphics
4 pages
Thesis Compiled
No ratings yet
Thesis Compiled
64 pages
KNN Algorithm - PPT (Autosaved)
0% (1)
KNN Algorithm - PPT (Autosaved)
8 pages
Impact of Accounting Information Systems On Organizational
No ratings yet
Impact of Accounting Information Systems On Organizational
5 pages
DM Lab External Q.P Model
No ratings yet
DM Lab External Q.P Model
6 pages
Newbold Stat7 Ism 09
No ratings yet
Newbold Stat7 Ism 09
17 pages
PPNCTT Nghiên Cứu Công Chúng
No ratings yet
PPNCTT Nghiên Cứu Công Chúng
70 pages
Residual Analysis and Test - 02
No ratings yet
Residual Analysis and Test - 02
37 pages
Level of Reading Comprehension of Grade Seven Students in Sfnhs
No ratings yet
Level of Reading Comprehension of Grade Seven Students in Sfnhs
37 pages
4 - LM Test and Heteroskedasticity
No ratings yet
4 - LM Test and Heteroskedasticity
13 pages
Analyze & Interprete Pro - Data-Edited
No ratings yet
Analyze & Interprete Pro - Data-Edited
32 pages
AI-MAJOR-AUGUST - Aryal Ashish
No ratings yet
AI-MAJOR-AUGUST - Aryal Ashish
16 pages
Explain - Thesis Writing & Research Writing
No ratings yet
Explain - Thesis Writing & Research Writing
6 pages
Testul 10
No ratings yet
Testul 10
28 pages
ML Lab
No ratings yet
ML Lab
2 pages
Sample Intro Statistics Intuitive Guide
50% (2)
Sample Intro Statistics Intuitive Guide
25 pages
Marketing Analytics Price and Promotion
No ratings yet
Marketing Analytics Price and Promotion
90 pages
Standard Format Thesis in English 4mep
No ratings yet
Standard Format Thesis in English 4mep
34 pages
Linear Regression and Corelation (1236)
No ratings yet
Linear Regression and Corelation (1236)
50 pages
Sample Questions
No ratings yet
Sample Questions
4 pages
Ch07 Forecasting Modelling
No ratings yet
Ch07 Forecasting Modelling
32 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
8 pages
Guidelines For Effective Presentation of Research Findings and Discussion
No ratings yet
Guidelines For Effective Presentation of Research Findings and Discussion
10 pages
Learning Satisfaction of Students and Academic Performance
75% (4)
Learning Satisfaction of Students and Academic Performance
37 pages
OB Article Review assignment-OR
No ratings yet
OB Article Review assignment-OR
3 pages
Mba
No ratings yet
Mba
65 pages
Communication in The Era of Artificial Intelligenc
No ratings yet
Communication in The Era of Artificial Intelligenc
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data604 Sravani FinalCombined

Uploaded by

Data604 Sravani FinalCombined

Uploaded by

% # Data604: Project 2 - Handwritten Digits and Iris Analysis

% **Assigned labels**: 6, 1, 5, 4, 8, 3 (USPS indices: 7, 2, 6, 5, 9, 4).

% ## Problem 1: Soft Margin SVM with Gaussian vs. Linear Kernel

Training set: 2000 samples, 256 features

fprintf('Test set: %d samples, %d features\n', size(X_test, 1), size(X_test, 2));

Test set: 200 samples, 256 features

% Train SVM models

% Compute confusion matrices

Linear Kernel Confusion Matrix:

fprintf('Gaussian Kernel Confusion Matrix:\n'); disp(conf_gaussian);

Gaussian Kernel Confusion Matrix:

% Plot confusion matrices

% Compute and display metrics

Linear Kernel Metrics:

fprintf('Gaussian Kernel Metrics:\nAccuracy: %.4f\nPrecision: %.4f\nRecall:

Gaussian Kernel Metrics:

Linear kernel performs better or metrics are comparable.

function [precision, recall, specificity, accuracy] = calculate_metrics_p1(confMat)

% ## Problem 2: One-vs-One vs. One-vs-All Multi-Class SVM

Training set: 6000 samples, 256 features

fprintf('Test set: %d samples, %d features\n', size(X_test, 1), size(X_test, 2));

Test set: 600 samples, 256 features

% Train SVM models

% Compute confusion matrices

fprintf('One-vs-All Confusion Matrix:\n'); disp(conf_OVA);

One-vs-All Confusion Matrix:

% Plot confusion matrices

% Compute and display metrics

fprintf('One-vs-All Metrics:\nPrecision: %.4f\nRecall: %.4f\nSpecificity: %.4f\n',

One-vs-One performs better.

function [precision, recall, specificity] = calculate_metrics_p2(confMat)

% ## Problem 3: K-Means Clustering

Data set: 6600 samples, 256 features

K_values = [4, 5, 6, 8];

% ## Problem 4: Hierarchical Clustering

Using subset of 600 samples

distances = {'euclidean', 'cityblock'};

Average silhouette (K=6, euclidean): 0.0653

% Analyze dendrograms for ideal clustering level

Dendrogram Analysis (euclidean Distance):

fprintf('Observation: Cityblock distance yields better clusters (higher silhouette

Training set: 40 samples

fprintf('Test set: %d samples\n', length(X_test));

Test set: 10 samples

Average 4-fold CV MSE: 0.0765

mdl_final = fitlm(X_train, Y_train);

fprintf('Model: SepalWidth = %.4f + %.4f * SepalLength\n',

Model: SepalWidth = -0.2939 + 0.7482 * SepalLength

figure('Position', [100, 100, 1000, 600], 'Visible', 'on');

Summary: Moderate accuracy; R-squared suggests a weak linear relationship.

This report presents my analysis for Data604 Project 2, where I worked on

2 Problem 1: Soft Margin SVM with Gaussian vs.

For the Gaussian kernel:  

I calculated the error metrics, which are summarized in Table 1.

Table 1: Problem 1: Error Metrics for Linear and Gaussian Kernels

Kernel Accuracy Precision Recall Specificity

3 Problem 2: One-vs-One vs. One-vs-All Multi-

The global error metrics are shown in Table 2.

Approach Precision Recall Specificity

4 Problem 3: K-Means Clustering

Figure 3: Cluster centers for Problem 3 with K = 4.

5 Problem 4: Hierarchical Clustering

6 Problem 5: Linear Regression on Iris Dataset

SepalWidth = −0.2939 + 0.7482 × SepalLength

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

% Assigned labels: 6, 1, 5, 4, 8, 3 (USPS indices: 7, 2, 6, 5, 9, 4).

For the Gaussian kernel: