Data604 Sravani FinalCombined
Data604 Sravani FinalCombined
1
subplot(1, 2, 1);
confusionchart(conf_linear, {'6 (+1)', '1 (-1)'}, 'Title', 'Linear Kernel',
'RowSummary', 'row-normalized', 'ColumnSummary', 'column-normalized');
subplot(1, 2, 2);
confusionchart(conf_gaussian, {'6 (+1)', '1 (-1)'}, 'Title', 'Gaussian Kernel',
'RowSummary', 'row-normalized', 'ColumnSummary', 'column-normalized');
sgtitle('Problem 1: Confusion Matrices');
drawnow; pause(0.1);
% Conclusion
if acc_gaussian > acc_linear
fprintf('Gaussian kernel performs better.\n');
2
else
fprintf('Linear kernel performs better or metrics are comparable.\n');
end
3
One-vs-One Confusion Matrix:
97 0 0 0 3 0
1 97 0 1 0 1
0 2 97 0 0 1
0 0 1 99 0 0
2 1 0 0 96 1
0 0 0 0 6 94
drawnow; pause(0.1);
4
One-vs-One Metrics:
Precision: 0.9672
Recall: 0.9667
Specificity: 0.9933
One-vs-All Metrics:
Precision: 0.9470
Recall: 0.9467
Specificity: 0.9893
% Conclusion
if p_OVO > p_OVA && r_OVO > r_OVA && s_OVO > s_OVA
fprintf('One-vs-One performs better.\n');
else
fprintf('One-vs-All performs better or metrics are comparable.\n');
end
5
[~, centers] = kmeans(X, K, 'Distance', 'sqeuclidean', 'MaxIter', 1000,
'Replicates', 5);
figure('Position', [100, 100, 1000, 200*ceil(K/2)], 'Visible', 'on');
sgtitle(sprintf('Problem 3: Cluster Centers (K=%d)', K));
for k = 1:K
subplot(ceil(K/2), 2, k);
imshow(reshape(centers(k,:), 16, 16)', []);
title(sprintf('Cluster %d', k));
end
drawnow; pause(0.1);
end
6
7
fprintf('Observation: K=6 aligns best with the number of classes, producing
interpretable centers.\n');
Observation: K=6 aligns best with the number of classes, producing interpretable centers.
8
fprintf('Using subset of %d samples\n', size(X_subset, 1));
9
Average silhouette (K=6, cityblock): 0.0796
Observation: Cityblock distance yields better clusters (higher silhouette score); K=6 is a reasonable choice based o
10
% ## Problem 5: Linear Regression on Iris Dataset
clear; clc;
load fisheriris;
setosa_idx = strcmp(species, 'setosa');
X = meas(setosa_idx, 1); % Sepal length
Y = meas(setosa_idx, 2); % Sepal width
fprintf('Setosa samples: %d\n', length(X));
Setosa samples: 50
rng(42);
indices = randperm(length(X));
train_idx = indices(1:40); test_idx = indices(41:50);
X_train = X(train_idx); Y_train = Y(train_idx);
X_test = X(test_idx); Y_test = Y(test_idx);
fprintf('Training set: %d samples\n', length(X_train));
k = 4;
cv = cvpartition(length(X_train), 'KFold', k);
cv_mse = zeros(k, 1);
for i = 1:k
train_cv_idx = cv.training(i); val_cv_idx = cv.test(i);
X_train_cv = X_train(train_cv_idx); Y_train_cv = Y_train(train_cv_idx);
X_val_cv = X_train(val_cv_idx); Y_val_cv = Y_train(val_cv_idx);
mdl = fitlm(X_train_cv, Y_train_cv);
Y_pred_cv = predict(mdl, X_val_cv);
cv_mse(i) = mean((Y_pred_cv - Y_val_cv).^2);
end
fprintf('Average 4-fold CV MSE: %.4f\n', mean(cv_mse));
11
Test RMSE: 0.2686
MAE: 0.2239
R-squared: 0.3989
drawnow; pause(0.1);
fprintf('Summary: Moderate accuracy; R-squared suggests a weak linear
relationship.\n');
12
Data604 Project 2: Analysis of Handwritten Digits
and Iris Dataset
Sravani
April 2025
Abstract
1 Introduction
In this project, I applied machine learning techniques to analyze the USPS handwrit-
ten digits dataset and the Iris dataset for Data604 Project 2. My assigned digits are
6, 1, 5, 4, 8, and 3, which correspond to USPS indices 7, 2, 6, 5, 9, and 4. The tasks
include binary and multi-class classification using Support Vector Machines (SVM), un-
supervised clustering with K-Means and hierarchical methods, and linear regression. I
implemented everything in MATLAB, and my results are shown through confusion ma-
trices, cluster center images, dendrograms, and regression plots. My goal with this report
is to clearly explain each problem and share my findings in a detailed way to understand
the methods and results better.
2.1 Methodology
I trained a soft margin SVM on digits 6 and 1 (USPS indices 7 and 2) using both linear
and Gaussian (RBF) kernels. The training set has 2000 samples (1000 per digit), and the
1
test set has 200 samples (100 per digit), with each sample having 256 features normalized
to [0, 1]. I set the labels as +1 for digit 6 and −1 for digit 1. The SVM models were trained
with a box constraint of 1, and the Gaussian kernel used an automatically determined
kernel scale. I then computed confusion matrices and error metrics (accuracy, precision,
recall, specificity) to compare the two kernels.
2.2 Results
The confusion matrices for both kernels are shown in Figure 1. For the linear kernel:
98 2
0 100
Figure 1: Confusion matrices for Problem 1: Linear kernel (left) and Gaussian kernel
(right), showing row-normalized and column-normalized percentages.
2.3 Observations
I noticed that the Gaussian kernel does better than the linear kernel in terms of recall
(0.9900 vs. 0.9800), which means it has fewer false negatives—only 1 compared to 2 for
the linear kernel. The linear kernel has perfect precision and specificity (both 1.0000),
but the Gaussian kernel gives a more even performance across all metrics, with all values
2
at 0.9900. I think this makes sense because the Gaussian kernel can capture non-linear
patterns in the data, which is important for distinguishing between digits 6 and 1 since
they might not be perfectly separable with a straight line. For example, digit 6 has a
loop, while digit 1 is more straight, so the non-linear boundary helps. This suggests that
the Gaussian kernel is better for this task, especially since handwritten digits can vary a
lot in shape.
3.1 Methodology
I trained multi-class SVMs on all six digits (6, 1, 5, 4, 8, 3) using one-vs-one (OVO)
and one-vs-all (OVA) approaches. The training set has 6000 samples (1000 per digit),
and the test set has 600 samples (100 per digit), with 256 features per sample. I used a
linear kernel with a box constraint of 1, and I standardized the data. Then, I computed
confusion matrices and global error metrics (precision, recall, specificity) to compare OVO
and OVA.
3.2 Results
The confusion matrices are shown in Figure 2. For OVO:
97 0 0 0 3 0
1 97 0 1 0 1
0 2 97 0 0 1
0 0 1 99 0 0
2 1 0 0 96 1
0 0 0 0 6 94
For OVA:
96 0 0 2 2 0
3 91 0 1 1 4
0 4 94 0 0 2
0 0 3 97 0 0
1 1 0 1 96 1
0 0 0 1 5 94
Table 2: Problem 2: Global Error Metrics for OVO and OVA Approaches
3
Figure 2: Confusion matrices for Problem 2: One-vs-One (left) and One-vs-All (right)
SVMs, showing row-normalized and column-normalized percentages.
3.3 Observations
From the results, I can see that the OVO approach does better than OVA across all
metrics. OVO has a precision of 0.9672 compared to OVA’s 0.9470, a recall of 0.9667
compared to 0.9467, and a specificity of 0.9933 compared to 0.9893. That’s an improve-
ment of 0.0202 in precision, 0.0200 in recall, and 0.0040 in specificity. Looking at the
confusion matrices, OVO has fewer mistakes overall. For example, for digit 1 (row 2),
OVO has only 3 misclassifications, while OVA has 9. Similarly, for digit 5 (row 3), OVA
misclassifies 6 samples, while OVO only misclassifies 3. I think OVO performs better
6
because it trains more classifiers— 2 = 15 compared to OVA’s 6—which lets it focus
on distinguishing between each pair of digits more carefully. For instance, digits 6 and
8 might look similar because of their loops, but OVO can handle that better by directly
comparing them. So, I conclude that OVO is the better choice for this multi-class task.
4.1 Methodology
I applied K-Means clustering to all six digits (6600 samples, 256 features) with K =
4, 5, 6, 8, using the Euclidean distance. I ran the algorithm with 5 replicates and a
maximum of 1000 iterations to make sure it converged. Then, I visualized the cluster
centers as 16 × 16 images to understand the clustering results.
4.2 Results
The cluster centers for each K are shown in Figures 3 to 6. For K = 4, the centers look
like combinations of digits—for example, Cluster 1 seems like a mix of digits 6 and 8,
and Cluster 2 looks like 1 and 4. For K = 5, the centers start to separate the digits
more, with Cluster 3 looking like digit 5. For K = 6, the centers match the six digits
well: Cluster 1 looks like digit 6, Cluster 2 like digit 1, Cluster 3 like digit 5, Cluster 4
4
like digit 4, Cluster 5 like digit 8, and Cluster 6 like digit 3. For K = 8, some digits get
split into variations—for example, Clusters 2 and 7 both look like digit 1 but in different
styles.
4.3 Observations
I found that K = 6 works best because it matches the number of digit classes, and the
cluster centers look very similar to the actual digits. For example, Cluster 1 for K = 6
clearly shows the loop of digit 6, and Cluster 2 shows the straight line of digit 1. When
K is less than 6, like K = 4, the clusters combine digits that look similar—digits 6 and 8
both have loops, and digits 1 and 4 are both straight, which makes sense visually. When
K is more than 6, like K = 8, the clusters start splitting digits into different styles, like
two versions of digit 1, which might represent different ways people write the digit. I
think K = 6 captures the natural structure of the data the best, as it avoids combining
different digits or splitting the same digit too much. To improve this in the future, I
could try different distance metrics, like cityblock, to see if that changes how the clusters
form.
5.1 Methodology
I applied hierarchical clustering to a subset of 600 samples (100 per digit) using Euclidean
and cityblock (L1 ) distances, with the average linkage method. I created dendrograms to
find the best number of clusters and computed silhouette scores for K = 6 to check the
clustering quality.
5
Figure 4: Cluster centers for Problem 3 with K = 5.
5.2 Results
The dendrograms are shown in Figures 7 and 8.
• Euclidean Distance: The largest merge distance jump is at step 593 (distance:
7.9019), suggesting K = 7.
• Cityblock Distance: The largest merge distance jump is at step 597 (distance:
95.4902), suggesting K = 3.
• Silhouette Scores (for K = 6): Euclidean = 0.0653, Cityblock = 0.0796.
5.3 Observations
I noticed that the cityblock distance gives better clusters because its silhouette score
(0.0796) is higher than the Euclidean score (0.0653). But both scores are pretty low,
which I think is because the data has so many features (256), making it hard to cluster
well in high dimensions. The dendrograms suggest different numbers of clusters—K = 7
for Euclidean and K = 3 for cityblock—but since we have six digit classes, I think K = 6
is a good choice to match the actual number of digits. I believe cityblock does better
because it’s less sensitive to outliers in the pixel values. For example, if a pixel is much
brighter or darker in one image, the Euclidean distance squares that difference, making
it bigger, while cityblock just takes the absolute difference, which might be more robust
for this data. In the future, I could try reducing the dimensions of the data, maybe with
PCA, to see if that improves the clustering.
6
Figure 5: Cluster centers for Problem 3 with K = 6.
6.1 Methodology
I performed linear regression on the Iris dataset (Setosa class only, 50 samples) to predict
sepal width from sepal length. I used a 40/10 train/test split and did 4-fold cross-
validation on the training set to validate my model. I measured the test set performance
with RMSE, MAE, and R-squared.
6.2 Results
The average 4-fold cross-validation MSE is 0.0765. The final model is:
The test set metrics are: RMSE = 0.2686, MAE = 0.2239, R-squared = 0.3989. The
regression plot is shown in Figure 9, with training data as blue circles, test data as red
crosses, and the regression line in green.
7
Figure 6: Cluster centers for Problem 3 with K = 8.
6.3 Observations
The model has okay accuracy, with an RMSE of 0.2686 and MAE of 0.2239, which means
the predictions are reasonably close to the actual values. But the R-squared value of
0.3989 is pretty low, showing that there’s a weak linear relationship between sepal length
and sepal width for the Setosa class. This makes sense because R-squared tells us how
much of the variation in sepal width is explained by sepal length, and 0.3989 means less
than 40% is explained, so there’s a lot of variation that the model doesn’t capture. I
think sepal width might depend on other features, like petal length or width, or maybe
the relationship isn’t linear at all. For example, if sepal width changes in a curved pattern
with sepal length, a linear model wouldn’t fit well. In the future, I could try adding more
features to the model or using a non-linear regression method, like polynomial regression,
to see if that improves the R-squared value.
7 Conclusion
This project let me apply machine learning techniques to the USPS handwritten digits
and Iris datasets, and I learned a lot from it. My key findings are: (1) the Gaussian
kernel works better than the linear kernel for binary SVM classification because it can
handle non-linear boundaries, which is important for digits like 6 and 1; (2) the one-vs-
one SVM approach is better for multi-class classification, with higher precision, recall,
8
Figure 7: Dendrogram for Problem 4 using Euclidean distance.
and specificity, because it focuses on pairs of digits; (3) K-Means clustering with K = 6
matches the digit classes best, showing the natural structure of the data; (4) hierarchical
clustering with cityblock distance gives better clusters than Euclidean, and K = 6 is a
practical choice for this data; and (5) linear regression on the Iris Setosa class shows a weak
linear relationship, so I might need more features or a non-linear model to predict sepal
width better. For future work, I’d like to try tuning the Gaussian kernel’s parameters to
improve the SVM even more, use dimension reduction like PCA for clustering to make it
easier, and explore non-linear regression models for the Iris dataset to get a better fit.
9
Figure 8: Dendrogram for Problem 4 using cityblock distance.
Figure 9: Linear regression plot for Problem 5 (Setosa): Training data (blue circles), test
data (red crosses), and regression line (green).
10