tut1_solution
tut1_solution
Tutorial 1 Solution
Q1
a)
( ) = min ,
∈( ), ∈
= min , | ∈ ( ), ∈
= min , | ∈ , ∈ , , | ∈ , ∈
= min , | ∈ , ∈ , , | ∈ , ∈
= min min , , min ,
∈ , ∈ ∈ , ∈
= min( ,d )
b)
Step1:
A B C D E
A 0
B 4.04 0
C 4.64 1.00 0
D 3.28 0.04 1.04 0
E 0.64 5.00 4.00 4.24 0
BD merge. d(B,D)=0.04.
Step2:
A C E BD
A 0
C 4.04 0
E 0.64 4.00 0
BD 3.28 1.00 4.24 0
AE merge. d(A,E)=0.64.
Step3:
C BD AE
C 0
BD 1.00 0
AE 4.00 3.28 0
BD, C merge. d(BD,C)=1
Step4:
AE BCD
AE 0
BCD 3.28 0
AE,BCD merge.
1
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Height
4
Q2
Centers
BE 0.5 1
ACD 0.6 1.266667
Distances
A B C D E
BE 1.09 1.25 1.25 0.89 1.25
ACD 1.644444 0.697778 0.897778 0.444444 1.964444
Centers
AE 0.4 0
BCD 0.666667 1.933333
Distances
A B C D E
AE 0.16 4.36 4.16 3.6 0.16
BCD 3.755556 0.115556 0.448889 0.128889 4.182222
No more reallocation.
2
Q3
b)
80
60
Height
40
20
0
c18
c15
c19
c16
c13
c20
c12
c14
c17
c10
c11
c2
c3
c4
c7
c6
c9
c1
c5
c8
c)
From the plot we can see that there is a big increase in coefficients when 3-cluster
solution is moved to 2-cluster solution. Therefore, the number of clusters is 3.
75
50
height
25
0
5 10 15
stage
d)
Cluster 1 1 5 8 10 11 12 14 17
Cluster 2 2 3 4 7 15 16 18 19
Cluster 3 6 9 13 20
e)
cluster income age edu expense hours
1 1 -0.049 0.24 0.20 0.40 0.32
2 2 -0.754 -0.90 -0.82 -0.81 -0.73
3 3 1.607 1.32 1.23 0.83 0.83
3
1
cluster
1
mean
2
0 3
-1
income age edu expense hours
f)
Cluster 1 contains customers 1, 5, 8, 10, 11, 12, 14 and 17 and they are middle-aged
people with moderate income, moderate education level. They would not spend a lot of
money on internet and have moderate usage of internet.
Cluster 2 contains customers 2, 3, 4, 7, 15, 16, 18 and 19 and they are younger people
with low income, low education level. They would spend less money on internet and
have low usage of internet.
Cluster 3 contains customers 6, 9, 13 and 20 and they are elder people with high income,
higher education level. They would spend a lot money on internet and have high usage of
internet.
When the result of cluster memberships is compared the value of Y, it seems that the
internet service providers segment the customers into several markets by the variables
used in this cluster analysis.
PCCW serves low usage group.
NWT serves high usage group.
Pacific serves moderate group.
4
Q4
a)
> fit1<-kmeans(x=eg1,centers=4,algorithm="MacQueen")
> fit1
K-means clustering with 4 clusters of sizes 11, 10, 10, 9
Cluster means:
English Math Chinese Science Music PE
1 0.75676842 -0.8289215 0.66097269 0.71634801 0.81428861 0.6973030
2 -1.54493946 -0.9524401 -1.60051497 -1.57003049 -1.53244087 -1.6212116
3 -0.04555104 0.6845897 0.09240849 0.03987673 -0.05603237 0.1334654
4 0.84227248 1.3107378 0.86781836 0.82463438 0.76972864 0.8007922
> fit2<-kmeans(x=eg1,centers=4,algorithm="Hartigan-Wong")
> fit2
K-means clustering with 4 clusters of sizes 9, 11, 10, 10
Cluster means:
English Math Chinese Science Music PE
1 0.84227248 1.3107378 0.86781836 0.82463438 0.76972864 0.8007922
2 0.75676842 -0.8289215 0.66097269 0.71634801 0.81428861 0.6973030
3 -0.04555104 0.6845897 0.09240849 0.03987673 -0.05603237 0.1334654
4 -1.54493946 -0.9524401 -1.60051497 -1.57003049 -1.53244087 -1.6212116
> table(fit1$cluster,fit2$cluster)
1 2 3 4
1 0 11 0 0
2 0 0 0 10
3 0 0 10 0
4 9 0 0 0
b)
Centers by Ward’s method
cluster English Math Chinese Science Music PE
1 1 0.757 -0.83 0.661 0.72 0.814 0.70
2 2 -1.545 -0.95 -1.601 -1.57 -1.532 -1.62
3 3 0.842 1.31 0.868 0.82 0.770 0.80
4 4 -0.046 0.68 0.092 0.04 -0.056 0.13
5
c)
The class sizes are approximately the same and the numbers of students in classes 1 to 4
are 11, 10, 9 and 10 respectively.
d)
K-means clustering with 4 clusters of sizes 11, 10, 9, 10
Cluster means:
English Math Chinese Science Music PE
1 0.75676842 -0.8289215 0.66097269 0.71634801 0.81428861 0.6973030
2 -1.54493946 -0.9524401 -1.60051497 -1.57003049 -1.53244087 -1.6212116
3 0.84227248 1.3107378 0.86781836 0.82463438 0.76972864 0.8007922
4 -0.04555104 0.6845897 0.09240849 0.03987673 -0.05603237 0.1334654
e)
cluster
1
0
mean
2
3
4
-1