Rock
Rock
Categorical Attributes
S. Guha, R. Rastogi and K. Shim
S. Guha, R. Rastogi and K. Shim ROCK Mark Harrison and John-Paul Cunliffe
1
Introduction
• Clustering, traditional approaches.
• Experiments.
– Artificial dataset.
– Real-world datasets.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
2
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
3
An Example Problem
• Supermarket transactions.
• We wish to group customers so that those buying similiar types of items appear
in the same group, e.g:
Group A— baby-related: diapers, baby-food, toys.
Group B— expensive imported foodstuffs.
etc...
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
4
Partitional Clustering
• Attempt to divide the points into k clusters so as to optimise some function,
E.
X
k X
E= |x − µi|
i=1 x∈Ci
• e.g. k-Means.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
5
• Repeatedly merge the two clusters that are the ‘closest’, based on some
similarity measure.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
6
• Simple approach: Let true = 1, false = 0 and treat the data as numeric.
A and B will merge but they share no items, whilst A and C do.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
7
• Simple approach: Let true = 1, false = 0 and treat the data as numeric.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
8
|T1 ∩ T2|
J(T1, T2) =
|T1 ∪ T2|
• Merge clusters with the most similar pair of points/highest average similarity.
• Considers only the similarity of two points in isolation, does not consider the
neighbourhood of the points.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
9
• If two points belong to the same cluster they should have many common
neighbours.
• If they belong to different clusters they will have few common neighbours.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
10
• Define a similarity function, sim(p1, p2), that encodes the level of similarity
(’closeness’) between two points.
• Normalise so that sim(p1, p2) is one when p1 equals p2 and zero when they
are completely dissimilar.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
11
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
12
• First approach— maximise the number of links between pairs of points in each
cluster:
Xk X
El = link(pq , pr )
i=1 pq ,pr ∈Ci
• . . . but does not force points with few links to split into different clusters.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
13
• Prevents points with few links being placed in the same cluster.
• If we add a new point the number of expected links increases, so if the new
point has few links El will decrease.
• Define a function f (θ), such that a point belonging to a cluster of size n has
approximatelys nf (θ) neighbours in the cluster.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
14
X
k X link(pq , pr )
El = ni 1+2f (θ)
i=1 pq ,pr ∈Ci ni
• Can be hard to find f (θ), but authors found even fairly inaccurate, but
reasonable, functions can provide good results.
1−θ
• For supermarket transactions use 1+θ .
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
15
link[Ci , Cj ]
g(Ci, Cj ) = 1+2f (θ) 1+2f (θ)
(ni + nj )1+2f (θ) − ni − nj
• A each step of the algorithm merge the pair of clusters that maximise this
function.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
16
• For each attribute A and each value it can take v construct an item A.v and
include it in the transaction if the attribute takes that value.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
17
Outliers
• Outliers will probably have very few or no neighbours, and as such will take
little or no part in clustering and can be discarded early on.
• Small clusters of outliers will persist in isolation until near the end of clustering,
so when we are close to reaching the required number of clusters we can stop
and weed out any small isolated clusters with little support.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
18
Random sampling
• If we have a huge number of points we can select a random sample with which
to do the clustering.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
19
Summary
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
20
Experimental Results
1 artifical, 3 natural data sets got clustered with Rock and compared to the
traditional Clustering Algorithm. For Rock:
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
21
# Cluster 1 2 3 4 5 6
# Transactions 9736 13029 14832 10893 13022 7391
# Items 19 20 19 19 22 19
# Cluster 7 8 9 10 Outliers
# Transactions 8564 11973 14279 5411 5456
# Items 19 21 22 19 116
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
22
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
23
Scalability
• Using random sampling results a greatly reduced impact of data size on the
execution time of ROCK.
• Which impact does the sample size have on the execution time (excl. labelling)?
• Random sample size is varied for four different settings of θ (the “threshold of
neighbourhood”).
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
24
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
25
Scalability
• The computational complexity of ROCK is roughly quadratic with respect to
the sample size.
• Why?
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
26
Scalability
• The computational complexity of ROCK is roughly quadratic with respect to
the sample size.
• Why?
• The reason for this is that as θ is increased, each transaction has fewer
neighbours and this makes the computation of links more efficient.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
27
Quality
• Number of transactions misclassified by ROCK for our synthetic data set with
θ values of 0.5 and 0.6 and a range of sample sizes:
• Note that the quality of clustering is better with θ = 0.5 than with θ = 0.6.
• Why?
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
28
Quality
• Random sample sizes we consider range from being less than 1% of the
database size to about 4.5%.
• Transaction sizes can be as small as 11, while the number of items defining
each cluster is approximately 20.
• A high percentage (roughly 40%) of items in a cluster are also present in other
clusters. Thus, a smaller similarity threshold is required to ensure that a larger
number of transaction pairs from the same cluster are neighbours.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
29
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
30
Congressional Votes
“The Congressional voting data set was obtained from the UCI Machine Learning
Repository. It is the United States Congressional Voting Records in 1984. Each
record corresponds to one Congress man’s votes on 16 issues (e.g., education
spending, crime). All attributes are boolean values, and very few contain missing
values. A classification label of Republican or Democrat is provided with each data
record. The data set contains records for 168 Republicans and 267 Democrats.”
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
31
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
32
• Why?
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
33
• Why?
• Improvement mainly caused by outlier removal scheme and the usage of links
by ROCK.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
34
• Only on 3 issues did a majority of Republicans and Democrats cast the same
vote.
• On 12 of the remaining 13 issues, the majority of the Democrats voted
differently from the majority of the Republicans.
• On each of the 12 issues, the Yes/No vote had sizable support in their
respective clusters.
• Therefore the two clusters are quite well-separated.
• Furthermore, there isn’t a significant difference in the sizes of the two clusters.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
35
Mushroom
“The mushroom data set was also obtained from the UCI Machine Learning
Repository. Each data record contains information that describes the physical
characteristics (e.g., color, odor, size, shape) of a single mushroom. A record
also contains a poisonous or edible label for the mushroom. All attributes are
categorical attributes; for instance, the values that the size attribute takes are
narrow and broad, while the values of shape can be bell, at, conical or convex, and
odor is one of spicy, almond, foul, fishy, pungent etc. The mushroom database
has the largest number of records (that is, 8124) among the real-life data sets we
used in our experiments. The number of edible and poisonous mushrooms in the
data set are 4208 and 3916, respectively.”
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
36
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
37
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
38
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
39
Mushroom Traditional
Traditional Hierarchical Clustering Algorithm, cluster # set to 20
Cluster # # Edible # Poisonous Cluster # #Edible #Poisonous
1 666 478 11 120 144
2 283 318 12 128 140
3 201 188 13 144 163
4 164 227 14 198 163
5 194 125 15 131 211
6 207 150 16 201 156
7 233 238 17 151 140
8 181 139 18 190 122
9 135 78 19 175 150
10 172 217 20 168 206
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
40
• Points belonging to different clusters are merged into a single cluster and large
clusters are split into smaller ones
• None of the clusters generated by the traditional algorithm are pure.
• Every cluster contains a sizable number of both poisonous and edible
mushrooms
• Sizes of clusters detected by traditional hierarchical clustering are fairly uniform:
More than 90% of the clusters have sizes between 200 and 400, and only 1
cluster has more than 1000 mushrooms.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
41
• Clusters are not well-separated and there is a wide variance in the sizes of
clusters.
• Cluster centers tend to spread out in all the attribute values and lose information
about points in the cluster that they represent.
• Thus - as discussed earlier - distances between centroids of clusters become a
poor estimate of the similarity between them.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
42
US Mutual Funds
“We ran ROCK on a time-series database of the closing prices of U.S. mutual funds
that were collected from the MIT AI Laboratories’ Experimental Stock Market
Data Servery. The funds represented in this dataset include bond funds, income
funds, asset allocation funds, balanced funds, equity income funds, foreign stock
funds, growth stock funds, aggressive growth stock funds and small company
growth funds. The closing prices for each fund are for business dates only. Some
of the mutual funds that were launched later than Jan 4, 1993 do not have a
price for the entire range of dates from Jan 4, 1993 until Mar 3, 1995. Thus,
there are many missing values for a certain number of mutual funds in our data
set. (...) This makes it difficult to use the traditional algorithm since it is unclear
as to how to treat the missing values in the context of traditional hierarchical
clustering.”
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
43
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
44
• The cluster named International 2 contains funds that invest in South-east Asia
and the Pacific rim region; they are T. Rowe Price New Asia (PRASX), Fidelity
Southeast Asia (FSEAX), and Scudder Pacific Opportunities (SCOPX).
• The Precious Metals cluster includes mutual funds that invest mainly in Gold.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
45
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
46
Remarks
• A new concept of links to measure the similarity/proximity between a pair of
data points with categorical attributes is investiaged.
• The robust hierarchical clustering algorithm ROCK employs links and not
distances for merging clusters.
• This method naturally extend to non-metric similarity measures that are
relevant in situations where a domain expert/similarity table is the only source
of knowledge.
• The results of the experimental study with real-life data sets is encouraging.
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
47
S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007