0% found this document useful (0 votes)

28 views48 pages

Rock

The document describes the ROCK algorithm for clustering categorical data. It discusses traditional clustering approaches and their limitations for categorical attributes. It then presents the ROCK algorithm which uses links between data points based on common neighbors to group similar points together.

Uploaded by

p_manimozhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views48 pages

Rock

Uploaded by

p_manimozhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

ROCK: A Robust Clustering Algorithm for

Categorical Attributes
S. Guha, R. Rastogi and K. Shim

S. Guha, R. Rastogi and K. Shim ROCK Mark Harrison and John-Paul Cunliffe
1

Introduction
• Clustering, traditional approaches.

• The ROCK algorithm.

• Experiments.
– Artificial dataset.
– Real-world datasets.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
2

Aim: Cluster Items with non-Numerical Attributes

• Clustering: Group similar items together, keep disimilar items apart.

• We are interested in clustering based on non-numerical data—

catagorical/boolean attributes.

Catagorical: { black, white, red, green, blue }

Boolean: { true, false }

• Boolean attributes are mearly a special case of catagorical attributes.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
3

An Example Problem
• Supermarket transactions.

• Each datapoint represents the set of items bought by a single customer.

• We wish to group customers so that those buying similiar types of items appear
in the same group, e.g:
Group A— baby-related: diapers, baby-food, toys.
Group B— expensive imported foodstuffs.
etc...

• Reperesent each transaction as a binary vector in which each attribute

reperesents the presence or absence of a particular item in the transaction
(boolean).

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
4

Partitional Clustering
• Attempt to divide the points into k clusters so as to optimise some function,
E.

• A common approach is to mimimise the total (Euclidian) distance between

each point and its clusters center:

X
k X
E= |x − µi|
i=1 x∈Ci

• e.g. k-Means.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
5

(Agglomerative) Hierarchical Clustering

• Start with all items in their own clusters.

• Repeatedly merge the two clusters that are the ‘closest’, based on some
similarity measure.

• Commonly examples are centroid-based methods— merge the two clusters

whos centers are the closest.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
6

Clustering with Boolean Attributes

• This all works fine for numerical data, but how do we apply it to, for example,
our transaction data?

• Simple approach: Let true = 1, false = 0 and treat the data as numeric.

• An example with hierarchical clustering:

A = (1, 0, 0,√0, 0) B = (0, 0, 0, √

0, 1) C = (1, 1, 1, 1,√0)
|A − B| = 2 |A − C| = 3 |B − C| = 5

A and B will merge but they share no items, whilst A and C do.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
7

Clustering with Boolean Attributes

• This all works fine for numerical data, but how do we apply it to, for example,
our transaction data?

• Simple approach: Let true = 1, false = 0 and treat the data as numeric.

• Dosen’t work very well. Other problems:

– We will end up with long vectors that have only a few non-zero coordinates.
– Two transactions A and B may be similar in that they contain many items
of the same type, but have no individual items in common. Gets worse with
large clusters.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
8

Clustering with Boolean Attributes

• Need a better similarity measure, one suggestion is the Jaccard coefficient:

|T1 ∩ T2|
J(T1, T2) =
|T1 ∪ T2|

• Merge clusters with the most similar pair of points/highest average similarity.

• Considers only the similarity of two points in isolation, does not consider the
neighbourhood of the points.

• Can fail when clusters are not well-seperated, sensitive to outliers.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
9

Neighbours and Links

• Need a more global approach that considers the links between points.

• Use common neighbours to define links.

• If point A neighbours point C, and point B neighbours point C then the

points A and B are linked, even if they are not themselves neighbours.

• If two points belong to the same cluster they should have many common
neighbours.

• If they belong to different clusters they will have few common neighbours.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
10

Neighbours and Links

• We need a way of deciding which points are ‘neighbours’.

• Define a similarity function, sim(p1, p2), that encodes the level of similarity
(’closeness’) between two points.

• Normalise so that sim(p1, p2) is one when p1 equals p2 and zero when they
are completely dissimilar.

• We then consider p1 and p2 to be ’neighbours’ if sim(p1, p2) ≥ θ, where θ is

a user-provided paramater.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
11

Neighbours and Links

• Then define link(p1 , p2) to be the number of common neighbours between p1
and p2.

• The similarity function can be anything— Euclidian distance, the Jaccard

coefficient, a similarity table provided by an expert, etc . . .

• For supermarket transactions use the Jaccard coefficient.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
12

The Criterion Function

• We characterise the best set of clusters through the use of a criterion function,
El— the best set of clusters is that which maximises El.

• First approach— maximise the number of links between pairs of points in each
cluster:
Xk X
El = link(pq , pr )
i=1 pq ,pr ∈Ci

• Keeps points that share many links in the same cluster . . .

• . . . but does not force points with few links to split into different clusters.

• May end up with all points in one big cluster.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
13

The Criterion Function

• Improved approach— divide the actual number of links by the expected number
of links.

• Prevents points with few links being placed in the same cluster.

• If we add a new point the number of expected links increases, so if the new
point has few links El will decrease.

• Define a function f (θ), such that a point belonging to a cluster of size n has
approximatelys nf (θ) neighbours in the cluster.

• Depends on the dataset/problem, and has to be provided by the user.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
14

The Criterion Function

• The final criterion function:

X
k X link(pq , pr )
El = ni 1+2f (θ)
i=1 pq ,pr ∈Ci ni

• Can be hard to find f (θ), but authors found even fairly inaccurate, but
reasonable, functions can provide good results.

1−θ
• For supermarket transactions use 1+θ .

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
15

ROCK: RObust Clustering using linKs

• A hierarchical clustering algorith that uses links.

• Define a goodness measure based on the above criterion function:

link[Ci , Cj ]
g(Ci, Cj ) = 1+2f (θ) 1+2f (θ)
(ni + nj )1+2f (θ) − ni − nj

• A each step of the algorithm merge the pair of clusters that maximise this
function.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
16

Dealing with Catagorical Attributes

• How do we handle catagorical attributes with the possibility of missing data?

• One possible method is to convert them into transactions.

• For each attribute A and each value it can take v construct an item A.v and
include it in the transaction if the attribute takes that value.

• If we have a missing value no item will be present.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
17

Outliers
• Outliers will probably have very few or no neighbours, and as such will take
little or no part in clustering and can be discarded early on.

• Small clusters of outliers will persist in isolation until near the end of clustering,
so when we are close to reaching the required number of clusters we can stop
and weed out any small isolated clusters with little support.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
18

Random sampling
• If we have a huge number of points we can select a random sample with which
to do the clustering.

• Once clustering is complete we assign the remaining datapoints from disk

by determining which cluster contains the most neighbours to each point
(normalised by the expected number of neighbours).

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
19

Summary

1. A random sample is drawn from the database.

2. A hierarchical clustering algorithm employing links is applied to the samples.
3. This means: Iteratively merge clusters Ci, Cj that maximise the goodness
function
total # crosslinks
g(p1, p2) = (1)
expected # crosslinks
and stop merging once there are no more links between clusters or the required
number of clusters has been reached.
4. Clusters involving only the sampled points are used to assign the remaining
data points on disk to the appropriate clusters.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
20

Experimental Results
1 artifical, 3 natural data sets got clustered with Rock and compared to the
traditional Clustering Algorithm. For Rock:

• In all of the experiments the Jaccard similarity function was used.

• Expected number of links was approximated using f (θ) = 1−θ 1+θ .

For Hierarchical Clustering:

• Categorical attributes were converted to boolean attributes with 0/1 values.

• New attribute = 1 iff “value for the original categorical attribute” = “value
corresponding to the boolean attribute”, else 0
• Outlier handling performed by eliminating clusters with only one point when
the number of clusters reduces to 31 of the original number

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
21

Synthetic Data Set

• Market basket database containing 114586 transactions.
• Of these, 5456 (arround 5%) are outliers, while the others belong to one of 10
clusters with sizes varying between 5000 and 15000.
• How did these transactions get constructed?

# Cluster 1 2 3 4 5 6
# Transactions 9736 13029 14832 10893 13022 7391
# Items 19 20 19 19 22 19
# Cluster 7 8 9 10 Outliers
# Transactions 8564 11973 14279 5411 5456
# Items 19 21 22 19 116

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
22

Synthetic Data Set

• Clusers are defined by the items its transactions hold.
• arround 40% of these items in a cluster are common with items for other
clusters, arround 60% exclusive to an cluster.
• A transaction for a cluster is generated by randomly selecting items from the
set of items that define the cluster.
• Outliers are generated by randomly selecting from among the items for all the
clusters.
• The transaction size parameter has a normal distribution with an average value
of 15. Due to the normal distribution, 98% of transactions have sizes between
11 and 19.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
23

Scalability
• Using random sampling results a greatly reduced impact of data size on the
execution time of ROCK.
• Which impact does the sample size have on the execution time (excl. labelling)?
• Random sample size is varied for four different settings of θ (the “threshold of
neighbourhood”).

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
24

Scalability with Respect to Random Sample Size

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
25

Scalability
• The computational complexity of ROCK is roughly quadratic with respect to
the sample size.

• For a given sample size, the performance of ROCK improves as θ is increased.

• Why?

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
26

Scalability
• The computational complexity of ROCK is roughly quadratic with respect to
the sample size.

• For a given sample size, the performance of ROCK improves as θ is increased.

• Why?

• The reason for this is that as θ is increased, each transaction has fewer
neighbours and this makes the computation of links more efficient.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
27

Quality
• Number of transactions misclassified by ROCK for our synthetic data set with
θ values of 0.5 and 0.6 and a range of sample sizes:

Sample Size 1000 2000 3000 4000 5000

θ = 0.5 37 0 0 0 0
θ = 0.6 8123 1051 384 104 8

• Note that the quality of clustering is better with θ = 0.5 than with θ = 0.6.

• Why?

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
28

Quality
• Random sample sizes we consider range from being less than 1% of the
database size to about 4.5%.

• Transaction sizes can be as small as 11, while the number of items defining
each cluster is approximately 20.

• A high percentage (roughly 40%) of items in a cluster are also present in other
clusters. Thus, a smaller similarity threshold is required to ensure that a larger
number of transaction pairs from the same cluster are neighbours.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
29

Real-Life Data Sets

• 3 Real-Life Data Sets:

Data Set Congressional Mushroom U.S. Mutual

Votes Fund
# Records 435 8124 795
# Attributes 16 22 548
Missing Values Yes (very few) Yes (very few) Yes
Note Republicans and 4208 edible and Jan 4, 1993 -
267 Democrats 3916 poisonous Mar 3, 1995
Table 1: Data Sets

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
30

Congressional Votes
“The Congressional voting data set was obtained from the UCI Machine Learning
Repository. It is the United States Congressional Voting Records in 1984. Each
record corresponds to one Congress man’s votes on 16 issues (e.g., education
spending, crime). All attributes are boolean values, and very few contain missing
values. A classification label of Republican or Democrat is provided with each data
record. The data set contains records for 168 Republicans and 267 Democrats.”

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
31

Congressional Votes on the Rock

• Results ROCK with θ = 0.73 and Hierarchical Clustering Algorithm with
centroid-based distance function:

Traditional Hierarchical Clustering Algorithm

Cluster No No of Republicans No of Democrats
1 157 52
2 11 215
ROCK
Cluster No No of Republicans No of Democrats
1 144 22
2 5 201
Table 2: Clustering Result for Congressional Voting Data

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
32

Congressional Votes on the Rock

• Both identify two clusters one containing a large number of republicans and
the other containing a majority of democrats.

• However, in the cluster for republicans found by the traditional algorithm,

around 25% of the members are democrats, while with ROCK, only 12% are
democrats.

• Why?

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
33

Congressional Votes on the Rock

• Both identify two clusters one containing a large number of republicans and
the other containing a majority of democrats.

• However, in the cluster for republicans found by the traditional algorithm,

around 25% of the members are democrats, while with ROCK, only 12% are
democrats.

• Why?

• Improvement mainly caused by outlier removal scheme and the usage of links
by ROCK.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
34

Congressional Votes on the Rock

Interestingly, the traditional algorithm also discovered the clusters easily. Reasons
for this are:

• Only on 3 issues did a majority of Republicans and Democrats cast the same
vote.
• On 12 of the remaining 13 issues, the majority of the Democrats voted
differently from the majority of the Republicans.
• On each of the 12 issues, the Yes/No vote had sizable support in their
respective clusters.
• Therefore the two clusters are quite well-separated.
• Furthermore, there isn’t a significant difference in the sizes of the two clusters.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
35

Mushroom
“The mushroom data set was also obtained from the UCI Machine Learning
Repository. Each data record contains information that describes the physical
characteristics (e.g., color, odor, size, shape) of a single mushroom. A record
also contains a poisonous or edible label for the mushroom. All attributes are
categorical attributes; for instance, the values that the size attribute takes are
narrow and broad, while the values of shape can be bell, at, conical or convex, and
odor is one of spicy, almond, foul, fishy, pungent etc. The mushroom database
has the largest number of records (that is, 8124) among the real-life data sets we
used in our experiments. The number of edible and poisonous mushrooms in the
data set are 4208 and 3916, respectively.”

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
36

Mushroom on the Rock

ROCK with θ = 0.8
Cluster # # Edible # Poisonous Cluster # #Edible #Poisonous
1 96 0 12 48 0
2 0 256 13 0 288
3 704 0 14 192 0
4 96 0 15 32 72
5 768 0 16 0 1728
6 0 192 17 288 0
7 1728 0 18 0 8
8 0 32 19 192 0
9 0 1296 20 16 0
10 0 8 21 0 36

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
37

Mushroom on the Rock

• ROCK found 21 clusters instead of 20: no pair of clusters among the 21
clusters had links between them and so ROCK could not proceed further.
• All except one (Cluster 15) of the clusters discovered by ROCK are pure
clusters in the sense that mushrooms in every cluster were either all poisonous
or all edible.
• There is a wide variance among the sizes of the clusters: 3 clusters have sizes
above 1000 while 9 of the 21 clusters have a size less than 100.
• The sizes of the largest and smallest cluster are 1728 and 8, respectively.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
38

Mushroom on the Rock

• In general, records in different clusters could be identical with respect to some
attribute values.
• Thus, every pair of clusters generally have some common values for the
attributes
• Thus clusters are not well-separated.
• What does this mean for the traditional approach?

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
39

Mushroom Traditional
Traditional Hierarchical Clustering Algorithm, cluster # set to 20
Cluster # # Edible # Poisonous Cluster # #Edible #Poisonous
1 666 478 11 120 144
2 283 318 12 128 140
3 201 188 13 144 163
4 164 227 14 198 163
5 194 125 15 131 211
6 207 150 16 201 156
7 233 238 17 151 140
8 181 139 18 190 122
9 135 78 19 175 150
10 172 217 20 168 206

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
40

Mushroom on the Rock

Observing these results we find that:

• Points belonging to different clusters are merged into a single cluster and large
clusters are split into smaller ones
• None of the clusters generated by the traditional algorithm are pure.
• Every cluster contains a sizable number of both poisonous and edible
mushrooms
• Sizes of clusters detected by traditional hierarchical clustering are fairly uniform:
More than 90% of the clusters have sizes between 200 and 400, and only 1
cluster has more than 1000 mushrooms.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
41

Mushroom on the Rock

So the quality of the clusters generated by the traditional algorithm was very
poor. Reasons for this are:

• Clusters are not well-separated and there is a wide variance in the sizes of
clusters.
• Cluster centers tend to spread out in all the attribute values and lose information
about points in the cluster that they represent.
• Thus - as discussed earlier - distances between centroids of clusters become a
poor estimate of the similarity between them.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
42

US Mutual Funds
“We ran ROCK on a time-series database of the closing prices of U.S. mutual funds
that were collected from the MIT AI Laboratories’ Experimental Stock Market
Data Servery. The funds represented in this dataset include bond funds, income
funds, asset allocation funds, balanced funds, equity income funds, foreign stock
funds, growth stock funds, aggressive growth stock funds and small company
growth funds. The closing prices for each fund are for business dates only. Some
of the mutual funds that were launched later than Jan 4, 1993 do not have a
price for the entire range of dates from Jan 4, 1993 until Mar 3, 1995. Thus,
there are many missing values for a certain number of mutual funds in our data
set. (...) This makes it difficult to use the traditional algorithm since it is unclear
as to how to treat the missing values in the context of traditional hierarchical
clustering.”

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
43

US Mutual Funds on the Rock

Mutual Funds Clusters generated with ROCK, θ = 0.8
Cluster Name # Ticker Symbol Note
Funds
Bonds 1 4 BTFTX BTFIX BTTTX BTMTX Coupon
Bonds 2 10 CPTNX FRGVX VWESX FGOVX PRCIX -
Bonds 3 24 FMUIX SCTFX PRXCX PRFHX VLHYX Municipal
Bonds 4 15 FTFIX FRHIX PHTBX FHIGX FMBDX Municipal
Bonds 6 3 VFLTX SWCAX FFLIX Municipal
Bonds 7 26 WPGVX DRBDX VUSTX SGZTX PRULX Income
Financial Service 3 FIDSX FSFSX FSRBX -
Precious Metals 10 FDPMX LEXMX VGPMX STIVX USERX Gold
International 2 4 PRASX FSEAX SCOPX Asia

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
44

US Mutual Funds on the Rock

• The Financial Service cluster has 3 funds: Fidelity Select Financial Services
(FIDSX), Invesco Strategic Financial Services (FSFSX) and Fidelity Select
Regional Banks (FSRBX) that invest primarily in banks, brokerages and
financial institutions.

• The cluster named International 2 contains funds that invest in South-east Asia
and the Pacific rim region; they are T. Rowe Price New Asia (PRASX), Fidelity
Southeast Asia (FSEAX), and Scudder Pacific Opportunities (SCOPX).

• The Precious Metals cluster includes mutual funds that invest mainly in Gold.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
45

US Mutual Funds on the Rock

• It appears that ROCK can also be used to cluster time-series data.
• It can be employed to determine interesting distributions in the underlying
data even when there are a large number of outliers that do not belong to any
of the clusters, as well as when the data contains a sizable number of missing
values.
• A nice and desirable characteristic of this technique: it does not merge a pair
of clusters if there are no links between them.
• Thus, the desired number of clusters input to ROCK is just a hint: ROCK
may discover more than the specified number of clusters (if there are no links
between clusters) or fewer (in case certain clusters are determined to be outliers
and eliminated).

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
46

Remarks
• A new concept of links to measure the similarity/proximity between a pair of
data points with categorical attributes is investiaged.
• The robust hierarchical clustering algorithm ROCK employs links and not
distances for merging clusters.
• This method naturally extend to non-metric similarity measures that are
relevant in situations where a domain expert/similarity table is the only source
of knowledge.
• The results of the experimental study with real-life data sets is encouraging.

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007
47

Email & End

John-Paul Cunliffe wrote:
(...) If there is any relevant information not covered in your paper, I would
appreciate any hint you can give me on it so I can present your work as complete
as possible. (...)
Sudipto Guha:
(...) We started out to solve a problem, I believe the problem was solved and we
(I) moved on. That’s that. ROCK does work quite well in practice, I have even
seen being used on environmental data where the categories were anonymized
and the algorithm gave correct answers.
As far as I am aware other researchers have tried to take the research to the next
step, in terms of optimizations of various factors. (...)

S. Guha, R. Rastogi and K. Shim ROCK Data Mining and Exploration, 2007

CLOPE: A Fast and Effective Clustering Algorithm For Transactional Data
No ratings yet
CLOPE: A Fast and Effective Clustering Algorithm For Transactional Data
6 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering Algorithms On Data Mining in Categorical Database
No ratings yet
Clustering Algorithms On Data Mining in Categorical Database
4 pages
Clustering Categorical Data: The Case of Quran Verses
No ratings yet
Clustering Categorical Data: The Case of Quran Verses
43 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Chp10 Cluster Analysis Basic Concepts and Methods
No ratings yet
Chp10 Cluster Analysis Basic Concepts and Methods
24 pages
Ifferent Methods of Clustering
No ratings yet
Ifferent Methods of Clustering
8 pages
Clustering
No ratings yet
Clustering
39 pages
Data Mining-Unit 3-Part1
No ratings yet
Data Mining-Unit 3-Part1
41 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
41 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Clustering
No ratings yet
Clustering
53 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Clustering
No ratings yet
Clustering
35 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Cluster Analysis Notes
No ratings yet
Cluster Analysis Notes
37 pages
Clustering
No ratings yet
Clustering
28 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Chapter 5 CLUSTERING
No ratings yet
Chapter 5 CLUSTERING
36 pages
Fast and Robust General Purpose Clustering Algorit
No ratings yet
Fast and Robust General Purpose Clustering Algorit
29 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Unit 5
No ratings yet
Unit 5
63 pages
Cluster Analysis: Minh Tran, PHD
No ratings yet
Cluster Analysis: Minh Tran, PHD
37 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Clustering
No ratings yet
Clustering
110 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
DMDW R20 Unit 5
No ratings yet
DMDW R20 Unit 5
21 pages
DMDW Unit-5
No ratings yet
DMDW Unit-5
21 pages
Survey of Clustering Algorithms
No ratings yet
Survey of Clustering Algorithms
37 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Hierarchical Clustering Algorithm
No ratings yet
Hierarchical Clustering Algorithm
9 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Clustering
No ratings yet
Clustering
16 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
Unit 5 DWM by DR KSR Cluster Analysis
No ratings yet
Unit 5 DWM by DR KSR Cluster Analysis
72 pages
Clustering
No ratings yet
Clustering
32 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
65 pages
Clustering I Ws
No ratings yet
Clustering I Ws
11 pages
Lecture 13
No ratings yet
Lecture 13
45 pages
Module V
No ratings yet
Module V
16 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit 5
No ratings yet
Unit 5
27 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Infinite Crossed Products
From Everand
Infinite Crossed Products
Donald S. Passman
No ratings yet
Kmeans Notes
No ratings yet
Kmeans Notes
8 pages
4.2 Generative
No ratings yet
4.2 Generative
21 pages
3 6-ConditionalIndependence
No ratings yet
3 6-ConditionalIndependence
38 pages
Mobcomp Unit3
No ratings yet
Mobcomp Unit3
12 pages
Unit 3
No ratings yet
Unit 3
119 pages
J179 Quick User Guide
No ratings yet
J179 Quick User Guide
2 pages
Sanmotion R Q M0008260C
No ratings yet
Sanmotion R Q M0008260C
48 pages
Q1 LE TLE 8 Lesson 8 Week 8
100% (1)
Q1 LE TLE 8 Lesson 8 Week 8
14 pages
HP - Probook.4510s.4520s.wistron.s Intel.H9265 4.48.4GK06.041.Rev .SD .Schematics 2
No ratings yet
HP - Probook.4510s.4520s.wistron.s Intel.H9265 4.48.4GK06.041.Rev .SD .Schematics 2
61 pages
Asset Management System Introduction
No ratings yet
Asset Management System Introduction
7 pages
Survey Paper
No ratings yet
Survey Paper
4 pages
Sustainability 11 06570 PDF
No ratings yet
Sustainability 11 06570 PDF
13 pages
Ezy4g Camera: Quick Operation Guide
No ratings yet
Ezy4g Camera: Quick Operation Guide
7 pages
GFS-154B M00 - iFIX Fundamentals Front Matter Volume 1 of 2
No ratings yet
GFS-154B M00 - iFIX Fundamentals Front Matter Volume 1 of 2
17 pages
MIS Quiz
No ratings yet
MIS Quiz
32 pages
Notes Topics 1.13 - 1.14 Key Modeling Functions and Applications Updated
No ratings yet
Notes Topics 1.13 - 1.14 Key Modeling Functions and Applications Updated
4 pages
Technology and Its Impact On Child Development Paper
No ratings yet
Technology and Its Impact On Child Development Paper
7 pages
Artificial Intelligence in E Commerce A Bibliometric Study
No ratings yet
Artificial Intelligence in E Commerce A Bibliometric Study
42 pages
TM-1202 AVEVA Plant 12.1 Piping Catalogues & Specifications 2.0
No ratings yet
TM-1202 AVEVA Plant 12.1 Piping Catalogues & Specifications 2.0
254 pages
Paperwork Aims To Be An Open
No ratings yet
Paperwork Aims To Be An Open
10 pages
MTC 30521
No ratings yet
MTC 30521
28 pages
National Career Service Portal: User Manual - COUNSELLOR v4.0
No ratings yet
National Career Service Portal: User Manual - COUNSELLOR v4.0
45 pages
KT Internship Report
No ratings yet
KT Internship Report
16 pages
Quantum Computing
No ratings yet
Quantum Computing
5 pages
Cheat Sheet - Odoo
No ratings yet
Cheat Sheet - Odoo
10 pages
Sales Guide
No ratings yet
Sales Guide
25 pages
Lab9 IAP301 HE172600 IA1802
No ratings yet
Lab9 IAP301 HE172600 IA1802
4 pages
Help Desk Representative Job Description
100% (1)
Help Desk Representative Job Description
3 pages
Performing A Plane-Based Scan Registration
No ratings yet
Performing A Plane-Based Scan Registration
16 pages
Security Assessment Agreement
No ratings yet
Security Assessment Agreement
6 pages
09.Project-Hospital Management System PDF
100% (9)
09.Project-Hospital Management System PDF
50 pages
Unit 1
No ratings yet
Unit 1
25 pages
Java How To Program, 10/e: Reserved
No ratings yet
Java How To Program, 10/e: Reserved
191 pages
DLD Lab Experiment 5&6
No ratings yet
DLD Lab Experiment 5&6
7 pages
Wireless Camera System Troubleshooting and FAQ
No ratings yet
Wireless Camera System Troubleshooting and FAQ
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Rock

Uploaded by

Rock

Uploaded by

ROCK: A Robust Clustering Algorithm for

• The ROCK algorithm.

Aim: Cluster Items with non-Numerical Attributes

• We are interested in clustering based on non-numerical data—

Catagorical: { black, white, red, green, blue }

• Boolean attributes are mearly a special case of catagorical attributes.

• Each datapoint represents the set of items bought by a single customer.

• Reperesent each transaction as a binary vector in which each attribute

• A common approach is to mimimise the total (Euclidian) distance between

(Agglomerative) Hierarchical Clustering

• Commonly examples are centroid-based methods— merge the two clusters

Clustering with Boolean Attributes

• An example with hierarchical clustering:

A = (1, 0, 0,√0, 0) B = (0, 0, 0, √

Clustering with Boolean Attributes

• Dosen’t work very well. Other problems:

Clustering with Boolean Attributes

• Can fail when clusters are not well-seperated, sensitive to outliers.

Neighbours and Links

• Use common neighbours to define links.

• If point A neighbours point C, and point B neighbours point C then the

Neighbours and Links

• We then consider p1 and p2 to be ’neighbours’ if sim(p1, p2) ≥ θ, where θ is

Neighbours and Links

• The similarity function can be anything— Euclidian distance, the Jaccard

• For supermarket transactions use the Jaccard coefficient.

The Criterion Function

• Keeps points that share many links in the same cluster . . .

• May end up with all points in one big cluster.

The Criterion Function

• Depends on the dataset/problem, and has to be provided by the user.

The Criterion Function

ROCK: RObust Clustering using linKs

• Define a goodness measure based on the above criterion function:

Dealing with Catagorical Attributes

• One possible method is to convert them into transactions.

• If we have a missing value no item will be present.

• Once clustering is complete we assign the remaining datapoints from disk

1. A random sample is drawn from the database.

• In all of the experiments the Jaccard similarity function was used.

For Hierarchical Clustering:

• Categorical attributes were converted to boolean attributes with 0/1 values.

Synthetic Data Set

Synthetic Data Set

Scalability with Respect to Random Sample Size

• For a given sample size, the performance of ROCK improves as θ is increased.

• For a given sample size, the performance of ROCK improves as θ is increased.

Sample Size 1000 2000 3000 4000 5000

Real-Life Data Sets

Data Set Congressional Mushroom U.S. Mutual

Congressional Votes on the Rock

Traditional Hierarchical Clustering Algorithm

Congressional Votes on the Rock

• However, in the cluster for republicans found by the traditional algorithm,

Congressional Votes on the Rock

• However, in the cluster for republicans found by the traditional algorithm,

Congressional Votes on the Rock

Mushroom on the Rock

Mushroom on the Rock

Mushroom on the Rock

Mushroom on the Rock

Mushroom on the Rock

US Mutual Funds on the Rock

US Mutual Funds on the Rock

US Mutual Funds on the Rock

Email & End

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.