0% found this document useful (0 votes)
23 views21 pages

CT075!3!2 DTM Topic 10 Cluster Analysis

This document discusses cluster analysis and the k-means clustering algorithm. It defines cluster analysis as grouping a set of data objects into clusters based on similarity. The k-means algorithm partitions observations into k clusters by minimizing distances between observations and assigned cluster centers, iteratively updating cluster centers until convergence. An example applies k-means to movie rating data to generate two clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views21 pages

CT075!3!2 DTM Topic 10 Cluster Analysis

This document discusses cluster analysis and the k-means clustering algorithm. It defines cluster analysis as grouping a set of data objects into clusters based on similarity. The k-means algorithm partitions observations into k clusters by minimizing distances between observations and assigned cluster centers, iteratively updating cluster centers until convergence. An example applies k-means to movie rating data to generate two clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Management

CT075-3-2

Cluster Analysis
Learning Outcomes

By the end of this lecture, YOU should be able


to:
•Understand the clustering concept

•Apply some algothims used for clustering

•Explain the partitioning algorithm K-means


Key Terms you must be able to use

• If you have mastered this topic, you should be


able to use the following terms correctly in your
assignments and exams:
– Clustering
– Cluster analysis
– K-means

Slide 4 (of 25)


What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms
General Applications of Clustering

• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Land use: Identification of areas of similar land use in an
earth observation database
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to
their house type, value, and geographical location
What Is Good Clustering?

• A good clustering method will produce high quality


clusters with
– low intra-class similarity (between 2 classes)
– high inter-class similarity (within a class)
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
Typical Requirements of Clustering in
Data Mining
• Scalability : work good on small sets only
• Ability to deal with different types of attributes
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• High dimensionality
• Interpretability and usability
Partitioning Algorithms: Basic Concept

• Partitioning method: Construct a partition of a database D


of n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning condition.

– k-means : Each cluster is represented by the center of


the cluster.
The K-Means Clustering Method
k-means algorithm is implemented in 5 steps:
• Step 1: Ask the user how many clusters k the data set should be
partitioned into.
• Step 2: Randomly assign k records to be the initial cluster center
locations.
• Step 3: For each record, find the nearest cluster center. Thus, in a
sense, each cluster center “owns” a subset of the records, thereby
representing a partition of the data set. We therefore have k clusters,
C1,C2, . . . ,Ck .
• Step 4: For each of the k clusters, find the cluster centroid, and
update the location of each cluster center to the new value of the
centroid.
• Step 5: Repeat steps 3 to 5 until convergence or termination.
The K-Means Clustering Method
• Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Equations required

Euclidean : to calculate the nearest value to the center of


cluster.

Data Mining: Concepts and Techniques


Example
k-means algorithm: consider the following data set consisting of the
ratings of two variables on each of seven movies.

Movie A B
M1 1.0 1.0
M2 1.5 2.0
M3 3.0 4.0
M4 5.0 7.0
M5 3.5 5.0
M6 4.5 5.0
M7 3.5 4.5
Example

Steps 1 and 2: Lets choose two seeds in


random
Movie A B

M1 1.0 1.0

M4 5.0 7.0
Example

Steps 3 & 4: Compute the distances using


the two attributes and using the sum of
absolute difference for simplicity (K-means
method)
Example

DISTANCE FROM CLUSTERS

C1 1 1
ALLOCATION TO
C2 5 7 C1 C2 NEAREST CLUSTER

M1 1 1 0 10 C1

M2 1.5 2 1.5 8.5 C1

M3 3 4 5 5 C1, C2

M4 5 7 10 0 C2

M5 3.5 5 6.5 3.5 C2

M6 4.5 5 7.5 2.5 C2

M7 3.5 4.5 6 4 C2
Example

STEP 5

A B

C1 1.83 2.33

C2 3.9 5.1

SEED1 1 1

SEED2 5 7
Example
DISTANCE FROM
CLUSTERS

C1 1.83 2.33 FROM


ALLOCATION
TO
THE NEAREST
C2 3.9 5.1 C1 C2 CLUSTER

M1 1 1 2.16 7 C1
M2 1.5 2 0.66 5.5 C1
M3 3 4 2.84 2 C1
M4 5 7 7.84 3 C2
M5 3.5 5 4.34 0.5 C2
M6 4.5 5 5.34 0.5 C2
M7 3.5 4.5 3.84 1 C2

Cluster 1 -> M1, M2, M3


Cluster 2 -> M4, M5, M6, M7
Summary

• Clustering algorithm and its applications


• The k-means algorithm.
References

• Larose T. (2005), Discovering Knowledge in Data, Wiley.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy