0% found this document useful (0 votes)
7 views85 pages

UNIT-5 PPT

The document provides an overview of cluster analysis, including its basic concepts, algorithms, and applications across various fields such as marketing, land use, and city planning. It discusses different clustering methods like K-means, agglomerative hierarchical clustering, and DBSCAN, along with their strengths and weaknesses. Additionally, it outlines the requirements for effective clustering in data mining and presents various approaches and techniques for calculating distances between clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views85 pages

UNIT-5 PPT

The document provides an overview of cluster analysis, including its basic concepts, algorithms, and applications across various fields such as marketing, land use, and city planning. It discusses different clustering methods like K-means, agglomerative hierarchical clustering, and DBSCAN, along with their strengths and weaknesses. Additionally, it outlines the requirements for effective clustering in data mining and presents various approaches and techniques for calculating distances between clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 85

UNIT-V

Cluster
Analysis
19/02/25
Cluster Analysis:
Basic Concepts and Algorithms
 Overview, What Is Cluster Analysis? Different
Types of Clustering, Different Types of
Clusters;
 K-means: The Basic K-means Algorithm, K-
means Additional Issues, Bisecting K-means,
Strengths and Weaknesses;
 Agglomerative Hierarchical Clustering: Basic
Agglomerative Hierarchical Clustering
Algorithm
 DBSCAN: Traditional Density Center-Based
Approach, DBSCAN Algorithm, Strengths and
Weaknesses.
19/02/25
Cluster Analysis
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping
similar data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data
distribution
 As a preprocessing step for other algorithms
19/02/25
Clustering Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to their
house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
19/02/25
Clustering Applications and
Multidisciplinary Efforts
 Pattern Recognition
 Spatial Data Analysis

Create thematic maps in GIS by clustering feature
spaces

Detect spatial clusters or for other spatial mining
tasks
 Image Processing
 Economic Science (especially market research)
 WWW

Document classification

Cluster Weblog data to discover groups of similar
access patterns
19/02/25
Quality - Good Clustering

 A good clustering method will produce high quality


clusters with

high intra-class similarity

low inter-class similarity
 The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns
19/02/25
Different Types of Clustering

 Hierarchical versus Partitional

 Exclusive versus Overlapping versus Fuzzy

 Complete versus Partial

19/02/25
Different Types of Clusters

 Well-separated Cluster

 Prototype-Based cluster

 Graph-Based Cluster

 Density-Based Cluster

 Shared- property or Conceptual Clusters

19/02/25
Different Types of Clusters
 Well-separated clusters : Each point is
closer to all of the points in its cluster than
to any point in another cluster.

Prototype-based clusters : Each point is closer


to the center of its cluster than to the center of
any other cluster.

19/02/25
 Graph-based clusters: Each point is closer
to at least one point in its cluster than to
any point in another cluster.

 Density-based clusters: Clusters are


regions of high density separated by
regions of low density.

19/02/25
 Shared clusters: Points in a cluster share
some general property that derives from
the entire set of points.

19/02/25
Requirements of Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
19/02/25
Major Clustering Approaches

 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

19/02/25
Major Clustering Approaches
 Grid-based approach:

based on a multiple-level granularity structure

Typical methods: STING, WaveCluster, CLIQUE
 Model-based:

A model is hypothesized for each of the clusters and tries to find the
best fit of that model to each other

Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:

Based on the analysis of frequent patterns

Typical methods: pCluster
 User-guided or constraint-based:

Clustering by considering user-specified or application-specific
constraints

Typical methods: COD (obstacles), constrained clustering
19/02/25
Typical Alternatives to Calculate the
Distance between Clusters
 Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dis(K i, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dis(K i, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and
an element in the other, i.e., dis(K i, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e.,
dis(Ki, Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
19/02/25
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN1(t )
Cm  N ip

 Radius: square root of average distance from any point of


the cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared distance
between all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N  1)

19/02/25
K-means
Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters, s.t., min sum of squared distance

 Given a k, find a partition of k clusters that optimizes the chosen


partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen’67): Each cluster is represented by the
center of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
19/02/25
The K-Means Clustering Method
 Given k, the k-means algorithm is implemented in
four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is
the center, i.e., mean point, of the cluster)

 Assign each object to the cluster with the


nearest seed point ,Go back to Step 2, stop
when no more new assignment
19/02/25
Basic K-means algorithm

1: Select K points as initial centroids.

2: repeat
3: Form K clusters by assigning each point to
its closest centroid.

4: Recompute the centroid of each cluster.

5: until Centroids do not change

19/02/25
The K-Means Clustering
Method
 Given k, the k-means algorithm is implemented in four
steps:

Cluster Analysis
The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

most
similar reassign reassign
center 10 10

K=2 9 9

8 8

Arbitrarily choose 7 7

6 6
K object as initial 5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

19/02/25
Table of notation

19/02/25
K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is


# clusters, and t is # iterations. Normally, k, t << n.
 Weakness
 Applicable only when mean is defined, then what about
categorical data
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

19/02/25
Example
 Cluster the following eight points (with (x,
y) representing locations) into three
clusters:
 A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7,
5), A6(6, 4), A7(1, 2), A8(4, 9)
 Initial cluster centers are: A1(2, 10), A4(5,
8) and A7(1, 2).

19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
K Means
 Advantages
1. Easy to implement.
2. With a large number of variables, K-Means
may be computationally faster than
hierarchical clustering (if K is small)
 Disadvantages

1. Difficult to predict the number of clusters


(K-Value).
2. Can converge on local minima
3. Sensitive to outliers
19/02/25
K-means: Additional Issues

Handling Empty Clusters:


 One of the problems with the basic K-

means algorithm - empty clusters can be


obtained if no points are allocated to a
cluster during the assignment step
 One approach is to choose the point that is

farthest away from any current centroid.


 Another approach is to choose the
replacement centroid from the cluster that
has the highest SSE

19/02/25
Reducing the SSE with
Postprocessing
 Two strategies that decrease the total SSE
by increasing the number of clusters are the
following:
 Split a cluster: The cluster with the largest

SSE is chose and split


 Introduce a new cluster centroid: the

point that is farthest from any cluster center


is chosen then determine this if we keep
track of the SSE contributed by each point.
Or
 choose randomly from all points or from the

points with the highest SSE.


19/02/25
 Two strategies that decrease the number of clusters,
while trying to minimize the increase in total SSE,
are the following:
 Disperse a cluster: removing the centroid that
corresponds to the cluster and reassigning the
points to other clusters
 Merge two clusters: The clusters with the closest
centroids are chosen, and merge the two clusters
that result in the smallest increase in total SSE
 Updating Centroids Incrementally: the centroids
can be updated incrementally, after each
assignment of a point to a cluster

19/02/25
Bisecting K-means
 The bisecting K-means algorithm is a
straightforward extension of the basic K-
means algorithm .
 To obtain K clusters, split the set of all
points into two clusters, select one of these
clusters to split, and so on.
 until K clusters have been produced.

19/02/25
Bisecting K-means

19/02/25
EXAMPLE

19/02/25
Hierarchical Clustering

 Use distance matrix as clustering criteria. This


method does not require the number of clusters k
as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
19/02/25
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g.,
Splus
 Use the Single-Link method and the dissimilarity
matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
10 10 10

 9

8
Eventually all nodes belong to the same cluster 9

8
9

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

19/02/25
Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.

19/02/25
19/02/25
Defining Proximity between Clusters

cluster proximity that differentiates the various


agglomerative hierarchical techniques
Agglomerative hierarchical clustering techniques,
such as MIN, MAX, and Group Average, come from a
graph-based view of clusters.
MIN defines cluster proximity as the proximity
between the closest two points that are in different
clusters
MAX takes the proximity between the farthest two
points in different clusters to be the cluster proximity
Group average technique, defines cluster proximity
to be the average pairwise proximities of all pairs of
points from different clusters

19/02/25
19/02/25
Agglomerative

19/02/25
Agglomerative

19/02/25
DENDROGRAM

19/02/25
DENDROGRAM

19/02/25
EXample

19/02/25
Plot

19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
Complete Link

19/02/25
19/02/25
19/02/25
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g.,
Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

19/02/25
Wards Method and Centroid Methods
 Ward's method, the proximity between two
clusters is defined as the increase in the squared
error that results when two clusters are merged
 Ward's method somewhat distinct from other
hierarchical techniques, it can be shown
mathematically that Ward's method is very similar
to the group average method when the proximity
between two points is taken to be the square of
the distance between them
 Centroid methods calculate the proximity between
two clusters by calculating the distance between
the centroids of clusters, similar to K-means.

19/02/25
Strengths and Weaknesses
 Both algorithms can produce better-quality
clusters
 Agglomerative hierarchical clustering algorithms
are expensive in terms of their computational and
storage requirements.
 all merges are final can also cause trouble for
noisy, high-dimensional data, such as document
data.
 these two problems can be addressed to some
degree by first partially clustering the data using
another technique, such as K-means.

19/02/25
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion),
such as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

based)
19/02/25
Density-Based Clustering: Basic
Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts
if

p belongs to NEps(q) p MinPts = 5
q
 core point condition: Eps = 1 cm

|NEps (q)| >= MinPts


19/02/25
Classification of Points According to
Center-Based Density

 Core points: These points are in the interior of a


density-based cluster. A point is a core point if the
number of points within a given neighbor hood
around the point as determined by the distance
function and a user specified distance parameter -
Eps, exceeds a certain threshold, MinPts, which is
also a user-specified parameter.
 Border points: A border point is not a core point,
but falls within the neighborhood of a core point. A
border point can fall within the neighborhoods of
several core points.
 Noise points: A noise point is any point that is
neither a core point nor a border point.
19/02/25
Density-Reachable and Density-Connected
 Density-reachable:
 A point p is density-reachable p
from a point q w.r.t. Eps, MinPts if
there is a chain of points p1, …, p1
q
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi

 Density-connected
p q
 A point p is density-connected to
a point q w.r.t. Eps, MinPts if o
there is a point o such that both,
p and q are density-reachable
from o w.r.t. Eps and MinPts
19/02/25
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
 Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
Core MinPts = 5

19/02/25
Example

19/02/25
DBSCAN: The Algorithm

 Arbitrary select a point p


 Retrieve all points density-reachable from p w.r.t.
Eps and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-
reachable from p and DBSCAN visits the next
point of the database.
 Continue the process until all of the points have
been processed.
19/02/25
19/02/25
Strengths and Weaknesses
 DBSCAN uses a density-based definition of a
cluster, it is relatively resistant to noise and can
handle clusters of arbitrary shapes and sizes
 DBSCAN can find many clusters that could not be
found using K-means
 DBSCAN has trouble when the clusters have widely
varying densities. It also has trouble with high-
dimensional data because density is more difficult
to define for such data.
 DBSCAN can be expensive when the computation
of nearest neighbors requires computing all
pairwise proximities, as is usually the case for
high-dimensional data.
19/02/25
Outlier
 What are outliers?
 The set of objects are considerably dissimilar

from the remainder of the data


 Example: Sports: Michael Jordon, Wayne

Gretzky, ...
 Problem: Define and find outliers in large data sets
 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis

19/02/25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy