0% found this document useful (0 votes)
12 views6 pages

Non Hierarchical Clustering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Non Hierarchical Clustering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Non hierarchical clustering techniques

• Designed to group items into a collection of K clusters.

• The number of clusters, K, may either be specified in advance or determined as part of the clustering
procedure.

• As the matrix of distances (similarities) does not have to be determined, and the basic data do not
have to be stored during the computer run, non hierarchical methods can be applied to much larger data
sets than can hierarchical techniques.

• Non hierarchical methods start from either


A) an initial partition of items into groups or
B) an initial set of seed points, which will form the nuclei of clusters.

• Good choices for starting configurations should be free of overt biases. One way to start is to randomly
select seed points from among the items or to randomly partition the items into initial groups.

Dr. Durba Bhattacharya, St. Xavier’s College(Autonomous), Kolkata


K-means Method
This algorithm assigns each item to the cluster having the nearest centroid (mean).

In its simplest version, the process is composed of these three steps:

1. Partition the items into K initial clusters.

2. Proceed through the list of items, assigning an item to the cluster whose centroid (mean) is nearest.

(Distance is usually computed using Euclidean distance with either standardized or unstandardized observations.)

Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item.

3. Repeat Step 2 until no more reassignments take place.

Note that rather than starting with a partition of all items into K preliminary groups in Step 1, we could specify K
initial centroids (seed points) and then proceed to Step 2.

The final assignment of items to clusters is dependent upon the initial partition or the initial selection of seed points.

Dr. Durba Bhattacharya, St. Xavier’s College(Autonomous), Kolkata


Example: Clustering using the K-means method

Suppose we measure two variables 𝑋1 and 𝑋2 for each of four individuals A, B, C, and D:

Observations
Individuals 𝑿𝟏 𝑿𝟐
A 5 3
B -1 1
C 1 -2
D -3 -2

Objective: To divide these items into K = 2 clusters such that the items within a cluster are closer to one another
than they are to the items in different clusters.

We arbitrarily partition the items into two clusters: (AB) and (CD), and compute the coordinates (𝑥1 , 𝑥2) of the
cluster centroid (mean).

Dr. Durba Bhattacharya, St. Xavier’s College(Autonomous), Kolkata


Step1: We arbitrarily partition the items into two clusters: (AB) and (CD), and compute the coordinates (𝑥1 , 𝑥2) of the
cluster centroid (mean).

Coordinates of the centroid


cluster 𝑥1 𝑥2
(AB) 5−1 3+1
=2 =2
2 2
(CD) 1−3 −2−2
= -1 = -2
2 2

Dr. Durba Bhattacharya, St. Xavier’s College(Autonomous), Kolkata


Step 2: We compute the Euclidean distance of each item from the group centroids and reassign each item to the
nearest group.

If an item is moved from the initial configuration, the cluster centroids (means) must be updated before proceeding.

The i-th coordinate, i = 1,2,..., p, of the centroid is easily updated using the formulas:

• To check the stability of the clustering, it is desirable to rerun the algorithm with a new initial partition.

• A table of the cluster centroids (means) and within-cluster variances also helps to delineate group differences.

Dr. Durba Bhattacharya, St. Xavier’s College(Autonomous), Kolkata


Some observations on the K-means Clustering:

Following are some strong arguments for not fixing the number of clusters, K, in advance:

• If two or more seed points inadvertently lie within a single cluster, their resulting clusters
will be poorly differentiated.

• The existence of an outlier might produce at least one group with very disperse items.

• Even if the population is known to consist of K groups, the sampling method may be such
that data from the rarest group do not appear in the sample.

• Forcing the data into K groups might lead to nonsensical clusters.

• In cases in which a single run of the algorithm requires the user to specify K, it is always a
good idea to rerun the algorithm for several choices.

Dr. Durba Bhattacharya, St. Xavier’s College(Autonomous), Kolkata

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy