Non Hierarchical Clustering
Non Hierarchical Clustering
• The number of clusters, K, may either be specified in advance or determined as part of the clustering
procedure.
• As the matrix of distances (similarities) does not have to be determined, and the basic data do not
have to be stored during the computer run, non hierarchical methods can be applied to much larger data
sets than can hierarchical techniques.
• Good choices for starting configurations should be free of overt biases. One way to start is to randomly
select seed points from among the items or to randomly partition the items into initial groups.
2. Proceed through the list of items, assigning an item to the cluster whose centroid (mean) is nearest.
(Distance is usually computed using Euclidean distance with either standardized or unstandardized observations.)
Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item.
Note that rather than starting with a partition of all items into K preliminary groups in Step 1, we could specify K
initial centroids (seed points) and then proceed to Step 2.
The final assignment of items to clusters is dependent upon the initial partition or the initial selection of seed points.
Suppose we measure two variables 𝑋1 and 𝑋2 for each of four individuals A, B, C, and D:
Observations
Individuals 𝑿𝟏 𝑿𝟐
A 5 3
B -1 1
C 1 -2
D -3 -2
Objective: To divide these items into K = 2 clusters such that the items within a cluster are closer to one another
than they are to the items in different clusters.
We arbitrarily partition the items into two clusters: (AB) and (CD), and compute the coordinates (𝑥1 , 𝑥2) of the
cluster centroid (mean).
If an item is moved from the initial configuration, the cluster centroids (means) must be updated before proceeding.
The i-th coordinate, i = 1,2,..., p, of the centroid is easily updated using the formulas:
• To check the stability of the clustering, it is desirable to rerun the algorithm with a new initial partition.
• A table of the cluster centroids (means) and within-cluster variances also helps to delineate group differences.
Following are some strong arguments for not fixing the number of clusters, K, in advance:
• If two or more seed points inadvertently lie within a single cluster, their resulting clusters
will be poorly differentiated.
• The existence of an outlier might produce at least one group with very disperse items.
• Even if the population is known to consist of K groups, the sampling method may be such
that data from the rarest group do not appear in the sample.
• In cases in which a single run of the algorithm requires the user to specify K, it is always a
good idea to rerun the algorithm for several choices.