Cluster Analysis
Cluster Analysis
Amar Saxena
AmarSaxena@gmail.com
+91.993.002.2910
• Reducing data
Slide 5
Research Questions in Cluster Analysis
• How to form the taxonomy
o Creating an empirically based classification of objects.
A Simple Example
Objective Versus Subjective Considerations
Scatter Diagram for Cluster Observations
Frequency of eating
out
Frequency of going to fast food Frequency of going to fast food
restaurants restaurants
Frequency of eating
out
Agglomerative Divisive
Two-Step
• A multi-step process
o Start with all observations as their own cluster.
o Using the selected similarity measure and agglomerative
algorithm, combine the two most similar observations into a
new cluster, now containing two observations.
o Repeat the clustering procedure using the similarity
measure/agglomerative algorithm to combine the two most
similar observations or clusters (i.e., combinations of
observations) into another new cluster.
o Continue the process until all observations are in a single
cluster.
Statistics Associated with Cluster Analysis
• Agglomeration schedule. An agglomeration schedule gives
information on the objects or cases being combined at each
stage of a hierarchical clustering process.
2. Practical considerations.
✓ Should always use the “best” variables available (i.e., little
measurement error, etc.).
Rules of thumb in selecting variables
• Inter-object similarity
o An empirical measure of correspondence, or resemblance,
between objects to be clustered.
o Calculated across the entire set of clustering variables to allow
for the grouping of observations and their comparison to each
other.
|x1-x2| + |y1-y2|
Structure Exists.
Representativeness of the sample.
Impact of multicollinearity.
Three Assumptions Underlying Cluster Analysis
1. Structure Exists
o Cluster analysis will always generate a solution,
o So, assume that a “natural” structure of objects exists which is to be
identified by the technique.
2. Representativeness of the Sample
o Obtained sample is truly representative of the population.
3. Impact of multicollinearity
o Multicollinearity among subsets of variables is an implicit “weighting”
of the clustering variables
o Potential remedies:
• Reduce the variables to equal numbers in each set of correlated
measures.
• Use an appropriate distance measure, like Mahalanobis Distance.
• Factor Analysis – take one variable from each factor
• Take a proactive approach and include only cluster variables that
are not highly correlated
Stage 4:
Deriving Clusters and Assessing
Overall Fit
• Non-hierarchical
o The number of clusters is specified by the analyst and then the
set of objects are formed into that set of groupings.
Linkage Methods (Hierarchical Clustering)
• Single linkage method (SLINK) or Nearest Neighbor method
o Based on minimum distance, or the nearest neighbor rule.
• Computes all pairwise dissimilarities between the elements in cluster 1 and
the elements in cluster 2, and considers the smallest of these dissimilarities
as a linkage criterion.
o At every stage, the distance between two clusters is the distance
between their two closest points.
o It tends to produce long, “loose” clusters.
Single Linkage
Minimum Distance
Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance
Cluster 1 Cluster 2
Average Linkage
Average Distance
Cluster 1 Cluster 2
Variance Methods
• Variance Methods try to generate clusters to minimize the
within-cluster variance. Popular method – Ward's procedure.
o Find mean of each cluster (mean of all the variables).
o Calculate the distance between each object in a particular cluster,
and that clusters’ mean, and square it (squared Euclidean distance).
o Sum these distances for all the objects.
o At each stage, the two clusters with the smallest increase in the
overall sum of squares within cluster distances are combined.
Minimizes the total within-cluster variance. At each step the pair of
clusters with minimum between-cluster distance are merged.
Produces compact clusters
Centroid Method
• Cons
o Permanent combinations – once joined, clusters are never
separated.
o Impact of outliers – outliers may appear as single object or very
small clusters.
o Large samples – not amenable to very large samples.
Nonhierarchical Clustering
K-Means
How Nonhierarchical Approaches Work?
1. Determine number of clusters to be extracted
2. Specify cluster seeds.
o Analyst specified.
o Sample generated:
• First cluster seed is 1st observation in data set with no missing values.
• Seed points are selected randomly from all observations.
3. Assign each observation to one of the seeds based on
similarity.
o Sequential Threshold: selects one seed point, develops cluster;
then selects next seed point and develops cluster, and so on.
Observation cannot be re-assigned to another cluster following
its original assignment.
o Parallel Threshold: sets all seed points simultaneously, then
develops clusters.
o Optimization: allow for re-assignment of observations based on
the sequential proximity of observations to clusters formed
during the clustering process.
Pros and Cons of Nonhierarchical Methods
• Pros
o Results are less susceptible to:
• outliers in the data,
• the distance measure used, and
• the inclusion of irrelevant or inappropriate variables.
o Can easily analyze very large data sets
• Cons
o Best results require knowledge of seed points.
o Difficult to guarantee optimal solution.
o Generates typically only spherical and more equally sized
clusters.
o Less efficient in examining wide number of cluster solutions.
Visualizing K-Means Clustering process
• Silhouette Method
o Difference between the smallest average between cluster
distance and the average within cluster distance, divided by the
larger of the two distances.
For each element in a cluster, calculate the average distance to all other
elements in its cluster and the average distance to all elements in each of the
other clusters.
This process helps in visualizing the cluster solutions even for very
large datasets.
Stage 5:
Interpretation of the Clusters
Cluster Interpretation
• Involves examining each cluster in terms of the cluster variate
to name or assign a label accurately describing the nature of
the clusters.
• The cluster centroid, a mean profile of the cluster on each
clustering variable, is particularly useful in the interpretation
stage.
o Interpretation involves examining the distinguishing
characteristics of each cluster’s profile and identifying
substantial differences between clusters.
o Cluster solutions failing to show substantial variation indicate
other cluster solutions should be examined.
• The cluster centroid should also be assessed for
correspondence with the analyst’s prior expectations based
on theory or practical experience.
Stage 6:
Validation and Profiling of the Clusters
Validation
Profiling
Validation of the Final Cluster Solution
• Validation is essential in cluster analysis since the clusters
are descriptive of structure and require additional support
for their relevance.
• Two approaches
o Cross-validation – empirically validates a cluster solution by
creating two sub-samples (randomly splitting the sample) and
then comparing the two cluster solutions for consistency with
respect to number of clusters and the cluster profiles.
o Criterion validity – achieved by examining differences on
variables not included in the cluster analysis but for which there
is a theoretical and relevant reason to expect variation across the
clusters.
Profiling A Cluster Solution
• Describing the characteristics of each cluster on a set of
additional variables (not the clustering variables) to further
understand the differences between clusters
o Examples include descriptive variables (e.g., demographics) as
well as other outcome-related measures.
o Provides insight to analysts as to nature and character of the
clusters.