0% found this document useful (0 votes)
7 views61 pages

Cluster Analysis

Uploaded by

Mayank Trivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views61 pages

Cluster Analysis

Uploaded by

Mayank Trivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Cluster Analysis

Amar Saxena
AmarSaxena@gmail.com
+91.993.002.2910

25th Oct 2024


What is Cluster Analysis?

Cluster Analysis as a Multivariate Technique

Conceptual Development with Cluster Analysis

Necessity of Conceptual Support in Cluster Analysis


What is Cluster Analysis?
• Definition
o Groups objects based on the characteristics they possess.
o So that objects in same cluster are similar and different from objects in
all the other clusters.
• Cluster analysis . . . is a group of multivariate techniques.

• Cluster Analysis, Discriminant Analysis and Logistic Regression


are all concerned with classification.
However, Discriminant Analysis and Logistic Regression require:
o Prior knowledge of the cluster or group membership for each case
(Supervised Learning)
In contrast, for cluster analysis, there is no a-priori information about the
group or cluster membership for any of the objects.
The essence of all clustering approaches is the classification of data
as suggested by “natural” groupings of the data themselves.
Clusters are suggested by the data, not defined a priori.
Cluster Analysis …
• Conceptual Development with Cluster analysis
o Data reduction – reduces population to smaller number of
homogeneous groups
o Hypothesis Generation – means of developing or assessing
hypotheses

• Necessity of Conceptual Support


o Cluster analysis is descriptive, a-theoretical, and non-inferential
o Cluster analysis will always create clusters, regardless of the
actual existence of any structure
o The cluster solution is not generalizable because it is totally
dependent upon variables used for analysis (cluster variates)
Uses of Cluster Analysis

• Segmenting the market

• Understanding the buyer behaviours

• Identify new opportunities

• Selecting test markets

• Reducing data

Slide 5
Research Questions in Cluster Analysis
• How to form the taxonomy
o Creating an empirically based classification of objects.

• How to simplify the data


o Grouping observations for further analysis.

• Which relationships can be identified


o Revealing relationships among the observations within and
between groups.
How Does Cluster Analysis Work?

A Simple Example
Objective Versus Subjective Considerations
Scatter Diagram for Cluster Observations

Frequency of eating out

Frequency of going to fast food restaurants

Fundamental Question: How Many Clusters?


Frequency of eating out Potential Two, Three and Four Cluster Solutions

Frequency of eating
out
Frequency of going to fast food Frequency of going to fast food
restaurants restaurants
Frequency of eating
out

Frequency of going to fast food restaurants

Which one is correct?


Between-Cluster and Within-Cluster Variation

Between-Cluster Variation = Maximize


Within-Cluster Variation = Minimize
Three Basic Questions In A Cluster Analysis
• How do we measure similarity?
o A method of simultaneously comparing observations on the
clustering variables is essential.
o Several methods are possible,
• Correlation between objects or
• Distance in two-dimensional space

• How do we form clusters?


o Group the observations that are most similar into a cluster,
o Determine the cluster group membership of each observation
for each set of clusters formed.

• How many groups do we form? A trade-off:


Fewer clusters and less homogeneity within clusters
Vs
A larger number of clusters and more within-group homogeneity.
Objective Vs Subjective Considerations
Main criticism – from two key elements:

1. Analyst has to make a judgement about:


• Selecting the characteristics to be used,
• Selecting the methods of combining clusters, and
• Even the interpretation of cluster solutions makes any
final solution unique to that analyst.

2. Subjectivity in selecting final solution


• No method to determine one solution which is optimal.
• Thus, it depends on the analyst to make the final
decision as to the number of clusters,
• And to accept as the final solution.
A Classification of Clustering Procedures
Clustering Procedures

Hierarchical Nonhierarchical Other

Agglomerative Divisive
Two-Step

Linkage Variance Centroid


Methods Methods Methods
Sequential Parallel Optimizing
Ward’s Threshold Threshold Partitioning
Method

Single Complete Average


Linkage Linkage Linkage
Two Types of Hierarchical Clustering Procedures
• Agglomerative Methods • Divisive Methods
o Buildup: all observations o Breakdown: initially all
start as individual observations in a single
clusters, join together cluster, then divided into
sequentially. smaller clusters.
How Agglomerative Hierarchical Approaches Work?

• A multi-step process
o Start with all observations as their own cluster.
o Using the selected similarity measure and agglomerative
algorithm, combine the two most similar observations into a
new cluster, now containing two observations.
o Repeat the clustering procedure using the similarity
measure/agglomerative algorithm to combine the two most
similar observations or clusters (i.e., combinations of
observations) into another new cluster.
o Continue the process until all observations are in a single
cluster.
Statistics Associated with Cluster Analysis
• Agglomeration schedule. An agglomeration schedule gives
information on the objects or cases being combined at each
stage of a hierarchical clustering process.

• Cluster centroid. The cluster centroid is the mean values of


the variables for all the cases or objects in a particular cluster.

• Cluster centers. The cluster centers are the initial starting


points in nonhierarchical clustering. Clusters are built
around these centers, or seeds.

• Cluster membership. Cluster membership indicates the


cluster to which each object or case belongs.

• Similarity/distance coefficient matrix. Matrix containing


pairwise distances between objects or cases.
Statistics Associated with Cluster Analysis
• Distances between cluster centers. These distances indicate
how separated the individual pairs of clusters are. Clusters
that are widely separated are distinct, and therefore desirable.

• Dendrogram. A dendrogram, or tree graph, is a pictorial


representation of clustering results. Vertical lines represent
clusters that are joined together. The position of the line on
the scale indicates the distances at which clusters were joined.
The dendrogram is read from left to right.

• Icicle plot. Another graphical display of clustering results.


So called because it resembles a row of icicles hanging from
the eaves of a house. The columns correspond to the objects
being clustered, and the rows correspond to the number of
clusters. An icicle plot is read from bottom to top.
Conducting Cluster Analysis

Formulate the Problem

Select a Distance Measure

Select a Clustering Procedure

Decide on the Number of Clusters

Interpret and Profile Clusters

Assess the Validity of Clustering


Attitudinal Data For Clustering
Stage 1:
Objectives of Cluster Analysis

Research Questions in Cluster Analysis


Selection of Clustering Variables
Formulate the Problem
• Selecting the Variables to be used for clustering –
o Perhaps the most important part in formulating the problem.
o Inclusion of irrelevant variables may distort an otherwise useful
clustering solution.
• The set of variables selected should describe the similarity
between objects in terms that are relevant to the problem.
• Selection of variables should be based on
o Past research, theory, or the hypotheses being tested.
o In exploratory research, the analyst should exercise judgment
and intuition.
• Variables typically used
o Lifestyle characteristics; Psychographic variables; Attitude;
o Geographic; Demographic; Performance;
Selection of Clustering Variables
• Variable Selection is a Critical Decision
o Clustering variables represent to sole means of measuring
similarity among objects
o As a result, the analysis is constrained based on the variables
included.

• Two Issues in Variable Selection


1. Conceptual considerations
✓ Variables characterize the objects being clustered.
✓ Relate specifically to the objectives of the cluster analysis.

2. Practical considerations.
✓ Should always use the “best” variables available (i.e., little
measurement error, etc.).
Rules of thumb in selecting variables

• Theoretical, conceptual and practical considerations must be


observed when selecting clustering variables for cluster
analysis:
o Only variables that relate specifically to objectives of the
cluster analysis are included, since “irrelevant” variables can not
be excluded from the analysis once it begins.
o Variables are selected which characterize the individuals
(objects) being clustered.
Stage 2:
Research Design in Cluster Analysis

Types and Number of Clustering Variables


Sample size
Outlier detection
Measuring object similarity
Data standardization
Types and Number of Clustering Variables

• Type of Variables Included


o Can employ either metric or non-metric, but not together.
o Multiple measures of similarity for each type.
• Number of Clustering Variables
o Can suffer from “curse of dimensionality” when large number
of variables analyzed.
o Can have impact with as few as 20 variables.
• Relevance of Clustering Variables
o No method to ascertain the relevancy of clustering variables.
o Include only those variables with strongest conceptual
support.
• Units of the Clustering Variables
o Standardize the variables – to remove the impact of units.
Is the Sample Size Adequate?
• The sample size requirement is based on:
o Sufficient size is needed to ensure representativeness of the
population and its underlying structure.
o Of particular interest is the ability to detect small groups within
the population.
o Minimum group sizes are based on the relevance of each group
to the research question and the confidence needed in
characterizing that group.

• Increasing sample size (say, above 1000 observations),


o May pose problems in using hierarchical clustering method
o Hence will require “hybrid” approaches
Detecting Outliers
• Outliers can severely distort the representativeness of results if
they appear as clusters that are inconsistent with objectives
o They should be removed, if the outlier represents:
▪ Aberrant observations not representative of the population.
▪ Observations of small/ insignificant segments within the population.
o They should be retained if the outlier represents:
▪ An under-sampling/poor representation of relevant groups in the
population. In this case, the sample should be augmented to ensure
representation of these groups.
• Identify Outliers based on the similarity measure by:
o Finding observations with large distances from all other
observations
o Graphic profile diagrams or parallel coordinate graphs
highlighting outlying cases.
o Their appearance in cluster solutions as single-member or very
small clusters.
Defining and Measuring Inter-object Similarity

• Inter-object similarity
o An empirical measure of correspondence, or resemblance,
between objects to be clustered.
o Calculated across the entire set of clustering variables to allow
for the grouping of observations and their comparison to each
other.

• Three methods most widely used :


o Distance measures – most often used.
o Correlational measures – less often used as they measure
patterns, not distance.
o Association measures – applicable for non-metric clustering
variables.
Types of Distance Measures
• Most widely used measure of similarity,
o Higher values representing greater dissimilarity (distance
between cases), not similarity.
• Many different distance measures, most common are:
o Euclidean (straight line) distance: Most common measure of
distance.
o Squared Euclidean distance: sum of squared distances.
Recommended measure for the centroid and Ward’s methods of
clustering.

o Mahalanobis distance (D2) accounts for variable


intercorrelations and weights each variable equally.
Good measure when the variables are highly intercorrelated.
Other measures of Distance
• City-block or Manhattan distance between two objects is the sum
of the absolute differences in values for each variable.
o Length needed to move between two points in a grid where you can
only move up, down, left or right.

|x1-x2| + |y1-y2|

• Chebychev distance between two objects is the maximum absolute


difference in values for any variable.
Select a Distance or Similarity Measure
• Clustering solution is influenced by the units of
measurement.
o So, if variables have different units, data must be standardized
by rescaling each variable or normalized (mean of zero; standard
deviation of one).

• Cluster Analysis is very sensitive to similarity measure used


o Different distance measures may lead to different clustering
results.
o Use different measures and compare the results.
Stage 3:
Assumptions of Cluster Analysis

Structure Exists.
Representativeness of the sample.
Impact of multicollinearity.
Three Assumptions Underlying Cluster Analysis

1. Structure Exists
o Cluster analysis will always generate a solution,
o So, assume that a “natural” structure of objects exists which is to be
identified by the technique.
2. Representativeness of the Sample
o Obtained sample is truly representative of the population.
3. Impact of multicollinearity
o Multicollinearity among subsets of variables is an implicit “weighting”
of the clustering variables
o Potential remedies:
• Reduce the variables to equal numbers in each set of correlated
measures.
• Use an appropriate distance measure, like Mahalanobis Distance.
• Factor Analysis – take one variable from each factor
• Take a proactive approach and include only cluster variables that
are not highly correlated
Stage 4:
Deriving Clusters and Assessing
Overall Fit

Selecting the partitioning procedure.


Potentially re-specify initial cluster solutions by
eliminating outliers or small clusters.
Determining the number of clusters.
Other clustering approaches.
Two Approaches to Partitioning Observations
• Hierarchical
o Most common approach is where all objects start as separate
clusters and then are joined sequentially such that each step
forms a new cluster joining by two clusters at a time until only a
single cluster remains.

• Non-hierarchical
o The number of clusters is specified by the analyst and then the
set of objects are formed into that set of groupings.
Linkage Methods (Hierarchical Clustering)
• Single linkage method (SLINK) or Nearest Neighbor method
o Based on minimum distance, or the nearest neighbor rule.
• Computes all pairwise dissimilarities between the elements in cluster 1 and
the elements in cluster 2, and considers the smallest of these dissimilarities
as a linkage criterion.
o At every stage, the distance between two clusters is the distance
between their two closest points.
o It tends to produce long, “loose” clusters.

o Useful when dealing with non-spherical clusters or when there


are significant differences in cluster sizes.
o It is often used in biological applications, such as in the analysis
of genetic data.
Linkage Methods (Hierarchical Clustering)
• Complete (or Maximum) linkage or Farthest Neighbor method
o Based on the max distance or the furthest neighbor approach.
• Computes all pairwise dissimilarities between the elements in cluster 1 and
the elements in cluster 2, and considers the largest value (i.e., maximum
value) of these dissimilarities as the distance between the two clusters.
o The distance between two clusters is calculated as the distance
between their two furthest points.
o It tends to produce more compact clusters.

o When the goal is to identify well-separated clusters with


relatively equal diameters.
o It is commonly used in social science research and psychology,
such as in personality trait clustering.
Linkage Methods (Hierarchical Clustering)
• Mean or Average linkage method works similarly.
o The distance between two clusters is defined as the average of
the distances between all pairs of objects, where one member of
the pair is from each of the clusters.
• Computes all pairwise dissimilarities between the elements in cluster 1 and
the elements in cluster 2, and considers the average of these dissimilarities
as the distance between the two clusters.
o Can vary in the compactness of the clusters it creates.

o Average linkage is versatile and can be used in various


scenarios, especially when there is no prior knowledge about the
data distribution.
o It is widely used in market segmentation, where clusters may
have different shapes and sizes.
Linkage Methods of Clustering

Single Linkage
Minimum Distance

Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2
Average Linkage

Average Distance
Cluster 1 Cluster 2
Variance Methods
• Variance Methods try to generate clusters to minimize the
within-cluster variance. Popular method – Ward's procedure.
o Find mean of each cluster (mean of all the variables).
o Calculate the distance between each object in a particular cluster,
and that clusters’ mean, and square it (squared Euclidean distance).
o Sum these distances for all the objects.
o At each stage, the two clusters with the smallest increase in the
overall sum of squares within cluster distances are combined.
Minimizes the total within-cluster variance. At each step the pair of
clusters with minimum between-cluster distance are merged.
Produces compact clusters

• Ward's method is suitable for datasets with a large number of


clusters or when the goal is to identify homogeneous clusters.
It is often used in biological and medical research, such as in
the analysis of gene expression data.
Centroid Methods
• In the centroid methods, the distance between two clusters is
the distance between their centroids (means for all the
variables). Every time objects are grouped, a new centroid is
computed.
Other Agglomerative Clustering Methods
Ward’s Procedure

Centroid Method

Of the hierarchical methods, average linkage and Ward's


methods have been shown to perform better than other
procedures.
Distance measure impacts the clusters formed: The 4 dendograms show clusters
formed on the same set of data, using different distance measure.
Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com 43
Comparing the Agglomerative Algorithms
• Single-linkage
o Probably the most versatile algorithm, but poorly delineated
cluster structures within the data produce unacceptable
snakelike “chains” for clusters.
• Complete linkage
o Eliminates the chaining problem, but only considers the
outermost observations in a cluster, thus impacted by outliers.
• Average linkage
o Generates clusters with small within-cluster variation and less
affected by outliers.
• Centroid linkage
o Like average linkage, is less affected by outliers.
• Ward’s method
o Most appropriate when the analyst expects somewhat equally
sized clusters, but easily distorted by outliers.
Agglomeration Schedule Using Ward’s Procedure

Clusters combined Stage cluster first appears


Stage Cluster 1 Cluster 2 Coefficient Cluster 1 Cluster 2 Next stage
1 14 16 1.00 0 0 6
2 6 7 2.00 0 0 7
3 2 13 3.50 0 0 15
4 5 11 5.00 0 0 11
5 3 8 6.50 0 0 16
6 10 14 8.16 0 1 9
7 6 12 10.17 2 0 10
8 9 20 13.00 0 0 11
9 4 10 15.58 0 6 12
10 1 6 18.50 6 7 13
11 5 9 23.00 4 8 15
12 4 19 27.75 9 0 17
13 1 17 33.10 10 0 14
14 1 15 41.33 13 0 16
15 2 5 51.83 3 11 18
16 1 3 64.50 14 5 19
17 4 18 79.67 12 0 18
18 2 4 172.66 15 17 19
19 1 2 328.60 16 18 0
45
Results of Hierarchical Clustering
Vertical Icicle Plot Using Ward’s Method
Dendrogram Using Ward’s Method
Pros and Cons of Hierarchical Methods
• Pros
o Simplicity – generates tree-like structure which is simplistic
portrayal of process.
o Measures of similarity – multiple measures to address many
situations.
o Speed – generate entire set of cluster solutions in single analysis.

• Cons
o Permanent combinations – once joined, clusters are never
separated.
o Impact of outliers – outliers may appear as single object or very
small clusters.
o Large samples – not amenable to very large samples.
Nonhierarchical Clustering
K-Means
How Nonhierarchical Approaches Work?
1. Determine number of clusters to be extracted
2. Specify cluster seeds.
o Analyst specified.
o Sample generated:
• First cluster seed is 1st observation in data set with no missing values.
• Seed points are selected randomly from all observations.
3. Assign each observation to one of the seeds based on
similarity.
o Sequential Threshold: selects one seed point, develops cluster;
then selects next seed point and develops cluster, and so on.
Observation cannot be re-assigned to another cluster following
its original assignment.
o Parallel Threshold: sets all seed points simultaneously, then
develops clusters.
o Optimization: allow for re-assignment of observations based on
the sequential proximity of observations to clusters formed
during the clustering process.
Pros and Cons of Nonhierarchical Methods
• Pros
o Results are less susceptible to:
• outliers in the data,
• the distance measure used, and
• the inclusion of irrelevant or inappropriate variables.
o Can easily analyze very large data sets

• Cons
o Best results require knowledge of seed points.
o Difficult to guarantee optimal solution.
o Generates typically only spherical and more equally sized
clusters.
o Less efficient in examining wide number of cluster solutions.
Visualizing K-Means Clustering process

Visualizing K-Means Clustering process


Deciding the number of Clusters
• Elbow Method
o Clusters are defined such that the total intra-cluster variation [or
total within-cluster sum of square (WSS)] is minimized.
o WSS measures the compactness of the clustering. Should be as
small as possible.
o Within the sum of squares (WSS) is defined as the sum of the
squared distance between each member of the cluster and its
centroid.

• Silhouette Method
o Difference between the smallest average between cluster
distance and the average within cluster distance, divided by the
larger of the two distances.
For each element in a cluster, calculate the average distance to all other
elements in its cluster and the average distance to all elements in each of the
other clusters.

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 53


Selecting Between Hierarchical and
Nonhierarchical
Hierarchical clustering solutions are preferred when:
• A wide range, even all, alternative clustering solutions is to
be examined.
• The sample size is moderate (under 300-400, not exceeding
1,000).

Nonhierarchical clustering methods are preferred when:


• The number of clusters is known and initial seed points can
be specified according to some practical, objective or
theoretical basis.
• There is concern about outliers since nonhierarchical
methods generally are less susceptible to outliers.
Combining Hierarchical and Nonhierachical
Approaches
• Using a non-hierarchical approach followed by a hierarchical
approach is often advisable.
1. A nonhierarchical approach is used to select a large number of
clusters.
2. The cluster centers for the clusters formed at this stage serve as
initial cluster seeds in the next, hierarchical procedure.
3. Hierarchical method then clusters all the cluster seeds.
The two cluster procedures are combined to form the overall
cluster solution.

This process helps in visualizing the cluster solutions even for very
large datasets.
Stage 5:
Interpretation of the Clusters
Cluster Interpretation
• Involves examining each cluster in terms of the cluster variate
to name or assign a label accurately describing the nature of
the clusters.
• The cluster centroid, a mean profile of the cluster on each
clustering variable, is particularly useful in the interpretation
stage.
o Interpretation involves examining the distinguishing
characteristics of each cluster’s profile and identifying
substantial differences between clusters.
o Cluster solutions failing to show substantial variation indicate
other cluster solutions should be examined.
• The cluster centroid should also be assessed for
correspondence with the analyst’s prior expectations based
on theory or practical experience.
Stage 6:
Validation and Profiling of the Clusters

Validation
Profiling
Validation of the Final Cluster Solution
• Validation is essential in cluster analysis since the clusters
are descriptive of structure and require additional support
for their relevance.

• Two approaches
o Cross-validation – empirically validates a cluster solution by
creating two sub-samples (randomly splitting the sample) and
then comparing the two cluster solutions for consistency with
respect to number of clusters and the cluster profiles.
o Criterion validity – achieved by examining differences on
variables not included in the cluster analysis but for which there
is a theoretical and relevant reason to expect variation across the
clusters.
Profiling A Cluster Solution
• Describing the characteristics of each cluster on a set of
additional variables (not the clustering variables) to further
understand the differences between clusters
o Examples include descriptive variables (e.g., demographics) as
well as other outcome-related measures.
o Provides insight to analysts as to nature and character of the
clusters.

• Clusters should differ on these relevant dimensions. This


typically involves the use of discriminant analysis or
ANOVA.
Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com 61

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy