Cluster Analysis
Cluster Analysis
Cluster analysis is a class technique used to classify objects or cases into relatively
homogenous groups called clusters. Objects in a cluster tend to be similar to objects in the
same cluster while they tend to be dissimilar to objects in other clusters. Cluster analysis is
also called classification analysis.
The steps involved in cluster analysis are:
1. Formulate the problem
2. Select a distance measure
3. Select a clustering procedure
4. Decide on the number of clusters
5. Interpret and profile clusters
6. Assess the validity of clustering
1) The key part in formulating the problem statement is selecting the variables on
which the clustering is based. The set of variables selected should help in describing
the similarity between the objects. If some irrelevant variable is selected, then the
outcome would be of no use.
2) As the objective of clustering is to group similar objects together, we need some
measure to assess how similar or different the objects are. The most common
approach is to measure similarity by measuring the distance between 2 objects.
Objects with smaller distance between them is considered to be more similar. The
most commonly used measure of similarity is Euclidean distance. It is the square root
of the sum of the squared distances in values for each variable.
3) Clustering procedure can be hierarchical, non-hierarchical or other procedure. In
Hierarchical clustering development of tree-like structure or hierarchy is done. In
hierarchical clustering, agglomerative clustering and divisive clustering are two
procedures.
a. Agglomerative clustering is a hierarchical clustering procedure where each object
starts out in a separate cluster. Clusters are formed by grouping objects into
bigger and bigger clusters.
b. Divisive clustering is a hierarchical clustering procedure where all objects start
out in one giant cluster. Clusters are formed by dividing this cluster into smaller
and smaller clusters.
In non-hierarchical clustering, there are 3 methods:
a. Sequential threshold method is a non-hierarchical clustering procedure in which
a cluster center is selected and all objects within a prespecified threshold value
from the center are grouped together.
b. Parallel threshold method is a non-hierarchical clustering procedure that
specifies several cluster center at once. All objects that are within a prespecified
threshold value from the center are grouped together.
c. Optimizing partitioning method is a non-hierarchical clustering method that
allows for later reassignment of objects to clusters to optimize an overall
criterion.
4) Decide on the Number of clusters:
a. In hierarchical clustering, the distances at which clusters are combined can be
used as a criteria. This information can be obtained from the agglomeration or
from the dendrogram.
b. In non-hierarchical clustering, the ratio of total within-group variance to
between-group variance can be plotted against the number of clusters. The point
at which an elbow or a sharp bend occurs indicates an appropriate number of
clusters. Increasing the number of clusters beyond this point is usually not
worthwhile.
c. The relative size of the clusters should be meaningful.
5) Interpret and profile the clusters: it involves examining the cluster centroid. The
centroid represents the mean value of the objects contained in the cluster on each
of the variables. The centroid enables us to describe each cluster by assigning it a
name or label.
6) Assess reliability and validity:
a. Perform cluster analysis on the same data using different distance measures.
Compare the results across different measures.
b. Use different methods of clustering and compare the results.
c. Split the data into half. Perform clustering on each half. Compare cluster
centroids across the two subsamples.
Discriminant Analysis:
In this analysis, the market research data is analysed where the dependent variables are
categorical in nature while the independent variables are interval type. One of the key
objective of discriminant analysis is to develop a discriminant function.
Discriminant Function:
It is linear combination of independent variables that will best discriminate between the
categories of the dependent variables.
discriminant analysis technique where the criterion variable involves three or more
categories. The main distinction is that in a two group discriminant analysis, it is possible to
derive only one discriminant function. Whereas in multiple discriminant analysis, more than
one function may be computed.