Lecture 02 - Cluster Analysis 1
Lecture 02 - Cluster Analysis 1
1
Review
2
Factor Analysis
Factor analysis is a class of procedures used for data reduction and
summarization.
It is an interdependence technique: no distinction between dependent
and independent variables.
Where:
Fi = estimate of ith factor
Wi= weight or factor score coefficient
k = number of variables
The Factor Analysis Process
Determine the
Determine the
Method of
number of factors
Extraction
This Session: Cluster
Analysis
6
Course Structure: 24h including 9h labs
7
Multivariate Methods
Regression
Metric Dependent
Variable
ANOVA if the repressors
are nonmetric
Dependence
The relationship is
Factor Analysis
among the variables
Interdependence
The relationship is
among Cluster Analysis
cases/respondents
8
Example
In our database we count more than 10,000 Customers - we know
their age, city name, income, employment status, designation (i.e.
level of seniority).
Both cluster analysis and discriminant analysis are concerned with classification.
◦ However, discriminant analysis requires prior knowledge of the cluster or group
membership for each object or case included, to develop the classification rule.
◦ In contrast, in cluster analysis there is no a priori information about the group or cluster
membership for any of the objects. Groups or clusters are suggested by the data, not
defined a priori.
Cluster analysis Vs. Factor analysis
CLUSTER ANALYSIS FACTOR ANALYSIS
Variable 1
Variable 2
A Practical Clustering Situation
Groups are not that distinct:
Variable 1
X
Variable 2
Statistics Associated with Cluster Analysis
Agglomeration schedule: An agglomeration schedule gives information on the
objects or cases being combined at each stage of a hierarchical clustering
process.
Cluster centroid: The cluster centroid is the mean values of the variables for all
the cases or objects in a particular cluster.
Cluster centers: The cluster centers are the initial starting points in
nonhierarchical clustering. Clusters are built around these centers, or seeds.
Cluster membership: Cluster membership indicates the cluster to which each
object or case belongs.
Statistics Associated with Cluster Analysis
Dendrogram: A dendrogram, or tree graph, is a graphical device for
displaying clustering results. Vertical lines represent clusters that are
joined together. The position of the line on the scale indicates the
distances at which clusters were joined. The dendrogram is read from left
to right.
Distances between cluster centers: These distances indicate how
separated the individual pairs of clusters are. Clusters that are widely
separated are distinct, and therefore desirable.
Statistics Associated with Cluster Analysis
Icicle diagram: An icicle diagram is a graphical display of clustering
results, so called because it resembles a row of icicles hanging from the
eaves of a house. The columns correspond to the objects being
clustered, and the rows correspond to the number of clusters. An icicle
diagram is read from bottom to top.
Similarity/distance coefficient matrix: A similarity/distance coefficient
matrix is a lower-triangle matrix containing pairwise distances between
objects or cases.
Conducting Cluster Analysis
Decide on
Formulate Select a Select a Interpret & Assess the
the
the Distance Clustering Profile Validity of
Number of
Problem Measure Procedure Clusters Clustering
Clusters
I - Formulate the Problem
• Perhaps the most important part of formulating the clustering problem is
selecting the variables on which the clustering is based.
• Inclusion of even one or two irrelevant variables may distort an otherwise
useful clustering solution.
• Basically, the set of variables selected should describe the similarity
between objects in terms that are relevant to the research problem.
• The variables should be selected based on past research, theory, or a
consideration of the hypotheses being tested. In exploratory research,
the researcher should exercise judgment and intuition.
II – Select a Distance Measure (1)
Several distance measures are available, each with
specific characteristics.
Linkage Complete
Hierarchical
Divisive Centroid
K-Means
Two-Step
22
III – Select a Clustering Procedure (2)
Hierarchical clustering is characterized by the development of a
hierarchy or tree-like structure. Hierarchical methods can be
agglomerative or divisive.
◦ Agglomerative clustering starts with each object in a separate
cluster. Clusters are formed by grouping objects into bigger and
bigger clusters. This process is continued until all objects are
members of a single cluster.
◦ Divisive clustering starts with all the objects grouped in a single
cluster. Clusters are divided or split until each object is in a
separate cluster.
III – Select a Clustering Procedure (3)-HP
Linkage Methods: The single Linkage
The single linkage method is based on minimum distance, or the
nearest neighbor rule:
◦ At every stage, the distance between two clusters is the distance
between their two closest points.
Single Linkage
Minimum Distance
Cluster 1 Cluster 2
III – Select a Clustering Procedure (3)- HP
Linkage Methods: The complete Linkage
The complete linkage method is similar to single linkage, except
that it is based on the maximum distance or the furthest neighbor
approach:
◦ In complete linkage, the distance between two clusters is calculated
as the distance between their two furthest points.
Complete Linkage
Maximum Distance
Cluster 1 Cluster 2
III – Select a Clustering Procedure (3)- HP
Linkage Methods: Average Linkage
The average linkage method works similarly. However, in this
method, the distance between two clusters is defined as the
average of the distances between all pairs of objects, where one
member of the pair is from each of the clusters.
Average Linkage
Average Distance
Cluster 1 Cluster 2
III – Select a Clustering Procedure (3)-HP
Variance Method
• The variance methods attempt to generate clusters to minimize the
within-cluster variance.
• A commonly used variance method is the Ward's procedure:
◦ For each cluster, the means for all the variables are computed.
◦ Then, for each object, the squared Euclidean distance to the cluster means is
calculated.
◦ These distances are summed for all the objects.
Ward’s Procedure
III – Select a Clustering Procedure (3)-HP
Centroid Method
• In the centroid methods, the distance between two clusters is the distance
between their centroids (means for all the variables). Every time objects are
grouped, a new centroid is computed.
• Of the hierarchical methods, average linkage and Ward's methods have been
shown to perform better than the other procedures.
Centroid Method
IV – Select a Clustering Procedure (3) NHP
K-Means Method
• The nonhierarchical clustering methods are frequently referred to as k-means clustering:
◦ Note that in this procedure the number k of clusters is fixed
• In the sequential threshold method, a cluster center is selected and all objects within a
prespecified threshold value from the center are grouped together. Then a new cluster
center or seed is selected, and the process is repeated for the unflustered points. Once
an object is clustered with a seed, it is no longer considered for clustering with subsequent
seeds.
• Algorithm:
1. Place K points (or seeds) into the space represented by the objects that are being clustered
2. These points represent initial group centroids.
3. Assign each object to the group that has the closest centroid.
4. When all objects have been assigned, recalculate the positions of the K centroids
5. Repeat Steps 2 and 3 until the centroids no longer move.
Hierarchical vs Non hierarchical methods
HIERARCHICAL CLUSTERING NON HIERARCHICAL CLUSTERING
Clustering 1
2
6
2
4
3
7
1
3
4
2
5
3
4
3 7 2 6 4 1 3
4 4 6 4 5 3 6
5 1 3 2 2 6 4
6 6 4 6 3 3 4
7 5 3 6 3 3 4
8 7 3 7 4 1 4
9 2 4 3 3 6 3
10 3 5 3 6 4 6
11 1 3 2 3 5 3
12 5 4 5 4 2 4
13 2 2 1 5 4 4
14 4 6 4 6 4 7
15 6 5 4 2 1 4
16 3 5 4 6 4 7
17 4 4 7 2 2 5
18 3 7 2 6 4 3
19 4 6 3 7 2 7
Results of Hierarchical
Clustering
Stage 6: Individual 10 joins the cluster formed at stage
Stage 1: 1, i.e. 14 & 16: now we have a cluster with 14, 16, 10
Individuals 14 Agglomeration Schedule
and 16 are the Cluster Combined Stage Cluster First Appears
Stage Coefficients Next Stage
first to be Cluster 1 Cluster 2 Cluster 1 Cluster 2
joined together 1 14 16 1.000 0 0 6
2 6 7 2.000 0 0 7
3 2 13 3.500 0 0 14
Stage 2: 4 5 11 5.000 0 0 8
Individuals 6 & 5 3 8 6.500 0 0 15
7 now form a 6 10 14 8.167 0 1 9
segment
7 6 12 10.500 2 0 10
8 5 9 13.000 4 0 14
The amount of
9 4 10 15.583 0 6 11
error created at
10 1 6 18.500 0 7 12
each clustering
11 4 19 23.250 9 0 16
stage. A large
12 1 17 28.600 10 0 13
jump in the value
13 1 15 36.833 12 0 15
of the error term
14 2 5 46.533 3 8 17
indicates that two
different things 15 1 3 59.200 13 5 18
Cluster 3
of the results of a hierarchical
procedure.
Cluster 2
separate cluster.
Cluster 1
each step of the procedure until all
are contained in a single cluster
Cluster Membership
Cluster Membership
Solution 1: 3 clusters Solution 2: 2 clusters
Case 3 Clusters 2 Clusters
1 1 1
2 2 2
Individuals 1 & 3 1 1
3 belong to 4 3 2
cluster 1 5 2 2
6 1 1
Individuals 5 & 7 1 1
9 belong to 8 1 1
cluster 2 Individuals'
9 2 2
10 3 2 membership for the
11 2 2 2 cluster solution
12 1 1
13 2 2
Individuals 14
14 3 2
& 16 belong to
15 1 1
cluster
16 3 2
17 1 1
18 3 2
19 3 2
Cluster Membership: Icicle Plot
Individuals Individuals or cases
9,11,5,13,2
belong to
cluster 2
Individuals 18,
19, 16,14,,10,4
belong to
cluster 3
Cluster Membership: Icicle Plot
Individuals Individuals or cases
9,11,5,13,2
belong to
cluster 3
1 2 3 4 5
Individual 18 A 5 cluster Solution
belongs to
cluster 1
Cluster Centroids: Description of
clusters
Examine the cluster centroids (we retained 3 clusters here)
Report
Mean
I always I like
I Iike having Going out is Bad I like Eating I Don't Care
Cluster look for Best Comparing
fun For Budget Out about going out
Buys Prices
1 5.7500 3.6250 6.0000 3.1250 1.8750 3.8750
Cluster Cluster
1 2 3
1 2 3
I like having fun 3.50 1.67 5.75
I like having fun 4.00 1.00 7.00 Going out is bad for
5.83 3.00 3.63
Going out is bad for budget
6.00 3.00 2.00
budget
I like eating Out 3.33 1.83 6.00
I like eating Out 3.00 2.00 6.00
I always look for best buys
6.00 3.50 3.13
I always look for best and bargains
7.00 2.00 4.00
buys and bargains
I don't care about going
I don't care about going 3.50 5.50 1.88
2.00 6.00 1.00 out
out
I like comparing prices 6.00 3.67 3.88
I like comparing prices 7.00 4.00 3.00
5 2 1.756
6 3 1.225
7 3 1.500
8 3 2.121
9 2 1.848 Distance from
10 1 1.143 each individual
11 2 1.190
and the cluster
12 3 1.581
13 2 2.533 centroid
14 1 1.404
15 3 2.828
16 1 1.624
17 3 2.598
18 1 3.555
19 1 2.154
20 2 1.658
Distance between the 3 cluster
centroids
Distances between Final Cluster Centers
Cluster 1 2 3
1 5.416 5.698
2 5.416 6.910
3 5.698 6.910
The between cluster distance should be bigger than the within cluster
distance
Thank you
Anova Analysis
47
Simple Example
Suppose a marketing researcher wishes to determine market segments in a
community based on patterns of loyalty to brands and stores.
A small sample of seven respondents is selected as a pilot test of how cluster
analysis is applied.
◦ Two measures of loyalty were measured for each respondents on 0-10 scale :
◦ V1(store loyalty)
◦ V2(brand loyalty)
Scatter Plot of the responses
How do we measure similarity?
Proximity Matrix of Euclidean Distance Between Observations
Observations
Observation
A B C D E F G
A ---
B 3.162 ---
C 5.099 2.000 ---
D 5.099 2.828 2.000 ---
E 5.000 2.236 2.236 4.123 ---
F 6.403 3.606 3.000 5.000 1.414 ---
G 3.606 2.236 3.606 5.000 2.000 3.162 ---
How do we form clusters?
SIMPLE RULE:
◦ Identify the two most similar(closest) observations not already in the same cluster and
combine them.
◦ We apply this rule repeatedly to generate a number of cluster solutions, starting with
each observation as its own “cluster” and then combining two clusters at a time until all
observations are in a single cluster.
◦ This process is termed a hierarchical procedure because it moves in a stepwise fashion to form an entire range
of cluster solutions. It is also an agglomerative method because clusters are formed by combining existing
clusters
How do we form clusters?
AGGLOMERATIVE PROCESS CLUSTER SOLUTION
Minimum Overall Similarity
Distance Measure
Step Unclustered Observation Cluster Membership Number of (Average
Observationsa Pair Clusters Within-Cluster
Distance)
Initial Solution (A)(B)(C)(D)(E)(F)(G) 7 0
1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414
2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192
3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144
4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234
5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896
6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420
• In steps 1,2,3 and 4, the OSM does not change substantially, which indicates that we
are forming other clusters with essentially the same heterogeneity of the existing
clusters.
• When we get to step 5, we see a large increase. This indicates that joining clusters
(B-C-D) and (E-F-G) resulted a single cluster that was markedly less homogenous.
How many groups do we form?
Therefore, the three – cluster solution of Step 4 seems the most appropriate for a
final cluster solution, with two equally sized clusters, (B-C-D) and (E-F-G), and a
single outlying observation (A).
Analyze>Classify>Hierarchical Cluster …
Analyze>Classify>K-Means Cluster …
Analyze>Classify>Two-Step Cluster …
SPSS Windows: Hierarchical Clustering
1. Select ANALYZE from the SPSS menu bar.
2. Click CLASSIFY and then HIERARCHICAL CLUSTER.
3. Move “Fun [v1],” “Bad for Budget [v2],” “Eating Out [v3],” “Best Buys [v4],” “Don’t Care [v5],”
and “Compare Prices [v6].” in to the VARIABLES box.
4. In the CLUSTER box check CASES (default option). In the DISPLAY box check STATISTICS and
PLOTS (default options).
5. Click on STATISTICS. In the pop-up window, check AGGLOMERATION SCHEDULE. In the
CLUSTER MEMBERSHIP box check RANGE OF SOLUTIONS. Then, for MINIMUM NUMBER OF
CLUSTERS: enter 2 and for MAXIMUM NUMBER OF CLUSTERS enter 4. Click CONTINUE.
6. Click on PLOTS. In the pop-up window, check DENDROGRAM. In the ICICLE box check ALL
CLUSTERS (default). In the ORIENTATION box, check VERTICAL. Click CONTINUE.
7. Click on METHOD. For CLUSTER METHOD select WARD’S METHOD. In the MEASURE box check
INTERVAL and select SQUARED EUCLIDEAN DISTANCE. Click CONTINUE
8. Click OK.
SPSS Windows: K-Means
Clustering
1. Select ANALYZE from the SPSS menu bar.
2. Click CLASSIFY and then K-MEANS CLUSTER.
3. Move “Fun [v1],” “Bad for Budget [v2],” “Eating Out [v3],” “Best Buys [v4],”
“Don’t Care [v5],” and “Compare Prices [v6].” in to the VARIABLES box.
4. For NUMBER OF CLUSTER select 3.
5. Click on OPTIONS. In the pop-up window, In the STATISTICS box, check
INITIAL CLUSTER CENTERS and CLUSTER INFORMATION FOR EACH CASE. Click
CONTINUE.
6. Click OK.
SPSS Windows: Two-Step
Clustering
1. Select ANALYZE from the SPSS menu bar.
2. Click CLASSIFY and then TWO-STEP CLUSTER.
3. Move “Fun [v1],” “Bad for Budget [v2],” “Eating Out [v3],” “Best Buys [v4],”
“Don’t Care [v5],” and “Compare Prices [v6].” in to the CONTINUOUS
VARIABLES box.
4. For DISTANCE MEASURE select EUCLIDEAN.
5. For NUMBER OF CLUSTER select DETERMINE AUTOMATICALLY.
6. For CLUSTERING CRITERION select AKAIKE’S INFORMATION CRITERION (AIC).
7. Click OK.