0% found this document useful (0 votes)

92 views152 pages

Unit - 4 - Modified

This document provides an overview of cluster analysis techniques. It discusses partitioning and hierarchical clustering methods, as well as distance measures used in algorithms. Specific techniques covered include k-means, k-medoids, agglomerative and divisive hierarchical clustering, and density-based techniques like DBSCAN and STING. Applications of cluster analysis and considerations for comparing different clustering approaches are also summarized.

Uploaded by

Shashwat Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views152 pages

Unit - 4 - Modified

Uploaded by

Shashwat Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 152

Cluster Analysis Unit - 4

Syllabus
• Cluster Analysis: Introduction
• Requirements and overview of different categories
• Partitioning method: Introduction
• k-means
• k-medoids
• Hierarchical method: Introduction
• Agglomerative vs. Divisive method
• Distance measures in algorithmic methods
• BIRCH technique
• DBSCAN technique
• STING technique
• CLIQUE technique
• Evaluation of clustering techniques
Session 1
Cluster Analysis: Introduction
Requirements and overview of different categories
• Clustering is the process of grouping a set of data objects intomultiple
groups or clusters
• so that objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters.
• Dissimilarities and similarities are assessed based on the attribute
values describing the objects and often involve distance measures.
• Clustering as a data mining tool has its roots in many application
areas such as biology, security, business intelligence, and Web search.
Cluster Analysis
• Cluster: A collection of data objects
• similar (or related) to one another within the same group
• dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
• Finding similarities between data according to the characteristics found in the
data and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning by observations vs.
learning by examples: supervised)
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
Applications of Cluster Analysis
• Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation database
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
• Climate: understanding earth climate, find patterns of atmospheric and ocean
• Economic Science: market research
• Owing to the huge amounts of data collected in databases, cluster analysis has recently become
a highly active topic in data mining research.
• distance-based cluster analysis
• Un-supervised learning // class label is not presented in D
• Learning by observation // classification was learning by example
Clustering as a Preprocessing Tool (Utility)
• Summarization:
• Preprocessing for regression, PCA, classification, and association analysis
• Compression:
• Image processing: vector quantization
• Finding K-nearest Neighbors
• Localizing search to one or a small number of clusters
• Outlier detection
• Outliers are often viewed as those “far away” from any cluster
Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters

• high intra-class similarity: cohesive within clusters
• low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
• the similarity measure used by the method
• its implementation, and
• Its ability to discover some or all of the hidden patterns
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
• Similarity is expressed in terms of a distance function, typically metric: d(i, j)
• The definitions of distance functions are usually rather different for interval-
scaled, boolean, categorical, ordinal ratio, and vector variables
• Weights should be associated with different variables based on applications
and data semantics
• Quality of clustering:
• There is usually a separate “quality” function that measures the “goodness”
of a cluster.
• It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
Requirements and Challenges
(of clustering Data mining)
• Scalability
• Clustering all the data instead of only on samples // Example: Web search scenarios
• Sample bias should not exist
• Ability to deal with different types of attributes
• Numerical, binary, categorical, ordinal, linked, and mixture of these
• complex data types such as graphs, sequences, images, and documents.
• Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters//the clustering results may be sensitive to such
parameters
• Quality of clusters
• Interpretability and usability
• Others
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data //outlier. Erraneous,missing,unknown
• need clustering methods that are robust to noise.

• Incremental clustering and insensitivity to input order // D will be updated

• High dimensionality
Considerations for Cluster Analysis
(Comparison)

• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
(Eg:mining _University)
• Separation of clusters
• clusters may not be exclusive.
• Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may
belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)
• Clustering space
• Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
• May lead to unreliable similarity measurements
Conclusion
• clustering algorithms have several requirements. These factors
include scalability and the ability to deal with different types of
attributes, noisy data, incremental updates, clusters of arbitrary
shape, and constraints. Interpretability and usability are also
important.
• Also they differ with respect to the partitioning level, whether or not
clusters are mutually exclusive, the similarity measures used, and
whether or not subspace clustering is performed.
• we study the types of data that often occur in cluster analysis and
how to preprocess them for such an analysis.
• Suppose that a data set to be clustered contains n objects, which may
represent persons, houses, documents, countries, and so on. Main
memory-based clustering algorithms typically operate on either of the
following two data structures.
Types of Data in Cluster Analysis
Data matrix (or object-by-variable structure):
This represents n objects, such as persons, with p variables (also
called measurements or attributes), such as age, height, weight, gender,
and so on.
• The structure is in the form of a relational table, or n-by-p matrix (n
objects p variables)

Data matrix Dissimilarity matrix

Types of Data in Cluster Analysis
Dissimilarity matrix (or object-by-object structure):
This stores a collection of proximities that are available for all pairs of n objects. It is often represented
by an n-by-n table.

• d(i, j) - difference or dissimilarity between objects i and j.

• d(i, j) - nonnegative number that is close to 0 when objects i and j are highly similar or “near” each
other, and becomes larger the more they differ.

• Since d(i, j)=d( j, i), and d(i, i)=0.

• Data matrix - two-mode matrix.(Deals with different entities)

• Dissimilarity matrix - one-mode matrix.(Deals with same entity)
Types of Data in Cluster Analysis
1. Interval-Scaled Variables
• Interval-scaled variables are continuous measurements of a roughly linear scale. Eg: weight and
height

• The measurement unit used can affect the clustering analysis.

• For example, changing measurement units from meters to inches for height, or from kilograms
to pounds for weight, may lead to a very different clustering structure.

• To avoid dependence on the choice of measurement units, the data should be standardized.
• Standardizing measurements attempts to give all variables an equal weight.
• Data Transformation by Normalization
• The measurement unit used can affect the data analysis.

To help avoid dependence on the choice of measurement units, the

data should be normalized or standardized. This involves transforming the data to fall
within a smaller or common range
How can the data for a variable be standardized?
• To standardize measurements, one choice is to convert the original
measurements to unit-less variables.
Distance measures
Distance Measure
• After standardization, or without standardization in certain
applications, the dissimilarity (or similarity) between the objects
described by interval-scaled variables is typically computed based on
the distance between each pair of objects. The most popular distance
measure is Euclidean distance

• Another well-known metric is Manhattan (or city block) distance,

defined as
Distance Measure
• Both the Euclidean distance and Manhattan distance satisfy the
following mathematic requirements of a distance function
Distance Measure
• Minkowski distance is a generalization of both Euclidean distance and
Manhattan distance. It is defined as

• p - positive integer. Such a distance is also called Lp norm.

• It represents the Manhattan distance when p = 1 (i.e., L1 norm)
• Euclidean distance when p = 2 (i.e., L2 norm).

• If each variable is assigned a weight according to its perceived importance,

the weighted Euclidean distance can be computed as
Distance Measure
Distance Measure
2. Binary Variables
• How to compute the dissimilarity between objects
described by either symmetric or asymmetric binary
variables.

• A binary variable has only two states: 0 or 1

• 0 – absent & 1- present

• Treating binary variables as if they are interval-scaled

can lead to misleading clustering results.

• If all binary variables are thought of as having the

same weight, we have the 2-by-2 contingency table as
Binary Variables
• The total number of variables is p, where p = q+r+s+t.

• A binary variable is symmetric if both of its states are equally valuable and carry the same weight.

• There is no preference on which outcome should be coded as 0 or 1.

• Dissimilarity that is based on symmetric binary variables is called symmetric binary dissimilarity.

• A binary variable is asymmetric if the outcomes of the states are not equally important, such as the
positive and negative outcomes of a disease test.
• A binary variable contains two possible outcomes: 1 (positive/present) or 0
(negative/absent). If there is no preference for which outcome should be coded as
0 and which as 1, the binary variable is called symmetric.
• For example, the binary variable "is evergreen?" for a plant has the possible states
"loses leaves in winter" and "does not lose leaves in winter." Both are equally
valuable and carry the same weight when a proximity measure is computed.
• If the outcomes of a binary variable are not equally important, the binary variable
is called asymmetric.
• An example of such a variable is the presence or absence of a relatively rare
attribute, such as "is color-blind" for a human being.
• While you say that two people who are color-blind have something in common,
you cannot say that people who are not color-blind have something in common.
Jaccard Coefficient
• The number of negative matches, t, is considered unimportant and thus is
ignored in the computation, as

• we can measure the distance between two binary variables based on the
notion of similarity instead of dissimilarity.

• The coefficient sim(i, j) is called the Jaccard coefficient.

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Dissimilarity between Binary Variables

• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

• gender is a symmetric attribute

• the remaining attributes are asymmetric binary
• let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
Han: Clustering 32
Object1,object 2=(1,1) or (p,p)=a
Object1,object 2=(1,0) or (p,n) =b
Object1,object 2=(0,1) or (n,p)= c
Object1,object 2=(0,0) or (n,n)=d
a= ? b=? c=?

01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
How can we compute the dissimilarity between
objects described by categorical, ordinal, and
ratio-scaled variables?"
Categorical, Ordinal, and Ratio-Scaled
Variables
• A categorical variable is a generalization of the binary variable in that it can take on more than
two states.
• For example, map color is a categorical variable that may have, say, five states: red, yellow, green,
pink, and blue.
• Let the number of states of a categorical variable be M. The states can be denoted by letters,
symbols, or a set of integers, such as 1, 2, : : : , M.
• The dissimilarity between two objects i and j can be computed based on the ratio of mismatches
(Eqn 7.3)

• m - where m is the number of matches (i.e., the number of variables for which i and j are in the
same state), and p is the total number of variables.
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Suppose that we have the sample data of Table 7.3, except that only the object-identifier and the variable (or attribute)
test-1 are available, where test-1 is categorical. Let's compute the dissimilarity matrix
3. Ordinal Variables
• An ordinal variable can be discrete or continuous. (we need to convert ordinal
into ratio scale)
• Order is important, e.g. rank (junior, senior)
• Can be treated like interval-scaled
• Replace an ordinal variable value by its rank:
• The distance can be calculated by treating ordinal as quantitative
• Map the range of each variable onto [0,1] by replacing i-th object in f-th
variable by • Normalized Rank

• Compute the dissimilarity using methods for interval-scaled variables.

Dissimilarity between ordinal variables. Suppose that we have the sample data of Table 7.3,
except that this time only the object-identi¯er and the continuous ordinal variable, test-2, are available. There
are three states for test-2, namely fair, good, and excellent, that is Mf = 3. For step 1, if we replace each value
for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3, respectively. Step 2 normalizes the
ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can use, say, the Euclidean
distance (Equation 7.5), which results in the following dissimilarity matrix:
Dissimilarity between ordinal variables

• There are three states for test-2, namely fair, good, and excellent, that
is Mf =3.
• step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively.
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5,
and rank 3 to 1.0.
• For step 3, we can use, say, the Euclidean distance, which results in
the following dissimilarity matrix:

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Ratio-Scaled Variables

• A ratio-scaled variable makes a positive measurement on a nonlinear

scale, such as an exponential scale.

• where A and B are positive constants, and t typically represents time. Eg:
growth of a bacteria population or the decay of a radioactive element.

• There are three methods to handle ratio-scaled variables for computing

the dissimilarity between objects.

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Ratio-Scaled Variables
• Treat ratio-scaled variables like interval-scaled variables.

• Apply logarithmic transformation to a ratio-scaled variable f having value

for object i by using the formula = log( ). The values can be treated as
interval-valued.

• Treat as continuous ordinal data and treat their ranks as interval-valued.

• The latter two methods are the most effective, although the choice of
method used may depend on the given application.
Dissimilarity between ratio-scaled variables

• The ratio-scaled variable, test-3, are available. Let’s try a logarithmic

transformation.
• Taking the log of test-3 results in the values 2.65, 1.34, 2.21, and 3.08
for the objects 1 to 4, respectively.

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
4. Variables of Mixed Types
• how can we compute the dissimilarity between objects of mixed
variable types?”
• One approach is to group each kind of variable together, performing a
separate cluster analysis for each variable type.
• A more preferable approach is to process all variable types together,
performing a single cluster analysis.
• Suppose that the data set contains p variables of mixed type. The
dissimilarity d(i, j) between objects i and j is defined as

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Variables of Mixed Types

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Variables of Mixed Types
• Apply logarithmic transformation to its values. Based on the
transformed values of 2.65, 1.34, 2.21, and 3.08 obtained for the
objects 1 to 4.
• =3.08 and =1.34.
• Then normalize the values in the dissimilarity matrix obtained in
Example 7.5 by dividing each one by (3.08 – 1.34) = 1.74.

• We can now use the dissimilarity matrices for the three variables in
our computation.

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Vector Objects
• There are several ways to define such a similarity function, s(x, y), to
compare two vectors x and y.
• One popular way is to define the similarity function as a cosine measure

• where is a transposition of vector x, ||x|| is the Euclidean norm of

vector ,||y|| is the Euclidean norm of vector y, and s is essentially the
cosine of the angle between vectors x and y.

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Session 2
Partitioning Method: Introduction
K-Means Algorithm
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the
sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)
E   ik1 pCi ( p  ci ) 2

• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:
• Partition objects into k nonempty subsets
• Compute seed points as the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point, of the cluster)
• Assign each object to the cluster with the nearest seed point
• Go back to Step 2, stop when the assignment does not change
K-Means Clustering-

K-Means clustering is an unsupervised iterative clustering technique.
It partitions the given data set into k predefined distinct clusters.
A cluster is defined as a collection of data points exhibiting certain similarities.

Partitions each dataset such that

• Each data point belongs to a cluster with the nearest mean.
• Data points belonging to one cluster have high degree of similarity.
• Data points belonging to different clusters have high degree of dissimilarity.
Step-01:

Choose the number of clusters K.
Step-02:
Randomly select any K data points as cluster centers.
•Select cluster centers in such a way that they are as farther as possible from each other.

Step-03:
Calculate the distance between each data point and each cluster center.
•The distance may be calculated either by using given distance function or by using euclidean distance formula.

Step-04:
Assign each data point to some cluster.
•A data point is assigned to that cluster whose center is nearest to that data point.

Step-05:
Re-compute the center of newly formed clusters.
•The center of a cluster is computed by taking mean of all the data points contained in that cluster.

Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria is met-
•Center of newly formed clusters do not change
•Data points remain present in the same cluster
•Maximum number of iterations are reached

• K-Means Clustering Algorithm offers the following advantages-

• Point-01:
• It is relatively efficient with time complexity O(nkt) where-
• n = number of instances
• k = number of clusters
• t = number of iterations
• Point-02:

• It often terminates at local optimum.

• Techniques such as Simulated Annealing or Genetic Algorithms may be used to find the global optimum.

• Disadvantages-

• K-Means Clustering Algorithm has the following disadvantages-
• It requires to specify the number of clusters (k) in advance.
• It can not handle noisy data and outliers.
• It is not suitable to identify clusters with non-convex shapes.
An Example of K-Means Clustering

K=2

Arbitrarily Update the

partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects

needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) PAM// Partition around Mediods
• Clustering in LARge Applications.
• Comment: Often terminates at a local optimal.
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
• Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see
Hastie et al., 2009)
• Sensitive to noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
What Is the Problem of the K-Means
Method?
• The k-means algorithm is sensitive to outliers !

• Since an object with an extremely large value may substantially distort the distribution of the
data

• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point,
medoids can be used, which is the most centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
• Another variant to k-means is the k-modes method, which extends the k-means
paradigm to cluster categorical data by replacing the means of clusters with
modes, using new dissimilarity measures to deal with categorical objects and a
frequency-based method to update modes of clusters. The k-means and the k-
modes methods can be integrated to cluster data with mixed numeric and
categorical values.
• The EM (Expectation-Maximization) algorithm extends the k-means paradigm in a
different way. Whereas the k-means algorithm assigns each object to a cluster,
• In EM,each object is assigned to each cluster according to a weight representing its
probability of membership.
• In other words, there are no strict boundaries between clusters. Therefore, new
means are computed based on weighted measures.
How can we make the k-means algorithm more scalable?"
A recent approach to scaling the k-means algorithm is based on the idea of identifying three
kinds of regions in data:
1. regions that are compressible,
2. regions that must be maintained in main memory,
3. and regions that are discardable.
An object is discardable if its membership in a cluster is ascertained.
An object is compressible if it is not discardable but belongs to a tight subcluster.
A data structure known as a clustering feature is used to summarize objects that have been
discarded or compressed.
If an objectis neither discardable nor compressible, then it should be retained in main
memory.
To achieve scalability,
The iterative clustering algorithm only includes the clustering features of the compressible objects and the objects that
must be retained in main memory,
• thereby turning a secondary-memory-based algorithm into a main-memory- based algorithm.
• An alternative approach to scaling the k-means algorithm explores the microclustering idea,
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|

Use K-Means Algorithm to find the three cluster centers after the second iteration.
Calculating Distance Between A1(2, 10) and C2(5, 8)-

Calculating Distance Between A1(2, 10) and C3(1, 2)-

https://www.gatevidyalay.com/k-means-clustering-algorithm-
example
• Calculate center of the clusters
• For Cluster-01:
• We have only one point A1(2, 10) in Cluster-01.
So, cluster center remains the same.
• For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
• For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
• This is completion of Iteration-01.

• Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-

We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
• Calculate new cluster centers
• After second iteration, the center of the three clusters are-
• C1(3, 9.5)
• C2(6.5, 5.25)
• C3(1.5, 3.5)
Practice Problem
Session 3
K-Medoids
Hierarchical Method: Introduction
Quality of clustering,
Variation within clustering
Error E=
“How can we modify the k-means algorithm to diminish such sensitivity to
outliers?”
• Instead of taking the mean value of the objects in a cluster as a reference point,
• pick actual objects to represent the clusters, using one representative object per
cluster.
• Each remaining object is assigned to the cluster of which the representative
object is the most similar.
• The partitioning method is then performed based on the principle of minimizing
the sum of the dissimilarities between each object p and its
correspondingrepresentative object.
• That is, an absolute-error criterion is used, defined as
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters

• PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

• Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the
non-medoids if it improves the total distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well for large data sets (due to the
computational complexity)

• Efficiency improvement on PAM

• CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

• CLARANS (Ng & Han, 1994): Randomized re-sampling

• The Partitioning Around Medoids (PAM) algorithm is a popular realization of k-medoids
clustering.
• It tackles the problem in an iterative, greedy
way.
• Like the k-means algorithm, the initial representative objects (called seeds) are chosen
arbitrarily.
• We consider whether replacing a representative object by a nonrepresentative
object would improve the clustering quality.
• All the possible replacements are tried out. The iterative process of replacing representative
objects by other objects
• continues until the quality of the resulting clustering cannot be improved by any
replacement.
• Generalization of k-means
• But more robust to noise than k-means
• The Partitioning Around Medoids (PAM) algorithm is a popular realization of k-medoids
clustering. It tackles the problem in an iterative, greedy way.
• Like the k-means algorithm, the initial representative objects (called seeds) are
chosen arbitrarily.
• It considers whether replacing a representative object by a nonrepresentative
object would improve the clustering quality. All the possible replacements
are tried out.
• The iterative process of replacing representative objects by other objects continues until
the quality of the resulting clustering cannot be improved by any replacement.
• This quality is measured by a cost function of the average dissimilarity between
an object and the representative object of its cluster
1. Specifically, let {o1, …. ,ok} be the current set of representative objects (i.e., medoids).
2. To determine whether a nonrepresentative object, denoted by orandom, is a good replacement for a
current medoid oj .
3. Calculate the distance from every object p to the closest object in the set f{o1,
…..,oj}􀀀1,orandom,ojC1, : : : ,okg, and
use the distance to update the cost function.
• The reassignments of objects to new medoid are simple.
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7
Arbitrary Assign
7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3 initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Total Cost = 26

Randomly select a
nonmedoid object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0Source
3 4 :5
1 627 8 9 10 0 1 2 3 4 5 6 7 8 9 10

http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
• More robust
• Time complexity
• “How can we scale up the k-medoids method?”
• CLARA( Random Samples)
• In some situations we may want to partition our data into groups at
different levels such as in a hierarchy.
• A hierarchical clustering method works by grouping data objects into
a hierarchy or “tree” of clusters.
• Representing data objects in the form of a hierarchy is useful for data
summarization and visualization.
• Handwriting recognition, hierarchy of species(animals,birds
etc),employee,gaming (chess).
• Agglomerative versus divisive hierarchical clustering,
• which organize objects into a hierarchy using a bottom-up or top-
down strategy, respectively.
• Agglomerative methods start with individual objects as clusters,
which are iteratively merged to form larger clusters.
• Conversely, divisive methods initially let all the given objectsform one
cluster, which they iteratively split into smaller clusters.
• Hierarchical clustering methods can encounter difficulties regarding
the selection of merge or split points. Such a decision is critical,
• merge or split decisions, if not well chosen, may lead to low-quality
clusters.
Moreover, the methods do not scale well because each decision of
merge or split needs to examine and evaluate many objects or clusters.
Solution: can be combined with Multiphase clustering
Hierarchical Clustering

• Use distance matrix as clustering criteria. This method does

not require the number of clusters k as an input, but needs a
termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative
(AGNES-agglometriveNesting)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA) DIVisive ANAlysis
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Session 4
Agglomerative vs. Divisive method
Distance measures in Algorithmic methods
BIRCH technique

DBSCAN technique

STING technique

CLIQUE technique

Evaluation of clustering techniques

AGNES (Agglomerative Nesting)

• Introduced in Kaufmann and Rousseeuw (1990)

• Implemented in statistical packages, e.g., Splus
• Use the single-link method and the dissimilarity matrix
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
• Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram.

• A clustering of the data objects is obtained by cutting the dendrogram at

the desired level, then each connected component forms a cluster.
• A dendrogram is a diagram that shows the hierarchical relationship
between objects. It is most commonly created as an output from
hierarchical clustering. The main use of a dendrogram is to work out
the best way to allocate objects to clusters
Dendrogram: Shows How Clusters are Merged

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)

• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Distance Between Clusters
• Single Link: smallest distance between points
• Complete Link: largest distance between points
• Average Link: average distance between points
• Centroid: distance between centroids
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = min(tip, tjq)// updating distance matrix. X

• Complete link: largest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

• Average: avg distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
X

• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)

• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
• Medoid: a chosen, centrally located object in the cluster

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Centroid, Radius and Diameter of a Cluster (for numerical data
sets)
• Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

• Radius: square root of average distance from any point of the

 N (t  c ) 2
cluster to its centroid ip m
Rm  i  1
N

• Diameter: square root of average mean squared distance

between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Hierarchical Clustering
• Clusters are created in levels actually creating sets of clusters at each
level.
• Agglomerative
• Initially each item in its own cluster
• Iteratively clusters are merged together
• Bottom Up
• Divisive
• Initially all items in one cluster
• Large clusters are successively divided
• Top Down
Hierarchical Algorithms
• Single Link
• MST Single Link
• Complete Link
• Average Link
Dendrogram
• Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
• Each level shows clusters for that
level.
• Leaf – individual clusters
• Root – one cluster
• A cluster at level i is the union of
its children clusters at level i+1.
Levels of Clustering
Agglomerative Example
A B C D E
A B
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5 E C
D 2 4 1 0 3
E 3 3 5 3 0
D

Threshold of

12 34 5

A B C D E
Problem: For the one dimensional data set {7,10,20,28,35}, perform hierarchical clustering and plot
the dendogram to visualize it.

Solution : First, let’s the visualize the data.

Observing the plot above, we can intuitively conclude that:
1.The first two points (7 and 10) are close to each other and should be in the same cluster
2.Also, the last two points (28 and 35) are close to each other and should be in the same
cluster
3.Cluster of the center point (20) is not easy to conclude
Let’s solve the problem by hand using both the types of agglomerative hierarchical
clustering :
• Single Linkage : In single link hierarchical clustering, we merge in each
step the two clusters, whose two closest members have the smallest
distance.

Using single linkage two clusters are formed :

Cluster 1 : (7,10)
Cluster 2 : (20,28,35)
Complete Linkage : In complete link hierarchical clustering, we merge in
the members of the clusters in each step, which provide the smallest
maximum pairwise distance.

Using complete linkage two clusters are

formed :
Cluster 1 : (7,10,20)
Cluster 2 : (28,35)
• https://online.stat.psu.edu/stat555/node/86
Agglomerative Algorithm
Single Link
• View all items with links (distances) between them.
• Finds maximal connected components in this graph.
• Two clusters are merged if there is at least one edge which
connects them.
• Uses threshold distances at each level.
• Could be agglomerative or divisive.
MST Single Link Algorithm
Single Link Clustering
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages, e.g., Splus
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)

• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods

• Can never undo what was done previously

• Do not scale well: time complexity of at least O(n2), where n is

the number of total objects

• Integration of hierarchical & distance-based clustering

• BIRCH (1996): uses CF-tree and incrementally adjusts the quality

of sub-clusters
• CHAMELEON (1999): hierarchical clustering using dynamic
modeling
Session 5
BIRCH

Multiphase Hierarchical Clustering

Using Clustering Feature Trees
Balanced Iterative Reducing and Clustering using Hierarchies
(BIRCH) is designed for

• clustering a large amount of numeric data by integrating hierarchical

clustering (at the initial microclustering stage) and other clustering
methods such as iterative partitioning (at the later macroclustering
stage).
• It overcomes the two difficulties in agglomerative clustering
methods:
• (1) scalability and (2) the inability to undo what was done in
theprevious step.
• BIRCH uses the notions of clustering feature to summarize a cluster,
and
• Clustering feature tree (CF-tree) to represent a cluster hierarchy.
• Given a limited amount of main memory, an important consideration
in BIRCH is to minimize the time required for input/output (I/O).
• BIRCH applies a multiphase clustering technique,
• A single scan of the data set yields a basic, good clustering, and
• one or more additional scans can optionally be used to further improve the
quality
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for
multiphase clustering
• Phase 1: BIRCH scans the database to build an initial in-memory CF-
tree, whichcan be viewed as a multilevel compression of the data
that tries to preserve the data’s
• inherent clustering structure.
• Phase 2: BIRCH applies a (selected) clustering algorithm to cluster
the leaf nodes of
• the CF-tree, which removes sparse clusters as outliers and groups
dense clusters into larger ones.
•
Scales linearly: finds a good clustering with a single scan and
.

improves the quality with a few additional scans

• Weakness: handles only numeric data, and sensitive to the order of the data record
Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)

N: Number of data points
N
LS: linear sum of N points:  Xi
i 1

SS: square sum of N points CF = (5, (16,30),(54,190))

N 2 10
(3,4)
 Xi 9

8
(2,6)
i 1 7

5
(4,5)
4

3
(4,7)
2

1
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
CF-Tree in BIRCH
• Clustering feature:
• Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd moments of the
subcluster from the statistical point of view
• Registers crucial measurements for computing cluster and utilizes storage efficiently
• A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering

• A nonleaf node in a tree has descendants or “children”

• The nonleaf nodes store sums of the CFs of their children
• A CF tree has two parameters
• Branching factor: max # of children
• Threshold: max diameter of sub-clusters stored at the leaf nodes
• The branching factor specifies
• the maximum number of children per nonleaf node.
• The threshold parameter specifies
the maximum diameter of subclusters stored at the leaf nodes of the tree.
These two parameters implicitly control the resulting tree’s size.
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
The Birch Algorithm
• Cluster Diameter 1 2
 (x  x )
n( n  1) i j

• For each point in the input

• Find closest leaf entry
• Add point to leaf entry and update CF
• If entry diameter > max_diameter, then split leaf, and possibly parents
• Algorithm is O(n)
• Concerns
• Sensitive to insertion order of data points
• Since we fix the size of leaf nodes, so clusters may not be so natural
• Clusters tend to be spherical given the radius and diameter measures

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Partitioning and hierarchical methods are
designed to find spherical-shaped clusters.
Session 6
DBSCAN
Density-based clusters are dense areas in the data
space separated from each other by sparser areas.
Given such data, portioning and hierarchical would likely
inaccurately identify convex regions, where noise or outliers are
included in the clusters.

• To find clusters of arbitrary shape, alternatively, we may model clusters as dense

• regions in the data space, separated by sparse regions.
This is the main strategy behind density-based clustering methods, which can discover clusters of
nonspherical shape.
• “How can we find dense regions in density-based clustering?”
• The density of an object o can be measured by the number of objects
close to o.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
finds core objects, that is, objects that have dense neighborhoods.
• It connects core objects and their neighborhoods to form dense
regions as clusters.
• “How does DBSCAN quantify the neighborhood of an object?”
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition
• Several interesting studies:
• DBSCAN: Ester, et al. (KDD’96)
• OPTICS: Ankerst, et al (SIGMOD’99).
• DENCLUE: Hinneburg & D. Keim (KDD’98)
• CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
Density-Based Clustering: Basic Concepts

• Two parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
• p belongs to NEps(q)
• core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Density-Reachable and Density-Connected
• Density-reachable:
• A point p is density-reachable from a
p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
q
such that pi+1 is directly density-
reachable from pi
• Density-connected
• A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a p q
point o such that both, p and q are
density-reachable from o w.r.t. Eps o
and MinPts
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
• Relies on a density-based notion of cluster: A cluster is defined as
a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been
processed
DBSCAN: Sensitive to Parameters

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Session 7
STING
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
• STING (a STatistical INformation Grid approach) by Wang,
Yang and Muntz (1997)
• WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
• A multi-resolution clustering approach using wavelet method
• CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
STING: A Statistical Information Grid Approach

• Wang, Yang and Muntz (VLDB’97)

• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of
resolution

1st layer

(i-1)st layer

i-th layer

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
The STING Clustering Method
• Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
• Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
• Parameters of higher level cells can be easily calculated from
parameters of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small number of
cells
• For each cell in the current level compute the confidence
interval
STING Algorithm and Its Analysis
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the next
lower level
• Repeat this process until the bottom layer is reached
• Advantages:
• Query-independent, easy to parallelize, incremental update
• O(K), where K is the number of grid cells at the lowest level
• Disadvantages:
• All the cluster boundaries are either horizontal or vertical,
and no diagonal boundary is detected
Session 8
CLIQUE
CLIQUE (Clustering In QUEst)

• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

• Automatically identifying subspaces of a high dimensional data space that allow
better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
• It partitions each dimension into the same number of equal length interval
• It partitions an m-dimensional data space into non-overlapping rectangular
units
• A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
• A cluster is a maximal set of connected dense units within a subspace
CLIQUE: The Major Steps

• Partition the data space and find the number of points that lie
inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters
• Determine dense units in all subspaces of interests
• Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
• Determine maximal regions that cover a cluster of connected
dense units for each cluster
• Determination of minimal cover for each cluster
Vacation
(10,000)

(week)
Salary

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60

=3

Vacation
r y 30 50
l a
Sa age

Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Strength and Weakness of CLIQUE

• Strength
• automatically finds subspaces of the highest dimensionality
such that high density clusters exist in those subspaces
• insensitive to the order of records in input and does not
presume some canonical data distribution
• scales linearly with the size of input and has good scalability
as the number of dimensions in the data increases
• Weakness
• The accuracy of the clustering result may be degraded at the
expense of simplicity of the method
Session 9
Evaluation of Clustering Techniques
Assessing Clustering Tendency

• Assess if non-random structure exists in the data by measuring the probability that the data is
generated by a uniform data distribution
• Test spatial randomness by statistic test: Hopkins Static
• Given a dataset D regarded as a sample of a random variable o, determine how far away o is
from being uniformly distributed in the data space
• Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D: xi =
min{dist (pi, v)} where v in D
• Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}:
yi = min{dist (qi, v)} where v in D and v ≠ qi
• Calculate the Hopkins Statistic:

• If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D
is highly skewed, H is close to 0
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Determine the Number of Clusters
• Empirical method
• # of clusters ≈√n/2 for a dataset of n points
• Elbow method
• Use the turning point in the curve of sum of within cluster variance w.r.t
the # of clusters
• Cross validation method
• Divide a given data set into m parts
• Use m – 1 parts to obtain a clustering model
• Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid, and use
the sum of squared distance between all points in the test set and the
closest centroids to measure how well the model fits the test set
• For any k > 0, repeat it m times, compare the overall quality measure w.r.t.
different k’s, and find # of clusters that fits the data the best
Measuring Clustering Quality
• Two methods: extrinsic vs. intrinsic
• Extrinsic: supervised, i.e., the ground truth is available
• Compare a clustering against the ground truth using certain
clustering quality measure
• Ex. BCubed precision and recall metrics
• Intrinsic: unsupervised, i.e., the ground truth is unavailable
• Evaluate the goodness of a clustering by considering how well
the clusters are separated, and how compact the clusters are
• Ex. Silhouette coefficient
Measuring Clustering Quality: Extrinsic Methods

• Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.
• Q is good if it satisfies the following 4 essential criteria
• Cluster homogeneity: the purer, the better
• Cluster completeness: should assign objects belong to the same category in
the ground truth to the same cluster
• Rag bag: putting a heterogeneous object into a pure cluster should be
penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other”
category)
• Small cluster preservation: splitting a small category into pieces is more
harmful than splitting a large category into pieces
References
• Jiawei Han and Micheline Kamber, “ Data Mining: Concepts and
Techniques”, 3rd Edition, Morgan Kauffman Publishers, 2011.
• http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.pdf

Unit- 4 DMA
No ratings yet
Unit- 4 DMA
145 pages
Syllabus Unit-I: Unit-I Introduction To Measurement Systems and Passive Sensors
No ratings yet
Syllabus Unit-I: Unit-I Introduction To Measurement Systems and Passive Sensors
57 pages
KiloWatt Labs Sirius User Manual 1000 48 B 1C TM X a X X SL 092020
No ratings yet
KiloWatt Labs Sirius User Manual 1000 48 B 1C TM X a X X SL 092020
26 pages
Concepts and Methods of 2D Infrared Spectroscopy 1st Edition Peter Hamm pdf download
No ratings yet
Concepts and Methods of 2D Infrared Spectroscopy 1st Edition Peter Hamm pdf download
66 pages
Ism 4
No ratings yet
Ism 4
63 pages
Unit 2
No ratings yet
Unit 2
60 pages
5. Clustering in non-euclidean space
No ratings yet
5. Clustering in non-euclidean space
4 pages
Gujarat Technological University: External Examiner's Feedback Form
0% (1)
Gujarat Technological University: External Examiner's Feedback Form
1 page
Ism 5
No ratings yet
Ism 5
40 pages
Winkelman LeanManufacturingConference 22
No ratings yet
Winkelman LeanManufacturingConference 22
58 pages
UNIT-2 ML notes
No ratings yet
UNIT-2 ML notes
15 pages
Ch01-Introduction and Research Methods
No ratings yet
Ch01-Introduction and Research Methods
54 pages
Business Communication Assignment
No ratings yet
Business Communication Assignment
13 pages
Cognitive Therapy 100 Key Points and Techniques - 1st Edition pdf epub
100% (12)
Cognitive Therapy 100 Key Points and Techniques - 1st Edition pdf epub
15 pages
CC4200 - SN 10000320xxB005240
100% (1)
CC4200 - SN 10000320xxB005240
679 pages
Fsd Unit III
No ratings yet
Fsd Unit III
22 pages
Unit 5
No ratings yet
Unit 5
70 pages
ML-UNIT-5
No ratings yet
ML-UNIT-5
20 pages
Thesis & Annotated Bibliography
No ratings yet
Thesis & Annotated Bibliography
6 pages
Displacement Measurement
No ratings yet
Displacement Measurement
90 pages
Module - 4 K Means Clustering
No ratings yet
Module - 4 K Means Clustering
20 pages
ECS4863 - Solutions To Activity 1.3
No ratings yet
ECS4863 - Solutions To Activity 1.3
16 pages
Open Source Linux For You - May 2013
100% (1)
Open Source Linux For You - May 2013
112 pages
Protoplast Isolation - Technical Notes
No ratings yet
Protoplast Isolation - Technical Notes
4 pages
Physics For Scientists and Engineers, 6e: Chapter - 32 Inductance
No ratings yet
Physics For Scientists and Engineers, 6e: Chapter - 32 Inductance
15 pages
Computer Science - Project XI
No ratings yet
Computer Science - Project XI
28 pages
Slope Stability Final
No ratings yet
Slope Stability Final
38 pages
United States Patent (191 4,717,695
No ratings yet
United States Patent (191 4,717,695
14 pages
Persuasion Matrix
No ratings yet
Persuasion Matrix
8 pages
DRV MasterDrives Chassis - E K - Converters
No ratings yet
DRV MasterDrives Chassis - E K - Converters
421 pages
Agricultural Development and The Role of Extension
No ratings yet
Agricultural Development and The Role of Extension
10 pages
13-Crystallization-I-Concepts and Homogeneous Nucleation-21-08-2023
No ratings yet
13-Crystallization-I-Concepts and Homogeneous Nucleation-21-08-2023
22 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
CH 6
No ratings yet
CH 6
72 pages
Big Data Analytics in Cloud Computing
No ratings yet
Big Data Analytics in Cloud Computing
8 pages
Generalised Measurement System
No ratings yet
Generalised Measurement System
11 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Datasheet 24c16
No ratings yet
Datasheet 24c16
17 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Adaptive Fuzzy Systems
No ratings yet
Adaptive Fuzzy Systems
6 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Linear Regression 18may
No ratings yet
Linear Regression 18may
28 pages
Unit Iii
No ratings yet
Unit Iii
73 pages
Augmented Reality in Quality Control
No ratings yet
Augmented Reality in Quality Control
6 pages
Software Metrics SE-603: Measurement Theory
No ratings yet
Software Metrics SE-603: Measurement Theory
85 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
DSRS BR
No ratings yet
DSRS BR
25 pages
Logistic Regression
No ratings yet
Logistic Regression
16 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
ML - Unit 2
No ratings yet
ML - Unit 2
15 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
Factor Analysis
67% (3)
Factor Analysis
25 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
UNIT-2.1 - Angular Measurement
No ratings yet
UNIT-2.1 - Angular Measurement
74 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
Bar Graph-Wps Office
No ratings yet
Bar Graph-Wps Office
16 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Oow Oral General Stuff Quick Reckoner
100% (10)
Oow Oral General Stuff Quick Reckoner
36 pages
Measurements of Stress and Strain
100% (1)
Measurements of Stress and Strain
26 pages
Toyota 2J Engine Data
No ratings yet
Toyota 2J Engine Data
1 page
Basic Skills For Counselling Children: Participant's Workbook
100% (1)
Basic Skills For Counselling Children: Participant's Workbook
45 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Inter Chang Ibility
No ratings yet
Inter Chang Ibility
20 pages
GROHE Specification Sheet 102915SH00
No ratings yet
GROHE Specification Sheet 102915SH00
1 page
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
Chapter 7
100% (1)
Chapter 7
31 pages
A. Personal Information: Stuart Musson
No ratings yet
A. Personal Information: Stuart Musson
4 pages
I. The Types of Machine Learning
No ratings yet
I. The Types of Machine Learning
8 pages
MMM Lecture - Unit 1 - Displacement Measurement
No ratings yet
MMM Lecture - Unit 1 - Displacement Measurement
22 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
Transducers
No ratings yet
Transducers
14 pages
CS583 Unsupervised Learning
No ratings yet
CS583 Unsupervised Learning
95 pages
Anthropology
100% (1)
Anthropology
4 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Cluster
100% (1)
Cluster
72 pages
Merox Caustic Extraction
100% (5)
Merox Caustic Extraction
146 pages
Linear and Angular Measurements: Unit - 2
No ratings yet
Linear and Angular Measurements: Unit - 2
129 pages
Augmented Analytics
No ratings yet
Augmented Analytics
8 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
Interfacing Analog To Digital Data Converters
100% (1)
Interfacing Analog To Digital Data Converters
16 pages
Auto Tune Cheatsheet
No ratings yet
Auto Tune Cheatsheet
4 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
Title: K-Means Clustering Algorithm Implementation: Department of Computer Science and Engineering
No ratings yet
Title: K-Means Clustering Algorithm Implementation: Department of Computer Science and Engineering
7 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
Valve Sizing Tables
No ratings yet
Valve Sizing Tables
5 pages
A Survey On Data Mining
No ratings yet
A Survey On Data Mining
4 pages
Key Data Mining Tasks: 1. Descriptive Analytics
No ratings yet
Key Data Mining Tasks: 1. Descriptive Analytics
10 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
Association Rules
No ratings yet
Association Rules
64 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit - 4 - Modified

Uploaded by

Unit - 4 - Modified

Uploaded by

Cluster Analysis Unit - 4

• A good clustering method will produce high quality clusters

• Incremental clustering and insensitivity to input order // D will be updated

Data matrix Dissimilarity matrix

• d(i, j) - difference or dissimilarity between objects i and j.

• Since d(i, j)=d( j, i), and d(i, i)=0.

• Data matrix - two-mode matrix.(Deals with different entities)

• The measurement unit used can affect the clustering analysis.

To help avoid dependence on the choice of measurement units, the

• Another well-known metric is Manhattan (or city block) distance,

• p - positive integer. Such a distance is also called Lp norm.

• If each variable is assigned a weight according to its perceived importance,

• A binary variable has only two states: 0 or 1

• 0 – absent & 1- present

• Treating binary variables as if they are interval-scaled

• If all binary variables are thought of as having the

• There is no preference on which outcome should be coded as 0 or 1.

• The coefficient sim(i, j) is called the Jaccard coefficient.

• gender is a symmetric attribute

• Compute the dissimilarity using methods for interval-scaled variables.

• A ratio-scaled variable makes a positive measurement on a nonlinear

• There are three methods to handle ratio-scaled variables for computing

• Apply logarithmic transformation to a ratio-scaled variable f having value

• Treat as continuous ordinal data and treat their ranks as interval-valued.

• The ratio-scaled variable, test-3, are available. Let’s try a logarithmic

• where is a transposition of vector x, ||x|| is the Euclidean norm of

Partitions each dataset such that

• It often terminates at local optimum.

Arbitrarily Update the

The initial data set Loop if Reassign objects

Calculating Distance Between A1(2, 10) and C3(1, 2)-

• PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

• Efficiency improvement on PAM

• CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

• CLARANS (Ng & Han, 1994): Randomized re-sampling

K=2 Total Cost = 26

• Use distance matrix as clustering criteria. This method does

Step 0 Step 1 Step 2 Step 3 Step 4

Evaluation of clustering techniques

• Introduced in Kaufmann and Rousseeuw (1990)

• A clustering of the data objects is obtained by cutting the dendrogram at

• Introduced in Kaufmann and Rousseeuw (1990)

• Radius: square root of average distance from any point of the

• Diameter: square root of average mean squared distance

Solution : First, let’s the visualize the data.

Using single linkage two clusters are formed :

Using complete linkage two clusters are

• Introduced in Kaufmann and Rousseeuw (1990)

• Can never undo what was done previously

• Do not scale well: time complexity of at least O(n2), where n is

• Integration of hierarchical & distance-based clustering

• BIRCH (1996): uses CF-tree and incrementally adjusts the quality

Multiphase Hierarchical Clustering

• clustering a large amount of numeric data by integrating hierarchical

improves the quality with a few additional scans

Clustering Feature (CF): CF = (N, LS, SS)

SS: square sum of N points CF = (5, (16,30),(54,190))

• A nonleaf node in a tree has descendants or “children”

B=7 CF1 CF2 CF3 CF6

Leaf node Leaf node

• For each point in the input

• To find clusters of arbitrary shape, alternatively, we may model clusters as dense

• Wang, Yang and Muntz (VLDB’97)

• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.