Module 4 ML
Module 4 ML
CLUSTERING
• Clustering is the process of grouping the data into classes or clusters, so that objects within
a cluster have high similarity in comparison to one another but are very dissimilar to objects
in other clusters.
• Dissimilarities are assessed based on the attribute values describing the objects.
• Clustering has its roots in many areas, including data mining, statistics, biology, and
machine learning.
Where i = (xi1, xi2,…., xin ) and j = (xj1, xj2,….., xjn ) are are two n-dimensional data objects.
• Another well-known metric is Manhattan (or city block) distance, defined as
Both the Euclidean distance and Manhattan distance satisfy the following mathematic
requirements of a distance function:
1. d (i, j) ≥ 0: Distance is a nonnegative number.
2. d (i, i) = 0: The distance of an object to itself is 0.
3. d (i, j) = d (j, i): Distance is a symmetric function.
4. d (i, j) ≤ d (i, h) + d (h, j): Going directly from object i to object j in space is no
more than making a detour over any other object h (triangular inequality).
Example of Euclidean distance and Manhattan distance.
Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in Figure below.
where the inner product v1 · v2 is the standard vector dot product, defined as∑𝑡𝑖=1 𝑣1𝑖 𝑣2𝑖 , and
the norm |v1| in the denominator is defined as |v1| = √𝑣1 . 𝑣1
HIERARCHCAL CLUSTERING
A hierarchical method creates a hierarchical decomposition of the given set of data objects. A
hierarchical method can be classified as being either agglomerative or divisive, based on how
the hierarchical decomposition is formed.
• The agglomerative approach, also called the bottom-up approach, starts with each
object forming a separate group. It successively merges the objects or groups that are
close to one another, until all of the groups are merged into one (the topmost level of
the hierarchy), or until a termination condition holds.
• The divisive approach, also called the top-down approach, starts with all of the
objects in the same cluster. In each successive iteration, a cluster is split up into smaller
clusters, until eventually each object is in one cluster, or until a termination condition
holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never
be undone. That is, if a particular merge or split decision later turns out to have been a poor
choice, the method cannot backtrack and correct it. This rigidity is useful in that it leads to
smaller computation costs by not having to worry about a combinatorial number of different
choices. However, such techniques cannot correct erroneous decisions.
There are two approaches to improving the quality of hierarchical clustering:
1. perform careful analysis of object “linkages” at each hierarchical partitioning, such as in
Chameleon, or
2. integrate hierarchical agglomeration and other approaches by first using a hierarchical
agglomerative algorithm to group objects into microclusters, and then performing
macroclustering on the microclusters using another clustering method such as iterative
relocation, as in BIRCH.
Agglomerative and Divisive Hierarchical Clustering
In general, there are two types of hierarchical clustering methods:
1. Agglomerative hierarchical clustering: This bottom-up strategy starts by placing
each object in its own cluster and then merges these atomic clusters into larger and
larger clusters, until all of the objects are in a single cluster or until certain termination
conditions are satisfied. Most hierarchical clustering methods belong to this category.
They differ only in their definition of intercluster similarity.
2. Divisive hierarchical clustering: This top-down strategy does the reverse of
agglomerative hierarchical clustering by starting with all objects in one cluster. It
subdivides the cluster into smaller and smaller pieces, until each object forms a cluster
on its own or until it satisfies certain termination conditions, such as a desired number
of clusters is obtained or the diameter of each cluster is within a certain threshold
Agglomerative versus divisive hierarchical clustering. Figure shows the application of
AGNES (AGglomerative NESting), an agglomerative hierarchical clustering method, and
DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method, to a data set of five
objects, {a, b, c, d, e}.
Initially, AGNES places each object into a cluster of its own. The clusters are then merged
step-by-step according to some criterion. This is a single-linkage approach in that each
cluster is represented by all of the objects in the cluster, and the similarity between two
clusters is measured by the similarity of the closest pair of data points belonging to different
clusters. The cluster merging process repeats until all of the objects are eventually merged
to form one cluster.
In DIANA, all of the objects are used to form one initial cluster. The cluster is split
according to some principle, such as the maximum Euclidean distance between the closest
neighbouring objects in the cluster. The cluster splitting process repeats until, eventually,
each new cluster contains only a single object.
In either agglomerative or divisive hierarchical clustering, the user can specify the desired
number of clusters as a termination condition.
A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering. It shows how objects are grouped together step by step. Figure
below shows a dendrogram for the five objects presented in Figure above,
where l = 0 shows the five objects as singleton clusters at level 0. At l = 1, objects a and b
are grouped together to form the first cluster, and they stay together at all subsequent levels.
We can also use a vertical axis to show the similarity scale between clusters.
Four widely used measures for distance between clusters are as follows, where |p−p’ | is the
distance between two objects or points, p and p’ ; mi is the mean for cluster, Ci ; and ni is
the number of objects in Ci .
• When an algorithm uses the minimum distance, dmin (Ci, Cj), to measure the distance
between clusters, it is sometimes called a nearest-neighbour clustering algorithm.
• If the clustering process is terminated when the distance between nearest clusters exceeds
an arbitrary threshold, it is called a single-linkage algorithm.
• If we view the data points as nodes of a graph, with edges forming a path between the nodes
in a cluster, then the merging of two clusters, Ci and Cj, corresponds to adding an edge
between the nearest pair of nodes in Ci and Cj. Because edges linking clusters always go
between distinct clusters, the resulting graph will generate a tree. Thus, an agglomerative
hierarchical clustering algorithm that uses the minimum distance measure is also called a
minimal spanning tree algorithm.
• When an algorithm uses the maximum distance, dmax (Ci, Cj), to measure the distance
between clusters, it is sometimes called a farthest-neighbor clustering algorithm.
• If the clustering process is terminated when the maximum distance between nearest clusters
exceeds an arbitrary threshold, it is called a complete-linkage algorithm.
What are some of the difficulties with hierarchical clustering?
• The hierarchical clustering method, though simple, often encounters difficulties regarding
the selection of merge or split points. Such a decision is critical because once a group of
objects is merged or split, the process at the next step will operate on the newly generated
clusters. It will neither undo what was done previously nor perform object swapping
between clusters. Thus, merge or split decisions, if not well chosen at some step, may lead
to low-quality clusters.
• The method does not scale well, because each decision to merge or split requires the
examination and evaluation of a good number of objects or clusters.
Classical Partitioning Methods: k-Means
The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k
clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can
be viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows. First, it randomly selects k of the objects, each of
which initially represents a cluster mean or center. For each of the remaining objects, an object
is assigned to the cluster to which it is the most similar, based on the distance between the
object and the cluster mean. It then computes the new mean for each cluster. This process
iterates until the criterion function converges. Typically, the square-error criterion is used,
defined as
where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and mi is the mean of cluster Ci (both p and mi are
multidimensional). In other words, for each object in each cluster, the distance from the object
to its cluster center is squared, and the distances are summed. This criterion tries to make the
resulting k clusters as compact and as separate as possible.
Suppose that there is a set of objects located in space as depicted in the rectangle shown in
Figure (a).
Let k = 3; that is, the user would like the objects to be partitioned into three clusters. According
to the algorithm, we arbitrarily choose three objects as the three initial cluster centers, where
cluster centers are marked by a “+”. Each object is distributed to a cluster based on the cluster
center to which it is the nearest. Such a distribution forms silhouettes encircled by dotted
curves, as shown in Figure (a). Next, the cluster centers are updated. That is, the mean value
of each cluster is recalculated based on the current objects in the cluster.
Using the new cluster centers, the objects are redistributed to the clusters based on which cluster
center is the nearest. Such a redistribution forms new silhouettes encircled by dashed curves,
as shown in Figure (b). This process iterates, leading to Figure (c). The process of iteratively
reassigning objects to clusters to improve the partitioning is referred to as iterative relocation.
Eventually, no redistribution of the objects in any cluster occurs, and so the process terminates.
The resulting clusters are returned by the clustering process.
The algorithm attempts to determine k partitions that minimize the square-error function. It
works well when the clusters are compact clouds that are rather well separated from one
another. The method is relatively scalable and efficient in processing large data sets because
the computational complexity of the algorithm is O(nkt), where n is the total number of objects,
k is the number of clusters, and t is the number of iterations. The method often terminates at a
local optimum.
Advantages and Disadvantages
• The k-means method, however, can be applied only when the mean of a cluster is defined.
• This may not be the case in some applications, such as when data with categorical attributes
are involved.
• The necessity for users to specify k, the number of clusters, in advance can be seen as a
disadvantage.
• The k-means method is not suitable for discovering clusters with nonconvex shapes or
clusters of very different size.
• It is sensitive to noise and outlier data points because a small number of such data can
substantially influence the mean value.
EXPECTATION-MAXIMIZATION ALGORITHM
The EM (Expectation-Maximization) algorithm extends the k-means paradigm in a different
way. Whereas the k-means algorithm assigns each object to a cluster, in EM each object is
assigned to each cluster according to a weight representing its probability of membership. In
other words, there are no strict boundaries between clusters. Therefore, new means are
computed based on weighted measures.
EM starts with an initial estimate or “guess” of the parameters of the mixture model
(collectively referred to as the parameter vector). It iteratively rescores the objects against the
mixture density produced by the parameter vector. The rescored objects are then used to update
the parameter estimates. Each object is assigned a probability that it would possess a certain
set of attribute values given that it was a member of a given cluster. The algorithm is described
as follows:
1. Make an initial guess of the parameter vector: This involves randomly selecting k objects
to represent the cluster means or centers (as in k-means partitioning), as well as making
guesses for the additional parameters.
2. Iteratively refine the parameters (or clusters) based on the following two steps:
a. Expectation Step: Assign each object xi to cluster Ck with the probability
where p(xi |Ck) = N(mk, Ek(xi)) follows the normal (i.e., Gaussian) distribution
around mean, mk, with expectation, Ek. In other words, this step calculates the
probability of cluster membership of object xi, for each of the clusters. These
probabilities are the “expected” cluster memberships for object xi .
b. Maximization Step: Use the probability estimates from above to re-estimate (or
refine) the model parameters. For example,
This step is the “maximization” of the likelihood of the distributions given the data.
For example, Figure shows the first two principal components, Y1 and Y2, for the given
set of data originally mapped to the axes X1 and X2. This information helps identify groups
or patterns within the data.
4. Because the components are sorted according to decreasing order of “significance,” the size
of the data can be reduced by eliminating the weaker components, that is, those with low
variance. Using the strongest principal components, it should be possible to reconstruct a
good approximation of the original data.
PCA is computationally inexpensive, can be applied to ordered and unordered attributes, and
can handle sparse data and skewed data. Multidimensional data of more than two dimensions
can be handled by reducing the problem to two dimensions. Principal components may be used
as inputs to multiple regression and cluster analysis.