0% found this document useful (0 votes)
22 views11 pages

Module 4 ML

Uploaded by

Abhiram Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

Module 4 ML

Uploaded by

Abhiram Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Module-4 (Unsupervised Learning)

Clustering - Similarity measures, Hierarchical Agglomerative Clustering, K-means


partitional clustering, Expectation maximization (EM) for soft clustering. Dimensionality
reduction – Principal Component Analysis.

CLUSTERING
• Clustering is the process of grouping the data into classes or clusters, so that objects within
a cluster have high similarity in comparison to one another but are very dissimilar to objects
in other clusters.
• Dissimilarities are assessed based on the attribute values describing the objects.
• Clustering has its roots in many areas, including data mining, statistics, biology, and
machine learning.

WHAT IS CLUSTER ANALYSIS?


The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering. A cluster is a collection of data objects that are similar to one another within
the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can
be treated collectively as one group and so may be considered as a form of data compression.
Although classification is an effective means for distinguishing groups or classes of objects, it
requires the often-costly collection and labelling of a large set of training tuples or patterns,
which the classifier uses to model each group.
It is often more desirable to proceed in the reverse direction:
• First partition the set of data into groups based on data similarity (e.g., using clustering),
and
• Then assign labels to the relatively small number of groups.
Additional advantages of such a clustering-based process are that it is adaptable to changes and
helps single out useful features that distinguish different groups.
Clustering is also called data segmentation in some applications because clustering partitions
large data sets into groups according to their similarity.
Clustering can also be used for outlier detection, where outliers may be more interesting than
common cases. Applications of outlier detection include the detection of credit card fraud and
the monitoring of criminal activities in electronic commerce.
In machine learning, clustering is an example of unsupervised learning. Unlike classification,
clustering and unsupervised learning do not rely on predefined classes and class-labeled
training examples. For this reason, clustering is a form of learning by observation, rather than
learning by examples.
Typical requirements of Clustering are:
1. Scalability:
• Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects.
• Highly scalable clustering algorithms are needed.
2. Ability to deal with different types of attributes:
• Many algorithms are designed to cluster interval-based (numerical) data.
• However, applications may require clustering other types of data, such as binary,
categorical (nominal), and ordinal data, or mixtures of these data types.
3. Discovery of clusters with arbitrary shape:
• Many clustering algorithms determine clusters based on Euclidean or Manhattan
distance measures. Algorithms based on such distance measures tend to find spherical
clusters with similar size and density.
• However, a cluster could be of any shape. It is important to develop algorithms that can
detect clusters of arbitrary shape.
4. Minimal requirements for domain knowledge to determine input parameters:
• Many clustering algorithms require users to input certain parameters in cluster analysis
(such as the number of desired clusters).
• Parameters are often difficult to determine, especially for data sets containing high-
dimensional objects. This not only burdens users, but it also makes the quality of
clustering difficult to control.
5. Ability to deal with noisy data:
• Most real-world databases contain outliers or missing, unknown, or erroneous data.
• Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
6. Incremental clustering and insensitivity to the order of input records:
• Some clustering algorithms cannot incorporate newly inserted data (i.e., database
updates) into existing clustering structures and, instead, must determine a new
clustering from scratch.
• Some clustering algorithms are sensitive to the order of input data. That is, given a set
of data objects, such an algorithm may return dramatically different clustering’s
depending on the order of presentation of the input objects.
7. High dimensionality:
• A database or a data warehouse can contain several dimensions or attributes.
• Many clustering algorithms are good at handling low-dimensional data, involving only
two to three dimensions.
8. Constraint-based clustering:
• Real-world applications may need to perform clustering under various kinds of
constraints.
• A challenging task is to find groups of data with good clustering behavior that satisfy
specified constraints.
9. Interpretability and usability:
• Users expect clustering results to be interpretable, comprehensible, and usable.
• That is, clustering may need to be tied to specific semantic interpretations and
applications.
SIMILARITY MEASURES
It is also known as distance measures. These measures include the
1. Euclidean distance,
2. Manhattan distance,
3. Minkowski distances and
4. Cosine Measures.

• The most popular distance measure is Euclidean distance, which is defined as

Where i = (xi1, xi2,…., xin ) and j = (xj1, xj2,….., xjn ) are are two n-dimensional data objects.
• Another well-known metric is Manhattan (or city block) distance, defined as

Both the Euclidean distance and Manhattan distance satisfy the following mathematic
requirements of a distance function:
1. d (i, j) ≥ 0: Distance is a nonnegative number.
2. d (i, i) = 0: The distance of an object to itself is 0.
3. d (i, j) = d (j, i): Distance is a symmetric function.
4. d (i, j) ≤ d (i, h) + d (h, j): Going directly from object i to object j in space is no
more than making a detour over any other object h (triangular inequality).
Example of Euclidean distance and Manhattan distance.
Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in Figure below.

• The Euclidean distance between the two is √(22 + 32 ) = 3.61.


• The Manhattan distance between the two is 2+3 = 5.

• Minkowski distance is a generalization of both Euclidean distance and Manhattan


distance. It is defined as
where p is a positive integer. Such a distance is also called Lp norm, in some literature. It
represents the Manhattan distance when p = 1 (i.e., L1 norm) and Euclidean distance when p =
2 (i.e., L2 norm). If each variable is assigned a weight according to its perceived importance,
the weighted Euclidean distance can be computed as

Weighting can also be applied to the Manhattan and Minkowski distances


• Cosine Similarity
Similar documents are expected to have similar relative term frequencies, we can measure the
similarity among a set of documents or between a document and a query (often defined as a set
of keywords), based on similar relative term occurrences in the frequency table. Many metrics
have been proposed for measuring document similarity based on relative term occurrences or
document vectors.
A representative metric is the cosine measure, defined as follows. Let v1 and v2 be two
document vectors. Their cosine similarity is defined a

where the inner product v1 · v2 is the standard vector dot product, defined as∑𝑡𝑖=1 𝑣1𝑖 𝑣2𝑖 , and
the norm |v1| in the denominator is defined as |v1| = √𝑣1 . 𝑣1
HIERARCHCAL CLUSTERING
A hierarchical method creates a hierarchical decomposition of the given set of data objects. A
hierarchical method can be classified as being either agglomerative or divisive, based on how
the hierarchical decomposition is formed.
• The agglomerative approach, also called the bottom-up approach, starts with each
object forming a separate group. It successively merges the objects or groups that are
close to one another, until all of the groups are merged into one (the topmost level of
the hierarchy), or until a termination condition holds.
• The divisive approach, also called the top-down approach, starts with all of the
objects in the same cluster. In each successive iteration, a cluster is split up into smaller
clusters, until eventually each object is in one cluster, or until a termination condition
holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never
be undone. That is, if a particular merge or split decision later turns out to have been a poor
choice, the method cannot backtrack and correct it. This rigidity is useful in that it leads to
smaller computation costs by not having to worry about a combinatorial number of different
choices. However, such techniques cannot correct erroneous decisions.
There are two approaches to improving the quality of hierarchical clustering:
1. perform careful analysis of object “linkages” at each hierarchical partitioning, such as in
Chameleon, or
2. integrate hierarchical agglomeration and other approaches by first using a hierarchical
agglomerative algorithm to group objects into microclusters, and then performing
macroclustering on the microclusters using another clustering method such as iterative
relocation, as in BIRCH.
Agglomerative and Divisive Hierarchical Clustering
In general, there are two types of hierarchical clustering methods:
1. Agglomerative hierarchical clustering: This bottom-up strategy starts by placing
each object in its own cluster and then merges these atomic clusters into larger and
larger clusters, until all of the objects are in a single cluster or until certain termination
conditions are satisfied. Most hierarchical clustering methods belong to this category.
They differ only in their definition of intercluster similarity.
2. Divisive hierarchical clustering: This top-down strategy does the reverse of
agglomerative hierarchical clustering by starting with all objects in one cluster. It
subdivides the cluster into smaller and smaller pieces, until each object forms a cluster
on its own or until it satisfies certain termination conditions, such as a desired number
of clusters is obtained or the diameter of each cluster is within a certain threshold
Agglomerative versus divisive hierarchical clustering. Figure shows the application of
AGNES (AGglomerative NESting), an agglomerative hierarchical clustering method, and
DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method, to a data set of five
objects, {a, b, c, d, e}.

Initially, AGNES places each object into a cluster of its own. The clusters are then merged
step-by-step according to some criterion. This is a single-linkage approach in that each
cluster is represented by all of the objects in the cluster, and the similarity between two
clusters is measured by the similarity of the closest pair of data points belonging to different
clusters. The cluster merging process repeats until all of the objects are eventually merged
to form one cluster.
In DIANA, all of the objects are used to form one initial cluster. The cluster is split
according to some principle, such as the maximum Euclidean distance between the closest
neighbouring objects in the cluster. The cluster splitting process repeats until, eventually,
each new cluster contains only a single object.
In either agglomerative or divisive hierarchical clustering, the user can specify the desired
number of clusters as a termination condition.
A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering. It shows how objects are grouped together step by step. Figure
below shows a dendrogram for the five objects presented in Figure above,

where l = 0 shows the five objects as singleton clusters at level 0. At l = 1, objects a and b
are grouped together to form the first cluster, and they stay together at all subsequent levels.
We can also use a vertical axis to show the similarity scale between clusters.
Four widely used measures for distance between clusters are as follows, where |p−p’ | is the
distance between two objects or points, p and p’ ; mi is the mean for cluster, Ci ; and ni is
the number of objects in Ci .

• When an algorithm uses the minimum distance, dmin (Ci, Cj), to measure the distance
between clusters, it is sometimes called a nearest-neighbour clustering algorithm.
• If the clustering process is terminated when the distance between nearest clusters exceeds
an arbitrary threshold, it is called a single-linkage algorithm.
• If we view the data points as nodes of a graph, with edges forming a path between the nodes
in a cluster, then the merging of two clusters, Ci and Cj, corresponds to adding an edge
between the nearest pair of nodes in Ci and Cj. Because edges linking clusters always go
between distinct clusters, the resulting graph will generate a tree. Thus, an agglomerative
hierarchical clustering algorithm that uses the minimum distance measure is also called a
minimal spanning tree algorithm.
• When an algorithm uses the maximum distance, dmax (Ci, Cj), to measure the distance
between clusters, it is sometimes called a farthest-neighbor clustering algorithm.
• If the clustering process is terminated when the maximum distance between nearest clusters
exceeds an arbitrary threshold, it is called a complete-linkage algorithm.
What are some of the difficulties with hierarchical clustering?
• The hierarchical clustering method, though simple, often encounters difficulties regarding
the selection of merge or split points. Such a decision is critical because once a group of
objects is merged or split, the process at the next step will operate on the newly generated
clusters. It will neither undo what was done previously nor perform object swapping
between clusters. Thus, merge or split decisions, if not well chosen at some step, may lead
to low-quality clusters.
• The method does not scale well, because each decision to merge or split requires the
examination and evaluation of a good number of objects or clusters.
Classical Partitioning Methods: k-Means
The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k
clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can
be viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows. First, it randomly selects k of the objects, each of
which initially represents a cluster mean or center. For each of the remaining objects, an object
is assigned to the cluster to which it is the most similar, based on the distance between the
object and the cluster mean. It then computes the new mean for each cluster. This process
iterates until the criterion function converges. Typically, the square-error criterion is used,
defined as

where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and mi is the mean of cluster Ci (both p and mi are
multidimensional). In other words, for each object in each cluster, the distance from the object
to its cluster center is squared, and the distances are summed. This criterion tries to make the
resulting k clusters as compact and as separate as possible.
Suppose that there is a set of objects located in space as depicted in the rectangle shown in
Figure (a).

Let k = 3; that is, the user would like the objects to be partitioned into three clusters. According
to the algorithm, we arbitrarily choose three objects as the three initial cluster centers, where
cluster centers are marked by a “+”. Each object is distributed to a cluster based on the cluster
center to which it is the nearest. Such a distribution forms silhouettes encircled by dotted
curves, as shown in Figure (a). Next, the cluster centers are updated. That is, the mean value
of each cluster is recalculated based on the current objects in the cluster.
Using the new cluster centers, the objects are redistributed to the clusters based on which cluster
center is the nearest. Such a redistribution forms new silhouettes encircled by dashed curves,
as shown in Figure (b). This process iterates, leading to Figure (c). The process of iteratively
reassigning objects to clusters to improve the partitioning is referred to as iterative relocation.
Eventually, no redistribution of the objects in any cluster occurs, and so the process terminates.
The resulting clusters are returned by the clustering process.
The algorithm attempts to determine k partitions that minimize the square-error function. It
works well when the clusters are compact clouds that are rather well separated from one
another. The method is relatively scalable and efficient in processing large data sets because
the computational complexity of the algorithm is O(nkt), where n is the total number of objects,
k is the number of clusters, and t is the number of iterations. The method often terminates at a
local optimum.
Advantages and Disadvantages
• The k-means method, however, can be applied only when the mean of a cluster is defined.
• This may not be the case in some applications, such as when data with categorical attributes
are involved.
• The necessity for users to specify k, the number of clusters, in advance can be seen as a
disadvantage.
• The k-means method is not suitable for discovering clusters with nonconvex shapes or
clusters of very different size.
• It is sensitive to noise and outlier data points because a small number of such data can
substantially influence the mean value.
EXPECTATION-MAXIMIZATION ALGORITHM
The EM (Expectation-Maximization) algorithm extends the k-means paradigm in a different
way. Whereas the k-means algorithm assigns each object to a cluster, in EM each object is
assigned to each cluster according to a weight representing its probability of membership. In
other words, there are no strict boundaries between clusters. Therefore, new means are
computed based on weighted measures.

EM starts with an initial estimate or “guess” of the parameters of the mixture model
(collectively referred to as the parameter vector). It iteratively rescores the objects against the
mixture density produced by the parameter vector. The rescored objects are then used to update
the parameter estimates. Each object is assigned a probability that it would possess a certain
set of attribute values given that it was a member of a given cluster. The algorithm is described
as follows:
1. Make an initial guess of the parameter vector: This involves randomly selecting k objects
to represent the cluster means or centers (as in k-means partitioning), as well as making
guesses for the additional parameters.
2. Iteratively refine the parameters (or clusters) based on the following two steps:
a. Expectation Step: Assign each object xi to cluster Ck with the probability

where p(xi |Ck) = N(mk, Ek(xi)) follows the normal (i.e., Gaussian) distribution
around mean, mk, with expectation, Ek. In other words, this step calculates the
probability of cluster membership of object xi, for each of the clusters. These
probabilities are the “expected” cluster memberships for object xi .
b. Maximization Step: Use the probability estimates from above to re-estimate (or
refine) the model parameters. For example,

This step is the “maximization” of the likelihood of the distributions given the data.

• The EM algorithm is simple and easy to implement.


• It converges fast but may not reach the global optima. Convergence is guaranteed for
certain forms of optimization functions.
• The computational complexity is linear in d (the number of input features), n (the number
of objects), and t (the number of iterations).
Dimensionality Reduction
In dimensionality reduction, data encoding or transformations are applied so as to obtain a
reduced or “compressed” representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data reduction is
called lossless. If, instead, we can reconstruct only an approximation of the original data, then
the data reduction is called lossy.
Two popular and effective methods of lossy dimensionality reduction: wavelet transforms
and Principal Components Analysis.
Principal Components Analysis
Suppose that the data to be reduced consist of tuples or data vectors described by n attributes
or dimensions. Principal components analysis, or PCA (also called the Karhunen-Loeve, or K-
L, method), searches for k n-dimensional orthogonal vectors that can best be used to represent
the data, where k ≤ n. The original data are thus projected onto a much smaller space, resulting
in dimensionality reduction.
The basic procedure is as follows:
1. The input data are normalized, so that each attribute falls within the same range. This step
helps ensure that attributes with large domains will not dominate attributes with smaller
domains.
2. PCA computes k orthonormal vectors that provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These
vectors are referred to as the principal components. The input data are a linear combination
of the principal components.
3. The principal components are sorted in order of decreasing “significance” or strength. The
principal components essentially serve as a new set of axes for the data, providing important
information about variance. That is, the sorted axes are such that the first axis shows the
most variance among the data, the second axis shows the next highest variance, and so on.

For example, Figure shows the first two principal components, Y1 and Y2, for the given
set of data originally mapped to the axes X1 and X2. This information helps identify groups
or patterns within the data.
4. Because the components are sorted according to decreasing order of “significance,” the size
of the data can be reduced by eliminating the weaker components, that is, those with low
variance. Using the strongest principal components, it should be possible to reconstruct a
good approximation of the original data.
PCA is computationally inexpensive, can be applied to ordered and unordered attributes, and
can handle sparse data and skewed data. Multidimensional data of more than two dimensions
can be handled by reducing the problem to two dimensions. Principal components may be used
as inputs to multiple regression and cluster analysis.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy