DWM Unit-5 Sem Ans
DWM Unit-5 Sem Ans
Cluster analysis is a statistical technique used in data analysis to group similar objects or data points
into clusters or segments. The objective of cluster analysis is to identify patterns or structure in data
and to classify data points based on their similarities.
There are several types of clusters that can be identified in DWM, including:
1. Hierarchical Clustering: This method creates a tree-like structure of clusters based on the
similarity between data points. The process starts by treating each data point as a separate
cluster and then merging them together until all data points belong to a single cluster.
2. K-means Clustering: This method groups data points into a pre-defined number of clusters.
The algorithm randomly assigns initial centroids to each cluster and then iteratively assigns
data points to the nearest centroid until the cluster is formed. The process is repeated until
the centroids converge, and the final clusters are obtained.
3. Density-Based Clustering: This method groups data points based on their density within the
dataset. The algorithm identifies dense regions of data points and considers them as
clusters, while the sparse regions are considered noise or outliers.
4. Fuzzy Clustering: This method assigns each data point a probability of belonging to each
cluster, allowing for overlapping clusters. The algorithm calculates a degree of membership
for each data point in each cluster, allowing for soft boundaries between clusters.
5. Partitioning Clustering: This method divides the dataset into a pre-defined number of
clusters by minimizing a specific cost function. The K-means algorithm is a popular example
of partitioning clustering.
First, it randomly selects k of the objects in D, each of which initially represents a cluster mean or
center. For each of the remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the Euclidean distance between the object and the cluster mean.
The k-means algorithm then iteratively improves the within-cluster variation. For each cluster, it
computes the new mean using the objects assigned to the cluster in the previous iteration. All the
objects are then reassigned using the updated means as the new cluster centers.
The iterations continue until the assignment is stable, that is, the clusters formed in the current
round are the same as those formed in the previous round. The k-means procedure is summarized.
Algorithm
Input −
Output −
K // Set of clusters
K-means algorithm −
repeat
assign each item ti to the cluster which has the closest mean
K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the
well-known clustering problem. K-means clustering is a method of vector quantization, originally
from signal processing, that is popular for cluster analysis in data mining.
Typically, unsupervised algorithms make inferences from datasets using only input vectors without
referring to known, or labelled, outcomes.
“the objective of K-means is simple: group similar data points together and discover underlying
patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.”
A cluster refers to a collection of data points aggregated together because of certain similarities.
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the
objects in D into k clusters, C1, …, Ck, that is, Ci Ϲ D and Ci ∩ Cj = θ for (1 ≤ i, j ≤ k).
An objective function is used to assess the partitioning quality so that objects within a cluster are
similar to one another but dissimilar to objects in other clusters. This is, the objective function aims
for high intracluster similarity and low intercluster similarity.
A centroid-based partitioning technique uses the centroid of a cluster, Ci, to represent that cluster.
Conceptually, the centroid of a cluster is its center point.
The centroid can be defined in various ways such as by the mean or medoid of the objects (or
points) assigned to the cluster.
The difference between an object p ∈ Ci and ci, the representative of the cluster, is measured by
dist (p, ci), where dist (x, y) is the Euclidean distance between two points x and y. The quality of
cluster Ci can be measured by the within-cluster variation, which is the sum of squared error
between all objects in Ci and the centroid ci.
5. Explain briefly Agglomerative versus divisive hierarchical clustering.
A divisive hierarchical clustering method employs a top-down strategy. It starts by placing all
objects in one cluster, which is the hierarchy’s root. It then divides the root cluster into several
smaller subclusters, and recursively partitions those clusters into smaller ones. The partitioning
process continues until each cluster at the lowest level is coherent enough—either containing only
one object, or the objects within a cluster are sufficiently similar to each other.
In either agglomerative or divisive hierarchical clustering, a user can specify the desired number of
clusters as a termination condition.
The clusters are then merged step-by-step according to some criterion. For example, clusters C1
and C2 may be merged if an object in C1 and an object in C2form the minimum Euclidean distance
between any two objects from different clusters.
This is a single-linkage approach in that each cluster is represented by all the objects in the cluster,
and the similarity between two clusters is measured by the similarity of the closest pair of data
points belonging to different clusters. The cluster-merging process repeats until all the objects are
eventually merged to form one cluster.
6.How agglomerative hierarchical clustering works? Explain with an example.
Agglomerative hierarchical clustering is a method of clustering that starts with each data
point as its own cluster and gradually merges the most similar clusters together until all data
points belong to a single cluster.
Here's an example of how agglomerative hierarchical clustering works in the context of data
warehouse and mining (DWM):
Suppose we have a dataset of customer transactions consisting of the following features:
age, income, and transaction amount. We want to use agglomerative hierarchical clustering
to group similar customers based on their transaction behavior.
1. Start by treating each data point as its own cluster. In our example, each transaction
is a data point, so we start with n clusters, where n is the number of transactions.
2. Calculate the distance between each pair of clusters using a distance metric such as
Euclidean distance or Manhattan distance. The distance between two clusters can be
defined as the distance between their centroids (mean values of each feature), or the
distance between their closest data points, or any other measure of similarity.
3. Merge the two closest clusters into a single cluster. In our example, let's say that
transaction 1 and transaction 2 are the closest, so we merge them into a new cluster.
4. Recalculate the distance between the new cluster and all other clusters. In our
example, we need to calculate the distance between the new cluster and each of the
remaining n-1 clusters.
5. Repeat steps 3 and 4 until all data points belong to a single cluster. At each iteration,
we merge the two closest clusters into a new cluster, recalculate the distances
between the new cluster and all other clusters, and repeat the process until we have
a single cluster containing all data points.
6. Stop when a stopping criterion is met. A stopping criterion can be defined based on
the number of clusters desired or the threshold distance between clusters beyond
which they should not be merged.
7. Write the k-means algorithms and find the k clusters of following set of data:
K={2,3,4,7,10,11,12,18,20,22,25,30} given k=2
Now, let's apply the k-means algorithm to the set of data given:
K={2,3,4,7,10,11,12,18,20,22,25,30} with k=2 clusters.
Step 1: Initialize k centroids
Step 2: Assign data points to nearest centroid
2 2 18
3 1 17
4 0 16
7 3 13
10 6 10
11 7 9
12 8 8
18 14 2
20 16 0
22 18 2
25 21 5
30 26 10