0% found this document useful (0 votes)
10 views8 pages

DWM Unit-5 Sem Ans

Cluster analysis is a statistical technique for grouping similar data points into clusters to identify patterns. Key types include hierarchical clustering (AGNES and DIANA), K-means clustering, density-based clustering (DBSCAN), and fuzzy clustering. Each method has its own approach to forming clusters, such as merging or dividing clusters based on similarity metrics or density.

Uploaded by

mukeshbalaga2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

DWM Unit-5 Sem Ans

Cluster analysis is a statistical technique for grouping similar data points into clusters to identify patterns. Key types include hierarchical clustering (AGNES and DIANA), K-means clustering, density-based clustering (DBSCAN), and fuzzy clustering. Each method has its own approach to forming clusters, such as merging or dividing clusters based on similarity metrics or density.

Uploaded by

mukeshbalaga2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1. What is cluster analysis? Write in detail about different types of clusters?

Cluster analysis is a statistical technique used in data analysis to group similar objects or data points
into clusters or segments. The objective of cluster analysis is to identify patterns or structure in data
and to classify data points based on their similarities.

There are several types of clusters that can be identified in DWM, including:

1. Hierarchical Clustering: This method creates a tree-like structure of clusters based on the
similarity between data points. The process starts by treating each data point as a separate
cluster and then merging them together until all data points belong to a single cluster.

2. K-means Clustering: This method groups data points into a pre-defined number of clusters.
The algorithm randomly assigns initial centroids to each cluster and then iteratively assigns
data points to the nearest centroid until the cluster is formed. The process is repeated until
the centroids converge, and the final clusters are obtained.

3. Density-Based Clustering: This method groups data points based on their density within the
dataset. The algorithm identifies dense regions of data points and considers them as
clusters, while the sparse regions are considered noise or outliers.

4. Fuzzy Clustering: This method assigns each data point a probability of belonging to each
cluster, allowing for overlapping clusters. The algorithm calculates a degree of membership
for each data point in each cluster, allowing for soft boundaries between clusters.

5. Partitioning Clustering: This method divides the dataset into a pre-defined number of
clusters by minimizing a specific cost function. The K-means algorithm is a popular example
of partitioning clustering.

2. Explain the basic hierarchical clustering strategies, AGNES and DIANA.


Hierarchical clustering is a clustering algorithm that creates a hierarchy of clusters by
recursively merging or dividing clusters based on a similarity metric.
AGNES (Agglomerative Nesting):
1.Start with each data point as a singleton cluster.
2.Compute the pairwise distances between all pairs of clusters.
3.Merge the two closest clusters based on a distance metric such as Euclidean distance or
Manhattan distance.
4.Recompute the distances between the new cluster and all remaining clusters.
5.Repeat steps 3 and 4 until all data points belong to a single cluster.
AGNES produces a dendrogram that shows the hierarchical structure of the clusters.
DIANA (Divisive Analysis):
1.Start with all data points in a single cluster.
2.Compute the pairwise distances between all pairs of data points.
3.Divide the data into two clusters based on a clustering criterion such as variance or
maximum distance.
4.Recompute the distances between the remaining data points and the two new clusters.
5.Repeat step 3 and 4 for each new cluster until all data points belong to their own cluster.
DIANA also produces a dendrogram, but the structure is obtained by recursively dividing
clusters instead of merging them.
3. Write the basic K-means algorithm and explain.
The k-means algorithm defines the centroid of a cluster as the mean value of the points within the
cluster. It proceeds as follows.

First, it randomly selects k of the objects in D, each of which initially represents a cluster mean or
center. For each of the remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the Euclidean distance between the object and the cluster mean.

The k-means algorithm then iteratively improves the within-cluster variation. For each cluster, it
computes the new mean using the objects assigned to the cluster in the previous iteration. All the
objects are then reassigned using the updated means as the new cluster centers.

The iterations continue until the assignment is stable, that is, the clusters formed in the current
round are the same as those formed in the previous round. The k-means procedure is summarized.

Algorithm

Input −

D = {t1 t2 … tn} // Set of elements

k // Number of desired clusters

Output −

K // Set of clusters

K-means algorithm −

assign initial values for means m1 m2 … . . mk

repeat

assign each item ti to the cluster which has the closest mean

calculate the new mean for each cluster

until convergence criteria are met


4. How the K-Means a Centroid-Based Technique explain?

K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the
well-known clustering problem. K-means clustering is a method of vector quantization, originally
from signal processing, that is popular for cluster analysis in data mining.

Typically, unsupervised algorithms make inferences from datasets using only input vectors without
referring to known, or labelled, outcomes.

“the objective of K-means is simple: group similar data points together and discover underlying
patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.”

A cluster refers to a collection of data points aggregated together because of certain similarities.

Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the
objects in D into k clusters, C1, …, Ck, that is, Ci Ϲ D and Ci ∩ Cj = θ for (1 ≤ i, j ≤ k).

An objective function is used to assess the partitioning quality so that objects within a cluster are
similar to one another but dissimilar to objects in other clusters. This is, the objective function aims
for high intracluster similarity and low intercluster similarity.

A centroid-based partitioning technique uses the centroid of a cluster, Ci, to represent that cluster.
Conceptually, the centroid of a cluster is its center point.

The centroid can be defined in various ways such as by the mean or medoid of the objects (or
points) assigned to the cluster.

The difference between an object p ∈ Ci and ci, the representative of the cluster, is measured by
dist (p, ci), where dist (x, y) is the Euclidean distance between two points x and y. The quality of
cluster Ci can be measured by the within-cluster variation, which is the sum of squared error
between all objects in Ci and the centroid ci.
5. Explain briefly Agglomerative versus divisive hierarchical clustering.

A hierarchical clustering method can be either agglomerative or divisive, depending on whether


the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion.
Let’s have a closer look at these strategies.

An agglomerative hierarchical clustering method uses a bottom-up strategy. It typically starts by


letting each object form its own cluster and iteratively merges clusters into larger and larger clusters,
until all the objects are in a single cluster or certain termination conditions are satisfied. The single
cluster becomes the hierarchy’s root. For the merging step, it finds the two clusters that are closest
to each other (according to some similarity measure), and combines the two to form one cluster.
Because two clusters are merged per iteration, where each cluster contains at least one object, an
agglomerative method requires at most n iterations.

A divisive hierarchical clustering method employs a top-down strategy. It starts by placing all
objects in one cluster, which is the hierarchy’s root. It then divides the root cluster into several
smaller subclusters, and recursively partitions those clusters into smaller ones. The partitioning
process continues until each cluster at the lowest level is coherent enough—either containing only
one object, or the objects within a cluster are sufficiently similar to each other.

In either agglomerative or divisive hierarchical clustering, a user can specify the desired number of
clusters as a termination condition.

The application of AGNES (AGglomerative NESting), an agglomerative hierarchical clustering


method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method, on a data set of
five objects, {a, b, c, d, e}. Initially, AGNES, the agglomerative method, places each object into a
cluster of its own.

The clusters are then merged step-by-step according to some criterion. For example, clusters C1
and C2 may be merged if an object in C1 and an object in C2form the minimum Euclidean distance
between any two objects from different clusters.

This is a single-linkage approach in that each cluster is represented by all the objects in the cluster,
and the similarity between two clusters is measured by the similarity of the closest pair of data
points belonging to different clusters. The cluster-merging process repeats until all the objects are
eventually merged to form one cluster.
6.How agglomerative hierarchical clustering works? Explain with an example.
Agglomerative hierarchical clustering is a method of clustering that starts with each data
point as its own cluster and gradually merges the most similar clusters together until all data
points belong to a single cluster.
Here's an example of how agglomerative hierarchical clustering works in the context of data
warehouse and mining (DWM):
Suppose we have a dataset of customer transactions consisting of the following features:
age, income, and transaction amount. We want to use agglomerative hierarchical clustering
to group similar customers based on their transaction behavior.
1. Start by treating each data point as its own cluster. In our example, each transaction
is a data point, so we start with n clusters, where n is the number of transactions.
2. Calculate the distance between each pair of clusters using a distance metric such as
Euclidean distance or Manhattan distance. The distance between two clusters can be
defined as the distance between their centroids (mean values of each feature), or the
distance between their closest data points, or any other measure of similarity.
3. Merge the two closest clusters into a single cluster. In our example, let's say that
transaction 1 and transaction 2 are the closest, so we merge them into a new cluster.
4. Recalculate the distance between the new cluster and all other clusters. In our
example, we need to calculate the distance between the new cluster and each of the
remaining n-1 clusters.
5. Repeat steps 3 and 4 until all data points belong to a single cluster. At each iteration,
we merge the two closest clusters into a new cluster, recalculate the distances
between the new cluster and all other clusters, and repeat the process until we have
a single cluster containing all data points.
6. Stop when a stopping criterion is met. A stopping criterion can be defined based on
the number of clusters desired or the threshold distance between clusters beyond
which they should not be merged.
7. Write the k-means algorithms and find the k clusters of following set of data:
K={2,3,4,7,10,11,12,18,20,22,25,30} given k=2
Now, let's apply the k-means algorithm to the set of data given:
K={2,3,4,7,10,11,12,18,20,22,25,30} with k=2 clusters.
Step 1: Initialize k centroids
Step 2: Assign data points to nearest centroid

Data point Distance to centroid 4 Distance to centroid 20

2 2 18

3 1 17

4 0 16

7 3 13

10 6 10

11 7 9

12 8 8

18 14 2

20 16 0

22 18 2

25 21 5

30 26 10

Data points {2,3,4,7,10,11,12} belong to centroid 4.


Data points {18,20,22,25,30} belong to centroid 20.
Step 3: Recalculate centroids
Centroid 1: {2,3,4,7,10,11,12}, mean=7 Centroid 2: {18,20,22,25,30}, mean=23
Step 4: Repeat steps 2 and 3 until convergence or maximum iterations
• After a few iterations, the centroids converge to {5.5, 23.0}.
Therefore, we obtain two clusters with centroids at 5.5 and 23.0 respectively.
• Cluster 1: {2,3,4,7,10,11,12} Cluster 2: {18,20,22,25,30}
8.Write the algorithm of bisecting K-means and how it is different from simple K-means.
Bisecting K-means is a variant of K-means clustering algorithm that divides the dataset into
two clusters at each iteration, instead of computing multiple clusters in parallel. Here's how
the algorithm works:
1.Initialize the entire dataset as a single cluster.
2.While the number of clusters is less than the desired number of clusters k, repeat the
following steps:
a. Apply K-means clustering to the current cluster to obtain two sub-clusters.
b. Choose the sub-cluster with the highest sum of squared errors (SSE) as the next cluster.
c. Remove the selected sub-cluster from the current cluster.
The algorithm stops when the desired number of clusters k is reached.
Bisecting K-means differs from simple K-means in the following ways:
1. Instead of initializing k centroids randomly, bisecting K-means starts with a single
cluster that contains all data points.
2. Bisecting K-means repeatedly divides the current cluster into two sub-clusters, while
simple K-means computes all k clusters in parallel.
3. Bisecting K-means selects the sub-cluster with the highest SSE as the next cluster,
whereas simple K-means assigns each data point to the nearest centroid.
4. Bisecting K-means can produce different results depending on the initial split, while
simple K-means is deterministic.
5. Bisecting K-means can be slower than simple K-means for small values of k, but can
be faster for larger values of k or when the dataset is large.

9. Explain DBSCAN clustering algorithm in detail?


DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering
algorithm that groups together data points that are closely packed together while identifying
outliers or noise points that do not belong to any cluster. Here's how the algorithm works:
1. Initialize all data points as unvisited.
2. Choose a random unvisited point and find all the neighboring points within a
distance eps.
3. If the number of neighboring points is less than a threshold value minPts, mark the
point as noise and move on to the next unvisited point.
4. If the number of neighboring points is greater than or equal to minPts, start a new
cluster with the current point and all its neighboring points.
5. For each point in the new cluster, find all the neighboring points within eps, and add
them to the cluster if they have not already been visited and have at least minPts
neighboring points.
6. Repeat steps 4 and 5 until no more points can be added to the current cluster.
7. Repeat steps 2-6 for all unvisited points until all points have been assigned to a
cluster or marked as noise.
10. List the strengths and Weaknesses of DBSCAN algorithm.
Strengths:
1. DBSCAN can find clusters of arbitrary shapes and sizes, as it does not assume any
particular cluster shape or size.
2. DBSCAN can handle noise and outliers, as it identifies them as points that do not
belong to any cluster.
3. DBSCAN does not require prior knowledge of the number of clusters, unlike K-means
and other centroid-based clustering algorithms.
4. DBSCAN can be more efficient than K-means and other clustering algorithms for large
datasets, as it only needs to compute distances between nearby points.
5. DBSCAN can produce consistent results, as it is deterministic and not sensitive to
initial parameter values.
Weaknesses:
1. DBSCAN requires setting two hyperparameters: the minimum number of points
required to form a dense region (minPts) and the radius of the neighborhood around
each point (eps). The choice of these parameters can have a significant impact on the
clustering results.
2. DBSCAN can struggle with datasets that have varying densities or clusters with vastly
different densities, as it uses a fixed eps radius for each point.
3. DBSCAN may not be suitable for high-dimensional datasets, as the notion of distance
becomes less meaningful in high-dimensional spaces (the curse of dimensionality).
4. DBSCAN may not be able to find clusters that are separated by gaps or voids in the
data.
5. DBSCAN can be sensitive to the order in which the data points are processed,
particularly if there are multiple clusters with similar densities.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy