Dwdm Unit v Note
Dwdm Unit v Note
Cluster Analysis:
Cluster and Importance of Cluster Analysis- Clustering techniques- Different Types of Clusters Partitioning
Methods (K-Means, K Medoids) -Strengths and Weaknesses. Hierarchical Methods (Agglomerative,
Divisive) Density-Based Methods (DBSCAN)
Cluster Analysis-Importance:
Cluster Analysis is a vital technique in data mining and machine learning, used to group similar
objects or data points into clusters. Its importance spans across various domains and applications,
and it plays a critical role in understanding data patterns and making informed decisions.
1. Data Summarization
• Cluster analysis simplifies large datasets by grouping similar data points into clusters,
helping to summarize the data efficiently. Instead of analyzing individual data points, you
can analyze clusters, reducing the complexity of the data while retaining its underlying
patterns.
2. Pattern Recognition
• Cluster analysis helps in identifying hidden patterns in data that are not obvious at first
glance. By grouping similar items together, it highlights the structure in the dataset,
revealing underlying relationships between data points.
3. Decision Making
• Clustering can aid business and organizational decision-making by revealing important
segments within data. For example, identifying key customer segments can guide
marketing strategies, product development, and resource allocation.
4. Anomaly Detection
• Clustering can help in identifying outliers or anomalies in the data by grouping normal
points into clusters and flagging those that don’t belong to any cluster. This is useful for
fraud detection, network security, and fault detection.
• Cluster analysis is often a first step in exploring datasets when little is known about the
data. By identifying groups of similar objects, clustering provides a basis for hypothesis
generation and further data analysis.
• Healthcare: Grouping patients with similar symptoms for diagnosis and treatment
planning.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled
data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups
in the data, with the number of groups represented by the variable K.
Clustering is a Machine Learning technique that involves the grouping of data points. Given a set
of data points, we can use a clustering algorithm to classify each data point into a specific group.
K-means example
Steps for finding the K-means clusters for the given data x= {2,3,5,6,8,10,11,14,16,17} k=2.
We have K = 2 clusters.
Let's randomly select two initial centroids from the dataset. For this example, let's choose:
Now we calculate the new centroids by averaging the points in each cluster.
The centroids and cluster assignments have stabilized, which would indicate the end of the K-
means process for this dataset.
• Document Clustering
• Image segmentation
• Image compression
• Customer segmentation
Advantages
• If we have large number of variables then, K-means would be faster than Hierarchical
clustering.
Disadvantages
• It is not good in doing clustering job if the clusters have a complicated geometric shape.
The k-medoids clustering algorithm is a robust clustering method that identifies representative
objects (medoids) within clusters. It is similar to K-means but offers advantages in terms of
handling noise and outliers because it uses actual data points as cluster centers rather than
centroids.
Key Concepts
• Medoid: The most centrally located point in a cluster, minimizing the distance to all
other points in that cluster.
• Distance Metric: Typically, the Euclidean distance is used, but other metrics can also be
applied, depending on the nature of the data.
1. Initialization:
2. Assignment Step:
o Assign each data point to the nearest medoid based on a chosen distance metric.
This forms k clusters.
3. Update Step:
o For each cluster, choose the new medoid by selecting the point within the cluster
that minimizes the sum of distances to all other points in that cluster.
4. Repeat:
o Repeat the assignment and update steps until the medoids no longer change or a
predefined number of iterations is reached.
5. Output:
Example of K-Medoids
1. Initialization:
2. Assignment:
o Points {1,2,3} are closer to medoid 1, and points {6,7,8} are closer to medoid 6.
3. Update:
o Calculate new medoids. For the first cluster, the medoid remains 1 (since it
minimizes distance). For the second cluster, 7 might be chosen as it minimizes
the distance to 6,7,8.
Advantages of K-Medoids
• Robust to Noise and Outliers: Because it selects actual data points as medoids, it is less
affected by outliers compared to K-means.
• Flexibility with Distance Metrics: K-medoids can use any distance measure, making it
applicable to various types of data.
Disadvantages of K-Medoids
• Sensitive to Initial Medoids: Like K-means, the choice of initial medoids can affect the
outcome, although it tends to be less sensitive than K-means.
Hierarchical Methods
Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters.
It is particularly useful for exploratory data analysis and is often visualized using a dendrogram.
There are two main approaches to hierarchical clustering: agglomerative and divisive. Here’s a
breakdown of these methods:
In agglomerative clustering, the algorithm starts with each data point as an individual cluster
and then merges them into larger clusters based on a distance metric. The process continues
until all points are merged into a single cluster or a stopping criterion is met.
Divisive clustering begins with a single cluster containing all data points and recursively splits it
into smaller clusters.
Dendrogram
• No Need to Specify Number of Clusters: The algorithm does not require the number of
clusters to be specified in advance.
• Flexible: Different linkage criteria and distance metrics can be used depending on the
nature of the data.
• Memory Usage: It requires more memory to store the distance matrix for large datasets.
• Sensitive to Noise and Outliers: Outliers can significantly affect the structure of clusters.
Applications
AGNES (Agglomerative Nesting) and DIANA (Divisive Analysis) are two popular hierarchical
clustering methods used to group similar data points into clusters. While both methods create
hierarchical representations of data, they do so through different approaches.
AGNES is a bottom-up approach to hierarchical clustering. It starts with each data point as its own
cluster and merges them based on their similarity until only one cluster remains.
Steps of AGNES:
1. Initialization: Start with nnn clusters (each data point is its own cluster).
3. Merge Clusters: Identify the two closest clusters and merge them into one.
4. Update Distance Matrix: Recalculate the distances between the new cluster and all other
clusters.
5. Repeat: Continue steps 3 and 4 until only one cluster remains or until a stopping criterion
is met (e.g., a predefined number of clusters).
Advantages of AGNES:
Disadvantages of AGNES:
Steps of DIANA:
2. Calculate Dissimilarity: Compute the dissimilarity (distance) of all points from the cluster
centroid.
3. Split the Cluster: Identify the point that is farthest from the centroid and create a new
cluster with that point.
5. Repeat: Continue the splitting process until all points are in their individual clusters or
until a stopping criterion is met.
Advantages of DIANA:
• Directly finds the most dissimilar points, which can be useful for identifying outliers.
Disadvantages of DIANA:
Key Concepts
• Density: DBSCAN defines clusters based on the density of data points. The algorithm
relies on two parameters:
o Epsilon (ε): The maximum distance between two samples for them to be
considered as in the same neighborhood.
o MinPts: The minimum number of points required to form a dense region (a core
point).
1. Classify Points:
o Core Point: A point that has at least MinPts points within its ε neighborhood.
o Border Point: A point that is within the ε neighborhood of a core point but does
not have enough neighbors to be a core point itself.
o Noise Point: A point that is neither a core point nor a border point.
2. Cluster Formation:
o Start with an unvisited point. If it is a core point, create a new cluster and
retrieve all points within its ε neighborhood.
o Add all those points to the cluster and recursively repeat the process for each of
those points.
o If a point is a border point, it is added to the cluster of the nearest core point.
3. Termination:
• No Need to Specify Number of Clusters: Unlike K-means, DBSCAN does not require the
user to specify the number of clusters beforehand.
• Can Discover Arbitrarily Shaped Clusters: DBSCAN can find clusters of varying shapes
and sizes, which is beneficial in many real-world applications.
Disadvantages of DBSCAN
• Parameter Sensitivity: The results can be sensitive to the choice of ε and MinPts. Poor
choices may lead to incorrect clustering.
• Difficulty with Varying Densities: DBSCAN struggles with datasets containing clusters of
varying densities, as a single ε may not be suitable for all clusters.
• High Dimensionality: The algorithm can struggle with high-dimensional data due to the
curse of dimensionality, making distance measures less meaningful.
Applications
Example Dataset
Points: (1,2),(1,4),(1,0),(2,2),(2,3),(3,3),(5,4),(6,6)
Parameters
For each point in the dataset, check if it is a core point by counting the number of points in its ε
neighborhood.
o Neighbors: (1, 0), (1, 2), (1, 4), (2, 2), (2, 3) → 5 neighbors
o Core Point
o Core Point
o Core Point
o Core Point
o From (1, 2), we have neighbors: (1, 0), (1, 4), (2, 2), (2, 3).
o This forms the first cluster: Cluster 1 = {(1, 0), (1, 2), (1, 4), (2, 2), (2, 3)}.
o Cluster 1 remains: {(1, 0), (1, 2), (1, 4), (2, 2), (2, 3), (3, 3)}.
• Point (5, 4): No neighbors and not part of any cluster. Label as Noise.
• Point (6, 6): No neighbors and not part of any cluster. Label as Noise.
• Cluster 1: {(1, 0), (1, 2), (1, 4), (2, 2), (2, 3), (3, 3)}
DBSCAN Example - 2:
Compare K-means, DBSCAN, and Hierarchical Clustering algorithms
1. K-means Clustering
Approach:
• Works by assigning each data point to the nearest cluster centroid, then recalculating
the centroids iteratively until convergence.
• Objective: Minimize the sum of squared distances between points and their assigned
cluster centroid.
Strengths:
• Efficient for large datasets due to its simplicity and quick convergence.
Weaknesses:
• Requires the user to predefine the number of clusters (K), which may not always be
obvious.
Approach:
• Groups points that are closely packed together (high-density regions), and marks points
in low-density regions as noise or outliers.
• Defines clusters based on two parameters: ε (epsilon), the radius to search for
neighboring points, and minPts, the minimum number of points required to form a
dense region.
Strengths:
• Does not require the user to specify the number of clusters (K).
Weaknesses:
• Performance depends heavily on the choice of the parameters ε and minPts. These
parameters may be difficult to determine, and inappropriate values may result in poor
clustering.
• Struggles with datasets with varying density, as a fixed ε might fail to capture clusters
with different densities.
• Does not scale well with large, high-dimensional datasets, as it becomes computationally
expensive to compute neighbors.
Approach:
• Two approaches:
o Agglomerative (Bottom-up): Each point starts as its own cluster, and pairs of
clusters are successively merged based on a linkage criterion until one cluster
remains or a threshold is met.
o Divisive (Top-down): Starts with all points in one cluster and recursively splits
them.
Strengths:
• Does not require predefining the number of clusters; instead, you can choose the
number by cutting the dendrogram at a specific level.
• Can capture hierarchical relationships between data points, which can be useful for
some applications (e.g., taxonomy).
• Flexible in terms of distance metrics (Euclidean, Manhattan, etc.) and linkage criteria
(single, complete, average).
• Works well with smaller datasets and is useful when clusters are hierarchical in nature.
Weaknesses:
• Sensitive to the choice of linkage criteria (single, complete, average), which can lead to
different clustering outcomes.
• Not robust to outliers, as they can distort the hierarchy and merge incorrectly.
• Difficult to interpret when there are too many clusters, making dendrograms complex.
Summary Table:
Assignment 5:
1. What is cluster Analysis? What are the important reasons of Cluster Analysis.
2. Explain the different types of clustering methods briefly?
3. Write the applications and limitations of k-means clustering algorithm?
4. Describe DBSCAN and explain the strengths and weakness of DBSCAN algorithm?
5. Write short notes on AGNES and DIANA?
6. Explain the K-medoids algorithm with an example?
7. Explain about hierarchical, agglomerative clustering briefly?
8. Write about K-means algorithm and find the k-means for the following data set K=2.
X = {2,3,4,10,11,12,20,25,30,32}
9. What is DBSCAN? Explain DBSCAN algorithm with suitable example.
10. Compare K-means, DBSCAN, and Hierarchical Clustering algorithms in terms of their
approaches, strengths, and weaknesses.