0% found this document useful (0 votes)
1 views19 pages

Dwdm Unit v Note

Cluster analysis is a crucial technique in data mining and machine learning used to group similar data points into clusters, aiding in data summarization, pattern recognition, decision making, and anomaly detection. Various clustering methods include K-Means, K-Medoids, hierarchical methods, and DBSCAN, each with its strengths and weaknesses. Applications span multiple domains such as marketing, healthcare, finance, and e-commerce, highlighting the versatility and importance of clustering in data analysis.

Uploaded by

krishnashivayear
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views19 pages

Dwdm Unit v Note

Cluster analysis is a crucial technique in data mining and machine learning used to group similar data points into clusters, aiding in data summarization, pattern recognition, decision making, and anomaly detection. Various clustering methods include K-Means, K-Medoids, hierarchical methods, and DBSCAN, each with its strengths and weaknesses. Applications span multiple domains such as marketing, healthcare, finance, and e-commerce, highlighting the versatility and importance of clustering in data analysis.

Uploaded by

krishnashivayear
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT 5

Cluster Analysis:

Cluster and Importance of Cluster Analysis- Clustering techniques- Different Types of Clusters Partitioning
Methods (K-Means, K Medoids) -Strengths and Weaknesses. Hierarchical Methods (Agglomerative,
Divisive) Density-Based Methods (DBSCAN)

Cluster Analysis-Importance:

Cluster Analysis is a vital technique in data mining and machine learning, used to group similar
objects or data points into clusters. Its importance spans across various domains and applications,
and it plays a critical role in understanding data patterns and making informed decisions.

Below are the key reasons why Cluster Analysis is important:

1. Data Summarization

• Cluster analysis simplifies large datasets by grouping similar data points into clusters,
helping to summarize the data efficiently. Instead of analyzing individual data points, you
can analyze clusters, reducing the complexity of the data while retaining its underlying
patterns.

2. Pattern Recognition

• Cluster analysis helps in identifying hidden patterns in data that are not obvious at first
glance. By grouping similar items together, it highlights the structure in the dataset,
revealing underlying relationships between data points.

3. Decision Making
• Clustering can aid business and organizational decision-making by revealing important
segments within data. For example, identifying key customer segments can guide
marketing strategies, product development, and resource allocation.

4. Anomaly Detection

• Clustering can help in identifying outliers or anomalies in the data by grouping normal
points into clusters and flagging those that don’t belong to any cluster. This is useful for
fraud detection, network security, and fault detection.

Exploratory Data Analysis (EDA)

• Cluster analysis is often a first step in exploring datasets when little is known about the
data. By identifying groups of similar objects, clustering provides a basis for hypothesis
generation and further data analysis.

Applications of Cluster Analysis:

• Marketing: Customer segmentation to design personalized marketing campaigns.

• Healthcare: Grouping patients with similar symptoms for diagnosis and treatment
planning.

• Finance: Identifying fraudulent transactions and segmenting financial portfolios.

• Genomics: Grouping genes with similar expression profiles to understand genetic


functions.

• E-commerce: Product recommendation systems and user behavior analysis.

Different Types of Clustering Methods:


K-Means Clustering
K-Means is probably the most well-known clustering algorithm and easy to understand and
implement

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled
data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups
in the data, with the number of groups represented by the variable K.

Clustering is a Machine Learning technique that involves the grouping of data points. Given a set
of data points, we can use a clustering algorithm to classify each data point into a specific group.
K-means example

Steps for finding the K-means clusters for the given data x= {2,3,5,6,8,10,11,14,16,17} k=2.

Step 1: Choose the Number of Clusters (K)

We have K = 2 clusters.

Step 2: Randomly Initialize Centroids

Let's randomly select two initial centroids from the dataset. For this example, let's choose:

• Centroid 1 (C1): 3 (initially chosen from the dataset)

• Centroid 2 (C2): 14 (initially chosen from the dataset)

Step 3: Assign Points to the Nearest Centroid

Cluster Assignments after Step 3:

• Cluster 1 (C1 = 3): {2, 3, 5, 6}

• Cluster 2 (C2 = 14): {8, 10, 11, 14, 16, 17}

Step 4: Recalculate Centroids

Now we calculate the new centroids by averaging the points in each cluster.

• New Centroid 1 (C1):

• New Centroid 2 (C2):

Step 5: Reassign Points to the New Centroids

Final Cluster Assignments:

• Cluster 1 (C1 ≈ 4): {2, 3, 5, 6}

• Cluster 2 (C2 ≈ 12.67): {8, 10, 11, 14, 16, 17}

The centroids and cluster assignments have stabilized, which would indicate the end of the K-
means process for this dataset.

Applications of K-Means Clustering Algorithm


• Market segmentation

• Document Clustering

• Image segmentation

• Image compression

• Customer segmentation

• Analyzing the trend on dynamic data

Advantages

• It is very easy to understand and implement.

• If we have large number of variables then, K-means would be faster than Hierarchical
clustering.

• On re-computation of centroids, an instance can change the cluster.

• Tighter clusters are formed with K-means as compared to Hierarchical clustering.

Disadvantages

• It is a bit difficult to predict the number of clusters i.e. the value of k.

• Output is strongly impacted by initial inputs like number of clusters (value of k)

• Order of data will have strong impact on the final output.

• It is very sensitive to rescaling. If we will rescale our data by means of normalization or


standardization, then the output will completely change.

• It is not good in doing clustering job if the clusters have a complicated geometric shape.

k-medoids clustering method

The k-medoids clustering algorithm is a robust clustering method that identifies representative
objects (medoids) within clusters. It is similar to K-means but offers advantages in terms of
handling noise and outliers because it uses actual data points as cluster centers rather than
centroids.

Key Concepts
• Medoid: The most centrally located point in a cluster, minimizing the distance to all
other points in that cluster.
• Distance Metric: Typically, the Euclidean distance is used, but other metrics can also be
applied, depending on the nature of the data.

Steps of the K-Medoids Algorithm

1. Initialization:

o Select k initial medoids randomly from the dataset.

2. Assignment Step:

o Assign each data point to the nearest medoid based on a chosen distance metric.
This forms k clusters.

3. Update Step:

o For each cluster, choose the new medoid by selecting the point within the cluster
that minimizes the sum of distances to all other points in that cluster.

4. Repeat:

o Repeat the assignment and update steps until the medoids no longer change or a
predefined number of iterations is reached.

5. Output:

o The final medoids and the assignments of data points to clusters.

Example of K-Medoids

Let’s say we have the following data points: {1,2,3,6,7,8}

and we want to form k=2 clusters.

1. Initialization:

o Randomly select two medoids, e.g., 1 and 6.

2. Assignment:

o Points {1,2,3} are closer to medoid 1, and points {6,7,8} are closer to medoid 6.

3. Update:
o Calculate new medoids. For the first cluster, the medoid remains 1 (since it
minimizes distance). For the second cluster, 7 might be chosen as it minimizes
the distance to 6,7,8.

4. Repeat until the medoids stabilize.

Advantages of K-Medoids

• Robust to Noise and Outliers: Because it selects actual data points as medoids, it is less
affected by outliers compared to K-means.

• Flexibility with Distance Metrics: K-medoids can use any distance measure, making it
applicable to various types of data.

Disadvantages of K-Medoids

• Computationally Intensive: K-medoids is more computationally expensive than K-


means, especially for large datasets, because it requires calculating pairwise distances
for all points.

• Sensitive to Initial Medoids: Like K-means, the choice of initial medoids can affect the
outcome, although it tends to be less sensitive than K-means.

Hierarchical Methods

Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters.
It is particularly useful for exploratory data analysis and is often visualized using a dendrogram.
There are two main approaches to hierarchical clustering: agglomerative and divisive. Here’s a
breakdown of these methods:

1. Agglomerative Hierarchical Clustering (Bottom-Up Approach)

In agglomerative clustering, the algorithm starts with each data point as an individual cluster
and then merges them into larger clusters based on a distance metric. The process continues
until all points are merged into a single cluster or a stopping criterion is met.

2. Divisive Hierarchical Clustering (Top-Down Approach)

Divisive clustering begins with a single cluster containing all data points and recursively splits it
into smaller clusters.

Visualization with Dendrograms


Hierarchical clustering is often visualized using a dendrogram, which shows the arrangement of
the clusters and the distances at which clusters were merged or split. The height of the
branches represents the distance or dissimilarity between clusters.

Dendrogram

Advantages of Hierarchical Clustering

• No Need to Specify Number of Clusters: The algorithm does not require the number of
clusters to be specified in advance.

• Hierarchical Structure: Provides a comprehensive view of the data by showing


relationships between clusters at various levels.

• Flexible: Different linkage criteria and distance metrics can be used depending on the
nature of the data.

Disadvantages of Hierarchical Clustering

• Computationally Intensive: Agglomerative clustering can be slow for large datasets, as it


involves calculating pairwise distances.

• Memory Usage: It requires more memory to store the distance matrix for large datasets.

• Sensitive to Noise and Outliers: Outliers can significantly affect the structure of clusters.

• Difficulties in Interpretation: Determining the "best" number of clusters can be


subjective and may require additional criteria or methods.

Applications

Hierarchical clustering is widely used in various domains, such as:

• Biology: For phylogenetic analysis and classifying species.


• Text Mining: Organizing documents into topic-based clusters.

• Image Analysis: Grouping similar images based on features.

• Market Research: Segmenting customers based on buying behavior.

AGNES (Agglomerative Nesting) and DIANA (Divisive Analysis) are two popular hierarchical
clustering methods used to group similar data points into clusters. While both methods create
hierarchical representations of data, they do so through different approaches.

AGNES (Agglomerative Nesting)

AGNES is a bottom-up approach to hierarchical clustering. It starts with each data point as its own
cluster and merges them based on their similarity until only one cluster remains.

Steps of AGNES:

1. Initialization: Start with nnn clusters (each data point is its own cluster).

2. Calculate Distances: Compute the distance between every pair of clusters.

3. Merge Clusters: Identify the two closest clusters and merge them into one.

4. Update Distance Matrix: Recalculate the distances between the new cluster and all other
clusters.

5. Repeat: Continue steps 3 and 4 until only one cluster remains or until a stopping criterion
is met (e.g., a predefined number of clusters).

Advantages of AGNES:

• Simple and intuitive to implement.

• Produces a dendrogram, which provides a visual representation of the clustering process.

Disadvantages of AGNES:

• Computationally expensive for large datasets (O(n3)O(n^3)O(n3)).

• Sensitive to noise and outliers.

• Requires the distance metric and linkage criteria to be specified in advance.

DIANA (Divisive Analysis)


DIANA is a top-down approach to hierarchical clustering. It starts with all data points in a single
cluster and recursively splits it into smaller clusters.

Steps of DIANA:

1. Initialization: Start with one cluster containing all data points.

2. Calculate Dissimilarity: Compute the dissimilarity (distance) of all points from the cluster
centroid.

3. Split the Cluster: Identify the point that is farthest from the centroid and create a new
cluster with that point.

4. Recalculate Centroids: Update the centroid of the remaining points.

5. Repeat: Continue the splitting process until all points are in their individual clusters or
until a stopping criterion is met.

Advantages of DIANA:

• Directly finds the most dissimilar points, which can be useful for identifying outliers.

• Useful when there is a clear hierarchical structure in the data.

Disadvantages of DIANA:

• More computationally expensive than AGNES, especially with large datasets.

• Less common and less intuitive than the agglomerative approach.

DBSCAN Clustering method:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering


algorithm that groups together points that are closely packed together while marking as outliers
points that lie alone in low-density regions. It is particularly effective for datasets with noise and
clusters of varying shapes and sizes.

Key Concepts

• Density: DBSCAN defines clusters based on the density of data points. The algorithm
relies on two parameters:

o Epsilon (ε): The maximum distance between two samples for them to be
considered as in the same neighborhood.
o MinPts: The minimum number of points required to form a dense region (a core
point).

Steps of the DBSCAN Algorithm

1. Classify Points:

o Core Point: A point that has at least MinPts points within its ε neighborhood.

o Border Point: A point that is within the ε neighborhood of a core point but does
not have enough neighbors to be a core point itself.

o Noise Point: A point that is neither a core point nor a border point.

2. Cluster Formation:

o Start with an unvisited point. If it is a core point, create a new cluster and
retrieve all points within its ε neighborhood.

o Add all those points to the cluster and recursively repeat the process for each of
those points.

o If a point is a border point, it is added to the cluster of the nearest core point.

o If a point is a noise point, it is labeled as such and ignored in cluster formation.

3. Termination:

o Continue this process until all points have been visited.


Advantages of DBSCAN

• No Need to Specify Number of Clusters: Unlike K-means, DBSCAN does not require the
user to specify the number of clusters beforehand.

• Robust to Noise: It can effectively identify outliers as noise points.

• Can Discover Arbitrarily Shaped Clusters: DBSCAN can find clusters of varying shapes
and sizes, which is beneficial in many real-world applications.

Disadvantages of DBSCAN

• Parameter Sensitivity: The results can be sensitive to the choice of ε and MinPts. Poor
choices may lead to incorrect clustering.

• Difficulty with Varying Densities: DBSCAN struggles with datasets containing clusters of
varying densities, as a single ε may not be suitable for all clusters.

• High Dimensionality: The algorithm can struggle with high-dimensional data due to the
curse of dimensionality, making distance measures less meaningful.

Applications

DBSCAN is widely used in various fields, including:

• Geospatial Analysis: Identifying clusters of geographic locations, such as crime hotspots


or disease outbreaks.

• Image Processing: Segmenting images into regions based on pixel density.

• Anomaly Detection: Identifying rare events or outliers in datasets.

• Biology: Grouping genes or species based on similarity metrics.

Example Dataset

Consider the following dataset of points in a two-dimensional space:

Points: (1,2),(1,4),(1,0),(2,2),(2,3),(3,3),(5,4),(6,6)

Parameters

• Epsilon (ε): 1.5 (The maximum distance to consider points as neighbors)

• MinPts: 3 (The minimum number of points required to form a dense region)


Step-by-Step DBSCAN Process

Step 1: Identify Core Points

For each point in the dataset, check if it is a core point by counting the number of points in its ε
neighborhood.

1. Point (1, 2):

o Neighbors: (1, 0), (1, 2), (1, 4), (2, 2), (2, 3) → 5 neighbors

o Core Point

2. Point (1, 4):

o Neighbors: (1, 2), (1, 4), (2, 3) → 3 neighbors

o Core Point

3. Point (1, 0):

o Neighbors: (1, 0), (1, 2) → 2 neighbors

o Not a Core Point

4. Point (2, 2):

o Neighbors: (1, 2), (2, 2), (2, 3) → 3 neighbors

o Core Point

5. Point (2, 3):

o Neighbors: (1, 4), (2, 2), (3, 3) → 3 neighbors

o Core Point

6. Point (3, 3):

o Neighbors: (2, 3), (3, 3) → 2 neighbors

o Not a Core Point

7. Point (5, 4):

o Neighbors: (5, 4) → 0 neighbors


o Not a Core Point

8. Point (6, 6):

o Neighbors: (6, 6) → 0 neighbors

o Not a Core Point

Step 2: Forming Clusters

Now, start forming clusters using the identified core points.

1. Starting with Core Point (1, 2):

o From (1, 2), we have neighbors: (1, 0), (1, 4), (2, 2), (2, 3).

o This forms the first cluster: Cluster 1 = {(1, 0), (1, 2), (1, 4), (2, 2), (2, 3)}.

2. Next Core Point (1, 4):

o (1, 4) is already in Cluster 1.

3. Next Core Point (2, 2):

o (2, 2) is already in Cluster 1.

4. Next Core Point (2, 3):

o (2, 3) is already in Cluster 1.

5. Next Core Point (3, 3):

o (3, 3) is not a core point and is connected to points in Cluster 1.

o Cluster 1 remains: {(1, 0), (1, 2), (1, 4), (2, 2), (2, 3), (3, 3)}.

Step 3: Identify Noise Points

• Point (5, 4): No neighbors and not part of any cluster. Label as Noise.

• Point (6, 6): No neighbors and not part of any cluster. Label as Noise.

Final Clustering Result

• Cluster 1: {(1, 0), (1, 2), (1, 4), (2, 2), (2, 3), (3, 3)}

• Noise Points: {(5, 4), (6, 6)}


Visualization

If you were to visualize this dataset:

• Points in Cluster 1 would be grouped together.

• Points (5, 4) and (6, 6) would be marked as noise.

DBSCAN Example - 2:
Compare K-means, DBSCAN, and Hierarchical Clustering algorithms

Here's a comparison of K-means, DBSCAN, and Hierarchical Clustering in terms of their


approaches, strengths, and weaknesses:

1. K-means Clustering

Approach:

• Partition-based clustering method.

• Divides the dataset into K predefined clusters.

• Works by assigning each data point to the nearest cluster centroid, then recalculating
the centroids iteratively until convergence.

• Objective: Minimize the sum of squared distances between points and their assigned
cluster centroid.

Strengths:

• Efficient for large datasets due to its simplicity and quick convergence.

• Easy to implement and interpret.

• Works well when clusters are globular and well-separated.

• Easily scales to high-dimensional data.

• K-means++ initialization helps to mitigate poor centroid initialization problems.

Weaknesses:
• Requires the user to predefine the number of clusters (K), which may not always be
obvious.

• Struggles with non-spherical clusters, clusters of varying sizes and densities.

• Sensitive to outliers and noise, which can skew cluster centroids.

• Often converges to local optima; performance depends on initial centroids.

2. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Approach:

• Density-based clustering method.

• Groups points that are closely packed together (high-density regions), and marks points
in low-density regions as noise or outliers.

• Defines clusters based on two parameters: ε (epsilon), the radius to search for
neighboring points, and minPts, the minimum number of points required to form a
dense region.

Strengths:

• Does not require the user to specify the number of clusters (K).

• Capable of identifying non-spherical clusters (clusters of arbitrary shapes, e.g.,


elongated or crescent-shaped).

• Can automatically detect outliers and noise.

• Performs well when clusters vary in size and density.

Weaknesses:

• Performance depends heavily on the choice of the parameters ε and minPts. These
parameters may be difficult to determine, and inappropriate values may result in poor
clustering.

• Struggles with datasets with varying density, as a fixed ε might fail to capture clusters
with different densities.
• Does not scale well with large, high-dimensional datasets, as it becomes computationally
expensive to compute neighbors.

• Sensitive to noise and the chosen distance metric.

3. Hierarchical Clustering (Agglomerative and Divisive)

Approach:

• Hierarchical-based clustering method.

• Two approaches:

o Agglomerative (Bottom-up): Each point starts as its own cluster, and pairs of
clusters are successively merged based on a linkage criterion until one cluster
remains or a threshold is met.

o Divisive (Top-down): Starts with all points in one cluster and recursively splits
them.

• Generates a dendrogram (tree-like diagram) representing the nested clusters.

Strengths:

• Does not require predefining the number of clusters; instead, you can choose the
number by cutting the dendrogram at a specific level.

• Can capture hierarchical relationships between data points, which can be useful for
some applications (e.g., taxonomy).

• Flexible in terms of distance metrics (Euclidean, Manhattan, etc.) and linkage criteria
(single, complete, average).

• Works well with smaller datasets and is useful when clusters are hierarchical in nature.

Weaknesses:

• Computationally expensive: Agglomerative methods typically have a time complexity of


O(n²), making them impractical for large datasets.

• Sensitive to the choice of linkage criteria (single, complete, average), which can lead to
different clustering outcomes.
• Not robust to outliers, as they can distort the hierarchy and merge incorrectly.

• Difficult to interpret when there are too many clusters, making dendrograms complex.

Summary Table:

Feature K-means DBSCAN Hierarchical Clustering


Type Partition-based Density-based Hierarchical
Cluster Shape Spherical Arbitrary shapes Arbitrary shapes
Predefine # of
Yes No No
Clusters
Handling
Poor Good Poor
Noise/Outliers
Works on Large No (computationally
Yes (efficient) No (scales poorly)
Datasets expensive)
Scalability High Moderate to low Low
Large, well-separated Varying densities, Hierarchical relationships,
Suitable for
datasets noise small datasets
Parameter Sensitive to initial Sensitive to ε and
Sensitive to linkage criterion
Sensitivity centroids minPts

Assignment 5:

1. What is cluster Analysis? What are the important reasons of Cluster Analysis.
2. Explain the different types of clustering methods briefly?
3. Write the applications and limitations of k-means clustering algorithm?
4. Describe DBSCAN and explain the strengths and weakness of DBSCAN algorithm?
5. Write short notes on AGNES and DIANA?
6. Explain the K-medoids algorithm with an example?
7. Explain about hierarchical, agglomerative clustering briefly?
8. Write about K-means algorithm and find the k-means for the following data set K=2.
X = {2,3,4,10,11,12,20,25,30,32}
9. What is DBSCAN? Explain DBSCAN algorithm with suitable example.
10. Compare K-means, DBSCAN, and Hierarchical Clustering algorithms in terms of their
approaches, strengths, and weaknesses.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy