0% found this document useful (0 votes)
4 views17 pages

Clustering Notes

Clustering is a machine learning technique that groups similar objects together based on various metrics, widely used in data mining, pattern recognition, and bioinformatics. Key properties include scalability, handling high dimensionality, and interpretability, while applications range from customer segmentation to healthcare and climate science. Various clustering methods exist, including partitioning, hierarchical, density-based, grid-based, and model-based approaches, each with its advantages and disadvantages.

Uploaded by

uthsahak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Clustering Notes

Clustering is a machine learning technique that groups similar objects together based on various metrics, widely used in data mining, pattern recognition, and bioinformatics. Key properties include scalability, handling high dimensionality, and interpretability, while applications range from customer segmentation to healthcare and climate science. Various clustering methods exist, including partitioning, hierarchical, density-based, grid-based, and model-based approaches, each with its advantages and disadvantages.

Uploaded by

uthsahak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Unit-5

Clustering
Definition of clustering : Clustering is a machine learning technique that involves grouping a set
of objects in such a way that objects in the same group (or cluster) are more similar to each other than to
those in other groups. The similarity can be based on various metrics, such as distance or density.
Clustering is used in a variety of applications: including data mining, pattern recognition, image analysis,
and bioinformatics, to discover structure in data without prior knowledge (unsupervised learning) of the
group definitions.

Properties of Clustering or Typical requirements of clustering in data mining:


1. Clustering Scalability: Nowadays there is a vast amount of data and should be with huge databases. In
order to dealing handle extensive databases, the clustering algorithm should be scalable. Data should be
scalable, if it is not scalable, then we can't get the appropriate result which would lead to wrong results.

2. High Dimensionality: The algorithm should be able to handle high dimensional space along with the
data of small size.

3. Algorithm Usability with Multiple Data Kinds: Different kinds of data can be used with algorithms
of clustering. It should be capable of dealing with different types of data like discrete, categorical and
interval-based data, binary data etc.

4. Dealing with Unstructured Data: There would be some databases that contain missing values, and
noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor quality
clusters. So it should be able to handle unstructured data and give some structure to the data by organizing
it into groups of similar data objects. This makes the job of the data expert easier in order to process the
data and discover new patterns.

5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.

Applications of clustering:
Clustering has a wide range of applications across various fields. Here are some key applications:

1. Customer Segmentation: Businesses use clustering to group customers with similar behaviors and
characteristics for targeted marketing strategies.

2. Market Research: Clustering helps in identifying distinct market segments and understanding
consumer needs and preferences.

3. Image Segmentation: In computer vision, clustering is used to partition images into regions with
similar pixels, aiding in object detection and image recognition.
4. Anomaly Detection: Clustering can identify outliers or anomalies in data, which is useful in fraud
detection, network security, and fault detection in systems.

5. Document Clustering: Used in text mining and information retrieval to group similar documents,
improving search engines and recommendation systems.

6. Genomics: Clustering is used to group genes or proteins with similar expression patterns, aiding in the
understanding of biological functions and disease mechanisms.

7. Social Network Analysis: Clustering helps in identifying communities or groups within social
networks, understanding social structures, and analyzing user behavior.

8. Recommender Systems: Clustering techniques group similar users or items to provide personalized
recommendations in platforms like Netflix or Amazon.

9. Urban Planning: Clustering can analyze geographical data to identify areas with similar
characteristics, aiding in urban development and resource allocation.

10. Healthcare: In medical research, clustering groups patients with similar symptoms or genetic
profiles, facilitating personalized medicine and treatment strategies.

11. Climate Science: Clustering helps in analyzing climate data, identifying patterns, and understanding
weather phenomena.

12. E-commerce: Clustering is used to categorize products and improve inventory management by
identifying product demand patterns.

Advantages of Cluster Analysis


 It can help identify patterns and relationships within a dataset that may not be immediately
obvious.
 It can be used for exploratory data analysis and can help with feature selection.
 It can be used to reduce the dimensionality of the data.
 It can be used for anomaly detection and outlier identification.
 It can be used for market segmentation and customer profiling.

Disadvantages of Cluster Analysis


 It can be sensitive to the choice of initial conditions and the number of clusters.
 It can be sensitive to the presence of noise or outliers in the data.
 It can be difficult to interpret the results of the analysis if the clusters are not well-defined.
 It can be computationally expensive for large datasets.
 The results of the analysis can be affected by the choice of clustering algorithm used.
 It is important to note that the success of cluster analysis depends on the data, the goals of the
analysis, and the ability of the analyst to interpret the results.
Major clustering methods
1. Partitioning clustering method
 K-means
 K- mediods
2. Hierarchical clustering method
 Agglomerative
 divisive
3. Density based clustering method
 DBSCAN
 OPTICS
 DENCLUE
4. Grid based clustering method
 CLIQUE
 STING
 WAVELET TRANSFORMATION
5. Model based clustering method
 Statistical Approach
 Neural Network Approach(Artificial Intelligence)

1. Partitioning Clustering Method


Partitioning clustering methods are a class of clustering techniques that divide a dataset into a set of
distinct, non-overlapping subsets or clusters. Each data point is assigned to exactly one cluster, and the
goal is to partition the data such that points within the same cluster are more similar to each other than to
points in other clusters. These methods generally require the number of clusters (K) to be specified in
advance and aim to optimize a specific objective function, such as minimizing the sum of squared
distances within clusters. The commonly used partition based clustering algorithms are:

K-Means Clustering
Definition: K-Means is a partitioning clustering algorithm that divides a dataset into K distinct, non-
overlapping subsets or clusters. Each cluster is represented by its centroid, which is the mean of the data
points within the cluster. The algorithm works iteratively to assign data points to clusters based on the
nearest centroid and then recalculates the centroids based on the new cluster assignments until
convergence.

Advantages:

 Simplicity: Easy to understand and implement.


 Efficiency: Computationally efficient for large datasets.
 Scalability: Performs well on large datasets.
 Speed: Converges relatively quickly.

Disadvantages:
 Requires K to be specified: The number of clusters (K) must be determined in advance.
 Sensitive to initial centroids: Poor initial choices can lead to suboptimal clustering (local
minima).
 Assumes spherical clusters: Assumes clusters are of similar size and shape, which may not be
true for all datasets.
 Sensitive to outliers: Outliers can significantly affect the results.

K- Mediods Clustering
Definition: K-Median is a partitioning clustering algorithm similar to K-Means, but instead of using the
mean to calculate the centroid of a cluster, it uses the median. The median is less affected by outliers,
making K-Median more robust for datasets with outliers or skewed distributions.

Advantages:

 Robustness to outliers: Less sensitive to outliers compared to K-Means.


 Flexibility: Works well for data with non-spherical clusters or uneven distributions.
 Reduced Influence of Extreme Values: The median minimizes the influence of extreme values
within each cluster.

Disadvantages:

 Requires K to be specified: The number of clusters (K) must be determined in advance.


 Computationally expensive: Finding the median is more computationally intensive than the
mean, especially in high-dimensional data.
 Sensitivity to initial medians: Like K-Means, it can converge to local minima depending on
initial choices.
 Slower convergence: Typically slower convergence compared to K-Means due to the median
calculation.

K-Means is efficient and simple, but sensitive to outliers and initial conditions, and assumes spherical
clusters. K-Median is more robust to outliers and flexible with cluster shapes, but computationally more
intensive and slower to converge.

2. Hierarchical clustering method


Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. There
are two main types of hierarchical clustering: agglomerative and divisive.

Agglomerative Hierarchical Clustering


Definition: Agglomerative hierarchical clustering is a bottom-up approach. It starts with each data point
as a separate cluster and then iteratively merges the closest pairs of clusters until all data points are in a
single cluster or a certain stopping criterion is met.
Divisive Hierarchical Clustering
Definition: Divisive hierarchical clustering is a top-down approach. It starts with all data points in a
single cluster and then iteratively splits the clusters until each data point is in its own cluster or a certain
stopping criterion is met.

A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical
clustering. It visually represents the arrangement of the clusters produced by hierarchical clustering
algorithms. Each branch of the dendrogram represents a cluster, and the length of the branches indicates
the distance or dissimilarity between clusters.

It is a diagram that shows the hierarchical relationship between the objects. It is most commonly created
as an output from hierarchical clustering.

Linkage Criteria:
In hierarchical clustering, the linkage criteria determine how the distance between clusters is calculated
when merging them. The choice of linkage criterion affects the shape and composition of the clusters.
Here are the most common linkage criteria:

1. Single Linkage (Minimum Linkage): The distance between two clusters is the shortest distance
between two points in each cluster. Minimum value is used to find the linkage.
2. Complete Linkage (Maximum Linkage): The distance between two clusters is the longest
distance between two points in each cluster. Maximum value is used to find the linkage.
3. Average Linkage: The distance between the two clusters is the average distance between each
point in the one cluster to every point in the other clusters. This is also called as un-weighted pair
groups. Average value is used to find the linkage.
3. Density based clustering method
Density-based clustering methods identify clusters based on the density of data points in the feature space.
These methods are particularly effective in identifying arbitrarily shaped clusters and are robust to noise
and outliers.

1. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Definition: DBSCAN groups together points that are closely packed, marking points in low-density
regions as outliers. It defines clusters as areas of high point density separated by areas of low point
density.

Parameters (inputs):

 ε (epsilon): The maximum radius of the neighborhood around a point.


 minPts: The minimum number of points required to form a dense region.
 Randomly select a point.
 Retrieve all points within ε distance from the selected point.
 If the number of points is greater than or equal to minPts, create a cluster.
 Expand the cluster by recursively including all density-reachable points.
 Mark points that do not belong to any cluster as noise.

Advantages:

 Arbitrarily Shaped Clusters: Can find clusters of any shape.


 Robust to Noise: Effectively identifies and handles noise and outliers.
 No Need to Specify the Number of Clusters: Unlike K-means, it does not require a
predetermined number of clusters.

Disadvantages:

 Parameter Sensitivity: The choice of ε and minPts significantly affects the results.
 Variable Density Clusters: Struggles with clusters of varying densities.

2. OPTICS (Ordering Points To Identify the Clustering Structure)

Definition:

OPTICS is an extension of DBSCAN that creates an ordering of the database, capturing the density-based
structure. It helps in identifying clusters with varying densities.

 ε: The maximum radius of the neighborhood (can be set to a high value).


 minPts: The minimum number of points to form a dense region.
 For each point, calculate the core distance and reachability distance.
 Generate an ordering of points based on the smallest reachability distance.
 Use the reachability plot to identify clusters with varying densities.
4. Grid based clustering method
Grid-based clustering methods partition the data space into a finite number of cells or grids and perform
clustering on these cells. These methods are generally efficient for large datasets and high-dimensional
data due to their ability to aggregate data points into grid cells.

1. CLIQUE (CLustering In QUEst)

Definition: CLIQUE is a grid-based clustering method that combines the concepts of grid-based and
density-based clustering. It partitions the data space into a grid and identifies dense clusters by examining
the distribution of data points in these grid cells.

 Grid Partitioning: Divide the data space into a grid with each cell covering a range of feature values.
 Density Calculation: Calculate the density of data points in each grid cell.
 Cluster Identification: Identify dense regions where the density of points is higher than a specified
threshold. Merge adjacent dense cells to form clusters.

Advantages:

 Scalability: Efficient for large datasets and high-dimensional data due to grid-based partitioning.
 No Assumptions about Shape: Can identify clusters of arbitrary shapes.
 Handling High Dimensions: Effective in high-dimensional spaces compared to methods like K-
means.

Disadvantages:

 Parameter Sensitivity: Performance is sensitive to the grid resolution and density threshold.
 Grid Granularity: The choice of grid size can impact the clustering results, requiring careful tuning.

2. STING (Statistical Information Grid)

Definition: STING is a grid-based clustering method that uses statistical information derived from grid
cells to perform clustering. It organizes data into a hierarchical grid structure and uses statistical measures
to identify clusters.

 Partition the data space into a hierarchical grid structure, where each level of the hierarchy
corresponds to different levels of granularity.
 Compute statistical measures (e.g., mean, variance) for the data points within each grid cell.
 Merge cells based on statistical similarity and density to form clusters.

Advantages:

 Efficiency: Provides efficient clustering by using grid-based data aggregation.


 Handling High Dimensions: Suitable for high-dimensional data due to hierarchical grid organization.

Disadvantages:
 Parameter Dependence: Requires careful tuning of parameters for grid size and statistical
thresholds.
 Scalability Issues: Performance can be affected by very large datasets due to hierarchical
structure management.

3. Wavelet Transformation-Based Clustering

Definition: Wavelet transformation-based clustering involves applying wavelet transforms to data to


capture features at different scales and then performing clustering based on these features. It is
particularly useful for analyzing and clustering data with varying frequencies or scales.

 Apply wavelet transforms to decompose the data into different frequency components.
 Extract features from the transformed data that represent different scales or frequency components.
 Perform clustering on the extracted features using traditional clustering algorithms or specialized
wavelet-based clustering techniques.

5. Model based clustering method


Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the
given model. A model-based algorithm may locate clusters by constructing a density function that reflects
the spatial distribution of the data points. It also leads to a way of automatically determining the number
of clusters based on standard statistics, taking-noise or outliers into account and thus yielding robust
clustering methods. Attempt to optimize the fit between the given data and some mathematical model
Based on the assumption that data are generated by a mixture of underlying probability distributions.

Two major approaches:


1. Statistical Approach
2. Al (neural network) Approach

Statistical Approach: Conceptual clustering is:


 A form of clustering in machine learning.
 Produces a classification scheme for a set of unlabeled objects
 Finds characteristic description for each group (concept/class)
 Usually adopts a statistical approach that uses probability measurements in determining the
concepts/clusters.

Advantages:

 Probabilistic Framework: Provides a probabilistic interpretation of clusters, which can be useful


for uncertainty quantification.
 Flexibility: Can model clusters of different shapes and sizes depending on the chosen
distribution.
 Soft Clustering: Allows data points to belong to multiple clusters with certain probabilities.

Disadvantages:
 Assumption of Distribution: Assumes a specific distribution (e.g., Gaussian), which might not
fit all types of data.
 Computational Complexity: Estimation can be computationally intensive, especially for large
datasets or high-dimensional data.
 Initialization Sensitivity: Results can be sensitive to the initial parameter estimates.

Neural Network Approach (Artificial Intelligence)

Definition: The neural network approach to model-based clustering uses artificial neural networks to
learn representations of the data and perform clustering. These methods leverage the capabilities of neural
networks to capture complex patterns and relationships in the data.

Advantages:

 Flexibility: Can model complex, non-linear relationships and patterns in the data.
 Automatic Feature Learning: Neural networks can automatically learn relevant features from
the data, reducing the need for manual feature engineering.
 Scalability: Can be scaled to handle large datasets and high-dimensional data.

Disadvantages:

 Complexity: Neural networks can be complex to design and train, requiring careful tuning of
hyperparameters.
 Interpretability: The models can be less interpretable compared to statistical models, making it
harder to understand the clustering results.
 Computational Requirements: Training neural networks can be computationally intensive and
may require specialized hardware (e.g., GPUs).

Evaluation of Clustering
Evaluating clustering results is essential to understand the quality, effectiveness, and reliability of the
clusters produced by a clustering algorithm. Here are some common approaches for evaluation of
clustering in data mining:

1. Internal Evaluation Measures


 Davies-Bouldin Index: This index measures the compactness and separation between clusters. A
lower Davies-Bouldin Index indicates better clustering
 Silhouette Score: This metric measures how similar an object is to its own cluster compared to other
clusters. A higher silhouette score indicates better-defined clusters.
 Inertia (Within-Cluster Sum of Squares): Inertia measures how far the points within a cluster are from
the centroid. It is minimised when clusters are tight and well- separated.

2. External Evaluation Measures


 Purity: Purity assesses the extent to which clusters contain a single class. It is the ratio of the number
of correctly classified instances to the total number of instances. Higher purity values indicate better
clustering.
 Fowlkes-Mallows Index: This index calculates the geometric mean of precision and recall. It is
useful for comparing clustering results to a known ground truth or reference clustering.
 Rand Index: The Rand Index measures the similarity between the true and predicted clusters,
considering both true positive and true negative instances.

3. Visual Evaluation: Visualisation techniques, such as scatter plots, dendrograms, or heatmaps, can
help in assessing the quality of clustering by providing a visual representation of how data points are
grouped.
4. Cluster Stability: This involves assessing how stable the clusters are under perturbations or
variations in the dataset. Stability measures can include methods like bootstrapping or subsampling.
5. Cross-validation: Splitting the dataset into training and testing sets and evaluating the clustering
performance on different subsets can help assess the robustness and generalisability of the clustering
algorithm.
6. Statistical Significance: Conduct statistical tests to determine if the observed clustering results are
statistically significant.
7. Domain-specific Evaluation: In some cases, the effectiveness of clustering may depend on domain-
specific criteria or objectives. Evaluation metrics can be tailored to align with the specific goals of the
data mining task.
It's important to note that the choice of evaluation metric depends on the nature of the data, the
characteristics of the clusters, and the goals of the analysis. No single metric is universally applicable,
and multiple metrics should be considered to gain a comprehensive understanding of the clustering
performance. Additionally, the interpretation of results should be done in the context of the specific
application or problem being addressed.

Problems

1. K-means

Algorithm/ Steps to solve/working procedure


Step 1: The number of clusters k is arbitrarily chosen.
Step 2: Select k random points from the data are selected as centroids (cluster mean)
Step 3: For each of the remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the distance between the object and the cluster mean.
Step 4: It then recomputes the new mean for each cluster i.e the centroids of newly formed clusters.
Step 5: Steps 3 and 4 are repeated until the criterion function converges. Stopping Criteria for K-
Means Clustering are:
 Centroids of newly formed clusters do not change (previous step and in current step same mean
value and cluster have same data points).
 Points remain in the same cluster.
 Maximum number of iterations is reached.
2. K-mediods

Algorithm/Steps to solve/working procedure


Step 1: Selecting Initial Medoid Points
We start by selecting 'k' number of random data points from our dataset as our initial medoid points.
Step 2: Assigning Non-Medoid Points to The Nearest Medoid
Next, we assign each non-medoid point in the dataset to its nearest medoid based on some distance metric
(e.g., Euclidean or Manhattan distance).
Step 3: Calculating Total Cost
After assigning all non-medoid points to their respective centroids, we calculate the total cost function for
each cluster using some distance measure (e.g., summing up distances between all non-medoids assigned
within the same centroid). This cost function represents how well all data points within the same centroid
are clustered together.
Step 4: Swapping Non-Mediod Points with Current Selected Mediod
Now comes an iterative process where we try swapping one non-medoid point at a time with one current
selected mediod at a time until there's no further decrease in cost function i.e. when the overall cost
function is minimized. This process helps us find better medoids within each cluster, improving our
clustering accuracy.
Step 5: Repeat Until Convergence
Finally, we repeat steps 2-4 until convergence i.e., when there's no further improvement in the cost
function or changes in clusters occur.
3. Agglomerative

Algorithm/ Steps to solve/working procedure


The steps for agglomerative clustering are as follows:

1. Initially each data point has separate cluster.


2. Compute the proximity matrix using a distance metric.
3. Use a linkage function to group objects into a hierarchical cluster tree based on the computed
distance matrix from the above step.
4. Data points with close proximity (min distance) are merged together to form a cluster.
5. Repeat steps 2 and 3 until a single cluster remains.

Example problem
Apply agglomerative hierarchical class string for the given proximity Matrix for the five objects A, B,
C, D and E using minimum distance method also construct dendrogram.

ITEM A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
4. Divisive

Algorithm/ Steps to solve/working procedure


The algorithm for divisive hierarchical clustering involves several steps.

Step 1: Consider all objects a part of one big cluster.


Step 2: Spilt the big cluster into small clusters i.e ‘n’ clusters by computing minimum spanning
tree (MST) for the given proximity matrix.
Step 3: Create a new cluster by breaking the link corresponding to the largest distance (smallest
similarity).
Step 4: Repeat the step 2 and 3 until we get only singleton clusters remain.

Example problem
Apply divisive hierarchical cluster for the given proximity Matrix for 5 objects A, B, C, D and E
using a minimum spanning tree and show the separate clusters.

ITEM A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0

Solution:

Step1:

Edges A-B A-C A-D A-E B-C B-D B-E C-D C-E D-E

weights 1 2 2 3 2 4 3 1 5 3

S0RT Ascending order:

Edges A-B C-D A-C A-D B-C A-E B-E D-E B-D C-E

weights 1 1 2 2 2 3 3 3 4 5

Construct MST:
Step 2:

1. Cost of spanning tree: The largest weight between the edge A and E is 3. So now we break the
link between A& E.

2. Cost of spanning tree: The next largest weight between the edge A and C is 2. So now we break
the link between A& C.

3. Cost of spanning tree: The largest weight between the edge A and B is 1. So now we break the
link between A& B.
4. Cost of spanning tree: The largest weight between the edge C and D is 1. So now we break the
link between C& D.

We get ‘n’ clusters each with only one object ∴ Termination condition holds (singleton clusters).

**********************************END**********************************************

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy