Clustering Notes
Clustering Notes
Clustering
Definition of clustering : Clustering is a machine learning technique that involves grouping a set
of objects in such a way that objects in the same group (or cluster) are more similar to each other than to
those in other groups. The similarity can be based on various metrics, such as distance or density.
Clustering is used in a variety of applications: including data mining, pattern recognition, image analysis,
and bioinformatics, to discover structure in data without prior knowledge (unsupervised learning) of the
group definitions.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with the
data of small size.
3. Algorithm Usability with Multiple Data Kinds: Different kinds of data can be used with algorithms
of clustering. It should be capable of dealing with different types of data like discrete, categorical and
interval-based data, binary data etc.
4. Dealing with Unstructured Data: There would be some databases that contain missing values, and
noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor quality
clusters. So it should be able to handle unstructured data and give some structure to the data by organizing
it into groups of similar data objects. This makes the job of the data expert easier in order to process the
data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.
Applications of clustering:
Clustering has a wide range of applications across various fields. Here are some key applications:
1. Customer Segmentation: Businesses use clustering to group customers with similar behaviors and
characteristics for targeted marketing strategies.
2. Market Research: Clustering helps in identifying distinct market segments and understanding
consumer needs and preferences.
3. Image Segmentation: In computer vision, clustering is used to partition images into regions with
similar pixels, aiding in object detection and image recognition.
4. Anomaly Detection: Clustering can identify outliers or anomalies in data, which is useful in fraud
detection, network security, and fault detection in systems.
5. Document Clustering: Used in text mining and information retrieval to group similar documents,
improving search engines and recommendation systems.
6. Genomics: Clustering is used to group genes or proteins with similar expression patterns, aiding in the
understanding of biological functions and disease mechanisms.
7. Social Network Analysis: Clustering helps in identifying communities or groups within social
networks, understanding social structures, and analyzing user behavior.
8. Recommender Systems: Clustering techniques group similar users or items to provide personalized
recommendations in platforms like Netflix or Amazon.
9. Urban Planning: Clustering can analyze geographical data to identify areas with similar
characteristics, aiding in urban development and resource allocation.
10. Healthcare: In medical research, clustering groups patients with similar symptoms or genetic
profiles, facilitating personalized medicine and treatment strategies.
11. Climate Science: Clustering helps in analyzing climate data, identifying patterns, and understanding
weather phenomena.
12. E-commerce: Clustering is used to categorize products and improve inventory management by
identifying product demand patterns.
K-Means Clustering
Definition: K-Means is a partitioning clustering algorithm that divides a dataset into K distinct, non-
overlapping subsets or clusters. Each cluster is represented by its centroid, which is the mean of the data
points within the cluster. The algorithm works iteratively to assign data points to clusters based on the
nearest centroid and then recalculates the centroids based on the new cluster assignments until
convergence.
Advantages:
Disadvantages:
Requires K to be specified: The number of clusters (K) must be determined in advance.
Sensitive to initial centroids: Poor initial choices can lead to suboptimal clustering (local
minima).
Assumes spherical clusters: Assumes clusters are of similar size and shape, which may not be
true for all datasets.
Sensitive to outliers: Outliers can significantly affect the results.
K- Mediods Clustering
Definition: K-Median is a partitioning clustering algorithm similar to K-Means, but instead of using the
mean to calculate the centroid of a cluster, it uses the median. The median is less affected by outliers,
making K-Median more robust for datasets with outliers or skewed distributions.
Advantages:
Disadvantages:
K-Means is efficient and simple, but sensitive to outliers and initial conditions, and assumes spherical
clusters. K-Median is more robust to outliers and flexible with cluster shapes, but computationally more
intensive and slower to converge.
A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical
clustering. It visually represents the arrangement of the clusters produced by hierarchical clustering
algorithms. Each branch of the dendrogram represents a cluster, and the length of the branches indicates
the distance or dissimilarity between clusters.
It is a diagram that shows the hierarchical relationship between the objects. It is most commonly created
as an output from hierarchical clustering.
Linkage Criteria:
In hierarchical clustering, the linkage criteria determine how the distance between clusters is calculated
when merging them. The choice of linkage criterion affects the shape and composition of the clusters.
Here are the most common linkage criteria:
1. Single Linkage (Minimum Linkage): The distance between two clusters is the shortest distance
between two points in each cluster. Minimum value is used to find the linkage.
2. Complete Linkage (Maximum Linkage): The distance between two clusters is the longest
distance between two points in each cluster. Maximum value is used to find the linkage.
3. Average Linkage: The distance between the two clusters is the average distance between each
point in the one cluster to every point in the other clusters. This is also called as un-weighted pair
groups. Average value is used to find the linkage.
3. Density based clustering method
Density-based clustering methods identify clusters based on the density of data points in the feature space.
These methods are particularly effective in identifying arbitrarily shaped clusters and are robust to noise
and outliers.
Definition: DBSCAN groups together points that are closely packed, marking points in low-density
regions as outliers. It defines clusters as areas of high point density separated by areas of low point
density.
Parameters (inputs):
Advantages:
Disadvantages:
Parameter Sensitivity: The choice of ε and minPts significantly affects the results.
Variable Density Clusters: Struggles with clusters of varying densities.
Definition:
OPTICS is an extension of DBSCAN that creates an ordering of the database, capturing the density-based
structure. It helps in identifying clusters with varying densities.
Definition: CLIQUE is a grid-based clustering method that combines the concepts of grid-based and
density-based clustering. It partitions the data space into a grid and identifies dense clusters by examining
the distribution of data points in these grid cells.
Grid Partitioning: Divide the data space into a grid with each cell covering a range of feature values.
Density Calculation: Calculate the density of data points in each grid cell.
Cluster Identification: Identify dense regions where the density of points is higher than a specified
threshold. Merge adjacent dense cells to form clusters.
Advantages:
Scalability: Efficient for large datasets and high-dimensional data due to grid-based partitioning.
No Assumptions about Shape: Can identify clusters of arbitrary shapes.
Handling High Dimensions: Effective in high-dimensional spaces compared to methods like K-
means.
Disadvantages:
Parameter Sensitivity: Performance is sensitive to the grid resolution and density threshold.
Grid Granularity: The choice of grid size can impact the clustering results, requiring careful tuning.
Definition: STING is a grid-based clustering method that uses statistical information derived from grid
cells to perform clustering. It organizes data into a hierarchical grid structure and uses statistical measures
to identify clusters.
Partition the data space into a hierarchical grid structure, where each level of the hierarchy
corresponds to different levels of granularity.
Compute statistical measures (e.g., mean, variance) for the data points within each grid cell.
Merge cells based on statistical similarity and density to form clusters.
Advantages:
Disadvantages:
Parameter Dependence: Requires careful tuning of parameters for grid size and statistical
thresholds.
Scalability Issues: Performance can be affected by very large datasets due to hierarchical
structure management.
Apply wavelet transforms to decompose the data into different frequency components.
Extract features from the transformed data that represent different scales or frequency components.
Perform clustering on the extracted features using traditional clustering algorithms or specialized
wavelet-based clustering techniques.
Advantages:
Disadvantages:
Assumption of Distribution: Assumes a specific distribution (e.g., Gaussian), which might not
fit all types of data.
Computational Complexity: Estimation can be computationally intensive, especially for large
datasets or high-dimensional data.
Initialization Sensitivity: Results can be sensitive to the initial parameter estimates.
Definition: The neural network approach to model-based clustering uses artificial neural networks to
learn representations of the data and perform clustering. These methods leverage the capabilities of neural
networks to capture complex patterns and relationships in the data.
Advantages:
Flexibility: Can model complex, non-linear relationships and patterns in the data.
Automatic Feature Learning: Neural networks can automatically learn relevant features from
the data, reducing the need for manual feature engineering.
Scalability: Can be scaled to handle large datasets and high-dimensional data.
Disadvantages:
Complexity: Neural networks can be complex to design and train, requiring careful tuning of
hyperparameters.
Interpretability: The models can be less interpretable compared to statistical models, making it
harder to understand the clustering results.
Computational Requirements: Training neural networks can be computationally intensive and
may require specialized hardware (e.g., GPUs).
Evaluation of Clustering
Evaluating clustering results is essential to understand the quality, effectiveness, and reliability of the
clusters produced by a clustering algorithm. Here are some common approaches for evaluation of
clustering in data mining:
3. Visual Evaluation: Visualisation techniques, such as scatter plots, dendrograms, or heatmaps, can
help in assessing the quality of clustering by providing a visual representation of how data points are
grouped.
4. Cluster Stability: This involves assessing how stable the clusters are under perturbations or
variations in the dataset. Stability measures can include methods like bootstrapping or subsampling.
5. Cross-validation: Splitting the dataset into training and testing sets and evaluating the clustering
performance on different subsets can help assess the robustness and generalisability of the clustering
algorithm.
6. Statistical Significance: Conduct statistical tests to determine if the observed clustering results are
statistically significant.
7. Domain-specific Evaluation: In some cases, the effectiveness of clustering may depend on domain-
specific criteria or objectives. Evaluation metrics can be tailored to align with the specific goals of the
data mining task.
It's important to note that the choice of evaluation metric depends on the nature of the data, the
characteristics of the clusters, and the goals of the analysis. No single metric is universally applicable,
and multiple metrics should be considered to gain a comprehensive understanding of the clustering
performance. Additionally, the interpretation of results should be done in the context of the specific
application or problem being addressed.
Problems
1. K-means
Example problem
Apply agglomerative hierarchical class string for the given proximity Matrix for the five objects A, B,
C, D and E using minimum distance method also construct dendrogram.
ITEM A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
4. Divisive
Example problem
Apply divisive hierarchical cluster for the given proximity Matrix for 5 objects A, B, C, D and E
using a minimum spanning tree and show the separate clusters.
ITEM A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Solution:
Step1:
Edges A-B A-C A-D A-E B-C B-D B-E C-D C-E D-E
weights 1 2 2 3 2 4 3 1 5 3
Edges A-B C-D A-C A-D B-C A-E B-E D-E B-D C-E
weights 1 1 2 2 2 3 3 3 4 5
Construct MST:
Step 2:
1. Cost of spanning tree: The largest weight between the edge A and E is 3. So now we break the
link between A& E.
2. Cost of spanning tree: The next largest weight between the edge A and C is 2. So now we break
the link between A& C.
3. Cost of spanning tree: The largest weight between the edge A and B is 1. So now we break the
link between A& B.
4. Cost of spanning tree: The largest weight between the edge C and D is 1. So now we break the
link between C& D.
We get ‘n’ clusters each with only one object ∴ Termination condition holds (singleton clusters).
**********************************END**********************************************