DWM PT 2 QB Soln
DWM PT 2 QB Soln
What is clustering
Definition:
● Clustering is a machine learning technique used to group similar data points
together into clusters based on their shared characteristics. These clusters are
formed based on the similarity or dissimilarity between the data points. The
goal of clustering is to identify patterns, structures, or relationships within the
data.
Types of Clustering:
1. K-means Clustering:
● One of the most common clustering algorithms.
● Selects a predefined number of clusters (k) and assigns data points to the
nearest cluster center.
● Iteratively adjusts cluster centers until convergence.
● Simple and efficient, but sensitive to the choice of k.
2. Hierarchical Clustering:
● Creates a hierarchy of clusters, either bottom-up (agglomerative) or top-down
(divisive).
● Agglomerative clustering starts with individual data points as clusters and
merges them based on similarity.
● Divisive clustering starts with a single cluster containing all data points and
splits it into smaller clusters.
● Doesn't require the specification of the number of clusters.
● Can be computationally expensive for large datasets.
3. Density-based Clustering:
● Identifies clusters based on dense regions of data points.
● Algorithms like DBSCAN and OPTICS are commonly used.
● DBSCAN defines clusters as areas with a minimum number of data points
within a specified radius.
● OPTICS orders data points based on their density and can create hierarchical
clusters.
● Effective for handling noise and outliers.
4. Model-based Clustering:
● Assumes a probabilistic model for the data and fits the model to the data.
● Gaussian Mixture Models (GMMs) are a popular example.
● GMMs assume that the data is generated from a mixture of Gaussian
distributions.
● Provides probabilistic membership for each data point.
● Can be more flexible than other methods but can be computationally
expensive.
Category/Type:
● K-means falls under the category of partitioning clustering algorithms. This
means it partitions the data into non-overlapping subsets.
Example:
● Consider a dataset of customer information with attributes such as age,
income, and purchase frequency. K-means clustering can be used to group
customers into segments based on their similarities. For example, one cluster
might represent young, high-income customers who frequently purchase
luxury items, while another cluster might represent older, low-income
customers who primarily buy groceries.
Definition:
● DBSCAN is a density-based clustering algorithm that groups data points
together based on their density. It identifies clusters as dense regions of data
points separated by low-density regions.
Category/Type:
● DBSCAN falls under the category of density-based clustering algorithms.
Working:
1. Choose parameters:
Epsilon (ε): Radius of the neighborhood.
MinPts: Minimum number of points required to form a cluster.
2. Scan the dataset:
For each data point:
Find all points within ε distance.
If the number of points found is greater than or equal to MinPts, the point is
considered a core point. Otherwise, it is considered a border point or noise.
3. Form clusters:
Starting from a core point, recursively find all points that are directly or indirectly
connected to it within ε distance.
A cluster is formed by all points connected to a core point.
Diagram:
Example:
Consider a dataset of two-dimensional points:
(2, 3), (4, 5), (6, 7), (8, 9), (1, 2), (10, 11)
Advantages of DBSCAN:
● Handles arbitrary shapes of clusters.
● Can handle noise and outliers effectively.
● Does not require specifying the number of clusters.
Disadvantages of DBSCAN:
Applications:
● Customer segmentation: Grouping customers based on purchase behavior or
demographics.
● Image segmentation: Dividing an image into different regions based on color
or texture.
● Anomaly detection: Identifying unusual data points that deviate from the
norm.
● Social network analysis: Analyzing communities and groups within social
networks.
● Spatial data analysis: Identifying clusters of geographic features.