Unit-4th Question-Bank Solution
Unit-4th Question-Bank Solution
Que-1 Explain K-Means algorithms. When would you use k means? 2022-23 2
State whether the statement “K-Means has an assumption each
cluster has roughly equal number of observations” is true or false.
Justify your answer
Que-2 Brief about the working of CLIQUE algorithm. 2022-23 2
Que-5 What are the advantages of the PCY algorithm over the Apriori 2022-23 10
Algorithm?
Que-6 Write short notes on Market Based modeling explaining market 2022-23 10
basket working rules and types of market basket analysis?
Que-7 2021-22 2
Find the entire Association rule from the above given Transaction
with Given minimum support = 50%, minimum confidence= 50
%. Using Apriori algorithm.
Que-8 How does the K-means algorithm work? Write k-means 2021-22 2
algorithm for partitioning.
Que-9 Cluster the following eight points (with (x, y) representing 2021-22 10
locations) into three clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5,
8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9) .
Initial cluster centers are A1(2, 10), A4(5, 8) and A7(1, 2). The
distance function between two points a = (x1, y1) and b = (x2,
y2) is defined as- Ρ (a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the
second iteration
Que-10 Explain SON algorithm to find all or most frequent itemsets 2021-22 10
using at most two passes.
Que-11 2021-22 10
Transaction ID Items
T1 {E,K,M,N,O,Y}
T2 {D,E,K,N,O,Y}
T3 {A,E,K,M}
T4 {C,K,M,U,Y}
T5 {C,E,I,K,O,O}
Que-13 What are the applications of frequent itemset analysis? Explain a 2020-21 2
simple and randomized algorithm to find most frequent itemsets
using at most two passes.
Que-14 Explain Toivonen’s algorithm? 2020-21 2
Que-16 How can you improve the efficiency of Apriori based mining 2020-21 10
Que-20
Explain clustering in Non-Euclidean Spaces and clustering for
streams parallelism?
SOLUTION
Ques.1 Explain K-Means algorithms. When would you use k means? State whether the
statement “K-Means has an assumption each cluster has roughly equal number of
observations”
K-Means Algorithm is an unsupervised machine learning algorithm used for clustering data into
k distinct groups. The steps involved in K-Means are:
Statement Analysis:
"K-Means has an assumption each cluster has roughly equal number of observations"
Answer: False
Justification:
● K-Means does not assume that each cluster must have an equal number of observations.
Instead, it minimizes the within-cluster variance (intra-cluster distance) without
considering the size of the clusters.
● However, K-Means tends to form clusters of similar size due to its reliance on
minimizing variance. This means it may struggle with data where clusters have very
different sizes or densities.
● Example: If one cluster is significantly denser or larger, K-Means may split it into
multiple clusters or merge smaller clusters, leading to poor performance.
1. Input:
○ A dataset in a high-dimensional space.
○ Parameters xi(density threshold) and τ (grid size).
2. Partitioning the Data Space:
○ The data space is divided into non-overlapping rectangular cells (or grids) based
on the parameter τ.
○ Each dimension is divided into equal-width intervals, forming a grid structure.
3. Density Calculation:
○ The density of each cell is calculated based on the number of data points within it.
○ A cell is considered "dense" if the number of points exceeds the threshold
4. Subspace Identification:
○ CLIQUE identifies dense cells in lower-dimensional subspaces.
○ Dense regions in subspaces are combined to form clusters in higher dimensions.
5. Cluster Formation:
○ Adjacent dense cells are merged to form clusters.
○ The merging process ensures that clusters are represented as unions of dense cells.
6. Output:
○ The algorithm outputs clusters in the original data space.
○ It also identifies the subspaces where the clusters are dense, providing
interpretability.
Features of CLIQUE:
CLIQUE is often used in bioinformatics or market basket analysis, where high-dimensional data
(e.g., gene expression levels or product purchases) needs clustering with interpretability in
subspaces.
Hierarchical clustering is a technique in unsupervised machine learning used to group data into a
hierarchy of clusters. The technique operates by either merging smaller clusters into larger
ones (agglomerative) or splitting larger clusters into smaller ones (divisive). The result is a
tree-like structure called a dendrogram, which visually represents the data hierarchy.
Hierarchical clustering is a powerful tool for grouping data, particularly in exploratory data
analysis. Its principle of grouping based on proximity and its flexibility in handling arbitrary
cluster shapes make it valuable, but its computational cost and sensitivity to noise should be
considered when working with large or noisy datasets.
1. Similarity Measure: A metric to calculate how close two data points or clusters are.
Common distance metrics include:
○ Euclidean Distance: Straight-line distance between two points.
○ Manhattan Distance: Sum of absolute differences between coordinates.
○ Cosine Similarity: Measures the cosine of the angle between two vectors.
2. Linkage Criteria: Determines how distances between clusters are calculated. Common
linkage methods include:
○ Single Linkage: Distance between the closest points in two clusters.
○ Complete Linkage: Distance between the farthest points in two clusters.
○ Average Linkage: Average distance between all points in two clusters.
○ Centroid Linkage: Distance between the centroids of two clusters.
1. Agglomerative (Bottom-Up):
○ Starts with each data point as its own cluster.
○ Iteratively merges the two closest clusters until all data points are in a single
cluster.
○ Example: Imagine grouping friends based on proximity. Start with each friend as
an individual group, then merge the closest groups.
Initially consider every data point as an individual Cluster and at every step, merge the nearest
pairs of the cluster. (It is a bottom-up method). At first, every dataset is considered an individual
entity or cluster. At every iteration, the clusters merge with different clusters until one cluster is
formed.
The algorithm for Agglomerative Hierarchical Clustering is:
● Calculate the similarity of one cluster with all the other clusters (calculate proximity
matrix)
● Consider every data point as an individual cluster
● Merge the clusters which are highly similar or close to each other.
● Recalculate the proximity matrix for each cluster
● Repeat Steps 3 and 4 until only a single cluster remains.
2. Initialization:
○ Treat each data point as an individual cluster.
3. Calculate Pairwise Distances:
○ Compute the distance between every pair of clusters using a distance metric.
4. Merge Clusters:
○ Identify the two clusters with the smallest distance and merge them into a single
cluster.
5. Update Distances:
○ Recalculate distances between the newly formed cluster and all other clusters
using the linkage criterion.
6. Repeat:
○ Repeat steps 3 and 4 until all data points are merged into a single cluster.
7. Output:
○ A dendrogram showing the hierarchy of clusters.
8. Divisive (Top-Down):
We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data
points as a single cluster and in every iteration, we separate the data points from the clusters
which aren’t comparable. In the end, we are left with N clusters
Example
Limitations
Lift is a measure used in association rule mining to evaluate the strength of an association rule. It
compares the likelihood of items being purchased together to their individual likelihoods of
being purchased independently.
Interpretation:
● Lift > 1: A and B are positively associated (they occur together more often than expected
by chance).
● Lift = 1: A and B are independent.
● Lift < 1: A and B are negatively associated (they occur together less often than expected
by chance).
Example:
If Lift =2, it means customers buying milk are twice as likely to buy bread compared to random
chance.
Qu5. What are the advantages of the PCY algorithm over the Apriori Algorithm?
The PCY (Park-Chen-Yu) Algorithm is an improved version of the Apriori algorithm designed to
handle large datasets more efficiently by reducing memory usage and computational overhead.
Comparison:
Que6.Write short notes on Market Based modeling explaining market basket working rules
and types of market basket analysis?
Market-Based Modeling:
Market-based modeling is a technique used in data mining and machine learning to analyze
consumer behavior. It involves studying patterns in transactional data to identify relationships
between items, helping businesses optimize sales strategies.
1. Association Rules:
○ Rules like A→B A→B indicate that if a customer buys A, they are likely to
buy B.
○ Metrics used:
■ Support: Frequency of the rule occurring in the dataset.
■ Confidence: Likelihood of B being purchased given A.
■ Lift: Strength of the association between A and B.
2. Steps:
○ Analyze transaction data to find frequent itemsets.
○ Generate association rules from these itemsets.
○ Evaluate the rules using metrics like support, confidence, and lift.
1. Descriptive MBA:
○ Identifies patterns in historical data.
○ Example: Customers buying bread are also likely to buy butter.
2. Predictive MBA:
○ Predicts future purchasing behavior based on past transactions.
○ Example: Recommending complementary products to customers.
3. Prescriptive MBA:
○ Provides actionable recommendations to improve sales.
○ Example: Suggesting promotional offers for frequently purchased combinations.
Applications:
Conclusion:
Market-based modeling and MBA are essential tools for businesses to understand consumer
behavior, improve customer experience, and increase revenue through data-driven decisions.
Que.7 Find the entire Association rule from the above given Transaction with Given
minimum support = 50%, minimum confidence= 50 %. Using Apriori algorithm
Beer 3 60%
Nuts 3 60%
Diape 4 80%
r
Coffe 2 40%
e
Eggs 3 60%
Milk 2 40%
● {Beer, Nuts}, {Beer, Diaper}, {Beer, Eggs}, {Nuts, Diaper}, {Nuts, Eggs}, {Diaper,
Eggs}
{Beer, 3 60%
Diaper}
{Beer, 1 20%
Eggs}
{Nuts, 2 40%
Diaper}
{Nuts, 2 40%
Eggs}
{Diaper, 3 60%
Eggs}
Que.8 How does the K-means algorithm work? Write k-means algorithm for partitioning.
K-Means is an iterative clustering algorithm that partitions a dataset into kkk distinct clusters
based on the proximity of data points. It minimizes the variance within clusters and maximizes
the variance between clusters.
1. Clustering Objective:
○ Partition the dataset into kkk clusters such that the sum of squared distances
between data points and the centroid of their assigned cluster is minimized.
○ Objective Function:
J=i=1∑kx∈Ci∑∥
x−μi∥2
○ where:
■ Ci: Cluster i,
■ μi: Centroid of cluster iii,
■ x: Data point.
2. Centroid:
○ The centroid is the mean of all data points in a cluster.
3. Iterative Process:
○ K-Means alternates between assigning data points to the nearest centroid and
updating centroids based on the mean of assigned points until convergence.
1. Initialization:
○ Select k initial centroids randomly from the dataset.
2. Assignment Step:
○ Assign each data point to the cluster whose centroid is nearest (based on a
distance metric like Euclidean distance).
3. Update Step:
○ Recalculate the centroids of each cluster as the mean of all data points assigned to
that cluster.
4. Convergence:
○ Repeat the assignment and update steps until:
■ The centroids no longer change significantly.
■ A maximum number of iterations is reached.
■ The change in the objective function J is below a threshold.
Advantages of K-Means
Limitations of K-Means
Que.9 Cluster the following eight points (with (x, y) representing locations) into three
clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9) .
Initial cluster centers are A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between
two points a = (x1, y1) and b = (x2, y2) is defined as- Ρ (a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Iteration-01:
● We calculate the distance of each point from each of the centers of the three
clusters.
● The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10)
and each of the center of the three clusters-
Ρ(A1, C1)
= |2 – 2| + |10 – 10|
=0
=3+2
=5
Ρ(A1, C3)
= |1 – 2| + |2 – 10|
=1+8
=9
In the similar manner, we calculate the distance of other points from each of the center of
the three clusters.
Next,
Given Points Distance from Distance from Distance from Point belongs
center (2, 10) of center (5, 8) of center (1, 2) of to Cluster
Cluster-01 Cluster-02 Cluster-03
A1(2, 10) 0 5 9 C1
A2(2, 5) 5 6 4 C3
A3(8, 4) 12 7 9 C2
A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2
A6(6, 4) 10 5 7 C2
A7(1, 2) 9 10 0 C3
A8(4, 9) 3 2 10 C2
Cluster-01:
● A1(2, 10)
Cluster-02
● A3(8, 4)
● A4(5, 8)
● A5(7, 5)
● A6(6, 4)
● A8(4, 9)
Cluster-03:
● A2(2, 5)
● A7(1, 2)
Now,
For Cluster-01:
For Cluster-02:
Center of Cluster-02
= (6, 6)
For Cluster-03:
Center of Cluster-03
= (1.5, 3.5)
Iteration-02:
● We calculate the distance of each point from each of the center of the three
clusters.
● The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of the
center of the three clusters-
= |2 – 2| + |10 – 10|
=0
= |6 – 2| + |6 – 10|
=4+4
=8
Ρ(A1, C3)
= 0.5 + 6.5
=7
In the similar manner, we calculate the distance of other points from each of the center of
the three clusters.
Next,
Given Points Distance from Distance from Distance from Point belongs
center (2, 10) of center (6, 6) of center (1.5, 3.5) of to Cluster
Cluster-01 Cluster-02 Cluster-03
A1(2, 10) 0 8 7 C1
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1
Cluster-01:
Cluster-02:
● A3(8, 4)
● A4(5, 8)
● A5(7, 5)
● A6(6, 4)
Cluster-03:
● A2(2, 5)
● A7(1, 2)
Now,
For Cluster-01:
Center of Cluster-01
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= (1.5, 3.5)
● C1(3, 9.5)
● C2(6.5, 5.25)
● C3(1.5, 3.5)
Que.10 Explain SON algorithm to find all or most frequent itemsets using at most
two passes.
The SON (Savasere, Omiecinski, and Navathe) algorithm is a distributed and scalable approach
for finding frequent item sets in large datasets, especially when the dataset is too large to fit into
memory. It uses the MapReduce framework to process data in parallel and requires at most two
passes over the dataset.
The SON algorithm leverages the fact that a subset of frequent itemsets in the entire dataset must
also be frequent in at least one of the partitions of the dataset. It operates in two phases:
1. Local Computation:
○ Break the dataset into smaller, manageable chunks (partitions).
○ Identify frequent itemsets within each partition (using a lower threshold for local
support).
2. Global Validation:
○ Combine the results from all partitions.
○ Validate which itemsets are globally frequent by counting their occurrences across
the entire dataset.
● Divide the dataset D into smaller partitions D1,D2,…,Dn that can fit into memory.
● Combine the locally frequent itemsets from all partitions into a single set of candidate
itemsets.
● Make a second pass over the dataset to count the occurrences of the candidate itemsets
across the entire dataset.
● Identify itemsets that meet the global support threshold as globally frequent itemsets.
1. Two-Pass Approach:
○ First pass: Identify locally frequent itemsets.
○ Second pass: Validate globally frequent itemsets.
2. Distributed Processing:
○ Efficiently handles large datasets using parallel computation (e.g., MapReduce).
3. Memory Efficiency:
○ Processes data in smaller partitions that fit into memory.
Example
Dataset (Transactions):
● support global=50%.
1. Scalability:
○ Handles massive datasets by leveraging distributed computation.
2. Efficiency:
○ Processes data in partitions, reducing memory usage.
3. Parallelization:
○ Suitable for frameworks like MapReduce.
Applications
The SON algorithm is a powerful method for mining frequent itemsets efficiently from large
datasets, making it an essential tool in big data analytics.
Que.11
Transaction ID Items
T1 {E,K,M,N,O,Y}
T2 {D,E,K,N,O,Y}
T3 {A,E,K,M}
T4 {C,K,M,U,Y}
{C,E,I,K,O,O}
T5
Explain FP-Growth algorithm. Solve this using the Frequent Pattern Tree.
The above-given data is a hypothetical dataset of transactions with each letter representing an
item. The frequency of each individual item is computed:-
Item Frequency
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 4
U 1
Y 3
Let the minimum support be 3. A Frequent Pattern set is built which will contain all the
elements whose frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies. After insertion of the relevant items,
the set L looks like this:-
L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating the
Frequent Pattern set and checking if the current item is contained in the transaction in question.
If the current item is contained, the item is inserted in the Ordered-Item set for the current
transaction. The following table is built for all the transactions:
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
Here, all the items are simply linked one after the other in the order of occurrence in the set and
initialize the support count for each item as 1.
T1 {E,K,M,N,O,Y} {K,E,M,O,Y}
{K,E,O,Y}
T2 {D,E,K,N,O,Y}
{K,E,M,}
T3 {A,E,K,M}
{K,M,Y}
T4 {C,K,M,U,Y}
{C,E,I,K,O,O} {K,E,O,}
T5
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
Till the insertion of the elements K and E, simply the support count is increased by 1. On
inserting O we can see that there is no direct link between E and O, therefore a new node
for the item O is initialized with the support count as 1 and item E is linked to this new
node. On inserting Y, we first initialize a new node for the item Y with support count as 1
and link the new node of O with the new node of Y.
c) Inserting the set {K, E, M}:
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set
of elements that is common in all the paths in the Conditional Pattern Base of that item and
calculating its support count by summing the support counts of all the paths in the Conditional
Pattern Base.
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing
the items of the Conditional Frequent Pattern Tree set to the corresponding to the item as given
in the below table.
For each row, two types of association rules can be inferred. For example for the first row which
contains the element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule,
the confidence of both the rules is calculated and the one with confidence greater than or equal to
the minimum confidence value is retained.
Que.12 What are the approaches for high dimensional data clustering?
High-dimensional data clustering refers to grouping data points in spaces with a large number of
dimensions (features). High dimensionality poses challenges like the "curse of dimensionality,"
which can obscure meaningful patterns. Here are the key approaches:
2. Subspace Clustering
● Identifies clusters within subsets of dimensions instead of the entire feature space.
● Examples:
○ CLIQUE: Combines grid-based and density-based clustering in subspaces.
○ PROCLUS: Finds medoid-based clusters in subspaces.
3. Spectral Clustering
4. Density-Based Clustering
5. Model-Based Clustering
● Uses probabilistic models like Gaussian Mixture Models (GMM) to identify clusters.
● Assumes data is generated from a mixture of underlying probability distributions.
● k-Means Variants: Adapted for high-dimensional data by using cosine similarity instead
of Euclidean distance.
● Hierarchical Clustering: Effective for smaller datasets, but suffers from scalability issues
in high dimensions.
7. Kernel Methods
8. Ensemble Clustering
Challenges
Que.13 What are the applications of frequent itemset analysis? Explain a simple and
randomized algorithm to find most frequent itemsets using at most two passes.
Frequent itemset analysis is widely used in discovering patterns and associations in data. Key
applications include:
Objective: To find frequent itemsets (sets of items appearing together frequently) using at most
two passes over the data.
Algorithm:
1. Input:
○ A dataset D of transactions.
○ A threshold sss (support count).
2. Pass 1:
○ Random Sampling:
■ Randomly sample a subset of transactions from D.
○ Frequent Itemset Mining:
■ Apply any frequent itemset mining algorithm (e.g., Apriori) on the
sampled data.
■ Record candidate frequent itemsets.
3. Pass 2:
○ Verify Candidate Itemsets:
■ For each candidate itemset, count its exact support in the entire dataset D.
○ Filter Frequent Itemsets:
■ Retain itemsets with support ≥s.
Key Features:
● The first pass reduces the number of candidate itemsets, making the second pass efficient.
● Random sampling ensures scalability.
Advantages:
Limitations:
Toivonen’s algorithm is a randomized approach for frequent itemset mining, particularly useful
for large datasets. It combines random sampling and verification to find frequent itemsets
efficiently. The algorithm minimizes the number of passes over the dataset, making it suitable for
scenarios where dataset scanning is expensive.Toivonen’s algorithm is an efficient and scalable
approach for frequent itemset mining. Its use of sampling and verification reduces computational
overhead, making it ideal for large datasets, while the negative border ensures no frequent
itemsets are missed.
1. Input:
● Use a frequent itemset mining algorithm (e.g., Apriori) on S with s to identify frequent
itemsets Fs
● Include both:
○ Frequent itemsets from Fs
○ Negative Border: Itemsets that are immediate supersets of Fs but are not frequent
in the sample.
● Count the actual support of all candidate itemsets (frequent itemsets and negative border)
in D.
● Retain only those itemsets with support ≥s.
6. Handle Errors:
● If any item set from the negative border is frequent in D, the algorithm fails because the
sample was not representative.
○ Solution: Restart the algorithm with a larger sample
Negative Border:
● The negative border contains itemsets that are not frequent in the sample but could
potentially be frequent in the full dataset.
● Ensures that no frequent itemsets are missed.
Advantages
1. Efficiency: Reduces the number of passes over the dataset (only one full pass is needed
for verification).
2. Scalability: Works well for large datasets by operating on a smaller sample.
3. Accuracy: Ensures correctness through verification.
Limitations
1. Random Sampling Bias: A poor sample may lead to errors, requiring a restart.
2. Negative Border Size: The size of the negative border can grow exponentially with the
number of items.
3. Restart Overhead: If the algorithm fails, it needs to restart with a larger sample,
increasing computational cost.
Example
Dataset D:
D={{A,B,C},{A,C},{A,B},{B,C},{A,B,C}}
Step 1: Sampling
Step 4: Verification
● Frequent item sets: {A,C},{A,B},{B,C} Negative border {A,B,C} are not frequent in D,
so the algorithm succeeds.
Applications
1. Grid-Based Approaches
○ Divide the feature space into grids and identify dense regions.
○ Example: CLIQUE (Clustering In QUEst):
■ Divides the data space into equal-sized cells.
■ Identifies dense cells (with a number of points exceeding a threshold).
■ Clusters are formed by merging dense cells.
2. Bottom-Up Approaches
○ Start with low-dimensional subspaces and iteratively extend to higher dimensions.
○ Example: Proclus:
■ Finds medoids in subspaces and assigns data points to clusters based on
proximity.
■ Works efficiently by limiting the number of subspaces considered.
3. Top-Down Approaches
○ Start with the full-dimensional space and iteratively remove irrelevant
dimensions.
○ Example: P3C (Projective Clustering with Constraints):
■ Uses statistical tests to identify relevant dimensions for each cluster.
4. Density-Based Approaches
○ Identify clusters as dense regions in subspaces.
○ Example: SUBCLU:
■ Extends DBSCAN to find density-connected clusters in subspaces.
5. Spectral-Based Approaches
○ Use eigenvectors of the similarity matrix to project data into a lower-dimensional
subspace.
○ Example: SSC (Sparse Subspace Clustering):
■ Uses sparse representation techniques to identify clusters.
6. Ensemble Subspace Clustering
○ Combines results from multiple subspace clustering algorithms to improve
robustness.
○ Example: ECSC (Ensemble Clustering of Subspaces):
■ Aggregates clustering results using consensus functions.
Que.16 How can you improve the efficiency of Apriori based mining.
The Apriori algorithm is a classic method for frequent itemset mining, but its performance can be
improved in several ways:
● Transaction Reduction:
○ Remove transactions that do not contain frequent itemsets to reduce the search
space.
● Vertical Data Format:
○ Represent data as item-to-transaction mappings.
○ Makes it easier to intersect transactions for candidate generation.
Clustering in data mining involves grouping similar data points into clusters. To ensure effective
clustering, the following requirements must be met:
1. Scalability
● The algorithm must handle large datasets efficiently, both in terms of time and memory.
● Should identify clusters of arbitrary shapes (e.g., density-based methods like DBSCAN).
● Clustering results should not be overly sensitive to noise or outliers in the data.
5. High Dimensionality
6. Interpretability
9. Domain-Specific Requirements
The SON (Savasere, Omiecinski, and Navathe) algorithm is a distributed approach for finding
frequent itemsets using the MapReduce framework. It is designed to handle large datasets by
dividing the data into chunks and processing them independently.
1. Input:
● Mapper Phase:
○ Each mapper processes a subset (chunk) of the dataset.
○ Within the chunk, find frequent itemsets using the Apriori algorithm with the
local support threshold s×chunk size/dataset sizes\
● Reducer Phase:
○ Combine all locally frequent itemsets to form a global candidate set.
● Mapper Phase:
○ Each mapper scans the dataset and counts the support of candidate itemsets.
● Reducer Phase:
○ Aggregates counts and filters itemsets with support ≥s
Advantages
Applications
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}
Approach:
There are several steps that you have to follow to get the Candidate table.
Step 1: Find the frequency of each element and remove the candidate set having length 1.
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to them
write their frequency. Note - Note: Pairs should not get repeated to avoid the pairs that are
already written before.
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It
gives us the bucket number). It defines in what bucket this particular pair will be put.
Step 4: This is the last step, and in this step, we have to create a table with the following details -
● Bit vector - if the frequency of the candidate pair is greater than equal to the threshold
then the bit vector is 1 otherwise 0. (mostly 1)
● Bucket number - found in the previous step
● Maximum number of support - frequency of this candidate pair, found in step 2.
● Correct - the candidate pair will be mentioned here.
● Candidate set - if the bit vector is 1, then "correct" will be written here.
Solution:
Step 1: Find the frequency of each element and remove the candidate set having length 1.
Items 1 2 3 4 5 6
Frequency 4 7 7 8 6 4
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to it write
its frequency.
T5 {(1, 5)} 1
T6 {(2, 6)} 2
T7 {(1, 4)} 2
T8 {(2, 5)} 2
T9 {(3, 6)} 1
T10 -
T11 -
T12 -
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It
gives us the bucket number).
Hash Function = ( i * j) mod 10
(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4
Bucket No.
0 (4,5)
2 (3,4)
3 (1,3)
4 (4,6)
5 (3,5)
6 (2,3)
8 (2,4)
Highest Support
Bit Vector Bucket No. Pairs Candidate Set
Count
1 0 3 (4,5) (4,5)
1 2 4 (3,4) (3,4)
1 3 3 (1,3) (1,3)
1 4 4 (4,6) (4,6)
1 5 3 (3,5) (3,5)
1 6 3 (2,3) (2,3)
1 8 4 (2,4) (2,4)
Que.20 Explain clustering in Non-Euclidean Spaces and clustering for streams parallelism?
Clustering in Non-Euclidean spaces involves grouping data points where the distance metric is
not based on Euclidean geometry. Non-Euclidean spaces arise in scenarios where the
relationships between data points cannot be captured by simple geometric distances, such as
high-dimensional, categorical, or graph-based data.Clustering in non-Euclidean spaces and for
data streams addresses the challenges of complex data structures and real-time processing. These
approaches leverage advanced distance metrics, incremental algorithms, and parallelism to
handle large-scale, dynamic datasets efficiently.
1. Density-Based Clustering
2. Graph-Based Clustering
● Spectral Clustering:
○ Constructs a similarity graph where nodes represent data points, and edges
represent similarities.
○ Uses eigenvalues of the graph Laplacian to identify clusters.
3. Kernel-Based Clustering
● Kernel K-Means:
○ Maps data into a higher-dimensional space using a kernel function.
○ Performs clustering in the transformed space.
4. Hierarchical Clustering
Applications
Clustering for data streams involves processing continuous, high-velocity data in real time.
Stream clustering algorithms are designed to handle large-scale, evolving data efficiently, often
leveraging parallelism for scalability.
● CluStream:
○ Maintains micro-clusters (compact summaries) of data in memory.
○ Periodically merges micro-clusters to form macro-clusters.
3. Grid-Based Clustering
● D-Stream:
○ Divides the data space into grids.
○ Assigns data points to grids and tracks the density of each grid over time.
Parallelism improves the efficiency of stream clustering by distributing the workload across
multiple processors or machines.
1. MapReduce Framework
● Mapper Phase:
○ Each mapper processes a subset of the stream to identify local clusters.
● Reducer Phase:
○ Combines local clusters into global clusters.
Applications
1. Fraud Detection:
○ Real-time clustering of transaction data to identify anomalies.
2. Network Monitoring:
○ Clustering network traffic streams to detect patterns or intrusions.
3. IoT Data Analysis:
○ Clustering sensor data streams for predictive maintenance.