0% found this document useful (0 votes)
9 views52 pages

Unit-4th Question-Bank Solution

The document covers various topics related to Frequent Itemset and Clustering, including algorithms like K-Means, CLIQUE, and Hierarchical clustering, as well as concepts like Lift in Association Data Mining and the PCY algorithm. It presents a series of questions and answers from previous years, detailing the workings of these algorithms, their applications, advantages, and limitations. The content is structured to aid understanding of clustering techniques and association rule mining in data analysis.

Uploaded by

tusharjand1876
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views52 pages

Unit-4th Question-Bank Solution

The document covers various topics related to Frequent Itemset and Clustering, including algorithms like K-Means, CLIQUE, and Hierarchical clustering, as well as concepts like Lift in Association Data Mining and the PCY algorithm. It presents a series of questions and answers from previous years, detailing the workings of these algorithms, their applications, advantages, and limitations. The content is structured to aid understanding of clustering techniques and association rule mining in data analysis.

Uploaded by

tusharjand1876
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

UNIT-4th

Frequent Itemset and Clustering


CO-4 Understand item sets, Clustering, Frame works & Visualization

Q.No Questions Year Marks

Que-1 Explain K-Means algorithms. When would you use k means? 2022-23 2
State whether the statement “K-Means has an assumption each
cluster has roughly equal number of observations” is true or false.
Justify your answer
Que-2 Brief about the working of CLIQUE algorithm. 2022-23 2

Que-3 Explain the Principle behind Hierarchical clustering Technique. 2022-23 10

Que-4 Define Lift in Association Data Mining. 2022-23 10

Que-5 What are the advantages of the PCY algorithm over the Apriori 2022-23 10
Algorithm?
Que-6 Write short notes on Market Based modeling explaining market 2022-23 10
basket working rules and types of market basket analysis?

Que-7 2021-22 2

Find the entire Association rule from the above given Transaction
with Given minimum support = 50%, minimum confidence= 50
%. Using Apriori algorithm.

Que-8 How does the K-means algorithm work? Write k-means 2021-22 2
algorithm for partitioning.
Que-9 Cluster the following eight points (with (x, y) representing 2021-22 10
locations) into three clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5,
8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9) .
Initial cluster centers are A1(2, 10), A4(5, 8) and A7(1, 2). The
distance function between two points a = (x1, y1) and b = (x2,
y2) is defined as- Ρ (a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the
second iteration
Que-10 Explain SON algorithm to find all or most frequent itemsets 2021-22 10
using at most two passes.

Que-11 2021-22 10
Transaction ID Items

T1 {E,K,M,N,O,Y}

T2 {D,E,K,N,O,Y}

T3 {A,E,K,M}

T4 {C,K,M,U,Y}

T5 {C,E,I,K,O,O}

Explain FP-Growth algorithm. Solve this using the Frequent


Pattern Tree.
Que-12 What are the approaches for high dimensional data clustering? 2020-21 2

Que-13 What are the applications of frequent itemset analysis? Explain a 2020-21 2
simple and randomized algorithm to find most frequent itemsets
using at most two passes.
Que-14 Explain Toivonen’s algorithm? 2020-21 2

Que-15 Discuss the basic subspace clustering approaches. 2020-21 10

Que-16 How can you improve the efficiency of Apriori based mining 2020-21 10

Que-17 What are the requirements for clustering in data mining?

Que-18 Explain the SON algorithm using MAP REDUCE


Que-19 What is PCY , Apply the PCY algorithm on the following
transaction to find the candidate sets (frequent sets) with
threshold minimum value as 3 and Hash function as (i*j)
mod 10
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}

Que-20
Explain clustering in Non-Euclidean Spaces and clustering for
streams parallelism?
SOLUTION
Ques.1 Explain K-Means algorithms. When would you use k means? State whether the
statement “K-Means has an assumption each cluster has roughly equal number of
observations”

K-Means Algorithm is an unsupervised machine learning algorithm used for clustering data into
k distinct groups. The steps involved in K-Means are:

1. Initialization: Select k initial centroids randomly.


2. Assignment: Assign each data point to the nearest centroid, forming kkk clusters.
3. Update: Calculate the new centroids as the mean of the data points in each cluster.
4. Repeat: Repeat the assignment and update steps until the centroids stabilize (i.e., there is
no significant change in their positions) or a maximum number of iterations is reached.

When to Use K-Means:

● When you need to group data into k distinct clusters.


● When the data is numerical and has low to moderate dimensions.
● When you want a fast and computationally efficient clustering method.
● When the clusters are expected to be spherical and of relatively similar size.

Statement Analysis:

"K-Means has an assumption each cluster has roughly equal number of observations"

Answer: False

Justification:

● K-Means does not assume that each cluster must have an equal number of observations.
Instead, it minimizes the within-cluster variance (intra-cluster distance) without
considering the size of the clusters.
● However, K-Means tends to form clusters of similar size due to its reliance on
minimizing variance. This means it may struggle with data where clusters have very
different sizes or densities.
● Example: If one cluster is significantly denser or larger, K-Means may split it into
multiple clusters or merge smaller clusters, leading to poor performance.

Que2. Brief about the working of CLIQUE algorithm.

CLIQUE (Clustering in QUEst) is a grid-based clustering algorithm specifically designed for


high-dimensional data. It combines clustering and subspace discovery to identify dense regions
in subspaces of the data.
Working of the CLIQUE Algorithm

1. Input:
○ A dataset in a high-dimensional space.
○ Parameters xi(density threshold) and τ (grid size).
2. Partitioning the Data Space:
○ The data space is divided into non-overlapping rectangular cells (or grids) based
on the parameter τ.
○ Each dimension is divided into equal-width intervals, forming a grid structure.
3. Density Calculation:
○ The density of each cell is calculated based on the number of data points within it.
○ A cell is considered "dense" if the number of points exceeds the threshold
4. Subspace Identification:
○ CLIQUE identifies dense cells in lower-dimensional subspaces.
○ Dense regions in subspaces are combined to form clusters in higher dimensions.
5. Cluster Formation:
○ Adjacent dense cells are merged to form clusters.
○ The merging process ensures that clusters are represented as unions of dense cells.
6. Output:
○ The algorithm outputs clusters in the original data space.
○ It also identifies the subspaces where the clusters are dense, providing
interpretability.

Features of CLIQUE:

1. Scalability: Handles large datasets efficiently by working in a grid structure.


2. High-Dimensional Data: Designed to perform well with high-dimensional data by
focusing on dense regions in subspaces.
3. Interpretability: Identifies relevant subspaces, making the clusters easy to interpret.
4. Automatic Detection: Automatically determines the number of clusters.

Example Use Case:

CLIQUE is often used in bioinformatics or market basket analysis, where high-dimensional data
(e.g., gene expression levels or product purchases) needs clustering with interpretability in
subspaces.

Que3.Explain the Principle behind Hierarchical clustering Technique.

Hierarchical clustering is a technique in unsupervised machine learning used to group data into a
hierarchy of clusters. The technique operates by either merging smaller clusters into larger
ones (agglomerative) or splitting larger clusters into smaller ones (divisive). The result is a
tree-like structure called a dendrogram, which visually represents the data hierarchy.

Hierarchical clustering is a powerful tool for grouping data, particularly in exploratory data
analysis. Its principle of grouping based on proximity and its flexibility in handling arbitrary
cluster shapes make it valuable, but its computational cost and sensitivity to noise should be
considered when working with large or noisy datasets.

Principle of Hierarchical Clustering

The fundamental principle of hierarchical clustering is based on proximity or similarity


between data points. The algorithm organizes data points into a hierarchy by iteratively
combining or splitting clusters based on their pairwise distances.

The process is guided by:

1. Similarity Measure: A metric to calculate how close two data points or clusters are.
Common distance metrics include:
○ Euclidean Distance: Straight-line distance between two points.
○ Manhattan Distance: Sum of absolute differences between coordinates.
○ Cosine Similarity: Measures the cosine of the angle between two vectors.
2. Linkage Criteria: Determines how distances between clusters are calculated. Common
linkage methods include:
○ Single Linkage: Distance between the closest points in two clusters.
○ Complete Linkage: Distance between the farthest points in two clusters.
○ Average Linkage: Average distance between all points in two clusters.
○ Centroid Linkage: Distance between the centroids of two clusters.

Types of Hierarchical Clustering

1. Agglomerative (Bottom-Up):
○ Starts with each data point as its own cluster.
○ Iteratively merges the two closest clusters until all data points are in a single
cluster.
○ Example: Imagine grouping friends based on proximity. Start with each friend as
an individual group, then merge the closest groups.

Initially consider every data point as an individual Cluster and at every step, merge the nearest
pairs of the cluster. (It is a bottom-up method). At first, every dataset is considered an individual
entity or cluster. At every iteration, the clusters merge with different clusters until one cluster is
formed.
The algorithm for Agglomerative Hierarchical Clustering is:
● Calculate the similarity of one cluster with all the other clusters (calculate proximity
matrix)
● Consider every data point as an individual cluster
● Merge the clusters which are highly similar or close to each other.
● Recalculate the proximity matrix for each cluster
● Repeat Steps 3 and 4 until only a single cluster remains.

Let’s see the graphical representation of this algorithm using a dendrogram


● Step-1: Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.
● Step-2: In the second step comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore
we merge them in the second step similarly to cluster (D) and (E) and at last, we get
the clusters [(A), (BC), (DE), (F)]
● Step-3: We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
● Step-4: Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
● Step-5: At last, the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].

Steps in Agglomerative Hierarchical Clustering

2. Initialization:
○ Treat each data point as an individual cluster.
3. Calculate Pairwise Distances:
○ Compute the distance between every pair of clusters using a distance metric.
4. Merge Clusters:
○ Identify the two clusters with the smallest distance and merge them into a single
cluster.
5. Update Distances:
○ Recalculate distances between the newly formed cluster and all other clusters
using the linkage criterion.
6. Repeat:
○ Repeat steps 3 and 4 until all data points are merged into a single cluster.
7. Output:
○ A dendrogram showing the hierarchy of clusters.
8. Divisive (Top-Down):

We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data
points as a single cluster and in every iteration, we separate the data points from the clusters
which aren’t comparable. In the end, we are left with N clusters

○ Starts with all data points in a single cluster.


○ Recursively splits clusters into smaller clusters until each data point forms its own
cluster.
○ Example: Imagine dividing a group of people into subgroups based on shared
interests.

Example

Consider a dataset with five points: A, B, C, D, and E.

1. Start with each point as its own cluster.


2. Calculate distances between all points.
3. Merge the two closest points (e.g., A and B).
4. Recalculate distances and merge the next closest clusters.
5. Continue until all points are in one cluster.
Advantages of Hierarchical Clustering

1. No Predefined k: Does not require specifying the number of clusters in advance.


2. Dendrogram Visualization: Provides a visual representation of data relationships.
3. Handles Arbitrary Shapes: Works well for non-spherical clusters.

Limitations

1. Scalability: Computationally expensive for large datasets (O(n3)).


2. Sensitivity to Noise: Outliers can distort clustering results.
3. Fixed Clusters: Once merged or split, clusters cannot be re-evaluated

Que4. Define Lift in Association Data Mining.

Lift is a measure used in association rule mining to evaluate the strength of an association rule. It
compares the likelihood of items being purchased together to their individual likelihoods of
being purchased independently.

Formula for Lift:


Lift=Support(A∩B)/Support(A)×Support(B)​

● Support(A∩B):Joint probability of A and B occurring together.


● Support(A): Probability of A occurring.
● Support(B): Probability of B occurring.

Interpretation:

● Lift > 1: A and B are positively associated (they occur together more often than expected
by chance).
● Lift = 1: A and B are independent.
● Lift < 1: A and B are negatively associated (they occur together less often than expected
by chance).

Example:

If A=Milk and B=Bread

If Lift =2, it means customers buying milk are twice as likely to buy bread compared to random
chance.
Qu5. What are the advantages of the PCY algorithm over the Apriori Algorithm?

The PCY (Park-Chen-Yu) Algorithm is an improved version of the Apriori algorithm designed to
handle large datasets more efficiently by reducing memory usage and computational overhead.

Advantages of PCY Algorithm:

1. Efficient Memory Usage:


○ The PCY algorithm uses a hash table to count item pairs in the first pass, storing
only hashed bucket counts instead of all candidate pairs.
○ This reduces the memory required compared to the Apriori algorithm, which
stores all candidate pairs explicitly.
2. Reduced Candidate Pairs:
○ By using a hash table, PCY eliminates non-frequent pairs early in the process,
reducing the number of candidate pairs that need to be generated and tested in
subsequent passes.
3. Fewer Passes Over Data:
○ The PCY algorithm minimizes the number of passes required to generate frequent
itemsets by leveraging the hashed buckets.
4. Scalability:
○ It is more scalable than Apriori for datasets with a large number of items or
transactions due to its efficient memory and computation management.
5. Handles Sparsity:
○ The hashing technique efficiently handles sparse datasets where many item pairs
are infrequent.

Comparison:

● Apriori generates and stores all candidate pairs, making it memory-intensive.


● PCY reduces memory requirements by storing only bucket counts and filtering
non-frequent pairs early.

Que6.Write short notes on Market Based modeling explaining market basket working rules
and types of market basket analysis?

Market-Based Modeling:

Market-based modeling is a technique used in data mining and machine learning to analyze
consumer behavior. It involves studying patterns in transactional data to identify relationships
between items, helping businesses optimize sales strategies.

Market Basket Analysis (MBA):


Market Basket Analysis is a popular application of market-based modeling that identifies
relationships between items frequently purchased together.

Working Rules of Market Basket Analysis:

1. Association Rules:
○ Rules like A→B A→B indicate that if a customer buys A, they are likely to
buy B.
○ Metrics used:
■ Support: Frequency of the rule occurring in the dataset.
■ Confidence: Likelihood of B being purchased given A.
■ Lift: Strength of the association between A and B.
2. Steps:
○ Analyze transaction data to find frequent itemsets.
○ Generate association rules from these itemsets.
○ Evaluate the rules using metrics like support, confidence, and lift.

Types of Market Basket Analysis:

1. Descriptive MBA:
○ Identifies patterns in historical data.
○ Example: Customers buying bread are also likely to buy butter.
2. Predictive MBA:
○ Predicts future purchasing behavior based on past transactions.
○ Example: Recommending complementary products to customers.
3. Prescriptive MBA:
○ Provides actionable recommendations to improve sales.
○ Example: Suggesting promotional offers for frequently purchased combinations.

Applications:

● Retail: Optimizing store layouts by placing related products together.


● E-commerce: Recommending products to customers (e.g., "Customers who bought this
also bought").
● Marketing: Designing cross-selling and up-selling strategies.

Conclusion:

Market-based modeling and MBA are essential tools for businesses to understand consumer
behavior, improve customer experience, and increase revenue through data-driven decisions.
Que.7 Find the entire Association rule from the above given Transaction with Given
minimum support = 50%, minimum confidence= 50 %. Using Apriori algorithm

Association Rule Mining Using Apriori Algorithm

Given transactions and parameters:

● Minimum Support = 50%


● Minimum Confidence = 50%

Step 1: Calculate Support for All Items


Item Support Support
Count %

Beer 3 60%

Nuts 3 60%

Diape 4 80%
r

Coffe 2 40%
e

Eggs 3 60%

Milk 2 40%

Frequent 1-itemsets (Support ≥ 50%):

● {Beer}, {Nuts}, {Diaper}, {Eggs}


Step 2: Generate Candidate 2-itemsets

Possible 2-itemsets from frequent 1-itemsets:

● {Beer, Nuts}, {Beer, Diaper}, {Beer, Eggs}, {Nuts, Diaper}, {Nuts, Eggs}, {Diaper,
Eggs}

Calculate support for each 2-itemset:

Itemset Support Support


Count %

{Beer, Nuts} 1 20%

{Beer, 3 60%
Diaper}

{Beer, 1 20%
Eggs}

{Nuts, 2 40%
Diaper}

{Nuts, 2 40%
Eggs}

{Diaper, 3 60%
Eggs}

Frequent 2-itemsets (Support ≥ 50%):

● {Beer, Diaper}, {Diaper, Eggs}

Step 3: Generate Candidate 3-itemsets

Possible 3-itemsets from frequent 2-itemsets:

● {Beer, Diaper, Eggs}

Calculate support for {Beer, Diaper, Eggs}:

Itemset Support Support


Count %
{Beer, Diaper, 1 20%
Eggs}

No frequent 3-itemsets (Support < 50%).

Step 4: Generate Association Rules

For each frequent itemset, generate rules and calculate confidence.

From {Beer, Diaper}:

● Rule: {Beer} → {Diaper}


○ Confidence = Support({Beer, Diaper}) / Support({Beer}) = 3/3 = 100%
● Rule: {Diaper} → {Beer}
○ Confidence = Support({Beer, Diaper}) / Support({Diaper}) = 3/4 = 75%

From {Diaper, Eggs}:

● Rule: {Diaper} → {Eggs}


○ Confidence = Support({Diaper, Eggs}) / Support({Diaper}) = 3/4 = 75%
● Rule: {Eggs} → {Diaper}
○ Confidence = Support({Diaper, Eggs}) / Support({Eggs}) = 3/3 = 100%

Final Association Rules (Confidence ≥ 50%):

1. {Beer} → {Diaper} (Confidence = 100%)


2. {Diaper} → {Beer} (Confidence = 75%)
3. {Diaper} → {Eggs} (Confidence = 75%)
4. {Eggs} → {Diaper} (Confidence = 100%)

Que.8 How does the K-means algorithm work? Write k-means algorithm for partitioning.

K-Means is an iterative clustering algorithm that partitions a dataset into kkk distinct clusters
based on the proximity of data points. It minimizes the variance within clusters and maximizes
the variance between clusters.

Principle of K-Means Algorithm

1. Clustering Objective:
○ Partition the dataset into kkk clusters such that the sum of squared distances
between data points and the centroid of their assigned cluster is minimized.
○ Objective Function:
J=i=1∑k​x∈Ci​∑∥
​ x−μi​∥2

○ where:
■ Ci​: Cluster i,
■ μi​: Centroid of cluster iii,
■ x: Data point.
2. Centroid:
○ The centroid is the mean of all data points in a cluster.
3. Iterative Process:
○ K-Means alternates between assigning data points to the nearest centroid and
updating centroids based on the mean of assigned points until convergence.

Steps of the K-Means Algorithm

1. Initialization:
○ Select k initial centroids randomly from the dataset.
2. Assignment Step:
○ Assign each data point to the cluster whose centroid is nearest (based on a
distance metric like Euclidean distance).
3. Update Step:
○ Recalculate the centroids of each cluster as the mean of all data points assigned to
that cluster.
4. Convergence:
○ Repeat the assignment and update steps until:
■ The centroids no longer change significantly.
■ A maximum number of iterations is reached.
■ The change in the objective function J is below a threshold.

Advantages of K-Means

1. Simple and easy to implement.


2. Scales well to large datasets.
3. Efficient for spherical clusters.

Limitations of K-Means

1. Sensitive to the initial choice of centroids.


2. Struggles with non-spherical clusters.
3. Requires the number of clusters kkk to be predefined.
Applications

1. Customer segmentation in marketing.


2. Image compression.
3. Anomaly detection.

Que.9 Cluster the following eight points (with (x, y) representing locations) into three
clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9) .
Initial cluster centers are A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between
two points a = (x1, y1) and b = (x2, y2) is defined as- Ρ (a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.

Iteration-01:

● We calculate the distance of each point from each of the centers of the three
clusters.
● The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10)
and each of the center of the three clusters-

Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|

= |2 – 2| + |10 – 10|

=0

Calculating Distance Between A1(2, 10) and C2(5, 8)-


Ρ(A1, C2)

= |x2 – x1| + |y2 – y1|


= |5 – 2| + |8 – 10|

=3+2

=5

Calculating Distance Between A1(2, 10) and C3(1, 2)-

Ρ(A1, C3)

= |x2 – x1| + |y2 – y1|

= |1 – 2| + |2 – 10|

=1+8

=9

In the similar manner, we calculate the distance of other points from each of the center of
the three clusters.

Next,

● We draw a table showing all the results.


● Using the table, we decide which point belongs to which cluster.
● The given point belongs to that cluster whose center is nearest to it.

Given Points Distance from Distance from Distance from Point belongs
center (2, 10) of center (5, 8) of center (1, 2) of to Cluster
Cluster-01 Cluster-02 Cluster-03

A1(2, 10) 0 5 9 C1

A2(2, 5) 5 6 4 C3

A3(8, 4) 12 7 9 C2
A4(5, 8) 5 0 10 C2

A5(7, 5) 10 5 9 C2

A6(6, 4) 10 5 7 C2

A7(1, 2) 9 10 0 C3

A8(4, 9) 3 2 10 C2

From here, New clusters are-

Cluster-01:

First cluster contains points-

● A1(2, 10)

Cluster-02

Second cluster contains points-

● A3(8, 4)
● A4(5, 8)
● A5(7, 5)
● A6(6, 4)
● A8(4, 9)

Cluster-03:

Third cluster contains points-

● A2(2, 5)
● A7(1, 2)
Now,

● We re-compute the new cluster clusters.


● The new cluster center is computed by taking mean of all the points contained in
that cluster.

For Cluster-01:

● We have only one point A1(2, 10) in Cluster-01.


● So, cluster center remains the same.

For Cluster-02:

Center of Cluster-02

= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)

= (6, 6)

For Cluster-03:

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)

= (1.5, 3.5)

This is completion of Iteration-01.

Iteration-02:

● We calculate the distance of each point from each of the center of the three
clusters.
● The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10) and each of the
center of the three clusters-

Calculating Distance Between A1(2, 10) and C1(2, 10)


Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|

= |2 – 2| + |10 – 10|

=0

Calculating Distance Between A1(2, 10) and C2(6, 6)-


Ρ(A1, C2)

= |x2 – x1| + |y2 – y1|

= |6 – 2| + |6 – 10|

=4+4

=8

Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-

Ρ(A1, C3)

= |x2 – x1| + |y2 – y1|

= |1.5 – 2| + |3.5 – 10|

= 0.5 + 6.5

=7

In the similar manner, we calculate the distance of other points from each of the center of
the three clusters.
Next,

● We draw a table showing all the results.


● Using the table, we decide which point belongs to which cluster.
● The given point belongs to that cluster whose center is nearest to it.

Given Points Distance from Distance from Distance from Point belongs
center (2, 10) of center (6, 6) of center (1.5, 3.5) of to Cluster
Cluster-01 Cluster-02 Cluster-03

A1(2, 10) 0 8 7 C1

A2(2, 5) 5 5 2 C3

A3(8, 4) 12 4 7 C2

A4(5, 8) 5 3 8 C2

A5(7, 5) 10 2 7 C2

A6(6, 4) 10 2 5 C2

A7(1, 2) 9 9 2 C3

A8(4, 9) 3 5 8 C1

From here, New clusters are-

Cluster-01:

First cluster contains points-


● A1(2, 10)
● A8(4, 9)

Cluster-02:

Second cluster contains points-

● A3(8, 4)
● A4(5, 8)
● A5(7, 5)
● A6(6, 4)

Cluster-03:

Third cluster contains points-

● A2(2, 5)
● A7(1, 2)

Now,

● We re-compute the new cluster clusters.


● The new cluster center is computed by taking mean of all the points contained in
that cluster.

For Cluster-01:

Center of Cluster-01

= ((2 + 4)/2, (10 + 9)/2)

= (3, 9.5)

For Cluster-02:
Center of Cluster-02

= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)

= (6.5, 5.25)

For Cluster-03:

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)

= (1.5, 3.5)

This is completion of Iteration-02.

After second iteration, the center of the three clusters are-

● C1(3, 9.5)
● C2(6.5, 5.25)
● C3(1.5, 3.5)

Que.10 Explain SON algorithm to find all or most frequent itemsets using at most
two passes.

The SON (Savasere, Omiecinski, and Navathe) algorithm is a distributed and scalable approach
for finding frequent item sets in large datasets, especially when the dataset is too large to fit into
memory. It uses the MapReduce framework to process data in parallel and requires at most two
passes over the dataset.

Principle of the SON Algorithm

The SON algorithm leverages the fact that a subset of frequent itemsets in the entire dataset must
also be frequent in at least one of the partitions of the dataset. It operates in two phases:

1. Local Computation:
○ Break the dataset into smaller, manageable chunks (partitions).
○ Identify frequent itemsets within each partition (using a lower threshold for local
support).
2. Global Validation:
○ Combine the results from all partitions.
○ Validate which itemsets are globally frequent by counting their occurrences across
the entire dataset.

Steps of the SON Algorithm

Step 1: Partition the Dataset

● Divide the dataset D into smaller partitions D1,D2,…,Dn that can fit into memory.

Step 2: First Pass (Local Frequent Itemsets)

● For each partition:


1. Apply a frequent itemset mining algorithm (e.g., Apriori or FP-Growth) with a
reduced support threshold supportlocal=support global​×(∣Di|​/∣D∣)​.
2. Identify locally frequent itemsets within the partition.

Step 3: Aggregate Local Results

● Combine the locally frequent itemsets from all partitions into a single set of candidate
itemsets.

Step 4: Second Pass (Global Validation)

● Make a second pass over the dataset to count the occurrences of the candidate itemsets
across the entire dataset.
● Identify itemsets that meet the global support threshold as globally frequent itemsets.

Key Features of SON Algorithm

1. Two-Pass Approach:
○ First pass: Identify locally frequent itemsets.
○ Second pass: Validate globally frequent itemsets.
2. Distributed Processing:
○ Efficiently handles large datasets using parallel computation (e.g., MapReduce).
3. Memory Efficiency:
○ Processes data in smaller partitions that fit into memory.

Example
Dataset (Transactions):

T1={A,B,C}, T2={A,C}, T3={A,B}, T4={B,C}, T5={A,B,C}

Global Support Threshold:

● support global=50%.

Step 1: Partition the Dataset

● Partition D into D1={T1,T2,T3}} and D2={T4,T5}


● Step 2: Local Frequent Itemsets
● For D1​: Frequent itemsets = {A,B,C,AB,AC}
● For D2​: Frequent itemsets = {A,B,C,AB,BC}

Step 3: Aggregate Candidates

● Candidate itemsets = {A,B,C,AB,AC,BC}.

Step 4: Global Validation

● Count occurrences of candidates across D:


○ A:4,B:4,C:4,AB:3,AC:3,BC:3
● Frequent itemsets = {A,B,C,AB,AC,BC}

Advantages of SON Algorithm

1. Scalability:
○ Handles massive datasets by leveraging distributed computation.
2. Efficiency:
○ Processes data in partitions, reducing memory usage.
3. Parallelization:
○ Suitable for frameworks like MapReduce.

Applications

1. Market basket analysis.


2. Fraud detection.
3. Recommendation systems.

The SON algorithm is a powerful method for mining frequent itemsets efficiently from large
datasets, making it an essential tool in big data analytics.
Que.11

Transaction ID Items

T1 {E,K,M,N,O,Y}

T2 {D,E,K,N,O,Y}

T3 {A,E,K,M}

T4 {C,K,M,U,Y}

{C,E,I,K,O,O}
T5

Explain FP-Growth algorithm. Solve this using the Frequent Pattern Tree.

The two primary drawbacks of the Apriori Algorithm are:


1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the database.
These two properties inevitably make the algorithm slower. To overcome these
redundant steps, a new association-rule mining algorithm was developed named
Frequent Pattern Growth Algorithm. It overcomes the disadvantages of the Apriori
algorithm by storing all the transactions in a Trie Data Structure.

The above-given data is a hypothetical dataset of transactions with each letter representing an
item. The frequency of each individual item is computed:-

Item Frequency
A 1

C 2

D 1

E 4

I 1

K 5

M 3

N 2

O 4

U 1

Y 3

Let the minimum support be 3. A Frequent Pattern set is built which will contain all the
elements whose frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies. After insertion of the relevant items,
the set L looks like this:-

L = {K : 5, E : 4, M : 3, O : 4, Y : 3}

Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating the
Frequent Pattern set and checking if the current item is contained in the transaction in question.
If the current item is contained, the item is inserted in the Ordered-Item set for the current
transaction. The following table is built for all the transactions:
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.

a) Inserting the set {K, E, M, O, Y}:

Here, all the items are simply linked one after the other in the order of occurrence in the set and
initialize the support count for each item as 1.

Transaction ID Items Ordered-Item Set

T1 {E,K,M,N,O,Y} {K,E,M,O,Y}
{K,E,O,Y}
T2 {D,E,K,N,O,Y}

{K,E,M,}
T3 {A,E,K,M}

{K,M,Y}
T4 {C,K,M,U,Y}

{C,E,I,K,O,O} {K,E,O,}
T5

Now, all the Ordered-Item sets are inserted into a Trie Data Structure.

b) Inserting the set {K, E, O, Y}:

Till the insertion of the elements K and E, simply the support count is increased by 1. On
inserting O we can see that there is no direct link between E and O, therefore a new node
for the item O is initialized with the support count as 1 and item E is linked to this new
node. On inserting Y, we first initialize a new node for the item Y with support count as 1
and link the new node of O with the new node of Y.
c) Inserting the set {K, E, M}:

Here simply the support count of each element is increased by 1.


d) Inserting the set {K, M, Y}:
Similar to step b), first the support count of K is increased, then new nodes for M and Y are
initialized and linked accordingly.

e) Inserting the set {K, E, O}:


Here simply the support counts of the respective elements are increased. Note that the support
count of the new node of item O is increased.
Now, for each item, the Conditional Pattern Base is computed which is path labels of all the
paths which lead to any node of the given item in the frequent-pattern tree. Note that the items in
the below table are arranged in the ascending order of their frequencies.

Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set
of elements that is common in all the paths in the Conditional Pattern Base of that item and
calculating its support count by summing the support counts of all the paths in the Conditional
Pattern Base.

From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing
the items of the Conditional Frequent Pattern Tree set to the corresponding to the item as given
in the below table.
For each row, two types of association rules can be inferred. For example for the first row which
contains the element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule,
the confidence of both the rules is calculated and the one with confidence greater than or equal to
the minimum confidence value is retained.

Que.12 What are the approaches for high dimensional data clustering?

High-dimensional data clustering refers to grouping data points in spaces with a large number of
dimensions (features). High dimensionality poses challenges like the "curse of dimensionality,"
which can obscure meaningful patterns. Here are the key approaches:

1. Dimensionality Reduction-Based Approaches

● Principal Component Analysis (PCA): Reduces dimensions by finding the principal


components that retain most of the variance.
● t-SNE (t-Distributed Stochastic Neighbor Embedding): Projects high-dimensional data
into lower dimensions while preserving neighborhood structures.
● Autoencoders: Neural networks that learn compressed representations of
high-dimensional data.
● Feature Selection: Selects the most relevant features using techniques like mutual
information or correlation analysis.

2. Subspace Clustering

● Identifies clusters within subsets of dimensions instead of the entire feature space.
● Examples:
○ CLIQUE: Combines grid-based and density-based clustering in subspaces.
○ PROCLUS: Finds medoid-based clusters in subspaces.

3. Spectral Clustering

● Uses the eigenvalues of a similarity matrix to perform clustering in a lower-dimensional


embedding space.
● Effective in capturing non-linear relationships in high dimensions.

4. Density-Based Clustering

● DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Identifies


clusters as dense regions separated by sparse areas.
● OPTICS (Ordering Points to Identify the Clustering Structure): Extends DBSCAN to
handle varying density clusters.

5. Model-Based Clustering

● Uses probabilistic models like Gaussian Mixture Models (GMM) to identify clusters.
● Assumes data is generated from a mixture of underlying probability distributions.

6. High-Dimensional Clustering Algorithms

● k-Means Variants: Adapted for high-dimensional data by using cosine similarity instead
of Euclidean distance.
● Hierarchical Clustering: Effective for smaller datasets, but suffers from scalability issues
in high dimensions.

7. Kernel Methods

● Transforms data into a higher-dimensional space using kernel functions to make it


linearly separable.

8. Ensemble Clustering

● Combines multiple clustering results to improve robustness and accuracy in high


dimensions.

Challenges

● Distance Measures: Euclidean distance becomes less meaningful in high dimensions.


● Sparsity: Data points tend to become equidistant.
● Scalability: Algorithms need to handle computational complexity.
Applications

● Bioinformatics (e.g., gene expression analysis).


● Image and video segmentation.
● Text and document clustering.
● Market segmentation in high-dimensional consumer data.

Que.13 What are the applications of frequent itemset analysis? Explain a simple and
randomized algorithm to find most frequent itemsets using at most two passes.

Applications of Frequent Itemset Analysis

Frequent itemset analysis is widely used in discovering patterns and associations in data. Key
applications include:

1. Market Basket Analysis:


○ Identifies products frequently bought together.
○ Example: "Milk and bread" are often purchased together.
2. Recommendation Systems:
○ Suggests products or services based on frequent patterns in user behavior.
3. Fraud Detection:
○ Identifies unusual patterns in financial transactions.
4. Healthcare:
○ Discovers patterns in medical data, such as co-occurrence of symptoms or
treatments.
5. Web Usage Mining:
○ Analyzes clickstream data to understand user behavior.
6. Telecommunications:
○ Identifies frequently occurring patterns in network usage for optimization.

Randomized Algorithm for Frequent Itemsets (Two-Pass)

Objective: To find frequent itemsets (sets of items appearing together frequently) using at most
two passes over the data.

Algorithm:

1. Input:
○ A dataset D of transactions.
○ A threshold sss (support count).
2. Pass 1:
○ Random Sampling:
■ Randomly sample a subset of transactions from D.
○ Frequent Itemset Mining:
■ Apply any frequent itemset mining algorithm (e.g., Apriori) on the
sampled data.
■ Record candidate frequent itemsets.
3. Pass 2:
○ Verify Candidate Itemsets:
■ For each candidate itemset, count its exact support in the entire dataset D.
○ Filter Frequent Itemsets:
■ Retain itemsets with support ≥s.

Key Features:

● The first pass reduces the number of candidate itemsets, making the second pass efficient.
● Random sampling ensures scalability.

Advantages:

● Reduces computational complexity by focusing only on candidates.


● Efficient for large datasets.

Limitations:

● May miss some frequent itemsets due to sampling errors.


● Requires careful tuning of sample size for accuracy.

Que.14 Explain Toivonen’s algorithm?

Toivonen’s algorithm is a randomized approach for frequent itemset mining, particularly useful
for large datasets. It combines random sampling and verification to find frequent itemsets
efficiently. The algorithm minimizes the number of passes over the dataset, making it suitable for
scenarios where dataset scanning is expensive.Toivonen’s algorithm is an efficient and scalable
approach for frequent itemset mining. Its use of sampling and verification reduces computational
overhead, making it ideal for large datasets, while the negative border ensures no frequent
itemsets are missed.

The algorithm works by:

1. Generating a sample of the dataset.


2. Mining the sample for frequent itemsets using a lower support threshold than the original
dataset.
3. Verifying the candidate frequent itemsets on the entire dataset to ensure correctness.

This approach reduces computational effort while ensuring accurate results.


Steps of Toivonen’s Algorithm

1. Input:

● Dataset D: Collection of transactions.


● Support threshold sss: Minimum support for frequent itemsets.
● Sample fraction fff: Fraction of data to sample (e.g., 1%).

2. Generate a Random Sample:

● Take a random sample S from D (e.g., 1% of D).


● Let s=s×fs the adjusted support threshold for the sample.

3. Find Frequent Itemsets in the Sample:

● Use a frequent itemset mining algorithm (e.g., Apriori) on S with s to identify frequent
itemsets Fs

4. Generate Candidate Itemsets:

● Include both:
○ Frequent itemsets from Fs
○ Negative Border: Itemsets that are immediate supersets of Fs but are not frequent
in the sample.

5. Verify Candidate Itemsets on Full Dataset:

● Count the actual support of all candidate itemsets (frequent itemsets and negative border)
in D.
● Retain only those itemsets with support ≥s.

6. Handle Errors:

● If any item set from the negative border is frequent in D, the algorithm fails because the
sample was not representative.
○ Solution: Restart the algorithm with a larger sample

Negative Border:

● The negative border contains itemsets that are not frequent in the sample but could
potentially be frequent in the full dataset.
● Ensures that no frequent itemsets are missed.

Support Threshold Adjustment:


● The sample support threshold s​is lower than sss to compensate for the smaller size of the
sample.

Advantages

1. Efficiency: Reduces the number of passes over the dataset (only one full pass is needed
for verification).
2. Scalability: Works well for large datasets by operating on a smaller sample.
3. Accuracy: Ensures correctness through verification.

Limitations

1. Random Sampling Bias: A poor sample may lead to errors, requiring a restart.
2. Negative Border Size: The size of the negative border can grow exponentially with the
number of items.
3. Restart Overhead: If the algorithm fails, it needs to restart with a larger sample,
increasing computational cost.

Example

Dataset D:
D={{A,B,C},{A,C},{A,B},{B,C},{A,B,C}}

Support Threshold s=60%


Frequent itemsets must appear in at least 3/5=60% of transactions.

Step 1: Sampling

● Random sample S={{A,B,C},{A,C}}S = \{ \{A, B, C\}, \{A, C\} \}S={{A,B,C},{A,C}}.


● Adjusted threshold ss=60%×0.5=30%s_s = 60\% \times 0.5 = 30\%ss​=60%×0.5=30%.

Step 2: Frequent Itemsets in Sample

● Frequent itemsets in S: Fs={A,C},{A,B},{B,C}

Step 3: Negative Border

● Negative border: {A,B,C} (immediate superset not frequent in the sample).

Step 4: Verification

● Count support of {A,C},{A,B},{B,C},{A,B,C} in D:


○ {A,C}:4/5 {A,B}:3/5 {B,C}:4/5 {A,B,C}:2/5
Step 5: Result

● Frequent item sets: {A,C},{A,B},{B,C} Negative border {A,B,C} are not frequent in D,
so the algorithm succeeds.

Applications

1. Market Basket Analysis: Identifies frequent itemsets in transactional data.


2. Web Usage Mining: Analyzes user clickstream data for patterns.
3. Bioinformatics: Discovers frequent patterns in gene expression data.

Que.15 Discuss the basic subspace clustering approaches.

Subspace clustering is a method for clustering high-dimensional data by identifying clusters in


subsets of dimensions (subspaces) rather than the entire feature space. It is particularly useful for
datasets where clusters exist only in specific subspaces.

Basic Subspace Clustering Approaches

1. Grid-Based Approaches
○ Divide the feature space into grids and identify dense regions.
○ Example: CLIQUE (Clustering In QUEst):
■ Divides the data space into equal-sized cells.
■ Identifies dense cells (with a number of points exceeding a threshold).
■ Clusters are formed by merging dense cells.
2. Bottom-Up Approaches
○ Start with low-dimensional subspaces and iteratively extend to higher dimensions.
○ Example: Proclus:
■ Finds medoids in subspaces and assigns data points to clusters based on
proximity.
■ Works efficiently by limiting the number of subspaces considered.
3. Top-Down Approaches
○ Start with the full-dimensional space and iteratively remove irrelevant
dimensions.
○ Example: P3C (Projective Clustering with Constraints):
■ Uses statistical tests to identify relevant dimensions for each cluster.
4. Density-Based Approaches
○ Identify clusters as dense regions in subspaces.
○ Example: SUBCLU:
■ Extends DBSCAN to find density-connected clusters in subspaces.
5. Spectral-Based Approaches
○ Use eigenvectors of the similarity matrix to project data into a lower-dimensional
subspace.
○ Example: SSC (Sparse Subspace Clustering):
■ Uses sparse representation techniques to identify clusters.
6. Ensemble Subspace Clustering
○ Combines results from multiple subspace clustering algorithms to improve
robustness.
○ Example: ECSC (Ensemble Clustering of Subspaces):
■ Aggregates clustering results using consensus functions.

Que.16 How can you improve the efficiency of Apriori based mining.

The Apriori algorithm is a classic method for frequent itemset mining, but its performance can be
improved in several ways:

1. Reduce Candidate Generation

● Hash-Based Itemset Counting:


○ Use a hash table to count occurrences of candidate itemsets.
○ Reduces the number of candidate itemsets considered.
● Partitioning:
○ Divide the dataset into smaller partitions.
○ Find frequent itemsets in each partition and combine results.
● Direct Hashing and Pruning (DHP):
○ Uses a hash table to prune unnecessary candidate itemsets early.

2. Efficient Data Representation

● Transaction Reduction:
○ Remove transactions that do not contain frequent itemsets to reduce the search
space.
● Vertical Data Format:
○ Represent data as item-to-transaction mappings.
○ Makes it easier to intersect transactions for candidate generation.

3. Early Pruning of Candidate Itemsets

● Dynamic Itemset Counting:


○ Incrementally generate candidates during the scan instead of generating all
candidates at once.
● Reduced Minimum Support Threshold:
○ Use a slightly lower support threshold for initial passes to reduce the number of
candidates.

4. Parallel and Distributed Processing

● Implement Apriori using parallel or distributed computing frameworks like MapReduce


to handle large datasets efficiently.

5. Use Alternative Algorithms

● Algorithms like FP-Growth (Frequent Pattern Growth) avoid candidate generation


altogether, offering significant performance improvements over Apriori.

Que.17 What are the requirements for clustering in data mining?

Clustering in data mining involves grouping similar data points into clusters. To ensure effective
clustering, the following requirements must be met:

1. Scalability

● The algorithm must handle large datasets efficiently, both in terms of time and memory.

2. Ability to Handle Different Data Types

● Support for numeric, categorical, and mixed data types.


● Example: Clustering text data requires handling string similarities.

3. Discovery of Arbitrary Shaped Clusters

● Should identify clusters of arbitrary shapes (e.g., density-based methods like DBSCAN).

4. Robustness to Noise and Outliers

● Clustering results should not be overly sensitive to noise or outliers in the data.

5. High Dimensionality

● Must work effectively in high-dimensional spaces (e.g., subspace clustering for


dimensionality reduction).

6. Interpretability

● Results should be interpretable to facilitate decision-making.

7. Minimal Input Parameters


● Require minimal and intuitive input parameters (e.g., number of clusters kkk in
kkk-Means).

8. Ability to Handle Dynamic Data

● Should support incremental updates for streaming or dynamic data.

9. Domain-Specific Requirements

● Tailored to the application, such as time-series clustering or spatial clustering.


Que.18 Explain the SON algorithm using MAP REDUCE

The SON (Savasere, Omiecinski, and Navathe) algorithm is a distributed approach for finding
frequent itemsets using the MapReduce framework. It is designed to handle large datasets by
dividing the data into chunks and processing them independently.

Steps of the SON Algorithm

1. Input:

● Dataset D, minimum support threshold s, and number of mappers.

2. Pass 1 (Local Frequent Itemset Mining):

● Mapper Phase:
○ Each mapper processes a subset (chunk) of the dataset.
○ Within the chunk, find frequent itemsets using the Apriori algorithm with the
local support threshold s×chunk size/dataset sizes\
● Reducer Phase:
○ Combine all locally frequent itemsets to form a global candidate set.

3. Pass 2 (Global Verification):

● Mapper Phase:
○ Each mapper scans the dataset and counts the support of candidate itemsets.
● Reducer Phase:
○ Aggregates counts and filters itemsets with support ≥s

Advantages

1. Scalability: Handles massive datasets using distributed processing.


2. Efficiency: Reduces communication overhead by processing chunks locally.

Applications

● Market basket analysis.


● Web usage mining.
● Social network analysis.
Que.19 What is PCY , Apply the PCY algorithm on the following transaction to
find the candidate sets (frequent sets) with threshold minimum value as 3 and
Hash function as (i*j) mod 10

T1 = {1, 2, 3}

T2 = {2, 3, 4}

T3 = {3, 4, 5}

T4 = {4, 5, 6}

T5 = {1, 3, 5}

T6 = {2, 4, 6}

T7 = {1, 3, 4}

T8 = {2, 4, 5}

T9 = {3, 4, 6}

T10 = {1, 2, 4}

T11 = {2, 3, 5}

T12 = {2, 4, 6}

Approach:
There are several steps that you have to follow to get the Candidate table.
Step 1: Find the frequency of each element and remove the candidate set having length 1.
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to them
write their frequency. Note - Note: Pairs should not get repeated to avoid the pairs that are
already written before.
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It
gives us the bucket number). It defines in what bucket this particular pair will be put.
Step 4: This is the last step, and in this step, we have to create a table with the following details -
● Bit vector - if the frequency of the candidate pair is greater than equal to the threshold
then the bit vector is 1 otherwise 0. (mostly 1)
● Bucket number - found in the previous step
● Maximum number of support - frequency of this candidate pair, found in step 2.
● Correct - the candidate pair will be mentioned here.
● Candidate set - if the bit vector is 1, then "correct" will be written here.

Solution:
Step 1: Find the frequency of each element and remove the candidate set having length 1.

Items 1 2 3 4 5 6

Frequency 4 7 7 8 6 4

Step 2: One by one transaction-wise, create all the possible pairs and corresponding to it write
its frequency.

T1 {(1, 2), (1, 3)} 2,3

T2 {(2, 3), (2, 4)} 3,4

T3 {(3, 4),(3, 5)} 4,3

T4 {(4, 5) ,(4, 6)} 3,4

T5 {(1, 5)} 1
T6 {(2, 6)} 2

T7 {(1, 4)} 2

T8 {(2, 5)} 2

T9 {(3, 6)} 1

T10 -

T11 -

T12 -

Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It
gives us the bucket number).
Hash Function = ( i * j) mod 10
(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4
Bucket No.

Bucket no. Pair

0 (4,5)

2 (3,4)

3 (1,3)

4 (4,6)

5 (3,5)

6 (2,3)
8 (2,4)

Step 4: Prepare candidate set

Highest Support
Bit Vector Bucket No. Pairs Candidate Set
Count

1 0 3 (4,5) (4,5)

1 2 4 (3,4) (3,4)

1 3 3 (1,3) (1,3)

1 4 4 (4,6) (4,6)

1 5 3 (3,5) (3,5)

1 6 3 (2,3) (2,3)
1 8 4 (2,4) (2,4)

Que.20 Explain clustering in Non-Euclidean Spaces and clustering for streams parallelism?

1. Clustering in Non-Euclidean Spaces

Clustering in Non-Euclidean spaces involves grouping data points where the distance metric is
not based on Euclidean geometry. Non-Euclidean spaces arise in scenarios where the
relationships between data points cannot be captured by simple geometric distances, such as
high-dimensional, categorical, or graph-based data.Clustering in non-Euclidean spaces and for
data streams addresses the challenges of complex data structures and real-time processing. These
approaches leverage advanced distance metrics, incremental algorithms, and parallelism to
handle large-scale, dynamic datasets efficiently.

Key Characteristics of Non-Euclidean Spaces

1. Complex Distance Metrics:


○ Use distance measures like Manhattan distance, Cosine similarity, Jaccard
similarity, or Graph distances.
2. Non-linear Data Relationships:
○ Data points may lie on non-linear manifolds, requiring specialized techniques.

Approaches for Clustering in Non-Euclidean Spaces

1. Density-Based Clustering

● DBSCAN (Density-Based Spatial Clustering of Applications with Noise):


○ Adapts to non-Euclidean metrics by using a distance function like cosine
similarity or graph-based distances.
○ Clusters are formed based on density rather than geometric proximity.

2. Graph-Based Clustering

● Spectral Clustering:
○ Constructs a similarity graph where nodes represent data points, and edges
represent similarities.
○ Uses eigenvalues of the graph Laplacian to identify clusters.
3. Kernel-Based Clustering

● Kernel K-Means:
○ Maps data into a higher-dimensional space using a kernel function.
○ Performs clustering in the transformed space.

4. Hierarchical Clustering

● Works with custom distance metrics to form clusters hierarchically.

Applications

1. Social Network Analysis:


○ Graph-based clustering to find communities.
2. Document Clustering:
○ Cosine similarity for text data.
3. Genomics:
○ Clustering based on genetic sequence similarity.

2. Clustering for Streams Parallelism

Clustering for data streams involves processing continuous, high-velocity data in real time.
Stream clustering algorithms are designed to handle large-scale, evolving data efficiently, often
leveraging parallelism for scalability.

Challenges in Stream Clustering

1. Dynamic Nature of Data:


○ Data evolves over time, requiring incremental updates to clusters.
2. Memory Constraints:
○ Cannot store the entire dataset; uses summary statistics or micro-clusters.
3. Real-Time Processing:
○ Requires low-latency algorithms.

Approaches for Stream Clustering

1. Micro-Cluster Based Algorithms

● CluStream:
○ Maintains micro-clusters (compact summaries) of data in memory.
○ Periodically merges micro-clusters to form macro-clusters.

2. Density-Based Stream Clustering


● DenStream:
○ Extends DBSCAN for streams by maintaining dense regions over time.
○ Adjusts clusters dynamically as new data arrives.

3. Grid-Based Clustering

● D-Stream:
○ Divides the data space into grids.
○ Assigns data points to grids and tracks the density of each grid over time.

Parallelism in Stream Clustering

Parallelism improves the efficiency of stream clustering by distributing the workload across
multiple processors or machines.

1. MapReduce Framework

● Mapper Phase:
○ Each mapper processes a subset of the stream to identify local clusters.
● Reducer Phase:
○ Combines local clusters into global clusters.

2. Parallel Density-Based Clustering

● Extends DBSCAN or DenStream by parallelizing the density computation across


multiple nodes.

3. Apache Flink and Spark Streaming

● Real-time frameworks for stream clustering.


● Enable parallel processing of high-velocity data streams.

Applications

1. Fraud Detection:
○ Real-time clustering of transaction data to identify anomalies.
2. Network Monitoring:
○ Clustering network traffic streams to detect patterns or intrusions.
3. IoT Data Analysis:
○ Clustering sensor data streams for predictive maintenance.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy