DM & W - Unit - 3
DM & W - Unit - 3
CLUSTERING
INTRODUCTION
Clustering is similar to classification in that data are grouped.
The groups are not predefined. Instead, the grouping is accomplished by finding similarities between data
according to characteristics found in the actual data.
The groups are called clusters. Clustering has been used in many application domains, including biology,
medicine, anthropology, marketing, and economics.
Clustering applications include plant and animal classification, disease classification, image processing,
pattern recognition, and document retrieval.
One of the first domains in which clustering were used was biological taxonomy. Recent uses include
examining Web log data to detect usage patterns.
When clustering is applied to a real-world database, many interesting problems occur:
Outlier handling is difficult. Here the elements do not naturally fall into any cluster. They can be viewed as
solitary clusters.
Dynamic data in the database implies that cluster membership may change over time.
Interpreting the semantic meaning of each cluster may be difficult. With classification, the labelling of the
classes is known ahead of time.
There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact
number of clusters required is not easy to determine.
The types of clustering algorithms can be furthered classified based on the implementation technique used.
Hierarchical algorithms can be categorized as agglomerative or divisive.
"Agglomerative" implies that the clusters are created in a bottom-up fashion, while divisive algorithms work
in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the
agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms
Given clusters Ki and Kj, there are several standard alternatives to calculate the distance between clusters. A
representative list is:
Single link: Smallest distance between an element in one cluster and an element in the other. We thus have
dis(Ki, Kj) = min(dis(til, tjm )) ∀til ∈ Ki ∉ Kj and ∀tjm ∈ Kj ∉j Ki.
Complete link: Largest distance between an element in one cluster and an element in the other. We thus have
dis(Ki, Kj) = max(dis(til, tjm )) ∀til ∈ Ki ∉ Kj and ∀tjm ∈ Kj ∉j Ki.
Average link: Average distance between an element in one cluster and an element in the other. We thus have
dis(Ki, Kj) = avg(dis(til, tjm )) ∀til ∈ Ki ∉ Kj and ∀tjm ∈ Kj ∉j Ki.
Centroid: If clusters have a representative centroid, then the centroid distance is defined as the distance
between the centroids. We thus have
dis(Ki , Kj) = dis(Ci, Cj), where Ci is the centroid for Ki and similarly for Cj.
Medoid: Using a medoid to represent each cluster, the distance between the clusters can be defined by the
distance between the medoids:
dis(Ki , Kj) = dis(Mi , Mj).
OUTLIERS
Outliers are sample points with values much different from those of the remaining set of data. Outliers may
represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data value) or could be
correct data values that are simply much different from the remaining data.
A person who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals; this
value probably would be viewed as an outlier. Some clustering techniques do ri�t perform well with the
presence of outliers. This problem is illustrated in Figure 5.3.
Clustering algorithms may actually find and remove outliers to ensure that they perform better. Outlier
detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data
mining, algorithms may then choose to remove or treat these values differently. Some outlier detection
techniques are based on statistical techniques is not performing well in real world datasets and alternative
detection techniques may be based on distance measures.
DISTANCE MEASURES
Similarity measure is a distance with dimensions representing features of the objects. In a common term, this
is a measure which helps us identify how much alike two data objects are. If the distance is small, the objects
have high similarity factor and vice versa. So if two objects are similar they are denoted as Obj1 = Obj2 and if
they are not similar they are denoted as Obj1 != Obj2
The similarity is always measured in the range of 0 to 1 and is denoted as [0,1].
There are various techniques of calculating Similarity Distance measure. Let’s look at some of the most
popular one.
Euclidean Distance
o This is most commonly used measure and is denoted as
o Similarity Distance Measure = SQRT ( (X2-X1)^2+ (Y2-Y1)^2) )
o The Euclidean distance between two points is the length of the path connecting them.
Manhattan Distance
o This is one more commonly used measure and is denoted as:
o Similarity Distance Measure = Abs (X2-X1) + Abs (Y2-Y1)
HIERARCHICAL ALGORITHMS
Hierarchical clustering is a popular method for grouping objects. It creates groups so that objects within a
group are similar to each other and different from objects in other groups. Clusters are visually represented in
a hierarchical tree called a dendrogram.
The root in a dendrogram tree contains one cluster where all elements are together. The leaves in the
dendrogram each consist of a single element cluster. Internal nodes in the dendrogram represent new clusters
formed by merging the clusters that appear as its children in the tree. Each level in the tree is associated with
the distance measure that was used to merge the clusters. All clusters created at a particular level were
combined because the children clusters had a distance between them less than the distance value associated
with this level in the tree.
Hierarchical clustering has a couple of key benefits:
1. There is no need to pre-specify the number of clusters. Instead, the dendrogram can be cut at the
appropriate level to obtain the desired number of clusters.
2. Data is easily summarized/organized into a hierarchy using dendrograms. Dendrograms make it easy to
examine and interpret clusters.
APPLICATIONS
There are many real-life applications of Hierarchical clustering. They include:
Bioinformatics: grouping animals according to their biological features to reconstruct phylogeny trees
Business: dividing customers into segments or forming a hierarchy of employees based on salary.
Image processing: grouping handwritten characters in text recognition based on the similarity of the
character shapes.
Information Retrieval: categorizing search results based on the query.
3. Find the minimum distance from the matrix and create the cluster. Here combine P3 & P6 as the first cluster
and update the distance matrix with respect to the new cluster (P3, P6)
P1 P2 P3, P6 P4 P5
P1 0
P2 0.23 0
P3, P6 0
P4 0.37 0.20 0
P5 0.34 0.14 0.29 0
4. To update the distance matrix find minimum distance of P1, P2, P4, P5 with respect to new cluster (P3, P6).
Distance = Minimum [(P3, P6), (P1)]
= Minimum [(P1, P3), (P1, P6)]
= Minimum [(0.22), (0.23)]
= 0.22
Distance = Minimum [(P3, P6), (P2)]
= Minimum [(P2, P3), (P2, P6)]
= Minimum [(0.15), (0.25)]
= 0.15
Distance = Minimum [(P3, P6), (P4)]
= Minimum [(P3, P4), (P4, P6)]
= Minimum [(0.15), (0.22)]
= 0.15
Distance = Minimum [(P3, P6), (P5)]
= Minimum [(P3, P5), (P5, P6)]
= Minimum [(0.28), (0.39)]
= 0.28
5. The updated distance matrix
P1 P2 P3, P6 P4 P5
P1 0
P2 0.23 0
P3, P6 0.22 0.15 0
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0
6. Find the minimum distance from the updated matrix and create the cluster. Here combine P2 & P5 as the
second cluster and update the distance matrix with respect to the new cluster (P2, P5)
P1 P2, P5 P3, P6 P4
P1 0
P2, P5 0
P3, P6 0.22 0
P4 0.37 0.15 0
7. Update the distance matrix find minimum distance of P1, (P3, P6), P4 with respect to new cluster (P2, P5).
Distance = Minimum [(P2, P5), (P1)]
= Minimum [(P1, P2), (P1, P5)]
= Minimum [(0.23), (0.34)]
= 0.23
Distance = Minimum [(P2, P5), (P3, P6)]
P1 P2, P5 P3, P6 P4
P1 0
P2, P5 0.23 0
P3, P6 0.22 0.15 0
P4 0.37 0.20 0.15 0
8. Find the minimum distance from the updated matrix and create the cluster. Here combine (P2, P5) & (P3,
P6) as the third cluster and update the distance matrix with respect to the new cluster (P2, P5) & (P3, P6).
P2, P5
P1 P4
P3, P6
P1 0
P2, P5
0
P3, P6
P4 0.37 0
9. Update the distance matrix find minimum distance of P1, P4 with respect to new cluster (P2, P5) & (P3, P6).
Distance = Minimum [((P2, P5),(P3, P6)), P1]
= Minimum [((P2, P5), P1), ((P3, P6), P1)]
= Minimum [(0.23), (0.22)]
= 0.22
Distance = Minimum [((P2, P5),(P3, P6)), P4]
= Minimum [((P2, P5), P4), ((P3, P6), P4)]
= Minimum [(0.20), (0.15)]
= 0.15
P2, P5
P1 P4
P3, P6
P1 0
P2, P5
0.22 0
P3, P6
P4 0.37 0.15 0
10. Find the minimum distance from the updated matrix and create the cluster. Here combine ((P2, P5)(P3,
P6), P4) as the third cluster and update the distance matrix with respect to the new cluster ((P2, P5)(P3, P6)),
P4).
P2, P5
P1 P3, P6
P4
P1 0
P2, P5
P3, P6 0.22 0
P4
11. Update the distance matrix find minimum distance of P1with respect to new cluster ((P2, P5)(P3, P6)), P4).
P2, P5
P1 P3, P6
P4
P1 0
P2, P5
P3, P6 0
P4
Distance = Minimum [((P2, P5),(P3, P6), P4), P1]
= Minimum [((P2, P5)((P3, P6),P4), P1), (P4, P1)]
= Minimum [(0.22), (0.37)]
= 0.22
P2, P5
P1 P3, P6
P4
P1 0
P2, P5
P3, P6 0.22 0
P4
12. Thus the clusters are created. The clusters are – {((((P3, P6) (P2, P5) P4) P1)}
PARTITIONAL ALGORITHMS
Non-hierarchical or partitional clustering creates the clusters in one step as opposed to several steps.
Partitioning methods are a widely used family of clustering algorithms in data mining that aim to partition
a dataset into K clusters. These algorithms attempt to group similar data points together while maximizing
the differences between the clusters.
Partitioning methods work by iteratively refining the cluster centroids until convergence is reached. These
algorithms are popular for their speed and scalability in handling large datasets.
The most widely used partitioning method is the K-means algorithm. Other popular partitioning methods
include K-medoids, Fuzzy C-means, and Hierarchical K-means.
The K-medoids are similar to K-means but use medoids instead of centroids as cluster representatives.
Fuzzy C-means is a soft clustering algorithm that allows data points to belong to multiple clusters with
varying degrees of membership.
Partitioning methods offer several benefits, including speed, scalability, and simplicity.
They are relatively easy to implement and can handle large datasets. Partitioning methods are also
effective in identifying natural clusters within data and can be used for various applications, such as
customer segmentation, image segmentation, and anomaly detection.
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset (D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 2 until no change occurs.
Flowchart:
Example
Create 3 clusters for the following data points,
A1 (2, 10), A2 (2, 5), A3 (8, 4), B1 (5, 8), B2 (7, 5), B3 (6, 4), C1 (1, 2), C2 (4, 9), Use Euclidean distance.
1. If k is given as 3, we need to break down the data points into 3 clusters and the initial centroids (choose
randomly) are A1 (2, 10), B1 (5, 8), C1 (1, 2).
2. Calculate the distance using the initial centroids.
Distance To
Data Points
2 10 5 8 1 2
A1 2 10 0.00 3.61 8.06
A2 2 5 5.00 4.24 3.16
A3 8 4 8.49 5.00 7.28
B1 5 8 3.61 0.00 7.21
B2 7 5 7.07 3.61 6.71
B3 6 4 7.21 4.12 5.39
C1 1 2 8.06 7.21 0.00
C2 4 9 2.24 1.41 7.62
3. Find the minimum value and assign the cluster
Distance To
Data Points Cluster
2 10 5 8 1 2
A1 2 10 0.00 3.61 8.06 1
A2 2 5 5.00 4.24 3.16 3
A3 8 4 8.49 5.00 7.28 2
B1 5 8 3.61 0.00 7.21 2
B2 7 5 7.07 3.61 6.71 2
B3 6 4 7.21 4.12 5.39 2
C1 1 2 8.06 7.21 0.00 3
C2 4 9 2.24 1.41 7.62 2
The Clusters are
Cluster 1 – A1
Cluster 2 – A3, B1, B2, B3, C2
Cluster 3 – A2, C1
5. Calculate the distance using the new centroids (2, 10), (6,6), (1.5, 3.5)
Distance To
Data Points
2 10 6 6 1.5 3.5
A1 2 10 0.00 5.66 6.52
A2 2 5 5.00 4.12 1.58
A3 8 4 8.49 2.83 6.52
B1 5 8 3.61 2.24 5.70
B2 7 5 7.07 1.41 5.70
B3 6 4 7.21 2.00 4.53
C1 1 2 8.06 6.40 1.58
C2 4 9 2.24 3.61 6.04
6. Find the minimum value and assign the new cluster
7. Compare the new clusters with old cluster, if the any data points cluster changed then continue the process.
8. Find the new centroids using the assigned clusters,
Cluster 1 – A1, C2 = (2+4)/2, (10+9)/2 = 6/2, 19/2 = (3, 9.5)
Cluster 2 – A3, B1, B2, B3 = (8+5+7+6)/4, (4+8+5+4)/4 = 26/4, 21/4 = (6.5, 5.25)
Cluster 3 – A2, C1 = (2+1)/2, (5+2)/2 = 3/2, 7/2 = (1.5, 3.5)
9. Calculate the distance using the new centroids (3, 9.5), (6.5, 5.25), (1.5, 3.5)
Distance To
Data Points
3 9.5 6.5 5.25 1.5 3.5
A1 2 10 1.12 6.54 6.52
A2 2 5 4.61 4.51 1.58
A3 8 4 7.43 1.95 6.52
B1 5 8 2.50 3.13 5.70
B2 7 5 6.02 0.56 5.70
B3 6 4 6.26 1.35 4.53
C1 1 2 7.76 6.39 1.58
C2 4 9 1.12 4.51 6.04
10. Find the minimum value and assign the new cluster
Distance To New Old
Data Points
3 9.5 6.5 5.25 1.5 3.5 Cluster Cluster
A1 2 10 1.12 6.54 6.52 1 1
A2 2 5 4.61 4.51 1.58 3 3
A3 8 4 7.43 1.95 6.52 2 2
B1 5 8 2.50 3.13 5.70 1 2
B2 7 5 6.02 0.56 5.70 2 2
B3 6 4 6.26 1.35 4.53 2 2
C1 1 2 7.76 6.39 1.58 3 3
C2 4 9 1.12 4.51 6.04 1 1
The New Clusters are
11. Compare the new clusters with old cluster, if the any data points cluster changed then continue the process.
12. Find the new centroids using the assigned clusters,
Cluster 1 – A1, B1, C2 = (2+5+4)/3, (10+8+9)/3 = 11/3, 27/3 = (3.67, 9)
Cluster 2 – A3, B2, B3 = (8+7+6)/3, (4+5+4)/3 = 21/3, 13/3 = (7, 4.33)
Cluster 3 – A2, C1 = (2+1)/2, (5+2)/2 = (1.5, 3.5)
13. Calculate the distance using the new centroids (3.67, 9), (7, 4.33), (1.5, 3.5)
Distance To
Data Points
3.67 9 7 4.33 1.5 3.5
A1 2 10 1.94 7.56 6.52
A2 2 5 4.33 5.04 1.58
A3 8 4 6.62 1.05 6.52
B1 5 8 1.67 4.18 5.70
B2 7 5 5.21 0.67 5.70
B3 6 4 5.52 1.05 4.53
C1 1 2 7.49 6.44 1.58
C2 4 9 0.33 5.55 6.04
14. Find the minimum value and assign the new cluster
Distance To New Old
Data Points
3.67 9 7 4.33 1.5 3.5 Cluster Cluster
A1 2 10 1.94 7.56 6.52 1 1
A2 2 5 4.33 5.04 1.58 3 3
A3 8 4 6.62 1.05 6.52 2 2
B1 5 8 1.67 4.18 5.70 1 1
B2 7 5 5.21 0.67 5.70 2 2
B3 6 4 5.52 1.05 4.53 2 2
C1 1 2 7.49 6.44 1.58 3 3
C2 4 9 0.33 5.55 6.04 1 1
15. Compare the new clusters with old cluster, if the any data points cluster changed then continue the process.
Here the old and new clusters are same. So the final 3 clusters are
Cluster 1 – A1, B1, C2
Cluster 2 – A3, B2, B3
Cluster 3 – A2, C1
Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm takes a Medoid as a
reference point.
Algorithm:
Given the value of k and unlabelled data:
1. Choose k number of random points from the data and assign these k points to k number of clusters. These
are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and assign it to the cluster with
the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the medoids)
4. Select a random point as the new medoid and swap it with the previous medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid, make the new medoid
permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous medoid, undo the swap and
repeat step 4.
7. The Repetitions have to continue until no change is encountered with new medoids to classify data points.
Example
Data set:
x y
P0 5 4
P1 7 7
P2 1 3
P3 8 6
P4 4 9
Scatter plot:
4. Consider New medoids: M1(5, 4) and M2(4, 9) [Change any one medoid only]
5. Calculation of distances using Manhattan Distance: |x1 - x2| + |y1 - y2|
Distance Distance Cluster
x y from from (Find Min (M1,M2)
M1(5, 4) M2(4, 9) Assign Cluster)
7. Calculate the cost of swaping medoids with respect to intial and new clusters,
S = Initial Total Cost – New Total Cost
S = 17 – 15 = 2 > 0, Greater than the previous cost.
8. The final Medoids are M1(5, 4), M2(4, 9) and The Clusters are Cluster 1 = P0, P2, P3 , Cluster 2 = P1, P4
.
ASSOCIATION RULES
Association rules are used to show the relationships between data items. These uncovered relationships are not
inherent in the data, as with functional dependencies, and they do not represent any sort of causality or
correlation.
Association rule mining finds interesting associations and relationships among large sets of data items. This
rule shows how frequently an itemset occurs in a transaction.
A typical example is a Market Based Analysis. Market Based Analysis is one of the key techniques used by
large relations to show associations between items.
It allows retailers to identify relationships between the items that people buy together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction.
TABLE 6.4: Association Rule Notation
Term Description
D Database of transactions
ti Transaction in D
s Support
α Confidence
X, Y Item sets
X --> Y Association rule
L Set of large itemsets
l Large itemset in L
C Set of candidate itemsets
p Number of partitions
Term Description
Basic Definitions
o An sample association rule is If A B,
o Here If element is called antecedent and Then statement is called as Consequent.
o Support Count (σ) – Frequency of occurrence of an itemset. Here (σ) ({Milk, Bread, Diaper}) = 2.
o Frequent Itemset – An itemset whose support is greater than or equal to minimum support
threshold.
o Association Rule – An implication expression of the form X -> Y, where X and Y are any 2 itemsets.
Example: {Milk, Diaper} -> {Juice}
Rule Evaluation Metrics
Support(s) – The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction. It is a measure of how frequently the collection of items occurs
together as a percentage of all transactions.
Support (X) = Frequency of (X) / Total No. of Transactions.
Confidence(c) – It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items in {A}.
Confidence (AB) = Support (A U B) / Support (A).
Lift – The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence, assuming
that the itemsets X and Y are independent of each other. The expected confidence is the confidence divided by
the frequency of {Y}.
Lift = Support (X, Y) / Support (X) x Support (Y)
If Lift = 1, the probability of occurrence of antecedent and consequent is independent of each other.
If Lift > 1, it determines the degree to which the two itemsets are dependent to each other.
If Lift < 1, it determines the one item is a substitute for other items, which means one item has a negative
effect on another.
Example
TID Items
T1 A, B, C
T2 B, C, E
T3 A, B, C
T4 E, B,D
Lift (AB) = Support (A, B) / Support (A) x Support (B) = 2 / 2*4 = 2 / 8 = 1 / 4 = 0.25
Large Itemsets
The most common approach to finding association rules is to break up the problem into two parts:
1. Find large itemsets.
2. Generate rules from frequent itemsets.
An itemset is any subset of the set of all items, I.
A large (frequent) itemset is an itemset whose number of occurrences is above a threshold, s. We use the
notation L to indicate the complete set of large itemsets and l to indicate a specific large itemset.
Generate rules for each large itemset as follows:
1) For a large itemset X and Y a subset of X, let Z = X – Y
2) If support(X)/Support (Z) > minimum confidence, then the rule Z=>Y (i.e. X-Y=>Y) is a valid rule
Example: Given data in Table with associated supports shown in Table 6.2. Suppose that the input support
and confidence are s = 30% and α = 50%, respectively. Using this value of s, we obtain the following set of
large itemsets:
Transaction Items
t1 Bread, Jelly, PeanutButter
t2 Bread, PeanutButter
t3 Bread, Milk, PeanutButter
t4 Juice, Bread
t5 Juice, Milk
BASIC ALGORITHMS
APRIORI ALGORITHM
The Apriori algorithm is a classic and widely used algorithm for mining frequent itemsets and association
rules in transactional databases. It is particularly efficient in identifying patterns in datasets with a large
number of transactions. The algorithm works by generating candidate itemsets, checking their support, and
pruning the search space based on the Apriori principle.
Key Concepts:
Frequent Itemsets: Itemsets that appear in transactions with a frequency greater than or equal to a
specified threshold.
Support: The proportion of transactions in the dataset that contain a particular itemset.
Steps of the Apriori Algorithm:
1. Initialization:
o Begin with the frequent itemsets of length 1 (single items).
o Calculate the support for each item, i.e., the number of transactions containing the item.
2. Generating Candidate Itemsets:
o Use the frequent itemsets of length (k-1) to generate candidate itemsets of length k.
o Candidate generation involves joining two frequent itemsets of length (k-1) if their first (k-2)
items are identical.
For example, if {A, B} and {A, C} are frequent, their join operation will produce {A, B,
C}.
3. Pruning Candidates:
o After generating candidate itemsets, prune those that do not meet the minimum support threshold.
o The Apriori principle states that if an itemset is infrequent, all its supersets will also be infrequent.
o This reduces the search space and avoids the need to consider all possible combinations.
4. Calculating Support:
o Count the support of each candidate itemset by scanning the entire dataset.
o Support is calculated as the number of transactions containing the itemset divided by the total
number of transactions.
5. Repeat Steps 2-4:
o Iterate through the process of generating candidates, pruning, and calculating support for itemsets
of increasing length until no more frequent itemsets can be found.
Example Scenario: Let's consider a simple transaction dataset for a retail store:
Transaction ID Items Purchased
1 {Milk, Bread, Eggs}
2 {Bread, Butter}
3 {Milk, Bread, Butter}
4 {Bread, Eggs}
5 {Milk, Eggs}
Step 1: Initialization
Calculate the support for single items:
o Support({Milk}) = 3/5 = 0.6
o Support({Bread}) = 4/5 = 0.8
o Support({Eggs}) = 3/5 = 0.6
o Support({Butter}) = 2/5 = 0.4
Step 2: Generating Candidate Itemsets
Join frequent itemsets to create candidate itemsets of length 2:
o {Milk, Bread}, {Milk, Eggs}, {Milk, Butter}, {Bread, Eggs}, {Bread, Butter}
Step 3: Pruning Candidates
Remove candidates that do not meet the minimum support threshold (let's assume minimum support =
0.5):
o {Milk, Bread}, {Milk, Eggs}, {Bread, Eggs}
Step 4: Calculating Support for Remaining Candidates
Count the support of each remaining candidate:
o Support({Milk, Bread}) = 2/5 = 0.4 (Below minimum support, so prune)
o Support({Milk, Eggs}) = 2/5 = 0.4 (Below minimum support, so prune)
o Support({Bread, Eggs}) = 2/5 = 0.4 (Below minimum support, so prune)
Step 5: Repeat Steps 2-4
Generate candidate itemsets of length 3 (none meet minimum support, so stop)
Association Rules:
From the frequent itemsets, we can generate association rules with various confidence thresholds:
o {Bread} => {Milk} (Confidence = 2/4 = 0.5)
o {Bread} => {Eggs} (Confidence = 2/4 = 0.5)
o {Milk} => {Bread} (Confidence = 2/3 = 0.67)
o {Eggs} => {Bread} (Confidence = 2/3 = 0.67)
Interpretation:
The algorithm identified that {Bread} is the most frequent item in the dataset.
The association rules show that customers who bought {Bread} also bought {Milk} or {Eggs} with a
confidence of 50%.
Real-World Application:
o In a retail setting, the Apriori algorithm could be used to analyze customer purchasing patterns:
"Customers who buy Bread are 50% likely to buy Milk as well."
"Customers who buy Bread are 50% likely to buy Eggs as well."
o These insights can guide inventory management, product placement, and targeted marketing strategies.
The Apriori algorithm's ability to efficiently mine frequent itemsets and generate association rules makes it a
powerful tool for discovering valuable patterns in transactional data, leading to informed business decisions
and strategies.
DATA PARALLELISM
Data parallelism algorithm is the count distribution algorithm (CDA). The database is divided into p
partitions, one for each processor. Each processor counts the candidates for its data and then broadcasts its
counts to all other processors.
Each processor then determines the global counts. These then are used to determine the large item sets and to
generate the candidates for the next scan.
Explanation of CDA
Objective:
CDA's primary goal is to distribute the task of counting itemsets across multiple processors.
This allows for parallel processing of large datasets, reducing the overall computation time.
Steps in CDA:
o Data Partitioning:
The dataset is divided into partitions, with each partition assigned to a different processor.
For example, consider a retail dataset of customer transactions:
o Processor 1 gets transactions from January to March.
o Processor 2 gets transactions from April to June.
o Processor 3 gets transactions from July to September.
o And so on...
o Local Counting:
Each processor performs local counting of itemsets within its assigned partition.
For instance:
o Processor 1 counts the occurrences of {milk, bread}, {milk, eggs}, etc., in its partition.
o Processor 2 counts the occurrences of {bread, butter}, {bread, jam}, etc., in its partition.
o Processor 3 counts the occurrences of {eggs, cheese}, {eggs, yogurt}, etc., in its partition.
o Each processor independently generates its local itemset counts.
o Aggregation:
After local counting, the partial counts from all processors are aggregated.
This step combines the local counts of the same itemsets from different partitions.
For example:
o If {milk, bread} appeared 100 times in Processor 1's partition and 150 times in Processor
2's partition, the total count becomes 250.
o Final Result:
The final result is the aggregated count of itemsets across all partitions.
This result provides insights into the frequent itemsets that appear frequently across the entire
dataset.
Advantages of CDA:
Reduced Communication:
o Only the final aggregated counts need to be communicated between processors.
o Minimal communication overhead compared to transferring entire datasets.
Scalability:
o CDA scales well with an increase in the number of processors.
o More processors lead to faster computation times due to parallel processing.
o This scalability ensures that the algorithm remains effective for handling increasing amounts of
transaction data.
The Count Distribution Algorithm (CDA) exemplifies data parallelism in association rule mining. By dividing
the dataset, performing local counts, aggregating results, and providing valuable insights, CDA enables
efficient and scalable analysis of large datasets such as retail transaction data.
TASK PARALLELISM
The data distribution algorithm (DDA) demonstrates task parallelism. Here the candidates as well as the
database are partitioned among the processors. Each processor in parallel counts the candidates given to it
using its local database partition.
The Data Distribution Algorithm (DDA) is a task parallelism algorithm used in association rule mining.
Unlike data parallelism, which focuses on parallelizing the data itself, task parallelism, as implemented by
DDA, aims to parallelize the processing of candidate itemsets.
Objective:
o DDA's primary goal is to distribute the task of processing candidate itemsets across multiple
processors or nodes.
o This parallelization enhances the efficiency of finding frequent itemsets in large datasets.
Steps in DDA:
o Candidate Partitioning:
The candidate itemsets are divided into subsets, with each subset assigned to a different
processor.
For example, consider a set of candidate itemsets:
Processor 1 handles {milk, bread}, {milk, eggs}, etc.
Processor 2 handles {bread, butter}, {bread, jam}, etc.
Processor 3 handles {eggs, cheese}, {eggs, yogurt}, etc.
Each processor is responsible for a specific subset of candidate itemsets.
o Local Processing:
Each processor independently processes its assigned subset of candidate itemsets.
For instance:
Processor 1 checks occurrences of {milk, bread}, {milk, eggs} in its subset.
Processor 2 checks occurrences of {bread, butter}, {bread, jam} in its subset.
Processor 3 checks occurrences of {eggs, cheese}, {eggs, yogurt} in its subset.
Local counting and verification of itemsets occur within each processor.
o Communication:
During processing, processors may need to communicate intermediate results.
For example, if a candidate itemset {milk, bread} is found in Processor 1, it might need to
inform other processors about this discovery.
Communication ensures that all processors are aware of the potential frequent itemsets found
in their subsets.
o Aggregation of Results:
Once local processing is complete, the results from all processors are aggregated.
This step combines the locally discovered frequent itemsets into a comprehensive list.
Advantages of DDA:
o Memory Efficiency:
DDA requires memory only for the subset of candidate itemsets assigned to each processor.
Each processor handles a smaller portion of the candidate itemsets, reducing memory
requirements.
o Adaptability:
The algorithm adapts to varying memory sizes at different processors.
Since not all partitions of candidates need to be the same size, DDA can adjust accordingly.
o Scalability:
DDA scales well with an increase in the number of processors.
Adding more processors allows for faster processing of candidate itemsets.
o Real-World Example: Online Retail Platform
Let's consider a real-world example of DDA applied to an online retail platform's transaction data for market
analysis.
Scenario:
o The online platform has a vast database of customer transactions.
o Each transaction includes the items purchased together by customers.
Objective:
o The goal is to identify frequent itemsets to improve product recommendations and marketing
strategies.
Using DDA:
o Candidate Partitioning:
The set of candidate itemsets is divided into subsets based on item combinations.
Each processor is assigned a specific subset of candidate itemsets.
Processor 1 handles {milk, bread}, {milk, eggs}, etc.
Processor 2 handles {bread, butter}, {bread, jam}, etc.
Processor 3 handles {eggs, cheese}, {eggs, yogurt}, etc.
Local Processing:
o Processor 1 checks occurrences of {milk, bread}, {milk, eggs} in its subset of transactions.
o Processor 2 checks occurrences of {bread, butter}, {bread, jam} in its subset of transactions.
o Processor 3 checks occurrences of {eggs, cheese}, {eggs, yogurt} in its subset of transactions.
o Each processor independently determines the frequent itemsets within its assigned subset.
Communication:
o If Processor 1 finds a frequent itemset {milk, bread}, it communicates this to other processors.
o This communication ensures that all processors are aware of the potential frequent itemsets across
subsets.
Aggregation of Results:
o After local processing, the results from all processors are combined.
o The final result is a comprehensive list of frequent itemsets discovered across the entire dataset.
Benefits:
o Efficient Processing:
DDA enables efficient processing of candidate itemsets across multiple processors.
Parallelization reduces the time needed to identify frequent itemsets.
o Personalized Recommendations:
The discovered frequent itemsets help in generating personalized product recommendations
for online shoppers.
Customers are shown items frequently purchased together, enhancing their shopping
experience.
o Marketing Insights:
Retailers gain insights into customer preferences and market trends.
This information guides marketing strategies such as targeted promotions and product
bundling.
The Data Distribution Algorithm (DDA) exemplifies task parallelism in association rule mining. By
distributing the processing of candidate itemsets, DDA enhances efficiency, memory utilization, and
scalability. In real-world applications like online retail platforms, DDA enables the discovery of frequent
itemsets for personalized recommendations and strategic marketing decisions.
COMPARING APPROACHES
Algorithms can be classified along the following dimensions,
o Target: The algorithms we have examined generate all rules that satisfy a given support and
confidence level. Alternatives to these types of algorithms are those that generate some subset of the
algorithms based on the constraints given.
o Type: Algorithms may generate regular association rules or more advanced association rules
o Data type: The rules generated for data in categorical databases. Rules may also be derived for other
types of data such as plain text.
o Data source: Our investigation has been limited to the use of association rules for market basket data.
This assumes that data are present in a transaction. The absence of data may also be important.
o Technique: The most common strategy to generate association rules is that of finding large itemsets.
Other techniques may also be used.
Itemset strategy: Itemsets may be counted in different ways. The most naïve approach is to generate all
itemsets and count them. As this is usually too space intensive, the bottom-up approach used by Apriori,
which takes advantage of the large itemset property, is the most common approach. A top-down technique
could also be used.
Transaction strategy: To count the itemsets, the transactions in the database must be scanned. All
transactions could be counted, only a sample may be counted, or the transactions could be divided into
partitions.
Itemset data structure: The most common data structure used to store the candidate itemsets and their counts
is a hash tree. Hash trees provide an effective technique to store, access, and count itemsets. They are efficient
to search, insert, and delete item sets. A hash tree is a multiway search tree where the branch to be taken at
each level in the tree is determined by applying a hash function as opposed to comparing key values to
branching points in the node.
Transaction data structure: Transactions may be viewed as in a flat file or as a TID list, which can be
viewed as an inverted file.
Optimization: These techniques look at how to improve on the performance of an algorithm given data
distribution (skewness) or amount of main memory.
Architecture: Sequential, parallel, and distributed algorithms have been proposed.
Parallelism strategy: Both data parallelism and task parallelism have been used.
INCREMENTAL RULES
Incremental rule mining in data mining refers to the process of updating existing rules or discovering new
rules efficiently when new data is added to a dataset. This process is particularly useful when dealing with
dynamic datasets that experience frequent updates or additions. The main goal of incremental rule mining is to
avoid the need to reprocess the entire dataset each time new data arrives, thereby saving computational
resources and time.
Importance of Incremental Rule Mining:
Efficiency: Incremental rule mining helps in updating existing rules or discovering new rules without
reprocessing the entire dataset.
Real-time Updates: It allows for immediate updates to the existing rules as new data becomes available.
Scalability: Incremental mining is essential for handling large datasets efficiently, especially in dynamic
environments.
Techniques Used for Incremental Rule Mining:
1. Incremental Association Rule Mining:
o Focuses on updating existing association rules or discovering new rules efficiently.
o Techniques like Apriori-based incremental mining and FP-Growth-based incremental mining are
common.
2. Sequential Pattern Mining:
o Extends to sequences of events or items, updating patterns as new sequences are observed.
o Examples include algorithms like GSP (Generalized Sequential Pattern) for incremental
sequential pattern mining.
3. Stream Mining:
o Handles continuous streams of data, updating patterns or rules in real-time.
o Algorithms like VFDT (Very Fast Decision Tree) for classification or CluStream for clustering
are used.
A generalized association rule, X => Y, is defined like a regular association rule with the restriction that no
item in Y may be above any item in X.
When generating generalized association rules, all possible rules are generated using one or more given
hierarchies. Several algorithms have been proposed to generate generalized rules.
The simplest would be to expand each transaction by adding (for each item in it) all items above it in any
hierarchy.
EXAMPLE
Figure 6.7 shows a partial concept hierarchy for food. This hierarchy shows that Wheat Bread is a type of
Bread, which is a type of grain.
An association rule of the form Bread => PeanutButter has a lower support and threshold than one of the form
Gram =>PeanutButter.
There obviously are more transactions containing any type of grain than transactions containing Bread.
Likewise, Wheat Bread =.> Peanutbutter has a lower threshold and support than Bread => PeanutButter.
Key Concepts:
Quantitative Attributes:
o In addition to categorical items, datasets for quantitative association rules include numerical
attributes.
o Examples: Prices, quantities, ratings, temperatures, scores, etc.
Support and Confidence for Quantitative Rules:
o Support: The proportion of transactions where an itemset with specific numerical conditions appears.
o Confidence: The probability of finding an itemset with specific numerical conditions given the
presence of another itemset.
Quantitative Measures:
o Measures such as mean, median, sum, range, variance, etc., are used to define conditions on
numerical attributes.
Example Scenario:
Consider a dataset of customer transactions at a supermarket:
Transaction ID Items Purchased Total Amount ($)
1 {Milk, Bread, Eggs} 12.50
2 {Bread, Butter} 8.75
3 {Milk, Bread, Butter} 15.20
4 {Bread, Eggs} 6.90
5 {Milk, Eggs} 9.75
Quantitative Association Rule Examples:
1. Support for Total Amount:
o {Milk} => {Total Amount > 10} [Support: 0.4, Confidence: 0.8]
Interpretation: 40% of transactions containing milk have a total amount greater than $10.
2. Confidence for Price Range:
o {Bread} => {Price Range: $5 - $10} [Support: 0.6, Confidence: 0.75]
Interpretation: 75% of transactions with bread fall within the price range of $5 to $10.
3. Quantitative Condition on Quantity:
o {Milk, Quantity > 2} => {Total Amount > 12} [Support: 0.2, Confidence: 1.0]
Interpretation: When purchasing more than 2 units of milk, the total amount spent is
usually over $12.
Benefits of Quantitative Association Rules:
1. Deeper Insights:
o Provides insights into how numerical attributes influence itemset associations.
o Understanding how prices, quantities, or other metrics impact purchasing behaviour.
2. Fine-grained Analysis:
o Enables segmentation of data based on numerical conditions for more targeted analysis.
o Discovering patterns specific to certain price ranges, quantities, or attribute values.
3. Optimized Decision-Making:
o Helps in pricing strategies, inventory management, and product bundling decisions.
o Tailoring promotions or discounts based on customer spending patterns.
Techniques for Quantitative Association Rules:
1. Binning or Discretization:
o Convert continuous numerical attributes into discrete bins or categories.
o Enables the application of standard association rule mining algorithms on discretized data.
2. Threshold-based Mining:
o Define thresholds or ranges for numerical attributes to create conditions for rule discovery.
o Specify minimum or maximum values for support and confidence.
o Allows for a more flexible approach, capturing patterns within specified ranges of significance.
3. Selective Pattern Mining:
o Apply different minimum supports to specific subsets of the data.
o For example, focus on high-value customers with a higher minimum support threshold.
Example Scenario:
Consider a retail store analyzing customer purchase data:
Single Minimum Support: 0.1 (for general patterns)
Multiple Minimum Supports:
o High-Value Customers: 0.2 (for patterns among high spenders)
o Low-Value Customers: 0.05 (for patterns among regular spenders)
Pattern Discovery Examples:
1. General Patterns (Single Minimum Support):
o {Milk, Bread} => {Eggs} [Support: 0.15, Confidence: 0.6]
o {Cheese, Crackers} => {Wine} [Support: 0.12, Confidence: 0.8]
2. High-Value Customer Patterns (Higher Minimum Support):
o {Champagne, Caviar} => {Truffles} [Support: 0.25, Confidence: 0.9]
o {Steak, Lobster} => {Red Wine} [Support: 0.18, Confidence: 0.7]
3. Low-Value Customer Patterns (Lower Minimum Support):
o {Chips, Soda} => {Popcorn} [Support: 0.03, Confidence: 0.5]
o {Cookies, Ice Cream} => {Milk} [Support: 0.06, Confidence: 0.6]
Applications:
1. Customer Segmentation:
o Identify patterns specific to different customer segments based on their spending habits.
o Tailor marketing strategies or promotions for each segment accordingly.
2. Product Bundling Strategies:
o Discover item combinations that are popular among high-value customers.
o Create targeted bundles or offers to maximize sales among different customer groups.
3. Market Basket Analysis:
o Understand purchasing behaviors at different levels of granularity.
o Optimize product placement, promotions, and inventory management strategies.
Correlation Rules
Correlation rules in data mining refer to discovering patterns of co-occurrence or association between items or
attributes in a dataset.
Unlike traditional association rules that focus on identifying frequent itemsets, correlation rules emphasize the
strength and direction of relationships between variables.
These rules are particularly useful for understanding how changes in one variable relate to changes in another,
providing insights into dependencies and associations beyond simple co-occurrences.
Key Concepts:
1. Correlation Coefficient:
o The correlation coefficient measures the strength and direction of a linear relationship between
two numerical variables.
o Values range from -1 to 1:
Positive values (close to 1) indicate a positive correlation (both variables increase or
decrease together).
Negative values (close to -1) indicate a negative correlation (one variable increases while
the other decreases).
Values close to 0 indicate little or no correlation.
2. Correlation Rules:
o Correlation rules identify pairs of items or attributes that have a significant correlation coefficient.
o These rules help in understanding how changes in one variable are associated with changes in
another.
Advantages of Correlation Rules:
1. Insight into Relationships:
o Provides insights into the strength and direction of relationships between variables.
o Helps in understanding dependencies and patterns in the data.
2. Identification of Causal Relationships:
o Indicates potential causal relationships between variables.
o Allows for hypothesis testing and validation of assumed cause-effect relationships.
3. Feature Selection:
o Useful for feature selection in predictive modeling tasks.
o Identifies variables that have the most impact on the target variable.
Techniques for Correlation Rule Mining:
1. Pearson Correlation:
o The Pearson correlation coefficient measures the linear relationship between two continuous
variables.
5. Conviction:
o Conviction measures the ratio of the expected frequency that X occurs without Y to the observed
frequency of errors in predicting Y.
o Formula:
Conviction(X⇒Y)=1−Support(Y)1−Confidence(X⇒Y)Conviction(X⇒Y)=1−Confidence(X⇒Y)1
−Support(Y)
o Conviction > 1 suggests a strong relationship, with larger values indicating stronger dependency.
SECTION –A
1. Define clustering.
2. State any 3 clustering attributes.
3. Define centroid and medoid.
4. What are outliers?
5. Define dendrogram.
6. What is an association rule?
7. Define support.
8. Define confidence and like.
9. What is data parallelism and task parallelism?
10. What is an incremental rule?
SECTION –B
1. Write short note on classification of clustering algorithms.
2. Explain about the similarity and distance measures.
3. Write short note on divisive clustering algorithms in details.
4. Explain the partitional algorithms.
5. Discuss about large item set method in detail.
SECTION – C
1. Discuss about agglomerative clustering algorithms with an example.
2. Create the cluster for the following data points using k – means & PAM methods.
x y
P0 2 10
P1 8 4
P2 7 5
P3 6 4
P4 1 2
P5 4 9
3. Discuss about steps involved in apriori algorithm in details.