0% found this document useful (0 votes)
8 views34 pages

DM & W - Unit - 3

The document provides an overview of clustering in data mining, highlighting its applications across various fields and the challenges associated with it, such as outlier handling and dynamic data. It details different clustering algorithms, particularly hierarchical clustering, and discusses distance measures used to determine similarity between data points. Additionally, it outlines the steps involved in hierarchical agglomerative clustering and presents examples of calculating distances and forming clusters.

Uploaded by

remo436258
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views34 pages

DM & W - Unit - 3

The document provides an overview of clustering in data mining, highlighting its applications across various fields and the challenges associated with it, such as outlier handling and dynamic data. It details different clustering algorithms, particularly hierarchical clustering, and discusses distance measures used to determine similarity between data points. Additionally, it outlines the steps involved in hierarchical agglomerative clustering and presents examples of calculating distances and forming clusters.

Uploaded by

remo436258
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Shanmuga Industries Arts and Science College, Tiruvannamalai.

PG & Research Department of Computer Science


Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

CLUSTERING
INTRODUCTION
 Clustering is similar to classification in that data are grouped.
 The groups are not predefined. Instead, the grouping is accomplished by finding similarities between data
according to characteristics found in the actual data.
 The groups are called clusters. Clustering has been used in many application domains, including biology,
medicine, anthropology, marketing, and economics.
 Clustering applications include plant and animal classification, disease classification, image processing,
pattern recognition, and document retrieval.
 One of the first domains in which clustering were used was biological taxonomy. Recent uses include
examining Web log data to detect usage patterns.
 When clustering is applied to a real-world database, many interesting problems occur:
 Outlier handling is difficult. Here the elements do not naturally fall into any cluster. They can be viewed as
solitary clusters.
 Dynamic data in the database implies that cluster membership may change over time.
 Interpreting the semantic meaning of each cluster may be difficult. With classification, the labelling of the
classes is known ahead of time.
 There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact
number of clusters required is not easy to determine.

BASIC FEATURES OF CLUSTERING


 The (best) number of clusters is not known.
 There may not be any a priori knowledge concerning the clusters.
 Cluster results are dynamic.
 A classification of the different types of clustering algorithms is shown in Figure 5.2. Clustering algorithms
themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters
is created. Each level in the hierarchy has a separate set of clusters.

 The types of clustering algorithms can be furthered classified based on the implementation technique used.
Hierarchical algorithms can be categorized as agglomerative or divisive.
 "Agglomerative" implies that the clusters are created in a bottom-up fashion, while divisive algorithms work
in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the
agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms

SIMILARITY AND DISTANCE MEASURES


 There are many desirable properties for the clusters created by a solution to a specific clustering problem. The
most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to
tuples outside it.
 Many clustering algorithms require that the distance between clusters (rather than elements) be determined.
This is not an easy task given that there are many interpretations for distance between clusters.

Name & Signature of the Faculty: Senthilkumar D Page 1


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 Given clusters Ki and Kj, there are several standard alternatives to calculate the distance between clusters. A
representative list is:
 Single link: Smallest distance between an element in one cluster and an element in the other. We thus have
dis(Ki, Kj) = min(dis(til, tjm )) ∀til ∈ Ki ∉ Kj and ∀tjm ∈ Kj ∉j Ki.
 Complete link: Largest distance between an element in one cluster and an element in the other. We thus have
dis(Ki, Kj) = max(dis(til, tjm )) ∀til ∈ Ki ∉ Kj and ∀tjm ∈ Kj ∉j Ki.
 Average link: Average distance between an element in one cluster and an element in the other. We thus have
dis(Ki, Kj) = avg(dis(til, tjm )) ∀til ∈ Ki ∉ Kj and ∀tjm ∈ Kj ∉j Ki.
 Centroid: If clusters have a representative centroid, then the centroid distance is defined as the distance
between the centroids. We thus have
dis(Ki , Kj) = dis(Ci, Cj), where Ci is the centroid for Ki and similarly for Cj.
 Medoid: Using a medoid to represent each cluster, the distance between the clusters can be defined by the
distance between the medoids:
dis(Ki , Kj) = dis(Mi , Mj).

OUTLIERS
 Outliers are sample points with values much different from those of the remaining set of data. Outliers may
represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data value) or could be
correct data values that are simply much different from the remaining data.
 A person who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals; this
value probably would be viewed as an outlier. Some clustering techniques do ri�t perform well with the
presence of outliers. This problem is illustrated in Figure 5.3.

 Clustering algorithms may actually find and remove outliers to ensure that they perform better. Outlier
detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data
mining, algorithms may then choose to remove or treat these values differently. Some outlier detection
techniques are based on statistical techniques is not performing well in real world datasets and alternative
detection techniques may be based on distance measures.

DISTANCE MEASURES
 Similarity measure is a distance with dimensions representing features of the objects. In a common term, this
is a measure which helps us identify how much alike two data objects are. If the distance is small, the objects
have high similarity factor and vice versa. So if two objects are similar they are denoted as Obj1 = Obj2 and if
they are not similar they are denoted as Obj1 != Obj2
 The similarity is always measured in the range of 0 to 1 and is denoted as [0,1].
 There are various techniques of calculating Similarity Distance measure. Let’s look at some of the most
popular one.

Name & Signature of the Faculty: Senthilkumar D Page 2


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 Euclidean Distance
o This is most commonly used measure and is denoted as
o Similarity Distance Measure = SQRT ( (X2-X1)^2+ (Y2-Y1)^2) )
o The Euclidean distance between two points is the length of the path connecting them.

 Manhattan Distance
o This is one more commonly used measure and is denoted as:
o Similarity Distance Measure = Abs (X2-X1) + Abs (Y2-Y1)

HIERARCHICAL ALGORITHMS
 Hierarchical clustering is a popular method for grouping objects. It creates groups so that objects within a
group are similar to each other and different from objects in other groups. Clusters are visually represented in
a hierarchical tree called a dendrogram.
 The root in a dendrogram tree contains one cluster where all elements are together. The leaves in the
dendrogram each consist of a single element cluster. Internal nodes in the dendrogram represent new clusters
formed by merging the clusters that appear as its children in the tree. Each level in the tree is associated with
the distance measure that was used to merge the clusters. All clusters created at a particular level were
combined because the children clusters had a distance between them less than the distance value associated
with this level in the tree.
 Hierarchical clustering has a couple of key benefits:
1. There is no need to pre-specify the number of clusters. Instead, the dendrogram can be cut at the
appropriate level to obtain the desired number of clusters.
2. Data is easily summarized/organized into a hierarchy using dendrograms. Dendrograms make it easy to
examine and interpret clusters.

APPLICATIONS
 There are many real-life applications of Hierarchical clustering. They include:
 Bioinformatics: grouping animals according to their biological features to reconstruct phylogeny trees
 Business: dividing customers into segments or forming a hierarchy of employees based on salary.
 Image processing: grouping handwritten characters in text recognition based on the similarity of the
character shapes.
 Information Retrieval: categorizing search results based on the query.

Name & Signature of the Faculty: Senthilkumar D Page 3


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

Hierarchical Clustering Types


1. Agglomerative: Initially, each object is considered to be its own cluster. According to a particular procedure,
the clusters are then merged step by step until a single cluster remains. At the end of the cluster merging
process, a cluster containing all the elements will be formed.
 Divisive: The Divisive method is the opposite of the Agglomerative method. Initially, all objects are
considered in a single cluster. Then the division process is performed step by step until each object forms a
different cluster. The cluster division or splitting procedure is carried out according to some principles that
maximum distance between neighbouring objects in the cluster.
 Between Agglomerative and Divisive clustering, Agglomerative clustering is generally the preferred method.
The below example will focus on Agglomerative clustering algorithms because they are the most popular and
easiest to implement.

Hierarchical Agglomerative Clustering


 Hierarchical clustering employs a measure of distance/similarity to create new clusters.
 Steps for Agglomerative clustering can be summarized as follows:
o Step 1: Compute the proximity matrix using a particular distance metric
o Step 2: Each data point is assigned to a cluster
o Step 3: Merge the clusters based on a metric for the similarity between clusters
o Step 4: Update the distance matrix
o Step 5: Repeat Step 3 and Step 4 until only a single cluster remains

Hierarchical Agglomerative Clustering – Single Link


 Find the clusters using single link technique. Use Euclidean Distance and draw the dentogram.
X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
1. Calculate Euclidean Distance and create distance marix.
 Distance [(X1, X2), (Y1, Y2)] = SQRT ( (X2-X1)^2+ (Y2-Y1)^2) )
 Distance (P1, P2) = SQRT ( (0.40 - 0.22)^2+ (0.53 - 0.38)^2) )
= SQRT ( (0.18)^2+ (0.15)^2) )
= SQRT ( 0.0324 + 0.0225 )
= SQRT ( 0.0549 )
= 0.23
 Similarly Calculate distance for all point and create distance matrix.
2. The distance matrix,
P1 P2 P3 P4 P5 P6
P1 0
P2 0.23 0
P3 0.22 0.15 0
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0
P6 0.23 0.28 0.11 0.22 0.29 0

Name & Signature of the Faculty: Senthilkumar D Page 4


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

3. Find the minimum distance from the matrix and create the cluster. Here combine P3 & P6 as the first cluster
and update the distance matrix with respect to the new cluster (P3, P6)
P1 P2 P3, P6 P4 P5
P1 0
P2 0.23 0
P3, P6 0
P4 0.37 0.20 0
P5 0.34 0.14 0.29 0
4. To update the distance matrix find minimum distance of P1, P2, P4, P5 with respect to new cluster (P3, P6).
Distance = Minimum [(P3, P6), (P1)]
= Minimum [(P1, P3), (P1, P6)]
= Minimum [(0.22), (0.23)]
= 0.22
Distance = Minimum [(P3, P6), (P2)]
= Minimum [(P2, P3), (P2, P6)]
= Minimum [(0.15), (0.25)]
= 0.15
Distance = Minimum [(P3, P6), (P4)]
= Minimum [(P3, P4), (P4, P6)]
= Minimum [(0.15), (0.22)]
= 0.15
Distance = Minimum [(P3, P6), (P5)]
= Minimum [(P3, P5), (P5, P6)]
= Minimum [(0.28), (0.39)]
= 0.28
5. The updated distance matrix
P1 P2 P3, P6 P4 P5
P1 0
P2 0.23 0
P3, P6 0.22 0.15 0
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0
6. Find the minimum distance from the updated matrix and create the cluster. Here combine P2 & P5 as the
second cluster and update the distance matrix with respect to the new cluster (P2, P5)
P1 P2, P5 P3, P6 P4
P1 0
P2, P5 0
P3, P6 0.22 0
P4 0.37 0.15 0

7. Update the distance matrix find minimum distance of P1, (P3, P6), P4 with respect to new cluster (P2, P5).
Distance = Minimum [(P2, P5), (P1)]
= Minimum [(P1, P2), (P1, P5)]
= Minimum [(0.23), (0.34)]
= 0.23
Distance = Minimum [(P2, P5), (P3, P6)]

Name & Signature of the Faculty: Senthilkumar D Page 5


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

= Minimum [((P3, P6), P2), ((P3, P6), P5)]


= Minimum [(0.15), (0.28)]
= 0.15
Distance = Minimum [(P2, P5), (P4)]
= Minimum [(P2, P4), (P4, P5)]
= Minimum [(0.20), (0.29)]
= 0.20

P1 P2, P5 P3, P6 P4
P1 0
P2, P5 0.23 0
P3, P6 0.22 0.15 0
P4 0.37 0.20 0.15 0

8. Find the minimum distance from the updated matrix and create the cluster. Here combine (P2, P5) & (P3,
P6) as the third cluster and update the distance matrix with respect to the new cluster (P2, P5) & (P3, P6).

P2, P5
P1 P4
P3, P6
P1 0
P2, P5
0
P3, P6
P4 0.37 0

9. Update the distance matrix find minimum distance of P1, P4 with respect to new cluster (P2, P5) & (P3, P6).
Distance = Minimum [((P2, P5),(P3, P6)), P1]
= Minimum [((P2, P5), P1), ((P3, P6), P1)]
= Minimum [(0.23), (0.22)]
= 0.22
Distance = Minimum [((P2, P5),(P3, P6)), P4]
= Minimum [((P2, P5), P4), ((P3, P6), P4)]
= Minimum [(0.20), (0.15)]
= 0.15
P2, P5
P1 P4
P3, P6
P1 0
P2, P5
0.22 0
P3, P6
P4 0.37 0.15 0

10. Find the minimum distance from the updated matrix and create the cluster. Here combine ((P2, P5)(P3,
P6), P4) as the third cluster and update the distance matrix with respect to the new cluster ((P2, P5)(P3, P6)),
P4).

Name & Signature of the Faculty: Senthilkumar D Page 6


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

P2, P5
P1 P3, P6
P4
P1 0
P2, P5
P3, P6 0.22 0
P4

11. Update the distance matrix find minimum distance of P1with respect to new cluster ((P2, P5)(P3, P6)), P4).

P2, P5
P1 P3, P6
P4
P1 0
P2, P5
P3, P6 0
P4
Distance = Minimum [((P2, P5),(P3, P6), P4), P1]
= Minimum [((P2, P5)((P3, P6),P4), P1), (P4, P1)]
= Minimum [(0.22), (0.37)]
= 0.22

P2, P5
P1 P3, P6
P4
P1 0
P2, P5
P3, P6 0.22 0
P4

12. Thus the clusters are created. The clusters are – {((((P3, P6) (P2, P5) P4) P1)}

PARTITIONAL ALGORITHMS
 Non-hierarchical or partitional clustering creates the clusters in one step as opposed to several steps.
 Partitioning methods are a widely used family of clustering algorithms in data mining that aim to partition
a dataset into K clusters. These algorithms attempt to group similar data points together while maximizing
the differences between the clusters.
 Partitioning methods work by iteratively refining the cluster centroids until convergence is reached. These
algorithms are popular for their speed and scalability in handling large datasets.
 The most widely used partitioning method is the K-means algorithm. Other popular partitioning methods
include K-medoids, Fuzzy C-means, and Hierarchical K-means.
 The K-medoids are similar to K-means but use medoids instead of centroids as cluster representatives.
Fuzzy C-means is a soft clustering algorithm that allows data points to belong to multiple clusters with
varying degrees of membership.
 Partitioning methods offer several benefits, including speed, scalability, and simplicity.

Name & Signature of the Faculty: Senthilkumar D Page 7


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 They are relatively easy to implement and can handle large datasets. Partitioning methods are also
effective in identifying natural clusters within data and can be used for various applications, such as
customer segmentation, image segmentation, and anomaly detection.

K-Means Clustering or A Centroid-Based Technique


 K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K from the user
and partitions the dataset containing N objects into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the similarity of data objects with the data objects from
outside the cluster is low (intercluster).
 The similarity of the cluster is determined with respect to the mean value of the cluster.
 It is a type of square error algorithm. At the start randomly k objects from the dataset are chosen in which
each of the objects represents a cluster mean(centre). For the rest of the data objects, they are assigned to
the nearest cluster based on their distance from the cluster mean. The new mean of each of the cluster is
then calculated with the added data objects.

Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters

Method:
1. Randomly assign K objects from the dataset (D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 2 until no change occurs.

Flowchart:

Example
Create 3 clusters for the following data points,
A1 (2, 10), A2 (2, 5), A3 (8, 4), B1 (5, 8), B2 (7, 5), B3 (6, 4), C1 (1, 2), C2 (4, 9), Use Euclidean distance.
1. If k is given as 3, we need to break down the data points into 3 clusters and the initial centroids (choose
randomly) are A1 (2, 10), B1 (5, 8), C1 (1, 2).
2. Calculate the distance using the initial centroids.

Name & Signature of the Faculty: Senthilkumar D Page 8


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

Distance To
Data Points
2 10 5 8 1 2
A1 2 10 0.00 3.61 8.06
A2 2 5 5.00 4.24 3.16
A3 8 4 8.49 5.00 7.28
B1 5 8 3.61 0.00 7.21
B2 7 5 7.07 3.61 6.71
B3 6 4 7.21 4.12 5.39
C1 1 2 8.06 7.21 0.00
C2 4 9 2.24 1.41 7.62
3. Find the minimum value and assign the cluster
Distance To
Data Points Cluster
2 10 5 8 1 2
A1 2 10 0.00 3.61 8.06 1
A2 2 5 5.00 4.24 3.16 3
A3 8 4 8.49 5.00 7.28 2
B1 5 8 3.61 0.00 7.21 2
B2 7 5 7.07 3.61 6.71 2
B3 6 4 7.21 4.12 5.39 2
C1 1 2 8.06 7.21 0.00 3
C2 4 9 2.24 1.41 7.62 2
The Clusters are
Cluster 1 – A1
Cluster 2 – A3, B1, B2, B3, C2
Cluster 3 – A2, C1

4. Find the new centroids using the assigned clusters,


Cluster 1 – A1 = (2, 10)
Cluster 2 – A3, B1, B2, B3, C2 = (8+5+7+6+4)/5, (4+8+5+4+9)/5 = 30/5, 30/5 = (6,6)
Cluster 3 – A2, C1 = (2+1)/2, (5+2)/2 = 3/2, 7/2 = (1.5, 3.5)

5. Calculate the distance using the new centroids (2, 10), (6,6), (1.5, 3.5)
Distance To
Data Points
2 10 6 6 1.5 3.5
A1 2 10 0.00 5.66 6.52
A2 2 5 5.00 4.12 1.58
A3 8 4 8.49 2.83 6.52
B1 5 8 3.61 2.24 5.70
B2 7 5 7.07 1.41 5.70
B3 6 4 7.21 2.00 4.53
C1 1 2 8.06 6.40 1.58
C2 4 9 2.24 3.61 6.04
6. Find the minimum value and assign the new cluster

Name & Signature of the Faculty: Senthilkumar D Page 9


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

Distance To New Old


Data Points
2 10 6 6 1.5 3.5 Cluster Cluster
A1 2 10 0.00 5.66 6.52 1 1
A2 2 5 5.00 4.12 1.58 3 3
A3 8 4 8.49 2.83 6.52 2 2
B1 5 8 3.61 2.24 5.70 2 2
B2 7 5 7.07 1.41 5.70 2 2
B3 6 4 7.21 2.00 4.53 2 2
C1 1 2 8.06 6.40 1.58 3 3
C2 4 9 2.24 3.61 6.04 1 2
 The New Clusters are
o Cluster 1 – A1, C2
o Cluster 2 – A3, B1, B2, B3
o Cluster 3 – A2, C1

7. Compare the new clusters with old cluster, if the any data points cluster changed then continue the process.
8. Find the new centroids using the assigned clusters,
Cluster 1 – A1, C2 = (2+4)/2, (10+9)/2 = 6/2, 19/2 = (3, 9.5)
Cluster 2 – A3, B1, B2, B3 = (8+5+7+6)/4, (4+8+5+4)/4 = 26/4, 21/4 = (6.5, 5.25)
Cluster 3 – A2, C1 = (2+1)/2, (5+2)/2 = 3/2, 7/2 = (1.5, 3.5)
9. Calculate the distance using the new centroids (3, 9.5), (6.5, 5.25), (1.5, 3.5)
Distance To
Data Points
3 9.5 6.5 5.25 1.5 3.5
A1 2 10 1.12 6.54 6.52
A2 2 5 4.61 4.51 1.58
A3 8 4 7.43 1.95 6.52
B1 5 8 2.50 3.13 5.70
B2 7 5 6.02 0.56 5.70
B3 6 4 6.26 1.35 4.53
C1 1 2 7.76 6.39 1.58
C2 4 9 1.12 4.51 6.04
10. Find the minimum value and assign the new cluster
Distance To New Old
Data Points
3 9.5 6.5 5.25 1.5 3.5 Cluster Cluster
A1 2 10 1.12 6.54 6.52 1 1
A2 2 5 4.61 4.51 1.58 3 3
A3 8 4 7.43 1.95 6.52 2 2
B1 5 8 2.50 3.13 5.70 1 2
B2 7 5 6.02 0.56 5.70 2 2
B3 6 4 6.26 1.35 4.53 2 2
C1 1 2 7.76 6.39 1.58 3 3
C2 4 9 1.12 4.51 6.04 1 1
 The New Clusters are

Name & Signature of the Faculty: Senthilkumar D Page 10


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

o Cluster 1 – A1, B1, C2


o Cluster 2 – A3, B2, B3
o Cluster 3 – A2, C1

11. Compare the new clusters with old cluster, if the any data points cluster changed then continue the process.
12. Find the new centroids using the assigned clusters,
Cluster 1 – A1, B1, C2 = (2+5+4)/3, (10+8+9)/3 = 11/3, 27/3 = (3.67, 9)
Cluster 2 – A3, B2, B3 = (8+7+6)/3, (4+5+4)/3 = 21/3, 13/3 = (7, 4.33)
Cluster 3 – A2, C1 = (2+1)/2, (5+2)/2 = (1.5, 3.5)
13. Calculate the distance using the new centroids (3.67, 9), (7, 4.33), (1.5, 3.5)

Distance To
Data Points
3.67 9 7 4.33 1.5 3.5
A1 2 10 1.94 7.56 6.52
A2 2 5 4.33 5.04 1.58
A3 8 4 6.62 1.05 6.52
B1 5 8 1.67 4.18 5.70
B2 7 5 5.21 0.67 5.70
B3 6 4 5.52 1.05 4.53
C1 1 2 7.49 6.44 1.58
C2 4 9 0.33 5.55 6.04

14. Find the minimum value and assign the new cluster
Distance To New Old
Data Points
3.67 9 7 4.33 1.5 3.5 Cluster Cluster
A1 2 10 1.94 7.56 6.52 1 1
A2 2 5 4.33 5.04 1.58 3 3
A3 8 4 6.62 1.05 6.52 2 2
B1 5 8 1.67 4.18 5.70 1 1
B2 7 5 5.21 0.67 5.70 2 2
B3 6 4 5.52 1.05 4.53 2 2
C1 1 2 7.49 6.44 1.58 3 3
C2 4 9 0.33 5.55 6.04 1 1
15. Compare the new clusters with old cluster, if the any data points cluster changed then continue the process.
Here the old and new clusters are same. So the final 3 clusters are
Cluster 1 – A1, B1, C2
Cluster 2 – A3, B2, B3
Cluster 3 – A2, C1

PAM Algorithm or K – Medoid Clustering


 The PAM (partitioning around medoids) algorithm, also called the K-medoids algorithm, represents a cluster
by a medoid. Using a medoid is an approach that handles outliers well.
 A Medoid is a point in the cluster from which dissimilarities with all the other points in the clusters are
minimal.

Name & Signature of the Faculty: Senthilkumar D Page 11


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm takes a Medoid as a
reference point.
 Algorithm:
 Given the value of k and unlabelled data:
1. Choose k number of random points from the data and assign these k points to k number of clusters. These
are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and assign it to the cluster with
the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the medoids)
4. Select a random point as the new medoid and swap it with the previous medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid, make the new medoid
permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous medoid, undo the swap and
repeat step 4.
7. The Repetitions have to continue until no change is encountered with new medoids to classify data points.
 Example
 Data set:
x y

P0 5 4

P1 7 7

P2 1 3

P3 8 6

P4 4 9

 Scatter plot:

 If k is given as 2, we need to break down the data points into 2 clusters.


1. Consider Initial medoids: M1(1, 3) and M2(4, 9)
2. Calculation of distances using Manhattan Distance: |x1 - x2| + |y1 - y2|

Name & Signature of the Faculty: Senthilkumar D Page 12


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

Distance Distance Cluster


x y from from (Find Min (M1,M2)
M1(1, 3) M2(4, 9) Assign Cluster)

P0 5 4 |(1-5) + (3-4)| = 5 |(4-5) + (9-4)| = 6 Cluster 1

P1 7 7 |(1-7) + (3-7)| = 10 |(4-7) + (9-7)| = 5 Cluster 2

P2 1 3 |(1-1) + (3-3)| = 0 |(4-1) + (9-3)| = 9 Cluster 1

P3 8 6 |(1-8) + (3-6)| = 10 |(4-8) + (9-6)| = 7 Cluster 2

P4 4 9 |(1-4) + (3-9)| = 10 |(4-4) + (9-9)| = 0 Cluster 2


 Create initial cluster,
o Cluster 1 = P0, P2 = {(5, 4), (1, 3)}
o Cluster 2 = P1, P3, P4 = {(7, 7), (8, 6), (4, 9)}

3. Calculation of total cost: Cost (c, x) = ∑i │ci - xi│


Total Cost = │ {Cost ((1, 3), (5, 4)) + Cost ((4, 9), (7, 7)) + Cost ((4, 9), (8, 6))}│
= │ {(1 – 5) + (3 – 4)} + {(4 – 7) + (9 – 7) + (4 – 8) + (9 – 6)}│
= │4 + 1 + 3 + 2 + 4 + 3 │ = 17

4. Consider New medoids: M1(5, 4) and M2(4, 9) [Change any one medoid only]
5. Calculation of distances using Manhattan Distance: |x1 - x2| + |y1 - y2|
Distance Distance Cluster
x y from from (Find Min (M1,M2)
M1(5, 4) M2(4, 9) Assign Cluster)

P0 5 4 |(5-5) + (4-4)| = 0 |(4-5) + (9-4)| = 6 Cluster 1

P1 7 7 |(5-7) + (4-7)| = 5 |(4-7) + (9-7)| = 5 Cluster 2

P2 1 3 |(5-1) + (4-3)| = 5 |(4-1) + (9-3)| = 9 Cluster 1

P3 8 6 |(5-8) + (4-6)| = 5 |(4-8) + (9-6)| = 7 Cluster 1

P4 4 9 |(5-4) + (4-9)| = 6 |(4-4) + (9-9)| = 0 Cluster 2

 The New clusters are,


o Cluster 1 = P0, P2, P3 = {(5, 4), (1, 3), (8, 6)}
o Cluster 2 = P1, P4 = {(7, 7), (4, 9)}

6. Calculation of total cost: Cost (c, x) = ∑i │ci - xi│


Total Cost = │ {Cost ((5, 4) (1, 3) + Cost (5, 4) (8, 6)) + Cost ((7, 7), (4, 9)} │
= │ {(5 – 1) + (4 – 3) + (5 – 8) + (4 – 6)} + {(7 – 4) + (7 – 9)} │
= │4 + 1 + 3 + 2 + 3 + 2 │ = 15

Name & Signature of the Faculty: Senthilkumar D Page 13


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

7. Calculate the cost of swaping medoids with respect to intial and new clusters,
S = Initial Total Cost – New Total Cost
S = 17 – 15 = 2 > 0, Greater than the previous cost.
8. The final Medoids are M1(5, 4), M2(4, 9) and The Clusters are Cluster 1 = P0, P2, P3 , Cluster 2 = P1, P4
.

Name & Signature of the Faculty: Senthilkumar D Page 14


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

ASSOCIATION RULES
 Association rules are used to show the relationships between data items. These uncovered relationships are not
inherent in the data, as with functional dependencies, and they do not represent any sort of causality or
correlation.
 Association rule mining finds interesting associations and relationships among large sets of data items. This
rule shows how frequently an itemset occurs in a transaction.
 A typical example is a Market Based Analysis. Market Based Analysis is one of the key techniques used by
large relations to show associations between items.
 It allows retailers to identify relationships between the items that people buy together frequently.
 Given a set of transactions, we can find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction.
 TABLE 6.4: Association Rule Notation
Term Description
D Database of transactions
ti Transaction in D
s Support
α Confidence
X, Y Item sets
X --> Y Association rule
L Set of large itemsets
l Large itemset in L
C Set of candidate itemsets
p Number of partitions

 Term Description
 Basic Definitions
o An sample association rule is If A  B,
o Here If element is called antecedent and Then statement is called as Consequent.
o Support Count (σ) – Frequency of occurrence of an itemset. Here (σ) ({Milk, Bread, Diaper}) = 2.
o Frequent Itemset – An itemset whose support is greater than or equal to minimum support
threshold.
o Association Rule – An implication expression of the form X -> Y, where X and Y are any 2 itemsets.
Example: {Milk, Diaper} -> {Juice}
 Rule Evaluation Metrics

Name & Signature of the Faculty: Senthilkumar D Page 15


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 Support(s) – The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction. It is a measure of how frequently the collection of items occurs
together as a percentage of all transactions.
Support (X) = Frequency of (X) / Total No. of Transactions.

 Confidence(c) – It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items in {A}.
Confidence (AB) = Support (A U B) / Support (A).

 Lift – The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence, assuming
that the itemsets X and Y are independent of each other. The expected confidence is the confidence divided by
the frequency of {Y}.
Lift = Support (X, Y) / Support (X) x Support (Y)
If Lift = 1, the probability of occurrence of antecedent and consequent is independent of each other.
If Lift > 1, it determines the degree to which the two itemsets are dependent to each other.
If Lift < 1, it determines the one item is a substitute for other items, which means one item has a negative
effect on another.
 Example
TID Items
T1 A, B, C

T2 B, C, E

T3 A, B, C
T4 E, B,D

Support (A) = 2 / 4 = 0.5 = 50%


Support (A, B) = 2 / 4 = 0.5 = 50%
Support (B) = 4 / 4 = 1.0 = 100%

Confidence (AB) = Support (A, B) / Support (A) = 2 / 2 = 1

Lift (AB) = Support (A, B) / Support (A) x Support (B) = 2 / 2*4 = 2 / 8 = 1 / 4 = 0.25

Applications of Association Rule Learning


 It has various applications in machine learning and data mining. Below are some popular applications of
association rule learning:
o Market Basket Analysis: It is one of the popular examples and applications of association rule mining.
This technique is commonly used by big retailers to determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in
identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other applications.

Name & Signature of the Faculty: Senthilkumar D Page 16


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

Generating Association Rules


A general 2-step algorithm for generating ARs:
1. Generate all itemsets that have a support exceeding the given threshold. Itemsets with this property are called
large or frequent itemsets.
2. Generate rules for each large itemset as follows:
1) For a large itemset X and Y a subset of X, let Z = X – Y
2) If support(X)/Support(Z) > minimum confidence, then the rule Z=>Y (i.e. X-Y=>Y) is a valid rule.

Large Itemsets
 The most common approach to finding association rules is to break up the problem into two parts:
1. Find large itemsets.
2. Generate rules from frequent itemsets.
 An itemset is any subset of the set of all items, I.
 A large (frequent) itemset is an itemset whose number of occurrences is above a threshold, s. We use the
notation L to indicate the complete set of large itemsets and l to indicate a specific large itemset.
 Generate rules for each large itemset as follows:
1) For a large itemset X and Y a subset of X, let Z = X – Y
2) If support(X)/Support (Z) > minimum confidence, then the rule Z=>Y (i.e. X-Y=>Y) is a valid rule

 Example: Given data in Table with associated supports shown in Table 6.2. Suppose that the input support
and confidence are s = 30% and α = 50%, respectively. Using this value of s, we obtain the following set of
large itemsets:

Transaction Items
t1 Bread, Jelly, PeanutButter
t2 Bread, PeanutButter
t3 Bread, Milk, PeanutButter
t4 Juice, Bread
t5 Juice, Milk

Name & Signature of the Faculty: Senthilkumar D Page 17


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

Set Support Set Support


Juice 40 Juice, Bread, Jelly 0
Bread 80 Juice, Bread, Milk 0
Jelly 20 Juice, Bread, PeanutButter 0
Milk 40 Juice, Jelly, Milk 0
PeanutButter 60 Juice, Jelly, PeanutButter 0
Juice, Bread 20 Bread, Jelly, Milk 0
Juice, Jelly 0 Bread, Jelly, PeanutButter 20
Juice, Milk 20 Bread, Milk, PeanutButter 20
Juice, PeanutButter 0 Jelly, Milk, PeanutButter 0
Bread, Jelly 20 Juice, Bread, Jelly, Milk 0
Bread, Milk 20 Juice, Bread, Jelly, PeanutButter 0
Bread, PeanutButter 60 Juice, Bread, Milk, PeanutButter 0
Jelly, Milk 0 Juice, Jelly, Milk, PeanutButter 0
Jelly, PeanutButter 20 Bread, Jelly, Milk, PeanutButter 0
Milk, PeanutButter 20 Juice, Bread, Jelly, Milk, PeanutButter 0

 L = {{Juice}, {Bread}, {Milk}, {PeanutButter} {Bread, PeanutButter}}.


o The association rules are generated from the last large itemset.
o Here l = {Bread, PeanutButter}. There are two nonempty subsets of l: {Bread} and {PeanutButter}.
o support ({Bread, PeanutButter}) / support({Bread}) = 60 / 80 = 0.75.
o This means that the confidence of the association rule Bread  PeanutButter is 75%, since this is above α,
it is a valid association rule and is added to R. Likewise with the second large itemset.
o The second large itemset,
o support ({Bread, PeanutButter}) / support({PeanutButter}) = 60 / 60 = 1.
o This means that the confidence of the association rule PeanutButter  Bread is 100%, and this is a valid
association rule.

BASIC ALGORITHMS

APRIORI ALGORITHM
 The Apriori algorithm is a classic and widely used algorithm for mining frequent itemsets and association
rules in transactional databases. It is particularly efficient in identifying patterns in datasets with a large
number of transactions. The algorithm works by generating candidate itemsets, checking their support, and
pruning the search space based on the Apriori principle.
 Key Concepts:
 Frequent Itemsets: Itemsets that appear in transactions with a frequency greater than or equal to a
specified threshold.
 Support: The proportion of transactions in the dataset that contain a particular itemset.
 Steps of the Apriori Algorithm:
1. Initialization:
o Begin with the frequent itemsets of length 1 (single items).
o Calculate the support for each item, i.e., the number of transactions containing the item.
2. Generating Candidate Itemsets:
o Use the frequent itemsets of length (k-1) to generate candidate itemsets of length k.

Name & Signature of the Faculty: Senthilkumar D Page 18


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

o Candidate generation involves joining two frequent itemsets of length (k-1) if their first (k-2)
items are identical.
 For example, if {A, B} and {A, C} are frequent, their join operation will produce {A, B,
C}.
3. Pruning Candidates:
o After generating candidate itemsets, prune those that do not meet the minimum support threshold.
o The Apriori principle states that if an itemset is infrequent, all its supersets will also be infrequent.
o This reduces the search space and avoids the need to consider all possible combinations.
4. Calculating Support:
o Count the support of each candidate itemset by scanning the entire dataset.
o Support is calculated as the number of transactions containing the itemset divided by the total
number of transactions.
5. Repeat Steps 2-4:
o Iterate through the process of generating candidates, pruning, and calculating support for itemsets
of increasing length until no more frequent itemsets can be found.
 Example Scenario: Let's consider a simple transaction dataset for a retail store:
Transaction ID Items Purchased
1 {Milk, Bread, Eggs}
2 {Bread, Butter}
3 {Milk, Bread, Butter}
4 {Bread, Eggs}
5 {Milk, Eggs}
 Step 1: Initialization
 Calculate the support for single items:
o Support({Milk}) = 3/5 = 0.6
o Support({Bread}) = 4/5 = 0.8
o Support({Eggs}) = 3/5 = 0.6
o Support({Butter}) = 2/5 = 0.4
 Step 2: Generating Candidate Itemsets
 Join frequent itemsets to create candidate itemsets of length 2:
o {Milk, Bread}, {Milk, Eggs}, {Milk, Butter}, {Bread, Eggs}, {Bread, Butter}
 Step 3: Pruning Candidates
 Remove candidates that do not meet the minimum support threshold (let's assume minimum support =
0.5):
o {Milk, Bread}, {Milk, Eggs}, {Bread, Eggs}
 Step 4: Calculating Support for Remaining Candidates
 Count the support of each remaining candidate:
o Support({Milk, Bread}) = 2/5 = 0.4 (Below minimum support, so prune)
o Support({Milk, Eggs}) = 2/5 = 0.4 (Below minimum support, so prune)
o Support({Bread, Eggs}) = 2/5 = 0.4 (Below minimum support, so prune)
 Step 5: Repeat Steps 2-4
 Generate candidate itemsets of length 3 (none meet minimum support, so stop)

 Resulting Frequent Itemsets:


 {Bread} with support 0.8
 {Milk} and {Eggs} with support 0.6 (although they didn't make it to length 2, they are individually
frequent)

Name & Signature of the Faculty: Senthilkumar D Page 19


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 Association Rules:
 From the frequent itemsets, we can generate association rules with various confidence thresholds:
o {Bread} => {Milk} (Confidence = 2/4 = 0.5)
o {Bread} => {Eggs} (Confidence = 2/4 = 0.5)
o {Milk} => {Bread} (Confidence = 2/3 = 0.67)
o {Eggs} => {Bread} (Confidence = 2/3 = 0.67)
 Interpretation:
 The algorithm identified that {Bread} is the most frequent item in the dataset.
 The association rules show that customers who bought {Bread} also bought {Milk} or {Eggs} with a
confidence of 50%.
 Real-World Application:
o In a retail setting, the Apriori algorithm could be used to analyze customer purchasing patterns:
 "Customers who buy Bread are 50% likely to buy Milk as well."
 "Customers who buy Bread are 50% likely to buy Eggs as well."
o These insights can guide inventory management, product placement, and targeted marketing strategies.
 The Apriori algorithm's ability to efficiently mine frequent itemsets and generate association rules makes it a
powerful tool for discovering valuable patterns in transactional data, leading to informed business decisions
and strategies.

PARALLEL AND DISTRIBUTED ALGORITHMS


 Parallel or distributed association rule algorithms strive to parallelize either the data, known as data
parallelism, or the candidates, referred to as task parallelism.
Introduction
 Parallel and Distributed Computing
o In data mining, parallel and distributed algorithms are designed to handle the processing of large
datasets.
o These algorithms aim to reduce computation time and improve efficiency by leveraging multiple
processors or nodes.
Parallel vs. Distributed Association Rule Algorithms
 Data Parallelism vs. Task Parallelism
o Parallel or distributed association rule algorithms focus on either:
 Data Parallelism: Parallelizing the data itself.
 Task Parallelism: Parallelizing the candidate itemsets.
 Task Parallelism
o Candidates are partitioned and counted separately at each processor.
o Example: Partitioning algorithm is easily parallelized using task parallelism.
Differentiating Factors in Parallel Association Rule Algorithms
 Load-Balancing Approach
o Algorithms may employ different strategies for distributing the workload among processors.
o Ensures fair distribution of tasks to optimize performance.
 Architecture
o Refers to the underlying structure of the computing system.
o Determines how tasks are assigned and executed across processors or nodes.
Advantages of Data Parallelism Algorithms
 Reduced Communication Cost
o Only initial candidates (item sets) and local counts need to be distributed.
o Communication overhead is lower compared to task parallelism.
 Memory Requirements
o Each processor needs sufficient memory to store all candidates during each scan.
o Performance degrades if I/O is required for both the database and candidate set.

Name & Signature of the Faculty: Senthilkumar D Page 20


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

Advantages of Task Parallelism Algorithms


 Memory Efficiency
o Only subset of candidates assigned to a processor needs to fit into memory during each scan.
o Adaptability to varying memory sizes at different sites.
 Scalability
o Task parallel algorithms can scale based on the number of processors and database size.
o Linear scalability with the number of processors is observed in data parallelism.
Considerations for Task Parallelism
 Memory Constraints
o Total size of all candidates must be small enough to fit into the combined memory of all processors.
o Variations of basic algorithms address these memory issues.
Performance Studies
 Data Parallelism
o Scales linearly with the number of processors and database size.
o Reduced memory requirements.
 Task Parallelism
o Offers a solution where data parallelism may not be feasible due to memory constraints.
o Efficiency depends on the load-balancing approach and architecture.
Conclusion
 Parallel and distributed association rule algorithms aim to optimize the processing of large datasets.
 Data parallelism focuses on parallelizing the data, while task parallelism focuses on parallelizing the candidate
itemsets.
 Considerations include load balancing, architecture, communication costs, and memory requirements.
 Task parallelism offers memory efficiency and scalability, making it a viable option for certain scenarios.

DATA PARALLELISM
 Data parallelism algorithm is the count distribution algorithm (CDA). The database is divided into p
partitions, one for each processor. Each processor counts the candidates for its data and then broadcasts its
counts to all other processors.
 Each processor then determines the global counts. These then are used to determine the large item sets and to
generate the candidates for the next scan.
Explanation of CDA
 Objective:
 CDA's primary goal is to distribute the task of counting itemsets across multiple processors.
 This allows for parallel processing of large datasets, reducing the overall computation time.
 Steps in CDA:
o Data Partitioning:
 The dataset is divided into partitions, with each partition assigned to a different processor.
 For example, consider a retail dataset of customer transactions:
o Processor 1 gets transactions from January to March.
o Processor 2 gets transactions from April to June.
o Processor 3 gets transactions from July to September.
o And so on...
o Local Counting:
 Each processor performs local counting of itemsets within its assigned partition.
 For instance:
o Processor 1 counts the occurrences of {milk, bread}, {milk, eggs}, etc., in its partition.
o Processor 2 counts the occurrences of {bread, butter}, {bread, jam}, etc., in its partition.
o Processor 3 counts the occurrences of {eggs, cheese}, {eggs, yogurt}, etc., in its partition.
o Each processor independently generates its local itemset counts.

Name & Signature of the Faculty: Senthilkumar D Page 21


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

o Aggregation:
 After local counting, the partial counts from all processors are aggregated.
 This step combines the local counts of the same itemsets from different partitions.
 For example:
o If {milk, bread} appeared 100 times in Processor 1's partition and 150 times in Processor
2's partition, the total count becomes 250.
o Final Result:
 The final result is the aggregated count of itemsets across all partitions.
 This result provides insights into the frequent itemsets that appear frequently across the entire
dataset.
Advantages of CDA:
 Reduced Communication:
o Only the final aggregated counts need to be communicated between processors.
o Minimal communication overhead compared to transferring entire datasets.
 Scalability:
o CDA scales well with an increase in the number of processors.
o More processors lead to faster computation times due to parallel processing.

Example: Retail Market Analysis


 Example of CDA applied to a retail dataset for market analysis.
Scenario:
 Suppose we have a large retail dataset containing customer transactions.
 The dataset includes information about items purchased together in each transaction.
Objective:
 Our goal is to find frequent itemsets that are commonly purchased together.
 This information helps in strategies like product placement, promotions, and bundling.
Using CDA:
1. Data Partitioning:
o The retail dataset is divided into partitions based on transaction dates.
o Each processor is assigned a specific time period of transactions.
2. Local Counting:
o Processor 1 counts the occurrences of {milk, bread}, {milk, eggs}, etc., in its assigned time period.
o Processor 2 counts the occurrences of {bread, butter}, {bread, jam}, etc., in its assigned time period.
o Each processor independently calculates local itemset counts.
3. Aggregation:
o The partial counts of itemsets from all processors are aggregated.
o For instance, the count of {milk, bread} across all partitions is combined.
4. Final Result:
o The final result provides a list of frequent itemsets across the entire retail dataset.
o Retailers can use this information to:
 Arrange frequently purchased items close to each other in stores.
 Create targeted promotions for item bundles.
 Optimize inventory and supply chain management.
Benefits:
 Efficiency:
o CDA allows for efficient mining of frequent itemsets from a vast retail dataset.
o Parallel processing reduces the time needed to derive valuable insights.
 Scalability:
o As the retail dataset grows, CDA can scale by adding more processors.

Name & Signature of the Faculty: Senthilkumar D Page 22


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

o This scalability ensures that the algorithm remains effective for handling increasing amounts of
transaction data.
 The Count Distribution Algorithm (CDA) exemplifies data parallelism in association rule mining. By dividing
the dataset, performing local counts, aggregating results, and providing valuable insights, CDA enables
efficient and scalable analysis of large datasets such as retail transaction data.

TASK PARALLELISM
 The data distribution algorithm (DDA) demonstrates task parallelism. Here the candidates as well as the
database are partitioned among the processors. Each processor in parallel counts the candidates given to it
using its local database partition.
 The Data Distribution Algorithm (DDA) is a task parallelism algorithm used in association rule mining.
Unlike data parallelism, which focuses on parallelizing the data itself, task parallelism, as implemented by
DDA, aims to parallelize the processing of candidate itemsets.
 Objective:
o DDA's primary goal is to distribute the task of processing candidate itemsets across multiple
processors or nodes.
o This parallelization enhances the efficiency of finding frequent itemsets in large datasets.
 Steps in DDA:
o Candidate Partitioning:
 The candidate itemsets are divided into subsets, with each subset assigned to a different
processor.
 For example, consider a set of candidate itemsets:
 Processor 1 handles {milk, bread}, {milk, eggs}, etc.
 Processor 2 handles {bread, butter}, {bread, jam}, etc.
 Processor 3 handles {eggs, cheese}, {eggs, yogurt}, etc.
 Each processor is responsible for a specific subset of candidate itemsets.
o Local Processing:
 Each processor independently processes its assigned subset of candidate itemsets.
 For instance:
 Processor 1 checks occurrences of {milk, bread}, {milk, eggs} in its subset.
 Processor 2 checks occurrences of {bread, butter}, {bread, jam} in its subset.
 Processor 3 checks occurrences of {eggs, cheese}, {eggs, yogurt} in its subset.
 Local counting and verification of itemsets occur within each processor.
o Communication:
 During processing, processors may need to communicate intermediate results.
 For example, if a candidate itemset {milk, bread} is found in Processor 1, it might need to
inform other processors about this discovery.
 Communication ensures that all processors are aware of the potential frequent itemsets found
in their subsets.
o Aggregation of Results:
 Once local processing is complete, the results from all processors are aggregated.
 This step combines the locally discovered frequent itemsets into a comprehensive list.
 Advantages of DDA:
o Memory Efficiency:
 DDA requires memory only for the subset of candidate itemsets assigned to each processor.

Name & Signature of the Faculty: Senthilkumar D Page 23


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 Each processor handles a smaller portion of the candidate itemsets, reducing memory
requirements.
o Adaptability:
 The algorithm adapts to varying memory sizes at different processors.
 Since not all partitions of candidates need to be the same size, DDA can adjust accordingly.
o Scalability:
 DDA scales well with an increase in the number of processors.
 Adding more processors allows for faster processing of candidate itemsets.
o Real-World Example: Online Retail Platform
 Let's consider a real-world example of DDA applied to an online retail platform's transaction data for market
analysis.
 Scenario:
o The online platform has a vast database of customer transactions.
o Each transaction includes the items purchased together by customers.
 Objective:
o The goal is to identify frequent itemsets to improve product recommendations and marketing
strategies.
 Using DDA:
o Candidate Partitioning:
 The set of candidate itemsets is divided into subsets based on item combinations.
 Each processor is assigned a specific subset of candidate itemsets.
 Processor 1 handles {milk, bread}, {milk, eggs}, etc.
 Processor 2 handles {bread, butter}, {bread, jam}, etc.
 Processor 3 handles {eggs, cheese}, {eggs, yogurt}, etc.
 Local Processing:
o Processor 1 checks occurrences of {milk, bread}, {milk, eggs} in its subset of transactions.
o Processor 2 checks occurrences of {bread, butter}, {bread, jam} in its subset of transactions.
o Processor 3 checks occurrences of {eggs, cheese}, {eggs, yogurt} in its subset of transactions.
o Each processor independently determines the frequent itemsets within its assigned subset.

 Communication:
o If Processor 1 finds a frequent itemset {milk, bread}, it communicates this to other processors.
o This communication ensures that all processors are aware of the potential frequent itemsets across
subsets.
 Aggregation of Results:
o After local processing, the results from all processors are combined.
o The final result is a comprehensive list of frequent itemsets discovered across the entire dataset.
 Benefits:
o Efficient Processing:
 DDA enables efficient processing of candidate itemsets across multiple processors.
 Parallelization reduces the time needed to identify frequent itemsets.
o Personalized Recommendations:
 The discovered frequent itemsets help in generating personalized product recommendations
for online shoppers.

Name & Signature of the Faculty: Senthilkumar D Page 24


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 Customers are shown items frequently purchased together, enhancing their shopping
experience.
o Marketing Insights:
 Retailers gain insights into customer preferences and market trends.
 This information guides marketing strategies such as targeted promotions and product
bundling.
 The Data Distribution Algorithm (DDA) exemplifies task parallelism in association rule mining. By
distributing the processing of candidate itemsets, DDA enhances efficiency, memory utilization, and
scalability. In real-world applications like online retail platforms, DDA enables the discovery of frequent
itemsets for personalized recommendations and strategic marketing decisions.

COMPARING APPROACHES
 Algorithms can be classified along the following dimensions,
o Target: The algorithms we have examined generate all rules that satisfy a given support and
confidence level. Alternatives to these types of algorithms are those that generate some subset of the
algorithms based on the constraints given.
o Type: Algorithms may generate regular association rules or more advanced association rules
o Data type: The rules generated for data in categorical databases. Rules may also be derived for other
types of data such as plain text.
o Data source: Our investigation has been limited to the use of association rules for market basket data.
This assumes that data are present in a transaction. The absence of data may also be important.
o Technique: The most common strategy to generate association rules is that of finding large itemsets.
Other techniques may also be used.
 Itemset strategy: Itemsets may be counted in different ways. The most naïve approach is to generate all
itemsets and count them. As this is usually too space intensive, the bottom-up approach used by Apriori,
which takes advantage of the large itemset property, is the most common approach. A top-down technique
could also be used.
 Transaction strategy: To count the itemsets, the transactions in the database must be scanned. All
transactions could be counted, only a sample may be counted, or the transactions could be divided into
partitions.
 Itemset data structure: The most common data structure used to store the candidate itemsets and their counts
is a hash tree. Hash trees provide an effective technique to store, access, and count itemsets. They are efficient
to search, insert, and delete item sets. A hash tree is a multiway search tree where the branch to be taken at
each level in the tree is determined by applying a hash function as opposed to comparing key values to
branching points in the node.
 Transaction data structure: Transactions may be viewed as in a flat file or as a TID list, which can be
viewed as an inverted file.
 Optimization: These techniques look at how to improve on the performance of an algorithm given data
distribution (skewness) or amount of main memory.
 Architecture: Sequential, parallel, and distributed algorithms have been proposed.
 Parallelism strategy: Both data parallelism and task parallelism have been used.

Name & Signature of the Faculty: Senthilkumar D Page 25


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

Comparison of Association Rule Algorithms


Partitioning Scans Data Structure Parallelism

A priori m+1 hash tree none

Sampling 2 not specified none

Partitioning 2 hash table none

CDA m+l hash tree data

DDA m+l hash tree task

INCREMENTAL RULES
 Incremental rule mining in data mining refers to the process of updating existing rules or discovering new
rules efficiently when new data is added to a dataset. This process is particularly useful when dealing with
dynamic datasets that experience frequent updates or additions. The main goal of incremental rule mining is to
avoid the need to reprocess the entire dataset each time new data arrives, thereby saving computational
resources and time.
 Importance of Incremental Rule Mining:
 Efficiency: Incremental rule mining helps in updating existing rules or discovering new rules without
reprocessing the entire dataset.
 Real-time Updates: It allows for immediate updates to the existing rules as new data becomes available.
 Scalability: Incremental mining is essential for handling large datasets efficiently, especially in dynamic
environments.
 Techniques Used for Incremental Rule Mining:
1. Incremental Association Rule Mining:
o Focuses on updating existing association rules or discovering new rules efficiently.
o Techniques like Apriori-based incremental mining and FP-Growth-based incremental mining are
common.
2. Sequential Pattern Mining:
o Extends to sequences of events or items, updating patterns as new sequences are observed.
o Examples include algorithms like GSP (Generalized Sequential Pattern) for incremental
sequential pattern mining.
3. Stream Mining:
o Handles continuous streams of data, updating patterns or rules in real-time.
o Algorithms like VFDT (Very Fast Decision Tree) for classification or CluStream for clustering
are used.

ADVANCED ASSOCIATION RULE TECHNIQUES


 There are several techniques that have been proposed to generate association rules that are more complex than
the basic rules.
Generalized Association Rules
 Using a concept hierarchy that shows the set relationship between different items, generalized association
rules allow rules at different levels.
 Example illustrates the use of these generalized rules using the concept hierarchy in Figure 6.7. Association
rules could be generated for any and all levels in the hierarchy.

Name & Signature of the Faculty: Senthilkumar D Page 26


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 A generalized association rule, X => Y, is defined like a regular association rule with the restriction that no
item in Y may be above any item in X.
 When generating generalized association rules, all possible rules are generated using one or more given
hierarchies. Several algorithms have been proposed to generate generalized rules.
 The simplest would be to expand each transaction by adding (for each item in it) all items above it in any
hierarchy.
 EXAMPLE

 Figure 6.7 shows a partial concept hierarchy for food. This hierarchy shows that Wheat Bread is a type of
Bread, which is a type of grain.
 An association rule of the form Bread => PeanutButter has a lower support and threshold than one of the form
Gram =>PeanutButter.
 There obviously are more transactions containing any type of grain than transactions containing Bread.
 Likewise, Wheat Bread =.> Peanutbutter has a lower threshold and support than Bread => PeanutButter.

Multiple-Level Association Rules


 A variation of generalized rules is multiple-level association rules.
 With multiple-level rules, itemsets may occur from any level in the hierarchy: Using a variation of the Apriori
algorithm, the concept hierarchy is traversed in a top - down manner and large itemsets are generated.
 When large itemsets are found at level i, large itemsets are generated for level i + 1.
 Large k-itemsets at one level in the concept hierarchy are used as candidates to generate large k-itemsets for
children at the next level.
 Modification to the basic association rule ideas may be changed. We expect that there is more support for
itemsets occurring at higher levels in the concept hierarchy.
 Thus, the minimum support required for association rules may vary based on level m the hierarchy. We would
expect that the frequency of itemsets at higher levels is much greater than the frequency of itemsets at lower
levels.
 Thus, for the reduced minimum support concept, the following rules apply:
 The minimum support for all nodes in the hierarchy at the same level is identical.
 If αi is the minimum support for level i in the hierarchy and αi-1 is the minimum support for level i - 1, then αi-1
> αi.

Quantitative Association Rules


 Quantitative association rules extend the traditional association rule mining approach by incorporating
quantitative attributes or numerical values into the rule discovery process.
 While standard association rules focus on relationships between items in transactions, quantitative association
rules consider both itemsets and numerical attributes associated with these items.
 These rules provide insights into how numerical values affect the occurrence of certain itemsets, enabling
more detailed analysis of associations in datasets.

Name & Signature of the Faculty: Senthilkumar D Page 27


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

Key Concepts:
 Quantitative Attributes:
o In addition to categorical items, datasets for quantitative association rules include numerical
attributes.
o Examples: Prices, quantities, ratings, temperatures, scores, etc.
 Support and Confidence for Quantitative Rules:
o Support: The proportion of transactions where an itemset with specific numerical conditions appears.
o Confidence: The probability of finding an itemset with specific numerical conditions given the
presence of another itemset.
 Quantitative Measures:
o Measures such as mean, median, sum, range, variance, etc., are used to define conditions on
numerical attributes.
Example Scenario:
 Consider a dataset of customer transactions at a supermarket:
Transaction ID Items Purchased Total Amount ($)
1 {Milk, Bread, Eggs} 12.50
2 {Bread, Butter} 8.75
3 {Milk, Bread, Butter} 15.20
4 {Bread, Eggs} 6.90
5 {Milk, Eggs} 9.75
 Quantitative Association Rule Examples:
1. Support for Total Amount:
o {Milk} => {Total Amount > 10} [Support: 0.4, Confidence: 0.8]
 Interpretation: 40% of transactions containing milk have a total amount greater than $10.
2. Confidence for Price Range:
o {Bread} => {Price Range: $5 - $10} [Support: 0.6, Confidence: 0.75]
 Interpretation: 75% of transactions with bread fall within the price range of $5 to $10.
3. Quantitative Condition on Quantity:
o {Milk, Quantity > 2} => {Total Amount > 12} [Support: 0.2, Confidence: 1.0]
 Interpretation: When purchasing more than 2 units of milk, the total amount spent is
usually over $12.
 Benefits of Quantitative Association Rules:
1. Deeper Insights:
o Provides insights into how numerical attributes influence itemset associations.
o Understanding how prices, quantities, or other metrics impact purchasing behaviour.
2. Fine-grained Analysis:
o Enables segmentation of data based on numerical conditions for more targeted analysis.
o Discovering patterns specific to certain price ranges, quantities, or attribute values.
3. Optimized Decision-Making:
o Helps in pricing strategies, inventory management, and product bundling decisions.
o Tailoring promotions or discounts based on customer spending patterns.
 Techniques for Quantitative Association Rules:
1. Binning or Discretization:
o Convert continuous numerical attributes into discrete bins or categories.
o Enables the application of standard association rule mining algorithms on discretized data.
2. Threshold-based Mining:
o Define thresholds or ranges for numerical attributes to create conditions for rule discovery.
o Specify minimum or maximum values for support and confidence.

Name & Signature of the Faculty: Senthilkumar D Page 28


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

3. Quantitative Measures in Rules:


o Include quantitative measures like mean, sum, variance, etc., in the rule conditions.
o Formulate rules based on statistical properties of numerical attributes.

 Applications of Quantitative Association Rules:


1. Retail Pricing Analysis:
o Analyzing how price ranges of products affect their co-purchase patterns.
o Identifying optimal price points based on associations with other items.
2. E-commerce Product Recommendations:
o Recommending products based on customers' past purchase amounts or order totals.
o Suggesting additional items to reach certain spending thresholds for discounts.
3. Financial Transaction Analysis:
o Detecting patterns of high-value transactions based on transaction amounts.
o Understanding how certain products or services are bundled in large transactions.

Using Multiple Minimum Supports


 Association rule mining involves discovering interesting relationships or patterns in datasets through the
identification of frequently co-occurring items.
 The traditional approach involves setting a single minimum support threshold, which determines the minimum
frequency with which an itemset must appear in the dataset to be considered significant.
 However, in certain scenarios, using multiple minimum support thresholds can provide a more nuanced and
detailed analysis of the data.
 Key Concepts:
1. Minimum Support:
o Minimum support is the threshold that determines the minimum frequency or occurrence of an
itemset in the dataset.
o An itemset is considered frequent if its support is greater than or equal to the minimum support
threshold.
2. Multiple Minimum Supports:
o In some cases, using a single minimum support may not capture all interesting patterns.
o Using multiple minimum supports allows for the discovery of itemsets at different levels of
significance.
 Advantages of Using Multiple Minimum Supports:
1. Fine-Grained Analysis:
o Allows for the identification of itemsets with varying levels of significance.
o Different levels of minimum support can reveal patterns of different strengths.
2. Tailored Pattern Discovery:
o Enables analysts to focus on specific subsets of itemsets based on their importance or relevance.
o Patterns of high significance can be separated from less significant ones.
3. Flexibility:
o Provides flexibility in adjusting the sensitivity of the mining process.
o Analysts can choose different levels of granularity in the patterns they wish to discover.

 Techniques for Using Multiple Minimum Supports:


1. Hierarchical Minimum Supports:
o Define a hierarchy of minimum supports, where each level represents a different level of
significance.
o Higher levels correspond to stricter criteria, capturing only the most significant patterns.
2. Threshold Ranges:
o Instead of fixed minimum supports, define ranges or intervals of minimum supports.

Name & Signature of the Faculty: Senthilkumar D Page 29


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

o Allows for a more flexible approach, capturing patterns within specified ranges of significance.
3. Selective Pattern Mining:
o Apply different minimum supports to specific subsets of the data.
o For example, focus on high-value customers with a higher minimum support threshold.
Example Scenario:
 Consider a retail store analyzing customer purchase data:
 Single Minimum Support: 0.1 (for general patterns)
 Multiple Minimum Supports:
o High-Value Customers: 0.2 (for patterns among high spenders)
o Low-Value Customers: 0.05 (for patterns among regular spenders)
 Pattern Discovery Examples:
1. General Patterns (Single Minimum Support):
o {Milk, Bread} => {Eggs} [Support: 0.15, Confidence: 0.6]
o {Cheese, Crackers} => {Wine} [Support: 0.12, Confidence: 0.8]
2. High-Value Customer Patterns (Higher Minimum Support):
o {Champagne, Caviar} => {Truffles} [Support: 0.25, Confidence: 0.9]
o {Steak, Lobster} => {Red Wine} [Support: 0.18, Confidence: 0.7]
3. Low-Value Customer Patterns (Lower Minimum Support):
o {Chips, Soda} => {Popcorn} [Support: 0.03, Confidence: 0.5]
o {Cookies, Ice Cream} => {Milk} [Support: 0.06, Confidence: 0.6]
 Applications:
1. Customer Segmentation:
o Identify patterns specific to different customer segments based on their spending habits.
o Tailor marketing strategies or promotions for each segment accordingly.
2. Product Bundling Strategies:
o Discover item combinations that are popular among high-value customers.
o Create targeted bundles or offers to maximize sales among different customer groups.
3. Market Basket Analysis:
o Understand purchasing behaviors at different levels of granularity.
o Optimize product placement, promotions, and inventory management strategies.

Correlation Rules
 Correlation rules in data mining refer to discovering patterns of co-occurrence or association between items or
attributes in a dataset.
 Unlike traditional association rules that focus on identifying frequent itemsets, correlation rules emphasize the
strength and direction of relationships between variables.
 These rules are particularly useful for understanding how changes in one variable relate to changes in another,
providing insights into dependencies and associations beyond simple co-occurrences.
 Key Concepts:
1. Correlation Coefficient:
o The correlation coefficient measures the strength and direction of a linear relationship between
two numerical variables.
o Values range from -1 to 1:
 Positive values (close to 1) indicate a positive correlation (both variables increase or
decrease together).
 Negative values (close to -1) indicate a negative correlation (one variable increases while
the other decreases).
 Values close to 0 indicate little or no correlation.
2. Correlation Rules:
o Correlation rules identify pairs of items or attributes that have a significant correlation coefficient.

Name & Signature of the Faculty: Senthilkumar D Page 30


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

o These rules help in understanding how changes in one variable are associated with changes in
another.
 Advantages of Correlation Rules:
1. Insight into Relationships:
o Provides insights into the strength and direction of relationships between variables.
o Helps in understanding dependencies and patterns in the data.
2. Identification of Causal Relationships:
o Indicates potential causal relationships between variables.
o Allows for hypothesis testing and validation of assumed cause-effect relationships.
3. Feature Selection:
o Useful for feature selection in predictive modeling tasks.
o Identifies variables that have the most impact on the target variable.
 Techniques for Correlation Rule Mining:
1. Pearson Correlation:
o The Pearson correlation coefficient measures the linear relationship between two continuous
variables.

2. Spearman Rank Correlation:


o Suitable for measuring the strength of monotonic relationships between variables.
o Uses ranked data rather than the actual values of variables.
3. Threshold-based Correlation Rules:
o Define a threshold for the correlation coefficient to identify significant correlations.
o Rules can be generated for pairs of variables with correlation coefficients above the threshold.
 Example Scenario: Consider a dataset of customer purchase behaviour:
Customer ID Age Income ($) Purchase Amount ($)
1 30 50000 100
2 25 60000 120
3 35 70000 150
4 40 55000 110
5 28 75000 160
 Correlation Rule Examples:
1. Positive Correlation between Age and Income:
o {Age} => {Income} [Correlation: 0.75]
 Interpretation: As age increases, income tends to increase, indicating a positive
correlation.
2. Negative Correlation between Age and Purchase Amount:
o {Age} => {Purchase Amount} [Correlation: -0.65]
 Interpretation: Younger customers tend to make larger purchases, indicating a negative
correlation.
3. Positive Correlation between Income and Purchase Amount:
o {Income} => {Purchase Amount} [Correlation: 0.82]
 Interpretation: Customers with higher incomes tend to make larger purchases, indicating a
positive correlation.

Name & Signature of the Faculty: Senthilkumar D Page 31


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

 Applications of Correlation Rules:


1. Market Segmentation:
o Identify patterns of purchasing behavior based on demographic variables.
o Segment customers based on their correlations with certain product categories.
2. Financial Analysis:
o Analyze the relationship between economic indicators and stock prices.
o Identify leading indicators that correlate with market movements.
3. Healthcare Analytics:
o Discover correlations between patient demographics and medical conditions.
o Identify risk factors or predictors for certain diseases.

MEASURING THE QUALITY OF RULES


 Measuring the quality of association rules is essential in data mining to identify meaningful and actionable
patterns from a large set of discovered rules.
 Quality measures help in assessing the significance, reliability, and usefulness of association rules, guiding
analysts in focusing on the most relevant patterns for further analysis or decision-making.
 There are several common measures used to evaluate the quality of association rules, each providing different
perspectives on the strength and significance of the relationships discovered.
 Key Quality Measures:
1. Support:
o Support measures the frequency of occurrence of an itemset in the dataset.
o Formula:
Support(X⇒Y)=Transactions containing X∪YTotal transactionsSupport(X⇒Y)=Total transaction
sTransactions containing X∪Y
o Indicates how often the rule occurs in the dataset.
o Higher support values suggest stronger relationships.
2. Confidence:
o Confidence measures the conditional probability of the consequent (Y) given the antecedent (X).
o Formula:
Confidence(X⇒Y)=Support(X∪Y)Support(X)Confidence(X⇒Y)=Support(X)Support(X∪Y)
o Indicates the strength of the rule.
o Higher confidence values suggest a stronger relationship between X and Y.
3. Lift:
o Lift measures the ratio of observed support to the expected support if X and Y were independent.
o Formula:
Lift(X⇒Y)=Support(X∪Y)Support(X)×Support(Y)Lift(X⇒Y)=Support(X)×Support(Y)Support(
X∪Y)
o Lift values > 1 indicate that the occurrence of X increases the likelihood of Y.
o Lift = 1 indicates independence, while Lift < 1 indicates a negative relationship.
4. Leverage:
o Leverage measures the difference between the observed frequency of X and Y occurring together
and the frequency expected if they were independent.
o Formula:
Leverage(X⇒Y)=Support(X∪Y)−Support(X)×Support(Y)Leverage(X⇒Y)=Support(X∪Y)−Supp
ort(X)×Support(Y)
o Indicates how much the occurrence of X and Y together deviates from what would be expected if
they were independent.
o A higher leverage value indicates a stronger association.

Name & Signature of the Faculty: Senthilkumar D Page 32


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

5. Conviction:
o Conviction measures the ratio of the expected frequency that X occurs without Y to the observed
frequency of errors in predicting Y.
o Formula:
Conviction(X⇒Y)=1−Support(Y)1−Confidence(X⇒Y)Conviction(X⇒Y)=1−Confidence(X⇒Y)1
−Support(Y)
o Conviction > 1 suggests a strong relationship, with larger values indicating stronger dependency.

 Interpretation of Quality Measures:


 High Support:
o Indicates that the rule is applicable to a significant portion of the dataset.
o Rules with high support are generally more reliable but may not be as interesting or specific.
 High Confidence:
o Indicates a strong relationship between the antecedent and consequent.
o Rules with high confidence are more likely to be true, making them useful for decision-making.
 High Lift:
o Indicates that the rule has a significant impact on the likelihood of the consequent.
o Rules with high lift show stronger dependencies and are often more actionable.
 High Leverage:
o Indicates that the occurrence of X and Y together is significantly higher than expected if they
were independent.
o Rules with high leverage highlight interesting associations that deviate from randomness.
 High Conviction:
o Indicates a strong dependency between the antecedent and consequent.
o Rules with high conviction are more likely to be reliable, especially when confidence is close to 1.

Name & Signature of the Faculty: Senthilkumar D Page 33


Signature of the HoD:
Shanmuga Industries Arts and Science College, Tiruvannamalai.
PG & Research Department of Computer Science
Notes of Lesson
Date Day Order Hour Unit III
Course Code &
Year I Semester II 23PCS21 Data Mining & Warehousing
Name
Topic

SECTION –A
1. Define clustering.
2. State any 3 clustering attributes.
3. Define centroid and medoid.
4. What are outliers?
5. Define dendrogram.
6. What is an association rule?
7. Define support.
8. Define confidence and like.
9. What is data parallelism and task parallelism?
10. What is an incremental rule?
SECTION –B
1. Write short note on classification of clustering algorithms.
2. Explain about the similarity and distance measures.
3. Write short note on divisive clustering algorithms in details.
4. Explain the partitional algorithms.
5. Discuss about large item set method in detail.

SECTION – C
1. Discuss about agglomerative clustering algorithms with an example.
2. Create the cluster for the following data points using k – means & PAM methods.
x y

P0 2 10

P1 8 4

P2 7 5

P3 6 4

P4 1 2

P5 4 9
3. Discuss about steps involved in apriori algorithm in details.

Name & Signature of the Faculty: Senthilkumar D Page 34


Signature of the HoD:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy