Unit-2 Dma
Unit-2 Dma
Unit-2
School of Computing - SRMIST Kattankulathur Campus
Association Rules
Example
bread => butter
buys{onions, potatoes} => buys{tomatoes}
2
Association Rules Continues…..
Understanding the buying patterns can help to increase sales in several ways.
Example:
● If there is a pair of items, X and Y, that are frequently bought together
● Both X and Y can be placed on the same shelf, so that buyers of one item
would be prompted to buy the other.
● Support: denotes the probability that contains both bread and butter.
● Confidence: denotes the probability that a transaction containing bread also
contains butter.
4
Examples to calculate the Support and confidence
5
Examples to calculate the Support and confidence
● Support: This says how popular an itemset is, as measured by the
proportion of transactions in which an itemset appears. In Table, the
support of {apple} is 4 out of 8, or 50%. Itemsets can also contain multiple
items. For instance, the support of {apple, beer, rice} is 2 out of 8, or 25%.
Confidence: This says how likely item Y is purchased when item X is purchased, expressed
as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y
also appears. In Table, the confidence of {apple -> beer} is 3 out of 4, or 75%.
6
Classification of Association Rules
7
Association Mining – Fields &Algorithms
8
● Frequent patterns are patterns (such as itemsets,
subsequences, or substructures) that appear in a data set
frequently.
● Eg: milk and bread.
10
Market basket Analysis
11
Market Basket Analysis
12
Market Basket Analysis
13
Frequent Item sets, Closed Item sets, and Association Rules
• Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf ) are called strong.
• A set of items is referred to as an item set.
• The occurrence frequency of an item set is the number of transactions that contain the itemset.
• If the relative support( the proportion of transactions in a dataset that contain a specific itemset) of
an item set I satisfies a pre-specified minimum support threshold (i.e., the absolute support of I
satisfies the corresponding minimum support count threshold), then I is
a frequent item set.
14
Frequent Item sets, Closed Item sets, and Association Rules
1. Find all frequent item sets: By definition, each of these itemsets will
occur at least as frequently as a predetermined minimum support
count, min sup.
2. Generate strong association rules from the frequent itemsets: By
definition, these rules must satisfy minimum support and minimum
confidence
15
Frequent Item sets, Closed Item sets, and Association Rules
● Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
● Closed Itemset:An itemset is closed if none of its immediate supersets have same support count same
as Itemset.
● K- Itemset:Itemset which contains K items is a K-itemset. So it can be said that an itemset is frequent if
the corresponding support count is greater than minimum support count.
16
Maximal & Closed Frequent Item set
17
Maximal & Closed Frequent Item set
All Maximal Frequent Itemsets are Closed Frequent Itemsets but all
Closed Frequent Itemsets are not Maximal Frequent Itemsets.
18
Apriori Algorithm
19
Frequent Pattern Mining
20
● Based on the types of values handled in the rule.
Eg: Boolean association rule, quantitative association
rule
21
Frequent Itemset Mining Methods – Apriori Alg
22
23
Example -2
24
Steps of Apriori Algorithm
26
● Find the frequent item sets in the following database
with min support 50% & min confidence 50%.
Transaction id Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
● 50/100*4 = 2
● MIN SUP COUNT IS 2.
27
● Step1: find c1
Items Support count
[A] 3
[B] 2
[C] 2
[D] 1
[E] 1
[F] 1
28
● Step 2: compare the candidate support count with min
support count so L1 will be
Items Support
[A] 3
[B] 2
[C] 2
Items
[A,B]
[A,C]
[B,C]
29
● Step 4: Scan D for count of each candidate in C2 and find
support.
Items Support
[A,B] 1
[A,C] 2
[B,C] 1
Items Support
[A,C] 2
Rule 1: A -> C
Rule 2: C -> A
31
Generating Association Rules for Frequent Item sets
32
Generating Association Rules for Frequent Item sets
● R1 I1^I2 -> I5
Confidence = SC(I1,I2,I5)/SC(I1,I2) = 2/4=50%. (Rejected)
● R2 I1^I5 -> I2
Confidence = SC(I1,I5,I2)/SC(I1,I5) = 2/2=100%. (Accepted)
● R3 I2^I5 -> I1
Confidence = SC(I2,I5,I1)/SC(I2,I5) = 2/2=100%. (Accepted)
● R4 I1->I2^I5
Confidence = SC(I1,I2,I5)/SC(I1) = 2/6=33%. (Rejected)
● R5 I2->I1^I5
Confidence = SC(I2,I1,I5)/SC(I2) = 2/7=29%. (Rejected)
● R6 I5->I1^I2
Confidence = SC(I5,I1,I2)/SC(I5) = 2/2=100%. (Accepted)
34
Problem - 1
● A database has five transactions. Let the Minimum Support &
Confidence ,min_sup=60%, min_confi = 100%.
● Find the frequent itemsets and generate the
association rules using Apriori algorithm.
TI ITEMS
D
T {M,O,N,K,E,Y}
1
T {D,O,N,K,E,Y}
2
T {M,A,K,E}
3
T {M,U,C,K,Y}
4
T {C,O,O,K,I,E}
5 35
Problem - 2
● A database has five transactions. Let the Minimum
Support & Confidence ,min_sup=3, min_confi = 80%.
TID ITEMS
T1 {1,2,3,4,5,6}
T2 {7,2,3,4,5,6}
T3 {1,8,4,5}
T4 {1,9,0,4,6}
T5 {0,2,2,4,5}
36
Improving the Efficiency of Apriori
37
Improving the Efficiency of Apriori
38
Improving the Efficiency of Apriori
39
Hash Based techniques
40
Hash Based techniques
41
Frequent Pattern Growth Algorithm
42
Mining Frequent Item sets without Candidate
Generation
Disadvantages in Apriori Algorithm:
● It may need to generate a huge number of candidate sets.
● It may need to repeatedly scan the database and check a large set of
candidates by pattern matching.
FP Growth Algorithm
43
Mining Frequent Item sets without Candidate Generation
● we will start from the node that has the minimum support count ie.I5.
● We exclude the node with maximum support count ie.I2 for preparing
the table.
44
Mining Frequent Item sets without Candidate Generation
45
FP Growth Algorithm
46
FP Growth Algorithm Vs Apriori Algorithm
47
FP GROWTH ALGORITHM Vs APRIORI ALGORITHM
48
Problems 1 – FP Growth Tree
49
Problems 2 – FP Growth Tree
50
Mining Frequent Item sets Using Vertical Data Format
Horizontal Data Format will be converted to Vertical Data Format
51
Mining Frequent Item sets Using Vertical Data Format
52
Mining Closed Frequent Item sets
● It is a frequent itemset that is both closed and its support is greater than
or equal to minsup.
● An itemset is closed in a data set if there exists no superset that has the
same support count as this original itemset.
53
Mining Closed Frequent Item sets
● To search for closed frequent itemsets directly during the mining process.
● This requires us to prune the search space as soon as we can identify the case of closed itemsets during
mining.
54
Pruning strategies
55
Pruning strategies
● Sub-item set pruning: If a frequent item set X is a proper subset of an already found frequent closed
itemset Y
● and support count(X) = support count(Y), then X and all of X’s descendants in the set enumeration tree
cannot be frequent closed item sets and thus can be pruned.
56
Pruning strategies
Item skipping: In the depth-first mining of closed itemsets, at each level, there will
be a prefix itemset X associated with a header table and a projected database.
● If a local frequent item p has the same support in several header tables at
different levels, we can safely prune p from the header tables at higher levels.
57
Pruning strategies
● Important optimization is to perform efficient checking
● subset checking: checks whether the newly found itemset is a subset of an already
found closed itemset with the same support.
● If the current itemset Sc can be subsumed by another already found closed itemset Sa,
then
(1) Sc and Sa have the same support.
(2) the length of Sc is smaller than that of Sa.
(3) all of the items in Sc are contained in Sa.
58
Which Patterns Are Interesting?—Pattern
Evaluation Method
● Most association rule mining algorithms employ a support-confidence
framework.
● Many interesting rules can be found using low support thresholds.
● Strong Rules Are Not Necessarily Interesting.
● Whether or not a rule is interesting can be assessed either subjectively
or objectively.
● only the user can judge if a given rule is interesting, and this judgment,
being subjective, may differ from one user to another.
● objective interestingness measures, based on the statistics “behind” the
data.
59
Association Mining to Correlation Analysis
60
Association Mining to Correlation Analysis
61
From Association Analysis to Correlation Analysis
62
Correlation Measures
● Lift is a simple correlation measure.
● The occurrence of item set A is independent of the occurrence of itemset
B if P(A U B) = P(A)P(B).
● otherwise, item sets A and B are dependent and correlated as events.
63
Correlation analysis using lift
64
Correlation analysis using Chi square
65
Correlation analysis using Chi square
66
67
Thank You
68