0% found this document useful (0 votes)
68 views54 pages

Association Rule Mining

The document discusses association rule mining and frequent pattern analysis. It defines key concepts like frequent itemsets, support, confidence and association rules. It describes the tasks of mining frequent itemsets and generating association rules from them. The Apriori algorithm is introduced as an efficient way to find frequent itemsets by leveraging the Apriori property that frequent subsets must be frequent. The algorithm makes multiple passes over the transaction database and prunes candidates that are not frequent.

Uploaded by

hawariya abel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views54 pages

Association Rule Mining

The document discusses association rule mining and frequent pattern analysis. It defines key concepts like frequent itemsets, support, confidence and association rules. It describes the tasks of mining frequent itemsets and generating association rules from them. The Apriori algorithm is introduced as an efficient way to find frequent itemsets by leveraging the Apriori property that frequent subsets must be frequent. The algorithm makes multiple passes over the transaction database and prunes candidates that are not frequent.

Uploaded by

hawariya abel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Association rule mining

Chapter 4

1
Association rule mining

• Basic Concepts
• Frequent Pattern and Association rule Mining
• Association rule Evaluation
• Issues in Association rule mining
• Classification of Frequent Pattern Mining
• Mining Frequent Itemsets
• The Apriori Algorithm
• Multi-level Associations rules
• Multi-Dimensional Association rule mining

2
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a dataset
• First proposed by Agrawal et al. [1] in the context of frequent
itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and
diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between
sets of items in large databases. In Proc. 1993 ACM-SIGMOD Int. Conf.
3
Management of Data (SIGMOD’93), pp. 207–216, Washington, DC, May 1993.
What Is Frequent Pattern Analysis?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.

4
Basic Concepts: Frequent Patterns
Tid Items bought • itemset: A set of one or more items
10 Beer, Nuts, Diaper
• k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
• (absolute) support, or, support
30 Beer, Diaper, Eggs
count of X: Frequency or
40 Nuts, Eggs, Milk
occurrence of an itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk
• (relative) support, s, is the fraction
Customer Customer
of transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys beer
6
Basic Concepts: Association Rules
Tid Items bought
10 Beer, Nuts, Diaper
• Find all the rules X  Y with
20 Beer, Coffee, Diaper
minimum support and confidence
30 Beer, Diaper, Eggs – support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk – confidence, c, conditional
Customer
probability that a transaction
Customer
buys both
buys
having X also contains Y
diaper Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Customer
buys beer
 Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
7
Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
– Computationally prohibitive!
8
Mining Association Rules
Tid Items bought
Example of Rules:
10 Bread, Milk
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
20 Bread, Diaper, Beer, Eggs
30 Milk, Diaper, Beer, Coke
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
40 Bread, Milk, Diaper, Beer
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
50 Bread, Milk, Diaper, Coke {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
• can have different confidence
• Thus, we may decouple the support and confidence requirements
9
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
• Generate all itemsets whose support ≥ minsup
2. Rule Generation
• Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
• Frequent itemset generation is still
computationally expensive

10
Frequent Itemset Generation

11
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Transactions List of Candidates
Tid Items bought
10 Bread, Milk
20 Bread, Diaper, Beer, Eggs
N m=2d
30 Milk, Diaper, Beer, Coke
40 Bread, Milk, Diaper, Beer
50 Bread, Milk, Diaper, Coke

w
– Match each transaction against every candidate
– Complexity: O(Nmw): this is costly 12
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates
or transactions
– No need to match every candidate against every
transaction
13
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must
also be frequent
• Apriori principle holds due to the following
property of the support measure:

– Support of an itemset never exceeds the support of


its subsets (anti-monotone property of support)
14
Example Apriori Principle

15
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
16
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 17
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd} 18
Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine
the support of each candidate itemset
– To reduce the number of comparisons, store the
candidates in a hash structure
• Instead of matching each transaction against every
candidate, match it against candidates contained in the
hashed buckets

19
How to Count Supports of Candidates?

• Why counting supports of candidates a problem?


– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction
20
Association Rule Discovery: Hash Tree
Hash Function
3,6,9
1,4,7
2,5,8
234
567

145 345 356 367


136 357 368
689

124 125 159


457 458

Hash on 1,4 or 7
21
Association Rule Discovery: Hash Tree
Hash Function
3,6,9
1,4,7
2,5,8
234
567

145 345 356 367


136 357 368
689

124 125 159


457 458

Hash on 2,5 or 8
22
Association Rule Discovery: Hash Tree
Hash Function
1,4,7 3,6,9
2,5,8
234
567

145 345 356 367


136 357 368
689

124 125 159


457 458

Hash on 3,6 or 9
23
Subset operation
Given a transaction T, what are the possible subsets of size 3?
Transaction T:
12356

Level 1 1 2356 2 356 3 56

Level 2
12 356 13 56 15 6 23 56 25 6 35 6

123 135 156 235 256 356


125 136 236
Level 3
126

Subsets of 3 items
24
Subset operation

Transaction:
12356

1+ 2356 2+ 3 5 6 3+ 5 6

1 2+ 3 5 6 1 3+ 5 6 1 5+ 6 2 3+ 5 6 25 6 35 6

123 135 156 235 256 356


125 136 236
126

Level 3 Subsets of 3 items


25
Subset Operation Using Hash Tree
Hash Function
12356 1,4,7 3,6,9
2,5,8
1+2356 2+ 3 5 6
1 3+ 5 6 3+ 5 6
234
1 2+ 3 5 6 567
1 5+ 6 367
345 356
145 136 357 368
689

124 125 159


457 458

26
Maximal Frequent Itemset

28
Closed Itemset
• An itemset is closed if none of its immediate
supersets has the same support as the itemset
Items Support
TID Items {A} 4 Items Support
1 {A,B} {B} 5 {A,B,C} 2
2 {B,C,D} {C} 3 {A,B,D} 3
{D} 4
3 {A,B,C,D} {A, C,D} 2
{A,B} 4
4 {A,B,D} {A,C} 2 {B,C,D} 3
5 {A,B,C,D {A,D} 3 {A,B,C,D} 2
{B,C} 3
{B, D} 4
{C,D} 3

29
Maximal vs Closed Itemsets

30
Maximal vs Closed Itemsets

31
The Frequent Pattern Growth Mining Method

• Idea: Frequent pattern growth


– Recursively grow frequent patterns by pattern and database
partition
• Method
– For each frequent item, construct its conditional pattern-
base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-
tree
– Until the resulting FP-tree is empty, or it contains only one
path—single path will generate all the combinations of its
sub-paths, each of which is a frequent pattern

32
FP-growth Algorithm
• Use a compressed representation of the
database using an FP-tree
• Once an FP-tree has been constructed, it uses
a recursive divide-and-conquer approach to
mine the frequent itemsets

33
FP-tree construction
TID Items
null
1 {A,B}
2 {B,C,D} A:7 B:3
3 {A,C,D,E}
4 {A,D,E}
B:5 C:1 D:1
5 {A,B,C} C:3

6 {A,B,C,D}
C:3 D:1 D:1 E:1 D:1 E:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} D:1 E:1
Items s
10 {B,C,E} B 8
A 7
C 7
D 5
E 3 34
FP-tree construction
Items
null
E
D A:7 B:3
C
A
B:5 C:1 D:1
B C:3

C:3 D:1 D:1 E:1 D:1 E:1

D:1 E:1
Items s
B 8
A 7
C 7
D 5
E 3 35
Benefits of the FP-tree Structure
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely
to be shared
– never be larger than the original database (if not count node-links and
counts)

36
Mining Frequent Patterns Using FP-tree
• General idea (divide-and-conquer)
– Recursively grow frequent pattern path using the
FP-tree
• Method
– For each item, construct its conditional pattern-
base, and then its conditional FP-tree
– Repeat the process on each newly created
conditional FP-tree
– Until the resulting FP-tree is empty, or it contains
only one path (single path will generate all the combinations of
its sub-paths, each of which is a frequent pattern)
37
Major Steps to Mine FP-tree

1) Construct conditional pattern base for each


node in the FP-tree
2) Construct conditional FP-tree from each
conditional pattern-base
3) Recursively mine conditional FP-trees and
grow frequent patterns obtained so far
 If the conditional FP-tree contains a single path,
simply enumerate all the patterns
38
Step 1: From FP-tree to Conditional Pattern Base
• Starting at the frequent header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item
• Accumulate all of transformed prefix paths of that item to form
a conditional pattern base

Header Table {}

Item frequency head Conditional pattern bases


f:4 c:1
f 4 itemcond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a fc:3
a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
39
Properties of FP-tree for Conditional
Pattern Base Construction
• Node-link property
– For any frequent item ai, all the possible frequent
patterns that contain ai can be obtained by
following ai's node-links, starting from ai's head in
the FP-tree header
• Prefix path property
– To calculate the frequent patterns for a node ai in
a path P, only the prefix sub-path of ai in P need
to be accumulated, and its frequency count should
40
carry the same count as node a .
Step 2: Construct Conditional FP-tree
• For each pattern-base
– Accumulate the count for each item in the base
– Construct the FP-tree for the frequent items of the pattern base

{} m-conditional pattern
Header Table base:
Item frequency head f:4 c:1 fca:2, fcab:1
f 4 All frequent patterns
c 4 c:3 b:1 b:1 {} concerning m
m,
a 3 
b 3 a:3 p:1 f:3  fm, cm, am,
fcm, fam, cam,
m 3
p 3 m:2 b:1 c:3 fcam

p:2 m:1 a:3


m-conditional FP-tree

41
Mining Frequent Patterns by Creating Conditional
Pattern-Bases

Item Conditional pattern-base Conditional FP-tree


p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty

42
Step 3: Recursively mine the conditional FP-tree

{}
{}
Cond. pattern base of “am”: (fc:3) f:3
f:3 c:3
c:3 am-conditional FP-tree
{}
a:3 Cond. pattern base of “cm”: (f:3)
m-conditional FP-tree
f:3
cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3) f:3


cam-conditional FP-tree

43
Single FP-tree Path Generation

• Suppose an FP-tree T has a single path P


• The complete set of frequent pattern of T can be generated by
enumeration of all the combinations of the sub-paths of P

{} All frequent patterns


concerning m
f:3 m,
fm, cm, am,
c:3 
fcm, fam, cam,
a:3 fcam

m-conditional FP-tree 44
Principles of Frequent Pattern Growth
• Pattern growth property
– Let  be a frequent itemset in DB, B be 's conditional pattern base,
and  be an itemset in B. Then    is a frequent itemset in DB iff
 is frequent in B.
• “abcdef ” is a frequent pattern, if and only if
– “abcde ” is a frequent pattern, and
– “f ” is frequent in the set of transactions containing “abcde ”

45
Benefits of the FP-tree Structure

• Completeness
– Preserve complete information for frequent pattern mining
– Never break a long pattern of any transaction
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more frequently
occurring, the more likely to be shared
– Never be larger than the original database (not count node-
links and the count field)

46
FP-Growth vs. Apriori: Scalability With the Support
Threshold

100 Data set T25I20D10K


90 D1 FP-grow th runtime
D1 Apriori runtime
80
Ru n tim e (se c.)

70

60

50
40

30
20

10

0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
47
FP-Growth vs. Tree-Projection: Scalability with the
Support Threshold

Data set T25I20D100K


140
D2 FP-growth
120 D2 TreeProjection

100
Runtime (sec.)

80

60

40

20

0
0 0.5 1 1.5 2
Support threshold (%) 48
Advantages of the Pattern Growth Approach

• Divide-and-conquer:
– Decompose both the mining task and DB according to the frequent
patterns obtained so far
– Lead to focused search of smaller databases
• Other factors
– No candidate generation, no candidate test
– Compressed database: FP-tree structure
– No repeated scan of entire database
– Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
• A good open-source implementation and refinement of FPGrowth
– FPGrowth+ (Grahne and J. Zhu, FIMI'03)
49
ECLAT: Mining by Exploring Vertical Data Format

• Vertical format: t(AB) = {T11, T25, …}


– tid-list: list of trans.-ids containing an itemset
• Deriving frequent patterns based on vertical intersections
– t(X) = t(Y): X and Y always happen together
– t(X)  t(Y): transaction having X always has Y
• Using diffset to accelerate mining
– Only keep track of differences of tids
– t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
– Diffset (XY, X) = {T2}
• Eclat (Zaki et al. @KDD’97)- Mining Closed patterns using vertical format:
CHARM (Zaki & Hsiao@SDM’02)

50
Mining Frequent Closed Patterns: CLOSET

• Flist: list of all frequent items in support ascending order


– Flist: d-a-f-e-c Min_sup=2
TID Items
• Divide search space 10 a, c, d, e, f
– Patterns having d 20 a, b, e
30 c, e, f
– Patterns having d but no a, etc. 40 a, c, d, f
50 c, e, f
• Find frequent closed pattern recursively
– Every transaction having d also has cfa  cfad is a frequent
closed pattern
• J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for Mining
Frequent Closed Itemsets", DMKD'00.
51
MaxMiner: Mining Max-Patterns
Tid Items
• 1st scan: find frequent items
10 A, B, C, D, E
– A, B, C, D, E 20 B, C, D, E,
• 2nd scan: find support for 30 A, C, D, F

– AB, AC, AD, AE, ABCDE


– BC, BD, BE, BCDE
Potential
– CD, CE, CDE, DE max-patterns
• Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in
later scan
• R. Bayardo. Efficiently mining long patterns from databases.
SIGMOD’98
53
CHARM: Mining by Exploring Vertical Data Format

• Vertical format: t(AB) = {T11, T25, …}


– tid-list: list of trans.-ids containing an itemset
• Deriving closed patterns based on vertical intersections
– t(X) = t(Y): X and Y always happen together
– t(X)  t(Y): transaction having X always has Y
• Using diffset to accelerate mining
– Only keep track of differences of tids
– t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
– Diffset (XY, X) = {T2}
• Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et
al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02) 54
Computational Complexity of Frequent Itemset Mining

• How many itemsets are potentially to be generated in the worst case?


– The number of frequent itemsets to be generated is sensitive to the minsup
threshold
– When minsup is low, there exist potentially an exponential number of
frequent itemsets
– The worst case: MN where M: # distinct items, and N: max length of
transactions
• The worst case complexity vs. the expected probability
– Ex. Suppose Walmart has 104 kinds of products
• The chance to pick up one product 10-4
• The chance to pick up a particular set of 10 products: ~10-40
• What is the chance this particular set of 10 products to be frequent 10 3
times in 109 transactions?

55
Interestingness Measure: Correlations (Lift)

• play basketball  eat cereal [40%, 66.7%] is misleading


– The overall % of students eating cereal is 75% > 66.7%.
• play basketball  not eat cereal [20%, 33.3%] is more accurate, although
with lower support and confidence
• Measure of dependent/correlated events: lift

P( A B) Basketball Not basketball Sum (row)


lift  Cereal 2000 1750 3750
P( A) P ( B )
Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000


2000 / 5000
lift ( B, C )   0.89
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C )   1.33
3000 / 5000 *1250 / 5000
56
Reading assignment
• Read multi-level association
• What are the problem of Apriori ?
• What are the two general steps in multi-level
association

57

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy