Association
Association
Association Rules
Instructor: Junghye Lee
Ref.: M.J.A. Berry and G. Linoff, Data Mining Techniques, Wiley, 1997.
1
Contents
Introduction
Association Rules
Basic Process
- Choosing the right set of items
- Generating rules and their measures
- Overcoming the practical limits
Strengths and Weakness
Application Areas
2
Introduction: What is Market Basket Analysis?
3
Introduction: Point of Sale Transactions
customer items
transaction
4
Introduction: Transactions and Co-Occurrence
Customer Items
• OJ and soda are more likely to
be purchased together.
1 Orange juice, Soda
•Detergent is never purchased
2 Milk, Orange juice, Window Cleaner
with window cleaner or milk.
3 Orange juice, Detergent
•Milk is never purchased with
4 Orange juice, Detergent ,Soda soda or detergent.
6
Association Rules - The Useful Rule
7
Association Rules - The Trivial Rule
8
Association Rules - The inexplicable rules
9
The Basic Process in Market Basket Analysis
12
Basic Process: Generating rules
13
Performance Measures - Support
Support
- How many transactions that contain “condition” and “result” at
the same time ?
14
Performance Measures - Confidence
Confidence
- How many transactions that contain “condition” and “result”
among the transactions including “condition” ?
P(condition result)
C P(" result"|" condition" )
P(condition)
# of transactions that include condition and result
# of transactions that include condition
- Conditional probability
- Degree of association – may not imply causality
- not symmetric
15
Performance Measures - Improvement
improvement or lift
- Lift (improvement) tells us how much better a rule is at
predicting the result than just guessing the result at random
Improvement example
16
Basic Process: Generating rules -
Example
Customer Items
17
Basic Process: Generating rules -
Example
Co-Occurrence Matrix
4 /5
transactions OJ Window Milk Soda Detergent
Cleaner
OJ 4 1 1 2 1
(0.8) (0.2) (0.2) (0.4) (0.2)
Window 1 2 1 1 0
Cleaner (0.2) (0.4) (0.2) (0.2)
Milk 1 1 1 0 0
(0.2) (0.2) (0.2)
Soda 2 1 0 3 1
(0.4) (0.2) (0.6) (0.2)
Detergent 1 0 0 1 2
(0.2) (0.2) (0.4)
18
Basic Process: Generating rules -
Example
Assume most common combination
‘A,B,C’
A and B and C 5%
19
Basic Process: Generating rules -
Example
Which is Result between A,B,C?
Setting a result on the basis of ‘confidence’
Confidence of rule “ If condition then result” support
– P(Result|Condition) = P(Result and
Condition)/P(Condition)
Confidence of rule “If AB then C” = P(ABC|AB)
Association P(condition) P(condition confidence
Rule and result)
If AB then C 25% 5% 0.20
20
Basic Process: Generating rules -
Example
What if P(R) > P(R|C) (= Confidence) ?
‘Improvement’ tells how much better a rule is than
just guessing randomly the result
Improvement = P(R|C) / P(R) = P(RC)/P(R)P(C)
– If Improvement > 1 the rule is better
– If Improvement < 1 “If C, then NOT R ” is
better (Negative rule)
22
Strengths of Market Basket Analysis
23
Weaknesses of Market Basket Analysis
24
Application of Market Basket Analysis
25
Apriori Algorithm
26
Apriori Algorithm – Phase 1
Step 0. Specify the minimum support smin .
k=1 C1 [{i1},{i2 },...,{im }] L1 {c C1 | supp(c) s min }
Step 1. k=k+1
Generate new candidate itemsets C k from Lk 1
(apriori-gen function)
Step 1-1. (join)
Generate k-itemsets Lk 1 by joining like C Lk 1 * Lk 1
Rules generated
L={b,c,g}
29
Apriori Algorithm – Theorem
Sequential Patterns
Sequential Patterns
Sequence: List of items in the order of time, etc.
– eg: s1 A1 , A2 ,..., An s2 B1 , B2 ,..., Bm
31
Algorithm for Finding Sequences
Agrawal and Srikant (1995)
1. Sort Phase: convert the transaction
database into customer sequences
2. L-itemset Phase: Find the set of all l-
itemsets L by considering the minimum
support.
3. Transformation Phase: Transform each
customer sequence into the set of all l-
itemsets contained in that transaction.
4. Sequence Phase: Generate large
sequences
5. Maximal Phase: Find the maximal
sequences among large sequences.
32
Algorithm for Finding Sequences
<example> (cont’d)
Min. support: 0.4
2) L-itemset Phase
itemset support #
{a} 4 1
{b} 3 2
{e} 2 3
{e, g} 2 4
{g} 3 5
3)Transformation Phase
cust # Cust seq Transformed seq
1 <{a}, {b}> <{1}, {2}>
2 <{c,d}, {a}, {e,f,g}> <{1}, {3, 4, 5}>
3 <{a,h,g}> <{1, 5}>
4 <{a}, {e,g}, {b}> <{1}, {3, 4, 5}, {2}>
5 <{b}> <{2}>
33
Algorithm for Finding Sequences
4)Sequence Phase
•AprioriAll: does not guarantee max seq, so requires
Maximal Phase
•AprioriSome, ‘DynamicSome’: guarantee max seq
AprioriAll
Step 0. Set all large 1-seq to L1.
k=1
Step 1. k=k+1
Ck Lk 1 * Lk 1
Step 2. Obtain Lk from Ck
Stop if Lk . Repeat Step 1, otherwise.
Example (cont’d)
L1=[<1>, <2>, <3>, <4>, <5>]
L2=[<1, 2>, <1, 3>, <1, 4>, <1, 5>]
L3=[]stop
Max seq : <1, 2> and <1,4>
34
Algorithm for Finding Sequences
Example
cust Transformed seq
1 <{1,5}, {2}, {3}, {4}>
2 <{1}, {3}, {4}, {3,5}>
3 <{1}, {2}, {3}, {4}>
4 <{1}, {3}, {5}>
5 <{4}, {5}>
Min support: 0.4
L1=[<1>, <2>, <3>, <4>, <5>]
L2=[<1 2>, <1 3>, <1 4>, <1 5>, <2 3>, <2 4>, <3 4>, <3 5>, <4 5>]
C3=[<1 2 3>, <1 2 4>, <1 3 4>, <1 3 5>, <1 4 5>, <2 3 4>, <2 3 5>,
<2 4 5>, <3 4 5>]
L3=[<1 2 3>, <1 2 4>, <1 3 4>, <1 3 5>, <2 3 4>]
C4=L4=[<1 2 3 4>]
Max seq: <1 2 3 4>, <1 3 5>, <4 5>
35
Algorithm for Finding Sequences
AprioriSome
(Forward phase)
Step 0. k=1; Obtain L1; C1=L1; last=1.
Step 1. (Generate Ck)
k←k+1
1) Lk 1known: Ck Lk 1 * Lk 1
2) Lk 1unknown: Ck Ck 1 * Ck 1
Step 2. (select k for Lk )
Stop if Ck , Proceed, otherwise.
1)If k=next(last), obtain Lk , last=k, Go to Step 1.
2)If not k=next(last), go to Step 1.
(Backward phase)
Step 0. k=kmax
Step 1.
1) Lk known: delete all subsequences in Li (i k ) from Lk
2) Lk unknown: delete all subsequences in Li (i k ) from Ck
Step 2. k←k-1; go to Step 1.
36
Algorithm for Finding Sequences
Function ‘next’ determines the length of
sequences.
– Agrawal & Srikant (1995)
| Lk |
hit (k )
| Ck |
1) If hit(k) < 0.666 , next(k)=k+1
2) If 0.666 hit(k) < 0.75 , next(k)=k+2
3) If 0.75 hit(k) < 0.80, next(k)=k+3
4) If 0.80 hit(k) < 0.85, next(k)=k+4
5) If hit(k) 0.85, next(k)=k+5
Once the all large sequences are obtained,
the union will be the maximal sequence.
37
Algorithm for Finding Sequences
Example(cont’d)- AprioriSome
next(i)=2i
(Forward phase)
Iteration 0.
L1=C1=[<1>, <2>, <3>, <4>, <5>], last=1
Iteration 1. (k=2)
C2=[<1 2>, <1 3>, <1 4>, <1 5>, <2 3>, <2 4>, <2 5>, <3 4>, <3 5>, <4 5>]
next(1)=2=k
L2=[<1 2>, <1 3>, <1 4>, <1 5>, <2 3>, <2 4>, <3 4>, <3 5>, <4 5>]
last=2
Iteration 2. (k=3)
C3=[<1 2 3>, <1 2 4>, <1 3 4>, <1 3 5>, <1 4 5>, <2 3 4>, <2 3 5>, <2 4 5>,<3 4 5>]
next(2)=43
Iteration 3.(k=4)
C4=[<1 2 3 4>, <1 2 3 5>, <1 2 4 5>, <1 3 4 5>, <2 3 4 5>]
next(2)=4=k
L4=[<1 2 3 4>]
38
Algorithm for Finding Sequences
Example(cont’d)
(Backward phase)
Iteration 0.
kmax=4
Iteration 1.
L4=[<1 2 3 4>]; k=3
Iteration 2.
L3=[<1 3 5>]; k=2
Iteration 3.
L2=[<4 5>]
39