Bu I 11 FIM Apriori
Bu I 11 FIM Apriori
Philippe Fournier-Viger
http://www.philippe-Fournier-viger.com
2
Introduction
Discovering patterns and associations
Discovering interesting relationships hidden
in large databases.
e.g. beer and diapers are often sold together
pattern mining is a fundamental data
mining problem with many applications in
various fields.
Introduced by Agrawal (1993).
Many extensions of this problem to discover
patterns in graphs, sequences, and other
kinds of data.
3
FREQUENT ITEMSET
MINING
4
Definitions
Let be the set of items (products)
sold in a retail store.
For example:
5
Definitions
A transaction database D is a set of
transactions.
D = {T1, T2, … Tr} such t
Transaction Items appearing in the transaction
T1 1 1 1 1 0
T2 1 1 0 0 0
T3 1 0 0 1 1
T4 1 1 0 1 1
Itemsets of size 1:
{pasta}, {lemon}, {bread}, {orange},
{cake}
Itemsets of size 2:
{pasta, lemon}, {pasta, bread} {pasta,
orange}, {pasta, cake}, {lemon, bread},
{lemon orange}, …
11
Definitions
The support (frequency) of an itemset X is
the number of transactions that contains X.
sup(X) = |{
For example: The support of {pasta, orange}
is 3.
which is
Transaction written
Items as: sup({pasta,
appearing in the transaction
orange}) = 3
T1 {pasta, lemon, bread,
orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
cake}
support = 支持 12
Definitions
The support of an itemset X can also be
written as a ratio (absolute support).
Example: The support of {pasta, orange} is
75% because it appears in 3 out of 4
transactions.
Transaction Items appearing in the transaction
15
Numerous applications
Frequent itemset mining has
numerous applications.
◦ medical applications,
◦ chemistry,
◦ biology,
◦ e-learning,
◦ etc.
16
Several algorithms
Algorithms:
◦ Apriori, AprioriTID (1993)
◦ Eclat (1997)
◦ FPGrowth (2000)
◦ Hmine (2001)
◦ LCM, …
◦…
Moreover, numerous extensions of the
FIM problem: uncertain data, fuzzy data,
purchase quantities, profit, weight, time,
rare itemsets, closed itemsets, etc.
17
ALGORITHMS
18
Naïve approach
Ifthere are n items in a database,
there are 2n -1 itemsets may be
frequent.
Naïve approach: count the support of
all these itemsets.
To do that, we would need to read each
transaction in the database to count
the support of each itemset.
This would be inefficient:
◦ need to perform too many comparisons
◦ requires too much memory
19
Search space
This is all the itemsets that can be formed with the items
lemon (l), pasta (p), bread (b), orange (o) and cake (c)
∅
l p b o c
lp lb lo lc pb po pc bo bc oc
lpb lpo lpc lbo lbc loc pbo pbc poc boc
l = lemon
p = pasta lpbo lpbc lpoc lboc pboc
b = bread
0=
orange lpbo
c = cake c This form a lattice, which can be
viewed as a Hasse diagram 20
Search space
If minsup = 2, the frequent itemsets are (in yellow):
∅
l p b o c
lp lb lo lc pb po pc bo bc oc
lpb lpo lpc lbo lbc loc pbo pbc poc boc
l = lemon
p = pasta lpbo lpbc lpoc lboc pboc
b = bread
0=
orange lpbo
c = cake c
21
I={A} I={A, B} I={A, B,C}
I={A,
B,C,D,E,F}
22
How to find the frequent
itemsets?
Two challenges:
How to count the support of
itemsets in an efficient way (not
spend too much time or
memory)?
How to reduce the search space
(we do not want to consider all
the possibilities)?
23
THE APRIORI
ALGORITHM
(AGRAWAL & SRIKANT,
1993/1994)
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules in large databases.
Research Report RJ 9839, IBM Almaden
Research Center, San Jose, California, June
1994. 24
Introduction
Apriori is a famous algorithm
which is not the most efficient
algorithm,
but has inspired many other algorithms!
has been applied in many fields,
has been adapted for many other similar
problems.
25
Apriori property: Let there be two itemsets X
support of X.
Example:
• The support of {pasta} is 4
• The support of {pasta, lemon} is 3
• The support of {pasta, lemon, orange} is 2
lp lb lo lc pb po pc bo bc oc
lpb lpo lpc lbo lbc loc pbo pbc poc boc
Infrequent
lpbo lpbc lpoc lboc pboc itemsets
lpbo
c
27
This property is useful to reduce the search
space. Example:
If « bread » is
minsup =2 ∅ infrequent
l p b o c
lp lb lo lc pb po pc bo bc oc
lpb lpo lpc lbo lbc loc pbo pbc poc boc
lpbo
c
28
This property is useful to reduce the search
space. Example: If « bread » is
lp lb lo lc pb po pc bo bc oc
lpb lpo lpc lbo lbc loc pbo pbc poc boc
lpbo
c
29
If there exists an itemset X such that X is
Property 2: Let there be an itemset Y.
e.g.
{pasta} support = 4
{lemon} support = 3
{bread} support = 1
{orange} support = 3
{cake} support = 2
33
The Apriori algorithm
Step 2: eliminate infrequent
itemsets.
e.g.
{pasta} support = 4
{lemon} support = 3
{orange} support = 3
{cake} support = 2
34
The Apriori algorithm
Step 3: generate candidates of size 2 by
combining pairs of frequent itemsets of size
1. Candidates of size 2
{pasta, lemon}
{pasta, orange}
Frequent items
{pasta, cake}
{pasta}
{lemon, orange}
{lemon}
{lemon, cake}
{orange, cake}
{orange}
{cake} 35
The Apriori algorithm
Step 4: Eliminate candidates of size 2 that
have an infrequent subset (Property 2)
(none!) Candidates of size 2
{pasta, lemon}
{pasta, orange}
Frequent items
{pasta, cake}
{pasta}
{lemon, orange}
{lemon}
{lemon, cake}
{orange, cake}
{orange}
{cake} 36
The Apriori algorithm
Step 5: scan the database to calculate the
support of remaining candidate itemsets of
size 2. Candidates of size 2
{pasta, lemon} support: 3
Candidates of size 3
Frequent itemsets of size 2
{pasta, lemon, orange}
{pasta, lemon}
{pasta, orange} {pasta, lemon, cake}
{pasta, cake} {pasta, orange, cake}
{lemon, orange} {lemon, orange, cake}
{orange, cake}
40
The Apriori algorithm
Step 8: eliminate candidates of size 3 having a
subset of size 2 that is infrequent.
Candidates of size 3
Frequent itemsets of size 2
{pasta, lemon, orange}
{pasta, lemon}
{pasta, orange} {pasta, lemon, cake}
{pasta, cake} {pasta, orange, cake}
{lemon, orange} {lemon, orange, cake}
{orange, cake} Because {lemon, cake} is
infrequent!
41
The Apriori algorithm
Step 8: eliminate candidates of size 3 having a
subset of size 2 that is infrequent.
Candidates of size 3
Frequent itemsets of size 2
{pasta, lemon,
{pasta, lemon}
orange}
{pasta, orange}
{pasta, cake} {pasta, orange,
{lemon, orange}
cake}
{orange, cake} Because {lemon, cake} is
infrequent!
42
The Apriori algorithm
Step 9: scan the database to calculate the
support of the remaining candidates of size
3. Candidates of size 2
43
The Apriori algorithm
Step 10: eliminate infrequent candidates
(none!)
frequent itemsets of size 3
44
The Apriori algorithm
Step 11: generate candidates of size 4 by
combining pairs of frequent itemsets of size 3.
Candidates of size 4
Frequent itemsets of size 3
{pasta, lemon,
orange} {pasta, lemon, orange,
cake}
{pasta, orange,
cake}
45
The Apriori algorithm
Step 12: eliminate candidates of size 4 having a
subset of size 3 that is infrequent.
Candidates of size 4
Frequent itemsets of size 3
{pasta, lemon,
orange} {pasta, lemon, orange,
cake}
{pasta, orange,
cake}
46
The Apriori algorithm
Step 12: Since there is no more candidates, we
cannot generate candidates of size 5 and the
algorithm stops.
Candidates of size 4
Result
47
Final result
{pasta} support = 4
{lemon} support = 3
{orange} support = 3
{cake} support = 2
48
Technical details
49
Technical details
Combining different itemsets can
generate the same candidate.
Example:
{A, B} and { A,E} {A, B,
E}
51
PERFORMANCE
COMPARISON
53
How to evaluate this type of
algorithms?
Execution time,
Memory used,
Scalability: how the performance is
influenced by the number of
transactions
Performance on different types of data:
◦ real data,
◦ synthetic (fake) data,
◦ dense vs sparse data,…
…
54
Performance (execution time)
55
Performance (execution time)
56
Performance of Apriori
The performance of Apriori depends
on several factors:
the minsup parameter: the more
it is set low, the larger the search
space and the number of itemsets
will be.
the number of items,
the number of transactions,
The average transaction length.
57
Problems of Apriori
can generate numerous
candidates
requires to scan the database
numerous times.
candidates may not exist in the
database.
…
58
A FEW
OPTIMIZATIONS FOR
THE APRIORI
ALGORITHM
This is an advanced topic
59
Optimization 1
In terms of data structure:
Store all items as integers:
e.g. 1= pasta, 2= orange, 3=
bread…
Why?
◦ it is faster to compare two integers
than to compare two character
strings,
◦ requires less memory.
60
Optimization 2
To reduce the time required to calculate the
support of itemsets
Transaction Items appearing in the transaction
T2 {pasta, lemon} sort
T3 {pasta, orange, cake} transaction
T4 {pasta, lemon, orange, cake} s by
T1 {pasta, lemon, bread, orange}ascending
length
61
Optimization 3
To reduce the time required to
calculate the support of itemsets
Replace all identical transactions
by a single transactions with a
weight.
62
Optimization 4
To reduce the time require to
calculate the support of itemsets:
Sort items in transactions
according to a total order (e.g.
alphabetical order).
Utilize binary search to quickly
check if an item appears in a
transaction.
63
Optimization 5
Store candidates in a
hash tree
To calculate the support
of candidates
◦ Calculate a hash value based
on a transaction to determine
if candidates are contained in
thetransaction.
64
Other optimizations
Sampling and partitioning
65
AprioriTID: a varition
AprioriTID:
Annotate each itemset with the
ids of transactions that contain it,
use the intersection () to
calculate the support of itemsets
instead of reading the database.
Example
66
Transaction Items appearing in the transaction
transactions({pasta})
transactions({lemon})
= {T1, T2, T3, T4} {T1, T2, T4}
= {T1, T2, T4 } 68
AprioriTID_Bitset
AprioriTID_bitset:
Same idea, except that bit
vectors are used instead of lists
of ids.
This allows to calculate the
intersection using the
Logical_AND, which is often very
fast.
Example
69
Transaction Items appearing in the transaction
pasta 1111
lemon 1101
bread 1000
orange 1011
cake 0011
transactions({pasta})
transactions({lemon})
= 1LOGICAL_AND 1101
= 1101 71
Conclusion
This video has presented:
The problem of frequent itemset
mining
The Apriori algorithm
Some optimizations
72
References
Han and Kamber (2011), Data
Mining: Concepts and Techniques,
3rd edition, Morgan Kaufmann
Publishers,
Tan, Steinbach & Kumar (2006),
Introduction to Data Mining,
Pearson education, ISBN-10:
0321321367
…
73