0% found this document useful (0 votes)
9 views72 pages

Bu I 11 FIM Apriori

The document discusses Frequent Itemset Mining and the Apriori algorithm, which is used to discover patterns in customer transaction data for applications like marketing and inventory management. It explains the definitions of transactions, itemsets, and support, and outlines the steps of the Apriori algorithm, including generating candidates and eliminating infrequent itemsets. The document highlights the importance of efficient support counting and search space reduction in mining frequent itemsets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views72 pages

Bu I 11 FIM Apriori

The document discusses Frequent Itemset Mining and the Apriori algorithm, which is used to discover patterns in customer transaction data for applications like marketing and inventory management. It explains the definitions of transactions, itemsets, and support, and outlines the steps of the Apriori algorithm, including generating candidates and eliminating infrequent itemsets. The document highlights the importance of efficient support counting and search space reduction in mining frequent itemsets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

Frequent Itemset Mining

and The Apriori algorithm

Philippe Fournier-Viger
http://www.philippe-Fournier-viger.com

R. Agrawal and R. Srikant. Fast algorithms for mining


association rules in large databases. Research Report RJ
9839, IBM Almaden Research Center, San Jose, California,
June 1994.
Source code and datasets available in the
1
SPMF library
Introduction
Many retail stores collect data about
customers.
e.g. customer transactions
Need to analyze this data to
understand customer behavior
Why?
◦ for marketing purposes,
◦ inventory management,
◦ customer relationship management

2
Introduction
Discovering patterns and associations
Discovering interesting relationships hidden
in large databases.
 e.g. beer and diapers are often sold together
pattern mining is a fundamental data
mining problem with many applications in
various fields.
Introduced by Agrawal (1993).
Many extensions of this problem to discover
patterns in graphs, sequences, and other
kinds of data.

3
FREQUENT ITEMSET
MINING

4
Definitions
Let be the set of items (products)
sold in a retail store.

For example:

I= {pasta, lemon, bread, orange, cake}

5
Definitions
A transaction database D is a set of
transactions.
D = {T1, T2, … Tr} such t
Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
cake}
6
Definitions
Each transaction has a unique identifier called its
Transaction ID (TID).
e.g. the transaction ID of T4 is 4.

Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
cake}
7
Definitions
A transaction is a set of items (an itemset).
e.g. T2= {pasta, lemon}
An item (a symbol) may not appear or appear once
in each transaction. Each transaction is unordered.
Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
cake}
8
Definitions
A transaction database can be viewed as a
binary matrix:

Transaction pasta lemon bread orange cake

T1 1 1 1 1 0
T2 1 1 0 0 0
T3 1 0 0 1 1
T4 1 1 0 1 1

• Asymetrical binary attributes (because 1 is more


important than 0)
• There is no information about purchase quantities
9
and prices.
Definitions
Let I bet the set of all items:
I= {pasta, lemon, bread, orange, cake}
There are 2|I| – 1 = 25 – 1 = 31 subsets :
{pasta}, {lemon}, {bread}, {orange},
{cake}
{pasta, lemon}, {pasta, bread} {pasta,
orange}, {pasta, cake}, {lemon, bread},
{lemon orange},
{lemon, cake}, {bread, orange}, {bread
cake}

{pasta, lemon, bread, orange, cake}
10
Definitions
An itemset is said to be of size k, if it
contains k items.

Itemsets of size 1:
{pasta}, {lemon}, {bread}, {orange},
{cake}

Itemsets of size 2:
{pasta, lemon}, {pasta, bread} {pasta,
orange}, {pasta, cake}, {lemon, bread},
{lemon orange}, …
11
Definitions
The support (frequency) of an itemset X is
the number of transactions that contains X.
sup(X) = |{
For example: The support of {pasta, orange}
is 3.
which is
Transaction written
Items as: sup({pasta,
appearing in the transaction
orange}) = 3
T1 {pasta, lemon, bread,
orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
cake}
support = 支持 12
Definitions
The support of an itemset X can also be
written as a ratio (absolute support).
Example: The support of {pasta, orange} is
75% because it appears in 3 out of 4
transactions.
Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
cake} 13
The problem of frequent itemset
mining
 Let there be a numerical value minsup, set by the
user.
 Frequent itemset mining (FIM) consists of
enumerating all frequent itemsets, that is itemsets
having a support greater or equal to minsup.
Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
cake} 14
Example
Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange
For minsup = 2, the frequent itemsets are:
cake}
{lemon}, {pasta}, {orange}, {cake}, {lemon, pasta}, {lemon,
orange}, {pasta, orange}, {pasta, cake}, {orange, cake}, {lemon,
pasta, orange}

For the user, choosing a high minsup value,


 will reduce the number of frequent itemsets,
 will increase the speed and decrease the memory required for
finding the frequent itemsets

15
Numerous applications
Frequent itemset mining has
numerous applications.
◦ medical applications,
◦ chemistry,
◦ biology,
◦ e-learning,
◦ etc.

16
Several algorithms
Algorithms:
◦ Apriori, AprioriTID (1993)
◦ Eclat (1997)
◦ FPGrowth (2000)
◦ Hmine (2001)
◦ LCM, …
◦…
Moreover, numerous extensions of the
FIM problem: uncertain data, fuzzy data,
purchase quantities, profit, weight, time,
rare itemsets, closed itemsets, etc.
17
ALGORITHMS

18
Naïve approach
Ifthere are n items in a database,
there are 2n -1 itemsets may be
frequent.
Naïve approach: count the support of
all these itemsets.
To do that, we would need to read each
transaction in the database to count
the support of each itemset.
This would be inefficient:
◦ need to perform too many comparisons
◦ requires too much memory
19
Search space
This is all the itemsets that can be formed with the items
lemon (l), pasta (p), bread (b), orange (o) and cake (c)

l p b o c

lp lb lo lc pb po pc bo bc oc

lpb lpo lpc lbo lbc loc pbo pbc poc boc

l = lemon
p = pasta lpbo lpbc lpoc lboc pboc
b = bread
0=
orange lpbo
c = cake c This form a lattice, which can be
viewed as a Hasse diagram 20
Search space
If minsup = 2, the frequent itemsets are (in yellow):

l p b o c

lp lb lo lc pb po pc bo bc oc

lpb lpo lpc lbo lbc loc pbo pbc poc boc

l = lemon
p = pasta lpbo lpbc lpoc lboc pboc
b = bread
0=
orange lpbo
c = cake c
21
I={A} I={A, B} I={A, B,C}

I={A, I={A, B,C,D,E}


B,C,D}

I={A,
B,C,D,E,F}

22
How to find the frequent
itemsets?
Two challenges:
How to count the support of
itemsets in an efficient way (not
spend too much time or
memory)?
How to reduce the search space
(we do not want to consider all
the possibilities)?

23
THE APRIORI
ALGORITHM
(AGRAWAL & SRIKANT,
1993/1994)
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules in large databases.
Research Report RJ 9839, IBM Almaden
Research Center, San Jose, California, June
1994. 24
Introduction
Apriori is a famous algorithm
which is not the most efficient
algorithm,
but has inspired many other algorithms!
has been applied in many fields,
has been adapted for many other similar
problems.

Apriori is based on two important


properties

25
Apriori property: Let there be two itemsets X

If X the support of Y is less than or equal to the


and Y.

support of X.
Example:
• The support of {pasta} is 4
• The support of {pasta, lemon} is 3
• The support of {pasta, lemon, orange} is 2

Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
(support is anti-monotonic) 26
Illustration
minsup =2 ∅
l p b o c
frequent
itemsets

lp lb lo lc pb po pc bo bc oc

lpb lpo lpc lbo lbc loc pbo pbc poc boc

Infrequent
lpbo lpbc lpoc lboc pboc itemsets

lpbo
c
27
This property is useful to reduce the search
space. Example:
If « bread » is
minsup =2 ∅ infrequent

l p b o c

lp lb lo lc pb po pc bo bc oc

lpb lpo lpc lbo lbc loc pbo pbc poc boc

lpbo lpbc lpoc lboc pboc

lpbo
c
28
This property is useful to reduce the search
space. Example: If « bread » is

minsup =2 ∅ infrequent, all its


supersets are
infrequent.
l p b o c

lp lb lo lc pb po pc bo bc oc

lpb lpo lpc lbo lbc loc pbo pbc poc boc

lpbo lpbc lpoc lboc pboc

lpbo
c
29
If there exists an itemset X such that X is
Property 2: Let there be an itemset Y.

infrequent, then Y is infrequent.


Example:
• Consider {bread, lemon}.
• If we know that {bread} is infrequent,
then we can infer that {bread, lemon} is
also infrequent.
Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
30
The Apriori algorithm
Iwill now explain how the Apriori
algorithm works
Input:
◦ minsup
◦ a transactional database
Output:
◦ all the frequent itemsets

Consider minsup =2.


31
The Apriori algorithm
Step 1: scan the database to
calculate the support of all
itemsets of size 1.
e.g.
{pasta} support = 4
{lemon} support = 3
{bread} support = 1
{orange} support = 3
{cake} support = 2
32
The Apriori algorithm
Step 2: eliminate infrequent
itemsets.

e.g.
{pasta} support = 4
{lemon} support = 3
{bread} support = 1
{orange} support = 3
{cake} support = 2
33
The Apriori algorithm
Step 2: eliminate infrequent
itemsets.

e.g.
{pasta} support = 4
{lemon} support = 3
{orange} support = 3
{cake} support = 2

34
The Apriori algorithm
Step 3: generate candidates of size 2 by
combining pairs of frequent itemsets of size
1. Candidates of size 2
{pasta, lemon}
{pasta, orange}
Frequent items
{pasta, cake}
{pasta}
{lemon, orange}
{lemon}
{lemon, cake}
{orange, cake}
{orange}
{cake} 35
The Apriori algorithm
Step 4: Eliminate candidates of size 2 that
have an infrequent subset (Property 2)
(none!) Candidates of size 2
{pasta, lemon}
{pasta, orange}
Frequent items
{pasta, cake}
{pasta}
{lemon, orange}
{lemon}
{lemon, cake}
{orange, cake}
{orange}
{cake} 36
The Apriori algorithm
Step 5: scan the database to calculate the
support of remaining candidate itemsets of
size 2. Candidates of size 2
{pasta, lemon} support: 3

{pasta, orange} support: 3


{pasta, cake} support: 2
{lemon, orange} support: 2
{lemon, cake} support: 1
{orange, cake} support: 2 37
The Apriori algorithm
Step 6: eliminate infrequent candidates of
size 2
Candidates of size 2

{pasta, lemon} support: 3

{pasta, orange} support: 3


{pasta, cake} support: 2
{lemon, orange} support: 2
{lemon, cake} support: 1
{orange, cake} support: 2 38
The Apriori algorithm
Step 6: eliminate infrequent candidates of
size 2
Frequent itemsets of size 2

{pasta, lemon} support: 3

{pasta, orange} support: 3


{pasta, cake} support: 2
{lemon, orange} support: 2
{orange, cake} support: 2
39
The Apriori algorithm
Step 7: generate candidates of size 3 by
combining frequent pairs of itemsets of size 2.

Candidates of size 3
Frequent itemsets of size 2
{pasta, lemon, orange}
{pasta, lemon}
{pasta, orange} {pasta, lemon, cake}
{pasta, cake} {pasta, orange, cake}
{lemon, orange} {lemon, orange, cake}
{orange, cake}

40
The Apriori algorithm
Step 8: eliminate candidates of size 3 having a
subset of size 2 that is infrequent.

Candidates of size 3
Frequent itemsets of size 2
{pasta, lemon, orange}
{pasta, lemon}
{pasta, orange} {pasta, lemon, cake}
{pasta, cake} {pasta, orange, cake}
{lemon, orange} {lemon, orange, cake}
{orange, cake} Because {lemon, cake} is
infrequent!

41
The Apriori algorithm
Step 8: eliminate candidates of size 3 having a
subset of size 2 that is infrequent.

Candidates of size 3
Frequent itemsets of size 2
{pasta, lemon,
{pasta, lemon}
orange}
{pasta, orange}
{pasta, cake} {pasta, orange,
{lemon, orange}
cake}
{orange, cake} Because {lemon, cake} is
infrequent!

42
The Apriori algorithm
Step 9: scan the database to calculate the
support of the remaining candidates of size
3. Candidates of size 2

{pasta, lemon, orange}


support: 2
{pasta, orange, cake} support:
2

43
The Apriori algorithm
Step 10: eliminate infrequent candidates
(none!)
frequent itemsets of size 3

{pasta, lemon, orange}


support: 2
{pasta, orange, cake} support:
2

44
The Apriori algorithm
Step 11: generate candidates of size 4 by
combining pairs of frequent itemsets of size 3.

Candidates of size 4
Frequent itemsets of size 3

{pasta, lemon,
orange} {pasta, lemon, orange,
cake}

{pasta, orange,
cake}

45
The Apriori algorithm
Step 12: eliminate candidates of size 4 having a
subset of size 3 that is infrequent.

Candidates of size 4
Frequent itemsets of size 3

{pasta, lemon,
orange} {pasta, lemon, orange,
cake}

{pasta, orange,
cake}

46
The Apriori algorithm
Step 12: Since there is no more candidates, we
cannot generate candidates of size 5 and the
algorithm stops.
Candidates of size 4

{pasta, lemon, orange,


cake}

Result 
47
Final result
{pasta} support = 4
{lemon} support = 3
{orange} support = 3
{cake} support = 2

{pasta, lemon} support: 3


{pasta, orange} support: 3
{pasta, cake} support: 2
{lemon, orange} support: 2
{orange, cake} support: 2

{pasta, lemon, orange} support: 2


{pasta, orange, cake} support: 2

48
Technical details

Combining different itemsets can


generate the same candidate.
Example:
{A, B} and { A,E}  {A, B, E}

{B, E} and { A,E}  {A, B,


E}
problem: some candidates are generated
several times!

49
Technical details
Combining different itemsets can
generate the same candidate.
Example:
{A, B} and { A,E}  {A, B,
E}

{B, E} and { A,E}  {A, B,


Solution:E}
• Sort items in each itemsets (e.g. by alphabetical
order)
• Combine two itemsets only if all items are the same 50
Apriori vs the naïve algorithm
The Apriori property can
considerably reduce the number
of itemsets to be considered.
In the previous example:
◦ Naïve approach:
25-1 = 31 itemsets are considered
◦ By using the Apriori property:
18 itemsets are considered

51
PERFORMANCE
COMPARISON

53
How to evaluate this type of
algorithms?
Execution time,
Memory used,
Scalability: how the performance is
influenced by the number of
transactions
Performance on different types of data:
◦ real data,
◦ synthetic (fake) data,
◦ dense vs sparse data,…
…

54
Performance (execution time)

55
Performance (execution time)

56
Performance of Apriori
The performance of Apriori depends
on several factors:
the minsup parameter: the more
it is set low, the larger the search
space and the number of itemsets
will be.
the number of items,
the number of transactions,
The average transaction length.

57
Problems of Apriori
can generate numerous
candidates
requires to scan the database
numerous times.
candidates may not exist in the
database.
…

58
A FEW
OPTIMIZATIONS FOR
THE APRIORI
ALGORITHM
This is an advanced topic

59
Optimization 1
In terms of data structure:
Store all items as integers:
e.g. 1= pasta, 2= orange, 3=
bread…

Why?
◦ it is faster to compare two integers
than to compare two character
strings,
◦ requires less memory.
60
Optimization 2
To reduce the time required to calculate the
support of itemsets
Transaction Items appearing in the transaction
T2 {pasta, lemon} sort
T3 {pasta, orange, cake} transaction
T4 {pasta, lemon, orange, cake} s by
T1 {pasta, lemon, bread, orange}ascending
length

 To calculate the support of an itemset of size


k, only the transactions of size >= k are used.

61
Optimization 3
To reduce the time required to
calculate the support of itemsets
Replace all identical transactions
by a single transactions with a
weight.

62
Optimization 4
To reduce the time require to
calculate the support of itemsets:
Sort items in transactions
according to a total order (e.g.
alphabetical order).
Utilize binary search to quickly
check if an item appears in a
transaction.

63
Optimization 5
 Store candidates in a
hash tree
 To calculate the support
of candidates
◦ Calculate a hash value based
on a transaction to determine
if candidates are contained in
thetransaction.

64
Other optimizations
Sampling and partitioning

65
AprioriTID: a varition
AprioriTID:
Annotate each itemset with the
ids of transactions that contain it,
use the intersection () to
calculate the support of itemsets
instead of reading the database.
Example 

66
Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
cake}
item transactions containing the item
pasta T1, T2, T3, T4
lemon T1, T2, T3
bread T1
orange T1, T3, T4
cake T3, T4
67
item transactions containing the item

pasta T1, T2, T3, T4


lemon T1, T2, T4
bread T1
orange T1, T3, T4
cake T3, T4

Example: calculating the support of{pasta,


lemon} :

transactions({pasta})
transactions({lemon})
= {T1, T2, T3, T4} {T1, T2, T4}
= {T1, T2, T4 } 68
AprioriTID_Bitset
AprioriTID_bitset:
Same idea, except that bit
vectors are used instead of lists
of ids.
This allows to calculate the
intersection using the
Logical_AND, which is often very
fast.
Example 
69
Transaction Items appearing in the transaction

T1 {pasta, lemon, bread,


orange}
T2 {pasta, lemon}
T3 {pasta, orange, cake}
T4 {pasta, lemon, orange,
item cake}
transactions containing the item

pasta 1111 (representing T1, T2,


T3, T4)
lemon 1101
bread 1000
orange 1011
70
item transactions containing the item

pasta 1111
lemon 1101
bread 1000
orange 1011
cake 0011

Example: Calculate the support of {pasta,


lemon} :

transactions({pasta})
transactions({lemon})
= 1LOGICAL_AND 1101
= 1101 71
Conclusion
This video has presented:
The problem of frequent itemset
mining
The Apriori algorithm
Some optimizations

72
References
Han and Kamber (2011), Data
Mining: Concepts and Techniques,
3rd edition, Morgan Kaufmann
Publishers,
Tan, Steinbach & Kumar (2006),
Introduction to Data Mining,
Pearson education, ISBN-10:
0321321367
…

73

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy