The Apriori Algorithm-A Tutorial
The Apriori Algorithm-A Tutorial
net/publication/228524191
CITATIONS READS
108 11,049
1 author:
Markus Hegland
Australian National University
194 PUBLICATIONS 2,363 CITATIONS
SEE PROFILE
All content following this page was uploaded by Markus Hegland on 29 May 2014.
Markus Hegland
CMA, Australian National University
John Dedman Building, Canberra ACT 0200, Australia
E-mail: Markus.Hegland@anu.edu.au
Association rules are ”if-then rules” with two measures which quantify
the support and confidence of the rule for a given data set. Having their
origin in market basked analysis, association rules are now one of the
most popular tools in data mining. This popularity is to a large part due
to the availability of efficient algorithms. The first and arguably most
influential algorithm for efficient association rule discovery is Apriori.
In the following we will review basic concepts of association rule dis-
covery including support, confidence, the apriori property, constraints
and parallel algorithms. The core consists of a review of the most im-
portant algorithms for association rule discovery. Some familiarity with
concepts like predicates, probability, expectation and random variables
is assumed.
1. Introduction
Large amounts of data have been collected routinely in the course of day-
to-day management in business, administration, banking, the delivery of
social and health services, environmental protection, security and in pol-
itics. Such data is primarily used for accounting and for management of
the customer base. Typically, management data sets are very large and
constantly growing and contain a large number of complex features. While
these data sets reflect properties of the managed subjects and relations, and
are thus potentially of some use to their owner, they often have relatively
low information density. One requires robust, simple and computationally
efficient tools to extract information from such data sets. The development
and understanding of such tools is the core business of data mining. These
tools are based on ideas from computer science, mathematics and statistics.
1
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
2 M. Hegland
These rules are very simple as is typical for association rule mining. Sim-
ple rules are understandable and ultimately useful. In a large retail shop
there are usually more than 10,000 items on sale and the shop may service
thousands of customers every day. Thus the size of the collected data is
substantial and even the detection of simple rules like the ones above re-
quires sophisticated algorithms. The efficiency of the algorithms will depend
on the particular characteristics of the data sets. An important feature of
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
4 M. Hegland
many retail data sets is that an average market basket only contains a small
subset of all items available.
The simplest model for the customers assumes that the customers choose
products from the shelves in the shop at random. In this case the choice of
each product is independent from any other product. Consequently, asso-
ciation rule discovery will simply recover the likelihoods for any item to be
chosen. While it is important to compare the performance of other models
with this “null-hypothesis” one would usually find that shoppers do have
a more complex approach when they fill the shopping basket (or trolley).
They will buy breakfast items, lunch items, dinner items and snacks, party
drinks, and Sunday dinners. They will have preferences for cheap items, for
(particular) brand items, for high-quality, for freshness, low-fat, special diet
and environmentally safe items. Such goals and preferences of the shopper
will influence the choices but can not be directly observed. In some sense,
market basket analysis should provide information about how the shop-
pers choose. In order to understand this a bit further consider the case of
politicians who vote according to party policy but where we will assume
for the moment that the party membership is unknown. Is it possible to
see an effect of the party membership in voting data? For a small but real
illustrative example consider the US Congress voting records from 1984 [?],
see figure 2. The 16 columns of the displayed bit matrix correspond to the
16 votes and the 435 rows to the members of congress. We have simplified
the data slightly so that a matrix element is one (pixel set) in the case of
votes which contains “voted for”, “paired for” and “announced for” and the
matrix element is zero in all other cases. The left data matrix in figure 2
is the original data where only the rows and columns have been randomly
permuted to remove any information introduced through the way the data
was collected. The matrix on the right side is purely random such that each
entry is independent and only the total number of entries is maintained.
Can you see the difference between the two bit matrices? We found that for
most people, the difference between the two matrices is not obvious from
visual inspection alone.
Data mining aims to discover patterns in the left bit matrix and thus
differences between the two examples. In particular, we will find columns or
items which display similar voting patterns and we aim to discover rules re-
lating to the items which hold for a large proportion of members of congress.
We will see how many of these rules can be explained by underlying mech-
anisms (in this case party membership).
In this example the selection of what are rows and what columns is
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
6 M. Hegland
items have received between 34 to 63.5 percent yes votes. Pairs of items
have received between 4 and 49 percent yes votes. The pairs with the most
yes votes (over 45 percent) are in the columns 2/6, 4/6, 13/15, 13/16 and
15/16. Some rules obtained for these pairs are: 92 percent of the yes votes
in column 2 are also yes votes in column 6, 86 percent of the yes votes in
column 4 are also yes votes in column 6 and, on the other side, 88 percent
of the votes in column 13 are also in column 15 and 89 percent of the yes
votes in column 16 are also yes votes in column 15. These figures suggest
combinations of items which could be further investigated in terms of causal
relationships between the items. Only a careful statistical analysis may
provide some certainty on this. This and other issues concerning inference
belong to statistics and are beyond the scope of this tutorial which focusses
on computational issues.
not.
The data is a sequence of itemsets which is represented as a bitmatrix
where each row corresponds to an itemset and the columns correspond to
the items. For the micromarket example a dataset containing the market
baskets {juice,bread, milk}, {potato} and {bread, potatoes} would be rep-
resented by the matrix
11100
0 0 0 0 1 .
01010
In the congressional voting example mentioned in the previous section the
first few rows of matrix are
1 0 1 1 1 1 0 1 0 1 0 0 1 1 0 0
1 1 1 0 0 1 0 1 1 1 1 0 1 1 0 0
0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 0
0 1 0 1 0 1 0 1 1 0 1 1 0 0 1 0
1 3 4 5 6 8 10 13 14
1 2 3 6 8 9 10 11 13 14
4 5 6 8 9 15
2 4 6 8 9 11 12 15
1 3 7 9 10 11 13 16.
It is assumed that the data matrix X ∈ {0, 1}n,d is random and thus
(i)
the elements xj are binary random variables. One would in general have
to assume correlations between both rows and columns. The correlations
between the columns might relate to the type of shopping and customer,
e.g., young family with small kids, weekend shopping or shopping for a
specific dish. Correlations between the rows might relate to special offers of
the retailer, time of the day and week. In the following it will be assumed
that the rows are drawn independently from a population of market baskets.
Thus it is assumed that there is a probability distribution function p :→
[0, 1] with
X
p(x) = 1
x∈X
8 M. Hegland
For given supports s(x), this is a linear system of equations which can be
solved recursively using s(e) = p(e) (where e = (1, . . . , 1) is the maximal
itemset) and
X
p(x) = s(x) − p(z).
z>x
where |x| is the number of bits set in x and d is the total number of items
available, i.e., number of components of x. As any z ≥ x has at least all the
bits set which are set in x one gets for the support
|x|
s(x) = p0 .
It follows that the frequent itemsets x are exactly the ones with few items,
in particular, where
σ0 = min s(x).
|x|=1
Note that in the random shopper case the rhs is just p0 . In this case, all
the single items would be frequent.
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
10 M. Hegland
d
Y x
p(x) = pj j (1 − pj )1−xj
j=1
and
d
Y x
s(x) = pj j .
j=1
In examples one can often find that the pj , when sorted, are approximated
by Zipf ’s law, i.e,
α
pj =
j
for some constant α. It follows again that itemsets with few popular items
are most likely.
However, this type of structure is not really what is of interest in as-
sociation rule mining. To illustrate this consider again the case of the US
Congress voting data. In figure ?? the support for single itemsets are dis-
played for the case of the actual data matrix and for a random permutation
of all the matrix elements. The supports are between 0.32 and 0.62 for the
original data where for the randomly permuted case the supports are be-
tween 0.46 and 0.56. Note that these supports are computed from the data,
in theory, the permuted case should have constant supports somewhere
slightly below 0.5. More interesting than the variation of the supports of
single items is the case when 2 items are considered. Supports for the votes
displayed in figure ??. are of the form “V2 and Vx”. Note that “V2 and
V2” is included for reference, even though this itemset has only one item.
In the random data case where the vote for any pair {V2,Vx} (where Vx
is not V2) is the square of the vote for the single item V2 as predicted
by the “random shopper theory” above. One notes that some pairs have
significantly higher supports than the random ones and others significantly
lower supports. This type of behaviour is not captured by the “random
shopper” model above even if the case of variable supports for single items
are allowed. The following example attempts to model some this behaviour.
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
0.7
0.6
0.6
0.5
0.5
proportion yes votes
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
Fig. 3. Supports for all the votes in the US congress data (split by party).
12 M. Hegland
0.7
0.6
0.6
0.5
0.5
proportion yes votes
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
vote number of second vote in pair vote number of second vote in pair
Fig. 4. Supports for pairs of votes in the US congress data (split by party).
distribution is
|x | |x | |x | |x |
p(x) = π0 p000 p011 (1−p00 )d0 −|x0 | (1−p01 )d1 −|x1 | +π1 p100 p111 (1−p10 )d0 −|x0 | (1−p11 )d1 −|x1 | .
This is a mixture model with two components. Recovery of the parameters
from the data of mixture models uses the EM algorithm and is discussed
in detail in [?]. Note that π0 + π1 = 1.
For frequent itemset mining, however, the support function is considered
and similarly to the random shopper case can be shown to be:
|x | |x | |x | |x |
s(x) = π0 p000 p011 + π1 p100 p111 .
Assume that the shopper of type i is unlikely to purchase items of the other
type. Thus p00 and p11 are much larger than p10 and p01 . In this case the
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
frequent itemsets are going to be small (as before), moreover, one has either
x0 = 0 or x1 = 0, thus, frequent itemsets will only contain items of one
type. Thus in this case frequent itemset mining acts as a filter to retrieve
“pure” itemsets.
A simple application to the voting data could consider two types of
politicians. A question to further study would be how closely these two
types correspond with the party lines.
One can now consider generalisations of this case by including more
than two types, combining with different probabilities (Zipf’s law) for the
different items in the same class and even use itemtypes and customer types
which overlap. These generalisations lead to graphical models and Bayesian
nets [?,?]. The “association rule approach” in these cases distinguishes itself
by using support functions, frequent itemsets and in particular, is based on
binary data. A statistical approach to this type of data is “discriminant
analysis” [?].
Thus function takes values which are either 0 or 1 and ax (z) = 1 iff x ≤ z
as in this case all the components zi which occur to power 1 in ax (z)
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
14 M. Hegland
The average length of itemsets is the expectation of f and one can see that
X
E(|x|) = s(z).
|z|=1
16 M. Hegland
Lemma 5: Let X be a finite Boolean lattice. Then, for each x ∈ X one has
_
x = {z ∈ A(X) | z ≤ x}.
The set of atoms associated with any element is unique, and the Boolean
lattice itself is isomorph to the powerset of the set of atoms. This is the key
structural theorem of Boolean lattices and is the reason why we can talk
about sets (itemsets) in general for association rule discovery.
In our case the atoms are the d basis vectors e1 , . . . , ed and any element
Pd
of X can be represented as a set of basis vectors, in particular x = i=1 ξi ei
where ξi ∈ {0, 1}. For the proof of the above theorem and further informa-
tion on lattices and partially ordered sets see [?]. The significance of the
theorem lays in the fact that if X is an arbitrary Boolean lattice it is equiva-
lent to the powerset of atoms (which can be represented by bitvectors) and
so one can find association rules on any Boolean lattice which conceptually
generalises the association rule algorithms.
In figure ?? we show the lattice of patterns for a simple market basket
case which is just a power set. The corresponding lattice for the bitvectors
is in figure ??. We represent the lattice using an undirected graph where the
nodes are the elements of X and edges are introduced between any element
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
18 M. Hegland
{milk, bread, coffee} {milk, bread, juice} {milk, coffee, juice} {bread, coffee, juice}
{milk, bread} {milk, coffee} {bread, coffee} {milk, juice} {bread. juice} {coffee, juice}
{}
1111
0000
20 M. Hegland
c(ax ⇒ ay ) = F δ (y|x)
22 M. Hegland
This is basically “the apriori property for the rules” and allows pruning
the tree of possible rules quite a lot. The theorem is again used as a nec-
essary condition. We start the algorithm by considering z = z1 ∨ z2 with
1-itemsets for z2 and looking at all strong rules. Then, if we consider a
2-itemset for z2 both subsets y < z2 need to be consequents of strong rules
in order for z2 to be a candidate of a consequent. By constructing the con-
sequents taking into account that all their nearest neighbours (their cover
in lattice terminology) need to be consequents as well. Due to the inter-
pretability problem one is mostly interested in small consequent itemsets
so that this is not really a big consideration. See [?] for efficient algorithms
for the direct search for association rules.
one percent of all the items are nonzero. In this case it makes sense to store
the matrix in a sparse format. Here we will consider two ways to store
the matrix, either by rows or by columns. The matrix corresponding to an
earlier example is
11000
1 0 1 1 0
1 0 0 0 1 .
1 1 0 1 0
00001
First we discuss the horizontal organisation. A row is represented simply
by the indices of the nonzero elements. And the matrix is represented as a
tuple of rows. For example, the above matrix is represented as
(1, 2) (1, 3, 4) (1, 5) (1, 2, 4) (5)
In practice we also need pointers which tell us where the row starts if
contiguous locations are used in memory.
Now assume that we have any row x and a az and would like to find
out if x supports az , i.e., if z ≤ x. If both the vectors are represented in the
sparse format this means that we would like to find out if the indices of z
are a subset of the indices of x. There are several different ways to do this
and we will choose the one which uses an auxiliary bitvector v ∈ X (in full
format) which is initialised to zero. The proposed algorithm has 3 steps:
We assume that the time per nonzero element for all the steps is the same
τ and we get for the time:
T = (2|x| + |z|)τ.
24 M. Hegland
With the same assumptions as above we get (|z (j) | = k) for the time:
T = (2|x| + mk k)τ.
Finally, running this algorithm for all the rows x(i) and vectors z (j) of
different lengths, one gets the total time
X n
X
T = (2 |x(i) | + mk kn)τ
k i=1
Note that the sum over k is for k between one and d but only the k for
which mk > 0 need to be considered. The complexity has two parts. The
first part is proportional to E(|x|)n which corresponds to the number of
data points times the average complexity of each data point. This part
thus encapsulates the data dependency. The second part is proportional to
mk kn where the factor mk k refers to the complexity of the search space
which has to be visited for each record n. For k = 1 we have m1 = 1 as
we need to consider all the components. Thus the second part is larger
than dnτ , in fact, we would probably have to consider all the pairs so that
it would be larger than d2 nτ which is much larger than the first part as
2E(|x|) ≤ 2d. Thus the major cost is due to the search through the possible
patterns and one typically has a good approximation
X
E(T ) ≈ mk knτ.
k
to the first column j with zj = 1, later extract all the values at the points
defined by the nonzero elements for the next column j 0 for which zj 0 = 1,
then zero the original ones in v and finally set the extracted values into the
v. More concisely, we have the algorithm, where xj stands for the whole
column j in the data matrix.
So we access v three times for each column, once for the extraction of
elements, once for setting it to zero and once for resetting the elements.
Thus for the determination of the support of z in the data base we have
the time complexity of
d X
X n
(i)
T = 3τ xj zj .
j=1 i=1
A more careful analysis shows that this is actually an upper bound for the
complexity. Now this is done for mk arrays z (s,k) of size k and for all k.
Thus we get the total time for the determination of the support of all az
to be
mk X
XX d X
n
(i) (s,k)
T = 3τ xj zj .
k s=1 j=1 i=1
(i)
We can get a simple upper bound for this using xj ≤ 1 as
X
T ≤3 mk knτ
k
Pd (s,k)
because j=1 zj = k. This is roughly 3 times what we got for the pre-
(i) (i)
vious algorithm. However, the xj are random with an expectation E(xj )
which is typically much less than one and have an average expected length
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
26 M. Hegland
of E(|x|)/d. If we introduce this into the equation for T we get the approx-
imation
3E(|x|) X
E(T ) ≈ mk knτ
d
k
which can be substantially smaller than the time for the previous algorithm.
Finally, we should point out that there are many other possible al-
gorithms and other possible data formats. Practical experience and more
careful analysis shows that one method may be more suitable for one data
set where the other is better for another data set. Thus one carefully has
to consider the specifics of a data set. Another consideration is also the size
k and number mk of the z considered. It is clear from the above that it is
essential to carefully choose the “candidates” az for which the support will
be determined. This will further be discussed in the next sections. There is
one term which occurred in both algorithms above and which characterises
the complexity of the search through multiple levels of az , it is:
∞
X
C= mk k.
k=1
We will use this constant later in the discussion of the efficiency of the
search procedures.
L4
L3
L2
L1
L0
Fig. 8. Level sets of Boolean lattices
Algorithm 1 Apriori
C1 = A(X) is the set of all one-itemsets, k = 1
while Ck 6= ∅ do
scan database to determine support of all ay with y ∈ Ck
extract frequent itemsets from Ck into Lk
generate Ck+1
k := k + 1.
end while
28 M. Hegland
The upper bound is hopeless for any large size d and we need to get better
bounds. This depends very much on how the candidate itemsets Ck are
chosen. We choose C1 to be the set of all 1-itemsets, and C2 to be the set
of all 2-itemsets so that we get m1 = 1 and m2 = d(d − 1)/2.
The apriori algorithm determines alternatively Ck and Lk such that
successively the following chain of sets is generated:
C1 = L1 → C2 → L2 → C3 → L3 → C4 → · · ·
How should we now choose the Ck ? We know, that the sequence Lk satisfies
the apriori property, which can be reformulated as
a set which contains Lk+1 . The apriori algorithm chooses the largest set
Ck+1 which satisfies the apriori condition. But is this really necessary? It
is if we can find a data set for which the extended sequence is the set of
frequent itemsets. This is shown in the next proposition:
Proposition 9: Let L1 , . . . , Lm be any sequence of sets of k-itemsets which
satisfies the apriori condition. Then there exists a dataset D and a σ > 0
such that the Lk are frequent itemsets for this dataset with minimal support
σ.
S
Proof: Set x(i) ∈ k Lk , i = 1, . . . , n to be sequence of all maximal item-
S
sets, i.e., for any z ∈ k Lk there is an x(i) such that z ≤ x(i) and x(i) 6≤ x(j)
for i 6= j. Choose σ = 1/n. Then the Lk are the sets of frequent itemsets
for this data set.
For any collection Lk there might be other data sets as well, the one chosen
above is the minimal one. The sequence of the Ck is now characterised by:
(1) C1 = L1
(2) If y ∈ Ck and z ≤ y then z ∈ Ck0 where k 0 = |z|.
In this case we will say that the sequence Ck satisfies the apriori condition.
It turns out that this characterisation is strong enough to get good upper
bounds for the size of mk = |Ck |.
However, before we go any further in the study of bounds for |Ck | we
provide a construction of a sequence Ck which satisfies the apriori condi-
tion. A first method uses Lk to construct Ck+1 which it chooses to be the
maximal set such that the sequence L1 , . . . , Lk , Ck+1 satisfies the apriori
property. One can see by induction that then the sequence C1 , . . . , Ck+1
will also satisfy the apriori property. A more general approach constructs
Ck+1 , . . . , Ck+p such that L1 , . . . , Lk ,Ck+1 , . . . , Ck+p satisfies the apriori
property. As p increases the granularity gets larger and this method may
work well for larger itemsets. However, choosing larger p also amounts to
larger Ck and thus some overhead. We will only discuss the case of p = 1
here.
The generation of Ck+1 is done in two steps. First a slightly larger set is
constructed and then all the elements which break the apriori property are
removed. For the first step the join operation is used. To explain join let the
elements of L1 (the atoms) be enumerated as e1 , . . . , ed . Any itemset can
then be constructed as join of these atoms. We denote a general itemset by
e(j1 , . . . , jk ) = ej1 ∨ · · · ∨ ejk
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
30 M. Hegland
where j1 < j2 < · · · < jk . The join of any k-itemset with itself is then
defined as
Lk o
n Lk := {e(j1 , . . . , jk+1 ) | e(j1 , . . . , jk ) ∈ Lk , e(j1 , . . . , jk−1 , jk+1 ) ∈ Lk }.
This is the set of bitvectors which have k − 1 bits set at places where some
z ∈ Ck has them set. The shadow ∂Ck can be smaller or larger than the
Ck . In general, one has for the size |∂Ck | ≥ k independent of the size of Ck .
So, for example, if k = 20 then |∂Ck | ≥ 20 even if |Ck | = 1. (In this case we
actually have |∂Ck | = 20.) For example, we have ∂C1 = ∅, and |∂C2 | ≤ d.
It follows now that the sequence of sets of itemsets Ck satisfies the
apriori condition iff
∂(Ck ) ⊂ Ck−1 .
which is the colex (or colexicographic) order. In this order the itemset
{3, 5, 6, 9} ≺ {3, 4, 7, 9} as the largest items determine the order. (In the
lexicographic ordering the order of these two sets would be reversed.)
Let [m] := {0, . . . , m − 1} and [m](k) be the set of all k-itemsets where
k bits are set in the first m positions and all other bits can be either 0 or 1.
In the colex order any z where bits m (and beyond) are set are larger than
any of the elements in [m](k) . Thus [m]k is just the set of the first m−1 k
bitvectors with k bits set.
We will now construct the sequence of the first m bitvectors for any m.
This corresponds to the first numbers, which, in the binary representation
have m ones set. Consider, for example the case of d = 5 and k = 2. For this
case all the bitvectors are in table ??. (Printed with the lowest significant
bit on the right hand side for legibility.)
(0,0,0,1,1) 3
(0,0,1,0,1) 5
(0,0,1,1,0) 6
(0,1,0,0,1) 9
(0,1,0,1,0) 10
(0,1,1,0,0) 12
(1,0,0,0,1) 17
(1,0,0,1,0) 18
(1,0,1,0,0) 20
(1,1,0,0,0) 24
C ∨ y := {z ∨ y | z ∈ C}.
32 M. Hegland
As only term j does not contain itemsets with item mj (all the others
do) the terms are pairwise disjoint and so the union contains
k
X
mj
|B (k) (mk , . . . , ms )| = b(k) (mk , . . . , ms ) :=
j=s
j
k-itemsets. This set contains the first (in colex order) bitvectors with k bits
set. By splitting off the last term in the union one then sees:
B (k) (mk , . . . , ms ) = B (k−1) (mk−1 , . . . , ms ) ∨ emk ∪ [mk ](k) (3)
and consequently
Lemma 10: For every m, k ∈ N there are numbers ms < · · · < mk such
that
X k
mj
m= (5)
j=s
j
34 M. Hegland
Let N (k) be the set of all k-itemsets of integers. It turns out that the
(k)
B occur as natural subsets of N (k) :
The shadow of the first k-itemsets B (k) (mk , . . . , ms ) are the first k − 1-
itemsets, or more precisely:
Lemma 12:
∂B (k) (mk , . . . , ms ) = B (k−1) (mk , . . . , ms ).
Proof: First we observe that in the case of s = k the shadow is simply set
of all k − 1 itemsets:
∂[mk ]k = [mk ](k−1) .
This can be used as anchor for the induction over k − s. As was shown
earlier, one has in general:
B (k) (mk , . . . , ms ) = [mk ]k ∪ B (k−1) (mk−1 , . . . , ms ) ∨ emk
The shadow is important for the apriori property and we would thus
like to determine the shadow, or at least its size for more arbitrary k-
itemsets as they occur in the apriori algorithm. Getting bounds is feasible
but one requires special technology to do this. This is going to be developed
further in the sequel. We would like to reduce the case of general sets of
k-itemsets to the case of the previous lemma, where we know the shadow.
So we would like to find a mapping which maps the set of k-itemsets to the
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
first k itemsets in colex order without changing the size of the shadow. We
will see that this can almost be done in the following. The way to move the
itemsets to earlier ones (or to “compress” them) is done by moving later
bits to earlier positions.
So we try to get the k itemsets close to B (k) (mk , . . . , ms ) in some sense,
so that the size of the shadow can be estimated. In order to simplify no-
tation we will introduce z + ej for z ∨ ej when ej 6≤ z and the reverse
operation (removing the j-th bit) by z − ej when ej ≤ z. Now we introduce
compression of a bitvector as
(
z − ej + ei if ei 6≤ z and ej ≤ z
Rij (z) =
z else
Thus we simply move the bit in position j to position i if there is a bit in
position j and position i is empty. If not, then we don’t do anything. So we
did not change the number of bits set. Also, if i < j then we move the bit
to an earlier position so that Rij (z) ≤ z. For our earlier example, when we
number the bits from the right, starting with 0 we get R1,3 ((0, 1, 1, 0, 0)) =
(0, 0, 1, 1, 0) and R31 ((0, 0, 0, 1, 1)) = (0, 0, 0, 1, 1). This is a “compression”
as it moves a collection of k-itemsets closer together and closer to the vector
z = 0 in terms of the colex order.
The mapping Rij is not injective as
Rij (z) = Rij (y)
when y = Rij (z) and this is the only case. Now consider for any set C of
−1
bitvectors the set Rij (C) ∩ C. These are those elements of C which stay
in C when compressed by Rij . The compression operator for bitsets is now
defined as
−1
R̃i,j (C) = Rij (C) ∪ (C ∩ Rij (C)).
Thus the points which stay in C under Rij are retained and the points
which are mapped outside C are added. Note that by this we have avoided
the problem with the non-injectivity as only points which stay in C can
be mapped onto each other. The size of the compressed set is thus the
same. However, the elements in the first part have been mapped to earlier
elements in the colex order. In our earlier example, for i, j = 1, 3 we get
C = {(0, 0, 0, 1, 1), (0, 1, 1, 0, 0), (1, 1, 0, 0, 0), (0, 1, 0, 1, 0)}
we get
R̃i,j (C) = {(0, 0, 0, 1, 1), (0, 0, 1, 1, 0), (1, 1, 0, 0, 0), (0, 1, 0, 1, 0)}.
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
36 M. Hegland
The next result shows that in terms of the shadow, the “compression”
R̃i,j really is a compression as the shadow of a compressed set can never
be larger than the shadow of the original set. We suggest therefor to call it
compression lemma.
The operator R̃i,j maps sets of k itemsets onto sets of k itemsets and
does not change the number of elements in a set of k itemsets. One now
says that a set of k itemsets C is compressed if R̃i,j (C) = C for all i < j.
This means that for any z ∈ C one has again Rij (z) ∈ C. Now we can move
to prove the key theorem:
then
Proof: First we note that the shadow is a monotone function of the un-
derlying set, i.e., if A1 ⊂ A2 then ∂A1 ⊂ ∂A2 . From this it follows that it
is enough to show that the bound holds for |A| = b(k) (mk , . . . , ms ).
Furthermore, it is sufficient to show this bound for compressed A as
compression at most reduces the size of the shadow and we are looking for
a lower bound. Thus we will assume A to be compressed in the following.
The proof uses double induction over k and m = |A|. First we show
that the theorem holds for the cases of k = 1 for any m and m = 1 for any
k. In the induction step we show that if the theorem holds for 1, . . . , k − 1
and any m and for 1, . . . , m − 1 and k then it also holds for k and m, see
figure (??).
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
38 M. Hegland
m
m−1
1 k−1 k
Fig. 9. Double induction
This theorem is the tool to derive the bounds for the size of future candidate
itemsets based on a current itemset and the apriori principle.
Theorem 16: Let the sequence Ck satisfy the apriori property and let
|Ck | = b(k) (mk , . . . , mr ).
Then
|Ck+p | ≤ b(k+p) (mk , . . . , mr )
for all p ≤ r.
Proof: The reason for the condition on p is that the shadows are well
defined.
First, we choose r such that mr ≤ r + p − 1, mr+1 ≤ r + 1 + p − 1, . . .,
ms−1 ≤ s − 1 + p − 1 and ms ≥ s + p − 1. Note that s = r and s = k + 1
may be possible.
Now we get an upper bound for the size |Ck |:
|Ck | = b(k) (mk , . . . , mr )
s−1
X
j+p−1
≤ b(k) (mk , . . . , ms ) +
j=1
j
s+p−1
= b(k) (mk , . . . , ms ) + −1
s−1
according to a previous lemma.
If the theorem does not hold then |Ck+p | > b(k+p) (mj , . . . , mr ) and thus
|Ck+p | ≥ b(k+p) (mj , . . . , mr ) + 1
s+p−1
≥ b(k+p) (mk , . . . , ms ) +
s+p−1
= b(k+p) (mk , . . . , ms , s + p − 1).
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
40 M. Hegland
Here we can apply the previous theorem to get a lower bound for Ck :
In practice one would know not only the size but also the contents of any
Ck and from that one can get a much better bound than the one provided
by the theory. A consequence
of the theorem is that for Lk with |Lk | ≤ mkk
mk
one has |Ck+p | ≤ k+p . In particular, one has Ck+p = ∅ for k > mp − p.
4. Extensions
4.1. Apriori Tid
One variant of the apriori algorithm discussed above computes supports
of itemsets by doing intersections of columns. Some of these intersections
are repeated over time and, in particular, entries of the Boolean matrix
are revisited which have no impact on the support. The Apriori TID [?]
algorithm provides a solution to some of these problems. For computing
the supports for larger itemsets it does not revisit the original table but
transforms the table as it goes along. The new columns correspond to the
candidate itemsets. In this way each new candidate itemset only requires
the intersection of two old ones.
The following demonstrates with an example how this works. The exam-
ple is adapted from [?]. In the first row the itemsets from Ck are depicted.
The minimal support is 50 percent or 2 rows. The initial matrix of the tid
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
algorithm is equal to
12345
1 0 1 1 0
0 1 1 0 1
1 1 1 0 1
01001
Note that the column (or item) four is not frequent and is not considered
for Ck . After one step of the Apriori tid one gets the matrix:
(1, 2) (1, 3) (1, 5) (2, 3) (2, 5) (3, 5)
0 1 0 0 0 0
0 0 0 1 1 1
1 1 1 1 1 1
0 0 0 0 1 0
Here one can see directly that the itemsets (1, 2) and (1, 5) are not frequent.
It follows that there remains only one candidate itemset with three items,
namely (2, 3, 5) and the matrix is
(2, 3, 5)
0
1
1
0
Let z(j1 , . . . , jk ) denote the elements of Ck . Then the elements in the trans-
formed Boolean matrix are az(j1 ,...,jk ) (xi ).
We will again use an auxiliary array v ∈ {0, 1}n . The apriori tid algo-
rithm uses the join considered earlier in order to construct a matrix for
the frequent itemsets Lk+1 from Lk . (As in the previous algorithms it is
assumed that all matrices are stored in memory. The case of very large data
sets which do not fit into memory will be discussed later.) The key part of
the algorithm, i.e., the step from k to k + 1 is then:
There are three major steps where the auxiliary vector v is accessed. The
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
42 M. Hegland
This has to be done for all elements y ∨ z where y, z ∈ Lk . Thus the average
complexity is
X
E(T ) = 3nmk E(xy )τ
k
V
for some “average” y and x = i xyi i . Now for all elements in Lk the
y
support is larger than σ, thus E(xy ) ≥ σ. So we get a lower bound for the
complexity:
X
E(T ) ≥ 3nmk στ.
k
From this we would expect that for some rk ∈ [1, k] we get the approxima-
tion
X
E(T ) ≈ 3nmk (E(|x|)/d)rk τ.
k
and so the “speedup” we can achieve by using this new algorithm is around
P
k kmk
S≈P rk −1 m
.
k (E(|x|)/d) k
of the lower support of the k-itemsets for larger k. While this second effect
does strongly depend on rk and thus the data, the first effect always holds,
so we get a speedup of at least
P
kmk
S ≥ Pk ,
k mk
i.e., the average size of the k-itemsets. Note that the role of the number mk
of candidate itemsets maybe slightly diminished but this is still the core
parameter which determines the complexity of the algorithm and the need
to reduce the size of the frequent itemsets is not diminished.
44 M. Hegland
These constraints will reduce the amount of frequent itemsets which need
to be further processed, but can they also assist in making the algorithms
more efficient? This will be discussed next after we have considered some
examples. Note that the constraints are not necessarily simple conjunctions!
Examples:
• We have mentioned the rule that any frequent itemset should not con-
tain an item and its generalisation, e.g., it should not contain both soft
drinks and ginger beer as this is identical to ginger beer. The constraint
is of the form b(x) = ¬ay (x) where y is the itemset where the “softdrink
and ginger beer bits” are set.
• In some cases, frequent itemsets have been well established earlier. An
example are crisps and soft drinks. There is no need to rediscover this
association. Here the constraint is of the form b(x) = ¬δy (x) where y
denotes the itemset “softdrinks and chips”.
• In some cases, the domain knowledge tells us that some itemsets are pre-
scribed, like in the case of a medical schedule which prescribes certain
procedures to be done jointly but others should not be jointly. Finding
these rules is not interesting. Here the constraint would exclude certain
z, i.e., b(z) = ¬δy (z) where y is the element to exclude.
• In some cases, the itemsets are related by definition. For example the
predicates defined by |z| > 2 is a consequence of |z| > 4. Having discov-
ered the second one relieves us of the need to discover the first one. This,
however, is a different type of constraint which needs to be considered
when defining the search space.
When the apriori condition holds one can generate the candidate item-
sets Ck in the (constrained) apriori algorithm from the sets L∗k instead of
from the larger Lk . However, the constraints need to be anti-monotone. We
know that constraints of the form az(j) are monotone and thus constraints
of the form bj = ¬az(j) are antimonotone. Such constraints say that a cer-
tain combination of items should not occur in the itemset. An example of
this is the case of ginger beer and soft drinks. Thus we will have simpler
frequent itemsets in general if we apply such a rule. Note that itemsets have
played three different roles so far:
(1) as data points x(i)
(2) as potentially frequent itemsets z and
(3) to define constraints ¬az(j) .
The constraints of the kind bj = ¬az(j) are now used to reduce the
candidate itemsets Ck prior to the data scan (this is how we save most).
Even better, it turns out that the conditions only need to be checked for
level k = |z (j) | where k is the size of the itemset defining the constraint.
(This gives a minor saving.) This is summarised in the next theorem:
46 M. Hegland
Proof: We need to show that every element y ∈ C̃k satisfies the constraints
bj (y) = 1. Remember that |y| = k. There are three cases:
From this it follows that C̃k ⊂ Ck∗ . The converse is a direct consequence of
the definition of the sets.
Thus we get a variant of the apriori algorithm which checks the constraints
only for one level, and moreover, this is done to reduce the number of
candidate itemsets. This is Algorithm ??.
redundantly or on one master processor and the result can then be com-
municated. The parallel algorithm also leads to an out-of-core algorithm
which does the counting of the supports in blocks. One can equally develop
an apriori-tid variant as well.
There is a disadvantage of this straight-forward approach, however. It
does require many synchronisation points, respectively, many scans of the
disk, one for each level. As the disks are slow and synchronisation expensive
this will cost some time. We will not discuss and algorithm suggested by [?]
which substantially reduces disk scans or synchronisation points at the cost
of some redundant computations. First we observe that
min ŝk (a) ≤ ŝ(a) ≤ max ŝk (a)
k k
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
48 M. Hegland
Proof: If for some frequent a this would not hold then one would get
max ŝj (a) < σ0
if σ0 is the threshold for frequent a. By the observation above ŝ(a) < σ0
which contradicts the assumption that a is frequent.
Using this one gets an algorithm which generates in a first step frequent k-
itemsets Lk,j for each Dj and each k. This requires one scan of the data, or
can be done on one processor, respectively. The union of all these frequent
itemset is then used as a set of candidate itemsets and the supports of all
these candidates is found in a second scan of the data. The parallel variant of
the algorithm is then Algorithm ??. Note that the supports for all the levels
Proposition 21: The sequence Ckp satisfies the apriori property, i.e.,
z ∈ Ckp & y≤z p
⇒ y ∈ C|y| .
We can use any algorithm to determine the frequent itemsets on one par-
tition, and, if we assume that the algorithm is scalable in the data size the
time to determine the frequent itemsets on all processors is equal to 1/p of
the time required to determine the frequent itemsets on one processor as
the data is 1/p on each processor. In addition we require to reads of the
data base which has an expectation of nλτDisk /p where λ is the average
size of the market baskets and τDisk is the time for one disk access. There
is also some time required for the communication which is proportional to
the size of the frequent itemsets. We will leave the further analysis which
follows the same lines as our earlier analysis to the reader at this stage.
As the partition is random, one can actually get away with the determi-
nation of the supports for a small subset of Ckp , as we only need to determine
the support for az for which the supports have not been determined in the
first scan. One may also wish to choose the minimal support σ for the first
scan slightly lower in order to further reduce the amount of second scans
required.
50 M. Hegland
which holds for (x1 , . . . , xm ) and (y1 , . . . , yk ) when ever there is a sequence
1 ≤ i1 < i2 < · · · < im ≤ k such that
xi ≤ yis , s = 1, . . . , m.
One can now verify that this defines a partial order on the set of se-
quences introduced above. However, the set of sequences does not form
a lattice as there are not necessarily unique lowest upper or greatest
lower bounds. For example, the two sequences ((0, 1), (1, 1), (1, 0)) and
((0, 1), (0, 1)) have the two (joint) upper bounds ((0, 1), (1, 1), (1, 0), (0, 1))
and ((0, 1), (0, 1), (1, 1), (1, 0) which have now common lower bound which
is still an upper bound for both original sequences. This makes the search
for frequent itemsets somewhat harder.
Another difference is that the complexity of the mining tasks has grown
considerably, with |I| items one has 2|I| market-baskets and thus 2|I|m
different sequences of length ≤ m. Thus it is essential to be able to deal
with the computational complexity of this problem. Note in particular, that
the probability of any particular sequence is going to be extremely small.
However, one will be able to make statements about the support of small
subsequences which correspond to shopping or treatment patterns.
Based on the ordering, the support of a sequence x is the set of all
sequences larger than x is
This is estimated by the number of sequences in the data base which are in
the support. Note that the itemsets now occur as length 1 sequences and
thus the support of the itemsets can be identified with the support of the
corresponding 1 sequence. As our focus is now on sequences this is different
from the support we get if we look just at the distribution of the itemsets.
The length of a sequence is the number of non-empty components. Thus
we can now define an apriori algorithm as before. This would start with the
determination of all the frequent 1 sequences which correspond to all the
frequent itemsets. Thus the first step of the sequence mining algorithm is
just the ordinary apriori algorithm. Then the apriori algorithm continues
as before, where the candidate generation step is similar but now we join
any two sequences which have all components identical except for the last
(non-empty) one. Then one gets a sequence of length m + 1 from two such
sequences of length m by concatenating the last component of the second
sequence on to the first one. After that one still needs to check if all subse-
quences are frequent to do some pruning.
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
Initially the tree consists only of the root. Then the first record is read
and a path is attached to the root such that the node labelled with the
first item of the record (items are ordered by their frequency) is adjacent to
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
52 M. Hegland
the root, the second item labels the next neighbour and so on. In addition
to the item, the label also contains the number 1, see Step 1 in figure ??.
Then the second record is included such that any common prefix (in the
example the items f,c,a is shared with the previous record and the remaining
items are added in a splitted path. The numeric parts of the labels of the
shared prefix nodes are increased by one, see Step 2 in the figure. This is
then done with all the other records until the whole data base is stored
in the tree. As the most common items were ordered first, there is a big
likelihood that many prefixes will be shared which results in substantial
saving or compression of the data base. Note that no information is lost
with respect to the supports. The FP tree structure is completed by adding
a header table which contains all items together with pointers to their first
occurrence in the tree. The other occurrences are then linked together so
that all occurrences of an item can easily be retrieved, see figure ??.
The FP tree does never break a long pattern into smaller patterns the
way the Apriori algorithm does. Long patterns can be directly retrieved
from the FP tree. The FP tree also contains the full relevant information
about the data base. It is compact, as all infrequent items are removed and
the highly frequent items share nodes in the tree. The number of nodes is
never less than the size of the data base measured in the sum of the sizes
of the records but there is anecdotal evidence that compression rates can
be over 100.
The FP tree is used to find all association rules containing particular
items. Starting with the least frequent items, all rules containing those items
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
{}
header table
item support f:4 c:1
f 4
c:3 b:1 b:1
c 4
a 3 a:3 p:1
b 3
m:2 b:1
m 3
p 3 p:2 m:1
can be found simply by generating for each item the conditional data base
which consists for each path which contains the item of those items which
are between that item and the root. (The lower items don’t need to be
considered, as they are considered together with other items.) These con-
ditional pattern bases can then again be put into FP-trees, the conditional
FP-trees and for those trees all the rules containing the previously selected
and any other item will be extracted. If the conditional pattern base con-
tains only one item, that item has to be the itemset. The frequencies of
these itemsets can be obtained from the number labels.
An additional speed-up is obtained by mining long prefix paths sepa-
rately and combine the results at the end. Of course any chain does not
need to be broken into parts necessarily as all the frequent subsets, together
with their frequencies are easily obtained directly.
5. Conclusion
Data mining deals with the processing of large, complex and noisy data.
Robust tools are required to recover weak signals. These tools require highly
efficient algorithms which scale with data size and complexity. Association
rule discovery is one of the most popular and successful tools in data mining.
Efficient algorithms are available. The developments in association rule dis-
covery combine concepts and insights from probability and combinatorics.
The original algorithm “Apriori” was developed in the early years of data
mining and is still widely used. Numerous variants and extensions exist of
which a small selection was covered in this tutorial.
The most recent work in association rules uses concepts from graph
View publication stats
March 30, 2005 9:7 WSPC/Lecture Notes Series: 9in x 6in heg05a
54 M. Hegland
theory, formal concept analysis and and statistics and links association
rules with graphical models and with hidden Markov models.
In this tutorial some of the mathematical basis of association rules was
covered but no attempt has been made to cover the vast literature discussing
with numerous algorithms.
Acknowledgements
I would like to thank Zuowei Shen, for his patience and support during the
preparation of this manuscript. Much of the work has arisen in discussions
and collaboration with John Maindonald, Peter Christen, Ole Nielsen, Steve
Roberts and Graham Williams.