Comparative Study of Different Improvements of Apriori Algorithm
Comparative Study of Different Improvements of Apriori Algorithm
Volume: 4 Issue: 3
ISSN: 2321-8169
75 - 78
_______________________________________________________________________________________
Ambar Dutta
Abstract Data mining is a process of finding out the most frequent patterns from a large set of dataset. Association rule mining is an important
technique in data mining. Apriori algorithm is the most basic, popular and simplest algorithm for finding out this frequent patterns. Still being
one of the simplest algorithms for association rule mining, it has certain limitations. In the literature, researchers have proposed several
improvements of Apriori algorithms. This paper provides the comparative study of some of these improved version of Apriori algorithm with
respect to traditional Apriori algorithm.
Keywords- Data Mining, Apriori algorithm, Frequent Itemset, Support, Dataset.
__________________________________________________*****_________________________________________________
I. INTRODUCTION
Data Mining [1] is a process where information or
knowledge is being extracted or mined from a large set of
dataset. Here we mine the sequential patterns that are present in
the large databases. Sequential Pattern Mining was first
introduced by Agarwal and Srikant in the year 1995. Sequence
patterns are those set of data which occurs in a specific order
that is sequentially among all the data patterns in a given set.
And finding out of these patterns which occurs sequentially out
of all the other patterns is sequential pattern mining. An
example of sequential pattern is as follows; suppose a customer
buys a laptop then it is more likely that the customer will buy a
mouse then antivirus and a printer after it sequentially. Some
terms which are constantly being used here are item-set,
support and confidence. Let there be a set of items L = {l1, l2,
} and a sub set of these items is knows as item-set. For a
given database D, support of an item, let be X is defined as the
ratio of the number of sequences in the database which contain
the item X to the total number of sequences in the database.
And, for a given database D, confidence of an sequence that
contains X as well as Y is defined as the percentage of the
number of sequences that contains X as well as Y to the
number of sequences which contains X. Mining of sequential
patterns can be classified into three different categories, they
are as 1. Mining based on candidate generation (example,
Apriori algorithm), 2. Mining without the involvement of any
candidate generation (example, FP-Growth Tree algorithm) and
3. Mining item sets which have vertical format (example,
ECLAT algorithm).
Mining of sequential patterns can be classified into three
different categories, they are as 1. Mining based on candidate
generation (example, Apriori algorithm), 2. Mining without the
involvement of any candidate generation (example, FP-Growth
Tree algorithm) and 3. Mining item sets which have vertical
format (example, ECLAT algorithm).Apriori algorithm is the
algorithm which involves Candidate generation. According to
this algorithm, first the 1-itemsets are found then the database
is scanned to find the support count. The itemsets with support
count less than minimum support count are discarded. The
resultant itemsets are then used to find the frequent 2-itemsets
in the same process. Likewise we find all the (k+1)-itemsets
from the frequent k-itemsets, until no more frequent itemsets
can be found out. In FP-Growth tree algorithm [2], candidate
keys are not generated and database is scanned for two times
only. It uses a tree like structure to store the database and uses a
divide and conquer method. And in ECLAT algorithm [2],
depth first search method is used. In first scan of the database a
TID (Transcation_Id) list is given to each single item. k+1
Itemset are then generated from the k itemset using apriori
property and depth first search method. (k+1)-Itemset are then
generated by taking the intersection of the TID-set of frequent
k-Itemset. This process is continued, until no more candidates
Itemset can be found.
In this paper, Apriori algorithm is taken into consideration.
In section II, a detailed description of Apriori algorithm is
provided along with its limitations. In section III, some of the
existing improvements of Apriori algorithm are discussed with
examples. In section IV comparisons between the original and
the existing improved apriori algorithm is shown. Finally,
conclusion is derived in section V.
II.
APRIORI ALGORITHM
A. Description
The first and the most basic algorithm which was developed
to find out the sequential patterns from a database was the
Apriori algorithm [3]. This algorithm involves candidate
generation and was first proposed by R. Agarwal and R.
Srikant in the year 1994. In Apriori algorithm, we first scan the
original database and find out the support count of each of the
individual items. And discard those items whose support count
is less than the minimum support count. The resultant item set
is then used to find out the frequent 2-items set. From where
again support count of each item-set is calculated and only
those items whose support count is more than minimum
support count are kept and others are discarded. Next we find
out the frequent 3-item set and then frequent 4-item set until no
more frequent item sets can be generated. The final frequent
item set which is generated and satisfies the minimum support
count is our final frequent pattern.
B. Advantages and Disadvantages
The advantage of Apriori algorithm is that it is a simple
algorithm and can be implemented easily. But it still has some
disadvantages also. The main disadvantage is that here the
entire database needs to be scanned at each step. Also in this
algorithm, a large number of candidate keys are generates. And
if the database is very large than scanning in each step not only
consumes a lot of time but the generation of a large number
75
_______________________________________________________________________________________
ISSN: 2321-8169
75 - 78
_______________________________________________________________________________________
candidate keys consumes a lot of memory also, which can be
sometimes limited3. Therefore this algorithm can work well for
small database but not for large database.
III.
I1I4
I4
T5,T7,T8
I2I3
I2
T2,T3,T4,T5,T6,T7,T8
Deleted
I2I4
I4
T5,T7,T8
I3I4
I4
T5,T7,T8
Support
I1I2I3
Item with
Min_support
I1
I1I3I4
I2I3I4
Transaction_IDs
T1,T3,T7,T9,T10
Deleted
I4
T5,T7,T8
Deleted
I4
T5,T7,T8
Items
T1
I1,I3,I7
T2
I2,I3,I7
T3
I1,I2,I3
T4
I2,I3
T5
I2,I3,I4,I5
T6
I2,I3
T7
I1,I2,I3,I4,I6
T8
I2,I3,I4,I6
T9
I1
T10
I1,I3
Support
Transaction_IDs
I1
T1,T3,T7,T9,T10
I2
T2,T3,T4,T5,T6,T7,T8
I3
T1,T2,T3,T4,T5,T6,T7,T8.T10
I4
T5,T7,T8
I5
T5
Deleted
I6
T7,T8
Deleted
I7
T1,T2
Deleted
Support
I1I2
I1I3
Item with
Min_support
I1
I1
Transaction_IDs
T1,T3,T7,T9,T10
T1,T3,T7,T9,T10
Deleted
Items
T1
I1,I3,I7
T2
I2,I3,I7
T3
I1,I2,I3
T4
I2,I3
T5
I2,I3,I4,I5
T6
I2,I3
T7
I1,I2,I3,I4,I6
T8
T9
I2,I3,I4,I6
T10
I1,I3
I1
Support
I1I2
Item with
Min_support
I1
I1I3
I1
T1,T3,T7, T10
I1I4
I4
T5,T7,T8
I2I3
I2
T2,T3,T4,T5,T6,T7,T8
Transaction_IDs
T1,T3,T7, T10
Deleted
Deleted
76
IJRITCC | March 2016, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
75 - 78
_______________________________________________________________________________________
I2I4
I4
T5,T7,T8
I3I4
I4
T5,T7,T8
Items
T1
I1,I3,I7
T2
I2,I3,I7
T3
I1,I2,I3
T4
I2,I3
T5
I2,I3,I4,I5
T6
I2,I3
T7
I1,I2,I3,I4,I6
T8
T9
I2,I3,I4,I6
T10
I1,I3
I1
Figure 2. Frequent 1-Itemsets after deleting the rows and columns.
Support
I1I2I3
Item with
Min_support
I1
T1,T3,T7
Deleted
I1I3I4
I4
T5,T7,T8
Deleted
I2I3I4
I4
T5,T7,T8
Transaction_IDs
77
IJRITCC | March 2016, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
75 - 78
_______________________________________________________________________________________
TABLE XI. COMPARISION AMONG THE NUMBER OF SCANS
Support Count
I1I2
I1I3
I1I4
I2I3
I2I4
I3I4
Deleted
Deleted
Algorithm
Normal Apriori
algorithm
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Number of scans
1-Item set
2-Item set
3-Item set
Total
70
60
30
160
70
70
70
70
26
24
54
60
11
9
12
10
107
103
136
140
2-Item set
3-Item set
Total
Normal Apriori
algorithm
16
Algorithm 4
14
V. CONCLUSIONS
The Apriori algorithm is the most basic algorithm that was
developed for finding out the frequent patterns in large
databases which suffered from certain limitations. In this paper,
four proposals for the improvement of Apriori algorithm were
discussed that have successfully overcome this limitations and
their comparisons have been done and shown. But still these
algorithms needs to be optimised more in terms of time
consumption, memory requirement, efficiency and reduction in
terms of number of scans.
REFERENCES
[1]
[2]
Support Count
I2I3I4
[3]
[4]
[6]
[7]
78
IJRITCC | March 2016, Available @ http://www.ijritcc.org
_______________________________________________________________________________________