Data Warehouse and Data Mining - Unit 5
Data Warehouse and Data Mining - Unit 5
FREQUENT PATTERN
M IN IN G
CHAPTER OUTLINE
0 Frequent patterns
0 Market basket analysis
0 Frequent itemsets
0 Closed itemsets
O Association rules
O Types of association rule:
Single dimensional, Multidimensional, Multilevel,
Quantitative
O Finding frequent itemset and Generating assoc
iation rules from frequent itemset:
Apriori algorithm, Limitation and improving
Apriori, FP grow th algorithm, From Associatio
Correlation Analysis, Lift n Mini ng to
94 Data Warehousing and Onto Minin g
INTRODUCTION
FREQLThT PATTERNS
Market basket analysis only uses transactions with more than one item, as no associations can be
made with single purchases. Besides this it does not consider the order of items either , .. 1thm a
transaction or across transaction
Market basket analysis can increase sales and customer satisfaction. Usmg data to detemune what
products are often purchased together, retailers can optimize product placement, offer special deals
and create new product bundles to encourage further sales of these combinations. These
improvements can generate additional sales for the retailer, while making the shopping experience
more productive and valuable for customers. By using market basket analysis, customers may feel a
stronger sentiment or brand loyalty toward the company.
Market basket analysis isn't limited to shopping baskets. Other areas where the technique is used
include e-commerce web sites, analysis of credit card purchases, analysis of telephone calling
patterns, Identification of fraudulent insurance claims, analysis of telecom service purchases etc.
FREQUENT ITEMSETS
Transaction Dataset (D): In association rule mining, we are usually interested in the absolute
number of customer transactions, also called baskets that contain a particular set of items, usually
products. A typical application of association rule mining is the analysis of consumer buying
behavior in supermarkets where they record the contents of shopping carts brought to register for
checkout. These transaction data are normally consisting of tuples of the form as shown in below
table 5.1.
Table 5.1 : Market Basket Transactions
TIO Items
1 Pencil,Paper
2 Pencil,Book,Rubber,Ink
3 Paper, Book, Rubber, Ruler
4 Pencil,Paper,Book,Rubbe r
5 Pencil, Paper, Book, Ruler
,
96 Data Warehousing and Data Mining
minim um of two items. Ea.c
Transaction (T): Transaction (Bagket) is a set of items with a
can belon g to severa l transactions . Fh
Trans action is identi fied with a uniqu e identifier. Items
examp le, in the given transaction datac;et Trans action ID
(TID) is used as a uniqu e identi fier/"'
action1 show ing an items et conta uu:
transa ctions . In the given datase t total five transa ctions . Trans
ining items like Pencil, Book Rubber&
items like Pencil, Paper. Transaction2 showi ng an items et conta
like Paper , Book, Rubber, Rule/
Ink. Transaction3 showi ng an itemset contai ning items
, Paper , Book, Rubbe r. Transactio~
Transaction4 showi ng an items et contai ning items like Pencil
, Book, Ruler. Follow ing are SOnie
showi ng an item.set contai ning items like Pencil, Paper
mining:
assum ptions about the shape of the data for assoc1atton rule
objects, patien ts, events , etc.
Items: Depending on the application field. they can be produ cts,
le {Paper, Pencil, Book} is one
ltems et Itemset is a collection of one or more items. For examp
possible item.set.
then k-item set is 1-itemset, when
K-itemset K-ltemset is a collection of exactly k-items. When k=1
r, Pencil, Book} is one possible 3-
k=2 then k-item.._c.et is 2-item set and so on. For example, {Pape
items et and {Paper, Pencil} is one possible 2-itemset.
than or equal to a minim um support
Frequent Itemset: An itemset whose suppo rt is greate r
ncy of occur rence of an itemset in
threshold is called frequent itemse t. Suppo rt of an items et is freque
suppo rt of an item set {Paper, Pencil,
the transactions of a given transaction dataset. For examp le,
in the given transa ction dataset.
Book} is 2. because this itemset appea rs only in 2 transa ctions
that needs to be requir ed for the itemset
Minim um suppo rt threshold is a minim um value of suppo rt
2 then the item.set {Paper, Pencil, Book]
to be frequent For example, if minim um suppo rt thresh old is
is freque nt because the suppo rt of this itemse t is 2 which
is greate r or equal to minim um support
ation rule gener ation.
thresh old 2 is true. Frequent itemset are generally used for associ
Closed
Itemsets
t
Figure 5.2: Relation betwe en frequent and closed itemse
►
Frequen t Pattern Mining O CHAPTER .5 I 81
An itemset Xis closed if X 15 frequent and there exists no super-pattern y □ X, with the same
support
,is x. Itemset X is not closed if at least one of its supersets has support count as X.
For example,
con~ider the transaction DB as shown in below
Table: 5.2 Table: 5.3
AsSOCIATION R ULES
betwee n
Association rules are "if-then" stateme nts that help to show the probability of relation ships
implica tion
data items, within large data sets in various types of databases.Association rule is an
given that
expression of the form X ➔ Y, where X and Y are itemsets. Association rule can be read as,
items in the
someone has purchas ed the items from the set X, and then they are likely to also buy the
e has
set Y. For example, association rule {Paper, Pencil} ➔ {Rubber} means given that someon
items in the
purchased the items from the set {Paper, Pencil}, then they are likely to also buy the
"buys
set {Rubber}. This rule can also be written in the form of predica te as buys (someone, Paper)
rring relation
(someone, Pencil)➔ buys (someone, Rubber). Association rules express the co-occu
among items but not causal relationship.
e. The
It is not very difficult to develop algorith ms that will find associations in a large databas
of very little
problem is that such an algorith m will also uncover many other associations that are
associations
value. Therefore, it is necessa ry to introdu ce some measure s to distingu ish interest ing
from non-interesting ones. These measure s are called rule twaluat ion matrices.
98 Data Warehousing and Data Mining
Level 1 min_sup= 5%
Comput er
Level 1 miI\_sup = 5% [Support =10)
·atia:i ~
.,stutk1
iation
m Level 2 min_sup = 3% Laptop Comput er Desktop Comput er
[support=6%] [Support= 4 %]
1. Support (S): 1 hl~ mctril support tt•lls us hnw populor 1n item ct 1 , ,1 mc.i 11 n•d hy u,,.
pI\1r0rtion l"'f transadions in which the itcmset appc.irs S11pport of 11 1 item •'t c,,ktill:lled c1 •
Number of tr-,ms.'lchon tlMt omt,,m t
Support(,)= Total number of tronsr1d1()n m tlw g ct
In our dntu shown m t,,hlC' 5.1, the supp(,t I ol itcm'-l'I !l'crn 11} is,
- !
Numl'<'r t'f trnns:11 IH'll th,11 l ont 1in •111 Pen< ii • = 0 g,. 80%
!Support ({Pe.noll) '"''fotal number l'' tr,ms,,l IH'1l in the given d,,t, 1"1' 1 5
• 0.6 = 60\;
2. Confiden~ Con.fidcnre n:11, us hO\\ hkely an itemset Y is purchased given that itemset X ts
purchased e..,pn..--s.,;;ed a, \ ➔1 } It is measured by the proportion of transactions with itemset
x, m \,him 1tem..--et). a so appears. Confidence of an association rule X ➔ Y calculated as·
~umber of transachons that contain an itemset {X, YI
Gonfuiea"e c .. Number of transactions that contain an itemset IX)
For example. the confidence of the association rule !Paper, Book} ➔ !Rubber} is,
Number of trans.ictions that contain an item set (Paper, Book. Rubbtrl
Con5denoe Paper, Book) ➔ (Rubber}I = ~umber of trarL~ctions that contain an itemset jPapcr, book)
• 2 2
Confidence {Paper, Book} ➔ {Beer}) = = x 100 = 67%
3 3
Hence., the confidence can be interpreted as an estimate of the probability P(Y IX) In other
"ord this as the probability of finding the itemset Yin transactions under the condition that
these b'ansactions also contain the itemset X. Confidence is directed and therefore usually
gi es different values for the rules X ➔ Y and Y➔ X. In our example, the confidence that
fRubberJ JS purchased given that {Paper, Book} is purchased ({Paper, Book}➔ {Rubber}) is 2
out of 3, or 67%. This means the conditional probability P(IRubber} I !Paper, Book}) = 67%.
Note that the confidence measure might mbrepresent the importance of an association. This
1s because it onJy accounts for how popular itemset X is (in our case !Paper, Bookl) but not)
(in our case IRubberf). H {Rubber} is also very popular in general, there wiJl be a higher
chance that a transaction containing {J'aper, Book} will also contain !Rubber}, thus inflating
the confidence measure. To accou11t for the base popul.uity of both item::., we use a third
measure called hft.
3. Lift Lift tells us how likely itc>m Y is purchasc:>d wh1:>n ilPm X is purchased, while controlling
for how popular items Y and X are. JI measures how many times more often X and Y occur
together than expected if the)' were statistically in<lept•ndent.
Lift (IX ➔ YI) .. number of transactions that contain an itemset {X, YI
Support {X}*Support{Y}
• If the value of Lift -= 1: implies no association between itemsets .
• If the value of Lift > 1: greater than 1 means that itemsct Y is likely to be bought if
itemset Xis bought,
• If the value of Lift< 1: less than 1 means that itemset Y is unlikely to be bought if
itemset X is bought.
Frequent Pattern Mining 0 CHAPTER 5 t 111
For example, lift for as.;ociation rule {Paper, Book}➔ {Rubber} 15 calculated a"
·f ({P nl'r Book} -¼ {Rubber)) = !'.'umber of lrans3::tlons that contain a n , ~ PaPl'"l" Boak. Ra~)
L1 t ar- ' Support IPaperl °Suppo:rtlBookJ-SUpport (Rubbe!-J
2
Lift ({Paper, Book} _. {Rubberl) =4 4 3 = 0.S xO.S2 x0.6 = 0.3S4
2
= 5 ·2
sxsxs
The lift of {Paper, Book} ➔ Rubber} is 5.2 greater than 1, which implies that item_-:et {Rubbe.rl
is likely to be bought if itemset {Paper, Book} ic; bought.
Gi,•en a set of transactions T, the goal of association rule mining is to find all rules ha,~
support~ minimum support threshold and confidence~ minimum confidence threshold. lf
association rule A .-B [Support, Confidence] satisfies minimum ,upport and minimum
confidence threshold then it is called a strong rule otherwise it is called "eak. rule. So, we can
say that the goal of association rule mining is to find all strong rules
In this case there are d= 4 items namely a, b, c and dare given for itemset generation, the item
set generation process generates total 24=16 itemset. These generated itemsets are: {a, b, c, dl,
{a, b, c}, {a, b, d}, {a, c, d}, {b, c, d}, {a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}, {a}, {b}, {c}, {d}, {null!.
But we can reduce the complexity of itemset generation by using Apriori principle.
APRIORI PRINCIPLE
Apriori principle states that if an itemset is frequent, then all of its subsets must also be
frequent. For example, if (Rubber, Book, pencil) is frequent, so is (Rubber, Book), (Rubber,
pencil}, (Book, pencil}, (Rubber), (Book}, (pencil}. The Apriori principle holds due to the
following property of the support measure
V X, Y : (X ~ Y) ⇒ s(X) ~ sM Where, X and Y are itemsets, that is, Support of an itemset nev~
exceeds the support of its subsets. This property is also known as anti-monotone or Aprio!l
property of support.
Freque nt Pntt,•rn Mining O CHAPTER 5 1103
tn"k considc, tlw d,11,l'-CI T) 8 Iiown m • . ·
f,(,r l'Xil r ' tublt• 5. 1 ,1bove. Ht1scd on th<' given d,1taset in
, i:; 1 -.uppl,t I of ill'tn'-l'I {Pt•ncil) = 1 w 111ch
· • ·
11111 Il ~ · "· . 1s grc,1tcr tlwn the support of superset 1temset
, -·1 Rubber} "' 2. Sm11l,11Iy' su 1, 1)or1 (J>,,per) > support (Pencil, . Paper) ,ind support (Boo k,
{I l'l1l 1
RuM~r) > :,;upporl (H()ok, Rubbc, , Ruler) "''-o holds. 1 his property is used by the A priori
algorithm to Tl'(htCt' th e C'ompll'xity of the cc1ndid,1t!' generation as: " If there is any itemset
. ·ch j._• infrt•qucnt ' its suncr~l'l !.h) J 1 l 'f ·
" 111 r • < u u not 'C gcner,1tcd or tested." for example, 1 item-set
(a bl is infrequent then we do not 1wed to t,,ke into ,1ccount c11l its super-sets {a, b, c}, {a, b, d}
and {a, b, c, d}. ThJt thcs'-' supersets of infrequent itcmsct (a, bl pruned by the Apriori
ali;orithm. lkncl', this climin,,tion lwlps lo improve the itcmset generation process.
f ound to be
lntrequent
Pruned supersets
Figure 5.5: Elimination of superset itemset1 of infrequent itemset {a, b} by using Apriori principle.
APRIORI ALGORITHM
Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It was
later improved by R Agarwal and R Srikant and came to be known as Apriori. This algorithm
uses two steps "join" and "prune" to reduce the search space. It is an iterative approach to
discover the most frequent itemsets.
Apriori algorithm is candidate generation & test approach for frequent itemset generation. It is
based on Apriori principle. Apriori principle states that "If there is any itemset which is
infrequent, its superset should not be generated or tested". Therefore, A priori algorithm
initially, scan transaction dataset once to get candidate 1-itemset and then frequent 1-itemsets,
104 Data Warehousing and Data Mining
generat e length (k+l) candida te itemsct from length k frequen t itemset, test the candidat
against transact ion dataset to filter frequen t itemset from infreque nt itemsets present
es
e/'
candida te itemset, this process termina tes when no frequen t or candida te set can be generat
The steps followed in the Apriori Algorith m of data mining are
Input:
Transac tion databas e D
Minimu m support count threshold
Output:
ft: frequen t itemset of size k in D.
• Join Step: This step generates (K+l) itemset from K-items ets by joining each item
with it:self.
• Prune Step: This step scans the count of each item in the databas e. If the candidate
item does not meet minimu m support , then it is regarde d as infrequ ent and thus it
is remove d. This step is perform ed to reduce the size of the candida te itemsets.
Method:
e 1-
1. Initially, each item present in the transaction database is a member of the set of candidat
e 1-
itemsets, C1. Find the support count of each candida te itemset present in the candidat
itemsets, C1 by scanning and counting the number of occurrences of each item in all
of the
Low ,,,.-..._
Confidence /
Rule /
I
I
I
I
I
I
I
I
I
I
I
I
\
\
\
\
\
Prune d ' , , _
Rules ....... ...._ _______ _ _ __
No Step 4:
cand idate set=
Null
Yes
Nlm' gcncratins fl\.-quent 1-itemset F1 from the candidate 1-itemset C,. For this we compare the
supp0rt count Yalue ot each itemset with min_sup=3 if it is less then eliminating such itemsets. As
}ou can~ here, 15 itemset has a support count value of 2 which is less than the min support value 3
~ . it d~ not meet mir\..sup=3, thus it is discarded in the upcoming iterations, only 11, 12, 13, 14 meet
min_sup cowil We haYe the frequent 1-itemset F1 as shown below.
Itemset Support Count
11 4
12 5
13 4
14 4
Iteration.2: Candidate 2-itemset C2 and Frequent 2-itemset generation. Here, candidate 2-itemset C2 is
generated by joining frequent 1-itemset to itself that is by finding the all-possible combinations of
item.set in F1. After the candidate itemsets have been found we prune those itemset which contains
already infrequent itemset but here, is no any such itemset is present. Now, we find the support
count for each 2-itemset as shown in below.
Il,13 3
11,14 2
12,13 4
12,14 3
13,14 2
Now comparing support count of each itemset with min_sup and ltemsets having Support less than
2 are eliminated again. Here, item set {11, 14) and {13, 14) does not meet min_sup, thus it is deleted so
we obtain the following frequent itemset F2.
ltemset Support Count
11,12 4
11,13 3
12,13 4
12,14 3
108 Data Warehousing nnd D:itn Mining
te 3-itemset c3 is
lteration 3: Cnndtdatc 3-,temsct C, and rrcquent 3 itemsct generation. J Jf•re, candida
combina tions of
generate d b) joining frequent 2•1tcmsct to itself that is by finding the all-possi ble
itt.'mset inF2. Afta the candidate item cts Jun c l"leen found we prune those 1temscts which contains
alread) infrequent itemsct Here nil poss1hlc CJ ,,11,: {l l ,12,IJ}, fl I, 12,1l}, 111,J3,J.1}, Il2,T3,f4}.
We can
frequent,
see for itemset Ill, 12, nJ subset<: 111, 12}, {11, n}, (12, 13} .nc occurrin g in F2 thus {Jl, 12, 13} is
it is not
\'\~ can see for \tcmsct {11, 12, 14} subsets, {It, 121, {II , Ill, (12, J4}, {11, f.1} is not frequent , as
occumn g m F2 thus Ul 12, 141 1" not fn.,1ucnt, hence it 1s deleted. We ccJn see for itemset {l1,
13, 14}
14} is not
subsets, {11., 13) Ill 14} {13, 14J, (11, 14} is not frC'qucnt , JS it is not occurring in F2 thus {Il, 12,
{13, 14)
frequent, hence it is deleted. \\ c can S<'C for itl!mset {12, 13, 14} subsets, (12, 13}, {12, T4), {13, J4},
1, not frequent, as 1t 15 not occurring in F2 thus {Il, 12, 14} is not
frequent, hence it is deleted. Now, we
find the support count for each 3-itemse t CJ as shown in below.
ltemset Support Count
11,1213 3
Support of the 1tem...--et {11, 12, 15} is 3 so, it passes the m.in_sup =3 so, F, is
ltemset Support Count
11,12,13 3
4. {Ill - 112,131
. support {11, 12, 131 3
Confide nce-= support {Il} ;:; 4 x 100 -= 75%
Limitations: Apriori algorithm is easy to understand and its' join and Prune steps are easy to
implement on large itemsets in large databases. Along with these advantages it has number of
limitations. These are:
1. Huge number of candidates: The candidate generation is the inherent cost of the Apriori
Algorithms, no matter what implementation technique is applied. It is costly to handle a
huge number of candidate sets. For example, if there are 104 large 1-itemsets, the Apriori
algorithm will need to generate more than 107 candidate 2-itemsets. Moreover for 100-
itemsets, it must generate inore than 2100 which is approximately 1Q30 candidates in total.
2. Multiple scans of transaction database so, to mine large data sets for long patterns this
algorithm is not a good choice.
3. When Database is scanned to check Ck for creating Fk, a large number of transactions will
be scanned even they do not contain any k-itemset.
Methods to Improve Apriori Efficiency: To improve the Apriori efficiency need to reduce
passes of transaction database scans, shrink number of candidates, facilitate support counting of
candidates. Many methods are available for improving the efficiency of the algorithm.
1. Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-itemsets and its corresponding count. It uses a hash function for
generating the table.
2. Transaction Reduction: This method reduces the number of transactions scanning in
future iterations. The transactions which do not contain K-frequent items are marked or
removed because such transaction cannot contain (K+1) frequent itemsets.
3. Partitioning: This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it should
be frequent in at least one of the partitions of the database. In Scan 1 partition database
and find local frequent patterns and In Scan 2 consolidate global frequent patterns.
HO Ixita I\\ arch ng and Data Mtnntg
D.itubasc D anc.f then searches fo
4. Sam,pling: Thi'- method picks a r.mdom s.implc S from
to verify frequ ent itemsets fou~
frequent 1temset m S U'-11\S Apnon. Sc.m d 1t11b,1sc once
,ire checked. For exam ple, check
m sample S onh lX'lrdcrs of closure o( frequent patterns
abed inste ad of ab, ar. ..., etc. &.rn d.1t.1b.1sc ugain
to find missed frequ ent patterns. It
c,111 be reduced by lowering t~
ma, be possible to lose a glolx1l tr(.'(JUCnl 1tcmsct. This
nun..sup.
new cand idate itemsets at any
5. Dynamic ltemsl't Counting: This technique c.1n add r
mark ed start p'1m t ot th~ datab .lse durin g the scan ning of the datab ase. Find longe
local datab ase parti tions.
tn:quent patterns based on short er frequ ent patte rns and
FPG RO WfH
• Fixed order is used, so path can overlap when transaction share items .
• Pointers are maintained between nodes containing the same item(dotted line)
Mining Frequent Patterns Using FP-tree:
• Start from each frequent length-1 pattern (called suffix pattern)
• Construct its conditional pattern base (set of prefix paths in the FP-tree co-
occurring with the suffix pattern)
• The pattern growth is achieved by the concatenation of the suffix pattern with the
frequent patterns generated from a conditional FP-tree.
The steps used by PP-growth approach for FP-tree construction can be expressed in Flow chart as
shown below:
- - - - - - - - ~ Read transaction t
hasNext No overlapped
prefix found
Increment the
frequency count for
Overlapped each overlapped item
Create new nodes labeled prefix found
with the items int
Example: Find all frequent itl"m-.cb- 0r (n·qucnt p,,ttcrn<: in the following datnb,1se by using FJ>.
growth algorithm. Take minimum suppo,t (tnll\_'<up) "'2
TIO List of item IDs
1 II, 12, 15
2 12,14
3 12, 13
4 11, 12, 14
5 11,13
6 12,13
I 7 11,13
8 ll,12,13,15
I 9 l1,l2,I3
Now, building a FP tree of given transaction database. Here, Item sets are considered in order of
their descending value of support count.
Constructing 1-itemsets and counting support count for each item set:
Itemset Support count
11 6
12 7
13 6
14 2
15 2
e
-1. For Transaction 4:12,Il,l4
. bu'It
Now, to facilitate tree traversal, an item header table JS
rJ'\'tnk to •
1 50 that each item r-~ .....
cx."CllIIellCes in the tree via a chain of node-links.
FP Tree Construction Over!! Now we need to find conditional pattern base and Conditional FP Tree
for each item. For this start with last item in sorted order that is 15, follow node pointers and traverse
only the paths containing 15, accumulate all of transformed prefix paths of that item to forrn a
conditional pattern base.
Frequent Pattern Mining O CHAPTER 5 J 117
Conditi onal Pattern Base
IS 1112,11:1), 112,11,13:11 l
.. all paths
No" constructmg conditio nal FP-tree based on conditional pattern base for IS, by merging
·
that appear ~min..._sup fimes and elimin'ate all others. Here, nun_sup 1s 2 nod es I2
.
,\J\J ,r.eeping nodes
13
,ind 11 pass this sup.port so, keep these two nodes in the prefix path co-occurring with the IS but,
the IS.
caiu1ot pass the m~sup =Z so, it is eliminat ed from the prefix path co-occurring with
Therefore the conditional FP-tree for 15 contains only nodes 12 and 11 in the prefix path for IS as
shOWil below.
Conditional FP Tree for IS: {12:2, Il~ 2)
the paths
MoYt? to next least frequent item in order, i.e., 14, follow node pointers and traverse only
conditio nal
containing 14 and accumul ate all of transfor med prefix paths of that item to form a
pattern base as shown below
r--- -r-- ---- ---- ---- ---- ---- ,
Conditional Pattern Base
14 { {12,Il:1}, {I2:1} }
all paths
Now, constructing conditio nal FP-tree based on conditional pattern base for 14, by merging
is 2 only
and keeping nodes that appear ~_su p times and eliminate all others. Here, min_sup
node 12 pass this support so, keep this node in the prefix path co-occurring with the 14 but,
11 cannot
pass the min_sup so, it is eliminated from the prefix path co-occurring with the IS. Therefore, the
conditional PP-tree for 14 contains only node 12 in the prefix path for 14 as shown below.
Conditional FP Tree for 14: {12:2)
the paths
Move to next least frequent item in order, i.e., 13, follow node pointers and traverse only
conditio nal
containing 13 and accumul ate all of transfor med prefix paths of that item to form a
pattern base as shown below
Conditional Pattern Base
Normally, association rules that pass min_sup and min_conf threshold are called interesting
association rules. But Sometimes, association rules assessed and qualified by support and confidence
as interesting association rules may be uninteresting in actual. This shows support and confidence
measures are insufficient at filtering out uninteresting association rules. The drawback of support is
many potentially interesting patterns involving low support items might be eliminated by the
support threshold. The drawback of confidence is more .subtle and is best demonstrated with the
following example. Suppose analyze the relationship between people who drink tea and coffee. For
this consider the following supermarket transactions database:
Tea Not tea Total
Coffee 150 750 900
Not coffee 50 50 100
Total 200 800 1000
The information given in this table can be used to evaluate the association rule Tea ➔ Coffee. At first
glance, it may appear that people who drink tea also tend to drink coffee because the rule's support=
100/1000=10% and confidence = 150/200=75% values are reasonably high. This argument would
have been acceptable except that the fraction of people who drink coffee, regardless of whether they
drink tea, is 900/1000=90%, while the fraction of tea drinkers who drink coffee is only 150/200=75%.
Thus, knowing that a person is a tea drinker actually decreases her probability of being a coffee
drinker from 90% to 75%! The rule Tea ➔ Coffee is therefore misleading despite its high confidence
value.
The tea-coffee example shows that high-confidence rules can sometimes be misleading because the
confidence measure ignores the support of the itemset appearing in the rule consequent. One way to
address this problem is by applying a correlation analysis-based metric known as lift:
Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is
independent of the occurrence of itemset B if P(AUB) = P(A)P(B); otherwise, itemsets A and B are
dependent and correlated as events. This definition can easily be extended to more than two
itemsets. The lift between the occurrence of A and B can be measured by computing
1~ Data Warehousing and Data Mining
. _ P(A uB)
lift {A, B) - P(A) P(B)
th the occurrence of
If the lift (A, B) is less than 1, then the occurrence of A is negatively co~ela ted wi
s ted, that is, the
B. If the resultin g value is greater than 1, then A and B are _po itively_ correla
one implies the occurrence of the other. If the resultin g value is equal t_o 1, then
Aand
occurrence of
e, the lift value equab
B are indepe ndent and there is no correlation betwee n them. In our exampl
0.89, which dearly indicates the negative correlation betwee n coffee and tea .
., ,-, -==~..,....'!!'!!!!!'~~----
~ Exe rci s)
1. What is frequen t pattern ?
2 Why frequent pattern mining is import ant data mining task? Explain
3. What is market basket analysis? Explain it with suitable example.
4. What is association rule? Why is it importa nt? Explain.
5. Explain the different types of association rules with suitable exampl e.
Define the concepts of suppor t and confidence and Lift for an association
rule mining.
6.
frequent pattern
7. What is Apriori principle? How it is used by the Aprior i algorit hm for
mining? Explain
8. What are interesting association rules?
ion from the
9 How interesting association rules are genera ted by using association rule generat
frequent pattern? Explain.
ed? Explain
10. What are the limitations of Apriori approa ch? How these limitations can be improv
.
11. What is PP-growth? Discuss PP-growth approa ch for frequen t pattern mining
.
12. Why FP-gro wth approa ch is conside red better than Aprior i approa ch? Explain
13. What is PP-tree? Differentiate it withconditional PP-treee.
for
14. What are interesting association rules? Why Correlation analysis is used as supplementary
suppor t and confidence framew ork for association rule asses.ses.
ng transaction
15. Use APRIORI algorit hm to generate strong association rules from the followi
databa se. Use min_su p=2 and min_confidence=75%.
TID Itemsets
10 A,C,D
20 B,C,E
30 A, B, C, E
40 B, E
Frequent Pattern Mining O CHAPTER 5 J 121
A databas~ has 1~ tra~saction~ contains the 9 items only A= {Al, A2, A3, A4, A5, A6, A7: A8,
16, A9}. Let 111111 stlP - 20 ~ a.nd mm conj= 60%. Find all frequent itemsets using Apriori algonthm.
List all the strong association rules.
-
~
Al
1
A2
0
A3
0
A4
0
AS
1
A6
1
A7
0
AB
1
A9
0
- 0 1 0 1 0 0 0 1 0
1 0 0
0 0 0 1 1 0
1 1 0 0 0
0 0 0 0
1 0 0
0 0 0 0 1 1
1 0 0 1 1 0 1
0 0
0 0 1 0 0 0 0 0
0
0 0 0 0 0
0 0 0 0
1 0 1 0 1 0 0
0 0
1 1 0 1 0
0 0 0 0
0 1 1 0 0
0 1 0 1
1 0 1 0 0
1 0 1 0
0 0 0 0 0 1
0 1 1
1-> presence of item in the transaction and
0-> absence of item in the transaction
17. A database has 10 transactions and contains only the 6 items only A = {Al, A2, A3, A4, AS,
A6}. Let min sup= 30% Find all frequent itemsets using Apriori algorithm.
Al A2 A3 A4 AS A6
0 0 0 1 1 1
0 1 1 1 0 0
1 0 0 1 1 1
1 1 0 1 0 0
1 0 1 0 1 1
0 1 1 1 0 1
0 0 0 1 0 1
0 1 0 1 0 1
1 0 0 1 0 0
1 1 1 1 1 1
122 Data Warehousing and Data Mining
18. A databa se has five transactions. Let min sup = 60% and min conf =80%. Find all frequent
itemsets using Apriori algorithm. List all the strong association rules.
TID Items Bought
1 {M, 0, N, K, E, Y}
2 {D, 0, N, K, E, Y}
3 {M, A, K, E}
4 {M, U, C, K, Y}
5 {C, 0, 0, K, I, E}
mining (ARM)
19. Show using an example how FP-tree algorithm solves the association rule
problem.
rt= 50% and
20. Perform ARM using FP-growth on the following data set with minim um suppo
confid ence= 75%.
Transaction ID Items
2 Bread,Cheese,Juice
3 Bread,Milk, Yogurt
□□□