0% found this document useful (0 votes)
13 views30 pages

Data Warehouse and Data Mining - Unit 5

Uploaded by

zrimreaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views30 pages

Data Warehouse and Data Mining - Unit 5

Uploaded by

zrimreaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

FREQUENT PATTERN
M IN IN G

CHAPTER OUTLINE
0 Frequent patterns
0 Market basket analysis
0 Frequent itemsets
0 Closed itemsets
O Association rules
O Types of association rule:
Single dimensional, Multidimensional, Multilevel,
Quantitative
O Finding frequent itemset and Generating assoc
iation rules from frequent itemset:
Apriori algorithm, Limitation and improving
Apriori, FP grow th algorithm, From Associatio
Correlation Analysis, Lift n Mini ng to
94 Data Warehousing and Onto Minin g

INTRODUCTION

relev ant patte rns in large datas et. Frequent


Frequent pattc m minin g finds llll3 most frequ ent ,rnd
Findi ng frequ ent pa_tterns ~lays an ~ssennai
patte rns are p.1ttems that appe ar frequently in .1 d,1t,1<:.cl.
, and many other ~~tere~tmg relationships
role in minin g a,.,ocintions, 1.·orrelaltons, c,ws,1I struc tures th
lo make vario us ~c~is wns m e p~e~ nt and
amon g data. The.,c intere sting rcl.llionship~ minl'd usC'd
impo rtant data mmm g task. Association rule
plan for the future. Thu::., frc-quent p.,ttc rn minin g is ,m
co occu rring relati ons betw een variables in
minin g is a popu lar meth od for diS(o wring intere sting
large datas et.
patte rn mini ng fin ds pencil, pape r and rubber
For exam ple, from the sales data of ::-t.1tioncry frequent
as frequ ent pattern. From thb trequ cnt pJttc m assoc
iation rule mini ng find inter esting association of
to buy a rubbe r. Such assoc iation can be
the form - cu, tome rs that buy a pencil and pape r are likely
ities such as, to recom mend next item to the
used as the bash, for decisioru. abou t marketing activ
og desig n, cross -mar ketin g, sale campaign
custo mer, prom otion al pricing, produ ct place ment s, catal
anal} sis etc. In addit ion to the abov e exam ple
from mark et baske t analy sis assoc iation rule mining
ding web usage minin g, intru sion detection,
are empl o) ed today in many applicatio n areas inclu
softw are bu~ and bioinformatics
This chap ter intro duces the basic conce pts of frequ
ent patte rns, and assoc iation rule mining and
the mine d assoc iation rules such as support,
descn'bes matrices to evalu ate the intere sting ness of
chap ter main ly focus es on vario us frequ ent
confidence,, and correlation analysis (e.g., lift). This
algor ithm, and assoc iation rule mining from
patte rn minin g algor ithms such as Apriori, FP-g rowth
the
the frequent patte rns. At the end of this chapt er, we study

FREQLThT PATTERNS

ructu res that appe ar in a data set with


Frequ ent patte rns are itemsets, subse quen ces, or subst
exam ple, a set of items , such as Pape r and
frequ ency no less than a user-specified thres hold. For
trans actio n data set is a frequ ent itemset. A
Penc il that appe ar frequently together in a statio nery
came ra, and then a mem ory card, if it occurs
subse quen ce, such as buyin g first a PC, then a digita l
seque ntial patte rn. A subst ructu re can refer
frequ ently in a shop ping histo ry database, is a frequ ent
ees, or subla ttices, whic h may be combined
to diffe rent struc tural forms, such as subg raphs, subtr
s frequ ently in a grap h datas et, it is called a
with items ets or subse quen ces. If a subst ructu re occur
plays an essen tial role in mini ng associations,
frequen_t struc tural patte rn Findi ng frequ ent patte rns
sting relati onsh ips amon g data. Moreover, it
correJ~tions, caus~I. s~ct ures, an_d many other intere
mini ng tasks as well.
helps m data dass1f1ca t1on, clustermg, and other data

MA RKE T BAS KET ANALYSIS

patte rn mini ng main ly used by retailers to


Market baske t analysis is an examp~e of a frequ ent 5
baske t analy sis retail ers deter mine s what iteJJ1
unde rstan d custo mer purch ase beha viors. In mark et
same baske t by custo mers . Based on we
are frequ ently bought toget her or place d in the
in trans actio ns retail er unco vers associatio!lS
comb inati ons of prod ucts that frequ ently co-oc cur
used by retail ers to make purchase
between prod ucts. Thes e unco vered assoc iation s are then
a parti cular mode l of smar tphon e, we
suggestions to cons umer s. For exam ple, when a perso n buys
Frequent Pattern Mining O CHAPTER 5 J 95
retailer may suggest 0th ~r Prod ucts such as phone cases, screen protectors, memory cards or other
· for that particula r phon Thi 5 ·1 d
acccs!>Orles • • c. s ue to the frequency with which other consumers
,.ht th-" items in the same transaction
t,ou&', ...,... • • as th e p h one. Simil"arly, stationery
· ·
retailers ·
also, make his
recomm,..ndahon of next item to the customer based on the baskets similar to shown in figure 5.1
t,cJow.

Pencil, Paper Pencil, Bool<:, PenC1l, Book, Pencil, Paper,


Rubber, Ink Rubber, Ruler Book, Ruler, Rub~r
~
\
0
l"
\
c u, tomtrl 0 0 Custome-r2 " \
0 Customer3 0 0 Customer4

Figure 5.1: Market baskets for Stationary

Market basket analysis only uses transactions with more than one item, as no associations can be
made with single purchases. Besides this it does not consider the order of items either , .. 1thm a
transaction or across transaction

Benefits of Market Bas ket Analysis

Market basket analysis can increase sales and customer satisfaction. Usmg data to detemune what
products are often purchased together, retailers can optimize product placement, offer special deals
and create new product bundles to encourage further sales of these combinations. These
improvements can generate additional sales for the retailer, while making the shopping experience
more productive and valuable for customers. By using market basket analysis, customers may feel a
stronger sentiment or brand loyalty toward the company.
Market basket analysis isn't limited to shopping baskets. Other areas where the technique is used
include e-commerce web sites, analysis of credit card purchases, analysis of telephone calling
patterns, Identification of fraudulent insurance claims, analysis of telecom service purchases etc.

FREQUENT ITEMSETS

Transaction Dataset (D): In association rule mining, we are usually interested in the absolute
number of customer transactions, also called baskets that contain a particular set of items, usually
products. A typical application of association rule mining is the analysis of consumer buying
behavior in supermarkets where they record the contents of shopping carts brought to register for
checkout. These transaction data are normally consisting of tuples of the form as shown in below
table 5.1.
Table 5.1 : Market Basket Transactions

TIO Items
1 Pencil,Paper
2 Pencil,Book,Rubber,Ink
3 Paper, Book, Rubber, Ruler
4 Pencil,Paper,Book,Rubbe r
5 Pencil, Paper, Book, Ruler
,
96 Data Warehousing and Data Mining
minim um of two items. Ea.c
Transaction (T): Transaction (Bagket) is a set of items with a
can belon g to severa l transactions . Fh
Trans action is identi fied with a uniqu e identifier. Items
examp le, in the given transaction datac;et Trans action ID
(TID) is used as a uniqu e identi fier/"'
action1 show ing an items et conta uu:
transa ctions . In the given datase t total five transa ctions . Trans
ining items like Pencil, Book Rubber&
items like Pencil, Paper. Transaction2 showi ng an items et conta
like Paper , Book, Rubber, Rule/
Ink. Transaction3 showi ng an itemset contai ning items
, Paper , Book, Rubbe r. Transactio~
Transaction4 showi ng an items et contai ning items like Pencil
, Book, Ruler. Follow ing are SOnie
showi ng an item.set contai ning items like Pencil, Paper
mining:
assum ptions about the shape of the data for assoc1atton rule
objects, patien ts, events , etc.
Items: Depending on the application field. they can be produ cts,
le {Paper, Pencil, Book} is one
ltems et Itemset is a collection of one or more items. For examp
possible item.set.
then k-item set is 1-itemset, when
K-itemset K-ltemset is a collection of exactly k-items. When k=1
r, Pencil, Book} is one possible 3-
k=2 then k-item.._c.et is 2-item set and so on. For example, {Pape
items et and {Paper, Pencil} is one possible 2-itemset.
than or equal to a minim um support
Frequent Itemset: An itemset whose suppo rt is greate r
ncy of occur rence of an itemset in
threshold is called frequent itemse t. Suppo rt of an items et is freque
suppo rt of an item set {Paper, Pencil,
the transactions of a given transaction dataset. For examp le,
in the given transa ction dataset.
Book} is 2. because this itemset appea rs only in 2 transa ctions
that needs to be requir ed for the itemset
Minim um suppo rt threshold is a minim um value of suppo rt
2 then the item.set {Paper, Pencil, Book]
to be frequent For example, if minim um suppo rt thresh old is
is freque nt because the suppo rt of this itemse t is 2 which
is greate r or equal to minim um support
ation rule gener ation.
thresh old 2 is true. Frequent itemset are generally used for associ

CLO SED lTEM SETS

s, for examp le, {a1, ..., a100} contains


A long patter n conta ins a combinatorial numb er of sub-p attern
on to this probl em is mine closed
(100 1) + (1oo2) + ... + (100100) = 2100 - 1 = 1.27x1() sub-patterns. Soluti
30

n is a lossless comp ressio n of frequent


patter ns instea d of minin g freque nt patterns. Oose d patter
ation regard ing the frequ ent itemsets but
patter ns that is dosed items ets contai ns compl ete inform
dosed item.sets is fewer in numb er than freque nt item.sets.
It helps to reduc e the numb er of patterns
and rules.

Frequ ent ltems ets

Closed
Itemsets

t
Figure 5.2: Relation betwe en frequent and closed itemse

Frequen t Pattern Mining O CHAPTER .5 I 81

An itemset Xis closed if X 15 frequent and there exists no super-pattern y □ X, with the same
support
,is x. Itemset X is not closed if at least one of its supersets has support count as X.
For example,
con~ider the transaction DB as shown in below
Table: 5.2 Table: 5.3

Tid Items Itemset Support


{Pencil} 4
1 {Pencil, Paper}
{Paper} 5
2 {Paper, Rubber, Ruler} 3
{Rubber}
3 {{Pencil, Paper, Rubber, Ruler} {Ruler} 4

' 4 {Pencil, Paper, Ruler} {Pencil, paper}


{Pencil, Rubber}
4
2
5 {Pencil, Paper, Rubber, Ruler} 3
{Pencil, Ruler}
Table: 5.4 {Paper, Rubber} 3
{paper, Ruler} 4
Itemset Support
{Rubber, Ruler} 3
{Pencil, Paper, Rubber} 2
{Pencil, Paper, Ruler} 3
{Pencil, Rubber, Ruler} 2
{Paper, Rubber, Ruler} 2
{Pencil, Paper, Rubber, Ruler} 2
than the
In this example, {Paper} is closed itemset because it all supersets have less support count
itemsets.
itemset {paper}. But other itemsets such as {Pencil}, {Rubber}, and {Ruler} is not closed
and {Pencil,
Here, item set {Pencil, Paper} is also closed, because its superse t {Pencil, Paper, Ruler}
Paper, Rubber, Ruler} has less support count than {Pencil, Paper}.

AsSOCIATION R ULES

betwee n
Association rules are "if-then" stateme nts that help to show the probability of relation ships
implica tion
data items, within large data sets in various types of databases.Association rule is an
given that
expression of the form X ➔ Y, where X and Y are itemsets. Association rule can be read as,
items in the
someone has purchas ed the items from the set X, and then they are likely to also buy the
e has
set Y. For example, association rule {Paper, Pencil} ➔ {Rubber} means given that someon
items in the
purchased the items from the set {Paper, Pencil}, then they are likely to also buy the
"buys
set {Rubber}. This rule can also be written in the form of predica te as buys (someone, Paper)
rring relation
(someone, Pencil)➔ buys (someone, Rubber). Association rules express the co-occu
among items but not causal relationship.
e. The
It is not very difficult to develop algorith ms that will find associations in a large databas
of very little
problem is that such an algorith m will also uncover many other associations that are
associations
value. Therefore, it is necessa ry to introdu ce some measure s to distingu ish interest ing
from non-interesting ones. These measure s are called rule twaluat ion matrices.
98 Data Warehousing and Data Mining

Types of Associatio n Rules


Popularity of association rule mining has led to its application on many types of data
~pplication domains. Some specialized kinds of association rules have been reported in data ~d.
literature. We describe here some of the important types: &
1. Quantitative association rule: Quantitative association rules refer to a special ~
association rules in the form of x- Y, with X and Y consisting of a set of numerical anct; of
categorical attributes. Different from general association rules where both the left-hand ~t
the right-hand sides of the rule should be categorical attributes, at least one attribute of thd.
quantitative association rule (left or right) must involve a numerical attribute. For exampt e
e,
Age (x, "30... 39") /\ salary (x, "42 ... 48K") ➔buys(x, "car").
2. Boolean Association Rule: If a rule involves associations between the presence or absence of
~;ems, it is a Boolean association rule. For example, buys(X, "laptop computer") ➔ buys(X,
HP_printer").
3. Single dimensional Association Rule: If the items or attributes in an association rule
reference only one dimension, then it is a single-dimens ional association rule. For example,
the rule buys(X, "computer") ➔buys(X, "antivirus software") is single dimensional
association rule. Another example of single dimensional association rule is buys(X, "Paper")
➔ buys(X, "Pencil").
4. Multidimensional Association Rule: If a rule references two or more dimensions, such as the
dimensions age, income and buys, then it is a multidimensio nal association rule. For example,
age(X, "30... 39") /\ income(X, "42K. .. 48K") ➔ buys(X, "high resolution TV"). Based on
whether a predicate repeat or not there are two types of multidimensio nal association rule:
a. Inter-dimension association rules: In this type of multidimensio nal association rule
there is no repeated predicates. For example, age(X,"19-25") /\Occupation(X, "student")
➔ buys(X, "Ruler")
b. Hybrid-dimension association rules: In this type of multidimensio nal association rule
there is repeated predicates. For example, age(X,"19-25") /\ buys(X, "popcorn") ➔
buys(X, "Ruler")
5. Multilevel Association Rule: Association rules generated from mining data at multiple levels
of abstraction are called multilevel association rules. Multilevel association rules can be
mined efficiently using concept hierarchies under a support-confi dence framework. In
general, a top-down strategy is employed, where counts are accumulated for the calculation
of frequent itemsets at each concept level, starting at the concept level 1 and working
downward in the hierarchy toward the more specific concept levels, until no more itemsets
can be found. For each level, any algorithm for discover frequent itemsets may be used, such
as Apriori or its variations. For example, First find high-level strong rules: Computer --t
keyboard [20%, 60%]. Then find their lower-level "weaker" rules: "laptop computer" --t
"laptop keyboard" [6%, 50%].

Level 1 min_sup= 5%

Laptop Computer Desktop Computer


Level 2 min_sup= 5% [support=6%] [Support= 4%]
Figure 5.2: Multilevel mining with uniform support
h,
... ...u:,;;g > I f· , • ,:m::: ;:;;

Frequen t Pottcm Mining O CHAPTER 5 I 99


I h •n• Oh' dilh'll'lll of \'ilri,,tions to ti · I
l • 1ig ,1pproac 1, where each variatio n involve s "playin g"
"ath supJXlrl thll'sho ld m ,, slii•hlly diffl'rcnt W·'Y So d ' b db I
n " , me are cscn e e ow;
Unifornt Suppm t: llw s,1mc minimum suppo r t used when muung • · t h 1 l f
a . a eac eve o
· . ter
,,1'-,tr,,clton In hgun•t, 1' 111inimum thn",hold of 5% is used throughout. Both "compu
nn~l 1,,plop l'lltllpull't i\tl' found lo lx• frequent, while" desktop computer" is not.
ed.
\\hen• ' umtonu mmimum suppor t threshold is used, the search procedure is simplifi
minimu m suppor t
lltl' ml'lhl,d ts also ~implc m that users are required to specify only one
on the
tht\'shllld. ·\n Apnori -like optimization technique can be adopted , based
examining
k.nowkxige th,'\t ,\n ancestor is a superse t of its descendants. The search avoids
tll't\\~'h contain ing any item whose ancestors do not have minimu
m support . The
items at
uniform suppor t approac h, howeve r, has some difficulties. It is unlikely that
abstraction.
lower levels of abstraction will occur as frequently as those at higher levels of
gful
If the minimu m suppor t threshold is set too high, it could miss some meanin
associations occurri ng at low abstraction levels. If the thresho ld is set too
low, it may
generat e mally uninter esting associations occurring at high abstraction levels.
ld.
b. Reduced Support: Each level of abstraction has its own minimu m suppor t thresho
ld is. For
The deeper the level of abstrac t ion, the smaller the corresp onding thresho
exampl e, in figure 5.3, the minimu m suppor t thresho lds for levels 1 and 2 are 5% and
er"
3%, respectively. In this way, "compu ter" "laptop comput er" and "deskto p comput
are all conside red frequen t.

Comput er
Level 1 miI\_sup = 5% [Support =10)
·atia:i ~
.,stutk1

iation
m Level 2 min_sup = 3% Laptop Comput er Desktop Comput er
[support=6%] [Support= 4 %]

Figure 5.3: Multilev el Mining with Reduced Support


Jes~
groups
eworl C. Group-based Support: Because users or experts often have insight as to which
ecific,
alcuJa!: are more import ant than others, it is sometim es more desirab le to set up user-sp
rules. ~or
\\'(II!; item, or group-b ased minima l suppor t thresho lds when mining multilevel
t pnce,
e iien' exampl e, a user could set up the minimu m suppor t thresho lds based on produc
lds for
usei or on items of interest , such as by setting particu larly low suppor t thresho
n to the
111pute! laptop compu ters and flash drives in order to pay particu lar attentio
association pattern s contain ing items in these categories.
pule!'

SUPPORT, CONFIDENCE AND LIFT

We wouldn 't want to


Imagine a Station ery store with tens of thousan ds of differen t produc ts.
Instead, we would want
calculate all associa tions betwee n every possibl e combin ation of produc ts.
Therefore, we use the
to select only potenti ally relevan t rules from the set of all possible rules.
· · need to analyze·
measures suppor t, confide nce and lift to reduce the numbe r of associations we
100 Data \\'nrehousmg and Dall\ Mining

1. Support (S): 1 hl~ mctril support tt•lls us hnw populor 1n item ct 1 , ,1 mc.i 11 n•d hy u,,.
pI\1r0rtion l"'f transadions in which the itcmset appc.irs S11pport of 11 1 item •'t c,,ktill:lled c1 •
Number of tr-,ms.'lchon tlMt omt,,m t
Support(,)= Total number of tronsr1d1()n m tlw g ct
In our dntu shown m t,,hlC' 5.1, the supp(,t I ol itcm'-l'I !l'crn 11} is,
- !
Numl'<'r t'f trnns:11 IH'll th,11 l ont 1in •111 Pen< ii • = 0 g,. 80%
!Support ({Pe.noll) '"''fotal number l'' tr,ms,,l IH'1l in the given d,,t, 1"1' 1 5

The support of 1temsct {Pl'n, 1I P,,pc1 I is

Support {Pt-ncd Pa1x-rl}


'\uml~r l'I trnns.1d10ns th,11 cont,1in ,in l'cnci l, f',i er
I ot,,l number of trJns,1clion in the given dataset
.. -
3
5

• 0.6 = 60\;
2. Confiden~ Con.fidcnre n:11, us hO\\ hkely an itemset Y is purchased given that itemset X ts
purchased e..,pn..--s.,;;ed a, \ ➔1 } It is measured by the proportion of transactions with itemset
x, m \,him 1tem..--et). a so appears. Confidence of an association rule X ➔ Y calculated as·
~umber of transachons that contain an itemset {X, YI
Gonfuiea"e c .. Number of transactions that contain an itemset IX)

For example. the confidence of the association rule !Paper, Book} ➔ !Rubber} is,
Number of trans.ictions that contain an item set (Paper, Book. Rubbtrl
Con5denoe Paper, Book) ➔ (Rubber}I = ~umber of trarL~ctions that contain an itemset jPapcr, book)

• 2 2
Confidence {Paper, Book} ➔ {Beer}) = = x 100 = 67%
3 3
Hence., the confidence can be interpreted as an estimate of the probability P(Y IX) In other
"ord this as the probability of finding the itemset Yin transactions under the condition that
these b'ansactions also contain the itemset X. Confidence is directed and therefore usually
gi es different values for the rules X ➔ Y and Y➔ X. In our example, the confidence that
fRubberJ JS purchased given that {Paper, Book} is purchased ({Paper, Book}➔ {Rubber}) is 2
out of 3, or 67%. This means the conditional probability P(IRubber} I !Paper, Book}) = 67%.
Note that the confidence measure might mbrepresent the importance of an association. This
1s because it onJy accounts for how popular itemset X is (in our case !Paper, Bookl) but not)
(in our case IRubberf). H {Rubber} is also very popular in general, there wiJl be a higher
chance that a transaction containing {J'aper, Book} will also contain !Rubber}, thus inflating
the confidence measure. To accou11t for the base popul.uity of both item::., we use a third
measure called hft.
3. Lift Lift tells us how likely itc>m Y is purchasc:>d wh1:>n ilPm X is purchased, while controlling
for how popular items Y and X are. JI measures how many times more often X and Y occur
together than expected if the)' were statistically in<lept•ndent.
Lift (IX ➔ YI) .. number of transactions that contain an itemset {X, YI
Support {X}*Support{Y}
• If the value of Lift -= 1: implies no association between itemsets .
• If the value of Lift > 1: greater than 1 means that itemsct Y is likely to be bought if
itemset Xis bought,
• If the value of Lift< 1: less than 1 means that itemset Y is unlikely to be bought if
itemset X is bought.
Frequent Pattern Mining 0 CHAPTER 5 t 111
For example, lift for as.;ociation rule {Paper, Book}➔ {Rubber} 15 calculated a"
·f ({P nl'r Book} -¼ {Rubber)) = !'.'umber of lrans3::tlons that contain a n , ~ PaPl'"l" Boak. Ra~)
L1 t ar- ' Support IPaperl °Suppo:rtlBookJ-SUpport (Rubbe!-J

2
Lift ({Paper, Book} _. {Rubberl) =4 4 3 = 0.S xO.S2 x0.6 = 0.3S4
2
= 5 ·2
sxsxs
The lift of {Paper, Book} ➔ Rubber} is 5.2 greater than 1, which implies that item_-:et {Rubbe.rl
is likely to be bought if itemset {Paper, Book} ic; bought.

FINDING FREQUENT ITEMSETS Al'\1) GENERAT~G ASSOCIATION R ULES


fROl\l FREQUENT ITEMSET

Gi,•en a set of transactions T, the goal of association rule mining is to find all rules ha,~
support~ minimum support threshold and confidence~ minimum confidence threshold. lf
association rule A .-B [Support, Confidence] satisfies minimum ,upport and minimum
confidence threshold then it is called a strong rule otherwise it is called "eak. rule. So, we can
say that the goal of association rule mining is to find all strong rules

Brute-Force Approach for Association Rule Mining

1. List all possible association rules


2 Compute the support and confidence for each rule
3. Prune rules that fail the minimum support and minimum contidence thresholds
For example: Find the association rules from the itemset {Paper. Boo1. RuM:?€:r} 'b.1:Sed 0n the
dataset D shown in table 5.1. Based on the given dataset D the all-possiNe .1ss0\:i.1tton rult.~ .1.nd
their support and confidence for the itemset {Paper, Book, Rubber} is as ~hown belo\\ :
{Paper, Book} ➔ {Rubber} (s = 0.4, c = 0.67)
{Paper, Rubber} ➔ {Book} (s = 0.4, c = 1.0)
{Book, Rubber} ➔ {Paper} (s = 0.4, c
= 0.67)
{Rubber} ➔ {Paper, Book} (s = 0.4, c = 0.67)
{Book} ➔ {Paper, Rubber} (s = 0.4, c = 0.5)
{Paper} ➔ {Book, Rubber} (s = 0.4, c = 0.5)
Similarly, for each itemsets we can gener,1te all possible ,1S$~i.1tion rul~ .,nd (':.lkul.lte thdr
support and confidence and then we can compare support ,1nd c,)ntidt.:'n'--""-' '--"t ('.1ch ruk with
minimum support and minimum confidence threshold to filtl'r tht.:- slrl)ll~ rules fr\.)m \\'"e,\k. n1l6
but, this approach is computation,11ly prohibitiYe. B~.1ust.", n.:. Wt.:' '--' ,U\ $t.," in tht." .,~"' l'
calculation that all the above rules are binarv p.utitions l"t tlw s.,m'-" ilt."m~t {P,,p{'r, ~x--.k., .,nd
Rubber} . Rules originating from the same itt.>mset h.n t.:- identic,,l support but c,,n h.\\ t." \.htt('.rent
confidence Thus; we may decouple the suppl,rl and Cl)ntid('nce C,\kul.,til)ll tl) elimin.,te tht."
redundant calculation of support for the s.une ite1nset. For thi" we c.in vie" th(' ,,s..,ocintfon
rules m·•rung
· as two-step approach:
102 Data Warehousing and Data Mining
1• t whose support ~ minimum sup
1. Frequent Itemset Generation: Gc1wratc a1 itcmsc 5 • Port
.d 1 . from each frequent 1temset, where
2. Rule Generation: Gcncr,1tc high conf1 cncc ru cs each
rule is a binary parht10ning of a trcqucnl itcmsct. .
. . t
Frequent item.set generation step 1s still comput,, tona lly expensive! Because for a g1Ven d it
. . ellls,
this method still ~encratcs 2d possible c,mdidatc itcmscls as shown m figure below.
'-

Figure 5.4: 2d ltem1et Generation

In this case there are d= 4 items namely a, b, c and dare given for itemset generation, the item
set generation process generates total 24=16 itemset. These generated itemsets are: {a, b, c, dl,
{a, b, c}, {a, b, d}, {a, c, d}, {b, c, d}, {a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}, {a}, {b}, {c}, {d}, {null!.
But we can reduce the complexity of itemset generation by using Apriori principle.

APRIORI PRINCIPLE

Apriori principle states that if an itemset is frequent, then all of its subsets must also be
frequent. For example, if (Rubber, Book, pencil) is frequent, so is (Rubber, Book), (Rubber,
pencil}, (Book, pencil}, (Rubber), (Book}, (pencil}. The Apriori principle holds due to the
following property of the support measure
V X, Y : (X ~ Y) ⇒ s(X) ~ sM Where, X and Y are itemsets, that is, Support of an itemset nev~
exceeds the support of its subsets. This property is also known as anti-monotone or Aprio!l
property of support.
Freque nt Pntt,•rn Mining O CHAPTER 5 1103
tn"k considc, tlw d,11,l'-CI T) 8 Iiown m • . ·
f,(,r l'Xil r ' tublt• 5. 1 ,1bove. Ht1scd on th<' given d,1taset in
, i:; 1 -.uppl,t I of ill'tn'-l'I {Pt•ncil) = 1 w 111ch
· • ·
11111 Il ~ · "· . 1s grc,1tcr tlwn the support of superset 1temset
, -·1 Rubber} "' 2. Sm11l,11Iy' su 1, 1)or1 (J>,,per) > support (Pencil, . Paper) ,ind support (Boo k,
{I l'l1l 1
RuM~r) > :,;upporl (H()ok, Rubbc, , Ruler) "''-o holds. 1 his property is used by the A priori
algorithm to Tl'(htCt' th e C'ompll'xity of the cc1ndid,1t!' generation as: " If there is any itemset
. ·ch j._• infrt•qucnt ' its suncr~l'l !.h) J 1 l 'f ·
" 111 r • < u u not 'C gcner,1tcd or tested." for example, 1 item-set

(a bl is infrequent then we do not 1wed to t,,ke into ,1ccount c11l its super-sets {a, b, c}, {a, b, d}
and {a, b, c, d}. ThJt thcs'-' supersets of infrequent itcmsct (a, bl pruned by the Apriori
ali;orithm. lkncl', this climin,,tion lwlps lo improve the itcmset generation process.

f ound to be
lntrequent

Pruned supersets

Figure 5.5: Elimination of superset itemset1 of infrequent itemset {a, b} by using Apriori principle.

APRIORI ALGORITHM

Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It was
later improved by R Agarwal and R Srikant and came to be known as Apriori. This algorithm
uses two steps "join" and "prune" to reduce the search space. It is an iterative approach to
discover the most frequent itemsets.
Apriori algorithm is candidate generation & test approach for frequent itemset generation. It is
based on Apriori principle. Apriori principle states that "If there is any itemset which is
infrequent, its superset should not be generated or tested". Therefore, A priori algorithm
initially, scan transaction dataset once to get candidate 1-itemset and then frequent 1-itemsets,
104 Data Warehousing and Data Mining

generat e length (k+l) candida te itemsct from length k frequen t itemset, test the candidat
against transact ion dataset to filter frequen t itemset from infreque nt itemsets present
es
e/'
candida te itemset, this process termina tes when no frequen t or candida te set can be generat
The steps followed in the Apriori Algorith m of data mining are
Input:
Transac tion databas e D
Minimu m support count threshold
Output:
ft: frequen t itemset of size k in D.

• Join Step: This step generates (K+l) itemset from K-items ets by joining each item
with it:self.
• Prune Step: This step scans the count of each item in the databas e. If the candidate
item does not meet minimu m support , then it is regarde d as infrequ ent and thus it
is remove d. This step is perform ed to reduce the size of the candida te itemsets.

Method:
e 1-
1. Initially, each item present in the transaction database is a member of the set of candidat
e 1-
itemsets, C1. Find the support count of each candida te itemset present in the candidat
itemsets, C1 by scanning and counting the number of occurrences of each item in all
of the

transactions. From the candida te 1-itemsets C1 create frequent 1-itemsets, F1 by putting


candida te 1-itemsets from C1 satisfying minimu m support threshold into F1.
2. for (k = 1; F1c!=0; k++) do // Repeat until Fk+1 is not empty (i.e., termina te when no frequent
or candida te itemsets can be generated)
{
Candidate Generation: Generate length (k+l) candida te itemsets Ck+t from length
k
a.
frequen t itemsets Fk as
i. Join Step: Ck+I is generated by joining Fk with itself. That is, create Ck+t by
obtaining combina tion of itemsets present in the Fk.
ii. Prune Step : Prune candida te itemsets in Ck+l containi ng subsets of length k that
are infreque nt
b. Support Counting: Count the support of each candida tes in Ck+l by scanning
databas e as: for each transact ion t in databas e do increme nt the count of all
candida tes in Ck+t that are containe d int
c. Create Frequen t length (K+l) itemset Fk+1 by elimina ting length (k+l) candidates that
are infrequ ent that is, Fk+t = candida tes in Ck+t with min_su pport
}
3. Output all frequen t itemsets of size k(Fk) and stop.
Frequ ent Patte rn Minin g 0
CHAPTER 5 J105
NT ITEMSETS
ASSOCIATION RULE GENERATION FROM THE FREQUE

itemset generation algor ithm in the given


After the frequent itemsets obtained by applying frequent
ation from the frequ ent itemsets. The
transaction database next step is association rules gener
one automatically satisfies the mini mum
association rules generated from frequent itemsets; each
dence. So, to filter stron g rules from weak
support by default but may not satisfy the minim um confi
old. It is two step processes as show n
rules we need to check only the minim um confidence thresh
below:
F.
1. Given a frequent itemset F, find all non-empty subsets Sc
(F- S) if it satisfies the mini mum
2. For every non-e mpty subse t S of F outpu t the rule S ➔
~ minim um confidence thres hold
confidence requirement. That is, if supp ort (F)/ support(S)
then outpu t the rule S ➔ (F- S) as stron g rule.
on this frequ ent items et F possi ble
For example, if {A, B, C, D} is a frequent itemset F, based
A, A ➔ BCD, B ➔ ACD, C ➔ ABO,
candidate rules are: ABC ➔ D, ABO ➔ C, ACD ➔ B, BCD➔
D ➔ ABC, AB ➔ CD, AC ➔ BO, AD ➔ BC, BC ➔
AD, BO ➔ AC, CD ➔ AB. Amo ng these possi ble
confidence threshold.
candidate rule outpu t only those rules satisfying minim um
If II I = k, then there are 2k - 2 candi date association
rules (ignoring I ➔ 0 and 0 ➔ I). In gener al,
C ➔ D) can be large r or smal ler than
confidence does not have an anti-monotone property c(AB
itemset has an anti-m onoto ne prope rty.
c(AB ➔ D). But confidence of rules generated from the same
c(ABC ➔ D) 2: c(AB ➔ CD) 2: c(A ➔ BCD).
For example, suppo se {A, B, C, D} is a frequent 4-itemset:
numb er of items on the RHS of the rule.
This implies Confidence is anti-monotone with respect to

Low ,,,.-..._
Confidence /
Rule /

I
I
I
I
I
I
I
I
I
I
I
I
\
\
\
\
\
Prune d ' , , _
Rules ....... ...._ _______ _ _ __

Figure 5.6: Lattice of assoc iation rules


I
106 Data Warehousing and Data Mining
Flow chart for frequ ent items et gene ratio n and assoc iation
rule mini ng by Aprio ri algorithm:

Step 1: Scan the transaction


database to get the supp ort S
of each 1-itemsets, Compare S
with min_sup, and get a set of
frequent 1-itemsets F1

Step 2: Use Fk-l join Fi..-1 to


generate a candidate K-itemsets
Step 3: Scan the transaction -
database to get the supp ort S of
and use Apriori property to each cand idate K-itemsets
prun e the unfrequented k- compare S with min_sup, and get'
itemsets from this set a set of frequent K-itemsets, Fk

No Step 4:
cand idate set=
Null

Yes

Step 6: For every non-empty


subse t S of I, outpu t the Step 5: For each frequent
association rule S➔ (I-S) if the i.,
·- - - - - - - - 1 itemset I, Gene rate all non-
confidence of this rule is >= to empt y subse ts of I.
min_conf.

2. Asso ciatio n rule gener ation


Example: Find the frequ ent itemsets base on the follow
ing transaction datas et and then generate the
association rules from the frequent itemsets by using
Apriori algor ithm and outp ut only strong
association rules. Use supp ort thres hold= 50%, Conf idenc
e= 80%.
Transaction List of items
T1 11,12,13
T2 12,13,14
T3 14,15
T4 11,12,14
TS 11,12,13,15
T6 11,12,13,14
.

T•rcqucnt Po1tcr11 Mlnlug O CHAPTER 5 _ J 107
~otution: 1le-re, gh l"n Support llm•shnld • 50% • 0..r; >< , 1• ., .., . "
~ 11 1crcro,~, mm sup .,
Jtrr.ation l: rir~t t'l all, 1.:n\ttc lbc itcmscls or thl' ~,1.c or I tl11s itcmsct is c,tllNI the c,tndidJte itemset,
C,and ll1kul,1ll' tlwn c::upp01 t vnhws.
--

ltl'msct Suppml Counl


It 1
-- -- -
12 5
lJ ,1
14 4
15 2

Nlm' gcncratins fl\.-quent 1-itemset F1 from the candidate 1-itemset C,. For this we compare the
supp0rt count Yalue ot each itemset with min_sup=3 if it is less then eliminating such itemsets. As
}ou can~ here, 15 itemset has a support count value of 2 which is less than the min support value 3
~ . it d~ not meet mir\..sup=3, thus it is discarded in the upcoming iterations, only 11, 12, 13, 14 meet
min_sup cowil We haYe the frequent 1-itemset F1 as shown below.
Itemset Support Count
11 4
12 5
13 4
14 4

Iteration.2: Candidate 2-itemset C2 and Frequent 2-itemset generation. Here, candidate 2-itemset C2 is
generated by joining frequent 1-itemset to itself that is by finding the all-possible combinations of
item.set in F1. After the candidate itemsets have been found we prune those itemset which contains
already infrequent itemset but here, is no any such itemset is present. Now, we find the support
count for each 2-itemset as shown in below.

ltemset Support Count


11,U 4

Il,13 3

11,14 2

12,13 4

12,14 3

13,14 2

Now comparing support count of each itemset with min_sup and ltemsets having Support less than
2 are eliminated again. Here, item set {11, 14) and {13, 14) does not meet min_sup, thus it is deleted so
we obtain the following frequent itemset F2.
ltemset Support Count
11,12 4
11,13 3
12,13 4
12,14 3
108 Data Warehousing nnd D:itn Mining
te 3-itemset c3 is
lteration 3: Cnndtdatc 3-,temsct C, and rrcquent 3 itemsct generation. J Jf•re, candida
combina tions of
generate d b) joining frequent 2•1tcmsct to itself that is by finding the all-possi ble
itt.'mset inF2. Afta the candidate item cts Jun c l"leen found we prune those 1temscts which contains
alread) infrequent itemsct Here nil poss1hlc CJ ,,11,: {l l ,12,IJ}, fl I, 12,1l}, 111,J3,J.1}, Il2,T3,f4}.
We can
frequent,
see for itemset Ill, 12, nJ subset<: 111, 12}, {11, n}, (12, 13} .nc occurrin g in F2 thus {Jl, 12, 13} is
it is not
\'\~ can see for \tcmsct {11, 12, 14} subsets, {It, 121, {II , Ill, (12, J4}, {11, f.1} is not frequent , as
occumn g m F2 thus Ul 12, 141 1" not fn.,1ucnt, hence it 1s deleted. We ccJn see for itemset {l1,
13, 14}
14} is not
subsets, {11., 13) Ill 14} {13, 14J, (11, 14} is not frC'qucnt , JS it is not occurring in F2 thus {Il, 12,
{13, 14)
frequent, hence it is deleted. \\ c can S<'C for itl!mset {12, 13, 14} subsets, (12, 13}, {12, T4), {13, J4},
1, not frequent, as 1t 15 not occurring in F2 thus {Il, 12, 14} is not
frequent, hence it is deleted. Now, we
find the support count for each 3-itemse t CJ as shown in below.
ltemset Support Count
11,1213 3

Support of the 1tem...--et {11, 12, 15} is 3 so, it passes the m.in_sup =3 so, F, is
ltemset Support Count
11,12,13 3

Form this frequent itemset F3 we cannot further generate candida te iternsets so we


stop here and our
final frequent 1temset is f3 = {11, 12, 13}
NOM geneating a.s50Ciation rules from the frequen t itemset 111, 12, 13} obtaine d above,
From the frequent itemset f 3 discover ed above the associati on could be
l 11.12.J- (131
support (11, 12, 13) 3
Co!lfidence • support {11, 12} = 4 x 100 = 75%
2. {Il. 13} - {12}
support {11, 12, 13) 3
c.onfidence:::: support (ll, 13} = 3x 100 = 100%
3. II2, 13J - {llJ
. _ 5Up port {11, 12, J3/ ~
Confide nce - 6Upport 112, 131 .. 4 x 100 c 75%

4. {Ill - 112,131
. support {11, 12, 131 3
Confide nce-= support {Il} ;:; 4 x 100 -= 75%

5. {12} -{Il, 13}


. support (11, 12, 13} l
Confide nce= support {Jl, 121 = 5 x 100 • 60%
6. {I3}-+ {11, 12}
. support {11, 12, 13} -~
Confide nce = support {l3J - 4 x 100 c 75%
For every
7. Now, filtering the s trong rules from weak rules and output the strong rules only.
. support count (I)
subset S of I, we output the rule S -+ (1 - 5) 1f support count (S) >= minimu m confidence
miniznuJJ\
threshol d. Here, the associat ion rule number 2 is strong rule since it passes
confide nce thresho ld is 80 %. So, output strong associat ion rule is {11, 13)- {12}.
Frequent Pattern Mining 0 CHAPTER 5 J109
f ACTORS AFFECTING COMPLEXITY OF APRIORI

Fl1nowing factors affects complexity of Apriori:


1. Choice of n~nimu~ support threshold: lowering support threshold results in more frequent
iremsets. This may increase number of candidates and max length of frequent itemsets.
,
- Dimension,lli ~ (number of items) of the data set: More space is needed to store support
count of each ~tem. If number of frequent items also increases, both computation and I/O
costs may also mcrease.
3. Size of database: since Apriori makes multiple passes; run time of algorithm may increase
"ith number of transactions.
t. A,·erage transaction width: Transaction width increases with denser data sets. This may
increase max length of frequent itemsets and traversals of hash tree (number of subsets in a
transaction increase with its width)

LnfiTATION AND IMPROVING APRIORI

Limitations: Apriori algorithm is easy to understand and its' join and Prune steps are easy to
implement on large itemsets in large databases. Along with these advantages it has number of
limitations. These are:
1. Huge number of candidates: The candidate generation is the inherent cost of the Apriori
Algorithms, no matter what implementation technique is applied. It is costly to handle a
huge number of candidate sets. For example, if there are 104 large 1-itemsets, the Apriori
algorithm will need to generate more than 107 candidate 2-itemsets. Moreover for 100-
itemsets, it must generate inore than 2100 which is approximately 1Q30 candidates in total.
2. Multiple scans of transaction database so, to mine large data sets for long patterns this
algorithm is not a good choice.
3. When Database is scanned to check Ck for creating Fk, a large number of transactions will
be scanned even they do not contain any k-itemset.
Methods to Improve Apriori Efficiency: To improve the Apriori efficiency need to reduce
passes of transaction database scans, shrink number of candidates, facilitate support counting of
candidates. Many methods are available for improving the efficiency of the algorithm.
1. Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-itemsets and its corresponding count. It uses a hash function for
generating the table.
2. Transaction Reduction: This method reduces the number of transactions scanning in
future iterations. The transactions which do not contain K-frequent items are marked or
removed because such transaction cannot contain (K+1) frequent itemsets.
3. Partitioning: This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it should
be frequent in at least one of the partitions of the database. In Scan 1 partition database
and find local frequent patterns and In Scan 2 consolidate global frequent patterns.
HO Ixita I\\ arch ng and Data Mtnntg
D.itubasc D anc.f then searches fo
4. Sam,pling: Thi'- method picks a r.mdom s.implc S from
to verify frequ ent itemsets fou~
frequent 1temset m S U'-11\S Apnon. Sc.m d 1t11b,1sc once
,ire checked. For exam ple, check
m sample S onh lX'lrdcrs of closure o( frequent patterns
abed inste ad of ab, ar. ..., etc. &.rn d.1t.1b.1sc ugain
to find missed frequ ent patterns. It
c,111 be reduced by lowering t~
ma, be possible to lose a glolx1l tr(.'(JUCnl 1tcmsct. This
nun..sup.
new cand idate itemsets at any
5. Dynamic ltemsl't Counting: This technique c.1n add r
mark ed start p'1m t ot th~ datab .lse durin g the scan ning of the datab ase. Find longe
local datab ase parti tions.
tn:quent patterns based on short er frequ ent patte rns and

APPUCATIO~ AREAS OF A.PRIORI ALGORITHl\1

in data mini ng of admitted


• In Education Field: Extracting association rules
stude nts throu gh characteristics and specialties.
nt's datab ase.
• In the Medical field: For example, Analysis of the patie
fores t fire with the forest fire
• In Forestry: Analysis of prob abili ty and inten sity of
data.
the Reco mme nder System and
• Apriori is used by many comp anies like Ama zon in
b) Google for the auto- comp lete featu re.
Ac h~ ~ and Disadvantages of Apri ori Algo rithm
Ad,7.111.ages
• F.asy to unde rstan d algor ithm
item sets in large databases
• Join and prun e steps are easy to impl emen t on large
Disa dvan ~es
very large and the minimum
• H requ ires high comp utati on if the item sets are
supp ort is kept very low.
• The entir e datab ase need s to be scan ned.

FPG RO WfH

The main Bottlenecks of the Apri ori approach are


it uses Breadth-first (i.e., level-wise) search and it
s. One possible bette r alternative approach for
often gene rates and test a huge num ber of candidatt•
depth -fir~t search and avoid explicit candidate
Apri ori appr oach is the FP-Growth approach uses
rns from short ones using local frequent items
gene ratio n. lts Major philo soph y is ''grow long patte
then get all transactions havin g "abc", and found
only ". For exam ple, if "abc " is a frequ ent pattern
g "abc" then abed is also frequent pattern. In
that "d" is a local frequ ent item in transactions havin
build a compact data struc ture called PP-Tree
FP-G rowt h there are main ly two steps involved first,
FP-tree.
and then, extra ct frequ ent items ets directly from the
ree is constructed using two passes over
1. FP-tree Construction horn a Transactional DB: FP-T
the data set:
Passl:
• Scan DB to find frequ ent 1-ite mset s as:
• Scan DB and find supp ort coun t for each item .
Frequent Pattern Mining O CHAPTER 5 J ffl
• Discard infrequent items
• Sort items in descending order of their frequency (support count).
• Sort the items in each transt1ction in descending order of their frequency.
• Uc;e this order w hen building the FP-trce, 50 common prefixes can be shared.
pass2: Scan DB again, construct PP-tree
• FP-growth reads one transaction at a time and maps it to a path.

• Fixed order is used, so path can overlap when transaction share items .

• Pointers are maintained between nodes containing the same item(dotted line)
Mining Frequent Patterns Using FP-tree:
• Start from each frequent length-1 pattern (called suffix pattern)

• Construct its conditional pattern base (set of prefix paths in the FP-tree co-
occurring with the suffix pattern)

• Then construct its conditional FP-tree.

• The pattern growth is achieved by the concatenation of the suffix pattern with the
frequent patterns generated from a conditional FP-tree.
The steps used by PP-growth approach for FP-tree construction can be expressed in Flow chart as
shown below:

Calculate the support


count of each items in S

Sort items in decreasing


support counts

- - - - - - - - ~ Read transaction t
hasNext No overlapped
prefix found
Increment the
frequency count for
Overlapped each overlapped item
Create new nodes labeled prefix found
with the items int

Set the frequency~-----1 Create new nodes for


count to 1 none overlapped item

Create pointer to In th e example


common items pointers are
presented by
dashed line
return

Figure 5.8: Flow Chart of FP-Tree con1trudion proceu


lll Data Warchou!-ing nnd Data Mming

Example: Find all frequent itl"m-.cb- 0r (n·qucnt p,,ttcrn<: in the following datnb,1se by using FJ>.
growth algorithm. Take minimum suppo,t (tnll\_'<up) "'2
TIO List of item IDs
1 II, 12, 15
2 12,14
3 12, 13
4 11, 12, 14
5 11,13
6 12,13

I 7 11,13
8 ll,12,13,15

I 9 l1,l2,I3

Now, building a FP tree of given transaction database. Here, Item sets are considered in order of
their descending value of support count.
Constructing 1-itemsets and counting support count for each item set:
Itemset Support count
11 6
12 7
13 6
14 2
15 2

Discarding all infrequent iternsets:

Itemset Support count


11 6
12 7
13 6
14 2
IS 2
Since, min_sup=2.

Sorting frequent 1-itemsets in descending order of their support count:

Itemset Support count


12 7
J1 6
13 6
14 2
IS 2
► CHAPTER 5 J ta
Frequen t Pattern Mining 0
. g each itemsets in D based on frequent 1-itemsets above·
order1n .
f.JO~\I, - TIO List of items Ordered items
1 11,12,15 12,11,15
2 12,14 12,14
3 12,13 12,13
4 11,12,14 12,11,14
5 11,13 11,13
6 12,13 12,13
7 11,13 11,13
8 11,12,13,15 12,11,13,15
9 11,12,13 12,11,13
.~ng FP-tree by using ordered itemsets one by one:
NoW drai""''
for Transaction 1: 12,11,15
1.

2. For Transaction 2:12,14


114 Data Warehousing and Data Mining
3. For Tran-:action 3:12,13

e
-1. For Transaction 4:12,Il,l4

5. For Transaction 5:Il,13



Frequent Pattern Mining 0 CHAPTER 5 t 115
for Transaction 6:12,13
6,

7. For Transaction 7:11,13

8. For Transaction 8:12,11,13,15


116 Doto War~houslng ond D to Mining

9, For Tran!laction 9:12,11,13

. bu'It
Now, to facilitate tree traversal, an item header table JS
rJ'\'tnk to •
1 50 that each item r-~ .....
cx."CllIIellCes in the tree via a chain of node-links.

FP Tree Construction Over!! Now we need to find conditional pattern base and Conditional FP Tree
for each item. For this start with last item in sorted order that is 15, follow node pointers and traverse
only the paths containing 15, accumulate all of transformed prefix paths of that item to forrn a
conditional pattern base.
Frequent Pattern Mining O CHAPTER 5 J 117
Conditi onal Pattern Base

IS 1112,11:1), 112,11,13:11 l
.. all paths
No" constructmg conditio nal FP-tree based on conditional pattern base for IS, by merging
·
that appear ~min..._sup fimes and elimin'ate all others. Here, nun_sup 1s 2 nod es I2
.
,\J\J ,r.eeping nodes
13
,ind 11 pass this sup.port so, keep these two nodes in the prefix path co-occurring with the IS but,
the IS.
caiu1ot pass the m~sup =Z so, it is eliminat ed from the prefix path co-occurring with
Therefore the conditional FP-tree for 15 contains only nodes 12 and 11 in the prefix path for IS as
shOWil below.
Conditional FP Tree for IS: {12:2, Il~ 2)
the paths
MoYt? to next least frequent item in order, i.e., 14, follow node pointers and traverse only
conditio nal
containing 14 and accumul ate all of transfor med prefix paths of that item to form a
pattern base as shown below
r--- -r-- ---- ---- ---- ---- ---- ,
Conditional Pattern Base

14 { {12,Il:1}, {I2:1} }

all paths
Now, constructing conditio nal FP-tree based on conditional pattern base for 14, by merging
is 2 only
and keeping nodes that appear ~_su p times and eliminate all others. Here, min_sup
node 12 pass this support so, keep this node in the prefix path co-occurring with the 14 but,
11 cannot
pass the min_sup so, it is eliminated from the prefix path co-occurring with the IS. Therefore, the
conditional PP-tree for 14 contains only node 12 in the prefix path for 14 as shown below.
Conditional FP Tree for 14: {12:2)
the paths
Move to next least frequent item in order, i.e., 13, follow node pointers and traverse only
conditio nal
containing 13 and accumul ate all of transfor med prefix paths of that item to form a
pattern base as shown below
Conditional Pattern Base

13 {{12,11:2), {12:2), {Il:2} )


all paths
Now, constructing conditio nal PP-tree based on conditional pattern base for 13, by merging
is 2 nodes I2
and keeping nodes that appear anin_su p times and eliminate all others. Here, min_sup
with the 13.
and 11 pass this support so, keep these two nodes in the prefix path co-occurring
Therefore, the conditional PP-tree for 13 contains nodes 12 and I1 in the prefix path for
13 as shown
below.
Conditional FP Tree for 13: {12:4, Il:2}, {Il:2)
the paths
Move to next least frequent item in order, i.e., 11, follow node pointers and traverse only
conditio nal
containing 11 and accumul ate all of transfor med prefix paths of that item to form a
pattern base as shown below

IConditional Pattern Base


{{12:4}}
r
118 Data Warehousing and Data Mining
paths
trce bas ed on con diti ona l pat ter n base for 11, by me rgi ng all
FP-
No ,\ con stru ctin g con dit ion al min_su p is 2 onJy
tha t app ear ~m in_ sup tim es and eliminate all oth ers. He re,
and kee pm g nod es Il. Therefore, the
por t so, kee p this nod e in the prefix pat h co-occurring wit h the
nod e 12 pas s thi s sup h for !la s sho wn bel ow.
dit ion al FP -tre e for fl con t.\i ns only nod e 12 in the prefix pat
con
4}
Co ndi tio nal FP Tree for 11: {12:
h has onl y one node
st fn.--que nt item in ord er, i.e., 12, but this item has prefix pat
MoYe to ne:xt lea e for suc h items. So,
nee d to cre ate con diti ona l pat tern base and conditional FP-tre
lab ele d nul l. So. no n process.
con ditional FP- tree con stru ctio
sto p con dit ion al p atte rn bas e and node with
pat ter n fro m the con dit ion al FP-trees by con cat ena tin g suffix
Ko w, gen era te the freque nt suffix node.
que nt pre fix no de pre sen t in the con dit ional FP-tree for tha t
eac h fre
Frequent Pattern Ge ner ate d
15 {12, 15: 2}, {Il, 15: 2}, {12, 11, 15: 2}
14 {12, 14: 2}
13 {12, 13: 4}, {11, 13: 4}, {12, 11, 13: 2)
I1 {12, 11: 4}
for the given transaction
fre que nt pat ter ns gen era ted by FP -gr ow th alg ori thm
These are the
database.

Advan tag es of FP -G row th


sons:
tha n Apriori due to fol low ing rea
Fre que nt Pat ter n Gro wth Fas ter
did ate tes t
1. No can did ate generation, no can
led FP-Tree
2 Uses com pac t dat a stru ctu re cal
n
3. Eliminates rep eat ed dat aba se sca
FP-tree bui ldi ng
4. Basic ope rat ion is cou nti ng and

D isa dva nta ges of PP-Growth


difficult to bui ld tha n Apriori.
1. FP Tre e is mo re cum ber som e and
2. It ma y be exp ens ive . red me mo ry.
alg ori thm ma y not fit in the sha
3. Wh en the dat aba se is large, the

Apriori Versus FP Growth


FP-Growth
Apriori
It is a tre e-b ase d alg ori thm.
1.
1. It is an arr ay-bas ed alg ori thm
fre que nt pattern
. 2. It con stru cts con dit ion al
2. It use s Joi n and Pru ne tec hni que n base froIJ\
tre e and con dit ion al pat ter
um sup par t.
dat aba se wh ich sat isfy mi nim
t sea rch
rch 3. FP Gr ow th use s a dep th- firs
3. Ap rio ri use s a Breadth-first sea th
app roa ch wh ere 4. FP Gr ow th util ize s a pattern-grow
4. Ap rio ri util ize s a level-w ise y considers
pat ter ns con tain ing 1 item, the n app roa ch me ans that, it onl
it gen era tes database.
pat ter ns act ual ly exi stin g in the
2 item s, the n item s, and so on.
Frequent Pattern Mining 0 CHAPTER 5 J U9
Apriori FP-Growth
~-0 ndidatc generation is extremely slow 5. Runtime increases linearly, depending on
Runtime increases exponentially depending the number of transactions and items
on the number of different items.
---
t,.
Ca~did.1te generation is very parallelizable 6. Data are very interdependent; each node
needs the root.
7. It requires large memory space due to large 7. It requires less memory space due to
number of candidate generation. compact structure and no candidate
generation.
S. It scans the database multiple times for 8. It scans the database only twice for
generating candidate sets. constructing frequent pattern tree.

fR0M AsSOCIATION MINING TO CORRELATION ANALYSIS (LIFf)

Normally, association rules that pass min_sup and min_conf threshold are called interesting
association rules. But Sometimes, association rules assessed and qualified by support and confidence
as interesting association rules may be uninteresting in actual. This shows support and confidence
measures are insufficient at filtering out uninteresting association rules. The drawback of support is
many potentially interesting patterns involving low support items might be eliminated by the
support threshold. The drawback of confidence is more .subtle and is best demonstrated with the
following example. Suppose analyze the relationship between people who drink tea and coffee. For
this consider the following supermarket transactions database:
Tea Not tea Total
Coffee 150 750 900
Not coffee 50 50 100
Total 200 800 1000

The information given in this table can be used to evaluate the association rule Tea ➔ Coffee. At first
glance, it may appear that people who drink tea also tend to drink coffee because the rule's support=
100/1000=10% and confidence = 150/200=75% values are reasonably high. This argument would
have been acceptable except that the fraction of people who drink coffee, regardless of whether they
drink tea, is 900/1000=90%, while the fraction of tea drinkers who drink coffee is only 150/200=75%.
Thus, knowing that a person is a tea drinker actually decreases her probability of being a coffee
drinker from 90% to 75%! The rule Tea ➔ Coffee is therefore misleading despite its high confidence
value.
The tea-coffee example shows that high-confidence rules can sometimes be misleading because the
confidence measure ignores the support of the itemset appearing in the rule consequent. One way to
address this problem is by applying a correlation analysis-based metric known as lift:
Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is
independent of the occurrence of itemset B if P(AUB) = P(A)P(B); otherwise, itemsets A and B are
dependent and correlated as events. This definition can easily be extended to more than two
itemsets. The lift between the occurrence of A and B can be measured by computing
1~ Data Warehousing and Data Mining

. _ P(A uB)
lift {A, B) - P(A) P(B)
th the occurrence of
If the lift (A, B) is less than 1, then the occurrence of A is negatively co~ela ted wi
s ted, that is, the
B. If the resultin g value is greater than 1, then A and B are _po itively_ correla
one implies the occurrence of the other. If the resultin g value is equal t_o 1, then
Aand
occurrence of
e, the lift value equab
B are indepe ndent and there is no correlation betwee n them. In our exampl
0.89, which dearly indicates the negative correlation betwee n coffee and tea .

., ,-, -==~..,....'!!'!!!!!'~~----
~ Exe rci s)
1. What is frequen t pattern ?
2 Why frequent pattern mining is import ant data mining task? Explain
3. What is market basket analysis? Explain it with suitable example.
4. What is association rule? Why is it importa nt? Explain.
5. Explain the different types of association rules with suitable exampl e.
Define the concepts of suppor t and confidence and Lift for an association
rule mining.
6.
frequent pattern
7. What is Apriori principle? How it is used by the Aprior i algorit hm for
mining? Explain
8. What are interesting association rules?
ion from the
9 How interesting association rules are genera ted by using association rule generat
frequent pattern? Explain.
ed? Explain
10. What are the limitations of Apriori approa ch? How these limitations can be improv
.
11. What is PP-growth? Discuss PP-growth approa ch for frequen t pattern mining
.
12. Why FP-gro wth approa ch is conside red better than Aprior i approa ch? Explain
13. What is PP-tree? Differentiate it withconditional PP-treee.
for
14. What are interesting association rules? Why Correlation analysis is used as supplementary
suppor t and confidence framew ork for association rule asses.ses.
ng transaction
15. Use APRIORI algorit hm to generate strong association rules from the followi
databa se. Use min_su p=2 and min_confidence=75%.
TID Itemsets

10 A,C,D

20 B,C,E

30 A, B, C, E

40 B, E
Frequent Pattern Mining O CHAPTER 5 J 121
A databas~ has 1~ tra~saction~ contains the 9 items only A= {Al, A2, A3, A4, A5, A6, A7: A8,
16, A9}. Let 111111 stlP - 20 ~ a.nd mm conj= 60%. Find all frequent itemsets using Apriori algonthm.
List all the strong association rules.

-
~
Al
1
A2
0
A3

0
A4

0
AS
1
A6
1
A7

0
AB
1
A9
0
- 0 1 0 1 0 0 0 1 0

1 0 0
0 0 0 1 1 0
1 1 0 0 0
0 0 0 0
1 0 0
0 0 0 0 1 1
1 0 0 1 1 0 1
0 0
0 0 1 0 0 0 0 0
0
0 0 0 0 0
0 0 0 0
1 0 1 0 1 0 0
0 0
1 1 0 1 0
0 0 0 0
0 1 1 0 0
0 1 0 1
1 0 1 0 0
1 0 1 0
0 0 0 0 0 1
0 1 1
1-> presence of item in the transaction and
0-> absence of item in the transaction
17. A database has 10 transactions and contains only the 6 items only A = {Al, A2, A3, A4, AS,
A6}. Let min sup= 30% Find all frequent itemsets using Apriori algorithm.

Al A2 A3 A4 AS A6

0 0 0 1 1 1

0 1 1 1 0 0

1 0 0 1 1 1

1 1 0 1 0 0

1 0 1 0 1 1

0 1 1 1 0 1

0 0 0 1 0 1

0 1 0 1 0 1

1 0 0 1 0 0

1 1 1 1 1 1
122 Data Warehousing and Data Mining

18. A databa se has five transactions. Let min sup = 60% and min conf =80%. Find all frequent
itemsets using Apriori algorithm. List all the strong association rules.
TID Items Bought
1 {M, 0, N, K, E, Y}
2 {D, 0, N, K, E, Y}
3 {M, A, K, E}
4 {M, U, C, K, Y}
5 {C, 0, 0, K, I, E}
mining (ARM)
19. Show using an example how FP-tree algorithm solves the association rule
problem.
rt= 50% and
20. Perform ARM using FP-growth on the following data set with minim um suppo
confid ence= 75%.
Transaction ID Items

1 Bread, Cheese, Eggs, Juice

2 Bread,Cheese,Juice

3 Bread,Milk, Yogurt

4 Bread, Juice, Milk

5 Cheese, Juice, Milk

□□□

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy