DMT Unit-IV - UR20 - New
DMT Unit-IV - UR20 - New
ASSOCIATION ANALYSIS:
“Association analysis is the process
of discovering interesting
relationships hidden in large data
sets”.
For example,
huge amounts of customer purchase data are
collected daily at the counters of grocery
stores.
such data, commonly known as market basket
transactions.
Each row in this table corresponds to a
transaction, which contains a unique identifier
labeled TID and a set of items bought by a
given customer.
The following rule can be extracted from the
data set
{Diapers} → {Beer}.
The rule suggests that many customers who buy
diapers also buy beer.
PROBLEM DEFINITION:
basic terminology used in association analysis .
Binary Representation
Market basket data can be represented in a
binary
TID format as shown
Bread Milk in Table
Diapers Beer Eggs Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
a b c d e
a a a a b b b c c d
b c d e c d e d e e
ab ab ab ac ac ad bc bc bd cd
c d e d e e d e e e
abcd
Apriori Principle.
“If an item set is frequent, then all of its subsets
must also be frequent”.
• Suppose {c, d, e} is a frequent item set.
Clearly, any transaction that contains {c, d, e}
must also contain its subsets, {c, d},{c, e}, {d,
e}, {c}, {d}, and {e}.
• As a result, if {c, d, e} is frequent, then all
subsets of {c, d, e} (i.e., the shaded itemsets in
this figure) must also be frequent.
nul
l
a b c d e
a a a a b b b c c d
b c d e c d e d e e
ab ab ab ac ac ad bc bc bd cd
c d e d e e d e e e
a a a a b b b c c d
b c d e c d e d e e
ab abd ac ac ad bc bc bd cd
c abe d e e d e e e
Apriori Principle.
“If an item set is frequent, then all of its subsets must also be
frequent”.
• Apriori pruning principle: If there is any itemset which is infrequent,
its superset should not be generated/tested!)
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be generated
22
The Apriori Algorithm—An Example-1
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
The Apriori
Algorithm
RULE GENERATION
How to extract association rules efficiently
from a given frequent item set?
Each frequent k-itemset, Y , can produce up
to 2k −2 association rules.
An association rule can be extracted by
partitioning the item set Y into two non-
empty subsets, X and Y − X, such that
X → Y − X satisfies the confidence threshold.
Example.
Let X = {1, 2, 3} be a frequent item set. There are six
candidate association rules that can be generated
from X:
1) {1, 2} → {3},
2) {1, 3} →{2},
3) {2, 3} → {1},
4) {1} → {2, 3},
5) {2} → {1, 3}, and
6) {3} → {1, 2}. As
Note: Each of their support is identical to the support
for X, the rules must satisfy the support threshold.
Consider the rule {1, 2} → {3},
which is generated from the frequent item set
X = {1, 2, 3}.
The confidence for this rule is
σ({1, 2, 3})/σ({1, 2}).
ab ad ae bc bd be cd ce
ac
de
46
FP growth algorithm first construct the data
structure called an FP-tree and extracts
frequent item sets directly from this structure.
FP-TREE REPRESENTATION
An FP-tree is a compressed representation of
the input data.
It is constructed by reading the data set one
transaction at a time and mapping each
transaction onto a path in the FP-tree.
EXAMPLE-1