0% found this document useful (0 votes)
9 views62 pages

DMT Unit-IV - UR20 - New

Uploaded by

Light Yagmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views62 pages

DMT Unit-IV - UR20 - New

Uploaded by

Light Yagmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

UNIT-IV

ASSOCIATION ANALYSIS:
 “Association analysis is the process
of discovering interesting
relationships hidden in large data
sets”.
For example,
 huge amounts of customer purchase data are
collected daily at the counters of grocery
stores.
 such data, commonly known as market basket
transactions.
Each row in this table corresponds to a
transaction, which contains a unique identifier
labeled TID and a set of items bought by a
given customer.
The following rule can be extracted from the
data set
{Diapers} → {Beer}.
The rule suggests that many customers who buy
diapers also buy beer.
PROBLEM DEFINITION:
basic terminology used in association analysis .
Binary Representation
Market basket data can be represented in a
binary
TID format as shown
Bread Milk in Table
Diapers Beer Eggs Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1

A binary 0/1 representation of market basket data.


Item set :
 a collection of zero or more items is termed
as itemset.
 Example: If an itemset contains k items, it is
called a k-itemset.
 For instance, {Beer, Diapers, Milk} is an
example of a 3-itemset.
 The null (or empty) set is an itemset that does
not contain any items.
Support Count: which refers to the number of
transactions that contain a particular itemset.

 Example: The support count for {Beer,


Diapers, Milk} is equal to two because there
are only two transactions that contain all
three items.
Association Rule :
 An association rule is an implication
expression of the form X → Y ,
where X and Y are disjoint item sets, i.e.,
X ∩ Y = ∅.
 The strength of an association rule can be
measured in terms of its support and
confidance.
Support determines how often a rule is
applicable to a given data set,
while confidence determines how frequently
items in Y appear in transactions that contain
X. The formal definitions of these metrics are
• Support, s(X → Y ) = σ(X ∪ Y );
N

• Confidence, c(X → Y ) =σ(X ∪ Y ) .


σ(X)
EXAMPLE:
 Consider the rule {Milk, Diapers} → {Beer}.
Since the support count for {Milk, Diapers,
Beer} is 2 and the total number of trans-
actions is 5, the rule’s support is 2/5 = 0.4.
 The rule’s confidence is obtained by dividing
the support count for {Milk, Diapers, Beer} by
the support count for {Milk, Diapers}.
 Since there are 3 transactions that contain
milk and diapers, the confidence for this rule
is 2/3 = 0.67.

P. DEEPIKA, ASSISTANT PROFESSOR, URCET


FREQUENT ITEM SET GENERATION:
 A lattice structure can be used to enumerate
the list of all possible itemsets. Figure shows
an item set lattice for I = {a, b, c, d, e}.

 In general, a data set that contains k items can


potentially generate up to 2k − 1 frequent item
sets, excluding the null set.
nul
l

a b c d e

a a a a b b b c c d
b c d e c d e d e e

ab ab ab ac ac ad bc bc bd cd
c d e d e e d e e e

abc abc abd acd bcd


d e e e e

abcd
Apriori Principle.
“If an item set is frequent, then all of its subsets
must also be frequent”.
• Suppose {c, d, e} is a frequent item set.
Clearly, any transaction that contains {c, d, e}
must also contain its subsets, {c, d},{c, e}, {d,
e}, {c}, {d}, and {e}.
• As a result, if {c, d, e} is frequent, then all
subsets of {c, d, e} (i.e., the shaded itemsets in
this figure) must also be frequent.
nul
l

a b c d e

a a a a b b b c c d
b c d e c d e d e e

ab ab ab ac ac ad bc bc bd cd
c d e d e e d e e e

abc abc abd acd bcd


d e e e e
Freque
nt
Itemset
abcd
“if an itemset such as {a, b} is infrequent, then
all of its supersets must be infrequent too”.
nul
l
Infreque
nt a b c d e
Itemset

a a a a b b b c c d
b c d e c d e d e e

ab abd ac ac ad bc bc bd cd
c abe d e e d e e e

abc abc abd acd bcd


d e e e e
Pruned
Superset
s
abcd
Apriori Algorithm
Frequent Item set Generation in the Apriori
Algorithm:

 Apriori is the first association rule mining


algorithm
Apriori Algorithm example
Apriori: A Candidate Generation & Test Approach

Apriori Principle.
“If an item set is frequent, then all of its subsets must also be
frequent”.
• Apriori pruning principle: If there is any itemset which is infrequent,
its superset should not be generated/tested!)
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be generated
22
The Apriori Algorithm—An Example-1
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
23
The Apriori Algorithm

Step 1: Ck: Candidate item set of size k


Step 2: Lk : frequent item set of size k
Step 3: L1 = {frequent items};
Step 4: for (k = 1; Lk !=; k++) do begin
Step 5: Ck+1 = candidates generated from Lk;
step 6: for each transaction t in database do
Step 7: increment the count of all candidates
in Ck+1 that are contained in t
step 8: Lk+1 = candidates in Ck with min_support
24
The Apriori Algorithm—An Example-2
EXAMPLE-2

The Apriori
Algorithm
RULE GENERATION
How to extract association rules efficiently
from a given frequent item set?
 Each frequent k-itemset, Y , can produce up
to 2k −2 association rules.
 An association rule can be extracted by
partitioning the item set Y into two non-
empty subsets, X and Y − X, such that
X → Y − X satisfies the confidence threshold.
Example.
Let X = {1, 2, 3} be a frequent item set. There are six
candidate association rules that can be generated
from X:
1) {1, 2} → {3},
2) {1, 3} →{2},
3) {2, 3} → {1},
4) {1} → {2, 3},
5) {2} → {1, 3}, and
6) {3} → {1, 2}. As
Note: Each of their support is identical to the support
for X, the rules must satisfy the support threshold.
Consider the rule {1, 2} → {3},
which is generated from the frequent item set
X = {1, 2, 3}.
 The confidence for this rule is
σ({1, 2, 3})/σ({1, 2}).

Because {1, 2, 3} is frequent, the antimonotone


property of support ensures that {1, 2} must
be frequent, too.
CONFIDENCE-BASED PRUNING
 Unlike the support measure, confidence does
not have any monotone property.
 For example, the confidence for X → Y can be
larger, smaller, or equal to the confidence for
another rule X˜ → Y˜ ,
where X˜ ⊆ X and Y˜ ⊆ Y
 if we compare rules generated from the same
frequent item set Y ,
the following theorem holds for the
confidence measure.
Theorem: If a rule X → Y −X does not satisfy
the confidence threshold, then
any rule XJ → Y − XJ, where XJ is a subset
of X, must not satisfy the confidence threshold
as well.
Rule Generation in Apriori Algorithm
 The Apriori algorithm uses a level-wise
approach for generating association rules.

 where each level corresponds to the number


of items that belong to the rule consequent.

 Initially, all the low-confidence rules that have


only one item in the rule consequent are
extracted.
 a lattice structure for the association rules
generated from the frequent itemset
{a, b, c, d}.
 If any node in the lattice has low confidence,
then according to Theorem.
Low-
Confidence
Rule abcd=>
{}

bcd= acd= abd= abc=


>a >b >c >d

cd=> bd=> bc=> ad=> ac=> ab=>


ab ac ad bc bd cd

d=>a c=>a b=>a a=>b


bc bd cd cd
Prune
d
Rules
COMPACT REPRESENTATION OF
FREQUENT ITEMSETS
 In practice, the number of frequent item sets
produced from a transaction data set can be
very large.

 It is useful to identify a small representative


set of item sets from which all other frequent
item sets can be derived.
Two such compact representations are
1) Maximal frequent item sets.
2) Closed frequent item sets.
1) Maximal Frequent Item sets

Definition . “A maximal frequent item- set


is defined as a frequent item set for
which none of its immediate supersets
are frequent”.
Maximal
Frequent
Itemset
a
b c d e

ab ad ae bc bd be cd ce
ac
de

abc abd ab acd ace ade bc bce bde


e d cde

abc abc abd acd bcd


d e e e e Frequen
t
Frequent
abcd Itemset
Closed Frequent Item sets

 Definition: An item set X is closed if none of its


immediate supersets has exactly the same
support count as X.
FP-GROWTH ALGORITHM

an alternative algorithm called FP-growth to


discovering frequent item sets.
Pattern-Growth Approach: Mining Frequent Patterns Without
Candidate Generation

• Bottlenecks of the Apriori approach


1) It may need to generate a huge number of candidate
set.
2) It may need to repeatedly scan the data base .
• The FPGrowth Approach
– Avoid explicit candidate generation

46
 FP growth algorithm first construct the data
structure called an FP-tree and extracts
frequent item sets directly from this structure.
FP-TREE REPRESENTATION
 An FP-tree is a compressed representation of
the input data.
 It is constructed by reading the data set one
transaction at a time and mapping each
transaction onto a path in the FP-tree.
EXAMPLE-1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy