0% found this document useful (0 votes)
9 views22 pages

Association Rule

Association Rule Algorithm Data Mining Lectures
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views22 pages

Association Rule

Association Rule Algorithm Data Mining Lectures
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Mining

Association Analysis: Basic Concepts


and Algorithms
What Are Patterns?
What are patterns?
– Patterns: A set of items, subsequences, or substructures
that occur frequently together (or strongly correlated) in
a data set
– Patterns represent intrinsic and important properties of
datasets

Frequent item set Frequent sequences Frequent structures


What Is Pattern Discovery?

Pattern discovery: Uncovering patterns from


massive data sets
It can answer questions such as:
– What products were often purchased together?
– What are the subsequent purchases after buying an
iPad?
Association Rule Mining

Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction
An association rule consists of two parts, i.e., an antecedent (if) and a consequent
(then).
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Basic Concepts: Transactional
Database
Transactional Database (TDB)
– Each transaction is associated with an identifier, called
a TID.
– May also have counts associated with each item sold

Tid Items bought

1 Beer, Nuts, Diaper

2 Beer, Coffee, Diaper

3 Beer, Diaper, Eggs

4 Nuts, Eggs, Milk

5 Nuts, Coffee, Diaper, Eggs, Milk


Definition: Frequent Itemset
Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}
– k-itemset TID Items

◆ An itemset that contains k items 1 Bread, Milk


Support count () 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
Relative support
– The fraction of transactions that
contains X (i.e., the probability that a
transaction contains X)
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
Rule Measures: Support and
Confidence
Customer
buys both
Customer Find all the rules X & Y  Z with
buys diaper
minimum confidence and support
– support, s, probability that a transaction
contains {X U Y U Z}
– confidence, c, conditional probability
that a transaction having {X U Y} also
Customer contains Z
buys beer

Transaction ID Items Bought Let minimum support 50%, and


2000 A,B,C minimum confidence 50%, we
1000 A,C have
4000 A,D – A  C (50%, 66.6%)
5000 B,E,F – C  A (50%, 100%)
Definition: Association Rule
Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X → Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper} → {Beer}
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
◆ Fraction of transactions that contain Example:
both X and Y {Milk , Diaper}  Beer
– Confidence (c)
Measures how often items in Y appear  ( Milk , Diaper, Beer ) 2

in transactions that contain X
s= = = 0.4
|T| 5
◆ Confidence is the conditional
probability of Y when X has already  (Milk, Diaper, Beer ) 2
c= = = 0.67
occurred  (Milk , Diaper ) 3
Mining Association Rules—An
Example

Transaction ID Items Bought Min. support 50%


2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
For rule A  C: {A,C} 50%
support = P (A U C) = 50%
confidence = P (C|A) = 66.6%
Association Rule Mining Task

Given a set of transactions T, the goal of


association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
Mining Association Rules

Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper} → {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules

Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive


Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
ABCDE candidate itemsets
Frequent Itemset Generation

Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate


– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Reducing Number of Candidates
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
– uses prior knowledge of frequent itemset properties.

Apriori principle holds due to the following property


of the support measure:
X ,Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
Lattice structure of frequent itemsets
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
Itemset Count
{Bread,Milk,Diaper} 3
Mining Frequent Itemsets: the Key Step

1. Find the frequent itemsets: the sets of items that


have minimum support
– A subset of a frequent itemset must also be a frequent
itemset
◆i.e.,if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
2. Use the frequent itemsets to generate association
rules.
The Apriori Algorithm—An Example
minsup = 2 Itemset sup
Database TDB Itemset sup
{A} 2 F1
Tid Items C1 {A} 2
{B} 3
10 A, C, D {B} 3
{C} 3
20 B, C, E 1st scan {C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
F2 Itemset sup {A, B} 1 {A, B}
{A, C} 2 {A, C} 2 {A, C}
2nd scan
{B, C} 2 {A, E} 1
{A, E}
{B, E} 3 {B, C} 2
{B, C}
{C, E} 2 {B, E} 3
{B, E}
{C, E} 2
{C, E}

C3 Itemset F3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
Apriori Algorithm

Method:

– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
◆ Generate length (k+1) candidate itemsets from length k
frequent itemsets
◆ Prune candidate itemsets containing subsets of length k that
are infrequent
◆ Count the support of each candidate by scanning the DB
◆ Eliminate candidates that are infrequent, leaving only those
that are frequent
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a
subset of a frequent k-itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy