Market Basket Analysis
Market Basket Analysis
Links
Links
https://rpubs.com/sbushmanov/180410
Visualisation
http://www.rdatamining.com/examples/association-rules
Association mining is usually done on transactions data from a retail market or from an online e-commerce
store. Since most transactions data is large, the apriori algorithm makes it easier to find these patterns or rules
quickly.
Rule
A rule is a notation that represents which item/s is frequently bought with what item/s. It has an LHS and an
RHS part and can be represented as follows:
This means, the item/s on the right were frequently purchased along with items on the left.
Measure the strength of a rule
The apriori() generates the most relevent set of rules from a given transaction data. It also shows the support,
confidence and lift of those rules. These three measure can be used to decide the relative strength of the
rules. So what do these terms mean?
Support
Confidence
Lift
Lift is the factor by which, the co-occurence of A and B exceeds the expected probability of A and B
co-occurring, had they been independent. So, higher the lift, higher the chance of A and B occurring together.
Calculation
Example : Groceries
Transactions data
Groceries data that comes with the arules package. Unlike dataframe, using head(Groceries) does
not display the transaction items in the data. To view the transactions, use the i nspect() function
instead.
Since association mining deals with transactions, the data has to be converted to one of class
transactions, made available in R through the arules pkg. This is a necessary step because the
apriori() function accepts transactions data of class t ransactions only.
library(arules)
library(datasets)
class(Groceries)
inspect(head(Groceries, 3))
eclat() takes in a transactions object and gives the most frequent items in the data based the support
you provide to the supp argument. The maxlen defines the maximum number of items in each
itemset of frequent items.
frequentItems <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15)) # calculates
support for frequent items
inspect(frequentItems)
#Support: The fraction of which our item set occurs in our dataset.
#Confidence: probability that a rule is correct for a new transaction with items on the left.
#Lift: The ratio by which by the confidence of a rule exceeds the expected confidence. if the lift is 1 it
indicates that the items on the left and right are independent.
rules <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5)) # Min Support as 0.001,
confidence as 0.8.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules
The rules with confidence of 1 (see r ules_conf above) imply that, whenever the LHS item was
purchased, the RHS item was also purchased 100% of the time.
A rule with a lift of 18 (see rules_lift above) imply that, the items in LHS and RHS are 18 times more
likely to be purchased together compared to the purchases when they are assumed to be
unrelated.
Control The Number Of Rules in Output
rules <- apriori(Groceries, parameter = list (supp = 0.001, conf = 0.5, maxlen=3)) # maxlen = 3
limits the elements in a rule to 3
Sometimes it is desirable to remove the rules that are subset of larger rules. To do so, filter the
redundant rules.
sum(is.redundant(rules2))
(redundant = which(is.redundant(rules2)))
#remove it
rulesNR = rules2[-redundant]
is.redundant(rulesNR)
sum(is.redundant(rulesNR)) #ok now
inspect(head(rules_conf))
One drawback with this is, you will get only 1 item on the RHS, irrespective of the support, confidence
or minlen parameters.
If you have to read data from a file as a transactions data, use read.transactions().
If you already have your transactions stored as a dataframe, you could convert it to class transactions
as follows,