0% found this document useful (0 votes)
69 views15 pages

PHM: Mining Periodic High-Utility Itemsets

The document proposes a new data mining task called periodic high-utility itemset mining. This task aims to efficiently discover groups of items that are periodically bought by customers and generate a high profit. An algorithm called PHM is presented to address the limitations of existing high-utility itemset mining algorithms in identifying recurring customer purchase behaviors. Experimental results show that PHM can efficiently find the desired periodic high-utility itemsets by filtering out a large number of non-periodic patterns.

Uploaded by

SnehaAriga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views15 pages

PHM: Mining Periodic High-Utility Itemsets

The document proposes a new data mining task called periodic high-utility itemset mining. This task aims to efficiently discover groups of items that are periodically bought by customers and generate a high profit. An algorithm called PHM is presented to address the limitations of existing high-utility itemset mining algorithms in identifying recurring customer purchase behaviors. Experimental results show that PHM can efficiently find the desired periodic high-utility itemsets by filtering out a large number of non-periodic patterns.

Uploaded by

SnehaAriga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

PHM: Mining Periodic High-Utility Itemsets

Philippe Fournier-Viger1 , Jerry Chun-Wei Lin2 ,


Quang-Huy Duong3 , Thu-Lan Dam3,4 ,
1
School of Natural Sciences and Humanities, Harbin Institute of Technology
Shenzhen Graduate School, China
2
School of Computer Science and Technology, Harbin Institute of Technology
Shenzhen Graduate School, China
3
College of Computer Science and Electronic Engineering, Hunan University, China
4
Faculty of Information Technology, Hanoi University of Industry, Vietnam
philfv@hitsz.edu.cn, jerrylin@ieee.org,
huydqyb@gmail.com, lanfict@gmail.com

Abstract. High-utility itemset mining is the task of discovering high-


utility itemsets, i.e. sets of items that yield a high profit in a customer
transaction database. High-utility itemsets are useful, as they provide
information about profitable sets of items bought by customers to retail
store managers, which can then use this information to take strategic
marketing decisions. An inherent limitation of traditional high-utility
itemset mining algorithms is that they are inappropriate to discover re-
curring customer purchase behavior, although such behavior is common
in real-life situations (for example, a customer may buy some products
every day, week or month). In this paper, we address this limitation by
proposing the task of periodic high-utility itemset mining. The goal is
to discover groups of items that are periodically bought by customers
and generate a high profit. An efficient algorithm named PHM (Peri-
odic High-utility itemset Miner) is proposed to efficiently enumerate all
periodic high-utility itemsets. Experimental results show that the PHM
algorithm is efficient, and can filter a huge amount of non periodic pat-
terns to reveal only the desired periodic high-utility itemsets.

Keywords: high-utility itemset, periodic itemset, average periodicity

1 Introduction
High-utility itemset mining (HUIM) [4, 5, 8–10, 13] is a popular data mining task.
It has attracted a lot of attention in recent years. It extends the traditional
problem of Frequent Itemset Mining (FIM) [1]. This latter consists of discov-
ering frequent itemsets. i.e. groups of items (itemsets) appearing frequently in
a transaction database [1]. FIM has many applications. However, an important
limitation of FIM is that it assumes that each item cannot appear more than once
in each transaction and that all items have the same importance (e.g. weight,
unit profit or value). High-Utility Itemset Mining (HUIM) addresses this issue
by considering that each item may have non binary purchase quantities in trans-
actions and that each item has a weight (e.g. unit profit). The goal of HUIM
is to discover itemsets having a high utility (e.g. yielding a high profit) in a
transaction database. Besides, market basket analysis, HUIM has several other
applications such as website click stream analysis, and biomedical applications [9,
13]. Mining high-utility itemsets is widely recognized as more challenging than
FIM because the utility measure used in HUIM is not anti-monotonic, i.e. a
high utility itemset may have supersets or subsets having lower, equal or higher
utilities [4]. Thus, techniques for reducing the search space in FIM cannot be
directly reused in HUIM.
Though several algorithms have been proposed for HUIM [4, 5, 8–10, 13], an
inherent limitation of these algorithms is that they are inappropriate to dis-
cover recurring customer purchase behavior, although such behavior is common
in real-life situations. For example, in a retail store, some customers may buy
some set of products on approximately a daily or weekly basis. Detecting these
purchase patterns is useful to better understand the behavior of customers and
thus adapt marketing strategies, for example by offering specific promotions to
cross-promote products such as reward or points to customers who are buying a
set of products periodically. In the field of FIM, algorithms have been proposed
to discover periodic frequent patterns (PFP) [2, 3, 6, 11, 7, 12] in a transaction
database. However, these algorithms are inadequate to find periodic patterns
that yield a high profit, as they only select patterns based on their frequency.
Hence, these algorithms may find a huge amount of periodic patterns that gen-
erate a low profit and miss many rare periodic patterns that yield a high profit.
To address this limitation of previous work, this paper proposes the task of
periodic high-utility itemset mining. The goal is to efficently discover all groups
of items that are bought together periodically and generate a high profit, in
a customer transaction database. The contributions of this paper are fourfold.
First, the concept of periodic patterns used in FIM is combined with the con-
cept of HUIs to define a new type of patterns named periodic high-utility itemsets
(PHIs), and its properties are studied. Second, novel measures of a pattern’s pe-
riodicity named average periodicity and minimum periodicity are introduced to
provide a flexible way of assessing the periodicity of patterns. Third, an effi-
cient algorithm named PHM (Periodic High-utility itemset Miner) is proposed
to efficiently discover the periodic high-utility itemsets. Fourth, an extensive
experimental evaluation is carried to compare the efficiency of PHM with the
state-of-the-art FHM algorithm for HUIM. Experimental results show that the
PHM algorithm is efficient, and can filter a huge amount of non periodic patterns
to reveal only the desired itemsets.
The rest of this paper is organized as follows. Section 2, 3, 4, 5 and 6 respec-
tively presents preliminaries related to HUIM, related work, the PHM algorithm,
the experimental evaluation and the conclusion.

2 Related work

This section reviews related work in high-utility itemset mining and periodic
frequent pattern mining.
2.1 High-utility itemset mining

Definition 1 (transaction database). Let I be a set of items (symbols). A


transaction database is a set of transactions D = {T1 , T2 , ..., Tn } such that for
each transaction Tc , Tc ∈ I and Tc has a unique identifier c called its Tid. Each
item i ∈ I is associated with a positive number p(i), called its external utility
(e.g. unit profit). For each transaction Tc such that i ∈ Tc , a positive number
q(i, Tc ) is called the internal utility of i (e.g. purchase quantity).

Example 1. Consider the database of Fig. 1, which will be used as running ex-
ample. This database contains seven transactions (T1 , T2 ...T7 ). Transaction T3
indicates that items a, b, c, d, and e appear in this transaction with an internal
utility of respectively 1, 5, 1, 3 and 1. Fig. 2 indicates that the external utility
of these items are respectively 5, 2, 1, 2 and 3.

Table 1: A transaction database Table 2: External utility values

TID Transaction Item abcd e


T1 (a, 1), (c, 1), Unit profit 5 2 1 2 3
T2 (e, 1)
T3 (a, 1), (b, 5), (c, 1), (d, 3), (e, 1)
T4 (b, 4), (c, 3), (d, 3), (e, 1)
T5 (a, 1), (c, 1), (d, 1)
T6 (a, 2), (c, 6), (e, 2)
T7 (b, 2), (c, 2), (e, 1)

Definition 2 (utility of an item/itemset). The utility of an item i in a


transaction Tc is denoted as u(i, Tc ) and defined as p(i) × q(i, Tc ). The utility
of an itemset X (a group of items P X ⊆ I) in a transaction Tc is denoted as
u(X, Tc ) and defined as u(X, Tc ) = i∈X u(i, Tc ). The utility
P of an itemset X
(in a database) is denoted as u(X) and defined as u(X) = Tc ∈g(X) u(X, Tc ),
where g(X) is the set of transactions containing X.

Example 2. The utility of item a in T6 is u(a, T6 ) = 5 × 2 = 10. The utility of


the itemset {a, c} in T6 is u({a, c}, T6 ) = u(a, T6 ) + u(c, T6 ) = 5 × 2 + 1 × 6 = 16.
The utility of the itemset {a, c} (in the database) is u({a, c}) = u(a) + u(c) =
u(a, T1 ) + u(a, T3 ) + u(a, T5 ) + u(a, T6 ) + u(c, T1 ) + u(c, T3 ) + u(c, T5 ) + u(c, T6 ) =
5 + 5 + 5 + 10 + 1 + 1 + 1 + 6 = 34.

Definition 3 (high-utility itemset mining). The problem of high-utility


itemset mining is to discover all high-utility itemsets [4, 5, 8–10, 13]. An itemset
X is a high-utility itemset if its utility u(X) is no less than a user-specified
minimum utility threshold minutil given by the user.
Example 3. If minutil = 30, the complete set of HUIs is {a, c} : 34, {a, c, e} : 31,
{b, c, d} : 34, {b, c, d, e} : 40, {b, c, e} : 37, {b, d} : 30, {b, d, e} : 36, and {b, e} : 31,
where each HUI is annotated with its utility.

In HUIM, the utility measure is not monotonic or anti-monotonic [10, 13], i.e.,
an itemset may have a utility lower, equal or higher than the utility of its subsets.
Several HUIM algorithms circumvent this problem by overestimating the utility
of itemsets using the Transaction-Weighted Utilization (TWU) measure [10, 13],
which is anti-monotonic, and defined as follows.

Definition 4 (Transaction weighted utilization). The transaction utility


(TU ) of a Ptransaction Tc is the sum of the utility of all the items in Tc . i.e.
T U (Tc ) = x∈Tc u(x, Tc ). The transaction-weighted utilization (TWU ) of an
itemset X is defined as theP sum of the transaction utility of transactions con-
taining X, i.e. T W U (X) = Tc ∈g(X) T U (Tc ).

Example 4. The TUs of T1 , T2 , T3 , T4 , T5 , T6 and T7 are respectively 6,3, 25, 20,


8, 22 and 9. The TWU of single items a, b, c, d, e are respectively 61, 54, 90, 53
and 79. T W U ({c, d}) = T U (T3 ) + T U (T4 ) + T U (T5 ) = 25 + 20 + 8 = 53.

Theorem 1 (Pruning search space using the TWU). Let X be an itemset,


if T W U (X) < minutil, then X and its supersets are low utility. [10]

Algorithms such as Two-Phase [10], BAHUI [8], PB [5], and UPGrowth+ [13]
utilizes the above property to prune the search space. They operate in two phases.
In the first phase, they identify candidate high utility itemsets by calculating
their TWUs. In the second phase, they scan the database to calculate the exact
utility of all candidates found in the first phase to eliminate low utility itemsets.
Recently, an alternative algorithm called HUI-Miner [9] was proposed to mine
HUIs directly using a single phase. Then, a faster depth-first search algorithm
FHM [4] was proposed, which extends HUI-Miner. In FHM, each itemset is
associated with a structure named utility-list [4, 9]. Utility-lists allow calculating
the utility of an itemset quickly by making join operations with utility-lists of
shorter patterns. Utility-lists are defined as follows.

Definition 5 (Utility-list). Let  be any total order on items from I. The


utility-list of an itemset X in a database D is a set of tuples such that there is a
tuple (tid, iutil, rutil) for each transaction Ttid containing X. The iutil element
of a tuple is the
P utility of X in Ttid . i.e., u(X, Ttid ). The rutil element of a tuple
is defined as i∈Ttid ∧ix∀x∈X u(i, Ttid ).

Example 5. Assume that  is the alphabetical order. The utility-list of {a} is


{(T1 , 5, 1), T3 , 5, 20), (T5 , 5, 3), (T6 , 10, 12)}. The utility-list of {d} is {(T3 , 6, 3),
(T4 , 6, 3), (T5 , 2, 0)}. The utility-list of {a, d} is {(T3 , 11, 3), (T5 , 7, 0)}.

To discover HUIs, FHM performs a single database scan to create utility-


lists of patterns containing single items. Then, longer patterns are obtained
by performing the join operation of utility-lists of shorter patterns. The join
operation for single items is performed as follows. Consider two items x, y such
that x  y, and their utility-lists ul({x}) and ul({y}). The utility-list of {x, y}
is obtained by creating a tuple (ex.tid, ex.iutil + ey.iutil, ey.rutil) for each pairs
of tuples ex ∈ ul({x}) and ey ∈ ul({y}) such that ex.tid = ey.tid. The join
operation for two itemsets P ∪ {x} and P ∪ {y} such that x  y is performed
as follows. Let ul(P ), ul({x}) and ul({y}) be the utility-lists of P , {x} and {y}.
The utility-list of P ∪ {x, y} is obtained by creating a tuple (ex.tid, ex.iutil +
ey.iutil − ep.iutil, ey.rutil) for each set of tuples ex ∈ ul({x}), ey ∈ ul({y}),
ep ∈ ul(P ) such that ex.tid = ey.tid = ep.tid. Calculating the utility of an
itemset using its utility-list and pruning the search space is done as follows.
Property 1 (Calculating utility of an itemset using its utility-list). The utility of
an itemset is the sum of iutil values in its utility-list [9].
Theorem 2 (Pruning search space using utility-lists). Let X be an item-
set. Let the extensions of X be the itemsets that can be obtained by appending
an item y to X such that y  i, ∀i ∈ X. If the sum of iutil and rutil values in
ul(X) is less than minutil, X and its extensions are low utility [9].
FHM is very efficient. However, an important limitation of current HUIM
algorithms is that they are not designed for discovering periodic patterns.

2.2 Periodic Frequent Pattern Mining


In the field of FIM, algorithms have been proposed to discover periodical frequent
patterns (PFP) [2, 3, 6, 11, 7, 12] in a transaction database. Discovering PFP has
applications in many domains such as web mining, bioinformatics, and market
basket analysis [12]. The concept of PFP is defined as follows [12].
Definition 6 (Periods of an itemset). Let there be a database D = {T1 , T2 ,
..., Tn } containing n transactions, and an itemset X. The set of transactions
containing X is denoted as g(X) = {Tg1 , Tg2 ..., Tgk }, where 1 ≤ g1 < g2 <
... < gk ≤ n. Two transactions Tx ⊃ X and Ty ⊃ X are said to be consecutive
with respect to X if there does not exist a transaction Tw ∈ g(X) such that
x < w < y. The period of two consecutive transactions Tx and Ty in g(X) is
defined as pe(Tx , Ty ) = (y − x), that is the number of transactions between
Tx and Ty . The periods of an itemset X is a list of periods defined as ps(X) =
{g1 −g0 , g2 −g1 , g3 −g2 , ...gk −gk−1 , gk+1 −gk }, where
S g0 and gk +1 are constants
defined as g0 = 0 and gk + 1 = n. Thus, ps(X) = 1≤z≤k+1 (gz − gz−1 ).
Example 6. For the itemset {a, c}, The list of transactions containing {a, c} is
g({a, c}) = {T1 , T3 , T5 , T6 }. Thus, the periods of this itemset are ps({a, c}) =
{1, 2, 2, 1, 1}.
Definition 7 (Periodic Frequent Pattern). The maximum periodicity of
an itemset X is defined as maxper(X) = max(ps(X)) [12]. An itemset X is a pe-
riodic frequent pattern (PFP) if |g(X)| ≥ minsup and maxper(X) < maxP er,
where minsup and maxP er are user-defined thresholds [12].
The first algorithm for mining PFPs is PFP-Tree [12]. It utilizes a tree-based
and pattern-growth approach for discovering PFPs. Then, the MTKPP algo-
rithm [2] was proposed for discovering the k most frequent PFPs in a database,
where k is a user-specified parameter. MTKPP utilizes a vertical structure to
maintain information about itemsets in the database A variation of the PF-Tree
algorithm named the ITL-Tree was also introduced [3] to reduce the time for
mining PFPs by approximating the periodicity of itemsets. Another approxi-
mate algorithm for PFP mining was recently proposed [7]. Other extensions of
the PF-Tree algorithm named MIS-PF-tree [6] and MaxCPF [11] were respec-
tively proposed to mine PFPs using multiple minsup thresholds, and multiple
minsup and minper thresholds.
An important limitation of traditional algorithms for PFP mining is that
they are inadequate to find periodic patterns that yield a high profit, since they
only consider the support (frequency) of patterns. Hence, they may find a huge
amount of periodic patterns that yield a low profit and miss many rare periodical
patterns that yield a high profit.

3 The PHM algorithm


To address the aforementioned limitation of HUI and PFP mining algorithms,
this section introduces the concept of periodic high-utility itemsets (PHUIs). The
first subsection present novel measures to assess the periodicity of HUIs, while
the second subsection presents and efficient algorithm named PHM (Periodic
High-Utility Itemset Miner) to discover PHUIs efficiently.

3.1 Measuring the periodicity of high-utility patterns


A drawback of the maximum periodicity measure used by most PFP algorithms
is that an itemset is automatically discarded if it has a single period of length
greater than the maxP er threshold. Thus, this measure may be viewed as too
strict. To provide a more flexible way of evaluating the periodicity of patterns,
the concept of average periodicity is introduced in the proposed algorithm.

Definition 8 (Average periodicity of anPitemset). The average periodicity


of an itemset X is defined as avgper(X) = g∈ps(X) /|ps(X)|.

Example 7. The periods of itemsets {a, c} and {e} are respectively ps({a, c}) =
{1, 2, 2, 1, 1} and ps({e}) = {2, 1, 1, 2, 1, 0}. The average periodicities of these
itemsets are respectively avgper({a, c}) = 1.4 and avgper({e}) = 1.16.

Lemma 1 (Relationship between average periodicity and support). Let


X be an itemset appearing in a database D. An alternative and equivalent way
of calculating the average periodicity of X is avgper(X) = |D|/(|g(X)| + 1).

Proof. Let g(X) = {Tg1 , Tg2 , . . . , Tgk } be the set of transactions


P containing X,
such that g1 < g2 < . . . < gk . By definition, avgper(X) = g∈ps(X) /|ps(X)|.
P
To prove that the lemma holds, we need to show that g∈ps(X) /|ps(X)| =
|D|/(|g(X)| + 1). P
(1) We first show that g∈ps(X) = |D|, as follows:
P
g∈ps(X) = (g1 − g0 ) + (g2 − g1 ) + . . . (gk − gk−1 ) + (gk+1 − gk )}
P
= g∈ps(X) = g0 + (g1 − g1 ) + (g2 − g2 ) + . . . (gk − gk ) + (gk+1 )
= gk+1 − g0 = |D|.
(2) We then show that S |ps(X)| = |g(X)| + 1, as follows:
By definition, ps(X) = 1≤z≤k+1 (gz − gz−1 ). Thus, the set ps(X) contains k+1
elements. Since X appears in k transactions, sup(X) = k, and thus |ps(X)| =
|g(X)| + 1.
Since (1) and (2) holds, the lemma holds. t
u
The above lemma is important as it provides an efficient way of calculating
the average periodicity of itemsets in a database D. The term |D| can be cal-
culated once, and thereafter the average periodicity of any itemset X can be
obtained by only calculating |g(X)| + 1, and then dividing |D| by the result.
This is more efficient than calculating the average periodicity using Definition 8.
Besides, this lemma is important as it shows that there is a relationship between
the support used in FIM and the average periodicity of a pattern.
Although the average periodicity is useful as it measures what is the typi-
cal period length of an itemset, it should not be used as the sole measure for
evaluating the periodicity of a pattern because it does not consider whether an
itemset has periods that vary widely or not. For example, the itemset {b, d} has
an average periodicity of 2.33. However, this is misleading since this itemset only
appears in transaction T3 and T4 , and its periods ps({T3 , T4 }) = {3, 1, 4} vary
widely. Intuitively, this pattern should not be a periodic pattern. To avoid finding
patterns having periods that vary widely, our solution is to combine the average
periodicity measure with other periodicity measure(s). The following measures
are combined with the average periodicity to achieve this goal.
First, we define the minimum periodicity of an itemset as minper(X) =
min(ps(X) to avoid discovering itemsets having some very short periods. But
this measure is not reliable since the first and last period of an itemset are
respectively equal to 1 or 0 if the itemset respectively appears in the first or
the last transaction of the database. For example, the last period of itemset
{e} is 0, because it appears in the last transaction (T7 ), and thus its minimum
periodicity is 0. Our solution to this issue is to exclude the first and last periods
of an itemset from the calculation of the minimum periodicity. Moreover, if the
set of periods is empty as a result of this exclusion, the minimum periodicity is
defined as ∞. In the rest of this paper, we consider this definition.
Second, we consider the maximum periodicity of an itemset maxper(X) as
defined in the previous section. The rationale for using this measure in combi-
nation with the average periodicity is that it can avoids discovering periodical
patterns that do not occur for long periods of time.
In terms of calculation cost, a reason for choosing the minimum periodicity,
maximum periodicity and average periodicity as measure is that they can be
calculated very efficiently for an itemset X by scanning the list of transactions
g(X) only once. That is, calculating these measures do not require to store the
set of periods ps(X) in memory. Conversely, other measure such as the stan-
dard deviation would require to calculate all periods of an itemset beforehand.
Thus, we define the concept of periodic high-utility itemsets by considering the
minimum periodicity, maximum periodicity and average periodicity measures.

Definition 9 (Periodic High-Utility Itemsets). Let minutil, minAvg,


maxAvg, minP er and maxP er be positive numbers, provided by the user. An
itemset X is a periodic high-utility itemset if and only if minAvg ≤ avgper(X) ≤
maxAvg, minper(X) ≥ minP er, maxper(X) ≤ maxP er, and u(X) ≥ minutil.

For example, if minutil = 20, minP er = 1, maxP er = 3, minAvg = 1, and


maxAvg = 2, the complete set of PHUIs is shown in table 3.

Table 3: The set of PHUIs in the running example

Itemset u(X) |g(X)| minper(X) maxper(X) avgper(X)


{b} 22 3 1 3 1.75
{b, e} 31 3 1 3 1.75
{b, c, e} 37 3 1 3 1.75
{b, c} 28 3 1 3 1.75
{a} 25 4 1 2 1.4
{a, c} 34 4 1 2 1.4
{c, e} 27 4 1 3 1.4

To develop an efficient algorithms for mining PHUIs, it is important to de-


sign efficient pruning strategies. To use the periodicity measures for pruning the
search space, the following theorems are presented.

Lemma 2 (Monotonicity of the average periodicity). Let X and Y be


itemsets such that X ⊂ Y . It follows that avgper(Y ) ≥ avgper(X).

Proof. The average periodicities of X and Y are respectively avgper(X) =


|D|/(|g(X)| + 1) and avgper(Y ) = |D|/(|g(Y )| + 1). Because X ⊂ Y , it fol-
lows that g(Y ) ⊆ g(X). Hence, avgper(Y ) ≥ avgper(X). t
u

Lemma 3 (Monotonicity of the minimum periodicity). Let X and Y be


itemsets such that X ⊂ Y . It follows that minper(Y ) ≥ minper(X).

Proof. Since X ⊂ Y , g(Y ) ⊆ g(X). If g(Y ) = g(X), then X and Y have the
same periods, and thus minper(Y ) = minper(X). If g(Y ) ⊂ g(X), then for
each transaction Tx ∈ g(X) \ g(Y ), the corresponding periods in ps(X) will
be replaced by a larger period in ps(Y ). Thus, any period in ps(Y ) cannot be
smaller than a period in ps(X). Hence, minper(Y ) ≥ minper(X). t
u

Lemma 4 (Monotonicity of the maximum periodicity). Let X and Y


be itemsets such that X ⊂ Y . It follows that maxper(Y ) ≥ maxper(X) [12].
Theorem 3 (Maximum periodicity pruning). Let X be an itemset ap-
pearing in a database D. X and its supersets are not PHUIs if maxper(X) >
maxP er. Thus, if this condition is met, the search space consisting of X and all
its supersets can be discarded.
Proof. By definition, if maxper(X) > maxP er, X is not a PHUI. By Lemma 4,
supersets of X are also not PHUIs.
Theorem 4 (Average periodicity pruning). Let X be an itemset appearing
in a database D. X is not a PHUI as well as all of its supersets if avgper(X) >
maxAvg, or equivalently if |g(X)| < (|D|/maxAvg) − 1. Thus, if this condition
is met, the search space consisting of X and all its supersets can be discarded.
Proof. By definition, if avgper(X) > maxAvg, X is not a PHUI. By Lemma 2,
supersets of X are also not PHUIs. The pruning condition avgper(X) > maxAvg
is rewritten as: |D|/(|g(X)|+1) > maxAvg. Thus, 1/(|g(X)|+1) > maxAvg/|D|,
which can be further rewritten as |g(X)| + 1 < |D|/maxAvg, and as |g(X)| <
(|D|/maxAvg) − 1. t
u

3.2 The algorithm


The proposed PHM algorithm is a utility-list based algorithm, inspired by the
FHM algorithm [4], where the utility-list of each itemset X is annotated with
two additional values: minper(X) and maxper(X). The main procedure of
PHM (Algorithm 1) takes a transaction database as input, and the minutil,
minAvg, maxAvg, minP er and maxP er thresholds. The algorithm first scans
the database to calculate T W U ({i}), minper({i}), maxper({i}), and |g({i})| for
each item i ∈ I. Then, the algorithm calculates the value γ = (|D|/maxAvg) − 1
to be later used for pruning itemsets using Theorem 4. Then, the algorithm iden-
tifies the set I ∗ of all items having a TWU no less than minutil, a maximum
periodicity no greater than maxP er, and appearing in no less than γ transac-
tions (other items are ignored since they cannot be part of a PHUI by Theorem
1, 3 and 4). The TWU values of items are then used to establish a total order
 on items, which is the order of ascending TWU values (as suggested in [9]).
A database scan is then performed. During this database scan, items in trans-
actions are reordered according to the total order , the utility-list of each item
i ∈ I ∗ is built and a structure named EUCS (Estimated Utility Co-Occurrence
Structure) is built [4]. This latter structure is defined as a set of triples of the
form (a, b, c) ∈ I ∗ × I ∗ × R. A triple (a,b,c) indicates that TWU({a, b}) = c.
The EUCS can be implemented as a triangular matrix (as shown in Fig. 1 for
the running example), or as a hashmap of hashmaps where only tuples of the
form (a, b, c) such that c 6= 0 are kept. After the construction of the EUCS, the
depth-first search exploration of itemsets starts by calling the recursive proce-
dure Search with the empty itemset ∅, the set of single items I ∗ , γ, minutil,
minAvg, minP er, maxP er, the EUCS structure, and |D|.
The Search procedure (Algorithm 2) takes as input an itemset P , extensions
of P having the form P z meaning that P z was previously obtained by appending
T5 (b,2)(c,2)(e,1)(g,2)

TID TU Item TWU Item a b c d e f


T1 8 a 65 b 30
T2 27 b 61 TU(T_1)
c 65 =617 TU(T_2) = 27 TU(T_3) = 30 TU(T_4) = 20
T3 30 c 96 d 38 50 58
T4 20 d 58 e 57 = 61
TWU(a) 65 77 50
TWU(b)= 61 TWU(c)= 96
T5 11 e 88 f 30 30 30 30 30
f 30 TWU(f)=
g 27 3038 TWU(g)
38 0 =3838 0
g 38
Algorithm 1: The PHM algorithm
input : D: a transaction database,
minutil, minAvg, maxAvg, minP er and maxP er: the thresholds
support
output: the set of periodic high-utility itemsets
TID TU Item TWU Item a b c d e
T1 8 a 65 b 30
1 Scan D once to calculate T W U ({i}), minper({i}),
Item a b c d T2 maxper({i}),
27 b |g({i})|
and 61 c 65 61
for each item i ∈ I; T3 30 c 96 d 38 50 58
b 1
2 γ ← (|D|/maxAvg) − 1; T 20 d 58 e 57 61 77 50
∗ c 4 3 4
3 I ← each item i such that TWU(i) ≥ minutil, |g({i})| ≥ γ and
d 2 2 3 T 5 11 e 88 f 30 30 30 30 3
maxper({i}) ≤ maxP er; f 30 g 27 38 38 0 3
e 2 3 4 2 ∗
4 Let  be the total order of TWU ascending values on I ;
∗ g 38
5 Scan D to build the utility-list of each item i ∈ I and build the EU CS
structure;
TW(∅, I ∗ , γ, minutil, minAvg, minP er, maxP er, EU CS, |D|);
6 Search
U support

Item a b c d Item a b c d
b 25 b 1
c 61 54 c 4 3
d 33 45 53 d 2 2 3
e 47 54 76 45 e 2 3 4 2

Fig. 1: The EUCS Fig. 2: The ESCS


TW
U
an item z to P , γ, minutil, minAvg, minP er, maxP er, the EUCS, and |D|. The
search procedure performs a loop on each extension Item Pax of bP . Inc thisdloop, the
average periodicity of P x is obtained by dividing b|D| by25the number of elements
in the utility list of P x plus one (by Lemma 1). Then,c if the54average periodicity
61
of P x is in the [minAvg, maxAvg] interval, the dsum of 33 the45iutil53values of the
e 1),47
utility-list of P x is no less than minutil (cf. Property 54 76 45
the minimum/maximum
periodicity of P x is no less/not greater than minP er/maxP er according to
the values stored in its utility-list, then P x is a PHUI and it is output. Then,
if the sum of iutil and rutil values in the utility-list of P x are no less than
minutil, the number of elements in the utility list of P x is no less than γ,
and maxper(P x) is no greater than maxP er, it means that extensions of P x
should be explored (by Theorem 1, 3 and 4). This is performed by merging P x
with all extensions P y of P such that y  x to form extensions of the form
P xy containing |P x| + 1 items. The utility-list of P xy is then constructed by
calling the Construct procedure (cf. Algorithm 3), to join the utility-lists of
P , P x and P y. This latter procedure is mainly the same as in HUI-Miner [9],
with the exception that periods are calculated during utility-list construction to
obtain maxP er(P xy) and minP er(P xy) (not shown). Then, a recursive call to
the Search procedure with P xy is done to calculate its utility and explore its
extension(s). The Search procedure starts from single items, recursively explores
the search space of itemsets by appending single items, and only prunes the
search space using Theorem 1, 3 and 4. Thus, it can be easily seen that this
procedure is correct and complete to discover all PHUIs.

Algorithm 2: The Search procedure


input : P : an itemset, ExtensionsOfP: a set of extensions of P , γ, minutil,
minAvg, minP er, maxP er, the EU CS structure, |D|
output: the set of periodic high-utility itemsets
1 foreach itemset P x ∈ ExtensionsOfP do
2 avgperP x ← |D|/(|P x.utilitylist| + 1);
3 if SUM(P xy.utilitylist.iutils) ≥ minutil∧
minAvg ≤ avgperP x ≤ maxAvg ∧ P x.utilitylist.minp ≥
minP er ∧ P x.utilitylist.maxp ≤ maxP er∧ then output P x;
4 if SUM(P x.utilitylist.iutils)+SUM(P x.utilitylist.rutils) ≥ minutil ∧
avgperP x ≥ γ and P x.utilitylist.maxp ≤ maxP er then
5 ExtensionsOfPx ← ∅;
6 foreach itemset P y ∈ ExtensionsOfP such that y  x do
7 if ∃(x, y, c) ∈ EUCS such that c ≥ minutil) then
8 P xy ← P x ∪ P y;
9 P xy.utilitylist ← Construct (P, P x, P y);
10 ExtensionsOfPx ← ExtensionsOfPx ∪ {P xy};
11 end
12 end
13 Search (P x, ExtensionsOfPx, γ, minutil, minAvg, minP er, maxP er,
EU CS, |D|);
14 end
15 end

Furthermore, in the implementation of PHM, two additional optimizations


are included, which are briefly described next.
Optimization 1. Estimated Average Periodicity Pruning (EAPP).
The PHM algorithms creates a structure called EUCS to store the TWU of all
pairs of items occurring in the database, and this structure is used to prune any
itemset P xy containing a pair of items {x,y} having a TWU lower than minutil
(Line 7 of the search procedure). The strategy EAPP is a novel strategy that
uses the same idea but prune itemsets using the average periodicity instead of
the utility. During the second database scan, a novel structure called ESCS (Es-
timated Support Co-occurrence Structure) is created to store |g({x, y})| for each
pair of items {x,y} (as shown in Figure 2). Then, Line 7 of the search procedure
is modified to prune itemset P xy if |g({x, y})| is less than γ by Theorem 4.
Optimization 2. Abandoning List Construction early (ALC). An-
other strategy introduced in PHM is to stop constructing the utility-list of an
itemset if a specific condition is met, indicating that the itemset cannot be a
PHUI. By Theorem 4, an itemset P xy cannot be a PHUI, if it appears in less
than γ = (|D|/maxAvg) − 1 transactions. The strategy ALC consists of modify-
Algorithm 3: The Construct procedure
input : P : an itemset, P x: the extension of P with an item x, P y: the
extension of P with an item y
output: the utility-list of P xy
1 U tilityListOf P xy ← ∅;
2 foreach tuple ex ∈ P x.utilitylist do
3 if ∃ey ∈ P y.utilitylist and ex.tid = exy.tid then
4 if P.utilitylist 6= ∅ then
5 Search element e ∈ P.utilitylist such that e.tid = ex.tid.;
6 exy ← (ex.tid, ex.iutil + ey.iutil − e.iutil, ey.rutil);
7 end
8 else
9 exy ← (ex.tid, ex.iutil + ey.iutil, ey.rutil);
10 end
11 periodexy ← calculateP eriod(exy.tid,UtilityListOfPxy);
12 U pdateM inP erM axP er(U tilityListOf P xy, periodexy );
13 U tilityListOf P xy ← U tilityListOf P xy ∪ {exy};
14 end
15 end
16 return UtilityListPxy;

ing the Construct procedure (Algorithm 3) as follows. The first modification is


to initialize a variable max with the value γ in Line 1. The second modification
is to the following lines, where the utility-list of P xy is constructed by checking
if each tuple in the utility-lists of P x appears in the utility-list of P y (Line 3).
For each tuple not appearing in P y, the variable max is decremented by 1. If
max is smaller than γ, the construction of the utility-list of P xy can be stopped
because |g(P xy)| will not be higher than γ. Thus P xy is not a PHUI by Theorem
4, and its extensions can also be ignored.

4 Experimental Study

We performed an experimental study to assess the performance of PHM. The


experiment was performed on a computer with a sixth generation 64 bit Core i5
processor running Windows 10, and equipped with 12 GB of free RAM. We com-
pared the performance of the proposed PHM algorithm with the state-of-the-art
FHM algorithm for mining HUIs. All memory measurements were done using
the Java API. The experiment was carried on four real-life datasets commonly
used in the HUIM litterature: retail, mushroom, chainstore and foodmart. These
datasets have varied characteristics and represents the main types of data typi-
cally encountered in real-life scenarios (dense, sparse and long transactions). Let
|I|, |D| and A represents the number of transactions, distinct items and average
transaction length of a dataset. retail is a sparse dataset with many different
items (|I| = 16,470, |D| = 88,162, A = 10,30). mushroom is a dense dataset with
long transactions (|I| = 119, |D| = 8,124, A = 23). chainstore is a dataset that
contains a huge number of transactions (|I| = 461, |D| = 1,112,949, A = 7.23).
foodmart is a sparse dataset (|I| = 1,559, |D| = 4,141, A = 4.4). The chainstore
and foodmart datasets are real-life customer transaction databases containing
real external and internal utility values. The retail and mushroom datasets con-
tains synthetic utility values, generated randomly [9, 13]. The source code of all
algorithms and datasets can be downloaded from http://goo.gl/Y6eBdz.
In the experiment, PHM was run on each dataset with fixed minper and
minAvg values, while varying the minutil threshold and the values of the
maxAvg and maxper parameters. In these experiments, the values for the pe-
riodicity thresholds have been found empirically for each dataset (as they are
dataset specific), and were chosen to show the trade-off between the number of
periodic patterns found and the execution time. Note that results for varying
the minper and minAvg values are not shown because these parameters have
less influence on the patterns found than the other parameters. Thereafter, the
notation PHM V-W-X-Y represents the PHM algorithm with minper = V ,
maxper = W , minAvg = X, and maxAV G = Y .
Fig. 3 compares the execution times of PHM for various parameter values
and FHM. Fig. 4, compares the number of PHUIs found by PHM for various
parameter values, and the number of HUIs found by FHM.

Retail Mushroom
100 1000
FHM FHM
PHM 1-1000-5-500 PHM 1-1000-5-500
80 PHM 1-5000-5-250
PHM 1-250-5-100 PHM 1-500-5-250
100
Runtime (s)

Runtime (s)

60

40
10
20

1
0 10000 20000 30000 40000 50000 0 2000000 4000000 6000000 8000000 10000000
minutil minutil

Chaintstore Foodmart
1000 10 000
FHM
PHM 1-5000-5-500
1 000
PHM 1-2000-5-200
100
Runtime (ms)
Runtime (s)

100

10 FHM
10 PHM 10-5000-5-500
PHM 1-2000-5-200
PHM 10-1000-5-100
1 1
1000000,00 1600000,00 2200000,00 0 1000 2000 3000 4000 5000
minutil minutil

Fig. 3: Execution times

It can first be observed that mining PHUIs using PHM can be much faster
than mining HUIs. The reason for the excellent performance of PHM is that
it prunes a large part of the search space using its designed pruning strategies
based on the maximum and average periodicity measures. For all datasets, it can
be found that a huge amount of HUIs are non periodic, and thus pruning non
Retail Mushroom
25 10000
FHM FHM
PHM 1-1000-5-500 PHM 1-1000-5-500
20 PHM 1-500-5-250 1000
PHM 1-250-5-100 PHM 1-500-5-250

Pattern count
Pattern count
15
100
10

10
5

1
0 10000 20000 30000 40000 50000 0 2000000 4000000 6000000 8000000 10000000
minutil minutil

Chaintstore Foodmart
450 1 000 000
FHM
400
PHM 1-5000-5-500 100 000
350
PHM 1-2000-5-200
300 10 000

Pattern count
Pattern count

250
1 000
200
150 100 FHM
100 PHM 10-5000-5-500
10 PHM 1-2000-5-200
50
PHM 10-1000-5-100
0 1
1000000,00 1600000,00 2200000,00 0 1000 2000 3000 4000 5000
minutil minutil

Fig. 4: Number of patterns found

periodic patterns leads to a massive performance improvement. For example,


for the lowest minutil, maxP er and maxAvg values on these datasets, PHM is
respectively up to 214, 127, 100 and 230 times faster than FHM. In general, the
more the periodicity thresholds are restrictive, the more the gap between the
runtime of FHM and PHM increases.
A second observation is that the number of PHUIs can be much less than
the number of HUIs (see Fig. 4). For example, on retail, 20,714 HUIs are found
for minutil = 2, 000. But only 110 HUIs are PHUIs for PHM 1-1000-5-500, and
only 7 for PHM 1-250-5-150. Some of the patterns found are quite interesting
as they contains several items. For example, it is found that items with product
ids 32, 48 and 39 are periodically bought with an average periodicity of 16.32,
a minimum periodicity of 1, and a maximum periodicity of 170. Huge reduction
in the number of patterns are also observed on the other datasets. These overall
results show that the proposed PHM algorithm is useful as it can filter a huge
amount of non periodic HUIs encountered in real datasets, and can run faster.
Memory consumption was also compared, although detailed results are not
shown as a figure due to space limitation. It was observed that PHM can use up
to 10 times less memory than FHM depending on how parameters are set. For
example, on chainstore and minutil = 1, 000, 000, FHM and PHM 1-5000-5-500
respectively consumes 1,631 MB and 159 MB of memory.

5 Conclusion
This paper explored the problem of mining periodic high-utility itemsets (PHUIs).
An efficient algorithm named PHM (Periodic High-utility itemset Miner) was
proposed to efficiently discover PHUIs using novel minimum and average peri-
odicity measures. An extensive experimental study with real-life datasets has
shown that PHM can be more than two orders of magnitude faster than FHM,
and discover more than two orders of magnitude less patterns by filtering non
periodic HUIs. Source code of PHM, FHM and datasets can be downloaded from
http://goo.gl/Y6eBdz. For future work, we will consider designing alternative
algorithms to mine PHUIs.

References
1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
databases. In: Proc. Int. Conf. Very Large Databases, pp. 487–499, (1994)
2. Amphawan, K., Lenca, P., Surarerks, A.: Mining top-k periodic-frequent pattern
from transactional databases without support threshold. In: Proc. 3rd Intern. Conf.
on Advances in Information Technology, pp. 18–29 (2009)
3. Amphawan, K., Surarerks, A., Lenca, P.: Mining periodic-frequent itemsets with
approximate periodicity using interval transaction-ids list tree. In: Proc. 2010 Third
Intern. Conf. on Knowledge Discovery and Data Mining, pp. 245-248 (2010)
4. Fournier-Viger, P., Wu, C.-W., Zida, S., Tseng, V. S.: FHM: Faster high-utility
itemset mining using estimated utility co-occurrence pruning. In: Proc. 21st Intern.
Symp. on Methodologies for Intell. Syst., pp. 83–92 (2014)
5. Lan, G. C., Hong, T. P., Tseng, V. S.: An efficient projection-based indexing ap-
proach for mining high utility itemsets. Knowl. and Inform. Syst. 38(1), 85–107
(2014)
6. Kiran, R. U., Reddy, P. K.: Mining Rare Periodic-Frequent Patterns Using Multiple
Minimum Supports. In: Proc. 15th Intern. Conf. on Management of Data (2009)
7. Uday, U. R., Kitsuregawa, M., Reddy, P. K.: Efficient Discovery of Periodic-Frequent
Patterns in Very Large Databases. Journal of Systems and Software, 112, 110–121
(2015)
8. Song, W., Liu, Y., Li, J.: BAHUI: Fast and memory efficient mining of high utility
itemsets based on bitmap. Intern. Journal of Data Warehousing and Mining. 10(1),
1–15 (2014)
9. Liu, M., Qu, J.: Mining high utility itemsets without candidate generation. In: Proc.
22nd ACM Intern. Conf. Info. and Know. Management, pp. 55–64 (2012)
10. Liu, Y., Liao, W., Choudhary, A.: A two-phase algorithm for fast discovery of high
utility itemsets. In: Proc. 9th Pacific-Asia Conf. on Knowl. Discovery and Data
Mining, pp. 689–695 (2005)
11. Surana, A., Kiran, R. U., Reddy, P. K.: An efficient approach to mine periodic-
frequent patterns in transactional databases. In: Proc. 2011 Quality Issues, Measures
of Interestingness and Evaluation of Data Mining Models Workshop, pp. 254–266
(2012)
12. Tanbeer, S. K., Ahmed, C. F., Jeong, B. S., Lee, Y. K.: Discovering periodic-
frequent patterns in transactional databases. In: Proc. 13th Pacific-Asia Conference
on Knowledge Discovery and Data Mining, pp. 242–253 (2009)
13. Tseng, V. S., Shie, B.-E., Wu, C.-W., Yu., P. S.: Efficient algorithms for mining
high utility itemsets from transactional databases. IEEE Trans. Knowl. Data Eng.
25(8), 1772–1786 (2013)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy