Oracle Data Modeling and Relational Database Design
Oracle Data Modeling and Relational Database Design
Abstract
More and more business activities are performed using information systems.
These systems produce such huge amounts of event data that existing systems are
unable to store and process them. Moreover, few processes are in steady-state
and due to changing circumstances processes evolve and systems need to adapt
continuously. Since conventional process discovery algorithms have been defined
for batch processing, it is difficult to apply them in such evolving environments.
Existing algorithms cannot cope with streaming event data and tend to generate
unreliable and obsolete results.
In this paper, we discuss the peculiarities of dealing with streaming event data
in the context of process mining. Subsequently, we present a general framework
for defining process mining algorithms in settings where it is impossible to store all
events over an extended period or where processes evolve while being analyzed.
We show how the Heuristics Miner, one of the most effective process discovery
algorithms for practical applications, can be modified using this framework. Dif-
ferent stream-aware versions of the Heuristics Miner are defined and implemented
in ProM. Moreover, experimental results on artificial and real logs are reported.
1 Introduction
One of the main aims of process mining is control-flow discovery, i.e., learning process
models from example traces recorded in some event log. Many different control-flow
discovery algorithms have been proposed in the past (see [20]). Basically, all such
algorithms have been defined for batch processing, i.e., a complete event log containing
all executed activities is supposed to be available at the moment of execution of the
mining algorithm. Nowadays, however, the information systems supporting business
processes are able to produce a huge amount of events thus creating new opportunities
and challenges from a computational point of view. In fact, in case of streaming data
it may be impossible to store all events. Moreover, even if one is able to store all
event data, it is often impossible to process them due to the exponential nature of most
algorithms. In addition to that, a business process may evolve over time. Manyika et al.
[15] report possible ways for exploiting large amount of data to improve the company
business. In their paper, stream processing is defined as “technologies designed to
∗ Email:burattin@math.unipd.it. Affiliation: Department of Mathematics, University of Padua, Italy.
† Email:sperduti@math.unipd.it. Affiliation: Department of Mathematics, University of Padua, Italy.
‡ Email: w.m.p.v.d.aalst@tue.nl. Affiliation: Department of Mathematics and Computer Science, Eind-
1
process large real-time streams of event data” and one of the example applications is
process monitoring. The challenge to deal with streaming event data is also discussed
in the Process Mining Manifesto1 [10].
Currently, however, there are no process mining algorithms able to mine an event
stream. This paper is the first that presents algorithms for discovering process models
based on streaming event data. In the remainder of this paper we refer to this problem
as Streaming Process Discovery (or SPD).
According to [2, 3], a data stream consists of an unbounded sequence of data items
with a very high throughput. In addition to that, the following assumptions are typically
made: i) data is assumed to have a small and fixed number of attributes; ii) mining
algorithms should be able to process an infinite amount of data, without exceeding
memory limits or otherwise fail, no matter how many items are processed; iii) for
classification tasks, data has a limited number of possible class labels; iv) the amount
of memory available to a learning/mining algorithm is considered finite, and typically
much smaller than the data observed in a reasonable span of time; v) there is a small
upper bound on the time allowed to process an item, e.g. algorithms have to scale
linearly with the number of processed items: typically the algorithms work with one
pass of the data; and vi) stream “concepts” are assumed to be stationary or evolving
[25, 27].
In SPD, a typical task is to reconstruct a control-flow model that could have gen-
erated the observed event log. The general representation of the SPD problem that we
adopt in this paper is shown in Fig. 1: one or more sources emit events (represented as
solid dots) which are observed by the stream miner that keeps the representation of the
process model up-to-date. Obviously, no standard mining algorithm adopting a batch
approach is able to deal with this scenario.
An SPD algorithm has to give satisfactory answers to the following two categories
of questions:
1. Is it possible to discover a process model while storing a minimal amount of
information? What should be stored? What is the performance of such methods
both in terms of model quality and speed/memory usage?
2. Can SPD techniques deal with changing processes? What is the performance
when the stream exhibits certain types of concept drift?
In this paper, we discuss the peculiarities of mining a stream of logs in the context
of process mining. Subsequently, we present a general framework for defining process
mining algorithms for streams of logs. We show how the Heuristics Miner, one of the
more effective algorithms for practical applications of process mining, can be adapted
for stream mining according to our SPD framework.
tue.nl/ieeetfpm/).
2
Time Events emi�ed over �me
...
Figure 1: General idea of SPD: the stream miner continuously receives events and,
using the latest observations, updates the process model.
into two categories: data and task based [6]. The idea of the first ones is to use only
a fragment of the entire dataset (by reducing the data into a smaller representation).
The idea of the latter approach is to modify existing techniques (or invent new ones) to
achieve time and space efficient solutions.
The main “data based” techniques are: sampling, load shedding, sketching and ag-
gregation. All these are based on the idea of randomly select items or stream portions.
The main drawback is that, since the dataset size is unknown, it is hard to define the
number of items to collect; moreover it is possible that some of the items that are ig-
nored were actually interesting and meaningful. Other approaches, like aggregation,
are slightly different: they are based on summarization techniques and, in this case,
the idea is to consider measures such as mean and variance; with these approaches,
problems arise when the data distribution contains many fluctuations.
The main “task based” techniques are: approximation algorithms, sliding window
and algorithm output granularity. Approximation algorithms aim to extract an approx-
imate solution. It is possible to define error bounds on the procedure. This way, one
obtains an “accuracy measure”. The basic idea of sliding window is that users are more
interested in most recent data, thus the analysis is performed giving more importance
to recent data, and considering only summarization of the old ones. The main charac-
teristic of “algorithm output granularity” is the ability to adapt the analysis to resource
availability.
The task of mining data stream is typically focused on specific types of algorithms
[6, 27, 2]. In particular, there are techniques for: clustering; classification; frequency
counting; time series analysis and change diagnosis (concept drift detection). All these
techniques cope with very specific problems and cannot be adapted to the SPD prob-
lem. However, as this work presents, it is possible to reuse some principles or to reduce
the SPD to sub-problems that can be solved with the available algorithms.
Over the last decade dozens of process discovery techniques have been proposed
[20], e.g., the Heuristics Miner [24]. However, these all work on a full event log and not
streaming data. Few works in process mining literature touch issues related to mining
event data streams.
In [12, 13], the authors focus on incremental workflow mining and task mining (i.e.
3
the identification of the activities starting from the documents accessed by users). The
basic idea is to mine process instances as soon as they are observed; each new model is
then merged with the previous one so to refine the global process representation. The
approach described is thought to deal with the incremental process refinement based
on logs generated from version management systems. However, as authors state, only
the initial idea is sketched.
An approach for mining legacy systems is described in [11]. In particular, after
the introduction of monitoring statements into the legacy code, an incremental process
mining approach is presented. The idea is to apply the same heuristics of the Heuristics
Miner into the process instances and add these data into an AVL tree, which are used
to find the best holding relations. Actually, this technique operates on “log fragments”
and not on single events so it is not really suitable for an online setting. Moreover,
heuristics are based on frequencies, so they must be computed with respect to a set of
traces and, again, this is not suitable for the settings with streaming event data.
An interesting contribution to the analysis of evolving processes is given in the
paper by Bose et al. [5]. The proposed approach, based on statistical hypothesis tests,
aims at detecting concept drift, i.e. the changes in event logs, and identifying the
regions of change in a process.
Solé and Carmona, in [18], describe an incremental approach for translating tran-
sition systems into Petri nets. This translation is performed using Region Theory. The
approach solves the problem of complexity of the translation, by splitting the log into
several parts; applying the Region Theory to each of them and then combine all them.
These regions are finally converted into Petri net.
The above review of the literature shows there no process mining technique for
SPD that address the requirements listed in this section.
The remainder of this paper is organized as follows: Section 2 presents the basic
concepts related to SPD; Section 3 describes the new algorithms designed to tackle
stream process mining; Section 4 reports some details about the implementation of
all the approaches in ProM and Section 5 presents the results of several experiments;
Section 6 concludes the paper. This work contains two appendices: Appendix A sum-
marizes the Heuristics Miner algorithm, Appendix B presents some details on error
bounds.
2 Basic concepts
The main difference between classical process mining [20] and SPD lies in the assumed
input format. For SPD we assume streaming event data that may even come from
multiple sources rather that a static event log containing historic data.
In this paper, we assume that each event, received by the miner, contains the name
of the activity executed, the case id it belongs to, and a timestamp. A formal definition
of these elements is as follows:
Definition 1 (Activity, Case, Time and Event Stream) Let A be a set of activities
and C be a set of case identifiers. An event is a triplet (c, a, t) ∈ C × A × N, i.e.,
the occurrence of activity a for case c (i.e. the process instance) at time t (timestamp of
emission of the event). Actually, in the miner, rather than using an absolute timestamp,
we consider a progressive number representing the number of events seen so far, so
an event at time t is followed by another event at time t + 1, regardless the time lasts
between them. S ∈ (C × A × N)∗ is an event stream, i.e., a sequence of events that
4
are observed item by item. The events in S are sorted according to the order they are
emitted, i.e. the event timestamp.
Starting from this definition, it is possible to define some functions:
Definition 2 (Case time scope) tstart (c) = min(c,a,t)∈S t, i.e. the time when the first
activity for c is observed. tend (c) = max(c,a,t)∈S t, i.e. the time when the last activity
for c is observed.
In analogy with classical data streams, an event stream can be defined as stationary
or evolving. In our context, a stationary stream can be seen as generated by a business
process that does not change with time. On the contrary, an evolving stream can be
understood as generated by a process that changes in time. More precisely, different
modes of change can be considered: i) drift of the process model; ii) shift of the process
model; iii) cases (i.e., execution instances of the process) distribution change. Drift
and shift of the process model correspond to the classical two modes of concept drift
[5] in data streams: a drift of the model refers to a gradual change of the underlying
process, while a model shift happens when a change between two process models is
more abrupt. The change in cases distribution represents another way in which an event
stream can evolve, i.e. the original process may stay the same during time, however,
the distribution of the cases is not stationary. With this we mean that the distribution
of the features of the process cases change with time. For example, in a production
process of a company selling clothing, the items involved in incoming orders (i.e., cases
features) during winter will follow a completely different distribution with respect to
items involved in incoming orders during the summer. Such distribution change may
significantly affect the relevance of specific paths in the control-flow of the involved
process.
Going back to process model drift, there is a peculiarity of business event streams
that cannot be found in traditional data streams. An event log records that a specific
activity ai of a business process P has been executed at time t for a specific case
cj . If the drift from P to P 0 happens at time t∗ while the process is running, there
might be cases for which all the activities have been executed within P (i.e., cases
that have terminated their execution before t∗ ), cases for which all the activities have
been executed within P 0 (i.e., cases that have started their execution on or after t∗ ),
and cases that have some activities executed within P and some others within P 0 (i.e.,
cases that have started their execution before t∗ and have terminated after t∗ ). We will
refer to these cases as transient cases. So, under this scenario, the stream will first
5
Mining �me
Mining �me
Time frame considered Time frame considered
{
{
Log used for mining Log used for mining
Figure 2: Two basic approaches for the definition of a finite log out of a stream of
events. The horizontal segments represent the time frames considered for the mining.
emit events of cases executed within P , followed by events of transient cases, followed
by events of cases executed within P 0 . On the contrary, if the drift does not occur
while the process is running, the stream will first report events referring to complete
executions (i.e. cases) of P , followed by events referring to complete executions of P 0
(no transient cases). In any case, the drift is characterized by the fact that P 0 is very
similar to P , i.e. the change in the process which emits the events is limited.
Due to space limitation, we restrict our treatment to stationary streams and streams
with concept drift with no generation of transient cases. The treatment of other scenar-
ios is left for future work.
6
Algorithm 1: Sliding Window HM / Periodic Resets HM
Input: S event stream; M memory of size max M ; PM memory policy (can be
‘reset’ or ‘shift’)
1 forever do
2 e ← observe(S) /* Observe a new event, where
e = (ci , ai , ti ) */
/* Check if event e has to be used */
3 if analyze(e) then
/* Memory update */
4 if size(M ) = max M then
5 if PM is reset then reset(M )
6 if PM is shift then shift(M )
7 end
8 insert(M, e)
/* Mining update */
9 if perform mining then
10 HeuristicsMiner (M )
11 end
12 end
13 end
7
to mining);
2. QC , with entries in C × A, stores the most recent observed event for each case;
3. QR with entries in A × A × R, stores the most recent observed direct succession
relations jointly with a weight for each succession relation (that represents its
degree of importance with respect to mining).
These queues are used by the online algorithm to retain the information needed to
perform mining.
The detailed description of the new algorithm is presented in Algorithm 2. Specifically,
the algorithm runs forever, considering, at each round, the current observed event
e = (ci , ai , ti ). For each current event, it is checked if ai is already in QA . If this
is not the case, ai is inserted in QA with weight 0. If ai is already present in the queue,
it is removed from its current position and moved at the beginning of the queue. In
any case, before insertion, it is checked if QA is full. If this is the case, the oldest
stored activity, i.e. the last in the queue, is removed. Subsequently, the weights of QA
are updated by fWA . After that, queue QC is examined to look for the most recent
event observed for case ci . If a pair (ci , a) is found, it is removed from the queue, an
instance of the succession relation (a, ai ) is created and searched in QR . If it is found,
it is moved from the current position to the beginning of QR . If it is a new succession
relation, its weight is set to 0. In any case, before insertion, it is checked if QR is
full. If this is the case, the oldest stored relation, i.e. the last in the queue, is removed.
Subsequently, the weights of QR are updated by fWR . Next, after checking if QC is
full (in which case the oldest stored event is removed), the event e is stored in QC .
Finally, it is checked if a model has to be generated. If this is the case, the procedure
generateModel (QA , QR ) is executed taking as input the current version of queues QA
and QR and producing “classical” model representations, such as Causal Nets [21] or
Petri Nets.
Algorithm 2 is parametric with respect to: i) the way weights of queues QA and QR
are updated by fWA , fWR , respectively; ii) how a model is generated by generateModel (QA , QR ).
In the following, generateModel (·, ·) will correspond to the procedure defined by
Heuristics Miner (Appendix A). In particular it is possible to consider QA as the
counter of activities (to filter out only the most frequent ones) and QR as the counter
of direct succession relations, which are used for the computation of the dependency
values between pairs of activities. The following subsections presents some specific
instances for fWA and fWR .
8
Algorithm 2: Online HM
Input: S event stream; max QA , max QC , max QR maximum memory sizes for
queues QA , QC , and QR , respectively; fWA , fWR model policy;
generateModel (·, ·).
1 forever do
2 e ← observe(S) /* observe a new event, where
e = (ci , ai , ti ) */
/* check if event e has to be used */
3 if analyze(e) then
4 if 6 ∃(a, w) ∈ QA s.t. a = ai then
5 if size(QA ) = max QA then
6 removeLast(QA ) /* removes last entry of QA
* /
7 end
8 w←0
9 else
10 w ← get(QA , ai ) /* get returns the old weight w
of ai and removes (ai , w) */
11 end
12 insert(QA , (ai , w)) /* inserts in front of QA */
13 QA ← fWA (QA ) /* updates the weights of QA */
14 if ∃(c, a) ∈ QC s.t. c = ci then
15 a ← get(QC , ci ) /* get returns the old activity a
of ci and removes (ci , a) */
16 if 6 ∃(as , af , u) ∈ QR s.t. (as = a) ∧ (af = ai ) then
17 if size(QR ) = max QR then
18 removeLast(QR ) /* removes last entry of QR
*/
19 end
20 u←0
21 else
22 u ← get(QR , a, ai ) /* get returns the old weight
u of relation a → ai and removes (a, ai , u) */
23 end
24 insert(QR , (a, ai , u)) /* inserts in front of QR */
25 QR ← fWR (QR ) /* updates the weights of QR */
26 else if size(QC ) = max QC then
27 removeLast(QC ) /* removes last entry of QC */
28 end
29 insert(QC , (ci , ai )) /* inserts in front of QC */
/* generate model */
30 if model then
31 generateModel (QA , QR )
32 end
33 end
34 end
9
where first(·) returns the first element of the queue.
In case of stationary streams, it is possible to use the Hoeffding bound to derive
error bounds on the measures computed by the online version of Heuristics Miner.
These bounds became tighter and tighter with the increase of the number of processed
events. Appendix B reports some details on that.
It must be noticed that if the sizes of the queues are large enough, the Online Heuris-
tics Miner collects all the needed statistics from the beginning of the stream till the
current time. So it performs very well, provided that the activity distribution of the
stream is stationary. However, in real world business processes it is natural to observe
variations both in events distribution and in the workflow of the process generating the
stream (concept drift).
In order to cope with concept drift, more importance should be given to more recent
events than to older ones. In the following we present a variant of Online Heuristics
Miner able to do that.
10
completely drop observations occurred before time t − k. This ability could be useful
in case a sudden drastic change in the event distribution.
• if the fitness remains unchanged (i.e. it is within a small interval), it means that
there is no drift so the value of α should be increased (up to 1);
• if the fitness increases, α should be increased too (up to 1).
The experiments, presented on the next section, consider only variations of α by a
constant factor. Alternative update policies (e.g. making the speed of change of α
proportional to the observed fitness change) can be considered and is in fact a topic of
future investigations.
Early explorations seem to reveal that the effectiveness of the α update policy heav-
ily depends on the problem type (i.e. characteristics of the event of stream), however
this topic still requires more investigations.
bucket (i.e., the bucket of the last element seen) is identified with bcurrent = N
w ,
where N is the progressive events counter.
The basic data structure used by Lossy Counting is a set of entries of the form
(e, f, ∆) where: e is an element of the stream; f is the estimated frequency of the item
e; and ∆ is the maximum possible error. Every time a new element e is observed, the
algorithm looks whether the data structure contains an entry for the corresponding ele-
ment. If such entry exists then its frequency value f is incremented by one, otherwise
11
Algorithm 3: Lossy Counting HM
Input: S event stream; N the bucket counter (initially value 1); DA activities set; DC
cases set; DR relations set; generateModel (·, ·).
1
1 w← /* define the bucket width */
2 forever do
bcurrent = N
3 w
/* define the current bucket id */
4 e ← observe(S) /* observe a new event, where e = (ci , ai , ∆i ) */
/* update the DA data structure */
5 if ∃(a, f, ∆) ∈ DA such that a = ai then
6 Remove the entry (a, f, ∆) from DA
7 DA ← (a, f + 1, ∆) /* updates the frequency of element ai
*/
8 else
9 DA ← DA ∪ {(ai , 1, bcurrent − 1)} /* inserts the new
observation */
10 end
/* update the DC data structure */
11 if ∃(c, a, f, ∆) ∈ DC such that c = ci then
12 Remove the entry (c, a, f, ∆) from DC
13 DC ← (c, ai , f + 1, ∆) /* updates the frequency and last
activity of case ci */
/* update the DR data structure */
14 Build relation ri as a → ai
15 if ∃(r, f, ∆) ∈ DR such that r = ri then
16 Remove the entry (r, f, ∆) from DR
17 DR ← (r, f + 1, ∆) /* updates the frequency of element
r i */
18 else
19 DR ← DR ∪ {(ri , 1, bcurrent − 1)} /* adds the new observation
*/
20 end
21 else
22 DC ← DC ∪ {(ci , ai , 1, bcurrent − 1)} /* adds the new observation
*/
23 end
/* periodic cleanup */
24 if N = 0 mod w then
25 foreach (a, f, ∆) ∈ DA such that f + ∆ ≤ bcurrent do
26 Remove (a, f, ∆) from DA
27 end
28 foreach (c, a, f, ∆) ∈ DC such that f + ∆ ≤ bcurrent do
29 Remove (c, a, f, ∆) from DC
30 end
31 foreach (r, f, ∆) ∈ DR such that f + ∆ ≤ bcurrent do
32 Remove (r, f, ∆) from DR
33 end
34 end
35 N ←N +1 /* increments the bucket counter */
/* generate model */
36 if model then
37 generateModel (DA , DR )
38 end
39 end
12
1 <log openxes.version="1.0RC7" xes.features="nested-attributes"
xes.version="1.0" xmlns="http://www.xes-standard.org/">
2 <trace>
3 <string key="concept:name" value="case_id_0" />
4 <event>
5 <date key="time:timestamp" value="2012-04-23T10:33:04.004+02:00"
/>
6 <string key="concept:name" value="A" />
7 <string key="lifecycle:transition" value="Task_Execution" />
8 </event>
9 </trace>
10 </log>
Listing 1: OpenXES fragment streamed over the network.
a new tuple is added: (e, 1, bcurrent − 1). Every time N ≡ 0 mod w, the algorithm
cleans the data structure by removing the entries that satisfy the following inequality:
f + ∆ ≤ bcurrent . Such condition ensures that, every time the cleanup procedure is
executed, bcurrent ≤ N .
This algorithm has been adapted to the SPD problem, using three instances of the
basic data structure. In particular, it counts the frequencies of the activities (with the
data structure DA ) and the frequencies of the direct succession relations (with the data
structure DR ). In order to obtain the relations, a third instance of the same data structure
is used, DC . In DC , each item is of the type (c, a, f, ∆) where c ∈ C represent the case
identifier; f and ∆, as in previous cases, respectively correspond to the frequency and
to the bucket id; and a ∈ A is the latest activity observed on the corresponding case.
Every time a new activity is observed, DA is updated. After that, the procedure checks
if, given the case identifiers of the current event, there is an entry in DC . If this is not
the case a new entry is added to DC (by adding the current case id and the activity
observed). Otherwise, the f and a components of the entry in DC are updated.
The Heuristics Miner can be used to generate the model, since a set of dependencies
between activities is available.
4 Implementation
All the approaches presented into this paper have been implemented in the ProM 6.1
toolkit [26]. Moreover, a “stream simulator” and a “logs merger” have also been im-
plemented to allow for experimentation (to test new algorithms and to compose logs).
Communications between stream sources and stream miner are performed over the
network: each event emitted consists of a “small log” (i.e., a trace which contains ex-
actly one event), encoded as a XES string [8]. An example of an event log streamed
is presented in Listing 1. This approach is useful to simulate “many-to-many environ-
ments” where one source emits events to many miners and one miner can use many
stream sources. The current implementation supports only the first scenario (currently
it is not possible to mine streams generated by more than one source).
Fig. 3 proposes the the set of ProM plugins implemented, and how they interact
each other. The available plugins can be split into two groups: plugins for the simu-
lation of the stream and plugins to mine streaming event data. To simulate a stream
there is the “Log Streamer” plugin. This plugin, receives a static log file as input and
streams each event over the network, according to its timestamp (in this context, times-
13
Online HM
Lossy Coun�ng HM
Given a log,
streams events Log Streamer Sliding Windows HM
over the network
Periodic Resets HM
Stream Tester
Figure 3: Architecture of the plugins implemented in ProM and how they interact with
each other. Each rounded box represents a ProM plugin.
tamps are used only to determine the order of events). It is possible to define the time
between each event, in order to test the miner under different emission rates (i.e. to
simulate different traffic conditions). A second plugin, called “Logs Merger” can be
used to concatenate different log files generated by different process models, just for
testing purposes.
Once the stream is active (i.e. events are sent through the network), the clients can
use these data to mine the model. There is a “Stream Tester” plugin, which just shows
the events received. The other 6 plugins support the two basic approaches (Section 3.1),
and the four stream specific approaches (Section 3.3 and 3.2).
In a typical session of testing a new stream process mining algorithm, we expect to
have two separate ProM instances active at the same time: the first is streaming events
over the network and the second is collecting and mining them.
Fig. 4 contains three screenshots of the ProM plugins implemented. The first image,
on top, contains the process streamer: the left bar describes the stream configuration
options (such as the speed or the network port for new connections), the central part
contains a representation of the log as a dotted chart [19] (the x axis represents the
time, and each point with the same timestamp x value is an event occurred at the same
instant). Blue dots are the events that are not yet sent (future events), green ones are
the events already streamed (past events). It is possible to change the color of the
future events so that every event referring to the same activity or to the same process
instance has the same color. The figure in the middle contains the Stream Tester: each
event of a stream is appended to this list, which shows the timestamp of the activity, its
name and its case id. The left bar contains some basic statistics (i.e. beginning of the
streaming session, number of events observed and average number of events observed
per second). The last picture, at the bottom, represents the Online HM miner. This
view can be divided into three parts: the central part, where the process representation
is shown (in this case, as a Causal Net); the left bar contains, on top, buttons to start/stop
the miner plus some basic statistics (i.e., beginning of the streaming session, number
of events observed and average number of events observed per second); at the bottom,
there is a graph which shows the evolution of the fitness measure.
14
Figure 4: Screenshots of four implemented ProM plugins. The first image (top left)
shows the logs merger (it is possible to define the overlap level of the two logs); the
second image (top right) represents the log streamer, the bottom left image is the stream
tester and the image at the bottom right shows the Online HM.
Moreover, Command-Line Interface (CLI) versions of the miners are available too2 .
In these cases, events are read from a static file (one event per line) and the miners
update the model (this implementation realizes an incremental approach of the algo-
rithm). These implementations are can be run in batch and are used for automated
experimentation.
5 Results
The algorithms presented in this paper have been tested using four datasets: event logs
from two artificial processes (one stationary and one evolving); a synthetic example;
and a real event log.
15
Figure 5: Model 1. Process model used to generate the stationary stream.
Figure 6: Model 2. The three process models that generate the evolving stream. Red
rounded rectangles indicate areas subject to modification.
High
Insurance Check
By Phone
Prepare By Email
No�fica�on
Low Insurance
Check
By Post
Low Claim Low Claim
Split Join
Register Low Medical Archive
History Check
Skip Response
Figure 7: Model 3. The first variant of the third model. Red rounded rectangles indicate
areas that will be subject to the modifications.
16
at the end, the stream contains traces from 5 different processes. Also in this case the
type of drift is shift. Due to space limitation, only the first process is presented and the
red rectangles indicate areas that are modified over time.
17
1 Online HM
0.9 1000 0.07
0.5 0.02
250
0.4 0.01
100
50 0
0.3
1000
100
250
500
750
50
0.2
0.1
Sliding Windows HM Periodic Resets HM
0
1000 1000 0.07 0.07
750
1000
1000
100
250
500
750
100
250
500
750
50
50
Mining window size Mining window size
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Fitness
Fitness
0.5 0.5
0.4 0.4
0.3 Online HM 0.3
Online HM w/ Self Adap�ng Aging
0.2 Lossy Coun�ng HM 0.2
0.1 Online HM w/ Aging (α = 0.997) 0.1
Online HM w/ Aging (α = 0.9985)
0 0
1 0 500 1000 1500 2000 2500 3000 3500
Events observed
0.99 α (Online HM w/ Self Adap�ng Aging)
Online HM Periodic Resets HM (q = 100; x = 200)
0 500 1000 1500 2000 2500 3000 3500 Lossy Coun�ng HM Sliding Windows HM (q = 750; x = 750)
Events observed Sliding Windows HM (q = 100; x = 200) Periodic Resets HM (q = 750; x = 750)
Figure 8: Aggregated experimental results for five streams generated by Model 1. Top:
average (left) and variance (right) values of fitness measures for basic approaches and
the Online HM. Bottom: evolution in time of average fitness for Online HM with
queues size 100 and log size for fitness 200; curves for HM with Aging (α = 0.9985
and α = 0.997), HM with Self Adapting (evolution of the α value is shown at the bot-
tom), Lossy Counting and different configurations of the basic approaches are reported
as well.
an error value = 0.01. The right hand side of Fig. 8 compares the basic approaches,
with different window and fitness sizes against the Online HM and the Lossy Counting
approach. As expected, since there is no drift, the Online HM outperforms the versions
with aging. In fact, HM with aging beside being less stable, degrades performances
as the value of α decreases, i.e. less importance is given to less recent events. This is
consistent with the bad performance reported for the basic approaches which can ex-
ploit only the most recent events contained in the window. The self adapting strategy,
after an initial variation of the α parameter, is able to converge to the Online HM by
eventually choosing a value of α equals to 1.
Fig. 9 reports the aggregated experimental results for five streams generated by
Model 2. In this case we adopted exactly the same experimental setup, procedure
and results presentation as described before. In addition, the occurrences of drift are
marked. As expected, the performance of Online HM decreases at each drift, while
HM with Aging is able to recover from the drifts. The price paid for this ability is a
less stable behavior. HM with Self Adapting aging seems to be the right compromise
being eventually able to recover from the drifts while showing a stable behavior. The α
curve shows that the self adapting strategy seems to be able to detect the concept drifts.
18
1 Online HM
0.9 1000 0.07
0.5 0.02
250
0.4 0.01
100
50 0
0.3
1000
100
250
500
750
50
0.2
0.1
Sliding Windows HM Periodic Resets HM
0
1000 1000 0.07 0.07
750
1000
1000
100
250
500
750
100
250
500
750
50
50
Mining window size Mining window size
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Fitness
Fitness
0.5 0.5
0.4 0.4
0.3 0.3
Dri�s Online HM
0.2 Online HM w/ Self Adap�ng Aging 0.2
Dri�s
0.1 Lossy Coun�ng HM 0.1
Online HM w/ Aging (α = 0.997)
0 0
1 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Events observed
0.99 α (Online HM w/ Self Adap�ng Aging)
Online HM Periodic Resets HM (q = 100, x = 200)
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Lossy Coun�ng HM Sliding Windows HM (q = 750, x = 750)
Events observed Sliding Windows HM (q = 100, x = 200) Periodic Resets HM (q = 750, x = 750)
The Model 3, with the synthetic example, has been tested with the basic approaches
(Sliding Windows and Periodic Resets), the Online HM, the HM with Self Adapting
and the Lossy Counting and the results are presented in Fig. 10. In this case, the
Lossy Counting and the Online HM outperform the other approaches. Lossy Counting
reaches higher fitness values, however Online HM is more stable and seems to better
tolerate the drifts. The basic approaches and the HM with Self Adapting, on the other
hand, are very unstable; moreover it is interesting to note that the value of α, of the
HM with Self Adapting, is always close to 1. This indicates that the short stabilities of
the fitness values are sufficient to increase α, so the updating policy (i.e. the incremen-
t/decrement speed of α) presented, for this particular case, seems to be too fast. The
second graph, on the bottom, presents three runs of the Lossy Counting, with differ-
ent values for . As expected, the lower the value of the accepted error, the better the
performances.
Due to the size of this dataset, it is interesting to evaluate the performance of the
approaches also in terms of space and time requirements.
Fig. 11 presents the average memory required by the miner during the processing
of the entire log. Different configurations are tested, both for the basic approaches with
19
1
0.9
0.8
0.7
0.6
Fitness
0.5
0.4
0.3
0.2
0.1
0
1
0.99
0.98 α (Online HM w/ Self Adap�ng Aging)
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000
Events observed
Sliding Windows HM (q = 1000; x = 1000) Online HM w/ Self Adap�ng Aging (q = 1000; x = 1000)
Periodic Resets HM (q = 1000; x = 1000) Lossy Coun�ng HM (ε = 0.01; x = 1000)
Online HM (win. 1000; fit. 1000)
1
0.9
0.8
0.7
0.6
Fitness
0.5
0.4
0.3
0.2
0.1
0
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000
Events observed
Lossy Coun�ng HM (ε = 0.01; x = 1000) Lossy Coun�ng HM (ε = 0.05; x = 1000)
Lossy Coun�ng HM (ε = 0.025; x = 1000)
Figure 10: Detailed results of the basic approaches, Online HM, HM with Self Adapt-
ing and Lossy Counting (with different configurations) on data of Model 3. Vertical
gray lines indicate points where concept drift occur.
the Online HM and the HM with Self Adapting, and the Lossy Counting algorithm.
Clearly, as the windows grow, the space requirement grows too. For what concerns
the Lossy Counting, again, as the value (accepted error) becomes lower, more space
is required. If we pick the Online HM with window 1000 and the Lossy Counting
with 0.01 (from Fig. 10, both seem to behave similarly) the Online HM consumes
less memory: it requires 128.3 MB whereas the Lossy Counting needs 143.8. Fig. 12
shows the time performance of different algorithms and different configurations. It is
interesting to note, from the chart at the bottom, that the time required by the Online
and the Self Adapting is almost independent of the configurations. Instead, the basic
approaches need to perform more complex operations: the Periodic Reset has to add
the new event and, sometimes, it resets the log; the Sliding Window has to update the
log every time a new event is observed.
In order to study the dependence of the storage requirements of Lossy Counting
with respect to the error parameter , we have run experiments on the same log for
different values of , recording the maximum size of the Lossy Counting sets during
execution. Results for x = 1000 are reported in Fig. 13. Specifically, the figure com-
pares the maximum size of the generated sets, the average fitness value and the average
precision value. As expected, as the value of becomes larger, both the fitness value
20
150 150
Online HM Lossy Coun�ng HM
Online HM w/ Self Adap�ng Aging
Sliding Windows HM
Space requirement (MB)
125 125
Periodic Resets HM
100 100
75 75
50 50
q, x = 10 q, x = 100 q, x = 500 q, x = 1000 ε = 0.1 ε = 0.05 ε = 0.01
Configura�ons Configura�ons (fitness size set to 1000)
Figure 11: Average memory requirements, in MB, for a complete run over the entire
log of Model 3, of the approaches (with different configurations).
Processing �me per event (ms)
100
10
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000
Events observed
Sliding Windows HM (q = 1000; x = 1000) Online HM w/ Self Adap�ng Aging (q = 1000; x = 1000)
Periodic Resets HM (q = 1000; x = 1000) Lossy Coun�ng HM (ε = 0.01; x = 1000)
Online HM (q = 1000; x = 1000)
14
Online HM
12 Online HM w/ Self Adap�ng Aging
Average processing �me (ms)
Sliding Windows HM
10 Periodic Resets HM
0
q, x = fit. 10 q, x = 100 q, x = 500 q, x = 1000
Configura�ons
Figure 12: Time performances over the entire log of Model 3. Top: time required to
process a single event by different algorithms (logarithmic scale). Vertical gray lines
indicate points where concept drift occur. Bottom: average time required to process an
event over the entire log, with different configurations of the algorithms.
21
1 200
0.9 Queues size 180
0.8 Average fitness 160
Average precision
ε= 5
ε=
ε= 5
ε=
ε=
ε=
ε=
0.
0.
0.
0.
0.
0.
0.
0.
02
05
07
15
25
3
Figure 13: Comparison of the average fitness, precision and space required, with re-
spect to different values of for the Lossy Counting HM executed on the log generated
by the Model 3.
Online HM
Online HM
with Aging
Counting
Window
Sliding
Lossy
HM
HM
and the sets size quickly decrease. The precision value, on the contrary, initially de-
creases and then goes up to very high values. This indicates an over-specialization of
the model to specific behaviors.
As an additional test, we decide to compare the proposed algorithms under extreme
storage conditions which do allow only to retain limited information about the observed
events. Specifically, Table 1 reports the average time required to process a single event,
average fitness and precision values when queues with size 10 and 100, respectively,
are used. For Lossy Counting we have used an value which approximately requires
sets of similar sizes. Please note that, for this log, a single process trace is longer
than 10 events so, with a queue of 10 elements it is not possible to keep in queue all the
events of a case (because events of different cases are interleaved). From the results it is
clear that, under these conditions, the order of occurrence of the algorithms in the table
(column order) is inversely proportional to all the evaluation criteria (i.e. execution
time, fitness, precision).
The online approaches presented in this work have been tested also against a real
22
1
0.9
0.8
0.7
0.6
Fitness
0.5
0.4
0.3
Online HM (q = 100; x = 200) Online HM w/ Aging (α = 0.998)
0.2 Online HM w/ Self Adap�ng Aging (q = 100; x = 200) Sliding Windows HM (q = 750; x = 750)
0.1 Lossy Coun�ng HM (ε = 0.01; x = 200) Periodic Resets HM (q = 750; x = 750)
0
1
0.99 α (Online HM w/ Self Adap�ng Aging)
Figure 14: Fitness performance on the real stream dataset by different algorithms.
dataset and results are presented in Fig. 14. The reported results refer to 9000 events
generated from the document management system, by Siav S.p.A., and run on an Ital-
ian bank institute. The observed process contains 8 activities and is assumed to be
stationary. The mining is performed using a queues size of 100 and, for the fitness
computation, the latest 200 events are considered. The behavior of the fitness curves
seems to indicate that some minor drifts occur.
As stated before, the main difference between Online HM and Lossy Counting is
that, whereas the main parameter of Online HM is the size of the queues (i.e. the
maximum space the application is allowed to use), the parameter of Lossy Counting
cannot control the memory occupancy of the approach. Fig. 15 proposes two com-
parisons of the approaches with two different configurations, against the real stream
dataset. In particular we defined the two configurations so that the average memory
required by Lossy Counting and Online HM are very close. The results presented are
actually the average values over four runs of the approaches. Please note that the two
configurations validates the fitness against different window sizes (in the first case it
contains 200 events, in the second one 1000) and this causes the second configuration
to validate results against a larger history.
The top part of the figure presents a configuration that uses, on average, about
100MB. To obtain this performance, several tests have been made and, at the end, for
Lossy Counting these parameters have been used: : 0.2, fitness queue size: 200. For
Online HM, the same fitness is used, but the queue size is set to 500. As the plot shows,
it is interesting to note that, in terms of fitness, this configuration is absolutely enough
for the Online HM approach instead, for Lossy Counting, it is not. The second plot, at
the bottom, presents a different configuration that uses about 170MB. In this case, the
error (i.e. ) for Lossy Counting is set to 0.01, the queue size of Online HM is set to
1500 and, for both, the fitness queue size is set to 1000. In this case the two approaches
generate really close results, in terms of fitness.
As final consideration, this empirical evaluation clearly shows that –at least in our
real dataset– both Online HM and Lossy Counting are able to reach very high per-
formances, however the Online is able to better exploit the information available with
respect to the Lossy Counting. In particular, Online HM considers only a finite number
of possible observations (depending on the queue size) that, in this particular case, are
sufficient to mine the correct model. The Lossy Counting, on the contrary, keeps all
the information for a certain time-frame (obtained starting from the error parameter)
without considering how many different behaviors are already seen.
Note on fitness measure The usage of fitness for the evaluation of stream process
23
1
0.9
0.8
0.7
0.6
Fitness
0.5
0.4
0.3
0.2
0.1
0
200
Space requirement (MB)
150
Avgs
100
50
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Events observed
Lossy Coun�ng HM Online HM
(a) Configuration that requires about 100MB. Lossy Counting: : 0.2, fitness queue size: 200; Online HM:
queue size: 500, fitness queue size: 200.
1
0.9
0.8
0.7
0.6
Fitness
0.5
0.4
0.3
0.2
0.1
0
350
Space requirement (MB)
300
250
200 Avgs
150
100
50
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Events observed
Lossy Coun�ng HM Online HM
(b) Configuration that requires about 170MB. Lossy Counting: : 0.01, fitness queue size: 1000; Online
HM: queue size: 1500, fitness queue size: 1000.
mining algorithms seems to be an effective choice. However, this might not always be
the case: let’s consider two very different processes P 0 and P 00 and a stream composed
of events generated by alternate executions of P 0 and P 00 . Under specific conditions,
the stream miner will generate a model that contains both P 0 and P 00 , connected by
an initial XOR-split and merged with a XOR-join. This model will have a very high
fitness value (it can replay traces from both P 0 and P 00 ), however the mined model is
not the one expected, i.e. the alteration in time of P 0 and P 00 is not reflected well.
In order to deal with the problem just presented, we propose the performances
of some approaches also in terms of “precision”. This measure is thought to prefer
models that describe a “minimal behavior” with respect to all the model that can be
generated starting from the same log. In particular we used the approach by Muñoz-
24
1
0.9
0.8
0.7
0.6
Precision
0.5
0.4
0.3
0.2 Online HM (q = 1000; x = 2000) Sliding Windows HM (q = 1000; x = 2000)
Online HM w/ Self Adap�ng Aging (q = 1000; x = 2000) Periodic Resets HM (q = 1000; x = 2000)
0.1 Lossy Coun�ng HM (ε = 0.01; x = 2000)
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Events observed
Figure 16: Precision performance on the real stream dataset by different algorithms.
Gama and Carmona described in [16]. Fig. 16 presents the precision calculated for four
approaches during the analysis of the dataset of real events. It should not surprise to no-
tice that the stream specific approaches reach very good precision values, whereas the
basic approach with periodic reset needs to recompute, every 1000 events, the model
from scratch. It is interesting to note that both Online HM and Lossy Counting are not
able to reach the top values, whereas the Self adapting one, after some time, reaches
the best precision, even if its value fluctuates a bit. The basic approach with sliding
window, instead, seems to behave quite nicely, even if the stream specific approaches
outperform it.
25
Counting seem to be the right choice in case of concept drift. The largest log has been
used also for measuring performance in terms of time and space requirements.
As future work, we plan to conduct a deeper analysis of the influence of the differ-
ent parameters on the presented approaches. Moreover, we plan to extend the current
approach also to mine the organizational perspective of the process. Finally, from a
process analyst point of view, it may be interesting to not only show the current up-
dated process model, but also report the “evolution points” of the process.
References
[1] Arya Adriansyah, Boudewijn van Dongen, and Wil M. P. van der Aalst. Confor-
mance Checking Using Cost-Based Fitness Analysis. In 2011 IEEE 15th Interna-
tional Enterprise Distributed Object Computing Conference, pages 55–64. IEEE,
August 2011.
[2] Charu Aggarwal. Data Streams: Models and Algorithms, volume 31 of Advances
in Database Systems. Springer US, Boston, MA, 2007.
[3] Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernard Pfahringer. MOA:
Massive Online Analysis Learning Examples. Journal of Machine Learning Re-
search, 11:1601–1604, 2010.
[4] R. P. Jagadeesh Chandra Bose. Process Mining in the Large: Preprocessing, Dis-
covery, and Diagnostics. Phd thesis, Technische Universiteit Eindhoven, 2012.
[5] R. P. Jagadeesh Chandra Bose, Wil M. P. van der Aalst, Indr Žliobait, and Mykola
Pechenizkiy. Handling Concept Drift in Process Mining. In Conference on
Advanced Information Systems Engineering (CAiSE), pages 391–405. Springer
Berlin / Heidelberg, 2011.
[6] Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy. Mining
Data Streams: a Review. ACM Sigmod Record, 34(2):18–26, June 2005.
[7] Lukasz Golab and M. Tamer Özsu. Issues in Data Stream Management. ACM
SIGMOD Record, 32(2):5–14, June 2003.
[8] Christian W. Günther. XES Standard Definition. www.xes-standard.org, 2009.
[9] Wassily Hoeffding. Probability Inequalities for Sums of Bounded Random Vari-
ables. Journal of the American Statistical Association, 58(301):13–30, 1963.
[10] IEEE Task Force on Process Mining. Process Mining Manifesto. In Florian
Daniel, Kamel Barkaoui, and Schahram Dustdar, editors, Business Process Man-
agement Workshops, pages 169–194. Springer-Verlag, 2011.
[11] Andre Cristiano Kalsing, Gleison Samuel do Nascimento, Cirano Iochpe, and
Lucineia Heloisa Thom. An Incremental Process Mining Approach to Extract
Knowledge from Legacy Systems. In 2010 14th IEEE International Enterprise
Distributed Object Computing Conference, pages 79–88. IEEE, October 2010.
[12] Ekkart Kindler, Vladimir Rubin, and Wilhelm Schäfer. Incremental Workflow
Mining Based on Document Versioning Information. In International Software
Process Workshop, pages 287–301. Springer Verlag, 2005.
26
[13] Ekkart Kindler, Vladimir Rubin, and Wilhelm Schäfer. Incremental Workflow
Mining for Process Flexibility. In Proceedings of BPMDS2006, pages 178–187,
2006.
[14] Gurmeet Singh Manku and Rajeev Motwani. Approximate Frequency Counts
over Data Streams. In Proceedings of International Conference on Very Large
Data Bases, pages 346–357, Hong Kong, China, 2002. Morgan Kaufmann.
[15] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs,
Charles Roxburgh, and Angela Hung Byers. Big Data: The Next Frontier for
Innovation, Competition, and Productivity. Technical Report June, McKinsey
Global Institute, 2011.
[16] Jorge Muñoz Gama and Josep Carmona. A fresh look at Precision in Process
Conformance. In Business Process Management, pages 211–226. Springer Berlin
/ Heidelberg, 2010.
[22] Wil M. P. van der Aalst, Arya Adriansyah, and Boudewijn van Dongen. Replaying
History on Process Models for Conformance Checking and Performance Analy-
sis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,
2(2):182–192, March 2012.
[23] Wil M. P. van der Aalst and Arthur H.M. ter Hofstede. YAWL: Yet Another
Workflow Language. Information Systems, 30(4):245–275, June 2005.
[24] Wil M. P. van der Aalst and Ton A. J. M. M. Weijters. Rediscovering Workflow
Models from Event-based Data Using Little Thumb. Integrated Computer-Aided
Engineering, 10(2):151–162, 2003.
[25] Matthijs van Leeuwen and Arno Siebes. StreamKrimp: Detecting Change in
Data Streams. In Walter Daelemans, Bart Goethals, and Katharina Morik, editors,
Machine Learning and Knowledge Discovery in Databases, volume LNCS 5211
of LNAI, pages 672–687. Springer, 2008.
27
[26] Eric H. M. W. Verbeek, Joos Buijs, Boudewijn van Dongen, and Wil M. P. van der
Aalst. ProM 6: The Process Mining Toolkit. In BPM 2010 Demo, pages 34–39,
2010.
[27] Gerhard Widmer and Miroslav Kubat. Learning in the Presence of Concept Drift
and Hidden Contexts. Machine Learning, 23(1):69–101, 1996.
A Heuristics Miner
A.1 Heuristics Miner metrics
Heuristics Miner (HM) [24] is a process mining algorithm that counts various types of
frequencies to mine dependency relations among activities represented by logs.
The relation a >W b holds iff there is a trace σ = ht1 , t2 , . . . , tn i and i ∈
{1, . . . , n − 1} such that ti = a and ti+1 = b. The notation |a >W b| indicates to
the number of times that, in W , a >W b holds (no. of times activity b directly follows
activity a).
The following subsections present a detailed list of all the formulae required by
Heuristics Miner to build a process model.
28
Figure 17: Example of a possible process model that generates the log W .
|a >W a|
a ⇒W a = (4)
|a >W a| + 1
is above a length-one loop threshold. A loop of length two is considered differently: it
is introduced if the quantity:
|a >2W b| + |b >2W a|
a ⇒2W b = (5)
|a >2W b| + |b >2W a| + 1
is above a length-two loop threshold. In this case, the a >2W b relation is observed when
a is directly followed by b and then there is a again (i.e. for trace σ = ht1 , t2 , . . . , tn i
there is an i ∈ {1, . . . , n − 2} such that σ ∈ W and ti = a and ti+1 = b and ti+2 = a).
Please note that the notation h· · · in indicates n case following of the same sequence.
Such log can be generated starting from executions of the process model of Fig. 17.
In the case reported in figure, the main measure (dependency relation) builds the
following relation:
A B1 B2 C D
A 0 0.83 0.83 0 0
B1
−0.83 0 0 0.83 0
−0.83
B2 0 0 0.83 0
C 0 −0.83 −0.83 0 0.909
D 0 0 0 −0.909 0
Starting from this relation and considering – for example – a value 0.9 for the depen-
dency threshold, it is possible to identify the complete set of dependencies, including
29
the split from activity A to B1 and B2 . In order to identify the type of the split it is
necessary to use the AND measure (Eq. (2)):
5+5
A ⇒W (B1 ∧ B2 ) = = 0.909
5+5+1
So, considering — for example — an AND-threshold of 0.1, the type of the split is
set to AND. In the ProM implementation, the default value for dependency threshold
is 0.9, and for the AND-threshold it is 0.1.
!
E[ρab + ρba ]
(a ⇒S0t b) ≤ (a ⇒S b) 1 +
E[ρab + ρba ] − ab (t) + nc(t)
ab (t)
1
E[ρab + ρba ] − ab (t) + nc(t)
30
!
E[ρbc + ρcb ]
(a ⇒S0t (b ∧ c)) ≤ (a ⇒S (b ∧ c)) 1 +
E[ρbc + ρcb ] − abc (t) + nc(t)
bc (t)
1
E[ρab + ρac ] − abc (t) + nc(t)
q q
(ξde +ξed )2 ln(2/δ) (ξde +ξdf )2 ln(2/δ)
where ∀d, e, f ∈ AS , de (t) = 2nc(t) , def (t) = 2nc(t) ,
and E[x] is the expected value of x.
|a>S b|−|b>S a|
Proof 1 Let consider the Heuristics Miner definition (a ⇒S b) = |a> S b|+|b>S a|+1
(as
presented in Eq. (1)). Let Nc be the number of cases contained in S0t , then
|a>S t b|−|b>S t a|
|a >S0t b| − |b >S0t a| 0
Nc
0
and
|a>S t b|−|b>S t a|
0 0
Nc E[ρab − ρba ]
(a ⇒S b) = lim = .
Nc →+∞ |a>S0t b|+|b>S0t a| 1 E[ρab + ρba ]
Nc + Nc
|a>S t b|−|b>S t a|
We recall that X = 0
Nc
0
is the mean of the random variable X = (ρab −
ρba ) computed over Nc independent observations, i.e. traces, and that X ∈ [−ξba , ξab ].
We can then use the Hoeffding bound [9] that states that, with probability 1 − δ
s
2
2
X − E[X] < X = rX ln δ ,
2Nc
E[X] − X X
1 ≤ = (a ⇒S0t b),
E[Y ] + Y + Nc Y + N1c
The last two bounds can be proved in a similar way by considering X = (ρbc +
q ∈ [0, ξbc + ξcb ] and Y q
ρ cb ) = (ρab + ρac ) ∈ [0, ξab + ξac ], which leads to X =
(ξbc +ξcb )2 ln(2/δ) )2 ln(2/δ)
2Nc and Y = (ξab +ξac
2Nc .
31
Similar bounds can be obtained also for the other measures computed by Heuristics
Miner. From the bounds it is possible to see that, with the increase of the number
1
of observed cases nc(t), both nc(t) and the errors ab (t) and abc (t) go to 0 and the
measures computed by the online version of Heuristics Miner consistently converge to
the “right” values.
32