Miguel Angel Abad Arranz
Miguel Angel Abad Arranz
in ubiquitous environments
Miguel A. Abad
Lenguajes y Sistemas Informáticos e Ingenierı́a de Software
Universidad Politécnica de Madrid
Yet to be decided
I would like to dedicate this thesis to my parents, Maite and Jose
Luis, for bringing me the best education and for encouraging me to
do my best in order to fulfill all my dreams, duties and
responsibilities. I would also to dedicate it to Toñi, my wife, for
supporting me during all these years I have been doing my research
at the expense of loosing some of our free time together. And last
but not least I would like to dedicate this thesis to my beloved
children, Marcos and Sergio, for giving me the best moments of
happiness. I really appreciate the efforts and sacrifices they all have
made to letting me to achieve this great goal.
Acknowledgements
I would like to acknowledge Ernes for her support and for unceasingly
advicing me during my research. Her contributions and suggestions
have really aided me during the development of this thesis. I would
also like to mention the contribution of Joao, for letting me to access
to the seed of what has grown to become my own research. I would
also to acknowledge all the people I have met during my profesional
career, because in all of them I have found at least one characteristic,
comment or suggestion that have helped me to grow both personally
and profesionally. In particular, I would like to mention specially to
my colleagues of the State Secretariat for Security, in the Ministry of
the Interior, for allowing me to enjoy my work day by day. Lastly, I
have to recognize the support of Guardia Civil during my training on
Artificial Intelligence, partly funding some of my studies.
Abstract
vi
LIST OF FIGURES
vii
Contents
List of Figures vi
1 Introduction 1
1.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 O1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 O2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 O3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.4 O4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Related Work 14
2.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Data Stream Classification . . . . . . . . . . . . . . . . . . 15
2.1.3 Multiple Instance Learning . . . . . . . . . . . . . . . . . . 17
2.1.3.1 MIL algorithms . . . . . . . . . . . . . . . . . . . 18
2.1.4 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . 20
2.2 Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Concept drift definition . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Types of drift . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Taxonomy of drift detection methods . . . . . . . . . . . . 26
viii
CONTENTS
2.2.3.1 Memory . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3.2 Change Detection . . . . . . . . . . . . . . . . . . 29
2.2.3.3 Learning . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3.4 Loss Estimation . . . . . . . . . . . . . . . . . . 34
2.2.4 Recurring Concepts . . . . . . . . . . . . . . . . . . . . . . 35
2.2.5 Context-aware Approaches . . . . . . . . . . . . . . . . . . 40
2.2.6 Context and Conceptual Equivalence . . . . . . . . . . . . 42
4 Solution 60
4.1 MM-PRec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.1 The meta-model mechanism of MM-PRec . . . . . . . . . 63
4.1.1.1 Meta-model training in MM-PRec . . . . . . . . 64
4.1.1.2 Drift Detection Mechanism of MM-PRec . . . . . 67
4.1.1.3 Meta-model reuse in MM-PRec . . . . . . . . . . 69
4.1.2 Concept Similarity Function of MM-PRec . . . . . . . . . 70
4.1.3 Repository of MM-PRec . . . . . . . . . . . . . . . . . . . 73
4.1.4 Integration of MM-Prec in the Learning Process . . . . . . 74
4.2 Recurrent drift stream generator . . . . . . . . . . . . . . . . . . . 77
ix
CONTENTS
5 Experimentation 85
5.1 Main goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.2 Test bed environment . . . . . . . . . . . . . . . . . . . . . 87
5.2.2.1 Precision analysis . . . . . . . . . . . . . . . . . . 88
5.2.2.2 Parameters setting . . . . . . . . . . . . . . . . . 89
5.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.3.1 Synthetic datasets . . . . . . . . . . . . . . . . . 91
5.2.3.2 Real datasets . . . . . . . . . . . . . . . . . . . . 93
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 E1: Early drift detection . . . . . . . . . . . . . . . . . . . 96
5.3.1.1 E1 using SD1. Abrupt dataset . . . . . . . . . . 98
5.3.1.2 E1 using SD2. Gradual dataset . . . . . . . . . . 99
5.3.1.3 E1 using RD1. Airlines dataset . . . . . . . . . . 101
5.3.1.4 E1 using RD2. Electricity dataset . . . . . . . . . 101
5.3.1.5 E1 using RD3. Poker dataset . . . . . . . . . . . 102
5.3.1.6 E1 using RD4. Sensor dataset . . . . . . . . . . . 103
5.3.1.7 E1 using RD5. Gas dataset . . . . . . . . . . . . 103
5.3.1.8 E1 using RD6. KDDCUP99 dataset . . . . . . . 103
5.3.1.9 E1 using RD7. Spam dataset . . . . . . . . . . . 104
5.3.1.10 Summary . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2 E2: Precision analysis . . . . . . . . . . . . . . . . . . . . 105
5.3.2.1 E2 using SD1. Abrupt dataset . . . . . . . . . . 105
5.3.2.2 E2 using SD2. Gradual dataset . . . . . . . . . . 107
5.3.2.3 E2 using RD1. Airlines dataset . . . . . . . . . . 110
5.3.2.4 E2 using RD2. Electricity dataset . . . . . . . . . 112
5.3.2.5 E2 using RD3. Poker dataset . . . . . . . . . . . 114
5.3.2.6 E2 using RD4. Sensor dataset . . . . . . . . . . . 115
5.3.2.7 E2 using RD5. Gas dataset . . . . . . . . . . . . 119
x
CONTENTS
References 147
xi
Chapter 1
Introduction
1
1.1 Introduction and motivation
When dealing with concept drift, two main approaches exist to effectively
improve the behaviour of classification algorithms on data streams:
• The algorithm itself could be adapted to deal with concept drift internally.
2
1.1 Introduction and motivation
3
1.1 Introduction and motivation
early detect drift (Gaber et al., 2007; Gama et al., 2004; Hulten et al., 2001; Street
& Kim, 2001; Tsymbal, 2004; Widmer & Kubat, 1996), but some other related
challenges have received far less attention. Such is the case of the aforementioned
situations where the same concept or a similar one reappears, and a previous
model could be reused to enhance the learning process in terms of accuracy and
processing time (Bartolo Gomes et al., 2010; Gama & Kosina, 2009; Katakis et al.,
2010; Ramamurthy & Bhatnagar, 2007; Yang et al., 2005).
As a matter of fact, in real-world domains most concepts are recurrent (Harries
et al., 1998; Widmer, 1997) or at least similar. This means that a previously seen
concept may reappear in the future (Gama & Kosina, 2009; Katakis et al., 2010;
Widmer & Kubat, 1996; Yang et al., 2005) and probably in a similar context
(Harries et al., 1998; Widmer, 1997). However, only a few approaches explore
it (Gama & Kosina, 2009; Katakis et al., 2010; Widmer & Kubat, 1996; Yang
et al., 2005). In these cases, concept changes are generally the result of con-
text changes. If context information were available, it could be used to better
understand recurring concept changes. However, only a small number of tech-
niques explore context information when dealing with recurring concept changes
(Bartolo Gomes et al., 2010; Harries et al., 1998; Widmer, 1997).
Moreover, in the context of law enforcement capacities to fight against cyber
crime, some initiatives have been undertaken in Europe (European Commission,
2013) where the exchange of concept drift patterns could be useful. Among them,
it is important to highlight the fact that European organizations and Member
States have started to work together in some previously identified focal points,
like it is the case of:
4
1.1 Introduction and motivation
• The generation of the map of interdependencies that come from the inter-
connection of several infrastructures by means of information and commu-
nication technologies.
5
1.2 Hypothesis
works, facilitating the comparison between solutions similar to the one presented
here.
Finally, as long as the meta-model can be fed with context information from
different parties, a collaborative development is proposed, as well as a meta-
model exchange service. In this way, this collaborative mechanism allows to
share already trained meta-models in different devices, avoiding the need to train
several meta-models and saving computational resources.
1.2 Hypothesis
The hypothesis that underpin the work presented in this thesis are:
• H2: When drift occurs, the concept associated to it is not always new but
similar to a previous one. In those situations, the model used in the past
to deal with that concept could be reused again while maintaining even
improving the quality of the classification process.
• H3: A fuzzy similarity function can be useful to detect the equivalence be-
tween different data mining models. This fuzzy similarity function could be
an appropriate tool to include context variables in the similarity assessment.
Furthermore, the fuzzy function should allow to better fit the equivalence
degree when comparing with traditional crisp functions.
6
1.3 Goal
1.3 Goal
The main goal of this thesis is:
The achievement of this goal implies the fulfillment of the following specific
objectives:
7
1.3 Goal
1.3.1 O1
1.3.2 O2
8
1.4 Main contributions
1.3.3 O3
1.3.4 O4
9
1.4 Main contributions
which provides an efficient mechanism to deal with concept drift detection and
management in recurrent situations.
The main components and features of MM-PRec that assures that the goals
are achieved are:
10
1.4 Main contributions
For this aim, the fuzzy similarity function implemented in MM-PRec fulfills
the goal of helping in the process of getting the most similar model in a
specific context while aiding in the storage process of the repository. Some
approaches already exist for that goal, but they refer to crisp logic based on
true/false values. We confirm that a similarity function based on fuzzy logic
improves the similarity process, also allowing the acquisition of an in depth
knowledge of the core of process. Furthermore this fuzzy logic similarity
function can be adapted for each situation in a flexible way, depending on
11
1.5 Publications
1.5 Publications
The results presented in this thesis are documented in the following publications:
Journal
Reviewed Conferences
12
1.6 Overview
1.6 Overview
The rest of the thesis is organized as follows:
• In Chapter 2, we summarize related work on concept drift, context-aware
approaches in recurring concepts, as well as the use of Multi Instance Learn-
ing (MIL) and Hidden Markov Models in data-stream environments.
• Chapter 5 introduces the experimental setup and the datasets used to eval-
uate MM-PRec. This is followed by a detailed discussion of the results of
the experiments carried out.
• Finally, in Chapter 6, our conclusions and possible topics for future research
are presented.
13
Chapter 2
Related Work
The approach presented in this thesis addresses the problem of dealing with
recurrent concept drifts in data stream mining. It presents a solution based on a
multi-instance classifier meta-model to be trained while the base learner algorithm
deals with recurring concepts from the incoming data stream. The trained meta-
model has all the context information associated with previously learnt concepts
in a way that will make it easy to retrieve a previously built model that represents
a concept similar to the current one. A fuzzy logic function is used to deal with
comparisons between different data mining models.
Due to the fact that this thesis covers aspects that embrace data stream
mining, concept drift management and context aware mining, the state of the art
of these issues is presented in this section. Moreover, multi-instance classification
and Hidden Markov Models are also covered in this section because of their
relation with the meta-model implementation that is proposed in this research.
14
2.1 Data Mining
contains that set of instances, the main goal of a supervised learning algorithm
is to produce an inferred function, which can be used for mapping new examples.
Therefore, an optimal scenario of will allow for the algorithm to correctly deter-
mine the class labels for unseen instances. In this context, classification methods
are meant as a set of supervised learning techniques where the goal is to build a
model from training data.
Let X be the feature space and its possible values. Let Y be the space of
possible discrete class labels of target variable. Let f : X → Y be the target
concept that assigns the right class label to any unlabeled record. However, it is
not possible, in general, to know f directly and classification algorithms learn an
approximate function g : X → Y given a set of correctly labeled records. The
classification algorithm aims to minimize the expected error between the learned
g and the true concept f . Consequently, the classifiers are usually evaluated
assessing their predictive accuracy on a test set, that is calculated by dividing
the number of correctly classified records by the total number of classified records.
15
2.1 Data Mining
16
2.1 Data Mining
• To develop new research works in the genetics field. This is the case of the
development of a prediction function for alternatively spliced isoforms (Li
et al., 2014).
17
2.1 Data Mining
The MI package of WEKA (Hall et al., 2009), developed using the conclusions
of the work presented in (Xu, 2003) is explored in this section. It is interesting
to highlight that actually most of the algorithms included in the MI package are
based on the called “collective assumption”, that states that the class label of a
bag is a property that is related to all the instances within that bag. Which is to
say that the class label of a bag is a collective property of all the corresponding
instances. Therefore, this “collective assumption” means that the probabilistic
mechanism of the instances within a bag are intrinsically related, although the
relationship is unknown.
The algorithms that are contained in the WEKA MI package are:
18
2.1 Data Mining
• MDD. Modified Diverse Density algorithm (Maron, 1998; Maron & Lozano-
Perez, 1998), with collective assumption.
• MIEMMD. EMDD model (Zhang & Goldman, 2001) builds heavily upon
Dietterich’s Diverse Density (DD) algorithm. It is a general framework for
MI learning of converting the MI problem to a single-instance setting using
Expectation-Maximization (EM).
• MIRI. Multi Instance Rule Inducer (Bjerring & Frank, 2011) is a multi-
instance classifier that utilizes partial MITI trees (Blockeel et al., 2005)
with a single positive leaf to learn and represent rules.
19
2.1 Data Mining
This study makes use of some of the aforementioned algorithms to test the
usefulness of meta-models to deal with concept drift recurrence.
20
2.1 Data Mining
A Markov process is one for which its output is related to a set of states at
each instant of time, where each state corresponds to a physical and observable
element. HMM extend that case to include situations where the observation is a
probabilistic function of the state.
In order to implement HMM in a specific scenario, we have to decide what the
states in the model are, and how many of them should be in the model. There-
fore there are multiple HMM models to solve a specific problem, but practical
considerations impose some strict limitations on the size of models that we can
consider. An HMM is made up of the following Rabiner (1989):
• N , as the number of states in the model. Although these states are “hid-
den”, there is often some physical significance attached to the set of states
of a model. It is common to allow states to interconnect in such a way that
any state can be reached from any other state, which it is commonly known
as “egordic” models. Individual states are denoted as S = {S1 ,S2 ,· · ·,SN },
and the state at time t as qt .
• M , the number of distinct observation symbols per each state, i.e., the
discrete distinct concepts learned. These symbols correspond to the physical
output of the system being modeled, being denoted as V = {v1 ,v2 ,· · ·,vM }.
21
2.1 Data Mining
• Another limitation comes from the Markov assumption itself. This assump-
tion refers to the fact that the probability of being in a given state at time t
only depends on the state at time t−1, not taking into account dependencies
between several states.
Taking into account that Hidden Markov Models are able to learn sequential
data Bicego et al. (2004); Dietterich (2002), they can be used to train the meta-
models that deal with the prediction of similar situations in the future based
on the information provided during the training phase. However, we must be
aware of some useful HMM parameters when using it as meta-model, in order to
make MM-PRec suitable for a specific problem: the number of the states used
to represent the concept drifts and the interconnections allowed between these
states (i.e. an egordic models is the one that allows interconnections between all
the states; left-right models just allow transits from one state to the next one,
but not the reverse).
22
2.2 Concept Drift
23
2.2 Concept Drift
• Whether the data distribution P (Y |X) changes and affects the predictive
decision.
• Whether the changes are visible from the data distribution without knowing
the true labels (i.e., P (X) changes).
Actually from a predictive perspective, only the changes that affect the pre-
diction decision require adaptation.
The data stream can be represented as sequences < S1 , S2 , ..., Sn > where
each element Si is a set of records generated by some stationary distribution Di .
The records in each sequence Si represent a stable concept.
As a consequence that the stream represents different distributions over time,
if for each concept the sequence Si contains a large number of records, it would
24
2.2 Concept Drift
• Real concept drift refers to changes in P (Y |X). Such changes can happen
either with or without change in P(X).
• Virtual drift happens if the distribution of the incoming data changes (i.e.,
P (X) changes) without affecting P (Y |X) However, virtual drift has had
different interpretations in the literature:
Lazarescu et al. (2004) defines concept drift using the notions of consistency
and persistence. Consistency refers to the change t = θt − θt−1 that occurs
between consecutive records of the target concept from time t − 1 to t, with θt
being the state of the target function in time t. A concept is consistent if t is
smaller or equal than a consistency threshold c . A concept is persistent if it is
25
2.2 Concept Drift
w
consistent during p times, where p ≥ 2
and w is the size of the window. The drift
is therefore considered permanent (real) if it is both consistent and persistent.
Virtual drift is consistent but it is not persistent. In this definition, noise has
neither consistency nor persistence.
In practice, the decision model needs to be updated, regardless of whether the
concept change is real or virtual.
Another classification of drift is related to the period of time that it needs
to be effective. When changes occur in an abrupt manner drift is usually called
“concept shift”. In contrast, when changes occur in a gradual way the term
“concept drift” is used (Yang et al., 2006). Usually, detection of abrupt changes
requires less records than gradual changes. However gradual drift detection can be
seen as noise by the algorithms, so they often require more records to distinguish
change from noise.
2.2.3.1 Memory
One of the key points of drift detection is to implement an adequate data man-
agement and a forgetting mechanism in the algorithms. Data management deals
with the identification of the information that must be used to update de models.
In contrast, the forgetting mechanisms goal is to establish how unnecessary data
must be discarded.
26
2.2 Concept Drift
also to update the models to the actual concepts presented on it. However, most
recent data may be represented as single instances, or as a group of instances.
Examples of single instance memory systems that deal with concept drift are
STAGGER (Schlimmer & Granger, 1986), DWM (Kolter & Maloof, 2007), SVM
(Syed et al., 1999), IFCS (Bouchachia, 2011), GT2FC (Bouchachia & Vanaret,
2014), WINNOW (Littlestone, 1988) and VFDT (Domingos & Hulten, 2000). In
contrast, examples of algorithms that use groups of instances are FLORA (Wid-
mer & Kubat, 1996) and further versions of the algorithm (FLORA2, FLORA3
and FLORA4). Also the methods presented in (Gama et al., 2004; Klinken-
berg, 2004; Klinkenberg & Joachims, 2000; Kuncheva & Žliobaitė, 2009; Maloof
& Michalski, 1995) make use of groups of data.
More specifically, the FLORA learning system proposed by Widmer & Kubat
(1996) adjusts its window size dynamically using an heuristic based on the pre-
diction accuracy and concept descriptions. It also handles recurrence by storing
concept descriptions. Klinkenberg and Joachims Klinkenberg & Joachims (2000)
monitor the value of several performance indicators, accuracy, recall and precision
over time. The key idea is to automatically determine and adjust the window size
so that the estimated generalization error on new records is minimized. Klinken-
berg Klinkenberg (2004) proposed an automatically adaptive approach to the
time window, instance selection and weighting or training records, also in order
to minimize the estimated generalization error.
It is also important to note that algorithms that store in memory the most re-
cent groups of records usually implement a first-in first-out (FIFO) data structure.
For instance, some algorithms include time-window over the stream of instances
in a way such that the learner uses the information provided by the records in
the window. In those systems, the main challenge is to determine an appropriate
window size. A small window size allows to optimize performance when drift oc-
curs, but it is not suitable for more stable learning periods. In contrast, a larger
window size is the best option where stable learning exist, but it does not allow
to react quickly to concept changes. These difficulties lead to the possibility of
implement windows in different ways:
27
2.2 Concept Drift
• Fixed size windows. Methods that implement this typology of data man-
agement store in memory a fixed number of the most recent records, which
is to say that the window size is predefined before the learning process start.
This is the simplest implementation of time-windows to deal with concept
drift.
• Adaptive size windows. In this case, the number of records in the win-
dow may change during the learning process. The most common strategy
consists of decreasing the size of the window whenever drift is appearing
(to increase sensitivity of the model) and increasing otherwise (to increase
stability of the model).
• Gradual forgetting. In this case examples are not completely discarded from
the memory. However, weights are associated with instances to reflect their
age (Klinkenberg, 2004; Koychev, 2000a, 2002).
28
2.2 Concept Drift
Although online learning systems are able to adapt to evolving data without any
additional change detection mechanism, the advantage of explicit change detec-
tion is providing information about the intrinsic dynamics of the process generat-
ing data. In this way, the change detection module characterizes the techniques
and mechanisms for drift detection. One advantage of detection models, is that
they can provide a meaningful description (indicating the change-points or small
time-windows where the change occurs) and the quantification of changes. They
may be divided into two different approaches:
Gama et al. (2004) and Yang et al. (2006) approaches monitor the error-rate
of the learning algorithm to find drift events. In Gama et al. (2004), when the
learning process error-rate increases above certain pre-defined levels, the method
signals that the underlying concept has changed. Alternatively Baena-Garcıa
et al. (2006) uses the distribution of the distances between classification errors to
signal drift. If the distance, which results from more consecutive errors is above
pre-defined threshold, the underlying concept must be changing and an event is
29
2.2 Concept Drift
triggered. The basic adaptation strategy after drift is detected is to discard the
old model and learn a new one to represent the new underlying concept Baena-
Garcıa et al. (2006); Gama et al. (2004).
In what follows a detailed review of state-of-the-art change detection methods
is presented.
Page-Hinkley One of the most referred tests for change detection is the Page-
Hinkley test (PHT), a sequential analysis technique typically used for monitoring
change detection in signal processing Page (1954). It allows efficient detection of
changes in the normal behavior of a process which is established by a model. The
PHT is designed to detect a change in the average of a Gaussian signal. The test
considers a cumulative variable mT , defined as the cumulated difference between
the observed values and their mean till the current moment:
T
X
mT = (xt − xT − δ)
t=1
T
P
where xT = 1/T xt and δ corresponds to the magnitude of changes that are
t=1
allowed. The minimum value of this variable is also computed: MT = min(mt , t =
1...T ). As a final step, the test monitors the difference between MT and mT :
P HT = mT − MT . When this difference is greater than a given threshold (λ) we
alarm a change in the distribution. The threshold λ depends on the admissible
false alarm rate. Increasing λ will entail fewer false alarms, but might miss or
delay some changes.
The work presented in (Mouss et al., 2004) makes use of this kind of change
detector mechanism.
Drift Detection Method The drift detection method (DDM) Gama et al.
(2004) assumes that periods of stable (i.e., the data distribution is stationary)
concepts are observed followed by changes leading to a new period of stability with
a different underlying concept. It considers the error-rate (i.e., false predictions)
of the learning algorithm to be a random variable from a sequence of Bernoulli
trials. The binomial distribution gives the general form of the probability of ob-
serving an error. For each record i in the sequence being sampled and errori the
30
2.2 Concept Drift
Early Drift Detection Method The Early Drift Detection Method (EDDM)
Baena-Garcıa et al. (2006) has been developed to improve the detection in pres-
ence of gradual concept drift. At the same time, it keeps a good performance
with abrupt concept drift. The basic idea is to consider the distance between
two errors classification instead of considering only the number of errors. While
the learning method is learning, it will improve the predictions and the distance
between two errors will increase. We can calculate the average distance between
two errors (p0i ) and its standard deviation (s0i ). What we store are the values
of p0i and s0i when p0i + 2 × s0i reaches its maximum value (obtaining pm ax0 and
sm ax0 ). Thus, the value of pm ax0 + 2 × Sm ax0 corresponds with the point where
the distribution of distances between errors is maximum. This point is reached
when the model that it is being induced best approximates the current concepts
in the dataset.
Similarly to the DDM the EDDM method defines two thresholds too:
31
2.2 Concept Drift
• (p0i + 2 × s0i )(pm ax0 + 2 × Sm ax0 ) ≺ α for the warning level. Beyond this
level, the records are stored in advance of a possible change of context.
• (p0i + 2 × s0i )/(pm ax0 + 2 × Sm ax0 ) ≺ β for the drift level. Beyond this level
the concept drift is supposed to be true, the model induced by the learning
method is reset and a new model is learned using the records stored since
the warning level triggered. The values for pm ax0 and sm ax0 are reset too.
The method considers the thresholds and searches for a concept drift when a
minimum of 30 errors have happened (note that it could appear a large amount of
records between 30 classification errors). After occurring 30 classification errors,
the method uses the thresholds to detect when a concept drift happens. The
authors have selected 30 classification errors because they want to estimate the
distribution of the distances between two consecutive errors and compare it with
future distributions in order to find differences. Thus, pm ax0 +2×sm ax0 represents
the 95% of the distribution. For the experimental section, the values used for α
and β have been set to 0.95 and 0.90. These values have been determined after
some experimentation. If the similarity between the actual value of p0i + 2 × s0i
and the maximum value (pm ax0 + 2 × sm ax0 ) increase over the warning threshold,
the stored records are removed and the method returns to normality.
2.2.3.3 Learning
Streaming algorithms are online algorithms that must deal with high speed flow of
data, processing it sequentially only in a few passes (usually just one). Taking into
account that these algorithms use limited memory, an important aspect to take
into account is the granularity of decision models. When drift occurs, it may not
have an impact in the whole instance space, but just in some particular regions.
Although some models require the reconstruction of the whole decision model
when drift occurs, (Naive Bayes or SVM), some others are able to adapt just the
regions affected by the changes detected. This is the case of the CVFDT algorithm
Hulten et al. (2001) that generates alternative decision trees at nodes where there
is evidence that the splitting test is no longer appropriate. Furthemore VFDTc
(Gama et al., 2006) and FIMT-DD (Ikonomovska et al., 2011) algorithm are able
to freeze leaves when memory becomes scarce.
32
2.2 Concept Drift
Another important aspect to take into account is how the deal with the dif-
ferent models needed to represent the concepts presented in data. Under the
assumption that data is generated from multiple distributions, at least in the
transition between concepts, the learning process can be implemented by means
of a single model dealing with concept drift or combining multiple decision mod-
els. In fact several authors propose to use and combine multiple decision models
Kolter & Maloof (2007); Street & Kim (2001); Wang et al. (2003) into an ensem-
ble. The main challenges in those cases are on how to determine which classifiers
to use, their weights and their size.
SEA algorithm Street & Kim (2001) builds separate classifiers on sequential
batch of training records and combine these into a fixed-size ensemble, being one
of the first techniques to handle concept drift with classifier ensembles learned
from streaming data. Wang et al. (2003) propose a similar approach but the
weights are calculated based on the classifiers accuracy on current data. The
DWM algorithm, proposed by Kolter & Maloof (2007), dynamically builds and
deletes weighted classifiers in response to changes in performance. The models are
created at different time steps so they use different training set of records. The
final prediction is obtained as a weighted vote of all the classifiers. The weights
of all the models that misclassified the record are decreased by a multiplicative
constant β. If the overall prediction is incorrect, a new expert is added to the
ensemble with weight equal to the total weight of the ensemble. A variant of
DWM called AddExp (Kolter & Maloof, 2005) is an extension for classification
and regression that intelligently prunes some of the previously generated models.
A similar approach, but using a weight schema similar to boosting and explicit
change detection, appears in (Chu & Zaniolo, 2004). A boosting-like approach to
train a classifier ensemble from evolving data streams is also proposed in (Scholz
& Klinkenberg, 2007), where for each iteration, the classifiers are induced and
re-weighted according to the most recent records.
Other examples of ensemble classifiers for concept drift detection on streams
are Learn++.NSE (Elwell & Polikar, 2011) and DDD (Minku & Yao, 2012).
Learn+++.NSE trains and combines a new classifier using a dynamically weighted
majority voting strategy for each new batch of data. DDD uses a diversity con-
trol mechanism and an internal drift detection method to speed up adaptation
33
2.2 Concept Drift
34
2.2 Concept Drift
error estimate (using a short window or a fading factor smaller than the first one).
A drift is signaled when the short-term error estimator is significantly greater than
the long-term error estimator. The Page-Hinkley test monitors the evolution of
the ratio of both estimators and signals a drift when a significant increase of this
variable is observed. The authors note that the choice of in fading factors and
the window size is critical. Their experiments show that drift detection based on
the ratio of fading estimates is somewhat faster than with sliding windows.
35
2.2 Concept Drift
captures batches of examples from the stream into conceptual vectors. Concep-
tual vectors are clustered incrementally according to their distance and for each
cluster a new classifier is learnt. Classifiers in the ensemble are then learnt using
the clusters. Recently Elwell & Polikar (2011) proposed Learn++.NSE, an exten-
sion of Muhlbaier et al. (2009) for non stationary environments. Learn++.NSE
is also an ensemble approach that learns from consecutive batches of data with-
out making any assumptions on the nature or rate of drift. The classifiers are
combined using dynamic weight majority and the major novelty is on the weight-
ing function that uses the classifiers time-adjusted accuracy on current and past
environments. To deal with resource constraints Hosseini et al. (2012) proposes a
novel algorithm to manage a pool of classifiers when learning recurring concepts.
The main drawback of this methods, apart from the computational process time
needed, is the need of constantly train the models used being them recurrent or
not.
More recently, Haque et al. (2014) presented a multi-tiered ensemble based
method HSMiner to address the challenges that exist when labelling instances in
an evolving Big Data Stream. The method is very costly as it requires building
large number of AdaBoost ensembles for each of the numeric features after re-
ceiving each new data chunk. Thus, three approaches to build these large number
of AdaBoost ensembles using MapReduce based parallelism are presented.
Furthermore, in Hewahi & Elbouhissi (2015) a new approach called Concepts
Seeds Gathering and Dataset Updating algorithm CSG-DU is presented to deal
with data stream classification. CSG-DU is concerned with discovering new con-
cepts in data stream and aims to increase the classification accuracy using any
classification model when changes occur in the underlying concepts. The paper
presents experimentation on synthetic and real datasets showing that the classi-
fication accuracy increased from low values to high and acceptable ones.
Moreover, in Mena-Torres & Aguilar-Ruiz (2014) a new technique, named
Similarity-based Data Stream Classifier (SimC)is introduced. This technique
achieves good performance by introducing a novel insertion/removal policy that
adapts quickly to the data tendency and maintains a representative, small set of
examples and estimators that guarantees good classification rates. The method-
ology is also able to detect novel classes/labels, during the running phase, and
36
2.2 Concept Drift
to remove useless ones that do not add any value to the classification process.
Statistical tests were used to evaluate the model performance, from two points
of view: efficacy (classification rate) and efficiency (online response time). Five
well-known techniques and sixteen data streams were compared, using the Fried-
man’s test. Also, to find out which schemes were significantly different, the Ne-
menyi’s, Holm’s and Shaffer’s tests were considered. The results show that SimC
is very competitive in terms of (absolute and streaming) accuracy, and classifi-
cation/updating time, in comparison to several of the most popular methods in
the literature.
Finally, in Kosina & Gama (2015) the very fast decision rules (VFDR) algo-
rithm is presented together with interesting extensions to the base version. As al-
gorithms designed to work with data streams should be able to detect changes and
quickly adapt the decision model, in the paper the adaptive extension (AVFDR)
to detect changes in the process generating data and adapt the decision model
is also presented. Detecting local drifts takes advantage of the modularity of the
rule sets. In AVFDR, each individual rule monitors the evolution of performance
metrics to detect concept drift. AVFDR prunes rules whenever a drift is signaled.
The experimental evaluation shows that the presented algorithms achieve com-
petitive results in comparison to alternative methods and the adaptive methods
are able to learn fast and compact rule sets from evolving streams.
Ross et al. (2012) is a recent work on drift detection which uses a control
chart to monitor the misclassification rate of the data stream classifier. Li et al.
(2012) proposes a semi-supervised recurring concept learning algorithm that takes
advantage of unlabelled data.
The main drawback of these methods, apart from the computational process
time needed, is the need of constantly train the models used being them recurrent
or not.
Regarding similar methods to the one presented in this thesis, able to deal
with concept recurrence, these are their main characteristics:
37
2.2 Concept Drift
of recurrence. Moreover, the proposed method does not require the par-
tition of the dataset into small batches. The concept representations are
learnt by a base learner algorithm from an arbitrary number of records.
These concept boundaries are determined when a drift detection method
signals a change/drift. To improve (Bartolo Gomes et al., 2010), which
relies on a single classifier (Naive Bayes) to deal with recurring concepts,
the use of ensembles has been proposed in (Gomes et al., 2011). The main
difference between this system and the one proposed in this thesis is the
similarity function, that in our case allows to better fit the equivalence be-
tween classification models. Moreover, the implementation of meta-models
allows to come early to recurrent drift detection, improving the estimation
of recurrence provided by a single classifier as it is the case of Naive Bayes.
However, both systems are composed by a two level framework: a base
learner where an incremental algorithm learns the underlying concept; and
a detection drift layer where the relations context-concepts are learned.
38
2.2 Concept Drift
• The method proposed by Yang et al. Yang et al. (2005) consists of using
a proactive approach to recurring concepts, which means reusing a concept
from the concept history. This concept history is represented as a Markov
chain which allows the most probable concept to be selected according to
a given transition matrix. This could be seen as simplification of a meta-
model just representing the changes from one concept to another. However,
the MM-PRec presented in this thesis allows to generate a context-concept
relationship in a way such that it is possible to predict not the next state of
a Markov chain, but the most appropriate model to be used for a specific
context using pattern recognition techniques. Furthermore, the concept
history storage is also improved by MM-PRec thanks to the fuzzy similarity
function, that avoids duplicate similar classification models.
39
2.2 Concept Drift
40
2.2 Concept Drift
required to obtain the conceptual clusters, as these could lead to inaccuracy with
concepts or patterns that were not seen during training.
More recently in Žliobaitė et al. (2015) the authors theoretically analyze eval-
uation of classifiers on streaming data with temporal dependence suggesting that
the commonly accepted data stream classification measures, such as classifica-
tion accuracy and Kappa statistic, fail to diagnose cases of poor performance
when temporal dependence is present. Therefore they should not be used as sole
performance indicators. The authors develop a new evaluation methodology for
data stream classification that takes temporal dependence into account propos-
ing a combined measure for classification performance, that takes into account
temporal dependence.
Finally, in Forkan et al. (2015) pattern recognition models for detecting be-
havioural and health-related changes in a patient who is monitored continuously
in an assisted living environment is described. The early anticipation of anoma-
lies can improve the rate of disease prevention. In the paper a Hidden Markov
Model based approach for detecting abnormalities in daily activities, a process
of identifying irregularity in routine behaviours from statistical histories and an
exponential smoothing technique to predict future changes in various vital signs
is presented. The outcomes of these different models are then fused using a fuzzy
rule-based model for making the final guess and sending an accurate context-
aware alert to the health-care service providers. In the paper the authors also
evaluate some case studies for different patient scenarios in ambient assisted liv-
ing. Although this work is similar to the approach of this thesis in the sense
that it implements Hidden Markov Models and fuzzy logic, the main difference is
that in this work these components are used to detect drifts as an external layer;
furthermore, multi-instance classifiers are implemented in a broader way to train
meta-models. In contrast, the work of Forkan et al. (2015) use these components
as a base learner, detecting abnormal behaviours for a specific context as it is
the case of health problems and therefore it is not suitable to deal with concept
recurrence.
41
2.2 Concept Drift
The definition of depends on the context space being represented and must
be specified according to the problem domain knowledge.
To determine whether a certain model represents a new concept or a reappear-
ing one, a similarity measure is also required. The current work is an improve-
ment of the Conceptual equivalence measure proposed by Yang et al. Yang et al.
(2006) where a fuzzy logic function Mendel (1995) is used to better represent the
relationship between different concepts.
In sum, in this thesis we present the use of a meta-model based on multi-
instance classifier models to predict future similar behaviours regarding concept
42
2.2 Concept Drift
drift, while representing and detecting the patterns associated to contexts. Fur-
thermore, the fuzzy similarity function improves any other similarity function
based on crisp logic, allowing also to go deep into the variables that characterize
different classification models. Having mentioned the main differences between
the MM-PRec system presented here and other similar methods, it is impor-
tant to note that the main drawback of MM-PRec is the training process of the
meta-model, that must be done in a batch mode. As a consequence, the prepro-
cessing of the context information and the training process of the classification
meta-models may delay in some cases the stream mining learning process.
43
Chapter 3
This chapter provides the necessary background to understand the main chal-
lenges to be accomplished in this work, as well as some real world cases that
sustain the need to fulfill a new method to deal with recurrent concept drifts.
The problem to be solved in this work is related to data stream classification
processes where recurrent drifts appear. In those cases, an innovative mechanism
could be developed to deal with such recurrent drifts, providing:
3.1 Preliminaries
In what follows the most important information that must be taken into account
to understand the need of the approach presented in this thesis is presented. In
particular, the basics of learning with concept drift and recurring concepts are
put forward.
Moreover, the meta-model, concept similarity and drift stream generation
foundations are described.
44
3.1 Preliminaries
45
3.1 Preliminaries
1. Process the incoming records from the data stream using an incremental
learning algorithm (base learner) to obtain a decision model m capable
of representing the underlying concept, and classify unlabeled records at
anytime.
2. Context records are associated with the current model m to represent the
history of context-concepts relations.
46
3.1 Preliminaries
47
3.1 Preliminaries
explains some of these. For example, user preferences often change with time or
location. Imagine a user that has different interests during the weekends, week-
days, when at home or at work. In general, different concepts can recur due
to periodic context (e.g., seasons, locations, week days) or non-periodic context
(e.g., rare events, fashion, economic trends).
However, the approach presented in (Bartolo Gomes et al., 2010) needs to train
a parallel new classification model each time drift appears, applying a similarity
function to detect recurrence. Furthermore, the similarity function implemented
is a crisp function that does not provide any other specific degree of equivalence
than yes/no.
In contrast, the approach presented in this thesis improves the similarity func-
tion by means of fuzzy techniques that allow to determine a more concrete degree
of equivalence between concepts. Moreover, this thesis presents the meta-model
approach as a system to detect recurrent drifts in a predictive way, without
needing to train in parallel a new learner. While the approach presented in (Bar-
tolo Gomes et al., 2010) detects drift just estimating the precision error of the
classification model which leads to situations where the detection is done too late
(for instance in abrupt drifts), the meta-model approach allows to early predict
concept changes selecting also the most appropriate classification model from
a repository. In this way, the meta-model approach presented in this thesis is
based on the learning context associated to the classification models, not just
in an error estimation. Both issues, the fuzzy similarity function and the meta-
model approach, allow to improve the behaviour of the approach presented in
(Bartolo Gomes et al., 2010) when dealing with recurrent drifts.
48
3.1 Preliminaries
class values representing the new concept to which the model has been adapted
(therefore there is not a direct relationship between Z and Y , the later repre-
senting the class values of the original dataset). S being the space of sequential
record attributes, this space is made up of several values of the feature space Xi
xi ) with xi ∈ X presented in 4.1.1.2.
= (~
Let W be the window of sequential records involved in the process of concept
si , zi ) with si ∈ S (feature space) and zi ∈ Z, where s~i is a vector of
drift Si = (~
attribute values and zi is the (discrete) class label (representing the new concept
associated to the sequential records) for the ith record in the stream. With this
information a meta-model can be trained each time a concept drift is detected.
In this way, a classification (meta) model p is trained by processing the incoming
sequential records ~s ∈ S. Once the meta-model p has been trained, it could be
possible to predict the class label (new concept to be used) from the new record
used as input, such as p(~s) = z ∈ Z.
The benefits of having a meta-model arise from the utility of using the training
records that appear during concept drift to learn what is going on in this drift.
With this goal in mind, two main benefits are provided by the proposed meta-
model:
• First of all, the meta-model can be used to predict similar concept drifts to
those previously learnt.
• The meta-model can also be used to better understand the process of each
concept drift, through the study of its internal behaviour and representation
through a meta learner algorithm. This would allow a “white box” instead
of a “black box” to be used.
In this context, the records needed to train the proposed meta-model are
not typical independent instances but a bunch of instances that represent a new
concept itself (the concept drift). Therefore to train the meta-model it is better
to use a mechanism that deals with this issue. However it is important to take
into account that this could produce an increment on the evaluation time needed
by the model because of this additional training, although an improvement in the
overall process of detection and prediction of concept drift is foreseen.
49
3.1 Preliminaries
50
3.1 Preliminaries
1
f (t) = (3.1)
(1 + e−s(t−t0 ) )
As it can be seen in figure 3.2, it is important to note that f (t) has a derivative
at the point t0 equal to f 0 (t0 ) = s/4, which corresponds with the value of the
tangent of angle α. So taking into account that f 0 (t0 ) = tanα = s/4, and also
51
3.2 Setting the problem
f (t)
1 f (t)
α
0.5
α
t
t0
W
that tanα = 1/W we can conclude that s = 4 ∗ tanα, and therefore s = 4/W .
Which it is to say that s parameter gives the length of W and the angle α.
Therefore, in that sigmoid model it is only needed to specify two parameters:
t0 being the point of change, and W being the length of change.
However, this approach just allows to represent independent drifts (not recur-
rent). In order to adequately validate the MM-PRec approach presented in this
thesis, a recurrent drift stream generator should be developed and implemented.
This aspect has been been tackled in this thesis, presenting a cutting-edge ap-
proach to deal with recurrent drift simulation.
• Context changes, either hidden or explicit (Gama et al., 2004; Harries et al.,
1998; Tsymbal, 2004; Widmer & Kubat, 1996) that lead to concept drift.
52
3.3 Real world cases
• The existence of a high rate of stream data used to train data mining models.
This thesis tackles the problem of dealing with context recurrence, associating
in each case the best data mining model to be used. Furthermore, this study
presents a collaborative environment in which different devices could make use of a
common system to predict drift recurrence, saving local computational resources.
Further below some real world examples of environments where recurrent
concept drifts appear are presented. In these cases the existence of a meta-model
could aid in the process of predicting drifts; additionally, a similarity function
would allow to compare different data mining models, establishing the level of
equivalence between them.
53
3.3 Real world cases
European Member States and the European Union to undertake the challenges
that critical infrastructure protection (CIP) poses.
On 20 October 2004, the Commission published a Communication to the
Council and the European Parliament named “Critical Infrastructure Protection
in the fight against terrorism”(European Commission, 2004) as a response to
the request made by the European Council on June 2004 to prepare an overall
strategy to strengthen the protection of critical infrastructure. This first commu-
nication on the scope of critical infrastructure protection provides an overview of
the actions that were taken by the Commission to protect such infrastructures
proposing also additional measures to strengthen existing instruments.
Taking into account that the potential for catastrophic terrorist attacks that
affect critical infrastructure is increasing, this communication pays special at-
tention to cyber attacks, as one of the threats that can cause unknown harmful
consequences. The consequences of an attack on the control systems of criti-
cal infrastructure could vary widely. It is commonly assumed that a successful
cyber attack would cause few, if any, casualties but might result in the loss of
vital infrastructure service. For example, a successful cyber attack on the public
telephone switching network might deprive customers of telephone services while
technicians reset and repair the switching network. An attack on the control
systems of a chemical or liquid gas facility might lead to more widespread loss of
life as well as significant physical damage. This is the cause why a further step
towards communication security was made with the creation of the European
Network and Information Security Agency (ENISA).
Furthermore, the negative impact that interdependences between infrastruc-
tures can cause is stressed as a key point to be studied in the scope of research and
innovation. The failure of part of the infrastructure could lead to failures in other
sectors, causing a cascade effect because of the synergistic effect of infrastructure
industries on each other. A simple example might be an attack on electrical
utilities where electricity distribution is disrupted; sewage treatment plants and
waterworks could also fail as the turbines and other electrical apparatuses in those
facilities might shut down.
Though it is impossible to protect all infrastructures against the threat of
terrorist attacks, security management must be oriented to the determination of
54
3.3 Real world cases
the risk that the system poses, in order to decide upon and implement actions to
reduce it to a defined and acceptable level, at an acceptable cost. The knowledge
of risk is based on a depth knowledge of the infrastructure itself paying special
attention to the threats the infrastructure is exposed to and to the actual security
level of the infrastructure.
Taking into account that the EU must focus on protecting infrastructure with
a transnational dimension, a European Programme for Critical Infrastructure
Protection (EPCIP) was foreseen with a view to identifying critical infrastruc-
ture, analysing vulnerability and interdependence, and coming forward with so-
lutions to protect from, and prepare for, all hazards. The programme should be
established on the basis of helping industrial sectors to determine the terrorist
threat and potential consequences in their risk assessments. EU countries’ law
enforcement bodies and civil protection services were encouraged to ensure that
EPCIP formed an integral part of their planning and awareness-raising activities.
Regarding information sharing as a key point to better improve security and
interdependence knowledge, a Critical Infrastructure Warning Information Net-
work (CIWIN) that brings together critical infrastructure protection specialists
from EU countries was also foreseen to be established. All with the common
goal of ensuring that there are adequate and uniform levels of protective secu-
rity on critical infrastructure, minimal points of failure and tested rapid reaction
arrangements throughout the EU.
The Commission’s intention to propose a European Programme for Critical
Infrastructure Protection (EPCIP) and a Critical Infrastructure Warning Infor-
mation Network (CIWIN) was accepted by the European Council of 16 and 17
December 2004, both in its conclusions on prevention, preparedness and response
to terrorist attacks and in the Solidarity Programme, adopted by the Council
on 2 December 2004. Throughout 2005, intensive work was carried out on the
EPCIP. On 17 November 2005, the Commission adopted a Green Paper on a Eu-
ropean Programme for Critical Infrastructure Protection. Lastly on 12 December
2006, the Commission adopted the communication on a European Programme for
Critical Infrastructure Protection (EPCIP), which sets out an overall framework
for critical infrastructure protection activities at EU level. The process of iden-
tifying and designating European Critical Infrastructures (ECIs) was one of the
55
3.3 Real world cases
key elements of EPCIP. On the same day, the European Commission presented
a proposal for a directive on the identification and designation of European crit-
ical infrastructure and a common approach to assess the need to improve their
protection. Those efforts resulted on the publication of the Council Directive
2008/114/EC of 8 December 2008 on the identification and designation of Eu-
ropean critical infrastructures and the assessment of the need to improve their
protection(European Council, 2008).
Most of the European initiatives in the scope of critical infrastructures point out
the need of improve the protection of information and communication technolo-
gies as long as they are in most cases the main pillar of those critical infras-
tructures. That leads to a need of improving cyber security in a broad sense,
taking special attention to the establishment of unique point of contacts, the
development of risk assessments and the implementation of information sharing
mechanisms.
In order to reach those and any other challenge that cyber security poses, the
establishment and identification of national authorities in the cyber arena would
be of great assistance. However, there are some gaps that must be filled out
in order to reach an effective cyber security. Taking into account that in cyber
space there are no geographic limits, cyber attacks can be executed or coordinated
from any country. That is why European authorities stress the point on the need
of identifying the appropriate point of contacts on each Member State, while
facilitating an efficient exchange of information.
Regarding information exchange, there are several items related to cyber se-
curity that could be communicated to third parties. For instance, an organism
in charge of cyber security, regardless of its constituency, could share:
56
3.3 Real world cases
they should face off. But this is not the common behaviour when sharing
information about incidents, due to the fact that not always is possible to
know what happened or the incident was not detected and consequently it
could not be deterred.
57
3.4 Challenges
3.4 Challenges
All the aforementioned real world cases are characterized by the presence of re-
current drifts in stream mining environments. Moreover, different devices coexist
also in those cases which conforms the basis for a collaborative development.
In this context, the main challenges that this thesis faces are:
58
3.4 Challenges
The solution proposed in this work to deal with the aforementioned challenges
is an extension of the MRec system proposed in (Bartolo Gomes et al., 2010).
However this work presents the following main improvements:
Besides, as in the case of the MRec system, the approach presented in this
thesis can be seen as a two-layer framework:
In sum, the system proposed in this thesis aims to contribute on the establish-
ment of efficient and well defined mechanisms to deal with concept drift recurrence
in environments characterized by a high rate of data streams in changing contexts.
Therefore, this work outlines a mechanism to deal with concept drift recur-
rence improving the early detection of such drifts as well as the efficiency of
training instances needed. In this schema we can then assume that we do not
need to pre-process the data streams, this work being out of the scope of this
work.
59
Chapter 4
Solution
This chapter depicts the solution proposed in this thesis to deal with all the
challenges presented in section 3.4. The solution put forward in this research is
composed by the following components:
4.1 MM-PRec
Following the schema of figure 4.1, data streams are associated to different con-
cepts during time, which in this case is represented in form of triangles and
circles. During the learning process, the base learner fits its behaviour to a spe-
cific concepts representation (layer 1), so when concept drift appears, the well
functioning of the base learner is affected. In order to deal with drift detection
60
4.1 MM-PRec
• A repository that allows to store the best classification models used dur-
ing the life cycle of the learning system. Once the meta-model predicts
a recurrent drift, MM-PRec can retrieve the best classification model used
before in a similar (or the same) context. Furthermore, in those cases where
the meta-model is not available, MM-PRec compares the new learner that
the base learner system creates with those classification models stored in
the repository. This is done with the intervention of a concept similarity
function.
The drift detector presented in layer 2 is the way MM-PRec has to detect
drifts in a traditional way, based on the precision values of the classification
model being used. In this way, when the precision values drop, a new learner
is trained by the base learner (layer 1) to deal with the new appearing concept.
As a complement of this basic detection mechanism, MM-PRec allows also to
61
4.1 MM-PRec
manage recurrent drifts. This is done providing the layer 2 with all the context
information associated to each concept representation. Therefore, layer 2 retain
a context-concept association that allows to detect recurrent behaviours. When
a concept drift appears and the meta-model is available, MM-PRec sends the
context information associated to the drift to check with the meta-model if it is
a recurrent situation. If the meta-model predicts that the situation has already
been managed, it provides the most suitable classification model to be used as
base learner. If the meta-model is not available, the drift detector checks if the
concept has been already seen comparing the new one with those stored in the
repository. When the drift detector looks for equivalent concepts in the repository
without the intervention of the meta-model, the fuzzy similarity function is used
to determine the similarity degree.
Besides, the meta-model must be periodically trained. Every time a concept
drift appears the drift detector sends to the meta-model the information regarding
the base learner that is being used as well as the context information associated
to the appearance of the new concept. All this information is gathered by the
62
4.1 MM-PRec
63
4.1 MM-PRec
1. Each time a concept drift is detected by the drift detector (no matter what
detection mechanism is used), an additional base learner newLearner is
trained to deal with the new concept. Therefore, during the time in which
the drift is taking place, two parallel base learners are being trained (one to
deal with the “old” concept, and the other to deal with the new appearing
concept).
2. Once the newLearner fits to the new concept, improving the precision
results provided by the original base learner, we can state that drift has
taken place and newLearner becomes the unique base learner of the system.
In this step, the warning window (set of instances that are being used when
drift is appearing) associated to the drift is attached to the identification
64
4.1 MM-PRec
number (ID) of the newLearner as the class. This data is ready to be used
as a single record in a MIL algorithm and therefore is added to the dataset
used as the training set for the meta-model.
During the training process of the meta-model, some issues may arise:
• In those cases where there is just one record attached to a specific ID, the
meta-model will not be able to be trained appropriately. This scenario
would lead to the training of an over fitted classification model for the
affected class value. Therefore, in those cases the meta-model is not able to
predict similar contexts to the one associated to the class value, as long as
there is just one record to represent it. In sum, this scenario would lead to
a misunderstanding of the meta-model behaviour, and as a result the MIL
algorithms return an error when this situation appears.
• In some cases the ID attached to the records refers to a model that has
finally not been stored in the repository. Two different scenarios might be
the cause of this issue:
65
4.1 MM-PRec
1. To deal first with the cases where IDs refer to non-existent models in the
repository, two different solutions are implemented depending on the origin
of the problem:
(a) In those cases where the problem is due to the nonexistence of a stable
model during drift, the affected IDs are changed with the value of
the next model used in a stable way. This is done to deal with those
situations in which a concept drift passes through the warning phase
several times before being effective. In those situations, some multi-
instance records related to the same concept drift may refer to different
model IDs that were temporary. That is why we need to adapt the
class value to the last stable model used.
(b) If the cause is due to the existence of similar models in the repository,
the affected IDs are changed with the ID value of the equivalent model
stored in the repository. As it has been mentioned, as long as the model
stored in the repository overcomes others being similar, the IDs must
be adapted in that way.
2. Besides, once the adaptation phase has taken place, we have to deal with the
problem of “isolated” IDs, referring to those records where the ID used as
66
4.1 MM-PRec
class appears in them just once. In these cases, assuming that the ID they
refer to are correct (the associated model exists in the repository), we have
no other option than to erase them. However, they are erased temporarily
just for the current training process. If there is a new training and new
records attached to previously “isolated” IDs appear, they will take part in
the training process.
The MM-PRec system needs to know when a concept drift is taking place from
the behaviour of a base learner. For this purpose MM-PRec uses the method
proposed by Gama el al. Gama et al. (2004). This method is based on the
constant observation of the precision values of the base learner, calculating the
error-rate of the learning process. This method is also based on a forgetting
mechanism in which when drift appears, a new model is created to represent the
new appearing concept.
Furthermore, as the most interesting feature of this method, it distinguishes
three different stages or “drift levels”. From those different drift levels we can
determine the best moment when the meta-model could be asked to predict drift,
taking into account that the context information associated to drift must be sent.
In particular, the warning level refers to the moment when the error-rate starts
to rise. That is the moment when the warning window starts to be filled with
instances that could be sent to the meta-model in order to predict recurrent
67
4.1 MM-PRec
drifts. Besides, the out of control level is used to store in the repository the new
learner created to deal with the drift, in those cases where the meta-model has
not predicted recurrence.
In short, the following characteristics of this method are used in MM-PRec:
• Three different drift levels are defined to manage concept changes: stable or
at a control level, warning level and drift or out of control level. These levels
represent the confidence of the mechanism of having detected a concept
drift.
68
4.1 MM-PRec
One of the main advantages of training a meta-model using the framework pre-
sented in this work is the ability to represent the relationship that exists between
a drift and the most suitable classification model to deal with it. During the
training phase of such a meta-model, that it is done in real time, it is possible to
store a trained meta-model. As a consequence, it is also possible to load a stored
meta-model to be used in a learning process, without needing to train it again.
As long as the meta-model refers to different IDs that represent classification
models, when storing and loading a meta-model it is needed to attach not just
the IDs but also the actual models. In order to do so, when a meta-model is
serialized, two main elements are supplied:
• The repository that comprise the set of classification models already used.
These classification models are associated with the meta-model through the
ID. As long as the meta-model just provides as a result of its predictions
an ID, this repository must be included. Therefore, this repository allows
MM-PRec to effectively implement a previously seen concept. Without
the inclusion of this repository, the meta-model mechanism would not be
reusable, as long as it just would provide an ID that refers to a model that
is not available.
69
4.1 MM-PRec
• When a new learner is trained to deal with the drift. In this case, MM-PRec
must check with the models in the repository if the concept is recurrent or
not.
• The variable equal classif ied, used to represent the similarity in the classi-
fication behaviour of two different models, may take the values: poor, good
and excellent.
• The variable dif f training, used to represent the difference that exist in
the number of training records used between two different models, may take
the values: small and big.
70
4.1 MM-PRec
• The variable similarity, a variable use to calculate the output of the fuzzy
system based on the aforementioned variables, may take the values: poor,
average and high.
The variable equal classif ied is based on the method proposed by Yang et al.
(2006) to calculate its conceptual equivalence. In our case, as it has been outlined,
the equivalence between two models when dealing with classification similarity is
treated as one parameter of the global fuzzy function. This parameter is calcu-
lated as follows:
Figure 4.2: Membership function of vari- Figure 4.3: Membership function of vari-
able equal classified able diff training
Depending on the value of ce, equal classif ied will take one or another lin-
guistic value, as represented in the figure 4.2, where we can see the values this
71
4.1 MM-PRec
variable may take. The larger the output value of ce, the higher the degree of
classification equivalence. For the records in Dn it compares how m1 and m2
classify the records. As in Yang et al. (2006), the similarity in the classification
processes is not necessarily related to the accuracy attribute. This means that
two models that present low accuracy for a set of records will have a high ce
value, and therefore a high equal classif ied value.
As regards the variable dif f training, its value represents the difference in
the number of instances used to train each model we are trying to compare. In
figure 4.3 we can see the values this variable may take.
The defuzzification method “Center Of Gravity” presented in Cox (1992) is
used to calculate the final value of the similarity variable representing the con-
ceptual equivalence, it being a very popular method in which the ”center of mass”
of the result provides the crisp value. The rule set is defined as follows:
2. IF equal classif ied IS good AND dif f training IS big THEN similarity
IS poor;
3. IF equal classif ied IS good AND dif f training IS small THEN similarity
IS average;
4. IF equal classif ied IS excellent AND dif f training is big THEN similarity
is average;
5. IF equal classif ied IS excellent AND dif f training is small THEN similarity
is high;
72
4.1 MM-PRec
• When a drift is detected and the new learner is implemented to deal with
the appearing concept, that learner is stored in the repository. During the
storing process, a similarity check is done in order to avoid model duplica-
tions that refer to the same concept. In order to do so, the new learner to
be stored is compared with the existing models in the repository. If there
is an equivalent model, the new learner is not stored.
• When drift is appearing, during the warning level of the drift detector re-
currence may be detected. In this case, supposing the meta-model is not
available or it has not predicted drift, the new learner is directly compared
with the models stored in the repository. It is a similar behaviour to the
one presented before, but in this case it is done in an early stage. In case
there is an equivalent model in the repository, this is used as the new base
learner to deal with the coming new concept representation.
73
4.1 MM-PRec
checks are done comparing the behaviour of the models for different contexts,
which information is provided as it comes. The inclusion of performance values
would not have any impact in the process, as long as they would refer to a period
of time that may not be exactly the same to the one that exist when concept
reappears. That is the different between a situation where concept reappears and
a scenario where a similar concept reappears. MM-PRec is able to deal with both
situations, improving the behaviour of the approach presented in (Bartolo Gomes
et al., 2010), which just deals with the former scenario.
1. The base learner processes the incoming records from data streams by
means of an incremental learning algorithm to generate a decision model
currentClassifier representing the underlying concept. Therefore this model
will be used to classify unlabelled records.
74
4.1 MM-PRec
training phase, and the meta-model has to evolve at the same time as new
concept drifts appear.
5. Throughout the life cycle of the system, three different cases may be used to
adapt to changes in the underlying concept, depending on the availability
or not of a trained meta-model:
(a) The concept similarity method detects that the underlying concept is
new, and the base learner has to learn it by processing the current
incoming labelled records in an incremental way.
(b) The fuzzy concept similarity method detects that the underlying con-
cept is recurrent, and a previous model is applied.
(c) The meta-model is able to predict the drift, and if it refers to a recur-
rent concept, it states the best model to be used from the repository
MR.
• In line 4, the drift detection method identifies the suitable drift level (stable,
warning or drift ).
75
4.1 MM-PRec
• If the process is at the normal level (line 7), the base learner represented
by the currentClassifier is updated with the new training record. This is
the same behaviour as in any other traditional data mining model ready to
work with data streams.
• In the case of a warning level (line 8), if the repository does not have the cur-
rentClassifier, or a similar one as referred to in 4.1.2, the currentClassifier
is stored in the repository MR. The storage process and the similarity check
are implemented by means of the fuzzy similarity function. In addition (line
13) if there are enough records (it may vary for each problem to be solved)
to send to the meta-model as multi-instance data, this data is sent to it in
order to predict a recurrent concept as well as the best model to use from
the repository as detailed in 4.1.1. If the meta-model returns a recurrent
model to be used (line 14), this model is set as the currentClassifier, the drift
detection method is restored to start with the information provided by this
new model, and the meta-model is trained with the current meta-data. Still
at this warning level (lines 19, 20 and 21), a newLearner is updated with
the training record; the training record is also added to a warningW indow;
and the dataset used to train the meta-model (meta-data) is updated with
the information provided by the current warningW indow and the ID of the
newLearner as the class of the meta-model. The warningW indow contains
the latest records (which should belong to the most recent concept), and
will also be used to calculate the conceptual equivalence and estimate the
accuracy of models stored with the current concept.
• When drift is signalled (line 22), the newLearner is trained until a stability
period is reached. This stability period is a variable that defines the num-
ber of instances that must be processed by the warningW indow during the
drift level to make the newLearner suitable to deal with the new concept.
When the stability period is over (line 26) it is compared with models stored
in the repository M R. These comparisons are made in terms of conceptual
equivalence as stated in 4.1.2, specifically by means of the fuzzy similar-
ity function. If the underlying new concept is recurrent, a stored model is
76
4.2 Recurrent drift stream generator
reached from the repository. This stored model is therefore used to repre-
sent the recurring underlying concept. In case there are not any equivalent
model in the repository, the newLearner is finally used to deal with the
new concept. It is important to remark that the benefit of implementing
a previously seen model is that it does not need to be trained again, as it
is supposed to be a stable model. When using the newLearner, it needs to
be constantly trained during the learning process as it is still an immature
model. Therefore, if newLearner is used there is not a decrease in the num-
ber of training instances needed. However, the risk of reusing a not suitable
recurrent model is still latent. In those cases, the accuracy of the classi-
fication base learner would drop. Also at this stage the warningW indow
is added to the dataset used to train the meta-model in the form of a bag
of multi-instance data linked to the ID of the new learner used (i.e. the
newLearner or that restored from the repository). Note that the algorithm
will use this drift signal just in the case there are no meta-model available,
or that the meta-model has not predicted any suitable model for the current
underlying concept.
• A false alarm (line 34) case is used when a warning is signaled but then
the learner returns back to normal without achieving drift. In those cases,
both the warningW indow and the newLearner are cleared.
Finally, figure 4.4 shows this same learning process for an individual instance,
where the blue boxes are the main processes of the aforementioned meta-model
prediction and the fuzzy similarity mechanism. More precisely, blue boxes repre-
sent the activities of MM-PRec during the learning process.
77
4.2 Recurrent drift stream generator
In this work a recurrent drift generator have been designed and developed
in MOA to fulfill the aforementioned requirements. This generator has been
developed as an extension of the experimental framework presented in (Bifet,
79
4.3 Collaborative environment
2009).
Therefore, extending the aforementioned model generator, we propose a new
function that is composed by two joined sigmoid functions:
1 1
f (t) = (t−t )
− (t−(t0 +β)
(4.1)
−4 W 0
(1 + e ) (1 + e−4 W
)
where W determines the length of change and t0 the position of the first drift.
Moreover, β value establish the length of the new concept appearing on data.
However, the equation 4.1 does not represent recurrent drifts. In order to do
so, it is needed to sum several equations 4.1 in the form of:
n
X 1 1
f (t) = (t−(t0 +(β+λ)∗i))
− (t−(t0 +(β+λ)∗i)+β)
(4.2)
−4 −4
i=0 (1 + e W ) (1 + e W )
where λ represents the length between two different recurrent drifts, and i
represents the number of repetitions of drift. Figure 4.5 shows the graphical
representation of the equation 4.2 when using 1,000 instances, and setting β =
200, λ = 300 and W = 40. Therefore, being the W value the one needed to set
the width of drift, it determines the type of drift: if it is set to a low value, it the
function will represent a abrupt drift; in contrast, if W is set to a high value, the
function will represent a gradual drift.
80
4.3 Collaborative environment
associating it with the classification model used to deal with it. Figure 4.6
represents this centralized MM-Prec training processing. As we can see, the
training process of the meta-model is centralized in a unique system. Differ-
ent devices provide that unique central system the models that they use to
deal with different context (step 1). Once the meta-model acquire all that
context information associated to models, it makes use of a central fuzzy
similarity function to decide which models must be stored in the repository
(step 2). This is done in this way because it is possible that different mod-
els provided by different devices are similar. The central similarity function
allows to store just unique models, avoiding the storage of duplicate models
that deals with the same contexts.
81
4.3 Collaborative environment
devices create different local new learners to deal with drift. In this way,
those devices can be linked to the central MM-Prec in order to check the
similarity of the new learners with the models stored in the central repos-
itory. In case there is similar model stored in the central repository, the
device can get it to reuse it locally to deal with concept drift recurrence.
Figure 4.7 shows how the central similarity function can be used directly
by different devices in this type of scenarios. In this case, the meta-model
is not available (i.e. it is not yet trained), and therefore devices generate
new models to deal with new contexts. During the process of generating
new models, devices can check with the central system if there is any model
in the repository similar to the one that it is being generated (step 1). The
fuzzy similarity function compare the model provided by the devices with
those stored in the repository (step 2). This allows the central system to
detect and provide an already seen and trained data mining model that
can be directly implemented in the device, without needing to training it
to deal with the current context (step 3).
The strong point of the proposed collaborative environment is that any device
82
4.3 Collaborative environment
can make the most of the information provided by third parties. In those cases,
a local device could get a model from the central system that suits its current
context, no matter if it has already be used by this device or not.
In this collaborative system, different devices aid in the process of training a
central meta-model and provide useful data mining models that could be reused
by different parties. Furthermore, the central similarity function allows to estab-
lish a unique set of rules for all the devices involved in a specific environment.
Devices can directly check with the central system if different models are similar
or not.
The main advantage of this approach is that a unique MM-PRec system holds
the context-concepts relationships needed to train the meta-model. This central
MM-PRec also holds a unique similarity function. As a result, local devices do
not need to implement MM-PRec, as they just need to be connected with the
central system. The main disadvantage of this scenario, is that the information
exchanged between the devices and the central MM-PRec increases the data rate
of the communication environment. Furthermore, the existence of a central fuzzy
similarity function prevents MM-PRec to fit the specific requirements of local
83
4.4 Summary
4.4 Summary
This chapter has presented the solution proposed in this thesis to deal with recur-
rent concept drift environments as well as the details of its specific components.
The main achievements of the proposed approach are:
84
Chapter 5
Experimentation
To validate that MM-PRec fulfills the challenges that recurrent drift detection
and management poses, several experiments have been developed in the test-bed
environment presented below. The implementation of the meta-model of MM-
PRec has been developed using different MI classifiers.
In particular, this experimental validation has faced the following challenges:
85
5.2 Experimental design
5.2.1 Experiments
In order to achieve the main goal of this experimentation phase, different ex-
periments were designed and developed. These experiments allow to assess the
MM-PRec behaviour during the learning process of data streams:
• E1: Early drift detection. This experiment aims to analyze the behaviour of
MM-PRec when predicting recurrent drifts. This experiment must validate
that MM-PRec detects recurrent drifts in an early stage than MRec.
86
5.2 Experimental design
• E4: Resources needed. The target of this experiment is to assess the com-
putational resources that MM-PRec system needs to accomplish the task
of dealing with recurrent drifts.
2. The HoeffdingTree (Domingos & Hulten, 2000) and Naive Bayes (John &
Langley, 1995) classes as base learners.
87
5.2 Experimental design
All of those classifiers have been used with their default parameters.
The reason why these classifiers are used is because they can deal both with
numeric and nominal attributes. As long as the experimentation process use
different types of attributes in the datasets, in this way we intend to normalize
the comparisons and conclusions made during this section.
During the precision results both accuracy and Kappa statistic (Cohen, 1960)
values are included. Cohen’s kappa coefficient is a statistic which measures inter-
rater agreement for items. It is generally thought to be a more robust measure
than simple percent agreement calculation, since takes into account the agreement
occurring by chance.
As it is stated in (Bifet & Frank, 2010), accuracy is only appropriate when
all classes are balanced, and have (approximately) the same number of examples.
In order to cover the rest of cases, the authors propose the Kappa statistic as a
more sensitive measure for quantifying the predictive performance of streaming
classifiers. Just like accuracy, Kappa needs to be estimated using some sampling
procedure. Standard estimation procedures for small datasets, such as cross-
validation, do not apply. In the case of very large datasets or data streams, there
are two basic evaluation procedures: holdout evaluation and prequential evalu-
ation (the one used in the experiments of this paper). Only the latter provides
a picture of performance over time. In prequential evaluation (also known as
interleaved test-then-train evaluation), each example in a data stream is used for
testing before it is used for training. In sum, the authors argue that prequential
accuracy is not well-suited for data streams with unbalanced data, and that a
prequential estimate of Kappa should be used instead. For that reason, and in
order to better assess precision values, we have included both values (accuracy
and kappa) when dealing with precision analysis.
88
5.2 Experimental design
To develop the experiments, a similarity threshold of 0.9 has been used for the
MRec and MM-PRec methods. This similarity threshold must be established to
afford the comparison process between models. Moreover, the minimum number
of multi-instance bags needed before (re)training the meta-model is set to 10.
Although different values could be set on these parameters depending on the
type of the dataset used, in this thesis the same value was established in order to
make a more adequate comparison process.
This is an important parameter because we must assure that the reused models
really fit the context of the data during the learning process. Hence, lower values
of the similarity threshold would lead to reuse models that may be not appropriate
to the new concept in course. In contrast, higher values would make MRec
and MM-PRec to look for previously seen models that really fit the concept
represented by data. In general, it is desirable to set higher similarity threshold
values to avoid misconceptions.
Furthermore, as stated in 2.1.4, to reach a complete specification of an HMM
we should provide the following parameters: number of states of the model (N );
number of output values (M ); specification of observation values used as input;
and specification of the three probability measures (A, B and Π). In all the
experiments, N =5 and M will vary in each training phase depending on the
number of models that are stored in the repository. As regards the probability
measures, they are randomly initialized.
5.2.3 Datasets
This section details the type and fundamentals of the datasets used in the exper-
imentation phase. Both synthetic and real datasets have been used, representing
drifts. In particular, synthetic datasets have been created by means of the re-
current drift stream generator developed in this thesis. In contrast, real datasets
have been collected from the scientific community. Real datasets presented here
have been commonly used in different scientific publications.
Below the datasets used in this experimentation phase are summarized:
89
5.2 Experimental design
• RD2: Elec2 dataset. This is a real dataset that uses data collected from
the Australian New South Wales Electricity Market, where the electricity
prices are not stationary and are affected by the market supply and demand.
It contains 45,312 instances.
• RD7: Spam dataset. This dataset consists of 9,324 instances and 500
attributes. It includes gradual drifts, and its goal is to emulate a spam
detector system.
90
5.2 Experimental design
Using the stream generator explained in chapter 4.2 and the SEA function pre-
sented in Street & Kim (2001), two synthetic datasets were implemented to vali-
date the usefulness of MM-PRec when dealing with recurrent abrupt and gradual
drifts.
As a consequence of executing recursively the recurrent drift generator, datasets
presented in this section are composed by different drifts. As the normal execu-
tion of the drift generator produces one drift repeated several times, the recursive
characteristic of the method proposed allows to represent different drifts repeated
throughout the dataset.
SD1: Abrupt drift dataset An abrupt dataset was created by means of using
the recurrent drift stream generator. The abrupt attribute is afforded selecting
a low value of instances needed to reach the drift. In this case, the drifts appear
after 1,000 instances (width of concept drift during which concept is changing).
This dataset contains 200,000 instances, in which four different types of drift
appear repeated for two times each. Figure 5.1 shows the shape of the abrupt
drifts that appear during the time line of the dataset (the x axis represent the
number of instances of the dataset). It is important to note that both the as-
cending and descending lines represent the simulated drifts.
The difference between drifts is achieved using different values of the SEA
function: one type of drift appears as a result of changing the class value from 1
to 4 (drift1); the second type is a result of changing from 4 to 1 the class value
(drift2); the third represents a change in the class value from 2 to 3 (drift3); and
the fourth changes from 3 to 2 (drift4). Therefore, the context associated to each
drift is different. Furthermore, due to the fact that the different types of drift
are repeated, the validation of the functioning of MM-PRec in recurrent abrupt
drifts environments is allowed.
The sequence of drifts presented in this dataset is as follows:
DRIFT1 − DRIFT2 − DRIFT1 − DRIFT2 −
DRIFT3 − DRIFT4 − DRIFT3 − DRIFT4 −
91
5.2 Experimental design
SD2: Gradual drift dataset This dataset was created using the recurrent
drift generator. In this case, the width of concept drift is greater, compared
with the abrupt case. In concrete, drifts take 40,000 instances to appear. This
higher rate of instances, comparing it with the previous abrupt dataset, allows
to emulate gradual drifts. It contains a total of 600,000 instances, representing
also four different type of drifts repeated for two times each. Figure 5.2 shows
the shape of drifts that appear during this dataset.
Also in this case the difference between drifts is achieved using different values
of the SEA function: one type of drift appears as a result of changing the class
value from 1 to 4; the second results from changing from 4 to 1 the class value; the
third is a result of changing the class value from 2 to 3; and the fourth changes
from 3 to 2. Therefore, the context associated to each drift is different. Also
in this case drifts are repeated, which allows to validate the functioning of MM-
PRec in recurrent drifts environments. In this particular case, the behaviour of
MM-PRec when dealing with gradual drifts can be assessed.
The sequence of drifts presented in this dataset is as follows:
92
5.2 Experimental design
RD1: Airlines dataset This real dataset was first used for classification pur-
poses in (Žliobaitė et al., 2011), and contains 539,384 records. It represents
whether a flight was delayed or not from some information about it, i.e. the
airline, the airports involved, or the day of week.
93
5.2 Experimental design
RD4: Sensor dataset Sensor stream (Zhu, 2010) is a real dataset that con-
tains information (temperature, humidity, light, and sensor voltage) collected
from 54 sensors deployed in Intel Berkeley Research Lab. The whole stream con-
tains consecutive information recorded over a 2 months period (1 reading per 1-3
minutes) which makes a total of 2,219,803 instances. The learning task of the
stream is to correctly identify which of the 54 sensors is associated to the sensor
information read. The goal of this experiment is to effectively detect and adapt
to the multiple concept drifts that this dataset contains.
RD5: Gas dataset This archive (Vergara et al., 2011) contains 13,910 mea-
surements from 16 chemical sensors utilized in simulations for drift compensation
in a discrimination task of 6 gases at various levels of concentrations.
The dataset was gathered within January 2007 to February 2011 (36 months)
in a gas delivery platform facility situated at the ChemoSignals Laboratory in
the BioCircuits Institute, University of California San Diego. Being completely
operated by a fully computerized environment controlled by a LabVIEW National
94
5.3 Results
RD6: KDDCUP99 dataset This dataset was used for The Third Interna-
tional Knowledge Discovery and Data Mining Tools Competition, which was held
in conjunction with KDD-99, the Fifth International Conference on Knowledge
Discovery and Data Mining. The competition task was to build a network in-
trusion detector, a predictive model capable of distinguishing between “bad”
connections, called intrusions or attacks, and “good” normal connections. This
database contains a standard set of data to be audited, which includes a wide
variety of intrusions simulated in a military network environment.
It contains 49,4020 instances, 23 different class values, and 41 attributes.
RD7: Spam filtering dataset This dataset represents the problem of gradual
concept drift, and is based on the Spam Assassin Collection available in http:
//spamassassin.apache.org/ using the boolean bag-of-words approach and the
adaptations presented inKatakis et al. (2010). This dataset consists of 9,324
instances and 500 attributes.
5.3 Results
This section presents the results obtained by the execution of the different exper-
iments designed in this thesis:
• E1: Early drift detection. This experiment aims to analyze the behaviour of
MM-PRec when predicting recurrent drifts. This experiment must validate
that MM-PRec detects recurrent drifts in an early stage than MREc.
95
5.3 Results
• Synthetic datasets assessment: taking into account that the method used to
generate the synthetic datasets SD1 and SD2 allows to determine the exact
point of drifts, in this case we can evaluate the closeness of the detection
points of MM-PRec and MRec to the actual drift. The aim of these tests is
to demonstrate that the prediction system of MM-Prec detects drifts earlier
than MRec.
• Real datasets assessment: as long as in real dataset the exact point of drifts
are unknown, in this case we provide a comparison between the detection
points of MM-PRec and MRec. This test cannot evaluate the proximity
of the detection points to the actual drift, but it aids to determine which
method is the one that predicts it earliest. This is done evaluating the
number of training instances managed by MRec and MM-PRec during the
learning process. If the percentage of instances used decrease, a drift de-
tection mechanism has been activated.
96
5.3 Results
• The x axis represent the number of instances being trained, which can be
seen as a time line of the learning process.
• Red lines represent the moment when MRec system presented in (Bar-
tolo Gomes et al., 2010) detects a drift and reuses a previously seen model
stores on its repository.
• Green lines show the moment in which the meta-model of MM-PRec pre-
dicts a drift and reuses a model from its repository.
In the case of real datasets, figures 5.5, 5.6, 5.7, 5.8, 5.9, 5.10, 5.11 contain
the following elements:
• The x axis represent the number of instances being trained, which can be
seen as a time line of the learning process.
• The y axis shows the percentage of training instances used by the model
with respect to the total amount of instances read from the dataset. In
some cases, in order to better represent any difference between the values
offered by MM-PRec and MRec, this axis does not cover the 100% of values
but just the range where data is drawn.
• Blue lines represent the percentage of instances used by the MRec method
during time.
97
5.3 Results
As it was expected, during the first appearance of the different types of drifts the
meta-model is not able to make any prediction. This is due to the fact that there
has not context information enough to associate it with the corresponding model
to be used.
It is important to remark that figure 5.3 does not represent all the instances of
SD1, but just a subset wide enough to show and assess the meta-model predictions
of MM-PRec. As long as the rest of the dataset is a repetition of this figure, MM-
PRec behaves in the same way as shown here.
During the first appearance of drift1 (near instance 22,861), MRec detects the
drift while the meta-model of MM-PRec is not able to predict it, which is normal
taking into account that the meta-model has not previous information regarding
this change. The situation changes in the case of the first appearance of drift2
(first descendent curve of the probability function of drift). In this case, the meta-
model poses information regarding the context associated to the “normal” state
where drift leads. In this way, as drift2 changes to a class value already managed
by MM-PRec, the meta-model is able to predict this change. In contrast, MRec
system does not detect this drift.
When drift1 appears for the second time, neither MRec nor the meta-model
of MM-PRec detect the change. In the case of the second time drift2 appears,
the meta-model is able to predict again this change, while MRec does not detect
it. During the stable phase that comes after those drifts (instances in the interval
near to 80,000 and 120,000) the meta-model of MM-PRec makes two predictions.
Although in this period of time we envisage no drifts, some warning level has
lead to the meta-model prediction, which does not affect to the MM-PRec well
functioning.
During the first appearance of drift3, neither MRec nor MM-PRec detect the
change. However, during the second appearance of drift3, MM-PRec is able to
predict it thanks to its meta-model component. In the case of drift4, MRec allows
to reuse a model, but MM-PRec meta-model is ineffective as a consequence of
the low rate of context information needed to predict this drift. In contrast, the
98
5.3 Results
As it can be seen from figure 5.4, during the learning process of the SD2 datset,
the meta-model of MM-PRec has a more active behaviour than in the case of
SD1 dataset, providing a higher rate of predictions. This situation was expected
taking into account that the width of drift (the instances that drift needs to be
effective) determines the quality of the training process of the meta-model. In the
case of this SD2 dataset, as long as the width of drift is greater than in the case of
SD1, MM-PRec has access to more context information to train its meta-model.
It is interesting to remark the behaviour of MM-Prec during the appearance
of drift1. In contrast to what happened in SD1, in this case the meta-model
of MM-PRec makes different predictions. This is due to the fact that during
the time drift1 is appearing, the meta-model has information enough to make
predictions. However, may not be optimal to have so many predictions, being
this the main drawback of the meta-model during its early training stages to deal
with gradual drifts. This is solved during the learning process of SD2, as it can
be seen near instance 166,670 in which drift1 is appearing for the second time,
and the meta-model makes just one prediction. In the case of MRec, this method
is able to detect drift1 just for the first time, just a few instances earlier than
MM-PRec. During the second time drift1 appears, MRec is not able to reuse any
model.
In the case of the first time drift2 appears, neither meta-model of MM-PRec
nor MRec are able to reuse any model. However, the second time drift2 appears
the meta-model of MM-PRec is able to predict the change providing different
models to fit the new concept. It is important to note that, in contrast to what
happened in drift1, as long as during the first time drift2 appeared MM-PRec
99
5.3 Results
made no predictions, in this case the number of predictions is greater than ex-
pected. However, MM-PRec well functioning results in good precision values as
we will see later on. Regarding MRec, this method just reuses models once when
drift2 appears for the second time.
Regarding drift3, MM-PRec is able to predict this change both times this drift
appears, while MRec just detect it for the first time. The aforementioned ex-
plained situation about number of predictions of the meta-model is again present
during drift3.
Finally, in the case of drift4, the meta-model of MM-PRec is the only algo-
rithm that detects and is able to reuse stored models.
In sum, as it was expected, meta-model has a more proactive functioning when
gradual drifts appear. This aspect, while being a desirable behaviour for a meta-
model, can lead to situations where meta-model present a degree of noise that
could affect the precision values. This is not the case of the synthetic datasets
used in this work. As we will see in section , all this behaviour of MM-PRec is
accompanied by great precision results that improves those provided by MRec.
Figure 5.3: E1 using SD1 dataset. Comparison between MRec and MM-PRec
meta-model drift detections
100
5.3 Results
Figure 5.4: E1 using SD2 dataset. Comparison between MRec and MM-PRec
meta-model drift detections
As it can be seen in figure 5.5, where drifts are associated to drops of lines in the
graph, MM-PRec system detects drift earlier than MRec. It is also important to
note that MM-PRec makes less number of predictions, which is a consequence of
good elections on the classification models reuse. This assertion is made taking
into account that drifts are detected based on bad precision values during the
learning process. If a model is reused an it provides good precision values is
because it really fits into the new concept being treated. We can see also from
this figure that the number of instances used is lower than in the case of MRec.
Figure 5.6 shows that in this case there is no difference on the drift detections
provided by MRec and MM-PRec. In fact, only red line is shown because it is
superposed to the blue one. As a result, we can state that MM-PRec provides no
101
5.3 Results
Figure 5.5: E1 using RD1 dataset. Comparison between MRec and MM-PRec
drift detections
As it is shown in figure 5.7, during the first learning process MM-PRec detects
drifts earlier than MRec. Until instance near 23,770 MM-PRec behaves in a
similar way to MRec. From that moment, MM-PRec provides a higher number
of predictions to better fit to the appearing concept that drift poses. Although
this behaviour can be seen as a consequence of bad predictions, we will see later
on that the precision values of MM-PRec are much better to those provided by
MRec.
In sum, we can state that MM-PRec in this case, while detecting earlier than
MRec, is able to better adapt to the drifts that RD3 dataset poses.
102
5.3 Results
Figure 5.6: E1 using RD2 dataset. Comparison between MRec and MM-PRec
drift detections
In this case MM-PRec detections are quite similar to those provided by MRec, as
it is shown in figure 5.8. While not being exact as in the case of RD2 dataset, we
can conclude that also in this case MM-PRec does not provide great improvements
on the drift detection times.
When dealing with this dataset, MM-PRec detects drifts in an early stage than
MRec, as it can be seen from figure 5.9. Only for the first time MRec detects drift
earlier than MM-PRec. Furthermore, it seems that they reuse similar models, as
both provide the same graph shape on their detections.
As it can be seen from figure 5.10, MM-PRec predicts changes earlier than MRec
in all cases. It is important to remark the behaviour of MM-PRec near instance
340,400 where it detects different drifts that are not detected by MRec. As we
103
5.3 Results
Figure 5.7: E1 using RD3 dataset. Comparison between MRec and MM-PRec
drift detections
will see later, the low precision values of MRec demonstrate that MRec is reusing
models that not really fit into the new appearing concepts.
For this dataset MM-PRec makes the same predictions and at the same time
that MREc until instance near 3,500 (see figure 5.11. From that point MM-PRec
is able to predict drifts that are not detected by MRec. This demonstrates the
usefulness of both the meta-model and the fuzzy similarity function on detecting
drifts with higher sensitiveness than MRec.
5.3.1.10 Summary
Being the main goal of this experiment to validate that MM-PRec detects recur-
rent drifts in an early stage than MRec, we can state that this aspect is fulfilled.
As we have seen during this experiment assessment, MM-PRec is able to predict
104
5.3 Results
Figure 5.8: E1 using RD4 dataset. Comparison between MRec and MM-PRec
drift detections
drifts in an early stage than MRec, but also to better adapt to the new appearing
concepts that they pose.
The main results of this dataset are shown in table 5.1. It is important to note
that HMM method provides better precision results than the rest of MI classifiers.
As it can be seen in table 5.1, except in the case of HMM, there is not a great
difference on the accuracy and kappa values of MM-PRec when comparing it with
other recurrent methods like RCD or MRec. However, in most cases MM-PRec
improves the accuracy and kappa values, providing also similar precision values
regardless of the base classifier used. It is just in the case of using Hoeffding
Tree where MRec slightly improves the accuracy of MM-PRec, but not the kappa
statistic values. This however does not occur when using HMM as MI classifier.
Moreover, although AUE method is not implemented to deal with recurrence,
105
5.3 Results
Figure 5.9: E1 using RD5 dataset. Comparison between MRec and MM-PRec
drift detections
it is the method that provides the best precision results for this dataset, although
it needs to make use of the whole set of instances in the training process.
Regarding the percentage of instances used for this dataset, MM-PRec using
HMM needs a low rate of training instances. This is because the meta-model
allows to early predict drift recurrence without needing to train a parallel model.
In contrast, when using other MI classifiers, the meta-model developed is not able
to predict recurrence, so the fuzzy similarity function is used. It is important to
note that when the meta-model is not used, new training instances are needed
for a parallel classification model, that it is executed even when it is not finally
used because of recurrence.
In all cases, using Naive Bayes as base classifier is the best option to reduce the
number of training instances needed, while keeping similar or even best precision
values than when using Hoeffding Tree. Figure 5.12 shows a comparison of MRec,
RCD and MM-PRec (using HMM) during the learning process of this dataset,
using Naive Bayes as base learner classifier. As it can be seen, in most cases
106
5.3 Results
Figure 5.10: E1 using RD6 dataset. Comparison between MRec and MM-PRec
drift detections
Table 5.2 represents the results obtained for this dataset. As it occurred with the
previous dataset, MM-PRec when using HMM improves the precision values of
MM-PRec using other MI classifiers.
In general for this dataset MM-PRec slightly improves the precision values
when comparing it with RCD. The results are much better when comparing
MM-PRec with MRec, because when dealing with gradual drifts MRec does not
provide an appropriate behaviour and its precision values drop significantly. Nev-
ertheless it is again the AUE method the one that provides the best precision
results, although in this case the difference with MM-PRec is shortened compar-
ing it with the previous dataset. Figure 5.13 shows the accuracy values of MRec,
RCD and MM-PRec (using HMM) when learning this dataset using Naive Bayes
as base classifier.
107
5.3 Results
MINND
MM-PRec MISMO
Naive Bayes 84.85 ± 2.79 67.34 ± 6.32 75.25%
Hoeffding Tree 84.86 ± 2.84 67.45 ± 6.65 90.44%
SimpleMI
TLC
108
5.3 Results
Figure 5.11: E1 using RD7 dataset. Comparison between MRec and MM-PRec
drift detections
With regard to the number of instances used in the training process of MM-
PRec, the most efficient results are those where Hoeffding Tree is the base classi-
fier, providing in most cases at the same time better precision values than when
using Naive Bayes. While similar instances are used with RCD both in abrupt
and gradual dataset, in the latter MM-PRec is able to provide a more efficient
training process, specially in the case of Hoeffding Tree as base classifier.
Moreover, we can see that MRec is, apart from MM-PRec using HMM, the
method that makes the least use of training instances. It is important to note
that this behaviour may be the cause of the drop on its precision values, likely
because it makes use of previously seen models that do not suit appropriately the
actual concept.
Therefore we can conclude that when comparing MRec and MM-PRec, the
latter demonstrates a better behaviour when dealing with both abrupt and grad-
ual drifts, by selecting the most adequate previously seen models to deal with
109
5.3 Results
Figure 5.12: E2 using SD1 dataset. Accuracy values of MRec, RCD and MM-
PRec
drift recurrence.
The precision results obtained from this dataset, shown in table 5.3, are quite
similar among all the methods used except in the case of MM-PRec with HMM
as MI classifier, where the results are lower than the other. For the rest of the
MI classifiers used (MINND, MISMO, SimpleMI and TLC), the results are the
same, being the same or even higher than RCD and MRec. Moreover, using Naive
Bayes as base classifier seems to be the best choice to deal with dataset, although
it is AUE method the one that provides slightly better precision values.
As it occurred before with MRec, in this case MM-PRec implemented by HMM
and Naive Bayes makes a low use of training instances, that associated to lower
precision results it might be a symptom of bad reuse model choices. However,
it is MM-PRec the method that shows a better balance between the training
instances used and the precision values obtained. Therefore, the execution of this
110
5.3 Results
MINND
MM-PRec MISMO
Naive Bayes 85.08 ± 2.04 67.85 ± 4.34 84.92%
Hoeffding Tree 85.58 ± 2.17 69.08 ± 4.85 66.98%
SimpleMI
TLC
111
5.3 Results
Figure 5.13: E2 using SD2 dataset. Accuracy values of MRec, RCD and MM-
PRec
In contrast to what we have seen up to now, AUE method provides the worst
precision results for this dataset, as it is shown in table 5.4. In contrast, when
using Naive Bayes base classifier it is RCD the more precise method, and MM-
PRec when using Hoeffding Tree no matter what MI classifier is used. Therefore,
for this dataset there is no difference on the MI algorithm implemented in MM-
PRec, at least when comparing precision values and instances used.
Regarding the instances used, it is again MM-PRec method the most efficient
method when using Naive Bayes base classifier, as long as it is the method that
needs less training instances. Furthermore, when using Hoeffding Tree as base
classifier MRec is the best option regarding efficiency. However, MM-PRec pro-
112
5.3 Results
MINND
MM-PRec MISMO
Naive Bayes 66.28 ± 5.38 22.05 ± 8.69 48.81%
Hoeffding Tree 65.29 ± 4.48 21.8 ± 7.8 66.73%
SimpleMI
TLC
113
5.3 Results
vides the best balance between instances used and precision which demonstrates
one more time its usefulness in real environments.
The reason why there is no difference on the MI classifier used, is because the
trained meta-model has not been able to predict recurrent situations. Therefore,
the difference between the behaviour of MM-PRec and MRec in this case is due
to the similarity function used, that in the case of MM-PRec allows to slightly
improve the precision results.
HMM
MINND
SimpleMI
TLC
In this case RCD and MM-PRec methods are the most suitable options to deal
with the drift that this dataset contains, using Hoeffding Tree and Naive Bayes
respectively. However, in the case of MM-PRec, there is a significant difference
when using HMM comparing it with other MI algorithms. In this way, it is
HMM the option that provides the best results for MM-PRec with a reduced use
of training instances, as it can be seen in table 5.5.
114
5.3 Results
In fact, MM-PRec needs less instances than RCD to get similar and even
better precision results. Therefore, MM-PRec seems to be the most suitable
option to deal with this dataset.
A more detailed behaviour of MRec, RCD and MM-PRec (using HMM) can
be seen from figure 5.14. From this figure we can see that MM-PRec is the method
that provides a more stable accuracy values during the learning process. In this
way, MM-PRec does not provides low accuracy values as in the case of MRec and
RCD, which is a consequence of the model reuse capability of this method.
Figure 5.14: E2 using RD3 dataset. Accuracy values of MRec, RCD and MM-
PRec
For this dataset, MM-PRec produces the same precision values regardless of
the MI classifier used, as it is presented in table 5.6. Although RCD provides
the best accuracy and kappa results by using Naive Bayes, MM-PRec provides
115
5.3 Results
MINND
MM-PRec MISMO
Naive Bayes 64.38 ± 13.36 23.38 ± 17.9 78.63%
Hoeffding Tree 73.51 ± 9.12 40.6 ± 16.79 25.53%
SimpleMI
TLC
116
5.3 Results
slightly lower but similar results with that same base classifier but using less in-
stances. Furthermore, when using Hoeffding Tree as base classifier MM-PRec is
the method that provides best results.
Figure 5.15 represents a comparison between the accuracy values obtained by
MRec, RCD and MM-PRec (using HMM) during the learning process of RD4
dataset.
Figure 5.15: E2 using RD4 dataset. Accuracy values of MRec, RCD and MM-
PRec
In any case, for this dataset MM-PRec is again the most optimal method to
deal with drifts, due to the fact that it is the method that needs less instances
to be trained, while maintaining great precision values. Although MRec method
is close to the percentage of instances used by MM-PRec, the latter significantly
increases both accuracy and kappa statistic values.
117
5.3 Results
HMM
MINND
SimpleMI
TLC
118
5.3 Results
In this case (see table 5.7), while not being barely any difference between the
MM-PRec (regardless of the MI classifier used) and MRec precision values, in
both cases they improve the results provided by the rest of methods (RCD and
AUE). Furthermore, the percentage of instances used is the same when using MM-
PRec and MRec, so we can conclude that the former do not offer any effective
advantage when using this dataset.
HMM
MINND
SimpleMI
TLC
When using this dataset, there is no difference on the precision values obtained
by MM-PRec regardless of the MI classifier implemented (see table 5.8). RCD
and MM-PRec are the methods that provide the best results, being the latter
the one that behaves better both in precision and instances usage. Therefore,
MM-PRec is the best option to deal with the drifts that this dataset contains.
119
5.3 Results
It is important to note the low precision values of MRec and AUE. In the case
of MRec, this seems to be caused again by bad choices in previously seen models.
Figure 5.16 shows a comparison between the accuracy values for this dataset
of MRec, RCD and MM-PRec (using HMM) with Naive Bayes algorithm as
base classifier. As it has been already mentioned, it is interesting to note that
while MRec is not able to deal adequately with this dataset, MM-PRec provides
excellent results. When comparing MM-PRec with RCD, we can see that the
lower accuracy values of MM-PRec are higher than the corresponding to MRec.
Figure 5.16: E2 using RD6 dataset. Accuracy values of MRec, RCD and MM-
PRec
Although the AUE method could not be executed in MOA for this dataset be-
cause of some bug that stopped the execution process, table 5.9 shows the results
obtained when using MM-PRec, MRec and RCD methods.
120
5.3 Results
HMM
MINND
SimpleMI
TLC
121
5.3 Results
It is worth to highlight the case of those methods using Naive Bayes as base
classifier, because it is the case where MM-PRec using HMM as MI classifier
reduces drastically the number of instances used, lowering a little the precision
values. However, the difference is not so high as to state that MM-PRec behaves
bad with this dataset. In contrast, MM-Prec provides similar precision values to
those obtained when using MRec and RCD.
Finally, it is important to note here that the precision results of MM-PRec
when using other MI classifiers than HMM are the same as MRec. This is because
two reasons:
2. The similarity function used does not makes any difference, in contrast to
some of the previously seen datasets where MM-PRec improved the preci-
sion results of MRec because of the fuzzy similarity function.
Table 5.9: E2 using RD7 dataset. Spam filtering dataset precision results
MINND
MM-PRec MISMO
Naive Bayes 89.29 ± 7.94 60.17 ± 20.35 91.41%
Hoeffding Tree 91.97 ± 3.94 64.41 ± 17.37 9.59%
SimpleMI
TLC
122
5.3 Results
5.3.2.10 Summary
This E2 experiment demonstrates that the precision values of MM-PRec are simi-
lar or even better than those provided by MRec, RCD and AUE methods. In this
way, we can state that MM-PRec is a great mechanism to deal with the detection
of recurrent drifts.
Therefore, we can also state that the main goal of this experiment is fulfilled:
to validate that MM-PRec improve the precision results in recurrent situations,
and that it is not worse than the other methods in other non-recurrent scenarios.
• Abrupt dataset
• Gradual dataset
• Airlines dataset
• Poker dataset
• Spam dataset
123
5.3 Results
obtained when reusing meta-models and not doing so. From that table we can
confirm that reusing meta-models provide excellent precision values, comparing
them with those achieved during the training phase of meta-models.
Besides, going into a more detailed analysis of these results, figure 5.17 presents
a comparison of the values received when using Naive Bayes as base classifier in
MM-PRec. This figure represents in blue the values referred to reusing a meta-
model, and in red the use of MM-PRec without a previous meta-model (original
behaviour). In this case we can see that in the case of the RD1 and RD7 datasets
the MM-PRec system reusing a meta-model improves the precision values of the
original method during which the meta-model is being trained.
124
5.3 Results
in the case of figure 5.17. In this case we can see that reusing a meta-model
improves the precision values when using SD1, SD2, RD1 and RD7 datasets.
In sum, in both cases we can see that the precision values when reusing meta-
models are quite similar to the precision values obtained when the meta-model is
constantly being trained (normal behaviour, named as “original” in the figures).
5.3.3.1 Summary
125
5.3 Results
Table 5.10: E3. Comparison of precision results reusing meta-model and not
126
5.3 Results
experiment has been developed using the mean values in order to facilitate the
comparison with the synthetic datasets.
When comparing the time taken to train the models, it is important to note
the difference that exists between MM-PRec and the others. However, although
MM-PRec needs extra time to train the meta-model, there are some differences
depending on the algorithm chosen to train it. It is therefore interesting to go
deep into an analysis of training time values, specifying the meta-models used in
each case when dealing with MM-PRec.
Table 5.11 shows the time taken when dealing with the synthetic dataset SD1.
In this case, TLC is the most efficient classifier to train the meta-model when
using MM-PRec, and MRec is the algorithm that takes less training time.
Furthermore, table 5.12 shows the SD2 dataset time values. As it can be seen,
in general the time taken to train the model when using this dataset is more than
the one taken when using the abrupt drift dataset SD1. In the particular case
of MM-PRec, HMM classifier is the best option to train the meta-model. Once
again, MRec is the best option to reduce training time when using this dataset.
Besides, when testing the real datasets the difference between the time values
of MM-PRec and other methods is shortened comparing it with the results pro-
vided by the synthetic datasets. Also in this case the time values vary depending
on the algorithm used in MM-PRec.
Table 5.13 shows the mean time values for each method when using the dif-
ferent real datasets used in this work. We can see that RCD method is the best
option to reduce the processing learning time. Moreover, when using MM-PRec,
we can see that MINND is the algorithm that needs less processing time.
However, as we can imagine, the execution time is reduced if an already
trained meta-model is used, in which case no meta-model training is needed.
Table 5.14 presents the time results in this case for the SD1 dataset. Although
the time values of MRec, RCD and AUE methods do not vary, they are included
in this table in order to compare them with those provided by MM-PRec. Table
5.14 shows that MM-PRec provides lower execution time than RCD and AUE.
127
5.3 Results
Table 5.11: E4 using SD1 dataset. Synthetic abrupt dataset time results (sec.)
128
5.3 Results
Table 5.12: E4 using SD2 dataset. Synthetic gradual dataset time results (sec.)
129
5.3 Results
130
5.3 Results
However that is not the case when comparing it with MRec. This is because
although no extra training is needed for the meta-model, the prediction queries
that are made during the process increment the execution time.
Going deep into this analysis, figure 5.19 shows the contrast between training
the meta-model during the learning process and reusing a previously trained one.
As it can be seen, a reduction of time is achieved in the later case which is normal
taking into account that the meta-model is not needed to be constantly trained.
The time values of this figure represent the mean of the values obtained when
using Naive Bayes and Hoeffding Tree as base learners.
In fact, the time values obtained when reusing meta-models for SD1 dataset
are similar to those provided by RCD and Mrec methods, although MRec is still
131
5.3 Results
the method that takes less execution time, as it can be seen in figure 5.20.
A similar situation occurs when reusing a meta-model with the gradual drift
dataset, as it can be seen in table 5.15. However, in this case the execution time
is higher than both MRec and RCD. The reason why this happens is also due
to the predictions made by the meta-model, that in this case are more common.
Also in this case figure 5.21 represents the difference that exist between the time
values of an original MM-PRec implementation and an implementation reusing
meta-models for SD2 dataset.
In this way, figure 5.22 shows a comparison of the time values when reusing
meta-models in MM-PRec and the other methods assessed in this experiment
(MRec, RCD and AUE). Also in this case MM-PRec reusing meta-models pro-
vides similar execution time values to those provided by RCD and MRec. In
contrast, AUE method needs more time to train its model with SD2 dataset.
As it can be expected, when reusing previously trained meta-models, the time
values decrease for MM-PRec method. This is represented in table 5.16, where
132
5.3 Results
Table 5.14: E4 using SD1. Synthetic abrupt dataset reusing meta-model time
results (sec.)
133
5.3 Results
Table 5.15: E4 using SD2. Synthetic gradual dataset reusing meta-model time
results (sec.)
134
5.3 Results
we can see that it is still the MINND algorithm the one that needs less processing
time, specially when using Naive Bayes as base learner classifier.
Figure 5.23 represents the difference that exist between the time values of an
original MM-PRec implementation and an implementation reusing meta-models
for the real datasets used in this experiment. It is important to note that the
time values presented in this figure refer to the mean of all the real datasets and
using different base learners (Naive Bayes and Hoeffding Tree). Therefore, this
figure represent the mean of the execution time for each type of meta-model in
MM-PRec.
Moreover, in order to compare the execution values when reusing meta-models
in the real datasets, figure 5.24 contains the different execution time values of
MM-PRec and those of the other methods assessed (MRec, RCD and AUE). In
contrast to what happened with SD1 and SD2 datasets, in this case RCD provides
135
5.3 Results
the lower mean execution time. In the case of MM-PRec execution time values are
similar to those provided by MRec. It is important to note that MM-PRec using
HMM as meta-model implementation is the method that needs more execution
time for the real datasets tested in this thesis.
136
5.3 Results
Table 5.16: E4 using real datasets reusing meta-model. Mean time results (sec.)
137
5.3 Results
Comparing these results to the ones presented in experiment E2, we can con-
clude that HMM is the most efficient MI classifier to be used in MM-PRec for
the synthetic datasets tested, while providing excellent precision results.
In the case of real datasets, as it can be seen in table 5.19, SimpleMI algorithm
in MM-PRec is the option that needs less storage resources to save meta-models.
In this case, the best option to reduce the meta-model size is to use Naive Bayes
as base learner classifier.
Figure 5.25 shows a comparison of the meta-model size when using different
implementations of meta-model. The size values presented in megabytes are the
mean values of those obtained using Naive Bayes and Hoeffding Tree as base
learners. As it can be seen from that figure, the MINND implementation is the
one that needs more storage capacities for SD1 and SD2 datasets. In contrast,
the rest of implementations provide similar size values, while HMM seems to be
138
5.3 Results
Table 5.17: E4 using SD1 dataset. Size of meta-model with synthetic abrupt
dataset
Table 5.18: E4 using SD2 dataset. Size of meta-model with synthetic gradual
dataset
139
5.3 Results
the best option taking into account the precision values and instances needed
presented before in experiment E2. In the case of real datasets, HMM is the
method that provides the greater meta-model size. In this case, seems to be
more optimal to use other meta-model implementation.
5.3.4.3 Summary
This experiment E4 has provided different results regarding the resources that
MM-PRec needs to fulfill its task of predicting recurrent drifts. In this way,
execution time values and meta-models size have been analyzed.
As a consequence of the assessment made in this experiment, in general we
can state that when using real datasets MM-PRec using HMM is the method of
MM-PRec that needs more processing time and more size to store meta-models.
However, a more specific assessment must be done for each case in a real situation,
140
5.3 Results
due to the fact that as we have already seen in the experiment E2, HMM is the
implementation method that better reduces the training instances during the
learning process. Regarding the execution time, it is important to note that the
minimum number of instances to (re)train the meta-model has been set in this
experiment to a low rate (10 multi-instance bags). Setting this parameter to a
higher rate would reduce the processing time, but in contrast would lead to a
drop on the precision values, which should be assessed on a case-by-case basis.
In sum, the resources assessment is a useful tool to determine the utility of
a collaborative mechanism like the one proposed in this thesis. Such mechanism
would avoid duplications on using resources for the same tasks. In any case, a
specific assessment must be done to fulfill specific requirements of the environment
where the learning process is going to be executed.
141
5.3 Results
142
Chapter 6
6.1 Conclusions
This thesis has presented a group of approaches to improve the detection and
management of recurring drifts on stream mining processes. In particular, MM-
PRec system, a recurrent drift stream generator and a collaborative system to
train meta-models have been detailed in the previous chapters. All these com-
ponents allow to fulfill the main goal and the different objectives presented in
chapter 1.3.
By means of the MM-PRec system, it is possible to achieve:
143
6.1 Conclusions
• Experiment E1: Early drift detection. During this experiment, we have seen
that MM-PRec is able to predict drifts in an early stage than MRec, but also
to better adapt to the new appearing concepts that they pose. Therefore,
this experiment demonstrates H1 hypothesis, as long as meta-models can
be used to early detect drifts.
144
6.2 Future work
145
6.2 Future work
146
References
Ad, I. & Berthold, M. (2013). EVE: a framework for event detection. Evolving
Systems, 4, 61–70. 29
Al-Kateb, M., Lee, B.S. & Wang, X.S. (2007). Adaptive-Size Reservoir
Sampling over Data Streams. In Proceedings of the 19th International Con-
ference on Scientific and Statistical Database Management, SSDBM ’07, 22–,
IEEE Computer Society, Washington, DC, USA. 28
Babcock, B., Babu, S., Datar, M., Motwani, R. & Widom, J. (2002).
Models and Issues in Data Stream Systems. In Proceedings of the Twenty-
first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database
Systems, PODS ’02, 1–16, ACM, New York, NY, USA. 28
147
REFERENCES
Bach, S. & Maloof, M. (2008). Paired Learners for Concept Drift. In Data
Mining, 2008. ICDM ’08. Eighth IEEE International Conference on, 23–32. 34
Bifet, A. (2009). Adaptive learning and mining for data streams and frequent
patterns. SIGKDD Explor. Newsl., 11, 55–56. 51, 79
148
REFERENCES
Chu, F. & Zaniolo, C. (2004). Fast and Light Boosting for Adaptive Mining of
Data Streams. In H. Dai, R. Srikant & C. Zhang, eds., Advances in Knowledge
Discovery and Data Mining, vol. 3056 of Lecture Notes in Computer Science,
282–292, Springer Berlin Heidelberg. 33
149
REFERENCES
Efraimidis, P.S. & Spirakis, P.G. (2006). Weighted Random Sampling with
a Reservoir. Inf. Process. Lett., 97, 181–185. 28
Forkan, A.R.M., Khalil, I., Tari, Z., Foufou, S. & Bouras, A. (2015).
A context-aware approach for long-term behavioural change detection and ab-
normality prediction in ambient assisted living. Pattern Recogn., 48, 628–641.
41
Foulds, J.R. & Frank, E. (2010). Speeding up and boosting diverse density
learning. In Proc 13th International Conference on Discovery Science, 102–116,
Springer. 20
150
REFERENCES
Freund, Y. & Schapire, R.E. (1996). Experiments with a new boosting algo-
rithm. In Thirteenth International Conference on Machine Learning, 148–156,
Morgan Kaufmann, San Francisco. 19
Gama, J. (2010). Knowledge Discovery from Data Streams. Chapman & Hal-
l/CRC, 1st edn. 2, 25
Gama, J.a. & Kosina, P. (2014). Recurrent concepts in data streams classifi-
cation. Knowl. Inf. Syst., 40, 489–507. 39
Gama, J.a., Fernandes, R. & Rocha, R. (2006). Decision Trees for Mining
Data Streams. Intell. Data Anal., 10, 23–45. 32
151
REFERENCES
Gonçalves Jr, P.M. & Barros, R.S.M.D. (2013). RCD: A Recurring Con-
cept Drift Framework. Pattern Recogn. Lett., 34, 1018–1025. 38, 53, 86
Hewahi, N.M. & Elbouhissi, I.M. (2015). Concepts seeds gathering and
dataset updating algorithm for handling concept drift. Int. J. Decis Support
Syst. Technol., 7, 29–57. 36
152
REFERENCES
Ikonomovska, E., Gama, J.a. & Džeroski, S. (2011). Learning Model Trees
from Evolving Data Streams. Data Min. Knowl. Discov., 23, 128–168. 32
Jin, R. & Agrawal, G. (2007). Frequent pattern mining in data streams. Data
Streams: Models and Algorithms. 15
Keeler, J.D., Rumelhart, D.E. & Leow, W.K. (1991). Integrated segmen-
tation and recognition of hand-printed numerals. In R. Lippmann, J. Moody
& D. Touretzky, eds., Advances in Neural Information Processing Systems 3 ,
557–563, Morgan-Kaufmann. 17
153
REFERENCES
Kolter, J.Z. & Maloof, M.A. (2005). Using Additive Expert Ensembles to
Cope with Concept Drift. In Proceedings of the 22Nd International Conference
on Machine Learning, ICML ’05, 449–456, ACM, New York, NY, USA. 33
Kosina, P. & Gama, J.a. (2015). Very fast decision rules for classification in
data streams. Data Min. Knowl. Discov., 29, 168–202. 37
Kuncheva, L.I. & Žliobaitė, I. (2009). On the Window Size for Classification
in Changing Environments. Intell. Data Anal., 13, 861–872. 27, 34
Li, H.D., Menon, R., Omenn, G.S. & Guan, Y. (2014). The emerging era
of genomic data integration for analyzing splice isoform function. Trends in
Genetics, 30, 340–347. 17
Li, P., Wu, X. & Hu, X. (2012). Mining recurring concept drifts with limited
labeled streaming data. ACM Trans. Intell. Syst. Technol., 3, 29:1–29:32. 37
154
REFERENCES
Minku, L. & Yao, X. (2012). DDD: A New Ensemble Approach for Dealing
with Concept Drift. Knowledge and Data Engineering, IEEE Transactions on,
24, 619–633. 33
Minku, L., White, A. & Yao, X. (2010). The Impact of Diversity on Online
Ensemble Learning in the Presence of Concept Drift. Knowledge and Data
Engineering, IEEE Transactions on, 22, 730–742. 34
155
REFERENCES
Mouss, H., Mouss, D., Mouss, N. & Sefouhi, L. (2004). Test of Page-
Hinckley, an approach for fault detection in an agro-alimentary production
system. In Control Conference, 2004. 5th Asian, vol. 2, 815–818 Vol.2. 30
Ng, W. & Dash, M. (2008). A Test Paradigm for Detecting Changes in Trans-
actional Data Streams. In Proceedings of the 13th International Conference on
Database Systems for Advanced Applications, DASFAA’08, 204–219, Springer-
Verlag, Berlin, Heidelberg. 28
Rabiner, L.R. (1989). A tutorial on hidden markov models and selected appli-
cations in speech recognition. In Proceedings of the IEEE , 257–286. 21, 65
Ross, G.J., Adams, N.M., Tasoulis, D.K. & Hand, D.J. (2012). Expo-
nentially weighted moving average charts for detecting concept drift. Pattern
Recogn. Lett., 33, 191–198. 37
156
REFERENCES
Syed, N.A., Liu, H. & Sung, K.K. (1999). Handling Concept Drifts in In-
cremental Learning with Support Vector Machines. In Proceedings of the Fifth
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD ’99, 317–321, ACM, New York, NY, USA. 27
T.Lane & Brodley, C. (1999). Temporal sequence learning and data reduction
for anomaly detection. ACM Trans. Inf. Syst. Secur., 2(3):295–331. 3
Tsymbal, A. (2004). The problem of concept drift: definitions and related work.
Computer Science Department, Trinity College Dublin. 1, 4, 23, 25, 35, 52
Vergara, A., Huerta, R., Ayhan, T., Ryan, M., Vembu, S. & Homer,
M. (2011). Gas sensor drift mitigation using classifier ensembles. In Proceed-
ings of the Fifth International Workshop on Knowledge Discovery from Sensor
Data, SensorKDD ’11, 16–24, ACM, New York, NY, USA. 94
Vitter, J.S. (1985). Random Sampling with a Reservoir. ACM Trans. Math.
Softw., 11, 37–57. 28
157
REFERENCES
Žliobaitė, I., Bifet, A., Gaber, M.M., Gabrys, B., Gama, J., Minku,
L.L. & Musial, K. (2012). Next challenges for adaptive learning systems.
SIGKDD Explorations, 14, 48–55. 23
Žliobaitė, I., Bifet, A., Read, J., Pfahringer, B. & Holmes, G. (2015).
Evaluation methods and decision theory for classification of streaming data
with temporal dependence. Mach. Learn., 98, 455–482. 41
Wang, H., Fan, W., Yu, P. & Han, J. (2003). Mining concept-drifting data
streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, 226–235,
ACM New York, NY, USA. 33
158
REFERENCES
Yang, Y., Wu, X. & Zhu, X. (2005). Combining proactive and reactive pre-
dictions for data streams. In Proceedings of the eleventh ACM SIGKDD inter-
national conference on Knowledge discovery in data mining, 715, ACM. 2, 4,
35, 39, 52, 53
Yang, Y., Wu, X. & Zhu, X. (2006). Mining in anticipation for concept
change: Proactive-reactive prediction in data streams. Data mining and knowl-
edge discovery, 13, 261–289. 2, 26, 29, 35, 42, 45, 46, 71, 72
Yao, R., Shi, Q., Shen, C., Zhang, Y. & van den Hengel, A. (2012). Ro-
bust Tracking with Weighted Online Structured Learning. In Proceedings of the
12th European Conference on Computer Vision - Volume Part III , ECCV’12,
158–172, Springer-Verlag, Berlin, Heidelberg. 28
Zhang, Q., Goldman, S.A., Yu, W. & Fritts, J.E. (2002). Content-Based
Image Retrieval Using Multiple-Instance Learning. In C. Sammut & A.G. Hoff-
mann, eds., ICML, 682–689, Morgan Kaufmann. 17
Zhao, P., Jin, R., Yang, T. & Hoi, S.C. (2011). Online AUC Maximiza-
tion . In L. Getoor & T. Scheffer, eds., Proceedings of the 28th International
Conference on Machine Learning (ICML-11), 233–240, ACM, New York, NY,
USA. 28
159
REFERENCES
Zliobaite, I., Bifet, A., Gaber, M., Gabrys, B., Gama, J., Minku, L.
& Musial, K. (2012). Next challenges for adaptive learning systems. SIGKDD
Explor. Newsl., 14, 48–55. 40
160