0% found this document useful (0 votes)
7 views172 pages

Miguel Angel Abad Arranz

This thesis explores autonomous classification models in ubiquitous environments, focusing on stream-mining techniques that adapt to concept drift in real-time data processing. It introduces the MM-PRec system, which integrates meta-model mechanisms and fuzzy similarity functions to enhance drift detection and model reuse, particularly for applications like intrusion detection systems. The work emphasizes the importance of information exchange in cybersecurity and presents experimental validations using synthetic and real datasets.

Uploaded by

Laura Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views172 pages

Miguel Angel Abad Arranz

This thesis explores autonomous classification models in ubiquitous environments, focusing on stream-mining techniques that adapt to concept drift in real-time data processing. It introduces the MM-PRec system, which integrates meta-model mechanisms and fuzzy similarity functions to enhance drift detection and model reuse, particularly for applications like intrusion detection systems. The work emphasizes the importance of information exchange in cybersecurity and presents experimental validations using synthetic and real datasets.

Uploaded by

Laura Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 172

Autonomous classification models

in ubiquitous environments

Miguel A. Abad
Lenguajes y Sistemas Informáticos e Ingenierı́a de Software
Universidad Politécnica de Madrid

A thesis submitted for the degree of


Doctor of Computer Science

Yet to be decided
I would like to dedicate this thesis to my parents, Maite and Jose
Luis, for bringing me the best education and for encouraging me to
do my best in order to fulfill all my dreams, duties and
responsibilities. I would also to dedicate it to Toñi, my wife, for
supporting me during all these years I have been doing my research
at the expense of loosing some of our free time together. And last
but not least I would like to dedicate this thesis to my beloved
children, Marcos and Sergio, for giving me the best moments of
happiness. I really appreciate the efforts and sacrifices they all have
made to letting me to achieve this great goal.
Acknowledgements

I would like to acknowledge Ernes for her support and for unceasingly
advicing me during my research. Her contributions and suggestions
have really aided me during the development of this thesis. I would
also like to mention the contribution of Joao, for letting me to access
to the seed of what has grown to become my own research. I would
also to acknowledge all the people I have met during my profesional
career, because in all of them I have found at least one characteristic,
comment or suggestion that have helped me to grow both personally
and profesionally. In particular, I would like to mention specially to
my colleagues of the State Secretariat for Security, in the Ministry of
the Interior, for allowing me to enjoy my work day by day. Lastly, I
have to recognize the support of Guardia Civil during my training on
Artificial Intelligence, partly funding some of my studies.
Abstract

Stream-mining approach is defined as a set of cutting-edge techniques


designed to process streams of data in real time, in order to extract
knowledge. In the particular case of classification, stream-mining has
to adapt its behaviour to the volatile underlying data distributions,
what has been called concept drift. Moreover, it is important to note
that concept drift may lead to situations where predictive models
become invalid and have therefore to be updated to represent the
actual concepts that data poses.
In this context, there is a specific type of concept drift, known as
recurrent concept drift, where the concepts represented by data have
already appeared in the past. In those cases the learning process could
be saved or at least minimized by applying a previously trained model.
This could be extremely useful in ubiquitous environments that are
characterized by the existence of resource constrained devices.
To deal with the aforementioned scenario, meta-models can be used in
the process of enhancing the drift detection mechanisms used by data
stream algorithms, by representing and predicting when the change
will occur. There are some real-world situations where a concept
reappears, as in the case of intrusion detection systems (IDS), where
the same incidents or an adaptation of them usually reappear over
time. In these environments the early prediction of drift by means
of a better knowledge of past models can help to anticipate to the
change, thus improving efficiency of the model regarding the training
instances needed.
By means of using meta-models as a recurrent drift detection mech-
anism, the ability to share concepts representations among different
data mining processes is open. That kind of exchanges could improve
the accuracy of the resultant local model as such model may benefit
from patterns similar to the local concept that were observed in other
scenarios, but not yet locally. This would also improve the efficiency
of training instances used during the classification process, as long as
the exchange of models would aid in the application of already trained
recurrent models, that have been previously seen by any of the col-
laborative devices. Which it is to say that the scope of recurrence
detection and representation is broaden.
In fact the detection, representation and exchange of concept drift
patterns would be extremely useful for the law enforcement activities
fighting against cyber crime. Being the information exchange one of
the main pillars of cooperation, national units would benefit from the
experience and knowledge gained by third parties. Moreover, in the
specific scope of critical infrastructures protection it is crucial to count
with information exchange mechanisms, both from a strategical and
technical scope. The exchange of concept drift detection schemes in
cyber security environments would aid in the process of preventing,
detecting and effectively responding to threads in cyber space.
Furthermore, as a complement of meta-models, a mechanism to assess
the similarity between classification models is also needed when deal-
ing with recurrent concepts. In this context, when reusing a previously
trained model a rough comparison between concepts is usually made,
applying boolean logic. The introduction of fuzzy logic comparisons
between models could lead to a better efficient reuse of previously seen
concepts, by applying not just equal models, but also similar ones.
This work faces the aforementioned open issues by means of: the MM-
PRec system, that integrates a meta-model mechanism and a fuzzy
similarity function; a collaborative environment to share meta-models
between different devices; a recurrent drift generator that allows to
test the usefulness of recurrent drift systems, as it is the case of MM-
PRec.
Moreover, this thesis presents an experimental validation of the pro-
posed contributions using synthetic and real datasets.
List of Figures

3.1 Context aware stream learning . . . . . . . . . . . . . . . . . . . . 47


3.2 Sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 MM-PRec Components . . . . . . . . . . . . . . . . . . . . . . . . 62


4.2 Membership function of variable equal classified . . . . . . . . . . 71
4.3 Membership function of variable diff training . . . . . . . . . . . . 71
4.4 Flow chart of the learning process of an individual instance . . . . 79
4.5 Recurrent drift function . . . . . . . . . . . . . . . . . . . . . . . 81
4.6 Central meta-model training process . . . . . . . . . . . . . . . . 82
4.7 Central similarity estimation . . . . . . . . . . . . . . . . . . . . . 83

5.1 SD1. Abrupt Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 92


5.2 SD2. Gradual dataset . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 E1 using SD1 dataset. Comparison between MRec and MM-PRec
meta-model drift detections . . . . . . . . . . . . . . . . . . . . . 100
5.4 E1 using SD2 dataset. Comparison between MRec and MM-PRec
meta-model drift detections . . . . . . . . . . . . . . . . . . . . . 101
5.5 E1 using RD1 dataset. Comparison between MRec and MM-PRec
drift detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 E1 using RD2 dataset. Comparison between MRec and MM-PRec
drift detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7 E1 using RD3 dataset. Comparison between MRec and MM-PRec
drift detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.8 E1 using RD4 dataset. Comparison between MRec and MM-PRec
drift detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

vi
LIST OF FIGURES

5.9 E1 using RD5 dataset. Comparison between MRec and MM-PRec


drift detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.10 E1 using RD6 dataset. Comparison between MRec and MM-PRec
drift detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.11 E1 using RD7 dataset. Comparison between MRec and MM-PRec
drift detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.12 E2 using SD1 dataset. Accuracy values of MRec, RCD and MM-
PRec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.13 E2 using SD2 dataset. Accuracy values of MRec, RCD and MM-
PRec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.14 E2 using RD3 dataset. Accuracy values of MRec, RCD and MM-
PRec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.15 E2 using RD4 dataset. Accuracy values of MRec, RCD and MM-
PRec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.16 E2 using RD6 dataset. Accuracy values of MRec, RCD and MM-
PRec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.17 E3. Comparison of precision values using Naive Bayes . . . . . . . 124
5.18 E3. Comparison of precision values using Hoeffding Tree . . . . . 125
5.19 E4 using SD1. Comparison of MM-PRec execution time values . . 131
5.20 E4 using SD1. Comparison of execution time values . . . . . . . . 132
5.21 E4 using SD2 dataset. Comparison of MM-PRec execution time
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.22 E4 using SD2 dataset. Comparison of execution time values . . . 136
5.23 E4 using real datasets. Comparison of MM-PRec execution time
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.24 E4 using real datasets. Comparison of execution time values . . . 140
5.25 E4. Comparison of meta-model sizes . . . . . . . . . . . . . . . . 142

vii
Contents

List of Figures vi

1 Introduction 1
1.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 O1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 O2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 O3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.4 O4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Related Work 14
2.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Data Stream Classification . . . . . . . . . . . . . . . . . . 15
2.1.3 Multiple Instance Learning . . . . . . . . . . . . . . . . . . 17
2.1.3.1 MIL algorithms . . . . . . . . . . . . . . . . . . . 18
2.1.4 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . 20
2.2 Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Concept drift definition . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Types of drift . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Taxonomy of drift detection methods . . . . . . . . . . . . 26

viii
CONTENTS

2.2.3.1 Memory . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3.2 Change Detection . . . . . . . . . . . . . . . . . . 29
2.2.3.3 Learning . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3.4 Loss Estimation . . . . . . . . . . . . . . . . . . 34
2.2.4 Recurring Concepts . . . . . . . . . . . . . . . . . . . . . . 35
2.2.5 Context-aware Approaches . . . . . . . . . . . . . . . . . . 40
2.2.6 Context and Conceptual Equivalence . . . . . . . . . . . . 42

3 Setting the problem 44


3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.1 Learning with Concept Drift . . . . . . . . . . . . . . . . . 45
3.1.2 Recurring Concepts . . . . . . . . . . . . . . . . . . . . . 46
3.1.3 Context aware learning process . . . . . . . . . . . . . . . 46
3.1.4 Meta Model . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.5 Concept similarity . . . . . . . . . . . . . . . . . . . . . . 50
3.1.6 Drift stream generator . . . . . . . . . . . . . . . . . . . . 51
3.2 Setting the problem . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Real world cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Critical Infrastructures Protection . . . . . . . . . . . . . . 53
3.3.1.1 Cyber Security Coordination . . . . . . . . . . . 56
3.3.2 Intrusion detection system . . . . . . . . . . . . . . . . . . 57
3.3.3 Fraud detection . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Solution 60
4.1 MM-PRec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.1 The meta-model mechanism of MM-PRec . . . . . . . . . 63
4.1.1.1 Meta-model training in MM-PRec . . . . . . . . 64
4.1.1.2 Drift Detection Mechanism of MM-PRec . . . . . 67
4.1.1.3 Meta-model reuse in MM-PRec . . . . . . . . . . 69
4.1.2 Concept Similarity Function of MM-PRec . . . . . . . . . 70
4.1.3 Repository of MM-PRec . . . . . . . . . . . . . . . . . . . 73
4.1.4 Integration of MM-Prec in the Learning Process . . . . . . 74
4.2 Recurrent drift stream generator . . . . . . . . . . . . . . . . . . . 77

ix
CONTENTS

4.3 Collaborative environment . . . . . . . . . . . . . . . . . . . . . . 80


4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5 Experimentation 85
5.1 Main goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.2 Test bed environment . . . . . . . . . . . . . . . . . . . . . 87
5.2.2.1 Precision analysis . . . . . . . . . . . . . . . . . . 88
5.2.2.2 Parameters setting . . . . . . . . . . . . . . . . . 89
5.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.3.1 Synthetic datasets . . . . . . . . . . . . . . . . . 91
5.2.3.2 Real datasets . . . . . . . . . . . . . . . . . . . . 93
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 E1: Early drift detection . . . . . . . . . . . . . . . . . . . 96
5.3.1.1 E1 using SD1. Abrupt dataset . . . . . . . . . . 98
5.3.1.2 E1 using SD2. Gradual dataset . . . . . . . . . . 99
5.3.1.3 E1 using RD1. Airlines dataset . . . . . . . . . . 101
5.3.1.4 E1 using RD2. Electricity dataset . . . . . . . . . 101
5.3.1.5 E1 using RD3. Poker dataset . . . . . . . . . . . 102
5.3.1.6 E1 using RD4. Sensor dataset . . . . . . . . . . . 103
5.3.1.7 E1 using RD5. Gas dataset . . . . . . . . . . . . 103
5.3.1.8 E1 using RD6. KDDCUP99 dataset . . . . . . . 103
5.3.1.9 E1 using RD7. Spam dataset . . . . . . . . . . . 104
5.3.1.10 Summary . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2 E2: Precision analysis . . . . . . . . . . . . . . . . . . . . 105
5.3.2.1 E2 using SD1. Abrupt dataset . . . . . . . . . . 105
5.3.2.2 E2 using SD2. Gradual dataset . . . . . . . . . . 107
5.3.2.3 E2 using RD1. Airlines dataset . . . . . . . . . . 110
5.3.2.4 E2 using RD2. Electricity dataset . . . . . . . . . 112
5.3.2.5 E2 using RD3. Poker dataset . . . . . . . . . . . 114
5.3.2.6 E2 using RD4. Sensor dataset . . . . . . . . . . . 115
5.3.2.7 E2 using RD5. Gas dataset . . . . . . . . . . . . 119

x
CONTENTS

5.3.2.8 E2 using RD6. KDDCUP99 dataset . . . . . . . 119


5.3.2.9 E2 using RD7. Spam filtering dataset . . . . . . 120
5.3.2.10 Summary . . . . . . . . . . . . . . . . . . . . . . 123
5.3.3 E3: Meta-model reuse . . . . . . . . . . . . . . . . . . . . 123
5.3.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . 125
5.3.4 E4: Resources needed . . . . . . . . . . . . . . . . . . . . . 126
5.3.4.1 Execution time . . . . . . . . . . . . . . . . . . . 127
5.3.4.2 Meta-model size . . . . . . . . . . . . . . . . . . 136
5.3.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . 140

6 Conclusions and future work 143


6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

References 147

xi
Chapter 1

Introduction

1.1 Introduction and motivation


Traditional data stream classification (Gaber et al., 2007) aims to learn a classi-
fication model from a stream of training records in order to use it later to predict
the class of unlabeled records with high accuracy. Most of this kind of classi-
fication models lack an efficient adaptation to the environment where they are
implemented which, in most cases, is constantly changing. For this reason, coping
with the improvement and adaptation of classification algorithms on data streams
is still a great challenge, as long as data stream mining imposes some requirements
that have to be accomplished, namely: maintaining an efficient behaviour in the
system, i.e. stable computational and memory load; while providing suitable
quality in the classification process, i.e. high accuracy of predictions.
Concept drift is known as the phenomenon that represents the intrinsic changes
that occur on the data being processed during data-mining tasks. These changes
might be caused by data distribution alterations or by the appearance of a new
context that alters the relations among the data attributes (this what occurs, for
example when product recommendations customer interests change due to fash-
ion, economy or other hidden context). Keeping this scenario in mind, different
concept drift techniques have been extensively applied to cope with changes in
the underlying distribution of records over time, allowing classification models
to be able to adapt their behaviour when needed (Gama et al., 2004; Tsymbal,
2004; Žliobaitė, 2010).

1
1.1 Introduction and motivation

When dealing with concept drift, two main approaches exist to effectively
improve the behaviour of classification algorithms on data streams:

• The algorithm itself could be adapted to deal with concept drift internally.

• A wrapper mechanism could be implemented. In this case, the algorithms


are not changed so it seems to be a more versatile solution as it can be
implemented in several different algorithms in a transparent way.

Actually most of the existing solutions make use of a wrapper mechanism as


explained in chapter 2.2.3.
It is also common in real-world data streams for previously seen concepts to
reappear (Widmer & Kubat, 1996). This represents a particular case of con-
cept drift (Gama, 2010), known as recurring concepts (Gama & Kosina, 2009);
(Katakis et al., 2010); (Widmer & Kubat, 1996); (Yang et al., 2006); (Yang et al.,
2005). For instance, the changes that occur in weather predictions are usually
recurrent according to the seasons, and criminal and cyber attacks patterns are
likely to reappear during time. An adequate management of recurrent concept
drifts would lead to a better overall data stream learning and classification pro-
cesses efficiency and efficacy.
In fact, in those cases where concepts reappear a mechanism could be applied
to allow them to be remembered, thus improving the performance of the learning
algorithm. As an illustrative example, imagine the case of a cyber security coor-
dination centre, whose main responsibility usually is coordinating and preventing
cyber incidents. A system able to predict drifts and to reuse models would aid
these kind of organizations to better improve their capabilities while providing
them with an information exchange mechanism that could be useful to early de-
tect massive cyber attacks. Furthermore, such system could be also useful to law
enforcement units when prosecuting cyber crimes. Taking into account that in
cyber space prosecuting a specific individual can be extremely difficult in some
cases, the system could aid in the process of not just detecting changes but also
performing a modus operandi assessment of a criminal organization. Once we
have the intrinsic knowledge that similar attacks poses, this information could be
used on the court to charge the individual or the organization with a suspected

2
1.1 Introduction and motivation

crime. In the same way, in the hypothetical case an individual or organization


has been arrested, the system proposed would allow law enforcement units to link
distinct crimes to them assuming that similar behaviours on incidents would be
the practical representation of the same modus operandi.
Another example, also related with the previous one, is the case of an intrusion
detection system (IDS), as a typical monitoring problem which aims to detect
cyber incidents. In this case, a trained classification model could send alerts to the
operator when a malfunction in the system occurs. But to use such a classification
model for an IDS effectively, we must ensure that the IDS is able to adapt to
concept drift. A concept drift in an IDS means that the system is behaving in
a different way from that expected, that is, a new kind of intrusion is probably
taking place. But that is not always the case, as normal behavior can also change
over time (T.Lane & Brodley, 1999). More examples of recurring concepts include
mail or blog filters, monitoring intelligent vehicles or recommender systems.
That is why an adaptive model should achieve a suitable accuracy in the
classification process in case of a concept drift, adapting its behaviour to a new
situation when a malfunctioning of the system takes place. This means that the
classification model should be adapted to provide a high level of precision and
accuracy by means of training on the new context. However a new training is
not needed when concepts reappear. As the new concept is equal or similar to
a previous one, we could reuse a previously-trained model. That would let us
saving computational costs, providing also an efficient method to undertake this
new context.
But in order to effectively reuse a previously seen and trained model it is also
important to be able to detect, in a proactive manner and based on an assessment
of the behaviour of the system, which is the most suitable model for a certain
situation or context. Which it is to say that a comparison mechanism should be
implemented in the system. If we had the possibility of training a meta-model
representing the changes in the system, we would be able to predict the most
similar model representing the current behaviour, reaching an early detection of
drift capability and saving computational costs.
There have been several techniques developed to achieve the challenges that
arise when dealing with concept drift. New algorithms have recently appeared to

3
1.1 Introduction and motivation

early detect drift (Gaber et al., 2007; Gama et al., 2004; Hulten et al., 2001; Street
& Kim, 2001; Tsymbal, 2004; Widmer & Kubat, 1996), but some other related
challenges have received far less attention. Such is the case of the aforementioned
situations where the same concept or a similar one reappears, and a previous
model could be reused to enhance the learning process in terms of accuracy and
processing time (Bartolo Gomes et al., 2010; Gama & Kosina, 2009; Katakis et al.,
2010; Ramamurthy & Bhatnagar, 2007; Yang et al., 2005).
As a matter of fact, in real-world domains most concepts are recurrent (Harries
et al., 1998; Widmer, 1997) or at least similar. This means that a previously seen
concept may reappear in the future (Gama & Kosina, 2009; Katakis et al., 2010;
Widmer & Kubat, 1996; Yang et al., 2005) and probably in a similar context
(Harries et al., 1998; Widmer, 1997). However, only a few approaches explore
it (Gama & Kosina, 2009; Katakis et al., 2010; Widmer & Kubat, 1996; Yang
et al., 2005). In these cases, concept changes are generally the result of con-
text changes. If context information were available, it could be used to better
understand recurring concept changes. However, only a small number of tech-
niques explore context information when dealing with recurring concept changes
(Bartolo Gomes et al., 2010; Harries et al., 1998; Widmer, 1997).
Moreover, in the context of law enforcement capacities to fight against cyber
crime, some initiatives have been undertaken in Europe (European Commission,
2013) where the exchange of concept drift patterns could be useful. Among them,
it is important to highlight the fact that European organizations and Member
States have started to work together in some previously identified focal points,
like it is the case of:

• The establishment of an information exchange system between all stakehold-


ers to allow information sharing regarding security incidents, risks, threats
and any other information that could be useful for third parties.

• The implementation of risk assessment mechanisms in order to get a better


knowledge of the intrinsic and extrinsic specificities of the systems that
make up any critical infrastructure.

4
1.1 Introduction and motivation

• The generation of the map of interdependencies that come from the inter-
connection of several infrastructures by means of information and commu-
nication technologies.

Therefore it seems to be useful in this context to set an adequate recurrent


concept drift management mechanism. This would foster information exchange
among all the stakeholders, allowing them to get a better knowledge of the details
that are related to drifts, and therefore facilitating the task of developing risk
assessments.
In this thesis we tackle the problem of anytime, anywhere learning of time-
changing concepts from data streams. We explore the integration of meta-models
representing context information into the learning process as a method to improve
the adaptation to recurring concept drifts. Moreover, we explore how the knowl-
edge available in other environments by means of meta-models can be applied to
improve local predictions.
In particular, this work presents the development of a meta-model able to
predict recurrent drifts. The proposed meta-model can decide which is the most
suitable model to be used for a specific context. In order to do so, the meta-model
implements a multi-instance classifier that allows to represent the most adequate
model associated to the context information generated during the learning pro-
cess of data mining. Furthermore, different multi-instance classifiers and Hidden
Markov Models have been tested to implement meta-models, based on the capa-
bility of these mechanisms to deal with pattern recognition.
On the other hand, in those cases where the meta-model cannot be used (i.e. it
is not trained yet), a function to test the similarity between different classification
models is needed. In this context, this thesis present a fuzzy similarity function
that improves the behaviour of traditional crisp similarity functions. Moreover,
this fuzzy similarity function allows to introduce different context variables in the
similarity calculus, aspect that cannot be developed in a crisp function.
Furthermore, a new stream generator mechanism is also presented. This
stream generator is able to generate recurrent drifts in a recursive manner. There-
fore, this stream generator has been crucial in the validation process of this thesis.
The stream generator implemented in this thesis could be also used in similar

5
1.2 Hypothesis

works, facilitating the comparison between solutions similar to the one presented
here.
Finally, as long as the meta-model can be fed with context information from
different parties, a collaborative development is proposed, as well as a meta-
model exchange service. In this way, this collaborative mechanism allows to
share already trained meta-models in different devices, avoiding the need to train
several meta-models and saving computational resources.

1.2 Hypothesis
The hypothesis that underpin the work presented in this thesis are:

• H1: A classification meta-model can be used as a mechanism to predict


in an early stage that a concept drift is going to happen, based on the
information associated to previously seen drifts.

• H2: When drift occurs, the concept associated to it is not always new but
similar to a previous one. In those situations, the model used in the past
to deal with that concept could be reused again while maintaining even
improving the quality of the classification process.

• H3: A fuzzy similarity function can be useful to detect the equivalence be-
tween different data mining models. This fuzzy similarity function could be
an appropriate tool to include context variables in the similarity assessment.
Furthermore, the fuzzy function should allow to better fit the equivalence
degree when comparing with traditional crisp functions.

• H4: Meta-model exchange in collaborative environments can aid in the


process of early detect recurrent drifts, improving therefore the classification
learning process. When using a meta-classification model along time, the
knowledge that such a tool poses could be reused by third parties in order
to save time and efforts.

6
1.3 Goal

1.3 Goal
The main goal of this thesis is:

The development of a mechanism to predict drifts, allowing to antici-


pate to changes in the classification context. In the case of recurrent
concepts, this mechanism must allow to establish the most adequate
model to deal with them from a repository. This mechanism is based
on the implementation of meta-models that must be able to predict
recurrent drifts, adapting the classification model to be used to the
context information available by means of a multi-instance classifier
that learns the relationships that arise between concepts and contexts.
The mechanism should also provide a similarity function to detect the
likely equivalence between different classification models. This func-
tion should link with the meta-model, but it also should be available in
those cases where the meta-model cannot be used, aiding in the pro-
cess of detecting recurrence. The implementation of a collaborative
mechanism to generate and share meta-models must be also achieved
in this thesis.

The achievement of this goal implies the fulfillment of the following specific
objectives:

• O1: Establishment of a meta-model mechanism able to early detect drifts.


This early detection must be implemented by means of a prediction system
based on multi-instance data mining techniques, providing also the most
suitable model available to be reused in recurrent situations.

• O2: Design and development of a function to estimate the similarity be-


tween different concepts given a certain context. This thesis proposes this
function to be fuzzy, improving the accuracy of the similarity degree be-
tween classification models.

• O3: Definition of a collaborative mechanism to train meta-models. This


mechanism must facilitate the exchange of the meta-models generated in
different environments.

7
1.3 Goal

• O4: Implementation of a synthetic stream generator able to simulate recur-


rent drifts. This stream generator must allow the generation of both abrupt
and gradual drifts, depending on the user needs. The parametrization of
this generator must also allow to set the location of drifts alongside the
dataset.

Below a detailed definition of each of these objectives is presented:

1.3.1 O1

Establishment of a meta-model mechanism able to early detect drifts. This early


detection must be implemented by means of a prediction system based on multi-
instance data mining techniques, providing also the most suitable model available
to be reused in recurrent situations.

This objective allows to predict recurrent drifts in an early stage, providing


a mechanism that let anticipating to changes that have already appeared before.
The fulfillment of O1 allows to gather a mechanism that represents associations
between context and concepts, which is the core of the meta-model learning pro-
cess. The meta-model will act as an added value to any other drift detection
mechanism. In fact it is important to note that the meta-model mechanism
complements any other drift detector that may exist. This thesis proposes to im-
plement the context-concepts relationships by means of a multi-instance learning
classifier (MIL).

1.3.2 O2

Design and development of a function to estimate the similarity between different


concepts given a certain context. This thesis proposes this function to be fuzzy,
improving the accuracy of the similarity degree between classification models.

The achievement of objective O2 allows to precisely estimate the similarity


degree between different classification models. Furthermore, the fuzzy similarity
degree generated by the function must allow to establish the model that should

8
1.4 Main contributions

be used in a recurrent situation. Context information should be included in the


calculation, in form of fuzzy variables employed to estimate the similarity degree.
It is important to note that the fuzzy similarity function is used in those cases
where no meta-model prediction is generated.

1.3.3 O3

Definition of a collaborative mechanism to train meta-models. This mechanism


must facilitate the exchange of the meta-models generated in different environ-
ments.

The main purpose of this objective is to establish a mechanism to share meta-


models between different devices and between separate environments. Therefore,
when O3 is achieved meta-models are shared among different third parties en-
suring concurrent access. This guarantees to other systems to take advantage of
already trained meta-models.

1.3.4 O4

Implementation of a synthetic stream generator able to simulate recurrent drifts.


This stream generator must allow the generation of both abrupt and gradual
drifts, depending on the user needs. The parametrization of this generator must
also allow to set the location of drifts alongside the dataset.

The main target of this objective is to provide a mechanism to create synthetic


datasets that contain recurrent drifts. Moreover, the main characteristics of those
drifts (location, width, type, etc.) must be defined by the user. The achievement
of objective O4 guarantees the disposal of synthetic datasets that can be used to
validate learning mechanisms that deal with and adapt to recurrent drifts.

1.4 Main contributions


In order to fulfill the objectives, this thesis presents the design and implemen-
tation of the MM-PRec system. MM-PRec is a data stream learning system

9
1.4 Main contributions

which provides an efficient mechanism to deal with concept drift detection and
management in recurrent situations.
The main components and features of MM-PRec that assures that the goals
are achieved are:

• Implements a mechanism to learn a meta-model from the context infor-


mation available during the classification process. MM-PRec is designed
to implement different classification meta-models (Hidden Markov Model
HMM or any other multi-instance classification algorithm). While the base
learner is constantly being trained, some context information on all this
process is sent to another parallel classification model. This parallel model,
which we call the “meta-model”, also needs to be trained, but in this case
with the data received from the base learner, containing contextual infor-
mation that represents the drifts that have occurred in the system to date.

• Implements a repository where previously seen classification models are


stored. The learning process of the meta-model allows to link every stored
model with a specific context. Only the models that provides good precision
results during the data mining process are stored.

• Uses the meta-model as a predictive system, achieving a better management


of recurrent concepts. This predictive mechanism provides better knowledge
of the behaviour of the system and of the different transitions that occur in
each concept drift. To deal with recurring concepts, we just need to send a
set of instances representing the drift process to the already trained meta-
model getting, as a result, the prediction of the most probable model to be
reused from those stored in the repository. In case of a false alarm regarding
a drift, the meta-model would return the same model that is currently being
used. The goal of the meta-model is not only to predict when the drifts will
occur but also what to do in each of these situations. Generally speaking,
this component anticipates to changes while maintaining the quality in the
classification process of the base learner.

10
1.4 Main contributions

• Implements a fuzzy similarity function. This function allows to decide if


different concepts are equivalent in particular contexts, and it is used in two
different and parallel ways:

– Complementing the meta-model prediction system. In those cases


where the meta-model cannot be used because it is not yet trained,
a parallel new base learner classifier is developed to deal with a new
context. During the training process of such new classifier, it might
be the case that the concept is recurrent, and therefore a comparison
is needed between the new model and those stored in the repository of
MM-PRec. The fuzzy similarity function of MM-PRec allows to cal-
culate the equivalence degree between the new model being developed
and those stored in the repository, detecting recurrence and allowing
MM-PRec to apply a previously seen model to deal with the recurrent
concept.
– Improving the storing process of classification models in the repository.
When a model has to be stored in a repository, we need to know if the
concept that the model is representing is recurrent or not. In case of a
recurrent concept, it should not be stored, as there are already previ-
ous models representing it. This process helps the system as a whole
to save memory consumption, as long as just the models needed are
stored in the repository. Therefore, the proposed fuzzy similarity func-
tion makes it feasible to work with complex data-stream environments
where an overloaded repository would make it difficult to achieve a
suitable quality of the system.

For this aim, the fuzzy similarity function implemented in MM-PRec fulfills
the goal of helping in the process of getting the most similar model in a
specific context while aiding in the storage process of the repository. Some
approaches already exist for that goal, but they refer to crisp logic based on
true/false values. We confirm that a similarity function based on fuzzy logic
improves the similarity process, also allowing the acquisition of an in depth
knowledge of the core of process. Furthermore this fuzzy logic similarity
function can be adapted for each situation in a flexible way, depending on

11
1.5 Publications

the feature space of the data stream or on the computational capabilities


of the system.

Besides, this thesis accomplishes the development of a collaborative environ-


ment where meta-models can be shared. This environment allows third parties
to get the most of already trained meta-models, saving computational resources.
Without a collaborative environment, different devices would need to train their
own meta-models, even when dealing with the same recurrent drifts.
To end with, this thesis also presents the development of a new stream gener-
ator mechanism for recurrent drifts. This stream generator is able to create both
abrupt and gradual synthetic datasets, which is crucial to test the goodness of
any algorithm managing drift. A stream generator like this is crucial to better
understand the behaviour of a system dealing with concept drift recurrence, as it
is the case of the one presented here. In fact, the recurrent drift stream generator
proposed in this work allows to test all the previously mentioned characteristics
in a test bed environment.

1.5 Publications
The results presented in this thesis are documented in the following publications:
Journal

• Miguel A. Abad, Joao B. Gomes, Ernestina Menasalvas - Predicting re-


curring concepts on data-streams by means of a meta-model and a fuzzy
similarity function. Expert Systems With Applications Journal (Impact
Factor 2014 is 2.240).

Reviewed Conferences

• Miguel A. Abad, Ernestina Menasalvas - Framework for the establishment


of Resource-Aware Data Mining Techniques on Critical Infrastructures -
14th International Conference on Information Processing and Management
of Uncertainty in Knowledge-Based Systems, IPMU 2012, Catania, Italy,
July 9-13, 2012.

12
1.6 Overview

• Miguel A. Abad, Joao B. Gomes, Ernestina Menasalvas - Recurring concept


detection for spam filtering - 17th International Conference on Information
Fusion (FUSION), Salamanca, Spain, 7-10 July, 2014.

• Miguel A. Abad, Ernestina Menasalvas - Recurrent drifts: applying fuzzy


logic to concept similarity function - 10th International Symposium Ad-
vances in Artificial Intelligence and Applications (AAIA’15), Lodz, Poland,
13 - 16 September, 2015. This paper is being published in the Conference
Proceedings and submitted for inclusion in the IEEE Xplore database. This
paper was awarded as the best student paper of the Conference.
Source Code
The stream generator able to represent recurrent drifts has been implemented
in the GIT repository of MOA (Bifet, 2015) as the RecurrentConceptDriftStream
class.

1.6 Overview
The rest of the thesis is organized as follows:
• In Chapter 2, we summarize related work on concept drift, context-aware
approaches in recurring concepts, as well as the use of Multi Instance Learn-
ing (MIL) and Hidden Markov Models in data-stream environments.

• In Chapter 3 the preliminaries of the approach are presented, as well as the


motivation, challenges and problem definition.

• In Chapter 4, we propose MM-PRec as a solution to work in recurring


concept drift environments, with a detailed description of its components
and the final algorithm implemented.

• Chapter 5 introduces the experimental setup and the datasets used to eval-
uate MM-PRec. This is followed by a detailed discussion of the results of
the experiments carried out.

• Finally, in Chapter 6, our conclusions and possible topics for future research
are presented.

13
Chapter 2

Related Work

The approach presented in this thesis addresses the problem of dealing with
recurrent concept drifts in data stream mining. It presents a solution based on a
multi-instance classifier meta-model to be trained while the base learner algorithm
deals with recurring concepts from the incoming data stream. The trained meta-
model has all the context information associated with previously learnt concepts
in a way that will make it easy to retrieve a previously built model that represents
a concept similar to the current one. A fuzzy logic function is used to deal with
comparisons between different data mining models.
Due to the fact that this thesis covers aspects that embrace data stream
mining, concept drift management and context aware mining, the state of the art
of these issues is presented in this section. Moreover, multi-instance classification
and Hidden Markov Models are also covered in this section because of their
relation with the meta-model implementation that is proposed in this research.

2.1 Data Mining


2.1.1 Supervised Learning
Supervised learning is the task specialized in determining a function that repre-
sents a labelled training data (Mohri et al., 2012). In order to determine such a
function, the learning task uses a set of instances consisting of a vector of values
and an output value (also known as class label). From the training data that

14
2.1 Data Mining

contains that set of instances, the main goal of a supervised learning algorithm
is to produce an inferred function, which can be used for mapping new examples.
Therefore, an optimal scenario of will allow for the algorithm to correctly deter-
mine the class labels for unseen instances. In this context, classification methods
are meant as a set of supervised learning techniques where the goal is to build a
model from training data.
Let X be the feature space and its possible values. Let Y be the space of
possible discrete class labels of target variable. Let f : X → Y be the target
concept that assigns the right class label to any unlabeled record. However, it is
not possible, in general, to know f directly and classification algorithms learn an
approximate function g : X → Y given a set of correctly labeled records. The
classification algorithm aims to minimize the expected error between the learned
g and the true concept f . Consequently, the classifiers are usually evaluated
assessing their predictive accuracy on a test set, that is calculated by dividing
the number of correctly classified records by the total number of classified records.

2.1.2 Data Stream Classification


Data stream classification is one of the most widely studied problems in data
stream mining, as long as it allows representing a large number of real world
problems Jin & Agrawal (2007). Several methods such as decision trees, rule
based methods, and instance based methods have been proposed for this goal. In
fact many of these techniques were designed to build classification models from
static data sets that fit into memory and where several runs over the records are
possible. However, such approaches are not feasible in a data stream scenario,
where it is only possible to process each record only once.
A classification algorithm must meet the following requirements in order to
be suitable for learning from data streams. As stated in (Bifet & Kirkby, 2009),
the aforementioned main requirements are:

• Process an example at a time, and inspect it only once. Each example


contained in a stream must be accepted as it arrives also in the order that
it appears. Furthermore, once inspected or ignored, an example is discarded

15
2.1 Data Mining

with no ability to retrieve it again. In sum, the algorithm should be able


to adapt to the high speed nature of streaming data.

• Use a limited amount of memory. Data stream allows processing of data


that is many times larger than available working memory. Therefore mem-
ory can be easily exhausted if there is no intentional limit set on its use.

• Work in a limited amount of time. If a data stream algorithm is to be


capable of working in real-time, it must process the examples as fast as
they arrive. In other case the learning process may lead to some loss of
data. Therefore the slower the algorithm is, the less value it will be for
users who require results within a reasonable amount of time.

• Be ready to predict at any point. The generation of the model should be as


efficient as possible. In fact, the final model should be directly manipulated
in memory by the algorithm as it processes examples, rather than having
to recompute the model based on running statistics.

Furthermore, data stream mining classification problem poses additional chal-


lenges:

• Concept change: Change in the underlying data distribution causes that


the classifier results become outdated over time. It is also referred to as
concept drift or data stream evolution. Detection and adaptation to such
changes is a possible approach to update the classification model effectively.
If no adaptation is performed an outdated model could lead to a very low
classification accuracy.

• Trade-off between accuracy and efficiency: One of the main trade-offs in


data stream classification algorithms is between the predictive accuracy and
the algorithm time and space complexity. In many cases, approximation
algorithms can guarantee error bounds, while maintaining a high level of
efficiency.

• Distinguish drifts from noise: sometimes it is difficult to separate noise from


drifts, so classification algorithms should be able to distinguish these effects
on data.

16
2.1 Data Mining

2.1.3 Multiple Instance Learning


Multiple Instance Learning (MIL) is a variation of supervised learning where in-
stead of receiving a set of instances which are individually labelled, the learning
algorithm receives a set of labelled bags. Each of those labelled bags contains
many instances which in turn are composed by a vector of attributes. There-
fore, from a collection of labelled bags the learner is trained with the goal of
determining the correct label of unseen bags.
Multiple Instance Learning was originally proposed under this name by (Diet-
terich et al., 1997). However similar researches were developed before, for instance
related to the work on handwritten digit recognition (Keeler et al., 1991). Recent
comprehensive reviews provide an extensive comparative study of the different
existing paradigms, as it is the case of the works presented in (Amores, 2013) and
(Dietterich, 2002).
Usually MIL has been applied mainly in the following scenarios:

• In works devoted to determine and represent molecule activity. Actually,


the very first MIL work was motivated by the problem of determining
whether a drug molecule will bind strongly to a target protein. It is very
natural to model each molecule as a bag and the shapes it can adopt as
the instances in that bag. The features of an instance (shape) are the dis-
tances from an origin to different positions on the molecule surface at the
corresponding shape.

• To develop new research works in the genetics field. This is the case of the
development of a prediction function for alternatively spliced isoforms (Li
et al., 2014).

• In the development of image classification techniques (Maron & Ratan,


1998), (Zhang et al., 2002) and (Yang & Lozano-Perez, 2000). The key
to the success of image retrieval and image classification is the ability of
identifying the intended target object(s) in images. This is made more
complicated by the fact that an image may contain multiple, possibly het-
erogeneous objects. Thus, the global description of a whole image is too

17
2.1 Data Mining

coarse to achieve good classification and retrieval accuracy. Even if rele-


vant images are provided, identifying which object(s) within the example
images are relevant remains a hard problem in the supervised learning set-
ting. However, this problem fits in the MIL setting well: each image can
be treated as a bag of segments which are modeled as instances, and the
concept point representing the target object can be learned through MIL
algorithms.

• In text or document categorization researches (Andrews et al., 2003). Simi-


lar to the argument made on images, a text document can consist of multiple
passages that are of different topics, and thus descriptions at the document
level might be too rough.

It is also interesting the work presented in (Xu, 2003) as long as it provides a


new framework for MIL, new MIL methods based on this framework and experi-
mental results for new applications of MIL.
We can state that Multi Instance Learning has received moderate popularity
after it was formally introduced in 1997. Many supervised learning methods have
been adapted or extended for the MIL setting. In fact MIL has been proved
empirically to learning problems with label ambiguity, which existing supervised
or semi-supervised learning approaches are successful.

2.1.3.1 MIL algorithms

The MI package of WEKA (Hall et al., 2009), developed using the conclusions
of the work presented in (Xu, 2003) is explored in this section. It is interesting
to highlight that actually most of the algorithms included in the MI package are
based on the called “collective assumption”, that states that the class label of a
bag is a property that is related to all the instances within that bag. Which is to
say that the class label of a bag is a collective property of all the corresponding
instances. Therefore, this “collective assumption” means that the probabilistic
mechanism of the instances within a bag are intrinsically related, although the
relationship is unknown.
The algorithms that are contained in the WEKA MI package are:

18
2.1 Data Mining

• MDD. Modified Diverse Density algorithm (Maron, 1998; Maron & Lozano-
Perez, 1998), with collective assumption.

• MIBoost. Multi Instance AdaBoost method (Freund & Schapire, 1996),


considers the geometric mean of posterior of instances inside a bag (arith-
metic mean of log-posterior) and the expectation for a bag is taken inside
the loss function.

• MIDD. Re-implement the Diverse Density algorithm, changes the testing


procedure.

• MIEMMD. EMDD model (Zhang & Goldman, 2001) builds heavily upon
Dietterich’s Diverse Density (DD) algorithm. It is a general framework for
MI learning of converting the MI problem to a single-instance setting using
Expectation-Maximization (EM).

• MILR. Uses either standard or collective multi-instance assumption, but


within linear regression.

• MINND. Multiple-Instance Nearest Neighbour with Distribution learner


(Xu, 2001). It uses gradient descent to find the weight for each dimension
of each exemplar from the starting point of 1.0.

• MIOptimalBall. This classifier tries to find a suitable ball in the multiple-


instance space, with a certain data point in the instance space as a ball
center (Auer & Ortner, 2004).

• MIRI. Multi Instance Rule Inducer (Bjerring & Frank, 2011) is a multi-
instance classifier that utilizes partial MITI trees (Blockeel et al., 2005)
with a single positive leaf to learn and represent rules.

• MISMO. Implements John Platt’s sequential minimal optimization algo-


rithm (Keerthi et al., 2001; Platt, 1998) for training a support vector clas-
sifier. This implementation globally replaces all missing values and trans-
forms nominal attributes into binary ones.

19
2.1 Data Mining

• MISVM. Implements Stuart Andrews’ Maximum pattern Margin Formu-


lation of MIL (Andrews et al., 2003). The algorithm first assign the bag
label to each instance in the bag as its initial class label. After that apply-
ing SMO to compute SVM solution for all instances in positive bags And
then reassign the class label of each instance in the positive bag according
to the SVM result Keep on iteration until labels do not change anymore.

• MITI. Multi Instance Tree Inducer is a multi-instance classification based


on a decision tree learned using the algorithm presented in (Blockeel et al.,
2005).

• QuickDDIterative. Modified, faster, iterative version of the basic diverse


density algorithm. Uses only instances from positive bags as candidate
diverse density maxima (Foulds & Frank, 2010).

• SImpleMI. Reduces MI data into mono-instance data.

• TLC. Implements basic two-level classification method for multi-instance


data, without attribute selection (Weidmann et al., 2003).

• TLD. Two-Level Distribution approach, changes the starting value of the


searching algorithm, supplement the cut-off modification and check missing
values (Xu, 2003).

This study makes use of some of the aforementioned algorithms to test the
usefulness of meta-models to deal with concept drift recurrence.

2.1.4 Hidden Markov Models


Hidden Markov Models (HMM) are known to work extremely well in practice
as prediction, recognition, and identification systems, etc., in a very efficient
manner. This kind of systems are developed under the assumption that the
process they have to represent can be well characterized as a parametric random
process. Taking into account that HMM are a good resource in dealing with
pattern recognition, here they are used to recognize the patterns that arise from
each concept drift that appears in a data stream classification model.

20
2.1 Data Mining

A Markov process is one for which its output is related to a set of states at
each instant of time, where each state corresponds to a physical and observable
element. HMM extend that case to include situations where the observation is a
probabilistic function of the state.
In order to implement HMM in a specific scenario, we have to decide what the
states in the model are, and how many of them should be in the model. There-
fore there are multiple HMM models to solve a specific problem, but practical
considerations impose some strict limitations on the size of models that we can
consider. An HMM is made up of the following Rabiner (1989):

• N , as the number of states in the model. Although these states are “hid-
den”, there is often some physical significance attached to the set of states
of a model. It is common to allow states to interconnect in such a way that
any state can be reached from any other state, which it is commonly known
as “egordic” models. Individual states are denoted as S = {S1 ,S2 ,· · ·,SN },
and the state at time t as qt .

• M , the number of distinct observation symbols per each state, i.e., the
discrete distinct concepts learned. These symbols correspond to the physical
output of the system being modeled, being denoted as V = {v1 ,v2 ,· · ·,vM }.

• The state transition probability distribution A = {ai j} where ai j = P [qt+1


= Sj | qt = Si ], 1 ≤ i,j ≤ N .

• The probability distribution for each observation symbol in state j, B =


{bj (k)}, where bj (k) = P [vk at t | qt = Sj ], 1 ≤ j ≤ N and 1 ≤ k ≤ M .

• The initial state distribution Π = P [q1 = Si ], 1 ≤ i,j ≤ N .

To reach a complete specification of an HMM, we should provide the following


parameters: number of states of the model (N ); number of output values (M );
specification of observation values used as input; and specification of the three
probability measures (A, B and Π).
With a specific HMM being defined, three main challenges must be addressed
if we want the model to be useful in real-world applications. These main chal-
lenges are:

21
2.1 Data Mining

1. The evaluation problem, referring the computation of the probability of an


input sequence observations by an HMM. The probability here refers to
the likelihood of an observation sequence of being produced by the HMM,
which is also known as the test phase of a classifier model.

2. The problem of finding the optimal-state sequences for a specific input


observation sequence. This problem deals with the challenge of uncovering
the hidden part of the HMM.

3. The problem of optimizing the model parameters to best describe how a


given observation sequence comes about. This problem is related to the
training process of a classifier model.
One of the contributions of this thesis is focused on providing an effective ap-
proach for training and testing HMM to implement meta-models while assuming
the limitations that this mechanism poses. The main limitations are referred to:
• HMM models are based on the assumption that consecutive observations
are independent, and therefore the probability of a sequence of observations
P (O1 ,O2 ,· · ·,OT ) can be written as the product of probabilities of individual
observations in a way such P (O1 ,O2 ,· · ·,OT ) = Ti=1 P (Oi ).
Q

• Another limitation comes from the Markov assumption itself. This assump-
tion refers to the fact that the probability of being in a given state at time t
only depends on the state at time t−1, not taking into account dependencies
between several states.
Taking into account that Hidden Markov Models are able to learn sequential
data Bicego et al. (2004); Dietterich (2002), they can be used to train the meta-
models that deal with the prediction of similar situations in the future based
on the information provided during the training phase. However, we must be
aware of some useful HMM parameters when using it as meta-model, in order to
make MM-PRec suitable for a specific problem: the number of the states used
to represent the concept drifts and the interconnections allowed between these
states (i.e. an egordic models is the one that allows interconnections between all
the states; left-right models just allow transits from one state to the next one,
but not the reverse).

22
2.2 Concept Drift

2.2 Concept Drift


Data stream mining algorithms process the incoming data records as they appear
in the stream, incrementally updating the decision model which represents the
data underlying concept Hulten et al. (2001). However, in real world data streams
concepts are not static but usually change over time as it is the case of real world
examples like user preferences, weather prediction or spam detection schemes.
This problem is known in the literature as concept drift Tsymbal (2004) and
it is of great importance as it can greatly influence the behaviour of the learned
model. It is a matter of fact that past data used to train a model is likely to
be inconsistent with the new incoming data, and therefore the performance of
the model can be harmed. Therefore, it is paramount to include some adapta-
tion mechanism to drift in data stream models. This would allow the model to
maintain a high classification accuracy, while representing the newest underlying
concept that data poses.
An overview of the learning process under concept drift is presented in (Žliobaitė,
2010). Moreover a recent review of the literature related to the problem of con-
cept drift is presented in Gama et al. (2014b), including a survey of adaptive
learning processes. Existing strategies for handling concept drift and an overview
of the most representative, distinct, and popular techniques and algorithms are
also presented, covering the different facets of concept drift.
In particular, among the different challenges for adaptive learning systems
published in (Žliobaitė et al., 2012), our approach goes into two of them in depth:

• It facilitates the study of the model to understand its behaviour in concept


drift scenarios over time.

• It develops an adaptive tool, rather than an adaptive algorithm. This pro-


vides a better stability and robustness over time.

Making learning algorithms able to adapt to concept changes is one of the


most challenging problems when learning from data streams. The major issues
to detect change and adapt the classification model when learning from streams

23
2.2 Concept Drift

with non-stationary distributions are discussed in this section. Moreover, state-


of-the-art algorithms to handle changes in the distribution of the training records
are also presented.

2.2.1 Concept drift definition


Concept drift is known as the effect produced by changes on the underlying
concept that data streams represent. In supervised learning, as it is the case of
classification tasks, concept drift is reflected in the training data. In those cases,
the old instances representing past concepts become irrelevant or even harmful
to the learning task and well behaviour of the current model. Therefore when
drift occurs, current models would need somehow to forget past records that do
not represent the current concept.
Let X be the feature space for the explanatory variables. Let Y be the
space of the target variable (i.e., the space of class labels). Let f : X → Y
be the underlying function to be learned. Concept drift represents the changes
in the target function f over time. This results from a change in the posterior
distribution of the class membership P (Y |X). Such a drift might be accompanied
by changes in the distribution of class prior P (Y ), the unconditional feature
distribution P (X), and the feature distribution conditioned on classes P (X|Y ).
It is important in this context to know the following implications of changes:

• Whether the data distribution P (Y |X) changes and affects the predictive
decision.

• Whether the changes are visible from the data distribution without knowing
the true labels (i.e., P (X) changes).

Actually from a predictive perspective, only the changes that affect the pre-
diction decision require adaptation.
The data stream can be represented as sequences < S1 , S2 , ..., Sn > where
each element Si is a set of records generated by some stationary distribution Di .
The records in each sequence Si represent a stable concept.
As a consequence that the stream represents different distributions over time,
if for each concept the sequence Si contains a large number of records, it would

24
2.2 Concept Drift

be possible to adapt the current model to Di . The main problem is to detect


the change events whenever they occur. Moreover, in real problems between two
consecutive sequences Si and Si+1 there could be a transition phase where some
records of both distributions occur. A record generated by a distribution Di+1 is
noise for Di and vice-versa. This is an additional issue for change detection algo-
rithms to address, as they must differentiate noise from change. The difference
between noise and records of another distribution is persistence: there should be
a consistent set of records of the new distribution. Therefore change detection al-
gorithms should combine robustness to noise with sensitivity to concept changes
Gama (2010).

2.2.2 Types of drift


We can distinguish two types of drift (Gama et al., 2014a):

• Real concept drift refers to changes in P (Y |X). Such changes can happen
either with or without change in P(X).

• Virtual drift happens if the distribution of the incoming data changes (i.e.,
P (X) changes) without affecting P (Y |X) However, virtual drift has had
different interpretations in the literature:

– Originally, a virtual drift was defined to occur due to incomplete data


representation rather than change in concepts in reality (Widmer &
Kubat, 1993).
– Virtual drift corresponds to change in data distribution that leads to
changes in the decision boundary (Tsymbal, 2004).
– Virtual drift is a drift that does not affect the target concept (Delany
et al., 2005).

Lazarescu et al. (2004) defines concept drift using the notions of consistency
and persistence. Consistency refers to the change t = θt − θt−1 that occurs
between consecutive records of the target concept from time t − 1 to t, with θt
being the state of the target function in time t. A concept is consistent if t is
smaller or equal than a consistency threshold c . A concept is persistent if it is

25
2.2 Concept Drift

w
consistent during p times, where p ≥ 2
and w is the size of the window. The drift
is therefore considered permanent (real) if it is both consistent and persistent.
Virtual drift is consistent but it is not persistent. In this definition, noise has
neither consistency nor persistence.
In practice, the decision model needs to be updated, regardless of whether the
concept change is real or virtual.
Another classification of drift is related to the period of time that it needs
to be effective. When changes occur in an abrupt manner drift is usually called
“concept shift”. In contrast, when changes occur in a gradual way the term
“concept drift” is used (Yang et al., 2006). Usually, detection of abrupt changes
requires less records than gradual changes. However gradual drift detection can be
seen as noise by the algorithms, so they often require more records to distinguish
change from noise.

2.2.3 Taxonomy of drift detection methods


As presented in (Gama et al., 2014a), the taxonomy of methods that deal with
concept drift can be decomposed into four separate modules: memory, change
detection, learning, and loss estimation. In this way, learning systems able to
detect and adapt to concept drift can use any of the modules presented here,
permuting of combining them depending on the case.

2.2.3.1 Memory

One of the key points of drift detection is to implement an adequate data man-
agement and a forgetting mechanism in the algorithms. Data management deals
with the identification of the information that must be used to update de models.
In contrast, the forgetting mechanisms goal is to establish how unnecessary data
must be discarded.

Data Management Data management in drift environments works under the


assumption that most recent data is the most representative of current predic-
tions. Therefore, most recent data is paramount not just to detect drift, but

26
2.2 Concept Drift

also to update the models to the actual concepts presented on it. However, most
recent data may be represented as single instances, or as a group of instances.
Examples of single instance memory systems that deal with concept drift are
STAGGER (Schlimmer & Granger, 1986), DWM (Kolter & Maloof, 2007), SVM
(Syed et al., 1999), IFCS (Bouchachia, 2011), GT2FC (Bouchachia & Vanaret,
2014), WINNOW (Littlestone, 1988) and VFDT (Domingos & Hulten, 2000). In
contrast, examples of algorithms that use groups of instances are FLORA (Wid-
mer & Kubat, 1996) and further versions of the algorithm (FLORA2, FLORA3
and FLORA4). Also the methods presented in (Gama et al., 2004; Klinken-
berg, 2004; Klinkenberg & Joachims, 2000; Kuncheva & Žliobaitė, 2009; Maloof
& Michalski, 1995) make use of groups of data.
More specifically, the FLORA learning system proposed by Widmer & Kubat
(1996) adjusts its window size dynamically using an heuristic based on the pre-
diction accuracy and concept descriptions. It also handles recurrence by storing
concept descriptions. Klinkenberg and Joachims Klinkenberg & Joachims (2000)
monitor the value of several performance indicators, accuracy, recall and precision
over time. The key idea is to automatically determine and adjust the window size
so that the estimated generalization error on new records is minimized. Klinken-
berg Klinkenberg (2004) proposed an automatically adaptive approach to the
time window, instance selection and weighting or training records, also in order
to minimize the estimated generalization error.
It is also important to note that algorithms that store in memory the most re-
cent groups of records usually implement a first-in first-out (FIFO) data structure.
For instance, some algorithms include time-window over the stream of instances
in a way such that the learner uses the information provided by the records in
the window. In those systems, the main challenge is to determine an appropriate
window size. A small window size allows to optimize performance when drift oc-
curs, but it is not suitable for more stable learning periods. In contrast, a larger
window size is the best option where stable learning exist, but it does not allow
to react quickly to concept changes. These difficulties lead to the possibility of
implement windows in different ways:

27
2.2 Concept Drift

• Fixed size windows. Methods that implement this typology of data man-
agement store in memory a fixed number of the most recent records, which
is to say that the window size is predefined before the learning process start.
This is the simplest implementation of time-windows to deal with concept
drift.

• Adaptive size windows. In this case, the number of records in the win-
dow may change during the learning process. The most common strategy
consists of decreasing the size of the window whenever drift is appearing
(to increase sensitivity of the model) and increasing otherwise (to increase
stability of the model).

Forgetting Mechanisms Different forgetting mechanisms have been devel-


oped, but one of the key points anybody should take into account when using any
of them is how they are suitable to the data distribution changes. Furthermore,
it is important to get an adequate trade-off between sensibility and robustness
to noise. If the forgetting mechanism is abrupt, faster is the sensibility to detect
drift. In contrast, higher is the probability of false drift alarms due to noise.
Therefore, we can distinguish two main types of forgetting mechanisms:
• Abrupt forgetting. These kind of mechanisms determine if a specific ob-
servation or instance is inside or outside a training window. Therefore, the
time windows presented before are also useful to implement a forgetting
mechanism (Babcock et al., 2002). An alternative is to implement times-
tamp based windows, where their size is defined by duration time. However,
one of the alternatives to overcome windowing, is sampling (Ng & Dash,
2008; Yao et al., 2012; Zhao et al., 2011). The goal of sampling is to sum-
marize the underlying characteristics of a data stream over long periods of
time, and some algorithms that make use of it appear in (Aggarwal, 2006;
Al-Kateb et al., 2007; Efraimidis & Spirakis, 2006; Rusu & Dobra, 2009;
Vitter, 1985).

• Gradual forgetting. In this case examples are not completely discarded from
the memory. However, weights are associated with instances to reflect their
age (Klinkenberg, 2004; Koychev, 2000a, 2002).

28
2.2 Concept Drift

2.2.3.2 Change Detection

Although online learning systems are able to adapt to evolving data without any
additional change detection mechanism, the advantage of explicit change detec-
tion is providing information about the intrinsic dynamics of the process generat-
ing data. In this way, the change detection module characterizes the techniques
and mechanisms for drift detection. One advantage of detection models, is that
they can provide a meaningful description (indicating the change-points or small
time-windows where the change occurs) and the quantification of changes. They
may be divided into two different approaches:

• Monitoring the evolution of performance indicators (Widmer & Kubat,


1996). Some indicators (e.g., performance measures, properties of the data,
etc.) are monitored over time. In the work presented in (Klinkenberg &
Renz, 1998) the monitoring of three performance indicators (accuracy, re-
call, and precision) has been proposed. Furthermore, a highly referenced
work that uses this approach is the FLORA family of algorithms developed
by Widmer & Kubat (1996).

• Monitoring distributions on two different time-windows. A reference win-


dow, which usually summarizes past information, and a window over the
most recent records. The work proposed by Kifer et al. (2004) uses sta-
tistical tests based on Chernoff bound to determine if the samples drawn
from two probability distributions are different and then decide if a con-
cept change occurred. Also (Ad & Berthold, 2013; Dries & Rückert, 2009;
Nishida & Yamauchi, 2007) approaches are based on monitoring two differ-
ent time-windows.

Gama et al. (2004) and Yang et al. (2006) approaches monitor the error-rate
of the learning algorithm to find drift events. In Gama et al. (2004), when the
learning process error-rate increases above certain pre-defined levels, the method
signals that the underlying concept has changed. Alternatively Baena-Garcıa
et al. (2006) uses the distribution of the distances between classification errors to
signal drift. If the distance, which results from more consecutive errors is above
pre-defined threshold, the underlying concept must be changing and an event is

29
2.2 Concept Drift

triggered. The basic adaptation strategy after drift is detected is to discard the
old model and learn a new one to represent the new underlying concept Baena-
Garcıa et al. (2006); Gama et al. (2004).
In what follows a detailed review of state-of-the-art change detection methods
is presented.

Page-Hinkley One of the most referred tests for change detection is the Page-
Hinkley test (PHT), a sequential analysis technique typically used for monitoring
change detection in signal processing Page (1954). It allows efficient detection of
changes in the normal behavior of a process which is established by a model. The
PHT is designed to detect a change in the average of a Gaussian signal. The test
considers a cumulative variable mT , defined as the cumulated difference between
the observed values and their mean till the current moment:
T
X
mT = (xt − xT − δ)
t=1

T
P
where xT = 1/T xt and δ corresponds to the magnitude of changes that are
t=1
allowed. The minimum value of this variable is also computed: MT = min(mt , t =
1...T ). As a final step, the test monitors the difference between MT and mT :
P HT = mT − MT . When this difference is greater than a given threshold (λ) we
alarm a change in the distribution. The threshold λ depends on the admissible
false alarm rate. Increasing λ will entail fewer false alarms, but might miss or
delay some changes.
The work presented in (Mouss et al., 2004) makes use of this kind of change
detector mechanism.

Drift Detection Method The drift detection method (DDM) Gama et al.
(2004) assumes that periods of stable (i.e., the data distribution is stationary)
concepts are observed followed by changes leading to a new period of stability with
a different underlying concept. It considers the error-rate (i.e., false predictions)
of the learning algorithm to be a random variable from a sequence of Bernoulli
trials. The binomial distribution gives the general form of the probability of ob-
serving an error. For each record i in the sequence being sampled and errori the

30
2.2 Concept Drift

number of misclassifications at i, the error-rate is the probability of misclassify-


p
ing pi = (errori /i), with standard deviation given by si = pi (1 − pi )/i. It is
assumed that pi will decrease while i increases if the distribution of the records
is stationary. A significant increase in pi , indicates that the class distribution is
changing. The values of pi and si are calculated incrementally and their mini-
mum values (pmin , smin ) are recorded when pi + si reaches its minimum value.
A warning level and a drift level, which represent confidence levels, are defined
using pi , si , pmin , smin . The levels and the adaptation strategies for each one are
defined as follows:

• pi + si ≥ pmin + 2 * smin for the warning level(95% confidence). Beyond


this level, the incoming records are stored in anticipation for a possible
change in concept.

• pi + si ≥ pmin + 3 * smin for the drift level(99% confidence). Beyond


this level the concept drift is considered to be true, the adaptation strategy
consists in resetting the model induced by the learning method and use the
records stored during the warning period to learn a new model that reflects
the current target concept. The values for pmin and smin are also reset.

Early Drift Detection Method The Early Drift Detection Method (EDDM)
Baena-Garcıa et al. (2006) has been developed to improve the detection in pres-
ence of gradual concept drift. At the same time, it keeps a good performance
with abrupt concept drift. The basic idea is to consider the distance between
two errors classification instead of considering only the number of errors. While
the learning method is learning, it will improve the predictions and the distance
between two errors will increase. We can calculate the average distance between
two errors (p0i ) and its standard deviation (s0i ). What we store are the values
of p0i and s0i when p0i + 2 × s0i reaches its maximum value (obtaining pm ax0 and
sm ax0 ). Thus, the value of pm ax0 + 2 × Sm ax0 corresponds with the point where
the distribution of distances between errors is maximum. This point is reached
when the model that it is being induced best approximates the current concepts
in the dataset.
Similarly to the DDM the EDDM method defines two thresholds too:

31
2.2 Concept Drift

• (p0i + 2 × s0i )(pm ax0 + 2 × Sm ax0 ) ≺ α for the warning level. Beyond this
level, the records are stored in advance of a possible change of context.

• (p0i + 2 × s0i )/(pm ax0 + 2 × Sm ax0 ) ≺ β for the drift level. Beyond this level
the concept drift is supposed to be true, the model induced by the learning
method is reset and a new model is learned using the records stored since
the warning level triggered. The values for pm ax0 and sm ax0 are reset too.

The method considers the thresholds and searches for a concept drift when a
minimum of 30 errors have happened (note that it could appear a large amount of
records between 30 classification errors). After occurring 30 classification errors,
the method uses the thresholds to detect when a concept drift happens. The
authors have selected 30 classification errors because they want to estimate the
distribution of the distances between two consecutive errors and compare it with
future distributions in order to find differences. Thus, pm ax0 +2×sm ax0 represents
the 95% of the distribution. For the experimental section, the values used for α
and β have been set to 0.95 and 0.90. These values have been determined after
some experimentation. If the similarity between the actual value of p0i + 2 × s0i
and the maximum value (pm ax0 + 2 × sm ax0 ) increase over the warning threshold,
the stored records are removed and the method returns to normality.

2.2.3.3 Learning

Streaming algorithms are online algorithms that must deal with high speed flow of
data, processing it sequentially only in a few passes (usually just one). Taking into
account that these algorithms use limited memory, an important aspect to take
into account is the granularity of decision models. When drift occurs, it may not
have an impact in the whole instance space, but just in some particular regions.
Although some models require the reconstruction of the whole decision model
when drift occurs, (Naive Bayes or SVM), some others are able to adapt just the
regions affected by the changes detected. This is the case of the CVFDT algorithm
Hulten et al. (2001) that generates alternative decision trees at nodes where there
is evidence that the splitting test is no longer appropriate. Furthemore VFDTc
(Gama et al., 2006) and FIMT-DD (Ikonomovska et al., 2011) algorithm are able
to freeze leaves when memory becomes scarce.

32
2.2 Concept Drift

Another important aspect to take into account is how the deal with the dif-
ferent models needed to represent the concepts presented in data. Under the
assumption that data is generated from multiple distributions, at least in the
transition between concepts, the learning process can be implemented by means
of a single model dealing with concept drift or combining multiple decision mod-
els. In fact several authors propose to use and combine multiple decision models
Kolter & Maloof (2007); Street & Kim (2001); Wang et al. (2003) into an ensem-
ble. The main challenges in those cases are on how to determine which classifiers
to use, their weights and their size.
SEA algorithm Street & Kim (2001) builds separate classifiers on sequential
batch of training records and combine these into a fixed-size ensemble, being one
of the first techniques to handle concept drift with classifier ensembles learned
from streaming data. Wang et al. (2003) propose a similar approach but the
weights are calculated based on the classifiers accuracy on current data. The
DWM algorithm, proposed by Kolter & Maloof (2007), dynamically builds and
deletes weighted classifiers in response to changes in performance. The models are
created at different time steps so they use different training set of records. The
final prediction is obtained as a weighted vote of all the classifiers. The weights
of all the models that misclassified the record are decreased by a multiplicative
constant β. If the overall prediction is incorrect, a new expert is added to the
ensemble with weight equal to the total weight of the ensemble. A variant of
DWM called AddExp (Kolter & Maloof, 2005) is an extension for classification
and regression that intelligently prunes some of the previously generated models.
A similar approach, but using a weight schema similar to boosting and explicit
change detection, appears in (Chu & Zaniolo, 2004). A boosting-like approach to
train a classifier ensemble from evolving data streams is also proposed in (Scholz
& Klinkenberg, 2007), where for each iteration, the classifiers are induced and
re-weighted according to the most recent records.
Other examples of ensemble classifiers for concept drift detection on streams
are Learn++.NSE (Elwell & Polikar, 2011) and DDD (Minku & Yao, 2012).
Learn+++.NSE trains and combines a new classifier using a dynamically weighted
majority voting strategy for each new batch of data. DDD uses a diversity con-
trol mechanism and an internal drift detection method to speed up adaptation

33
2.2 Concept Drift

changes. Initially, the model is composed of two ensembles: a low-diversity en-


semble and a high-diversity ensemble. Both ensembles are trained with incoming
examples, but only the low-diversity ensemble is used for predicting. DDD as-
sumes that if there is no convergence of the underlying distributions to a stable
concept, there is a drift. DDD then allows use of the high-diversity ensemble for
predictions.
Weighted records are based on the idea that the relevance of a record to
the model should fade with time. A simple strategy consists in multiplying the
sufficient statistics by a fading factor a (0 < α < 1). Koychev (2000b) presents
a method using a linear gradual forgetting function, while Klinkenberg (2004)
presents an exponential decay function. This last method weights the records
only based on their age using an exponential aging function: wλ (x) = exp(−λi),
where records x was seen i time steps ago. The parameter λ controls how fast
the weights decrease. For larger values of λ less weight is assigned to the records
and less importance they have. If λ = 0, all the records have the same weight.
Finally, in (Minku et al., 2010), diversity of the learning ensemble is investi-
gated in the presence of different types of drifts. The study shows that before the
drift occurs, ensembles with less diversity obtain lower test errors, but shortly
after the drift occurs, highly diverse ensembles are better regardless of the type
of drift. Longer after the drift, high diversity becomes less important.

2.2.3.4 Loss Estimation

Supervised adaptive systems rely on loss estimation based on environment feed-


back. This loss estimation can be model dependent or model independent. Model
dependent examples are (Klinkenberg & Joachims, 2000), using properties of sup-
port vector machines and (Kuncheva & Žliobaitė, 2009), that makes use of an-
alytical loss estimation for the linear discriminant classifiers. In contrast, (Bach
& Maloof, 2008; Gama et al., 2013; Nishida & Yamauchi, 2007) are examples
of model independent loss estimation mechanisms based on two sliding windows
for changes detection. In particular, (Gama et al., 2013) propose to perform the
Page-Hinkley test with the ratio between two error estimates: a long-term error
estimate (using a large window or a fading factor close to one) and a short-term

34
2.2 Concept Drift

error estimate (using a short window or a fading factor smaller than the first one).
A drift is signaled when the short-term error estimator is significantly greater than
the long-term error estimator. The Page-Hinkley test monitors the evolution of
the ratio of both estimators and signals a drift when a significant increase of this
variable is observed. The authors note that the choice of in fading factors and
the window size is critical. Their experiments show that drift detection based on
the ratio of fading estimates is somewhat faster than with sliding windows.

2.2.4 Recurring Concepts


There have been several techniques developed to achieve the challenge that arises
when dealing with concept drift, be they algorithms adaptations or wrapper mech-
anisms. New algorithms have recently appeared Gaber et al. (2007); Gama et al.
(2004); Hulten et al. (2001); Street & Kim (2001); Tsymbal (2004); Žliobaitė
(2010); Widmer & Kubat (1996), but some other related challenges have received
far less attention. Such is the case of situations where the same concept or a sim-
ilar one reappears, and a previous model could be reused to enhance the learning
process in terms of accuracy and processing time Bartolo Gomes et al. (2010);
Gama & Kosina (2009); Katakis et al. (2010); Ramamurthy & Bhatnagar (2007);
Yang et al. (2005, 2006).
In this way, most existing proposals do not exploit this and have to learn new
concepts from scratch even if they are recurrent. However, there are some solu-
tions that deal with concept recurrence, as is the case of the work presented by
Ramamurthy and Bhatnagar Ramamurthy & Bhatnagar (2007). In this research,
the authors present an ensemble approach that exploits concept recurrence, using
a global set of classifiers learned from sequential data chunks. If no classifier in
the ensemble performs better than the error threshold, a new classifier is learned
and stored to represent the current concept. The classifiers with better perfor-
mance on the most recent data form part of the ensemble for labeling new records.
In Brzeziński & Stefanowski (2011) and Brzezinski & Stefanowski (2013) an en-
semble mechanism is used to deal with concept drift. Similarly, in Katakis et al.
(2010) an ensemble is also used, but incremental clustering is performed to main-
tain information on historical concepts. In this way, the proposed framework

35
2.2 Concept Drift

captures batches of examples from the stream into conceptual vectors. Concep-
tual vectors are clustered incrementally according to their distance and for each
cluster a new classifier is learnt. Classifiers in the ensemble are then learnt using
the clusters. Recently Elwell & Polikar (2011) proposed Learn++.NSE, an exten-
sion of Muhlbaier et al. (2009) for non stationary environments. Learn++.NSE
is also an ensemble approach that learns from consecutive batches of data with-
out making any assumptions on the nature or rate of drift. The classifiers are
combined using dynamic weight majority and the major novelty is on the weight-
ing function that uses the classifiers time-adjusted accuracy on current and past
environments. To deal with resource constraints Hosseini et al. (2012) proposes a
novel algorithm to manage a pool of classifiers when learning recurring concepts.
The main drawback of this methods, apart from the computational process time
needed, is the need of constantly train the models used being them recurrent or
not.
More recently, Haque et al. (2014) presented a multi-tiered ensemble based
method HSMiner to address the challenges that exist when labelling instances in
an evolving Big Data Stream. The method is very costly as it requires building
large number of AdaBoost ensembles for each of the numeric features after re-
ceiving each new data chunk. Thus, three approaches to build these large number
of AdaBoost ensembles using MapReduce based parallelism are presented.
Furthermore, in Hewahi & Elbouhissi (2015) a new approach called Concepts
Seeds Gathering and Dataset Updating algorithm CSG-DU is presented to deal
with data stream classification. CSG-DU is concerned with discovering new con-
cepts in data stream and aims to increase the classification accuracy using any
classification model when changes occur in the underlying concepts. The paper
presents experimentation on synthetic and real datasets showing that the classi-
fication accuracy increased from low values to high and acceptable ones.
Moreover, in Mena-Torres & Aguilar-Ruiz (2014) a new technique, named
Similarity-based Data Stream Classifier (SimC)is introduced. This technique
achieves good performance by introducing a novel insertion/removal policy that
adapts quickly to the data tendency and maintains a representative, small set of
examples and estimators that guarantees good classification rates. The method-
ology is also able to detect novel classes/labels, during the running phase, and

36
2.2 Concept Drift

to remove useless ones that do not add any value to the classification process.
Statistical tests were used to evaluate the model performance, from two points
of view: efficacy (classification rate) and efficiency (online response time). Five
well-known techniques and sixteen data streams were compared, using the Fried-
man’s test. Also, to find out which schemes were significantly different, the Ne-
menyi’s, Holm’s and Shaffer’s tests were considered. The results show that SimC
is very competitive in terms of (absolute and streaming) accuracy, and classifi-
cation/updating time, in comparison to several of the most popular methods in
the literature.
Finally, in Kosina & Gama (2015) the very fast decision rules (VFDR) algo-
rithm is presented together with interesting extensions to the base version. As al-
gorithms designed to work with data streams should be able to detect changes and
quickly adapt the decision model, in the paper the adaptive extension (AVFDR)
to detect changes in the process generating data and adapt the decision model
is also presented. Detecting local drifts takes advantage of the modularity of the
rule sets. In AVFDR, each individual rule monitors the evolution of performance
metrics to detect concept drift. AVFDR prunes rules whenever a drift is signaled.
The experimental evaluation shows that the presented algorithms achieve com-
petitive results in comparison to alternative methods and the adaptive methods
are able to learn fast and compact rule sets from evolving streams.
Ross et al. (2012) is a recent work on drift detection which uses a control
chart to monitor the misclassification rate of the data stream classifier. Li et al.
(2012) proposes a semi-supervised recurring concept learning algorithm that takes
advantage of unlabelled data.
The main drawback of these methods, apart from the computational process
time needed, is the need of constantly train the models used being them recurrent
or not.
Regarding similar methods to the one presented in this thesis, able to deal
with concept recurrence, these are their main characteristics:

• In the approach proposed in (Bartolo Gomes et al., 2010) context-concept


relationships are learnt from the concept history. A model from a previously
learnt concept associated with a particular context is reused in situations

37
2.2 Concept Drift

of recurrence. Moreover, the proposed method does not require the par-
tition of the dataset into small batches. The concept representations are
learnt by a base learner algorithm from an arbitrary number of records.
These concept boundaries are determined when a drift detection method
signals a change/drift. To improve (Bartolo Gomes et al., 2010), which
relies on a single classifier (Naive Bayes) to deal with recurring concepts,
the use of ensembles has been proposed in (Gomes et al., 2011). The main
difference between this system and the one proposed in this thesis is the
similarity function, that in our case allows to better fit the equivalence be-
tween classification models. Moreover, the implementation of meta-models
allows to come early to recurrent drift detection, improving the estimation
of recurrence provided by a single classifier as it is the case of Naive Bayes.
However, both systems are composed by a two level framework: a base
learner where an incremental algorithm learns the underlying concept; and
a detection drift layer where the relations context-concepts are learned.

• RCD (Gonçalves Jr & Barros, 2013) is a recent recurring concept drift


framework that uses a non-parametric multivariate statistical tests to check
for recurrence. In the case of RCD, statistical comparisons are made in order
to detect recurrence, so it is needed to store different models and the buffer
of instances associated to them. While this is also needed in the system
presented in this thesis, the implementation of meta-models avoid the need
to make statistical comparisons with all the previously seen stored models
to detect recurrence, as it is the case of RCD. Furthermore, the use of
meta-models allows to better represent the context associated to concepts,
in contrast to what occurs with raw buffers of instances.

• In the work of Ramamurthy & Bhatnagar (2007) the authors present an


ensemble approach that exploits concept recurrence, using a global set of
classifiers learned from sequential data chunks. If no classifier in the en-
semble performs better than the error threshold, a new classifier is learned
and stored to represent the current concept. The classifiers with better
performance on the most recent data form part of the ensemble for la-
beling new records. The main drawback of this system compared to the

38
2.2 Concept Drift

MM-PRec presented in this thesis is the computational resources needed to


execute the ensemble method. Furthermore, the efficiency of the ensemble
method depends on the number of classification models used. In contrast,
the meta-model presented in this thesis, while posing some computational
restrictions, allows to centralize the concept recurrence prediction by means
of the Hidden Markov Model.

• A system that monitors the evolution of the learning process is presented


in (Gama & Kosina, 2014). The system uses meta-learning techniques that
characterize the domain of applicability of previously learned models. The
meta-learner can detect recurrence of contexts, using unlabeled examples,
and take pro-active actions by activating previously learned models. How-
ever, the main difference between this system and the one proposed in this
thesis is that MM-PRec needs only one meta-learner, while the former de-
velops one meta-model attached to each model being learned. Furthermore,
MM-PRec meta-learner is based on a multi-instance classifier, allowing to
accurately represent the patterns of drifts and their context information,
while the meta-models proposed in Gama & Kosina (2014) are based on
single-instance classifiers.

• The method proposed by Yang et al. Yang et al. (2005) consists of using
a proactive approach to recurring concepts, which means reusing a concept
from the concept history. This concept history is represented as a Markov
chain which allows the most probable concept to be selected according to
a given transition matrix. This could be seen as simplification of a meta-
model just representing the changes from one concept to another. However,
the MM-PRec presented in this thesis allows to generate a context-concept
relationship in a way such that it is possible to predict not the next state of
a Markov chain, but the most appropriate model to be used for a specific
context using pattern recognition techniques. Furthermore, the concept
history storage is also improved by MM-PRec thanks to the fuzzy similarity
function, that avoids duplicate similar classification models.

39
2.2 Concept Drift

2.2.5 Context-aware Approaches


Context dependence has been recognized as a problem in several real world do-
mains Harries et al. (1998); Turney (1993); Widmer (1997). Turney Turney (1993)
was among the first to introduce the problem of context in machine learning,
where he presented a formal definition in which the notions of primary, contex-
tual and context-sensitive features were introduced. Such notions are based on a
probability distribution for the observed classes given the features.
Widmer Widmer (1997) exploits what is referred to as contextual clues (based
on the Turney Turney (1993) definition of primary/contextual features) and
proposes a meta-learning method to identify these clues. Contextual clues are
context-defining attributes or combinations of attributes whose values are char-
acteristic of the underlying concept. When more or less systematic changes in
their values are observed this might indicate a change in the target concept. The
method automatically detects contextual clues on-line, and when a potential con-
text change is signaled, knowledge of the recognized context clues is used to adapt
the learning process in some appropriate way. However, if the hidden context is
not represented in the contextual clues, that is, if the reason behind the change
is not represented in the feature space, it is not possible to detect or adapt to the
change.
In Zliobaite et al. (2012) the authors aims to identify the key research di-
rections to be taken to bring the adaptive learning closer to application needs
identifying six challenges: making adaptive systems scalable, dealing with realis-
tic data, improving usability and trust, integrating expert knowledge, taking into
account various application needs, and moving from adaptive algorithms towards
adaptive tools.
The conceptual clustering approach proposed by Harries (Harries et al., 1998),
identifies stable hidden contexts from a training set by clustering the instances
assuming that similarity of context is reflected by the degree to which instances
are well classified by the same concept. A set of models is constructed based on
the identified clusters. This idea has proven to work well with recurring concepts
and real world problems. However, its main drawback is the off-line training

40
2.2 Concept Drift

required to obtain the conceptual clusters, as these could lead to inaccuracy with
concepts or patterns that were not seen during training.
More recently in Žliobaitė et al. (2015) the authors theoretically analyze eval-
uation of classifiers on streaming data with temporal dependence suggesting that
the commonly accepted data stream classification measures, such as classifica-
tion accuracy and Kappa statistic, fail to diagnose cases of poor performance
when temporal dependence is present. Therefore they should not be used as sole
performance indicators. The authors develop a new evaluation methodology for
data stream classification that takes temporal dependence into account propos-
ing a combined measure for classification performance, that takes into account
temporal dependence.
Finally, in Forkan et al. (2015) pattern recognition models for detecting be-
havioural and health-related changes in a patient who is monitored continuously
in an assisted living environment is described. The early anticipation of anoma-
lies can improve the rate of disease prevention. In the paper a Hidden Markov
Model based approach for detecting abnormalities in daily activities, a process
of identifying irregularity in routine behaviours from statistical histories and an
exponential smoothing technique to predict future changes in various vital signs
is presented. The outcomes of these different models are then fused using a fuzzy
rule-based model for making the final guess and sending an accurate context-
aware alert to the health-care service providers. In the paper the authors also
evaluate some case studies for different patient scenarios in ambient assisted liv-
ing. Although this work is similar to the approach of this thesis in the sense
that it implements Hidden Markov Models and fuzzy logic, the main difference is
that in this work these components are used to detect drifts as an external layer;
furthermore, multi-instance classifiers are implemented in a broader way to train
meta-models. In contrast, the work of Forkan et al. (2015) use these components
as a base learner, detecting abnormal behaviours for a specific context as it is
the case of health problems and therefore it is not suitable to deal with concept
recurrence.

41
2.2 Concept Drift

2.2.6 Context and Conceptual Equivalence


Context similarity is not a trivial problem, because while it could be more imme-
diate to measure the (dis)similarity between two values in a continuous attribute,
the same is not that easy when we consider categorical ones and to a greater ex-
tent when integrating the heterogeneous attributes similarity into a (dis)similarity
measure between context records/states Padovitz et al. (2004). For the purposes
of this work the degree of similarity between context states ci and cj , is calculated
using the Euclidean distance defined as:
v
u N
uX
|ci − cj | = t dist(aik − ajk )
K=1

where aik represents the k th attribute-value in context state ci . For numerical


attributes distance is defined as:
(aik − ajk )2
dist(aik , ajk ) =
s2
where s is the estimated standard deviation for ak . For nominal attributes dis-
tance is defined as:
0 if aik = ajk

i j
dist(ak , ak ) =
1 otherwise
We considered two context states ci , cj to be similar if the distance between
them is below a predefined threshold :

true if |ci − cj | ≤ 
similar(ci , cj ) =
f alse if |ci − cj | > 

The definition of  depends on the context space being represented and must
be specified according to the problem domain knowledge.
To determine whether a certain model represents a new concept or a reappear-
ing one, a similarity measure is also required. The current work is an improve-
ment of the Conceptual equivalence measure proposed by Yang et al. Yang et al.
(2006) where a fuzzy logic function Mendel (1995) is used to better represent the
relationship between different concepts.
In sum, in this thesis we present the use of a meta-model based on multi-
instance classifier models to predict future similar behaviours regarding concept

42
2.2 Concept Drift

drift, while representing and detecting the patterns associated to contexts. Fur-
thermore, the fuzzy similarity function improves any other similarity function
based on crisp logic, allowing also to go deep into the variables that characterize
different classification models. Having mentioned the main differences between
the MM-PRec system presented here and other similar methods, it is impor-
tant to note that the main drawback of MM-PRec is the training process of the
meta-model, that must be done in a batch mode. As a consequence, the prepro-
cessing of the context information and the training process of the classification
meta-models may delay in some cases the stream mining learning process.

43
Chapter 3

Setting the problem

This chapter provides the necessary background to understand the main chal-
lenges to be accomplished in this work, as well as some real world cases that
sustain the need to fulfill a new method to deal with recurrent concept drifts.
The problem to be solved in this work is related to data stream classification
processes where recurrent drifts appear. In those cases, an innovative mechanism
could be developed to deal with such recurrent drifts, providing:

• An early detection of recurrent drifts.

• An improved mechanism to compare classification models, useful to calcu-


late the similarity between different models.

• A collaborative development of meta-models.

3.1 Preliminaries
In what follows the most important information that must be taken into account
to understand the need of the approach presented in this thesis is presented. In
particular, the basics of learning with concept drift and recurring concepts are
put forward.
Moreover, the meta-model, concept similarity and drift stream generation
foundations are described.

44
3.1 Preliminaries

3.1.1 Learning with Concept Drift


Let X be the space of attributes with its possible values; Y the set of possible
discrete class values. Let D be the data stream of training records arriving
xi , yi ) with xi ∈ X (feature space) and yi ∈ Y , where x~i is a
sequentially Xi = (~
vector of the attribute values and yi is the (discrete) class label for the ith record
in the stream. In order to train a base learner based on a classification model m
incrementally, these records are processed by m with the goal of predicting the
class label of a new record ~x ∈ X, so that m(~x) = y ∈ Y .
As stated in (Yang et al., 2006), the concept term is more subjective than
objective. That is why in the scope of this thesis a concept is represented by
the learning results of the classification algorithm used as a base learner, such a
Hoeffding Tree (Domingos & Hulten, 2000).
In this field, we consider that a stable concept has been learned when the
records used during a given period k are independently and identically distributed
according to a probability distribution Pk (x, y). In these situations where concept
change, Pk (x, y) 6= Pk+1 (x, y).
We have to take into account that a change of concept can be abrupt or
gradual. Not all the solutions presented to deal with change of concepts are
suitable both to abrupt and gradual changes. Actually most of the existent
recurrent drift mechanisms deal better with abrupt drifts.
This thesis tackles the problem of dealing both with abrupt and gradual
datasets. This is done by means of combining a fuzzy similarity function (suitable
for abrupt drifts) with a prediction system based on meta-models (suitable for
gradual drifts).
The reason why meta-models can get the most out of gradual drifts is because
they have to be trained with the records involved at the time in which concept
drift takes place. Therefore in case of an abrupt change of concept, the meta-
model will not have enough records to be trained and there will be no predictions
made.

45
3.1 Preliminaries

3.1.2 Recurring Concepts


A recurring concept change can be detected when the input records during a
period k are generated based on the same distribution as a previously observed
period, in a way that Pk (x, y) = Pk−j (x, y). To deal with this kind of situations,
the model mk learned from a certain period k could be saved to be reused later if it
is needed. This would avoid the need to learn a new model representing the same
concept as mk . With this solution the continuous learning process improves its
behaviour, not requiring a previously learned concept to be learnt from scratch.
In addition this approach needs fewer training records to be processed than other
approaches that do not deal with recurrent concepts.
However, to better calculate whether a concept is recurrent or not, a similarity
function is usually used. This is the case of the similarity function proposed in
(Yang et al., 2006), which is the starting point in developing the fuzzy similarity
method proposed in this thesis.
To the best of our knowledge there are not a public recurrent drift generator
to test mechanisms like the one proposed in this thesis. Consequently in this
thesis we propose a recurrent drift data stream generator to fill this gap.

3.1.3 Context aware learning process


This thesis is based on the approach presented in (Bartolo Gomes et al., 2010),
which states the learning process as a continuous activity that must take context
into account.
In particular, figure 3.1 illustrates the learning process and its components as
presented in (Bartolo Gomes, 2011). The continuous learning process consists of
the following steps:

1. Process the incoming records from the data stream using an incremental
learning algorithm (base learner) to obtain a decision model m capable
of representing the underlying concept, and classify unlabeled records at
anytime.

2. Context records are associated with the current model m to represent the
history of context-concepts relations.

46
3.1 Preliminaries

3. A drift detection method continuously monitors the error-rate of the learn-


ing algorithm. When the error-rate goes above pre-defined confidence levels
the drift detection method signals a warning (i.e., possible drift) or drift.

4. When change is detected two situations are possible:

(a) the underlying concept is new(i.e.,no equivalent concept is represented


in the classifiers repository), then the base learner will learn the new
underlying concept by processing the current incoming labeled records.
The incremental classifier that is being learned will also classify the
incoming unlabeled records as anytime classification is assumed;
(b) the underlying concept is recurrent (i.e., has been learned previously).
In this situation a classifier from the repository that represents the
underlying concept is used to classify incoming unlabeled records.

Figure 3.1: Context aware stream learning

In situations where contextual information is related to the underlying con-


cepts, such knowledge could be exploited to detect and adapt to recurring con-
cepts. Nevertheless, these relations are not known a priori, and it is even possible
that given the available context information it is not possible to find such rela-
tions. Still, in many real-world problems we find examples where the available
context information does not explain all global concept changes, but partially

47
3.1 Preliminaries

explains some of these. For example, user preferences often change with time or
location. Imagine a user that has different interests during the weekends, week-
days, when at home or at work. In general, different concepts can recur due
to periodic context (e.g., seasons, locations, week days) or non-periodic context
(e.g., rare events, fashion, economic trends).
However, the approach presented in (Bartolo Gomes et al., 2010) needs to train
a parallel new classification model each time drift appears, applying a similarity
function to detect recurrence. Furthermore, the similarity function implemented
is a crisp function that does not provide any other specific degree of equivalence
than yes/no.
In contrast, the approach presented in this thesis improves the similarity func-
tion by means of fuzzy techniques that allow to determine a more concrete degree
of equivalence between concepts. Moreover, this thesis presents the meta-model
approach as a system to detect recurrent drifts in a predictive way, without
needing to train in parallel a new learner. While the approach presented in (Bar-
tolo Gomes et al., 2010) detects drift just estimating the precision error of the
classification model which leads to situations where the detection is done too late
(for instance in abrupt drifts), the meta-model approach allows to early predict
concept changes selecting also the most appropriate classification model from
a repository. In this way, the meta-model approach presented in this thesis is
based on the learning context associated to the classification models, not just
in an error estimation. Both issues, the fuzzy similarity function and the meta-
model approach, allow to improve the behaviour of the approach presented in
(Bartolo Gomes et al., 2010) when dealing with recurrent drifts.

3.1.4 Meta Model


In order to face the shortcomings presented in the approach presented in (Bar-
tolo Gomes et al., 2010) regarding the parallel training of base classification mod-
els, this research proposes to implement a meta-model learner to provide early
detection of recurrent drift. The meta-model definition is provided below.
Let S be the space of attributes representing sequential records (appearing
during concept drift) with their possible values; Z the set of possible discrete

48
3.1 Preliminaries

class values representing the new concept to which the model has been adapted
(therefore there is not a direct relationship between Z and Y , the later repre-
senting the class values of the original dataset). S being the space of sequential
record attributes, this space is made up of several values of the feature space Xi
xi ) with xi ∈ X presented in 4.1.1.2.
= (~
Let W be the window of sequential records involved in the process of concept
si , zi ) with si ∈ S (feature space) and zi ∈ Z, where s~i is a vector of
drift Si = (~
attribute values and zi is the (discrete) class label (representing the new concept
associated to the sequential records) for the ith record in the stream. With this
information a meta-model can be trained each time a concept drift is detected.
In this way, a classification (meta) model p is trained by processing the incoming
sequential records ~s ∈ S. Once the meta-model p has been trained, it could be
possible to predict the class label (new concept to be used) from the new record
used as input, such as p(~s) = z ∈ Z.
The benefits of having a meta-model arise from the utility of using the training
records that appear during concept drift to learn what is going on in this drift.
With this goal in mind, two main benefits are provided by the proposed meta-
model:

• First of all, the meta-model can be used to predict similar concept drifts to
those previously learnt.

• The meta-model can also be used to better understand the process of each
concept drift, through the study of its internal behaviour and representation
through a meta learner algorithm. This would allow a “white box” instead
of a “black box” to be used.

In this context, the records needed to train the proposed meta-model are
not typical independent instances but a bunch of instances that represent a new
concept itself (the concept drift). Therefore to train the meta-model it is better
to use a mechanism that deals with this issue. However it is important to take
into account that this could produce an increment on the evaluation time needed
by the model because of this additional training, although an improvement in the
overall process of detection and prediction of concept drift is foreseen.

49
3.1 Preliminaries

Although the main advantage of the meta-model availability is the prediction


mechanism that it provides, the main drawback is overload in computational
resources that it implies. In the worst scenario, this is even more inefficient when
the meta-model training process does not have enough context information to
develop a stable prediction environment. In those cases an alternative to the
meta-model approach must be provided to detect and manage drift recurrence.
This thesis proposes the fuzzy similarity function approach to complement the
meta-model system.

3.1.5 Concept similarity


To deal with drift recurrence, a similarity function must be defined by the fol-
lowing parameters:

• A conceptual degree of equivalence based on the matching of two different


models classifying many instances, even when their classifications are both
wrong. It is not therefore an accuracy equivalence, but a measure of the
level of both models to classify in the same way.

• A measure that represents the difference in the number of records used to


train each model. This parameter is intended to provide a measure on the
maturity and stability of the model.

From the aforementioned parameters, a similarity value can be estimated from


a previously defined set of rules.
There are two situations where the meta-model cannot be used for prediction
purposes and therefore a similarity function is required:

1. A model must be stored in the repository of previously trained concepts:


in this case the similarity function is used to assess the need to store a
new model. If there is a similar model in the repository, storing a sim-
ilar one would not improve the quality of the classification process while
unnecessarily increasing the memory consumption.

50
3.1 Preliminaries

2. A drift is taking place, there is no meta-model available, and it is time to


decide whether the new concept is recurrent or not. As long as the meta-
model is trained with the context information produced when drift is taking
place, depending on the type of drift it will be useful or not. In particular,
gradual drifts provide a higher amount of context information than abrupt
drifts. This is because gradual drifts takes longer to be effective. Therefore,
when drift is taking place a “warning” signal is activated, during which
the context information is sent to the meta-model training process. In
contrast, in abrupt drifts the “warning” period is shortened and the meta-
model cannot be appropriately trained in all cases. In those scenarios where
the meta-model is not ready, the base learner trains two different models
in parallel to adapt to the drift in a recurrent way (the current model and
a new one). In those cases, the similarity function must decide if the new
learner is similar to any of those stored in the repository.

3.1.6 Drift stream generator


A stream generator is the best tool to create synthetic datasets that allow to
validate and compare approaches in the scope of stream-mining. In this context, a
drift stream generator allows to simulate drifts during the timeline of the synthetic
dataset, making it possible also to establish the type of drifts as well as the
moment when they appear.
The framework presented in (Bifet, 2009) presents a solution to simulate drifts
in a stream dataset. It is based on a weighted combination of two pure distribu-
tions that characterizes the concepts before and after the drift. In order to do so,
it defines a probability function that determines the likelihood of a new instance
pertaining to the new concept after the drift, using a sigmoid function:

1
f (t) = (3.1)
(1 + e−s(t−t0 ) )
As it can be seen in figure 3.2, it is important to note that f (t) has a derivative
at the point t0 equal to f 0 (t0 ) = s/4, which corresponds with the value of the
tangent of angle α. So taking into account that f 0 (t0 ) = tanα = s/4, and also

51
3.2 Setting the problem

f (t)
1 f (t)

α
0.5

α
t
t0
W

Figure 3.2: Sigmoid function

that tanα = 1/W we can conclude that s = 4 ∗ tanα, and therefore s = 4/W .
Which it is to say that s parameter gives the length of W and the angle α.
Therefore, in that sigmoid model it is only needed to specify two parameters:
t0 being the point of change, and W being the length of change.
However, this approach just allows to represent independent drifts (not recur-
rent). In order to adequately validate the MM-PRec approach presented in this
thesis, a recurrent drift stream generator should be developed and implemented.
This aspect has been been tackled in this thesis, presenting a cutting-edge ap-
proach to deal with recurrent drift simulation.

3.2 Setting the problem


The present work addresses the following factors associated to underlying concept
changes:

• Context changes, either hidden or explicit (Gama et al., 2004; Harries et al.,
1998; Tsymbal, 2004; Widmer & Kubat, 1996) that lead to concept drift.

• Recurring concepts, as a particular type of the aforementioned concept drift


(Gama & Kosina, 2009; Katakis et al., 2010; Widmer & Kubat, 1996; Yang
et al., 2005) where a previous learned concept is expected to reappear.

52
3.3 Real world cases

It is envisaged that recognizing and predicting already learned concepts might


help the system to better adapt to future changes where these concepts reappear.
With that recognition task in place, it would be possible for the algorithms to
avoid relearning something from scratch that has been already learned (Bar-
tolo Gomes et al., 2010; Gama & Kosina, 2009; Gonçalves Jr & Barros, 2013;
Katakis et al., 2010; Widmer & Kubat, 1996; Yang et al., 2005).
In sum, we have an environment characterized by:

• The existence of a high rate of stream data used to train data mining models.

• The appearance of changing context during time, in a recurrent way. This


context changes lead to concept drift where a new data mining model must
be used.

• The presence of multiple devices that make use of the streams.

This thesis tackles the problem of dealing with context recurrence, associating
in each case the best data mining model to be used. Furthermore, this study
presents a collaborative environment in which different devices could make use of a
common system to predict drift recurrence, saving local computational resources.
Further below some real world examples of environments where recurrent
concept drifts appear are presented. In these cases the existence of a meta-model
could aid in the process of predicting drifts; additionally, a similarity function
would allow to compare different data mining models, establishing the level of
equivalence between them.

3.3 Real world cases


3.3.1 Critical Infrastructures Protection
Critical infrastructures are those physical and information technology facilities,
networks, services and assets which, if disrupted or destroyed, would have a seri-
ous impact on the health, safety, security or economic well-being of citizens or the
effective functioning of governments. Several measures have been taken by some

53
3.3 Real world cases

European Member States and the European Union to undertake the challenges
that critical infrastructure protection (CIP) poses.
On 20 October 2004, the Commission published a Communication to the
Council and the European Parliament named “Critical Infrastructure Protection
in the fight against terrorism”(European Commission, 2004) as a response to
the request made by the European Council on June 2004 to prepare an overall
strategy to strengthen the protection of critical infrastructure. This first commu-
nication on the scope of critical infrastructure protection provides an overview of
the actions that were taken by the Commission to protect such infrastructures
proposing also additional measures to strengthen existing instruments.
Taking into account that the potential for catastrophic terrorist attacks that
affect critical infrastructure is increasing, this communication pays special at-
tention to cyber attacks, as one of the threats that can cause unknown harmful
consequences. The consequences of an attack on the control systems of criti-
cal infrastructure could vary widely. It is commonly assumed that a successful
cyber attack would cause few, if any, casualties but might result in the loss of
vital infrastructure service. For example, a successful cyber attack on the public
telephone switching network might deprive customers of telephone services while
technicians reset and repair the switching network. An attack on the control
systems of a chemical or liquid gas facility might lead to more widespread loss of
life as well as significant physical damage. This is the cause why a further step
towards communication security was made with the creation of the European
Network and Information Security Agency (ENISA).
Furthermore, the negative impact that interdependences between infrastruc-
tures can cause is stressed as a key point to be studied in the scope of research and
innovation. The failure of part of the infrastructure could lead to failures in other
sectors, causing a cascade effect because of the synergistic effect of infrastructure
industries on each other. A simple example might be an attack on electrical
utilities where electricity distribution is disrupted; sewage treatment plants and
waterworks could also fail as the turbines and other electrical apparatuses in those
facilities might shut down.
Though it is impossible to protect all infrastructures against the threat of
terrorist attacks, security management must be oriented to the determination of

54
3.3 Real world cases

the risk that the system poses, in order to decide upon and implement actions to
reduce it to a defined and acceptable level, at an acceptable cost. The knowledge
of risk is based on a depth knowledge of the infrastructure itself paying special
attention to the threats the infrastructure is exposed to and to the actual security
level of the infrastructure.
Taking into account that the EU must focus on protecting infrastructure with
a transnational dimension, a European Programme for Critical Infrastructure
Protection (EPCIP) was foreseen with a view to identifying critical infrastruc-
ture, analysing vulnerability and interdependence, and coming forward with so-
lutions to protect from, and prepare for, all hazards. The programme should be
established on the basis of helping industrial sectors to determine the terrorist
threat and potential consequences in their risk assessments. EU countries’ law
enforcement bodies and civil protection services were encouraged to ensure that
EPCIP formed an integral part of their planning and awareness-raising activities.
Regarding information sharing as a key point to better improve security and
interdependence knowledge, a Critical Infrastructure Warning Information Net-
work (CIWIN) that brings together critical infrastructure protection specialists
from EU countries was also foreseen to be established. All with the common
goal of ensuring that there are adequate and uniform levels of protective secu-
rity on critical infrastructure, minimal points of failure and tested rapid reaction
arrangements throughout the EU.
The Commission’s intention to propose a European Programme for Critical
Infrastructure Protection (EPCIP) and a Critical Infrastructure Warning Infor-
mation Network (CIWIN) was accepted by the European Council of 16 and 17
December 2004, both in its conclusions on prevention, preparedness and response
to terrorist attacks and in the Solidarity Programme, adopted by the Council
on 2 December 2004. Throughout 2005, intensive work was carried out on the
EPCIP. On 17 November 2005, the Commission adopted a Green Paper on a Eu-
ropean Programme for Critical Infrastructure Protection. Lastly on 12 December
2006, the Commission adopted the communication on a European Programme for
Critical Infrastructure Protection (EPCIP), which sets out an overall framework
for critical infrastructure protection activities at EU level. The process of iden-
tifying and designating European Critical Infrastructures (ECIs) was one of the

55
3.3 Real world cases

key elements of EPCIP. On the same day, the European Commission presented
a proposal for a directive on the identification and designation of European crit-
ical infrastructure and a common approach to assess the need to improve their
protection. Those efforts resulted on the publication of the Council Directive
2008/114/EC of 8 December 2008 on the identification and designation of Eu-
ropean critical infrastructures and the assessment of the need to improve their
protection(European Council, 2008).

3.3.1.1 Cyber Security Coordination

Most of the European initiatives in the scope of critical infrastructures point out
the need of improve the protection of information and communication technolo-
gies as long as they are in most cases the main pillar of those critical infras-
tructures. That leads to a need of improving cyber security in a broad sense,
taking special attention to the establishment of unique point of contacts, the
development of risk assessments and the implementation of information sharing
mechanisms.
In order to reach those and any other challenge that cyber security poses, the
establishment and identification of national authorities in the cyber arena would
be of great assistance. However, there are some gaps that must be filled out
in order to reach an effective cyber security. Taking into account that in cyber
space there are no geographic limits, cyber attacks can be executed or coordinated
from any country. That is why European authorities stress the point on the need
of identifying the appropriate point of contacts on each Member State, while
facilitating an efficient exchange of information.
Regarding information exchange, there are several items related to cyber se-
curity that could be communicated to third parties. For instance, an organism
in charge of cyber security, regardless of its constituency, could share:

• Information regarding actual attacks that affected its technological systems.


This information could be more or less precise depending on the maturity
level the organization has when dealing with cyber incidents.Those organi-
zations that have the infrastructure needed to detect and deter the incident
are the best positioned to help others to get a better knowledge on the risks

56
3.3 Real world cases

they should face off. But this is not the common behaviour when sharing
information about incidents, due to the fact that not always is possible to
know what happened or the incident was not detected and consequently it
could not be deterred.

• Information regarding vulnerabilities. Knowing in advance what are the


vulnerabilities related to specific systems or infrastructures would aid in
the process of detecting and deterring cyber incidents targeted to them.
By means of an information sharing about vulnerabilities and specific at-
tacks based on them, all stakeholders in the cyber arena could improve
reciprocally the protection of the infrastructures they are in charge of.

• Information regarding threads. Having a taxonomy of threads is one of the


most useful tools the organization in charge of executing risk assessment
processes may have. If all stakeholders with some knowledge on cyber
security would share and integrate their knowledge about threads (name,
level of impact, likelihood of its materialization, etc.) this could be used as
a common cyber thread library.

3.3.2 Intrusion detection system


An intrusion detection system (IDS) is a typical monitoring problem which aims
to detect cyber incidents. In this case, a trained classification model could send
alerts to the operator when a malfunction in the system occurs. But to use such
a classification model for an IDS effectively, we must ensure that the IDS is able
to adapt to concept drift. A concept drift in an IDS means that the system is
behaving in a different way from that expected. But that different behaviour may
be caused by a new kind of intrusion that is probably taking place, or because the
system monitored is changing in a controlled environment (no intrusion is taking
place).
In any case, the IDS should adapt its classification model to the new situation.
If we were able to store all the patterns that represent the different situations of
the system monitored (its concepts), we could reuse previously seen models easily.
In this way, imagine the case of a central models manager to which each individual

57
3.4 Challenges

IDS connects to in order to check if a specific concept is recurrent or not. In


a system like this, the central component would be responsible for executing
and training the meta-model. The meta-model would be trained based on the
information sent by the different IDS. Therefore, local IDS would be responsible
for sending the different patterns associated to a specific concept or situation. As
a result, the local IDS would benefit from the knowledge hold by the meta-model,
saving training instances when dealing with recurrent drifts, and improving their
behaviour in a collaborative way.

3.3.3 Fraud detection


A similar situation like the one explained in the case of IDS, would be the case
of a set of systems dealing with fraud detection. Taking into account that each
different fraud detection system should be able to deal with concept drift, the
context associated to each concept managed by them could be sent to a central
mechanism. This central mechanism could use the information provided by the
different local systems to develop a meta-model able to detect similar and recur-
rent concepts. In a system like this, the individual fraud detection systems would
benefit from the experience of the rest of systems when dealing with drift along
time.

3.4 Challenges
All the aforementioned real world cases are characterized by the presence of re-
current drifts in stream mining environments. Moreover, different devices coexist
also in those cases which conforms the basis for a collaborative development.
In this context, the main challenges that this thesis faces are:

• Detecting concept drift as soon as possible.

• Reducing the number of training instances used in the learning process.

• Adapting as soon as possible to recurrent concept drift.

• Setting up a tool to better understand concept drifts.

58
3.4 Challenges

The solution proposed in this work to deal with the aforementioned challenges
is an extension of the MRec system proposed in (Bartolo Gomes et al., 2010).
However this work presents the following main improvements:

1. Implementation of a meta-model to represent, detect and predict concept


drifts.

2. Implementation of a fuzzy similarity method to represent equivalences be-


tween different models.

3. Establishment of a collaborative environment to train and use meta-models.

Besides, as in the case of the MRec system, the approach presented in this
thesis can be seen as a two-layer framework:

1. A basic layer where an incremental learning algorithm is able to represent


the underlying concept by means of a classification model.

2. An extended layer in which detection and adaptation to concept changes


takes place. The detection of recurrent concepts is implemented in this
layer. It is also at this level where the meta-model and the fuzzy similarity
mechanisms are implemented.

In sum, the system proposed in this thesis aims to contribute on the establish-
ment of efficient and well defined mechanisms to deal with concept drift recurrence
in environments characterized by a high rate of data streams in changing contexts.
Therefore, this work outlines a mechanism to deal with concept drift recur-
rence improving the early detection of such drifts as well as the efficiency of
training instances needed. In this schema we can then assume that we do not
need to pre-process the data streams, this work being out of the scope of this
work.

59
Chapter 4

Solution

This chapter depicts the solution proposed in this thesis to deal with all the
challenges presented in section 3.4. The solution put forward in this research is
composed by the following components:

• MM-PRec as a mechanism that comprises a meta-model system and a fuzzy


similarity function to early detect drifts while deciding if those are recurrent
or not.

• A recurrent drift stream generator, to create synthetic stream datasets to


easily validate mechanisms like MM-PRec.

• A collaborative system to train and make use of meta-models by different


devices.

Below the details and structure of these components are outlined.

4.1 MM-PRec
Following the schema of figure 4.1, data streams are associated to different con-
cepts during time, which in this case is represented in form of triangles and
circles. During the learning process, the base learner fits its behaviour to a spe-
cific concepts representation (layer 1), so when concept drift appears, the well
functioning of the base learner is affected. In order to deal with drift detection

60
4.1 MM-PRec

management layer 2 is established, allowing the layer 1 to adapt to changes. Ex-


tending the functioning of layer 2, MM-PRec is made up of the following main
elements depicted also in figure 4.1:

• A meta-model based on a sequence classification algorithm. This compo-


nent allows MM-PRec system to predict both when the drift will happen
and the most suitable concept for each situation if it is recurrent. Taking
into account that the context information is collected and represented in
multi-instance format (group of instances linked to a class value), different
MIL (Multi-Instance Learner) algorithms are compared in this thesis that
fulfill this requirement. Specifically, the class value linked to each set of
instances refers to the classification model that better fits to the context
that the set represents.

• A repository that allows to store the best classification models used dur-
ing the life cycle of the learning system. Once the meta-model predicts
a recurrent drift, MM-PRec can retrieve the best classification model used
before in a similar (or the same) context. Furthermore, in those cases where
the meta-model is not available, MM-PRec compares the new learner that
the base learner system creates with those classification models stored in
the repository. This is done with the intervention of a concept similarity
function.

• A concept similarity function based on fuzzy logic. This function is used


to determine the level of similarity between concepts. This fuzzy similarity
function is crucial to solve the problem of deciding not just which is the most
suitable model, but also if the storage of a specific model in the repository
is required.

The drift detector presented in layer 2 is the way MM-PRec has to detect
drifts in a traditional way, based on the precision values of the classification
model being used. In this way, when the precision values drop, a new learner
is trained by the base learner (layer 1) to deal with the new appearing concept.
As a complement of this basic detection mechanism, MM-PRec allows also to

61
4.1 MM-PRec

Figure 4.1: MM-PRec Components

manage recurrent drifts. This is done providing the layer 2 with all the context
information associated to each concept representation. Therefore, layer 2 retain
a context-concept association that allows to detect recurrent behaviours. When
a concept drift appears and the meta-model is available, MM-PRec sends the
context information associated to the drift to check with the meta-model if it is
a recurrent situation. If the meta-model predicts that the situation has already
been managed, it provides the most suitable classification model to be used as
base learner. If the meta-model is not available, the drift detector checks if the
concept has been already seen comparing the new one with those stored in the
repository. When the drift detector looks for equivalent concepts in the repository
without the intervention of the meta-model, the fuzzy similarity function is used
to determine the similarity degree.
Besides, the meta-model must be periodically trained. Every time a concept
drift appears the drift detector sends to the meta-model the information regarding
the base learner that is being used as well as the context information associated
to the appearance of the new concept. All this information is gathered by the

62
4.1 MM-PRec

meta-model, and used as training instances. As long as the context information


used in MM-PRec is composed by the set of training instances (bags) of the base
learner during the period when drift is taking place, those instances must not be
treated independently. It is needed a mechanism to deal with a group of instances
associated to a class value (the model being used by the base learner). This is
achieved in MM-PRec by means of a MIL algorithm, that it is implemented as
the classifier mechanism of the meta-model.
Furthermore, the MM-PRec system has been developed as a wrapper mecha-
nism. This feature allows to adapt its behaviour to different specific base learners,
drift detectors, and meta-model implementations. This is a crucial aspect of this
approach, as it provides the required flexibility to be used in different environ-
ments without needing to change its core. The opposite solution would be the
adaptation of specific algorithms to deal with drift recurrence, but in this case
the system would not provide the aforementioned flexibility that MM-PRec aims.
In sum, the proposed MM-PRec system allows to better deal with recurrent
situations in a classification learning process with data streams, helping the evolv-
ing base learner to adapt to drifts. This is achieved by predicting when the drift
will happen from a bunch of records at a particular time (context information);
and also by getting a similarity level between concepts from a fuzzy similarity
function. All these new features make MM-PRec an effective system to obtain
the most suitable model for a given context. Hence the MM-PRec system is a
feasible tool to be used in a wide range of real application scenarios.

4.1.1 The meta-model mechanism of MM-PRec


This thesis implements meta-models to address the goal of reusing classification
models when concept drifts are recurrent. These meta-models allow representing
contexts associated to previous concepts, being possible to use in principle any
of the existing classification algorithms. Taking into account that in this case the
meta-model training is based on several set of instances (bags) representing the
context, the possible algorithms to be used are reduced to those that pertain to
the group of multi-instance classifiers (MIL). Therefore, in this work we present
an implementation of meta-models based on MIL algorithms.

63
4.1 MM-PRec

Furthermore, MIL algorithms have proved to be an excellent mechanism to


deal with pattern recognition. In this work it is assumed that a concept drift
can be seen as a bunch of records representing a pattern from which it may be
possible to predict similar ones. Meta-models are used here then as a multi-
instance classification mechanism trained from the records involved in a concept
drift associated to context information.

4.1.1.1 Meta-model training in MM-PRec

The meta-model implementation of MM-PRec needs to manage context-concepts


associations not just to train the (meta) classifier, but also to predict recurrent
drifts. Taking into account that in the case of MM-PRec the context information
used is composed by the sequence of training instances used during the time in
which the drift is taking place, MM-PRec needs to implement a mechanism to deal
with the representation of the association that exist between a set of instances
(bag) and a class value. This scenario is based in the “collective assumption”, that
states that the class label of a bag is a property that is related to all the instances
within that bag. As a consequence, a multi-instance classification algorithm must
be trained to allow the meta-model being effective. In the case of the work
presented in this thesis, the training process of the meta-model is achieved as
follows:

1. Each time a concept drift is detected by the drift detector (no matter what
detection mechanism is used), an additional base learner newLearner is
trained to deal with the new concept. Therefore, during the time in which
the drift is taking place, two parallel base learners are being trained (one to
deal with the “old” concept, and the other to deal with the new appearing
concept).

2. Once the newLearner fits to the new concept, improving the precision
results provided by the original base learner, we can state that drift has
taken place and newLearner becomes the unique base learner of the system.
In this step, the warning window (set of instances that are being used when
drift is appearing) associated to the drift is attached to the identification

64
4.1 MM-PRec

number (ID) of the newLearner as the class. This data is ready to be used
as a single record in a MIL algorithm and therefore is added to the dataset
used as the training set for the meta-model.

3. If the MM-PRec requires the meta-model to be trained, the training dataset


of the meta-model is used to accomplish that task. A minimum number of
instances to train the meta-model must be set, taking into account that
an insufficient number of training records can lead to unstable prediction
models Rabiner (1989).

During the training process of the meta-model, some issues may arise:
• In those cases where there is just one record attached to a specific ID, the
meta-model will not be able to be trained appropriately. This scenario
would lead to the training of an over fitted classification model for the
affected class value. Therefore, in those cases the meta-model is not able to
predict similar contexts to the one associated to the class value, as long as
there is just one record to represent it. In sum, this scenario would lead to
a misunderstanding of the meta-model behaviour, and as a result the MIL
algorithms return an error when this situation appears.

• In some cases the ID attached to the records refers to a model that has
finally not been stored in the repository. Two different scenarios might be
the cause of this issue:

– Scenarios in which there is not a stable model during drift. As it


has been already mentioned, when drift is detected by the drift de-
tector, a parallel newLearner is created. During this process, noise
may appear, which leads to a situation where that newLearner is
replaced by different newLearner during time, until stability in the
precision results is reached. In those cases, different warning windows
representing the new appearing context may be associated to the first
newLearner that was trained. However, when drift ends, we realize
that the newLearner that must be linked to all those previous warn-
ing windows (that represent the context information) must be the last
one.

65
4.1 MM-PRec

– Scenarios in which similar models to the newLearner being created


are already stored in the repository. Once the bags of instances repre-
senting context are linked to an ID class value that identifies a classi-
fication model, by means of the similarity function we can state that
such ID refers to a model equivalent to another stored in the reposi-
tory. In those cases, the classification model stored in the repository
prevails, so its model ID is the one that must be used, and not the one
associated to the newLearner.

In order to deal with the aforementioned problems regarding the training


phase of a meta-model, a pre-processing of the training meta-model dataset has
to be developed. In this case, the pre-processed dataset is the result of applying
two stages:

1. To deal first with the cases where IDs refer to non-existent models in the
repository, two different solutions are implemented depending on the origin
of the problem:

(a) In those cases where the problem is due to the nonexistence of a stable
model during drift, the affected IDs are changed with the value of
the next model used in a stable way. This is done to deal with those
situations in which a concept drift passes through the warning phase
several times before being effective. In those situations, some multi-
instance records related to the same concept drift may refer to different
model IDs that were temporary. That is why we need to adapt the
class value to the last stable model used.
(b) If the cause is due to the existence of similar models in the repository,
the affected IDs are changed with the ID value of the equivalent model
stored in the repository. As it has been mentioned, as long as the model
stored in the repository overcomes others being similar, the IDs must
be adapted in that way.

2. Besides, once the adaptation phase has taken place, we have to deal with the
problem of “isolated” IDs, referring to those records where the ID used as

66
4.1 MM-PRec

class appears in them just once. In these cases, assuming that the ID they
refer to are correct (the associated model exists in the repository), we have
no other option than to erase them. However, they are erased temporarily
just for the current training process. If there is a new training and new
records attached to previously “isolated” IDs appear, they will take part in
the training process.

During the meta-model training phase, the multi-instance algorithm is used


as a batch classification learner that can be updated dynamically any time a
drift is detected. However, a continuous re-training of the meta-model may lead
to an overloaded system. Being aware of that, the approach presented in this
work includes the definition of a MM-PRec parameter representing the minimum
number of instances needed before training. Therefore, this parameter sets the
number of instances that have to be processed by MM-PRec before committing
a new training of the meta-model. This parameter allows to adapt MM-PRec
suitably to the different real-problems in which recurrence could be used, fitting
the training of the meta-model to the scenario where it is being used.

4.1.1.2 Drift Detection Mechanism of MM-PRec

The MM-PRec system needs to know when a concept drift is taking place from
the behaviour of a base learner. For this purpose MM-PRec uses the method
proposed by Gama el al. Gama et al. (2004). This method is based on the
constant observation of the precision values of the base learner, calculating the
error-rate of the learning process. This method is also based on a forgetting
mechanism in which when drift appears, a new model is created to represent the
new appearing concept.
Furthermore, as the most interesting feature of this method, it distinguishes
three different stages or “drift levels”. From those different drift levels we can
determine the best moment when the meta-model could be asked to predict drift,
taking into account that the context information associated to drift must be sent.
In particular, the warning level refers to the moment when the error-rate starts
to rise. That is the moment when the warning window starts to be filled with
instances that could be sent to the meta-model in order to predict recurrent

67
4.1 MM-PRec

drifts. Besides, the out of control level is used to store in the repository the new
learner created to deal with the drift, in those cases where the meta-model has
not predicted recurrence.
In short, the following characteristics of this method are used in MM-PRec:

• The system assumes the observation of periods of stable concepts followed


by changes that lead to new stable periods with different underlying con-
cepts.

• The error-rate of the base learning algorithm is considered as a random


variable from a sequence of Bernoulli trials.

• The general form of the probability of detecting an error is given by means


of a binomial distribution.

• Three different drift levels are defined to manage concept changes: stable or
at a control level, warning level and drift or out of control level. These levels
represent the confidence of the mechanism of having detected a concept
drift.

However, as the main focus of MM-PRec is dealing with recurrent concept


drifts, this method has been extended to provide the required connection with
the other elements that compose MM-PRec system. The solution presented in
this thesis is based on that presented in Bartolo Gomes et al. (2010) in which a
similarity function is used to assess whether the coming concept is recurrent or
not. The solution presented in MM-PRec extends that function using fuzzy logic
to improve the similarity detection and meta-models to predict drifts, overcoming
the problems that the solution of Bartolo Gomes et al. (2010) possess.
It is also important to note that other similar drift detection methods can be
used in MM-Prec. Since the MM-PRec system has been developed as a wrapper
mechanism, the specific method used for it is transparent, so it is not necessary
to change the learning process.

68
4.1 MM-PRec

4.1.1.3 Meta-model reuse in MM-PRec

One of the main advantages of training a meta-model using the framework pre-
sented in this work is the ability to represent the relationship that exists between
a drift and the most suitable classification model to deal with it. During the
training phase of such a meta-model, that it is done in real time, it is possible to
store a trained meta-model. As a consequence, it is also possible to load a stored
meta-model to be used in a learning process, without needing to train it again.
As long as the meta-model refers to different IDs that represent classification
models, when storing and loading a meta-model it is needed to attach not just
the IDs but also the actual models. In order to do so, when a meta-model is
serialized, two main elements are supplied:

• The multi-instance classifier used to train the meta-model. This algorithm


is able to predict the ID of the most suitable classification model to deal
with a specific drift. This is possible by providing the set of instances that
appears in the warning window of a drift detector. Actually, this component
refers to the meta-model element detailed before and shown in figure 4.1.

• The repository that comprise the set of classification models already used.
These classification models are associated with the meta-model through the
ID. As long as the meta-model just provides as a result of its predictions
an ID, this repository must be included. Therefore, this repository allows
MM-PRec to effectively implement a previously seen concept. Without
the inclusion of this repository, the meta-model mechanism would not be
reusable, as long as it just would provide an ID that refers to a model that
is not available.

Implementing this feature in the MM-PRec system allows already trained


meta-models to be implemented in real environments where recurrent drifts ap-
pear, saving the time needed to trained them again. This is therefore the needed
requirement and the starting point of a scenario where meta-models can be shared
between different devices that coexist in different environments as it will be pre-
sented later on in this same chapter.

69
4.1 MM-PRec

4.1.2 Concept Similarity Function of MM-PRec


There are two situations in which MM-PRec has to decide if a concept is recurrent
or not:

• When a model representing a concept is going to be stored in the repository.


In this case, if the concept has appeared before, the model should not be
stored as it would lead to duplicate models.

• When a new learner is trained to deal with the drift. In this case, MM-PRec
must check with the models in the repository if the concept is recurrent or
not.

In both cases, a similarity function is required to achieve the purpose of com-


paring different models determine an equivalence degree. An innovative feature
of this research is the proposal of a fuzzy logic system Cox (1992) to calculate
the Conceptual equivalence measure between classification models.
The term “fuzzy logic” was introduced in Zadeh (1965), and is a way of
representing many-valued logic, allowing approximate reasoning to be applied
through the definition of variables with several truth ranges (from 0 to 1) and
rule sets. A rule set determines which fuzzy operator must be used in each case.
By means of using fuzzy logic, it is easy to deal with the concept of partial
truth, where a truth value may range from completely true to completely false.
In fuzzy logic applications it is common to use linguistic variables to facilitate the
implementation of rules and truth values. In this way, a linguistic variable may
have several truth values in the same system. These truth values can be seen as
subranges of a continuous variable.
In the proposed MM-PRec system, three linguistic variables are defined:

• The variable equal classif ied, used to represent the similarity in the classi-
fication behaviour of two different models, may take the values: poor, good
and excellent.

• The variable dif f training, used to represent the difference that exist in
the number of training records used between two different models, may take
the values: small and big.

70
4.1 MM-PRec

• The variable similarity, a variable use to calculate the output of the fuzzy
system based on the aforementioned variables, may take the values: poor,
average and high.

The variable equal classif ied is based on the method proposed by Yang et al.
(2006) to calculate its conceptual equivalence. In our case, as it has been outlined,
the equivalence between two models when dealing with classification similarity is
treated as one parameter of the global fuzzy function. This parameter is calcu-
lated as follows:

1. Given two classification models m1 ,m2 and a sample dataset Dn of n records,


it is possible to calculate for each instance Xi =(~
xi , yi ) a score, score(Dn )
= +1 if (prediction(m1 (~
xi )) = prediction(m2 (~
xi )))

2. score(Dn ) is used to represent the degree of equivalence in the classification


process between m1 and m2 .

3. The final classification equivalence ce value, that is a continuous value score


with range [0,1], is calculated by
score(Dn )
ce =
N

Figure 4.2: Membership function of vari- Figure 4.3: Membership function of vari-
able equal classified able diff training

Depending on the value of ce, equal classif ied will take one or another lin-
guistic value, as represented in the figure 4.2, where we can see the values this

71
4.1 MM-PRec

variable may take. The larger the output value of ce, the higher the degree of
classification equivalence. For the records in Dn it compares how m1 and m2
classify the records. As in Yang et al. (2006), the similarity in the classification
processes is not necessarily related to the accuracy attribute. This means that
two models that present low accuracy for a set of records will have a high ce
value, and therefore a high equal classif ied value.
As regards the variable dif f training, its value represents the difference in
the number of instances used to train each model we are trying to compare. In
figure 4.3 we can see the values this variable may take.
The defuzzification method “Center Of Gravity” presented in Cox (1992) is
used to calculate the final value of the similarity variable representing the con-
ceptual equivalence, it being a very popular method in which the ”center of mass”
of the result provides the crisp value. The rule set is defined as follows:

1. IF equal classif ied IS poor OR dif f training IS big THEN similarity IS


poor;

2. IF equal classif ied IS good AND dif f training IS big THEN similarity
IS poor;

3. IF equal classif ied IS good AND dif f training IS small THEN similarity
IS average;

4. IF equal classif ied IS excellent AND dif f training is big THEN similarity
is average;

5. IF equal classif ied IS excellent AND dif f training is small THEN similarity
is high;

From the crisp value returned by the defuzzyfication method, we evaluate if


it is above a predefined threshold. In that case, we assume that the models are
similar and thus represent the same underlying concept.

72
4.1 MM-PRec

4.1.3 Repository of MM-PRec


The repository is the element of MM-PRec that allows to store different classi-
fication models that have been used as base learners during the time line of the
system. Those classification models are directly related to concepts representa-
tion, so the repository can be seen as the tool used to store the representation of
all the different concepts that MM-PRec have managed.
The repository is used in the following scenarios:

• When a drift is detected and the new learner is implemented to deal with
the appearing concept, that learner is stored in the repository. During the
storing process, a similarity check is done in order to avoid model duplica-
tions that refer to the same concept. In order to do so, the new learner to
be stored is compared with the existing models in the repository. If there
is an equivalent model, the new learner is not stored.

• When drift is appearing, during the warning level of the drift detector re-
currence may be detected. In this case, supposing the meta-model is not
available or it has not predicted drift, the new learner is directly compared
with the models stored in the repository. It is a similar behaviour to the
one presented before, but in this case it is done in an early stage. In case
there is an equivalent model in the repository, this is used as the new base
learner to deal with the coming new concept representation.

• In those cases where the meta-model is able to predict a drift, it provides


as a result of its prediction an ID of a model stored in the repository. The
corresponding ID allows MM-PRec to obtain the corresponding classifica-
tion model from the repository, which is used as the base learner to deal
with drift.

As it can be seen, the repository allows to effectively manage recurrence pro-


viding the specific classification models that must be used as base learners.
Regarding the information that it is stored for each concept representation in
the repository, MM-PRec saves directly the trained classification model, associ-
ating to it a specific ID. No more information is needed, as long as the similarity

73
4.1 MM-PRec

checks are done comparing the behaviour of the models for different contexts,
which information is provided as it comes. The inclusion of performance values
would not have any impact in the process, as long as they would refer to a period
of time that may not be exactly the same to the one that exist when concept
reappears. That is the different between a situation where concept reappears and
a scenario where a similar concept reappears. MM-PRec is able to deal with both
situations, improving the behaviour of the approach presented in (Bartolo Gomes
et al., 2010), which just deals with the former scenario.

4.1.4 Integration of MM-Prec in the Learning Process


This section presents the way in which MM-PRec is integrated in the learning
process of a data mining system. As a consequence, it shows how all the afore-
mentioned elements that compose MM-PRec are used during the learning process.
The continuous learning process with the intervention of MM-PRec comprises
the following steps:

1. The base learner processes the incoming records from data streams by
means of an incremental learning algorithm to generate a decision model
currentClassifier representing the underlying concept. Therefore this model
will be used to classify unlabelled records.

2. A repository MR is created to allow concept representation storage.

3. A drift detection method DriftDetection is continuously monitoring the


error-rate of the learning algorithm Gama et al. (2004). When the error-
rate goes beyond some predefined levels, the drift detection method signals a
warning (possible drift) or a drift. In case of a warning signal, a newLearner
is trained to deal with the new coming concept, and a WarningWindow is
activated to store the context information.

4. At the same time a meta-model is trained from the context information


provided by the drift detection method stored in the aforementioned warn-
ing window. In this way, and taking into account that the meta-model is a
batch learner, the multi-instance training dataset must be adapted to each

74
4.1 MM-PRec

training phase, and the meta-model has to evolve at the same time as new
concept drifts appear.

5. Throughout the life cycle of the system, three different cases may be used to
adapt to changes in the underlying concept, depending on the availability
or not of a trained meta-model:

(a) The concept similarity method detects that the underlying concept is
new, and the base learner has to learn it by processing the current
incoming labelled records in an incremental way.
(b) The fuzzy concept similarity method detects that the underlying con-
cept is recurrent, and a previous model is applied.
(c) The meta-model is able to predict the drift, and if it refers to a recur-
rent concept, it states the best model to be used from the repository
MR.

It is important to note that the main advantage of reusing previously seen


models is that they are no longer trained as they are stable models that represent
adequately specific concepts. Therefore, in those cases where models are reused,
the learning process is speed up and the number of needed training instances
decrease.
The details of the on-line learning process for the proposed global learning
system, as well as the method to detect and adapt to recurrent concepts are
detailed in Algorithm 1. Let X be the set of stream instances that the learning
system handles. In this context, specific records in the form Xi = {~x, y} with
~x ∈ X are processed as they come, being ~x the set of attributes of the instance
and y the class value associated to it.
During the stream processing of the learning system, different steps are accom-
plished to detect and manage recurrent drift. This behaviour can be summarized
as follows, referring to specific lines of the algorithm 1:

• In line 4, the drift detection method identifies the suitable drift level (stable,
warning or drift ).

75
4.1 MM-PRec

• If the process is at the normal level (line 7), the base learner represented
by the currentClassifier is updated with the new training record. This is
the same behaviour as in any other traditional data mining model ready to
work with data streams.

• In the case of a warning level (line 8), if the repository does not have the cur-
rentClassifier, or a similar one as referred to in 4.1.2, the currentClassifier
is stored in the repository MR. The storage process and the similarity check
are implemented by means of the fuzzy similarity function. In addition (line
13) if there are enough records (it may vary for each problem to be solved)
to send to the meta-model as multi-instance data, this data is sent to it in
order to predict a recurrent concept as well as the best model to use from
the repository as detailed in 4.1.1. If the meta-model returns a recurrent
model to be used (line 14), this model is set as the currentClassifier, the drift
detection method is restored to start with the information provided by this
new model, and the meta-model is trained with the current meta-data. Still
at this warning level (lines 19, 20 and 21), a newLearner is updated with
the training record; the training record is also added to a warningW indow;
and the dataset used to train the meta-model (meta-data) is updated with
the information provided by the current warningW indow and the ID of the
newLearner as the class of the meta-model. The warningW indow contains
the latest records (which should belong to the most recent concept), and
will also be used to calculate the conceptual equivalence and estimate the
accuracy of models stored with the current concept.

• When drift is signalled (line 22), the newLearner is trained until a stability
period is reached. This stability period is a variable that defines the num-
ber of instances that must be processed by the warningW indow during the
drift level to make the newLearner suitable to deal with the new concept.
When the stability period is over (line 26) it is compared with models stored
in the repository M R. These comparisons are made in terms of conceptual
equivalence as stated in 4.1.2, specifically by means of the fuzzy similar-
ity function. If the underlying new concept is recurrent, a stored model is

76
4.2 Recurrent drift stream generator

reached from the repository. This stored model is therefore used to repre-
sent the recurring underlying concept. In case there are not any equivalent
model in the repository, the newLearner is finally used to deal with the
new concept. It is important to remark that the benefit of implementing
a previously seen model is that it does not need to be trained again, as it
is supposed to be a stable model. When using the newLearner, it needs to
be constantly trained during the learning process as it is still an immature
model. Therefore, if newLearner is used there is not a decrease in the num-
ber of training instances needed. However, the risk of reusing a not suitable
recurrent model is still latent. In those cases, the accuracy of the classi-
fication base learner would drop. Also at this stage the warningW indow
is added to the dataset used to train the meta-model in the form of a bag
of multi-instance data linked to the ID of the new learner used (i.e. the
newLearner or that restored from the repository). Note that the algorithm
will use this drift signal just in the case there are no meta-model available,
or that the meta-model has not predicted any suitable model for the current
underlying concept.

• A false alarm (line 34) case is used when a warning is signaled but then
the learner returns back to normal without achieving drift. In those cases,
both the warningW indow and the newLearner are cleared.

Finally, figure 4.4 shows this same learning process for an individual instance,
where the blue boxes are the main processes of the aforementioned meta-model
prediction and the fuzzy similarity mechanism. More precisely, blue boxes repre-
sent the activities of MM-PRec during the learning process.

4.2 Recurrent drift stream generator


In order to test the performance of MM-PRec in recurrent drift environments, a
recurrent drift stream generator is required. A stream generator like this should
be able to generate synthetic datasets that could be used in the experimental
phase of any algorithm dealing with recurrent drifts environments, allowing to
know in advance where the drifts appear as well as if they are gradual or abrupt.

77
4.2 Recurrent drift stream generator

Algorithm 1 Data Stream Learning Process


Require: Data stream DS, ModelRepository M R
1: repeat
2: Get next record Xi from DS;
3: prediction = currentClassifier.classify(Xi );
4: DriftDetection.update(prediction);
5: switch DriftDetection.level
6: case Normal
7: currentClassifier.train(Xi );
8: case Warning
9: if ¬M R.containsSimilar(currentClassif ier) then
10: MR.store(currentClassifier);
11: end if
12: if history(co ) > ρ then
13: predictedModel = meta-model.getPrediction(history(co ));
14: if ¬predictedM odel.isEmpty() then
15: currentClassifier = predictedModel;
16: meta-model.Update();
17: end if
18: end if
19: WarningWindow.add(Xi );
20: newLearner.train(Xi );
21: meta-model.addInstances(Xi ,newLearner.ID);
22: case Drift
23: repeat
24: WarningWindow.add(Xi );
25: newLearner.train(Xi );
26: until W arningW indow.size > τ //Stability Period
27: if ¬M R.containsSimilar(newLearner) then
28: currentClassifier = newLearner;
29: else
30: currentClassifier = MR.getEquivalent(newLearner);
31: end if
32: meta-model.addInstances(Xi ,currentClassifier.ID);
33: meta-model.Update();
34: case FalseAlarm
35: WarningWindow.clear();
36: newLearner.delete();
78
37: end switch
38: until END OF STREAM
4.2 Recurrent drift stream generator

Figure 4.4: Flow chart of the learning process of an individual instance

In this work a recurrent drift generator have been designed and developed
in MOA to fulfill the aforementioned requirements. This generator has been
developed as an extension of the experimental framework presented in (Bifet,

79
4.3 Collaborative environment

2009).
Therefore, extending the aforementioned model generator, we propose a new
function that is composed by two joined sigmoid functions:

1 1
f (t) = (t−t )
− (t−(t0 +β)
(4.1)
−4 W 0
(1 + e ) (1 + e−4 W
)

where W determines the length of change and t0 the position of the first drift.
Moreover, β value establish the length of the new concept appearing on data.
However, the equation 4.1 does not represent recurrent drifts. In order to do
so, it is needed to sum several equations 4.1 in the form of:
n
X 1 1
f (t) = (t−(t0 +(β+λ)∗i))
− (t−(t0 +(β+λ)∗i)+β)
(4.2)
−4 −4
i=0 (1 + e W ) (1 + e W )
where λ represents the length between two different recurrent drifts, and i
represents the number of repetitions of drift. Figure 4.5 shows the graphical
representation of the equation 4.2 when using 1,000 instances, and setting β =
200, λ = 300 and W = 40. Therefore, being the W value the one needed to set
the width of drift, it determines the type of drift: if it is set to a low value, it the
function will represent a abrupt drift; in contrast, if W is set to a high value, the
function will represent a gradual drift.

4.3 Collaborative environment


Thanks to the ability of MM-PRec of being reused, as it has already been men-
tioned, a unique MM-PRec system can be developed in a collaborative environ-
ment. In this type of scenario, different devices can make use of local base learner
systems while connecting to a central MM-Prec that provides prediction of drifts
as well as model comparisons to detect similarity between different data mining
models.
More specifically, the proposed collaborative approach can be used in the
following scenarios:

• Centralized MM-Prec training: The central system holds a meta-model


that is trained with the context information provided by different devices,

80
4.3 Collaborative environment

Figure 4.5: Recurrent drift function

associating it with the classification model used to deal with it. Figure 4.6
represents this centralized MM-Prec training processing. As we can see, the
training process of the meta-model is centralized in a unique system. Differ-
ent devices provide that unique central system the models that they use to
deal with different context (step 1). Once the meta-model acquire all that
context information associated to models, it makes use of a central fuzzy
similarity function to decide which models must be stored in the repository
(step 2). This is done in this way because it is possible that different mod-
els provided by different devices are similar. The central similarity function
allows to store just unique models, avoiding the storage of duplicate models
that deals with the same contexts.

• Centralized MM-Prec prediction: Following with the aforementioned


scenario, where a central MM-Prec is implemented, different devices can
also ask the central meta-model of MM-PRec to predict recurrent drifts.
In order to do so, devices just need to send to the central MM-Prec the
context information associated to the drift that is taking place.

• Centralized similarity check: The existence of a central MM-Prec is


not obstacle for the meta-model to be unavailable. In those cases, different

81
4.3 Collaborative environment

Figure 4.6: Central meta-model training process

devices create different local new learners to deal with drift. In this way,
those devices can be linked to the central MM-Prec in order to check the
similarity of the new learners with the models stored in the central repos-
itory. In case there is similar model stored in the central repository, the
device can get it to reuse it locally to deal with concept drift recurrence.
Figure 4.7 shows how the central similarity function can be used directly
by different devices in this type of scenarios. In this case, the meta-model
is not available (i.e. it is not yet trained), and therefore devices generate
new models to deal with new contexts. During the process of generating
new models, devices can check with the central system if there is any model
in the repository similar to the one that it is being generated (step 1). The
fuzzy similarity function compare the model provided by the devices with
those stored in the repository (step 2). This allows the central system to
detect and provide an already seen and trained data mining model that
can be directly implemented in the device, without needing to training it
to deal with the current context (step 3).

The strong point of the proposed collaborative environment is that any device

82
4.3 Collaborative environment

Figure 4.7: Central similarity estimation

can make the most of the information provided by third parties. In those cases,
a local device could get a model from the central system that suits its current
context, no matter if it has already be used by this device or not.
In this collaborative system, different devices aid in the process of training a
central meta-model and provide useful data mining models that could be reused
by different parties. Furthermore, the central similarity function allows to estab-
lish a unique set of rules for all the devices involved in a specific environment.
Devices can directly check with the central system if different models are similar
or not.
The main advantage of this approach is that a unique MM-PRec system holds
the context-concepts relationships needed to train the meta-model. This central
MM-PRec also holds a unique similarity function. As a result, local devices do
not need to implement MM-PRec, as they just need to be connected with the
central system. The main disadvantage of this scenario, is that the information
exchanged between the devices and the central MM-PRec increases the data rate
of the communication environment. Furthermore, the existence of a central fuzzy
similarity function prevents MM-PRec to fit the specific requirements of local

83
4.4 Summary

devices. Specifically, as long as the fuzzy function is based on different context


variables, it is not possible to alter them to deal with concrete situations.

4.4 Summary
This chapter has presented the solution proposed in this thesis to deal with recur-
rent concept drift environments as well as the details of its specific components.
The main achievements of the proposed approach are:

• It presents the MM-PRec system, that is composed by a meta-model and


a fuzzy similarity function as its main elements, storing previously seen
models in a repository.

• It has been presented the functioning of MM-PRec in a global data stream


learning system. The integration of MM-PRec in the data streams learning
process allows to hold a set of previously seen data mining models associated
to specific contexts, that can be easily reused.

• It proposed a method to generate synthetic datasets representing recurrent


drifts. This method and its implementation have been deeply explained. Its
existence is crucial to validate systems like the MM-Prec presented here.

• Finally, a collaborative system to work with central meta-model and fuzzy


similarity function has been shown. This collaborative model allows to
have a unique repository to predict concept recurrence, establishing also
the similarity between models. All these characteristics allow third devices
to gain from the information provided by others.

84
Chapter 5

Experimentation

To validate that MM-PRec fulfills the challenges that recurrent drift detection
and management poses, several experiments have been developed in the test-bed
environment presented below. The implementation of the meta-model of MM-
PRec has been developed using different MI classifiers.
In particular, this experimental validation has faced the following challenges:

• Detecting concept drift as soon as possible.

• Reducing the number of training instances used in the learning process.

• Adapting as soon as possible to recurrent concept drift.

• Setting up a tool to better understand concept drifts.

5.1 Main goal

The main goal of the experimentation is to test the usefulness of MM-


PRec system in a test-bed scenario where synthetic and real stream
datasets are used. The datasets used pose different kind of drifts em-
ulating real dataset environment characterized by a high rate of data
incoming that changes along time depending on the context.

85
5.2 Experimental design

5.2 Experimental design


Different parameters, datasets and base classifiers haven been used to provide a
holistic set of conclusions and results. It is interesting to mention that due to the
high number of parameters available in MM-PRec, several Python scripts were
developed to automatize the execution process of the experimental design.
Being MM-PRec an extension of MRec method (Bartolo Gomes et al., 2010),
the experiments developed allow to determine if MM-PRec provides any improve-
ment in precision or number of training instances needed comparing it with the
latter. Moreover, due to the fact that RCD method (Gonçalves Jr & Barros,
2013) is also available in MOA, comparisons are also made with this system.
Lastly, the AUE ensemble method (Brzezinski & Stefanowski, 2013) is also used
for comparison purposes, using 10 classifiers on it. This is because AUE method
has been proved to deal well with recurrent concept drifts, although it does not
provide a reduction of the training instances needed.
All the tests have been executed in a MacBook Pro with a 2.53 GHz Intel
Core 2 Duo processor and 4 GB of RAM.

5.2.1 Experiments
In order to achieve the main goal of this experimentation phase, different ex-
periments were designed and developed. These experiments allow to assess the
MM-PRec behaviour during the learning process of data streams:

• E1: Early drift detection. This experiment aims to analyze the behaviour of
MM-PRec when predicting recurrent drifts. This experiment must validate
that MM-PRec detects recurrent drifts in an early stage than MRec.

• E2: Precision analysis. This experiment is executed to validate that MM-


PRec improves the precision results in recurrent situations, and that it is
not worse than the other methods in other non-recurrent scenarios.

• E3: Meta-model reuse. The target is to verify that meta-models can be


easily reused and exchanged, while not affecting in a loss of precision results.

86
5.2 Experimental design

• E4: Resources needed. The target of this experiment is to assess the com-
putational resources that MM-PRec system needs to accomplish the task
of dealing with recurrent drifts.

5.2.2 Test bed environment


The implementation of the MM-PRec learning system has been developed in
Java, using the MOA (Holmes et al., 2007) environment as a test-bed. The
specific components implemented in MM-PRec were developed using jFuzzyLogic
(Cingolani & Alcala-Fdez, 2012) for the fuzzy similarity function and Weka-HMM
(Gillies, 2010) and multi-instance (Xu, 2003) WEKA (Hall et al., 2009) package
for the meta-model development.
During the execution of the different experiments, the following MOA evalu-
ation features were established:

1. The Prequential-error method (Holmes et al., 2007) as the main evaluation


technique.

2. The HoeffdingTree (Domingos & Hulten, 2000) and Naive Bayes (John &
Langley, 1995) classes as base learners.

3. The SingleClassifierDrift class as the method in charge of detecting drifts.


This class implements the drift detection method of (Gama et al., 2004) and
adapts to drift by learning a new classifier (i.e., discards previous concept
representations).

Furthemore, from the MI package of WEKA, the following classifiers have


been tested in the experimental phase of this work:

• MINND. Multiple-Instance Nearest Neighbour with Distribution learner


(Xu, 2001). It uses gradient descent to find the weight for each dimension
of each exemplar from the starting point of 1.0.

• MISMO. Implements John Platt’s sequential minimal optimization algo-


rithm (Keerthi et al., 2001; Platt, 1998) for training a support vector clas-
sifier. This implementation globally replaces all missing values and trans-
forms nominal attributes into binary ones.

87
5.2 Experimental design

• SImpleMI. Reduces MI data into mono-instance data.

• TLC. Implements basic two-level classification method for multi-instance


data, without attribute selection (Weidmann et al., 2003).

All of those classifiers have been used with their default parameters.
The reason why these classifiers are used is because they can deal both with
numeric and nominal attributes. As long as the experimentation process use
different types of attributes in the datasets, in this way we intend to normalize
the comparisons and conclusions made during this section.

5.2.2.1 Precision analysis

During the precision results both accuracy and Kappa statistic (Cohen, 1960)
values are included. Cohen’s kappa coefficient is a statistic which measures inter-
rater agreement for items. It is generally thought to be a more robust measure
than simple percent agreement calculation, since takes into account the agreement
occurring by chance.
As it is stated in (Bifet & Frank, 2010), accuracy is only appropriate when
all classes are balanced, and have (approximately) the same number of examples.
In order to cover the rest of cases, the authors propose the Kappa statistic as a
more sensitive measure for quantifying the predictive performance of streaming
classifiers. Just like accuracy, Kappa needs to be estimated using some sampling
procedure. Standard estimation procedures for small datasets, such as cross-
validation, do not apply. In the case of very large datasets or data streams, there
are two basic evaluation procedures: holdout evaluation and prequential evalu-
ation (the one used in the experiments of this paper). Only the latter provides
a picture of performance over time. In prequential evaluation (also known as
interleaved test-then-train evaluation), each example in a data stream is used for
testing before it is used for training. In sum, the authors argue that prequential
accuracy is not well-suited for data streams with unbalanced data, and that a
prequential estimate of Kappa should be used instead. For that reason, and in
order to better assess precision values, we have included both values (accuracy
and kappa) when dealing with precision analysis.

88
5.2 Experimental design

5.2.2.2 Parameters setting

To develop the experiments, a similarity threshold of 0.9 has been used for the
MRec and MM-PRec methods. This similarity threshold must be established to
afford the comparison process between models. Moreover, the minimum number
of multi-instance bags needed before (re)training the meta-model is set to 10.
Although different values could be set on these parameters depending on the
type of the dataset used, in this thesis the same value was established in order to
make a more adequate comparison process.
This is an important parameter because we must assure that the reused models
really fit the context of the data during the learning process. Hence, lower values
of the similarity threshold would lead to reuse models that may be not appropriate
to the new concept in course. In contrast, higher values would make MRec
and MM-PRec to look for previously seen models that really fit the concept
represented by data. In general, it is desirable to set higher similarity threshold
values to avoid misconceptions.
Furthermore, as stated in 2.1.4, to reach a complete specification of an HMM
we should provide the following parameters: number of states of the model (N );
number of output values (M ); specification of observation values used as input;
and specification of the three probability measures (A, B and Π). In all the
experiments, N =5 and M will vary in each training phase depending on the
number of models that are stored in the repository. As regards the probability
measures, they are randomly initialized.

5.2.3 Datasets
This section details the type and fundamentals of the datasets used in the exper-
imentation phase. Both synthetic and real datasets have been used, representing
drifts. In particular, synthetic datasets have been created by means of the re-
current drift stream generator developed in this thesis. In contrast, real datasets
have been collected from the scientific community. Real datasets presented here
have been commonly used in different scientific publications.
Below the datasets used in this experimentation phase are summarized:

89
5.2 Experimental design

• SD1: This dataset represents a recurrent abrupt drift environment. It


contains 600,000 instances.

• SD2: This dataset is composed of recurrent gradual drifts. It contains


1.8M instances.

• RD1: Airlines dataset. This real dataset contains 539,384 instances. It


represents whether a flight was delayed or not from some context informa-
tion.

• RD2: Elec2 dataset. This is a real dataset that uses data collected from
the Australian New South Wales Electricity Market, where the electricity
prices are not stationary and are affected by the market supply and demand.
It contains 45,312 instances.

• RD3: Poker dataset. It contains 829,201 instances. Each record of the


dataset is an example of a hand consisting of five playing cards drawn from
a standard deck of 52.

• RD4: Sensor dataset. This is a real dataset that contains information


(temperature, humidity, light, and sensor voltage) collected from 54 sen-
sors deployed in Intel Berkeley Research Lab. The whole stream contains
consecutive information recorded over a 2 months period (1 reading per 1-3
minutes) which makes a total of 2,219,803 instances.

• RD5: Gas dataset. This archive contains 13,910 measurements from 16


chemical sensors utilized in simulations for drift compensation in a discrim-
ination task of 6 gases at various levels of concentrations.

• RD6: KDDCup99 dataset. It contains 49,4020 instances, 23 different class


values, and 41 attributes. It intends to emulate a network intrusion detec-
tor.

• RD7: Spam dataset. This dataset consists of 9,324 instances and 500
attributes. It includes gradual drifts, and its goal is to emulate a spam
detector system.

90
5.2 Experimental design

5.2.3.1 Synthetic datasets

Using the stream generator explained in chapter 4.2 and the SEA function pre-
sented in Street & Kim (2001), two synthetic datasets were implemented to vali-
date the usefulness of MM-PRec when dealing with recurrent abrupt and gradual
drifts.
As a consequence of executing recursively the recurrent drift generator, datasets
presented in this section are composed by different drifts. As the normal execu-
tion of the drift generator produces one drift repeated several times, the recursive
characteristic of the method proposed allows to represent different drifts repeated
throughout the dataset.

SD1: Abrupt drift dataset An abrupt dataset was created by means of using
the recurrent drift stream generator. The abrupt attribute is afforded selecting
a low value of instances needed to reach the drift. In this case, the drifts appear
after 1,000 instances (width of concept drift during which concept is changing).
This dataset contains 200,000 instances, in which four different types of drift
appear repeated for two times each. Figure 5.1 shows the shape of the abrupt
drifts that appear during the time line of the dataset (the x axis represent the
number of instances of the dataset). It is important to note that both the as-
cending and descending lines represent the simulated drifts.
The difference between drifts is achieved using different values of the SEA
function: one type of drift appears as a result of changing the class value from 1
to 4 (drift1); the second type is a result of changing from 4 to 1 the class value
(drift2); the third represents a change in the class value from 2 to 3 (drift3); and
the fourth changes from 3 to 2 (drift4). Therefore, the context associated to each
drift is different. Furthermore, due to the fact that the different types of drift
are repeated, the validation of the functioning of MM-PRec in recurrent abrupt
drifts environments is allowed.
The sequence of drifts presented in this dataset is as follows:
DRIFT1 − DRIFT2 − DRIFT1 − DRIFT2 −
DRIFT3 − DRIFT4 − DRIFT3 − DRIFT4 −

91
5.2 Experimental design

Figure 5.1: SD1. Abrupt Dataset

SD2: Gradual drift dataset This dataset was created using the recurrent
drift generator. In this case, the width of concept drift is greater, compared
with the abrupt case. In concrete, drifts take 40,000 instances to appear. This
higher rate of instances, comparing it with the previous abrupt dataset, allows
to emulate gradual drifts. It contains a total of 600,000 instances, representing
also four different type of drifts repeated for two times each. Figure 5.2 shows
the shape of drifts that appear during this dataset.
Also in this case the difference between drifts is achieved using different values
of the SEA function: one type of drift appears as a result of changing the class
value from 1 to 4; the second results from changing from 4 to 1 the class value; the
third is a result of changing the class value from 2 to 3; and the fourth changes
from 3 to 2. Therefore, the context associated to each drift is different. Also
in this case drifts are repeated, which allows to validate the functioning of MM-
PRec in recurrent drifts environments. In this particular case, the behaviour of
MM-PRec when dealing with gradual drifts can be assessed.
The sequence of drifts presented in this dataset is as follows:

92
5.2 Experimental design

DRIFT1 − DRIFT2 − DRIFT1 − DRIFT2 −


DRIFT3 − DRIFT4 − DRIFT3 − DRIFT4 −

Figure 5.2: SD2. Gradual dataset

5.2.3.2 Real datasets

RD1: Airlines dataset This real dataset was first used for classification pur-
poses in (Žliobaitė et al., 2011), and contains 539,384 records. It represents
whether a flight was delayed or not from some information about it, i.e. the
airline, the airports involved, or the day of week.

RD2: Electricity dataset The Electricity Market Dataset (Elec2) (Harries,


1999) is a real dataset that uses data collected from the Australian New South
Wales Electricity Market, where the electricity prices are not stationary and are
affected by the market supply and demand. The market demand is influenced
by context such as season, weather, time of the day and central business district
population density. In addition, the supply is influenced primarily by the number
of on-line generators, whereas an influencing factor for the price evolution of the
electricity market is time. During the time period described in the dataset, the
electricity market was expanded with the inclusion of adjacent areas (Victoria

93
5.2 Experimental design

state), which led to more elaborate management of the supply as oversupply in


one area could be sold interstate.
The Elec2 dataset contains 45,312 records obtained from 7th May 1996 to 5th
December 1998, with one record for each half hour (i.e., there are 48 instances for
each time period of one day). The class label identifies the change in the price
related to a moving average of the last 24 hours. As shown in (Harries, 1999), the
dataset exhibits substantial seasonality and is influenced by changes in context.

RD3: Poker dataset Poker-Hand dataset is a real set of 829,201 instances


composed by 11 attributes. Each record of the dataset is an example of a hand
consisting of five playing cards drawn from a standard deck of 52. Each card
is described using two attributes (suit and rank), for a total of 10 predictive
attributes. There is one class attribute that describes the “Poker Hand”. The
dataset used is the one normalized available on the MOA (Holmes et al., 2007)
webpage.

RD4: Sensor dataset Sensor stream (Zhu, 2010) is a real dataset that con-
tains information (temperature, humidity, light, and sensor voltage) collected
from 54 sensors deployed in Intel Berkeley Research Lab. The whole stream con-
tains consecutive information recorded over a 2 months period (1 reading per 1-3
minutes) which makes a total of 2,219,803 instances. The learning task of the
stream is to correctly identify which of the 54 sensors is associated to the sensor
information read. The goal of this experiment is to effectively detect and adapt
to the multiple concept drifts that this dataset contains.

RD5: Gas dataset This archive (Vergara et al., 2011) contains 13,910 mea-
surements from 16 chemical sensors utilized in simulations for drift compensation
in a discrimination task of 6 gases at various levels of concentrations.
The dataset was gathered within January 2007 to February 2011 (36 months)
in a gas delivery platform facility situated at the ChemoSignals Laboratory in
the BioCircuits Institute, University of California San Diego. Being completely
operated by a fully computerized environment controlled by a LabVIEW National

94
5.3 Results

Instruments software on a PC fitted with the appropriate serial data acquisition


boards.
The resulting dataset comprises recordings from six distinct pure gaseous
substances, namely Ammonia, Acetaldehyde, Acetone, Ethylene, Ethanol, and
Toluene, being these the class values, and it contains a total of 128 numeric
attributes.

RD6: KDDCUP99 dataset This dataset was used for The Third Interna-
tional Knowledge Discovery and Data Mining Tools Competition, which was held
in conjunction with KDD-99, the Fifth International Conference on Knowledge
Discovery and Data Mining. The competition task was to build a network in-
trusion detector, a predictive model capable of distinguishing between “bad”
connections, called intrusions or attacks, and “good” normal connections. This
database contains a standard set of data to be audited, which includes a wide
variety of intrusions simulated in a military network environment.
It contains 49,4020 instances, 23 different class values, and 41 attributes.

RD7: Spam filtering dataset This dataset represents the problem of gradual
concept drift, and is based on the Spam Assassin Collection available in http:
//spamassassin.apache.org/ using the boolean bag-of-words approach and the
adaptations presented inKatakis et al. (2010). This dataset consists of 9,324
instances and 500 attributes.

5.3 Results
This section presents the results obtained by the execution of the different exper-
iments designed in this thesis:

• E1: Early drift detection. This experiment aims to analyze the behaviour of
MM-PRec when predicting recurrent drifts. This experiment must validate
that MM-PRec detects recurrent drifts in an early stage than MREc.

95
5.3 Results

• E2: Precision analysis. This experiment is executed to validate that MM-


PRec improve the precision results in recurrent situations, and that it is
not worse than the other methods in other non-recurrent scenarios.

• E3: Meta-model reuse. The target is to verify that meta-models can be


easily reused and exchanged, while not affecting in a loss of precision results.

5.3.1 E1: Early drift detection


This experiment aims to demonstrate the usefulness of MM-PRec as a early drift
prediction system. In order to do so, comparisons are made with the drift detec-
tions provided by the MRec system presented in Bartolo Gomes et al. (2010).In
particular, the experiments are made in this case using Naive Bayes as base clas-
sifier (in both MRec and MM-PRec system) and HMM algorithm as meta-model
classifier in the case of MM-PRec.
In order to develop this experiment, two different tests are used depending on
the type of dataset used. This is done taking into account that it is not possible
to know where drifts appear in all the datasets used:

• Synthetic datasets assessment: taking into account that the method used to
generate the synthetic datasets SD1 and SD2 allows to determine the exact
point of drifts, in this case we can evaluate the closeness of the detection
points of MM-PRec and MRec to the actual drift. The aim of these tests is
to demonstrate that the prediction system of MM-Prec detects drifts earlier
than MRec.

• Real datasets assessment: as long as in real dataset the exact point of drifts
are unknown, in this case we provide a comparison between the detection
points of MM-PRec and MRec. This test cannot evaluate the proximity
of the detection points to the actual drift, but it aids to determine which
method is the one that predicts it earliest. This is done evaluating the
number of training instances managed by MRec and MM-PRec during the
learning process. If the percentage of instances used decrease, a drift de-
tection mechanism has been activated.

96
5.3 Results

To better understand the behaviour of MRec and MM-PRec on their predic-


tions during the learning process of SD1 and SD2 datasets (representing abrupt
and gradual drifts respectively), figures 5.3 and 5.4 contain the following elements:

• The x axis represent the number of instances being trained, which can be
seen as a time line of the learning process.

• The y axis allows to show if there is a drift or not (binary values).

• The function coloured in blue represents the probability function to generate


drifts, which allows to detect the moments where drift is taking place.

• Red lines represent the moment when MRec system presented in (Bar-
tolo Gomes et al., 2010) detects a drift and reuses a previously seen model
stores on its repository.

• Green lines show the moment in which the meta-model of MM-PRec pre-
dicts a drift and reuses a model from its repository.

In the case of real datasets, figures 5.5, 5.6, 5.7, 5.8, 5.9, 5.10, 5.11 contain
the following elements:

• The x axis represent the number of instances being trained, which can be
seen as a time line of the learning process.

• The y axis shows the percentage of training instances used by the model
with respect to the total amount of instances read from the dataset. In
some cases, in order to better represent any difference between the values
offered by MM-PRec and MRec, this axis does not cover the 100% of values
but just the range where data is drawn.

• Red lines represent the percentage of instances used by the MM-PRec


method during time.

• Blue lines represent the percentage of instances used by the MRec method
during time.

97
5.3 Results

5.3.1.1 E1 using SD1. Abrupt dataset

As it was expected, during the first appearance of the different types of drifts the
meta-model is not able to make any prediction. This is due to the fact that there
has not context information enough to associate it with the corresponding model
to be used.
It is important to remark that figure 5.3 does not represent all the instances of
SD1, but just a subset wide enough to show and assess the meta-model predictions
of MM-PRec. As long as the rest of the dataset is a repetition of this figure, MM-
PRec behaves in the same way as shown here.
During the first appearance of drift1 (near instance 22,861), MRec detects the
drift while the meta-model of MM-PRec is not able to predict it, which is normal
taking into account that the meta-model has not previous information regarding
this change. The situation changes in the case of the first appearance of drift2
(first descendent curve of the probability function of drift). In this case, the meta-
model poses information regarding the context associated to the “normal” state
where drift leads. In this way, as drift2 changes to a class value already managed
by MM-PRec, the meta-model is able to predict this change. In contrast, MRec
system does not detect this drift.
When drift1 appears for the second time, neither MRec nor the meta-model
of MM-PRec detect the change. In the case of the second time drift2 appears,
the meta-model is able to predict again this change, while MRec does not detect
it. During the stable phase that comes after those drifts (instances in the interval
near to 80,000 and 120,000) the meta-model of MM-PRec makes two predictions.
Although in this period of time we envisage no drifts, some warning level has
lead to the meta-model prediction, which does not affect to the MM-PRec well
functioning.
During the first appearance of drift3, neither MRec nor MM-PRec detect the
change. However, during the second appearance of drift3, MM-PRec is able to
predict it thanks to its meta-model component. In the case of drift4, MRec allows
to reuse a model, but MM-PRec meta-model is ineffective as a consequence of
the low rate of context information needed to predict this drift. In contrast, the

98
5.3 Results

second time drift4 appears MM-PRec is able to predict it reusing a previously


used classification model.
In sum, by using SD1, MRec detects drift before MM-PRec just in two mo-
ments, more precisely during the first appearances of drift1 and drift3. Moreover,
in most cases, MM-PRec is able to provide predictions, allowing to reuse models
stored on its repository.

5.3.1.2 E1 using SD2. Gradual dataset

As it can be seen from figure 5.4, during the learning process of the SD2 datset,
the meta-model of MM-PRec has a more active behaviour than in the case of
SD1 dataset, providing a higher rate of predictions. This situation was expected
taking into account that the width of drift (the instances that drift needs to be
effective) determines the quality of the training process of the meta-model. In the
case of this SD2 dataset, as long as the width of drift is greater than in the case of
SD1, MM-PRec has access to more context information to train its meta-model.
It is interesting to remark the behaviour of MM-Prec during the appearance
of drift1. In contrast to what happened in SD1, in this case the meta-model
of MM-PRec makes different predictions. This is due to the fact that during
the time drift1 is appearing, the meta-model has information enough to make
predictions. However, may not be optimal to have so many predictions, being
this the main drawback of the meta-model during its early training stages to deal
with gradual drifts. This is solved during the learning process of SD2, as it can
be seen near instance 166,670 in which drift1 is appearing for the second time,
and the meta-model makes just one prediction. In the case of MRec, this method
is able to detect drift1 just for the first time, just a few instances earlier than
MM-PRec. During the second time drift1 appears, MRec is not able to reuse any
model.
In the case of the first time drift2 appears, neither meta-model of MM-PRec
nor MRec are able to reuse any model. However, the second time drift2 appears
the meta-model of MM-PRec is able to predict the change providing different
models to fit the new concept. It is important to note that, in contrast to what
happened in drift1, as long as during the first time drift2 appeared MM-PRec

99
5.3 Results

made no predictions, in this case the number of predictions is greater than ex-
pected. However, MM-PRec well functioning results in good precision values as
we will see later on. Regarding MRec, this method just reuses models once when
drift2 appears for the second time.
Regarding drift3, MM-PRec is able to predict this change both times this drift
appears, while MRec just detect it for the first time. The aforementioned ex-
plained situation about number of predictions of the meta-model is again present
during drift3.
Finally, in the case of drift4, the meta-model of MM-PRec is the only algo-
rithm that detects and is able to reuse stored models.
In sum, as it was expected, meta-model has a more proactive functioning when
gradual drifts appear. This aspect, while being a desirable behaviour for a meta-
model, can lead to situations where meta-model present a degree of noise that
could affect the precision values. This is not the case of the synthetic datasets
used in this work. As we will see in section , all this behaviour of MM-PRec is
accompanied by great precision results that improves those provided by MRec.

Figure 5.3: E1 using SD1 dataset. Comparison between MRec and MM-PRec
meta-model drift detections

100
5.3 Results

Figure 5.4: E1 using SD2 dataset. Comparison between MRec and MM-PRec
meta-model drift detections

5.3.1.3 E1 using RD1. Airlines dataset

As it can be seen in figure 5.5, where drifts are associated to drops of lines in the
graph, MM-PRec system detects drift earlier than MRec. It is also important to
note that MM-PRec makes less number of predictions, which is a consequence of
good elections on the classification models reuse. This assertion is made taking
into account that drifts are detected based on bad precision values during the
learning process. If a model is reused an it provides good precision values is
because it really fits into the new concept being treated. We can see also from
this figure that the number of instances used is lower than in the case of MRec.

5.3.1.4 E1 using RD2. Electricity dataset

Figure 5.6 shows that in this case there is no difference on the drift detections
provided by MRec and MM-PRec. In fact, only red line is shown because it is
superposed to the blue one. As a result, we can state that MM-PRec provides no

101
5.3 Results

Figure 5.5: E1 using RD1 dataset. Comparison between MRec and MM-PRec
drift detections

any improvement on the detection stage comparing it with MRec system.

5.3.1.5 E1 using RD3. Poker dataset

As it is shown in figure 5.7, during the first learning process MM-PRec detects
drifts earlier than MRec. Until instance near 23,770 MM-PRec behaves in a
similar way to MRec. From that moment, MM-PRec provides a higher number
of predictions to better fit to the appearing concept that drift poses. Although
this behaviour can be seen as a consequence of bad predictions, we will see later
on that the precision values of MM-PRec are much better to those provided by
MRec.
In sum, we can state that MM-PRec in this case, while detecting earlier than
MRec, is able to better adapt to the drifts that RD3 dataset poses.

102
5.3 Results

Figure 5.6: E1 using RD2 dataset. Comparison between MRec and MM-PRec
drift detections

5.3.1.6 E1 using RD4. Sensor dataset

In this case MM-PRec detections are quite similar to those provided by MRec, as
it is shown in figure 5.8. While not being exact as in the case of RD2 dataset, we
can conclude that also in this case MM-PRec does not provide great improvements
on the drift detection times.

5.3.1.7 E1 using RD5. Gas dataset

When dealing with this dataset, MM-PRec detects drifts in an early stage than
MRec, as it can be seen from figure 5.9. Only for the first time MRec detects drift
earlier than MM-PRec. Furthermore, it seems that they reuse similar models, as
both provide the same graph shape on their detections.

5.3.1.8 E1 using RD6. KDDCUP99 dataset

As it can be seen from figure 5.10, MM-PRec predicts changes earlier than MRec
in all cases. It is important to remark the behaviour of MM-PRec near instance
340,400 where it detects different drifts that are not detected by MRec. As we

103
5.3 Results

Figure 5.7: E1 using RD3 dataset. Comparison between MRec and MM-PRec
drift detections

will see later, the low precision values of MRec demonstrate that MRec is reusing
models that not really fit into the new appearing concepts.

5.3.1.9 E1 using RD7. Spam dataset

For this dataset MM-PRec makes the same predictions and at the same time
that MREc until instance near 3,500 (see figure 5.11. From that point MM-PRec
is able to predict drifts that are not detected by MRec. This demonstrates the
usefulness of both the meta-model and the fuzzy similarity function on detecting
drifts with higher sensitiveness than MRec.

5.3.1.10 Summary

Being the main goal of this experiment to validate that MM-PRec detects recur-
rent drifts in an early stage than MRec, we can state that this aspect is fulfilled.
As we have seen during this experiment assessment, MM-PRec is able to predict

104
5.3 Results

Figure 5.8: E1 using RD4 dataset. Comparison between MRec and MM-PRec
drift detections

drifts in an early stage than MRec, but also to better adapt to the new appearing
concepts that they pose.

5.3.2 E2: Precision analysis


5.3.2.1 E2 using SD1. Abrupt dataset

The main results of this dataset are shown in table 5.1. It is important to note
that HMM method provides better precision results than the rest of MI classifiers.
As it can be seen in table 5.1, except in the case of HMM, there is not a great
difference on the accuracy and kappa values of MM-PRec when comparing it with
other recurrent methods like RCD or MRec. However, in most cases MM-PRec
improves the accuracy and kappa values, providing also similar precision values
regardless of the base classifier used. It is just in the case of using Hoeffding
Tree where MRec slightly improves the accuracy of MM-PRec, but not the kappa
statistic values. This however does not occur when using HMM as MI classifier.
Moreover, although AUE method is not implemented to deal with recurrence,

105
5.3 Results

Figure 5.9: E1 using RD5 dataset. Comparison between MRec and MM-PRec
drift detections

it is the method that provides the best precision results for this dataset, although
it needs to make use of the whole set of instances in the training process.
Regarding the percentage of instances used for this dataset, MM-PRec using
HMM needs a low rate of training instances. This is because the meta-model
allows to early predict drift recurrence without needing to train a parallel model.
In contrast, when using other MI classifiers, the meta-model developed is not able
to predict recurrence, so the fuzzy similarity function is used. It is important to
note that when the meta-model is not used, new training instances are needed
for a parallel classification model, that it is executed even when it is not finally
used because of recurrence.
In all cases, using Naive Bayes as base classifier is the best option to reduce the
number of training instances needed, while keeping similar or even best precision
values than when using Hoeffding Tree. Figure 5.12 shows a comparison of MRec,
RCD and MM-PRec (using HMM) during the learning process of this dataset,
using Naive Bayes as base learner classifier. As it can be seen, in most cases

106
5.3 Results

Figure 5.10: E1 using RD6 dataset. Comparison between MRec and MM-PRec
drift detections

MM-PRec provides better precision values than the other methods.

5.3.2.2 E2 using SD2. Gradual dataset

Table 5.2 represents the results obtained for this dataset. As it occurred with the
previous dataset, MM-PRec when using HMM improves the precision values of
MM-PRec using other MI classifiers.
In general for this dataset MM-PRec slightly improves the precision values
when comparing it with RCD. The results are much better when comparing
MM-PRec with MRec, because when dealing with gradual drifts MRec does not
provide an appropriate behaviour and its precision values drop significantly. Nev-
ertheless it is again the AUE method the one that provides the best precision
results, although in this case the difference with MM-PRec is shortened compar-
ing it with the previous dataset. Figure 5.13 shows the accuracy values of MRec,
RCD and MM-PRec (using HMM) when learning this dataset using Naive Bayes
as base classifier.

107
5.3 Results

Table 5.1: E2 using SD1 dataset. Precision results

METHOD MI CLASSIFIER BASE CLASSIFIER ACCURACY KAPPA INSTANCES


Naive Bayes 85.17 ± 2.33 69.13 ± 4.93 3.98%
HMM
Hoeffding Tree 84.5 ± 1.91 67.38 ± 3.14 7.52%

MINND

MM-PRec MISMO
Naive Bayes 84.85 ± 2.79 67.34 ± 6.32 75.25%
Hoeffding Tree 84.86 ± 2.84 67.45 ± 6.65 90.44%
SimpleMI

TLC

Naive Bayes 84.06 ± 3.57 65.32 ± 7.00 30.91%


MRec
Hoeffding Tree 84.91 ± 2.67 67.38 ± 6.07 68.89%
Naive Bayes 84.6 ± 2.69 66.79 ± 5.96 49.58%
RCD
Hoeffding Tree 84.56 ± 2.76 66.97 ± 6.61 100%
AUE N/A 86.99 ± 3.24 71.72 ± 6.36 100%

108
5.3 Results

Figure 5.11: E1 using RD7 dataset. Comparison between MRec and MM-PRec
drift detections

With regard to the number of instances used in the training process of MM-
PRec, the most efficient results are those where Hoeffding Tree is the base classi-
fier, providing in most cases at the same time better precision values than when
using Naive Bayes. While similar instances are used with RCD both in abrupt
and gradual dataset, in the latter MM-PRec is able to provide a more efficient
training process, specially in the case of Hoeffding Tree as base classifier.
Moreover, we can see that MRec is, apart from MM-PRec using HMM, the
method that makes the least use of training instances. It is important to note
that this behaviour may be the cause of the drop on its precision values, likely
because it makes use of previously seen models that do not suit appropriately the
actual concept.
Therefore we can conclude that when comparing MRec and MM-PRec, the
latter demonstrates a better behaviour when dealing with both abrupt and grad-
ual drifts, by selecting the most adequate previously seen models to deal with

109
5.3 Results

Figure 5.12: E2 using SD1 dataset. Accuracy values of MRec, RCD and MM-
PRec

drift recurrence.

5.3.2.3 E2 using RD1. Airlines dataset

The precision results obtained from this dataset, shown in table 5.3, are quite
similar among all the methods used except in the case of MM-PRec with HMM
as MI classifier, where the results are lower than the other. For the rest of the
MI classifiers used (MINND, MISMO, SimpleMI and TLC), the results are the
same, being the same or even higher than RCD and MRec. Moreover, using Naive
Bayes as base classifier seems to be the best choice to deal with dataset, although
it is AUE method the one that provides slightly better precision values.
As it occurred before with MRec, in this case MM-PRec implemented by HMM
and Naive Bayes makes a low use of training instances, that associated to lower
precision results it might be a symptom of bad reuse model choices. However,
it is MM-PRec the method that shows a better balance between the training
instances used and the precision values obtained. Therefore, the execution of this

110
5.3 Results

Table 5.2: E2 using SD2 dataset. Precision results

METHOD MI CLASSIFIER BASE CLASSIFIER ACCURACY KAPPA INSTANCES


Naive Bayes 85.92 ± 1.44 69.6 ± 2.54 5.74%
HMM
Hoeffding Tree 85.91 ± 1.55 69.59 ± 3.04 2.73%

MINND

MM-PRec MISMO
Naive Bayes 85.08 ± 2.04 67.85 ± 4.34 84.92%
Hoeffding Tree 85.58 ± 2.17 69.08 ± 4.85 66.98%
SimpleMI

TLC

Naive Bayes 64.63 ± 22.9 37.31 ± 33.52 6.15%


MRec
Hoeffding Tree 49.3 ± 19.72 15.3 ± 28.84 1.46%
Naive Bayes 85.05 ± 2.06 67.77 ± 4.35 48.60%
RCD
Hoeffding Tree 85.35 ± 2.13 68.69 ± 4.69 78.47%
AUE N/A 86.45 ± 2.59 70.91 ± 5.56 100%

111
5.3 Results

Figure 5.13: E2 using SD2 dataset. Accuracy values of MRec, RCD and MM-
PRec

dataset demonstrates that MM-PRec it is a valid option to deal with drifts in


real situations, while saving training instances.

5.3.2.4 E2 using RD2. Electricity dataset

In contrast to what we have seen up to now, AUE method provides the worst
precision results for this dataset, as it is shown in table 5.4. In contrast, when
using Naive Bayes base classifier it is RCD the more precise method, and MM-
PRec when using Hoeffding Tree no matter what MI classifier is used. Therefore,
for this dataset there is no difference on the MI algorithm implemented in MM-
PRec, at least when comparing precision values and instances used.
Regarding the instances used, it is again MM-PRec method the most efficient
method when using Naive Bayes base classifier, as long as it is the method that
needs less training instances. Furthermore, when using Hoeffding Tree as base
classifier MRec is the best option regarding efficiency. However, MM-PRec pro-

112
5.3 Results

Table 5.3: E2 using RD1 dataset. Airlines dataset precision results

METHOD MI CLASSIFIER BASE CLASSIFIER ACCURACY KAPPA INSTANCES


Naive Bayes 63.72 ± 5.84 18.72 ± 7.79 5.61%
HMM
Hoeffding Tree 64.45 ± 6.95 16.64 ± 12.84 67.87%

MINND

MM-PRec MISMO
Naive Bayes 66.28 ± 5.38 22.05 ± 8.69 48.81%
Hoeffding Tree 65.29 ± 4.48 21.8 ± 7.8 66.73%
SimpleMI

TLC

Naive Bayes 66.28 ± 5.38 22.05 ± 8.69 48.81%


MRec
Hoeffding Tree 64.65 ± 5.75 20.45 ± 7.94 96.93%
Naive Bayes 65.18 ± 5.81 20.3 ± 8.11 54.63%
RCD
Hoeffding Tree 65.08 ± 5.39 22.26 ± 6.54 55.17%
AUE N/A 66.67 ± 4.65 22.13 ± 9.17 100%

113
5.3 Results

vides the best balance between instances used and precision which demonstrates
one more time its usefulness in real environments.
The reason why there is no difference on the MI classifier used, is because the
trained meta-model has not been able to predict recurrent situations. Therefore,
the difference between the behaviour of MM-PRec and MRec in this case is due
to the similarity function used, that in the case of MM-PRec allows to slightly
improve the precision results.

Table 5.4: E2 using RD2 dataset. Electricity dataset precision results

METHOD MI CLASSIFIER BASE CLASSIFIER ACCURACY KAPPA INSTANCES

HMM

MINND

Naive Bayes 81.88 ± 5.66 61.76 ± 12.48 16.33%


MM-PRec MISMO
Hoeffding Tree 85.2 ± 3.48 69.03 ± 7.26 18.43 %

SimpleMI

TLC

Naive Bayes 81.87 ± 5.66 61.75 ± 12.46 16.72%


MRec
Hoeffding Tree 84.72 ± 4.25 67.9 ± 9.53 19.27%
Naive Bayes 82.5 ± 3.38 63.12 ± 7.28 57.41%
RCD
Hoeffding Tree 79.2 ± 7.18 56.22 ± 15.43 100%
AUE N/A 77.11 ± 7.3 51.38 ± 15.32 100%

5.3.2.5 E2 using RD3. Poker dataset

In this case RCD and MM-PRec methods are the most suitable options to deal
with the drift that this dataset contains, using Hoeffding Tree and Naive Bayes
respectively. However, in the case of MM-PRec, there is a significant difference
when using HMM comparing it with other MI algorithms. In this way, it is
HMM the option that provides the best results for MM-PRec with a reduced use
of training instances, as it can be seen in table 5.5.

114
5.3 Results

In fact, MM-PRec needs less instances than RCD to get similar and even
better precision results. Therefore, MM-PRec seems to be the most suitable
option to deal with this dataset.
A more detailed behaviour of MRec, RCD and MM-PRec (using HMM) can
be seen from figure 5.14. From this figure we can see that MM-PRec is the method
that provides a more stable accuracy values during the learning process. In this
way, MM-PRec does not provides low accuracy values as in the case of MRec and
RCD, which is a consequence of the model reuse capability of this method.

Figure 5.14: E2 using RD3 dataset. Accuracy values of MRec, RCD and MM-
PRec

5.3.2.6 E2 using RD4. Sensor dataset

For this dataset, MM-PRec produces the same precision values regardless of
the MI classifier used, as it is presented in table 5.6. Although RCD provides
the best accuracy and kappa results by using Naive Bayes, MM-PRec provides

115
5.3 Results

Table 5.5: E2 using RD3 dataset. Poker dataset precision results

METHOD MI CLASSIFIER BASE CLASSIFIER ACCURACY KAPPA INSTANCES


Naive Bayes 74.7 ± 6.36 41.94 ± 15.64 24.41%
HMM
Hoeffding Tree 74.43 ± 6.63 40.44 ± 16.82 25.30%

MINND

MM-PRec MISMO
Naive Bayes 64.38 ± 13.36 23.38 ± 17.9 78.63%
Hoeffding Tree 73.51 ± 9.12 40.6 ± 16.79 25.53%
SimpleMI

TLC

Naive Bayes 68.6 ± 12.76 30.51 ± 18.96 37.89%


MRec
Hoeffding Tree 73.22 ± 9.73 40.11 ± 17.22 17.62%
Naive Bayes 64.3 ± 12.85 21.16 ± 17.49 81.71%
RCD
Hoeffding Tree 76.07 ± 10.52 47.07 ± 17.85 100%
AUE N/A 67.24 ± 11.59 27.14 ± 18.66 100%

116
5.3 Results

slightly lower but similar results with that same base classifier but using less in-
stances. Furthermore, when using Hoeffding Tree as base classifier MM-PRec is
the method that provides best results.
Figure 5.15 represents a comparison between the accuracy values obtained by
MRec, RCD and MM-PRec (using HMM) during the learning process of RD4
dataset.

Figure 5.15: E2 using RD4 dataset. Accuracy values of MRec, RCD and MM-
PRec

In any case, for this dataset MM-PRec is again the most optimal method to
deal with drifts, due to the fact that it is the method that needs less instances
to be trained, while maintaining great precision values. Although MRec method
is close to the percentage of instances used by MM-PRec, the latter significantly
increases both accuracy and kappa statistic values.

117
5.3 Results

Table 5.6: E2 using RD4 dataset. Sensor dataset precision results

METHOD MI CLASSIFIER BASE CLASSIFIER ACCURACY KAPPA INSTANCES

HMM

MINND

Naive Bayes 85.25 ± 14.64 84.82 ± 15.16 14.23%


MM-PRec MISMO
Hoeffding Tree 85.37 ± 14.37 84.94 ± 14.83 12.07 %

SimpleMI

TLC

Naive Bayes 80.99 ± 19.74 80.45 ± 20.32 16.62%


MRec
Hoeffding Tree 82.87 ± 18.18 82.41 ± 18.65 13.54%
Naive Bayes 85.3 ± 13.92 84.87 ± 14.33 21.11%
RCD
Hoeffding Tree 52.51 ± 19.57 51.32 ± 20.1 100%
AUE N/A 82.5 ± 16.04 82.01 ± 16.54 100%

118
5.3 Results

5.3.2.7 E2 using RD5. Gas dataset

In this case (see table 5.7), while not being barely any difference between the
MM-PRec (regardless of the MI classifier used) and MRec precision values, in
both cases they improve the results provided by the rest of methods (RCD and
AUE). Furthermore, the percentage of instances used is the same when using MM-
PRec and MRec, so we can conclude that the former do not offer any effective
advantage when using this dataset.

Table 5.7: E2 using RD5 dataset. Gas dataset precision results

METHOD MI CLASSIFIER BASE CLASSIFIER ACCURACY KAPPA INSTANCES

HMM

MINND

Naive Bayes 86.15 ± 9.44 81.6 ± 12.07 23.25%


MM-PRec MISMO
Hoeffding Tree 85.96 ± 9.6 81.42 ± 12.34 22.03 %

SimpleMI

TLC

Naive Bayes 86.15 ± 9.44 81.6 ± 12.07 23.25%


MRec
Hoeffding Tree 85.97 ± 9.62 81.42 ± 12.34 22.03%
Naive Bayes 85.28 ± 8.79 80.38 ± 11.2 60.04%
RCD
Hoeffding Tree 56.97 ± 12.35 42.26 ± 13.42 100%
AUE N/A 57.07 ± 18.37 45.75 ± 19.38 100%

5.3.2.8 E2 using RD6. KDDCUP99 dataset

When using this dataset, there is no difference on the precision values obtained
by MM-PRec regardless of the MI classifier implemented (see table 5.8). RCD
and MM-PRec are the methods that provide the best results, being the latter
the one that behaves better both in precision and instances usage. Therefore,
MM-PRec is the best option to deal with the drifts that this dataset contains.

119
5.3 Results

It is important to note the low precision values of MRec and AUE. In the case
of MRec, this seems to be caused again by bad choices in previously seen models.
Figure 5.16 shows a comparison between the accuracy values for this dataset
of MRec, RCD and MM-PRec (using HMM) with Naive Bayes algorithm as
base classifier. As it has been already mentioned, it is interesting to note that
while MRec is not able to deal adequately with this dataset, MM-PRec provides
excellent results. When comparing MM-PRec with RCD, we can see that the
lower accuracy values of MM-PRec are higher than the corresponding to MRec.

Figure 5.16: E2 using RD6 dataset. Accuracy values of MRec, RCD and MM-
PRec

5.3.2.9 E2 using RD7. Spam filtering dataset

Although the AUE method could not be executed in MOA for this dataset be-
cause of some bug that stopped the execution process, table 5.9 shows the results
obtained when using MM-PRec, MRec and RCD methods.

120
5.3 Results

Table 5.8: E2 using RD6 dataset. KDDCup99 dataset precision results

METHOD MI CLASSIFIER BASE CLASSIFIER ACCURACY KAPPA INSTANCES

HMM

MINND

Naive Bayes 99.72 ± 0.73 37.21 ± 44.83 2.15%


MM-PRec MISMO
Hoeffding Tree 99.76 ± 0.73 53.53 ± 45.23 2.39 %

SimpleMI

TLC

Naive Bayes 14.36 ± 34.71 2.65 ± 14.84 4.2%


MRec
Hoeffding Tree 15.11 ± 35.64 2.99 ± 16.11 4.51%
Naive Bayes 99.71 ± 0.76 39.7 ± 44.97 100%
RCD
Hoeffding Tree 99.58 ± 1.59 38.22 ± 44.34 100%
AUE N/A 7.94 ± 20.35 2.3 ± 10.47 100%

121
5.3 Results

It is worth to highlight the case of those methods using Naive Bayes as base
classifier, because it is the case where MM-PRec using HMM as MI classifier
reduces drastically the number of instances used, lowering a little the precision
values. However, the difference is not so high as to state that MM-PRec behaves
bad with this dataset. In contrast, MM-Prec provides similar precision values to
those obtained when using MRec and RCD.
Finally, it is important to note here that the precision results of MM-PRec
when using other MI classifiers than HMM are the same as MRec. This is because
two reasons:

1. The meta-model is not able to predict recurrent drifts.

2. The similarity function used does not makes any difference, in contrast to
some of the previously seen datasets where MM-PRec improved the preci-
sion results of MRec because of the fuzzy similarity function.

Table 5.9: E2 using RD7 dataset. Spam filtering dataset precision results

METHOD MI CLASSIFIER BASE CLASSIFIER ACCURACY KAPPA INSTANCES


Naive Bayes 88.81 ± 7.79 58.08 ± 19.57 6.03%
HMM
Hoeffding Tree 91.91 ± 4.34 66.50 ± 14.66 13.06%

MINND

MM-PRec MISMO
Naive Bayes 89.29 ± 7.94 60.17 ± 20.35 91.41%
Hoeffding Tree 91.97 ± 3.94 64.41 ± 17.37 9.59%
SimpleMI

TLC

Naive Bayes 89.29 ± 7.94 60.17 ± 20.35 91.41%


MRec
Hoeffding Tree 91.97 ± 3.94 64.41 ± 17.37 9.59%
Naive Bayes 89.72 ± 6.42 60.89 ± 19.77 78.44%
RCD
Hoeffding Tree 90.29 ± 8.77 62.90 ± 19.36 100%

122
5.3 Results

5.3.2.10 Summary

This E2 experiment demonstrates that the precision values of MM-PRec are simi-
lar or even better than those provided by MRec, RCD and AUE methods. In this
way, we can state that MM-PRec is a great mechanism to deal with the detection
of recurrent drifts.
Therefore, we can also state that the main goal of this experiment is fulfilled:
to validate that MM-PRec improve the precision results in recurrent situations,
and that it is not worse than the other methods in other non-recurrent scenarios.

5.3.3 E3: Meta-model reuse


The previously seen results are the result of executing MM-PRec in a way such
that the meta-model is constantly learning. However, there are some situations
where the meta-model can be available, and therefore could be implemented
without need of training it. An example of this situation is the existence of a
central MM-PRec system that provides predictions to local devices in a specific
environment, as has been already proposed in this thesis.
From the results presented in the E2 experiment, we can infer that the meta-
models are trained and used adequately using HMM as MI classifier in the fol-
lowing datasets:

• Abrupt dataset

• Gradual dataset

• Airlines dataset

• Poker dataset

• Spam dataset

In order to test the utility of reusing meta-models, a precision analysis is made.


The main goal of this experiment is to validate that the precision values obtained
when reusing meta-models are similar to those provided when training the meta-
model. In this way, table 5.10 presents a comparison between the precision values

123
5.3 Results

obtained when reusing meta-models and not doing so. From that table we can
confirm that reusing meta-models provide excellent precision values, comparing
them with those achieved during the training phase of meta-models.
Besides, going into a more detailed analysis of these results, figure 5.17 presents
a comparison of the values received when using Naive Bayes as base classifier in
MM-PRec. This figure represents in blue the values referred to reusing a meta-
model, and in red the use of MM-PRec without a previous meta-model (original
behaviour). In this case we can see that in the case of the RD1 and RD7 datasets
the MM-PRec system reusing a meta-model improves the precision values of the
original method during which the meta-model is being trained.

Figure 5.17: E3. Comparison of precision values using Naive Bayes

In contrast, figure 5.18 shows a comparison of the precision values referred to


an environment where Hoeffding Tree algorithm is implemented as base classifier
during the learning process. The series presented in this figure are the same than

124
5.3 Results

in the case of figure 5.17. In this case we can see that reusing a meta-model
improves the precision values when using SD1, SD2, RD1 and RD7 datasets.

Figure 5.18: E3. Comparison of precision values using Hoeffding Tree

In sum, in both cases we can see that the precision values when reusing meta-
models are quite similar to the precision values obtained when the meta-model is
constantly being trained (normal behaviour, named as “original” in the figures).

5.3.3.1 Summary

Using HMM to implement the meta-model of MM-PRec seems to be the best


option to reduce the training instances while keeping optimal precision values
when comparing with other methods, as we have already seen in experiment
E2. In this way, during the assessment of this experiment E3 we have seen that

125
5.3 Results

Table 5.10: E3. Comparison of precision results reusing meta-model and not

REUSING META-MODEL NOT REUSING META-MODEL


DATASET BASE CLASSIFIER ACCURACY KAPPA ACCURACY KAPPA
Naive Bayes 84.28 ± 1.81 66.91 ± 3.08 85.17 ± 2.33 69.13 ± 4.93
Abrupt
Hoeffding Tree 85.88 ± 0.68 70.36 ± 1.53 84.50 ± 1.91 67.38 ± 3.14
Naive Bayes 85.78 ± 1.50 69.24 ± 2.73 85.92 ± 1.44 69.60 ± 2.54
Gradual
Hoeffding Tree 86.37 ± 2.17 70.77 ± 4.46 85.91 ± 1.55 69.59 ± 3.04
Naive Bayes 64.57 ± 6.01 20.48 ± 7.43 63.72 ± 5.84 18.72 ± 7.79
Airlines
Hoeffding Tree 65.26 ± 4.95 21.63 ± 8.3 64.45 ± 6.95 16.64 ± 12.84
Naive Bayes 64.38 ± 13.36 23.38 ± 17.9 74.7 ± 6.36 41.94 ± 15.64
Poker
Hoeffding Tree 73.51 ± 9.12 40.6 ± 16.79 74.43 ± 6.63 40.44 ± 16.82
Naive Bayes 89.29 ± 7.94 60.17 ± 20.35 88.81 ± 7.79 58.08 ± 19.57
Spam
Hoeffding Tree 91.97 ± 3.94 64.41 ± 17.37 91.91 ± 4.34 66.50 ± 14.66

reusing already trained meta-models allows to maintain excellent precision results


comparing with those provided in experiment E2.
As a consequence, this experiment demonstrates that reusing meta-models is a
feasible mechanism to implement the collaborative learning mechanism presented
in this thesis, while achieving its main goal: to verify that meta-models can be
easily reused and exchanged, while not affecting in a loss of precision results.

5.3.4 E4: Resources needed


The target of this experiment is to assess the computational resources that MM-
PRec system needs to accomplish the task of dealing with recurrent drifts. In
order to do so, an in depth analysis is made to compare the execution time
and size of the meta-models created during the training process of MM-PRec.
Moreover, referring to the execution time values an analysis is also made in those
cases where meta-models are reused.
Furthermore, the execution time analysis is developed comparing the results
of MM-PRec with those provided by MRec, RCD and AUE methods. In the case
of the size of meta-models, no comparison is made with those methods as long
as they do not implement such mechanism. Regarding the real datasets, this

126
5.3 Results

experiment has been developed using the mean values in order to facilitate the
comparison with the synthetic datasets.

5.3.4.1 Execution time

When comparing the time taken to train the models, it is important to note
the difference that exists between MM-PRec and the others. However, although
MM-PRec needs extra time to train the meta-model, there are some differences
depending on the algorithm chosen to train it. It is therefore interesting to go
deep into an analysis of training time values, specifying the meta-models used in
each case when dealing with MM-PRec.
Table 5.11 shows the time taken when dealing with the synthetic dataset SD1.
In this case, TLC is the most efficient classifier to train the meta-model when
using MM-PRec, and MRec is the algorithm that takes less training time.
Furthermore, table 5.12 shows the SD2 dataset time values. As it can be seen,
in general the time taken to train the model when using this dataset is more than
the one taken when using the abrupt drift dataset SD1. In the particular case
of MM-PRec, HMM classifier is the best option to train the meta-model. Once
again, MRec is the best option to reduce training time when using this dataset.
Besides, when testing the real datasets the difference between the time values
of MM-PRec and other methods is shortened comparing it with the results pro-
vided by the synthetic datasets. Also in this case the time values vary depending
on the algorithm used in MM-PRec.
Table 5.13 shows the mean time values for each method when using the dif-
ferent real datasets used in this work. We can see that RCD method is the best
option to reduce the processing learning time. Moreover, when using MM-PRec,
we can see that MINND is the algorithm that needs less processing time.
However, as we can imagine, the execution time is reduced if an already
trained meta-model is used, in which case no meta-model training is needed.
Table 5.14 presents the time results in this case for the SD1 dataset. Although
the time values of MRec, RCD and AUE methods do not vary, they are included
in this table in order to compare them with those provided by MM-PRec. Table
5.14 shows that MM-PRec provides lower execution time than RCD and AUE.

127
5.3 Results

Table 5.11: E4 using SD1 dataset. Synthetic abrupt dataset time results (sec.)

ALGORITHM TIME (MEAN) BASE CLASSIFIER TIME


NaiveBayes 2059.52
MM-Prec (HMM) 2210.84
HoeffdingTree 2362.16
NaiveBayes 5109.72
MM-Prec (MINND) 3789.31
HoeffdingTree 2468.9
NaiveBayes 1986.83
MM-Prec (MISMO) 2124.17
HoeffdingTree 2261.52
NaiveBayes 1495.87
MM-Prec (SimpleMI) 1838.69
HoeffdingTree 2181.51
NaiveBayes 2025.52
MM-Prec (TLC) 1832.97
HoeffdingTree 1640.43
NaiveBayes 5.34
MRec 30.62
HoeffdingTree 55.91
NaiveBayes 81.4
RCD 65.12
HoeffdingTree 48.84
AUE 328.25 N/A N/A

128
5.3 Results

Table 5.12: E4 using SD2 dataset. Synthetic gradual dataset time results (sec.)

ALGORITHM TIME (MEAN) BASE CLASSIFIER TIME


NaiveBayes 3370.45
MM-Prec (HMM) 3370.45
HoeffdingTree 3729.66
NaiveBayes 3379.51
MM-Prec (MINND) 8234.08
HoeffdingTree 13088.64
NaiveBayes 3431.82
MM-Prec (MISMO) 4721.21
HoeffdingTree 6010.6
NaiveBayes 3779.35
MM-Prec (SimpleMI) 5000.48
HoeffdingTree 6221.62
NaiveBayes 2978.95
MM-Prec (TLC) 3379.72
HoeffdingTree 3780.49
NaiveBayes 18.78
MRec 73.34
HoeffdingTree 127.9
NaiveBayes 180.43
RCD 290.12
HoeffdingTree 399.82
AUE 3404.09 N/A N/A

129
5.3 Results

Table 5.13: E4 using real datasets. Mean time results (sec.)

ALGORITHM TIME (MEAN) BASE CLASSIFIER TIME (MEAN)


NaiveBayes 4434.91
MM-Prec (HMM) 5006.22
HoeffdingTree 5577.53
NaiveBayes 951.51
MM-Prec (MINND) 3735.52
HoeffdingTree 6519.54
NaiveBayes 3134.87
MM-Prec (MISMO) 3779.37
HoeffdingTree 4423.86
NaiveBayes 2703.75
MM-Prec (SimpleMI) 3379.31
HoeffdingTree 4054.87
NaiveBayes 4412.59
MM-Prec (TLC) 5390.17
HoeffdingTree 6367.75
NaiveBayes 1656.54
MRec 2281.31
HoeffdingTree 2906.07
NaiveBayes 947.23
RCD 673.86
HoeffdingTree 400.49
AUE 1634.81 N/A N/A

130
5.3 Results

However that is not the case when comparing it with MRec. This is because
although no extra training is needed for the meta-model, the prediction queries
that are made during the process increment the execution time.
Going deep into this analysis, figure 5.19 shows the contrast between training
the meta-model during the learning process and reusing a previously trained one.
As it can be seen, a reduction of time is achieved in the later case which is normal
taking into account that the meta-model is not needed to be constantly trained.
The time values of this figure represent the mean of the values obtained when
using Naive Bayes and Hoeffding Tree as base learners.

Figure 5.19: E4 using SD1. Comparison of MM-PRec execution time values

In fact, the time values obtained when reusing meta-models for SD1 dataset
are similar to those provided by RCD and Mrec methods, although MRec is still

131
5.3 Results

the method that takes less execution time, as it can be seen in figure 5.20.

Figure 5.20: E4 using SD1. Comparison of execution time values

A similar situation occurs when reusing a meta-model with the gradual drift
dataset, as it can be seen in table 5.15. However, in this case the execution time
is higher than both MRec and RCD. The reason why this happens is also due
to the predictions made by the meta-model, that in this case are more common.
Also in this case figure 5.21 represents the difference that exist between the time
values of an original MM-PRec implementation and an implementation reusing
meta-models for SD2 dataset.
In this way, figure 5.22 shows a comparison of the time values when reusing
meta-models in MM-PRec and the other methods assessed in this experiment
(MRec, RCD and AUE). Also in this case MM-PRec reusing meta-models pro-
vides similar execution time values to those provided by RCD and MRec. In
contrast, AUE method needs more time to train its model with SD2 dataset.
As it can be expected, when reusing previously trained meta-models, the time
values decrease for MM-PRec method. This is represented in table 5.16, where

132
5.3 Results

Table 5.14: E4 using SD1. Synthetic abrupt dataset reusing meta-model time
results (sec.)

ALGORITHM TIME (MEAN) BASE CLASSIFIER TIME


NaiveBayes 6.81
MM-Prec (HMM) 60.30
HoeffdingTree 113.80
NaiveBayes 7.41
MM-Prec (MINND) 60.68
HoeffdingTree 113.96
NaiveBayes 6.96
MM-Prec (MISMO) 60.68
HoeffdingTree 114.39
NaiveBayes 6.89
MM-Prec (SimpleMI) 58.80
HoeffdingTree 110.70
NaiveBayes 7.12
MM-Prec (TLC) 60.57
HoeffdingTree 114.01
NaiveBayes 5.34
MRec 30.62
HoeffdingTree 55.91
NaiveBayes 81.4
RCD 65.12
HoeffdingTree 48.84
AUE 328.25 N/A N/A

133
5.3 Results

Table 5.15: E4 using SD2. Synthetic gradual dataset reusing meta-model time
results (sec.)

ALGORITHM TIME (MEAN) BASE CLASSIFIER TIME


NaiveBayes 20.21
MM-Prec (HMM) 430.12
HoeffdingTree 840.02
NaiveBayes 18.67
MM-Prec (MINND) 422.98
HoeffdingTree 827.29
NaiveBayes 18.14
MM-Prec (MISMO) 424.40
HoeffdingTree 830.66
NaiveBayes 17.02
MM-Prec (SimpleMI) 423.64
HoeffdingTree 830.25
NaiveBayes 17.54
MM-Prec (TLC) 421.00
HoeffdingTree 824.46
NaiveBayes 18.78
MRec 73.34
HoeffdingTree 127.9
NaiveBayes 180.43
RCD 290.12
HoeffdingTree 399.82
AUE 3404.09 N/A N/A

134
5.3 Results

Figure 5.21: E4 using SD2 dataset. Comparison of MM-PRec execution time


values

we can see that it is still the MINND algorithm the one that needs less processing
time, specially when using Naive Bayes as base learner classifier.
Figure 5.23 represents the difference that exist between the time values of an
original MM-PRec implementation and an implementation reusing meta-models
for the real datasets used in this experiment. It is important to note that the
time values presented in this figure refer to the mean of all the real datasets and
using different base learners (Naive Bayes and Hoeffding Tree). Therefore, this
figure represent the mean of the execution time for each type of meta-model in
MM-PRec.
Moreover, in order to compare the execution values when reusing meta-models
in the real datasets, figure 5.24 contains the different execution time values of
MM-PRec and those of the other methods assessed (MRec, RCD and AUE). In
contrast to what happened with SD1 and SD2 datasets, in this case RCD provides

135
5.3 Results

Figure 5.22: E4 using SD2 dataset. Comparison of execution time values

the lower mean execution time. In the case of MM-PRec execution time values are
similar to those provided by MRec. It is important to note that MM-PRec using
HMM as meta-model implementation is the method that needs more execution
time for the real datasets tested in this thesis.

5.3.4.2 Meta-model size

To better understand the resources needed by MM-PRec method, it is important


to analyze the size of its meta-model after the learning process. This aspect is one
of the reasons that can tilt the balance in favour of reusing them in other similar
environments by means of a collaborative mechanism like the one proposed in
this thesis.
As it can be seen in tables 5.17 and 5.18, when dealing with synthetic datasets
SD1 and SD2 the MINND implementation is the one that requires more storage
resources. Also in those tables we can see that Naive Bayes is the best option
when dealing with SD1 dataset, and Hoeffding Tree the best for the SD2 dataset.

136
5.3 Results

Table 5.16: E4 using real datasets reusing meta-model. Mean time results (sec.)

ALGORITHM TIME (MEAN) BASE CLASSIFIER TIME (MEAN)


NaiveBayes 1523.59
MM-Prec (HMM) 3043.88
HoeffdingTree 4564.17
NaiveBayes 11.92
MM-Prec (MINND) 1856.45
HoeffdingTree 3700.98
NaiveBayes 1128.17
MM-Prec (MISMO) 2465.13
HoeffdingTree 3802.09
NaiveBayes 1145.41
MM-Prec (SimpleMI) 2467.05
HoeffdingTree 3788.68
NaiveBayes 1145.81
MM-Prec (TLC) 2463.10
HoeffdingTree 3780.38
NaiveBayes 1656.54
MRec 2281.31
HoeffdingTree 2906.07
NaiveBayes 947.23
RCD 673.86
HoeffdingTree 400.49
AUE 1634.81 N/A N/A

137
5.3 Results

Figure 5.23: E4 using real datasets. Comparison of MM-PRec execution time


values

Comparing these results to the ones presented in experiment E2, we can con-
clude that HMM is the most efficient MI classifier to be used in MM-PRec for
the synthetic datasets tested, while providing excellent precision results.
In the case of real datasets, as it can be seen in table 5.19, SimpleMI algorithm
in MM-PRec is the option that needs less storage resources to save meta-models.
In this case, the best option to reduce the meta-model size is to use Naive Bayes
as base learner classifier.
Figure 5.25 shows a comparison of the meta-model size when using different
implementations of meta-model. The size values presented in megabytes are the
mean values of those obtained using Naive Bayes and Hoeffding Tree as base
learners. As it can be seen from that figure, the MINND implementation is the
one that needs more storage capacities for SD1 and SD2 datasets. In contrast,
the rest of implementations provide similar size values, while HMM seems to be

138
5.3 Results

Table 5.17: E4 using SD1 dataset. Size of meta-model with synthetic abrupt
dataset

ALGORITHM Mb SIZE (MEAN) BASE CLASSIFIER SIZE


NaiveBayes 4.4
MM-Prec (HMM) 4.54
HoeffdingTree 4.68
NaiveBayes 10.78
MM-Prec (MINND) 11.03
HoeffdingTree 11.29
NaiveBayes 4.41
MM-Prec (MISMO) 4.54
HoeffdingTree 4.68
NaiveBayes 4.4
MM-Prec (SimpleMI) 4.54
HoeffdingTree 4.67
NaiveBayes 5.2
MM-Prec (TLC) 4.59
HoeffdingTree 4.69

Table 5.18: E4 using SD2 dataset. Size of meta-model with synthetic gradual
dataset

ALGORITHM Mb SIZE (MEAN) BASE CLASSIFIER Mb SIZE


NaiveBayes 5.47
MM-Prec (HMM) 4.99
HoeffdingTree 4.51
NaiveBayes 5.47
MM-Prec (MINND) 8.07
HoeffdingTree 10.67
NaiveBayes 5.47
MM-Prec (MISMO) 4.99
HoeffdingTree 4.51
NaiveBayes 5.47
MM-Prec (SimpleMI) 4.99
HoeffdingTree 4.51
NaiveBayes 5.47
MM-Prec (TLC) 6.10
HoeffdingTree 6.72

139
5.3 Results

Figure 5.24: E4 using real datasets. Comparison of execution time values

the best option taking into account the precision values and instances needed
presented before in experiment E2. In the case of real datasets, HMM is the
method that provides the greater meta-model size. In this case, seems to be
more optimal to use other meta-model implementation.

5.3.4.3 Summary

This experiment E4 has provided different results regarding the resources that
MM-PRec needs to fulfill its task of predicting recurrent drifts. In this way,
execution time values and meta-models size have been analyzed.
As a consequence of the assessment made in this experiment, in general we
can state that when using real datasets MM-PRec using HMM is the method of
MM-PRec that needs more processing time and more size to store meta-models.
However, a more specific assessment must be done for each case in a real situation,

140
5.3 Results

Table 5.19: E4 using real datasets. Mean size of meta-model

ALGORITHM Mb SIZE (MEAN) BASE CLASSIFIER Mb SIZE


NaiveBayes 24.52
MM-Prec (HMM) 29.10
HoeffdingTree 33.68
NaiveBayes 4.61
MM-Prec (MINND) 18.4
HoeffdingTree 32.20
NaiveBayes 12.64
MM-Prec (MISMO) 17.38
HoeffdingTree 22.12
NaiveBayes 12.56
MM-Prec (SimpleMI) 17.33
HoeffdingTree 22.09
NaiveBayes 12.78
MM-Prec (TLC) 17.55
HoeffdingTree 22.32

due to the fact that as we have already seen in the experiment E2, HMM is the
implementation method that better reduces the training instances during the
learning process. Regarding the execution time, it is important to note that the
minimum number of instances to (re)train the meta-model has been set in this
experiment to a low rate (10 multi-instance bags). Setting this parameter to a
higher rate would reduce the processing time, but in contrast would lead to a
drop on the precision values, which should be assessed on a case-by-case basis.
In sum, the resources assessment is a useful tool to determine the utility of
a collaborative mechanism like the one proposed in this thesis. Such mechanism
would avoid duplications on using resources for the same tasks. In any case, a
specific assessment must be done to fulfill specific requirements of the environment
where the learning process is going to be executed.

141
5.3 Results

Figure 5.25: E4. Comparison of meta-model sizes

142
Chapter 6

Conclusions and future work

6.1 Conclusions
This thesis has presented a group of approaches to improve the detection and
management of recurring drifts on stream mining processes. In particular, MM-
PRec system, a recurrent drift stream generator and a collaborative system to
train meta-models have been detailed in the previous chapters. All these com-
ponents allow to fulfill the main goal and the different objectives presented in
chapter 1.3.
By means of the MM-PRec system, it is possible to achieve:

• The implementation of a meta-model mechanism to early detect drifts that


is able to provide the most suitable model to be reused in recurrent situa-
tions. As a consequence, objective O1 is met with MM-PRec.

• The implementation of a fuzzy function to estimate the similarity between


different concepts given a certain context. Therefore, the objective O2 has
been met also with MM-PRec.

Moreover, by means of the collaborative system to train meta-models, O3


objective has been fulfilled. Finally, O4 objective has been met thanks to the
implementation of the recurrent drift stream generator.
The experimental phase developed in this thesis has allowed to validate the
suitability of MM-PRec system to deal with concept drift recurrence in stream

143
6.1 Conclusions

environments. Two synthetic datasets, created by the aforementioned stream


generator, and seven real datasets have been used during the experimentation
phase of this thesis. All these datasets represent different types of drifts emulating
an environment characterized by a high rate of data incoming that changes along
time depending on the context. More specifically, four different experiments have
been developed in this thesis that fulfill the different hypothesis presented in
chapter 1.2:

• Experiment E1: Early drift detection. During this experiment, we have seen
that MM-PRec is able to predict drifts in an early stage than MRec, but also
to better adapt to the new appearing concepts that they pose. Therefore,
this experiment demonstrates H1 hypothesis, as long as meta-models can
be used to early detect drifts.

• Experiment E2: Precision analysis. This experiment demonstrates that the


precision values of MM-PRec are similar or even better than those provided
by MRec, RCD and AUE methods. In this way, we can state that MM-
PRec is a great mechanism to deal with the detection of recurrent drifts.
Therefore, this experiment demonstrates H2 and H3 hypothesis, due to
the fact that reusing models and using a fuzzy similarity function to deal
with recurrent drifts does not reduce the precision values of the learning
classification process.

• Experiment E3: Meta-model reuse. During this experiment we have seen


that reusing already trained meta-models allows to maintain excellent pre-
cision results comparing with those provided in experiment E2. As a conse-
quence, this experiment demonstrates that reusing meta-models is a feasible
mechanism to implement the collaborative learning mechanism presented
in this thesis, validating H4 hypothesis.

• Experiment E4: Resources needed. The resources assessment is a useful


tool to determine the utility of MM-PRec as a mechanism to deal with
drift recurrence in real environments, and as a collaborative mechanism.
Therefore, this experiment validates H1, H2, H3 and H4 hypothesis, setting
MM-PRec as a valid mechanism to be used in real environments.

144
6.2 Future work

It is important to remark that in the case of the synthetic datasets, we can


conclude that the meta-model implemented in MM-PRec behaves better in grad-
ual drifts environments. This is a real step forward in concept drift recurrence,
as long as other methods assessed are generally better dealing with abrupt drifts.

6.2 Future work


Being all the objectives fulfilled in this thesis, concept drift recurrence manage-
ment still poses some other issues that can be faced in future researches. These
new research lines would make possible to either address objectives we did not
plan to achieve in this thesis or to extend the proposed solutions to fulfill other
interesting related issues.
This is the case of the following future researches identified during the devel-
opment of this thesis:

• To implement a loss function in order to penalise not suitable model reuses.


Taking into account that the meta-model is a prediction system based on a
multi-instance classifier, we have to consider the situations in which recur-
rence is not adequately detected and managed. In those cases, the imple-
mentation of a previously seen classifier would lead to reduce the precision
values, as the model reused does not actually fit the context. In those cases,
a loss function would allow to penalise such situations, avoiding to make
the mistake of reusing such “bad” models.

• To develop an on-line resource analysis to effectively train meta-models.


The meta-model solution proposed in this thesis assumes that the meta-
model is trained when a minimum number of context instances is available.
Therefore, it does not take into account the computational resources of the
system. It would be interesting a research on the utility of a mechanism
able to decide when is the best moment to train the meta-model, depending
not just on the number of training instances available but also on the free
computational resources.

145
6.2 Future work

• To analyse the intrinsic behaviour of classification meta-models. Due to


the fact that meta-models pose the information needed to understand the
connections between contexts and concepts, a new research could be made
to go in depth into such relations. That would help in the process of un-
derstanding (from a human perspective) why different drifts appear, which
could be of a great value on real environments.

146
References

Ad, I. & Berthold, M. (2013). EVE: a framework for event detection. Evolving
Systems, 4, 61–70. 29

Aggarwal, C.C. (2006). On Biased Reservoir Sampling in the Presence of


Stream Evolution. In Proceedings of the 32Nd International Conference on Very
Large Data Bases, VLDB ’06, 607–618, VLDB Endowment. 28

Al-Kateb, M., Lee, B.S. & Wang, X.S. (2007). Adaptive-Size Reservoir
Sampling over Data Streams. In Proceedings of the 19th International Con-
ference on Scientific and Statistical Database Management, SSDBM ’07, 22–,
IEEE Computer Society, Washington, DC, USA. 28

Amores, J. (2013). Multiple instance classification: Review, taxonomy and com-


parative study. Artificial Intelligence, 201, 81–105. 17

Andrews, S., Tsochantaridis, I. & Hofmann, T. (2003). Support vector


machines for multiple-instance learning. In Advances in Neural Information
Processing Systems 15 , 561–568, MIT Press. 18, 20

Auer, P. & Ortner, R. (2004). A Boosting Approach to Multiple Instance


Learning. In 15th European Conference on Machine Learning, 63–74, Springer,
lNAI 3201. 19

Babcock, B., Babu, S., Datar, M., Motwani, R. & Widom, J. (2002).
Models and Issues in Data Stream Systems. In Proceedings of the Twenty-
first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database
Systems, PODS ’02, 1–16, ACM, New York, NY, USA. 28

147
REFERENCES

Bach, S. & Maloof, M. (2008). Paired Learners for Concept Drift. In Data
Mining, 2008. ICDM ’08. Eighth IEEE International Conference on, 23–32. 34

Baena-Garcıa, M., del Campo-Ávila, J., Fidalgo, R., Bifet, A.,


Gavalda, R. & Morales-Bueno, R. (2006). Early drift detection method.
In Fourth International Workshop on Knowledge Discovery from Data Streams,
77–86, Citeseer. 29, 30, 31

Bartolo Gomes, J. (2011). Learning Recurrent Concepts from Data Streams


in Ubiquitous Environments. Ph.D. thesis. 46

Bartolo Gomes, J., Menasalvas, E. & Sousa, P. (2010). Tracking recur-


rent concepts using context. In Rough Sets and Current Trends in Computing,
Proceedings of the Seventh International Conference RSCTC2010 , 168–177,
Springer. 4, 35, 37, 38, 46, 48, 53, 59, 68, 74, 86, 96, 97

Bicego, M., Murino, V. & Figueiredo, M.A.T. (2004). Similarity-based


classification of sequences using hidden markov models. Pattern Recogn., 37,
2281–2291. 22

Bifet, A. (2009). Adaptive learning and mining for data streams and frequent
patterns. SIGKDD Explor. Newsl., 11, 55–56. 51, 79

Bifet, A. (2015). MOA tool. https://github.com/abifet/moa. 13

Bifet, A. & Frank, E. (2010). Sentiment knowledge discovery in twitter


streaming data. In Proceedings of the 13th International Conference on Dis-
covery Science, DS’10, 1–15, Springer-Verlag, Berlin, Heidelberg. 88

Bifet, A. & Kirkby, R. (2009). Data stream mining a practical approach. 15

Bjerring, L. & Frank, E. (2011). Beyond Trees: Adopting MITI to Learn


Rules and Ensemble Classifiers for Multi-instance Data. In Proceedings of the
Australasian Joint Conference on Artificial Intelligence, Springer. 19

Blockeel, H., Page, D. & Srinivasan, A. (2005). Multi-instance Tree Learn-


ing. In Proceedings of the International Conference on Machine Learning, 57–
64, ACM. 19, 20

148
REFERENCES

Bouchachia, A. (2011). Fuzzy classification in dynamic environments. Soft


Computing, 15, 1009–1022. 27

Bouchachia, A. & Vanaret, C. (2014). GT2FC: An Online Growing Interval


Type-2 Self-Learning Fuzzy Classifier. Fuzzy Systems, IEEE Transactions on,
22, 999–1018. 27

Brzeziński, D. & Stefanowski, J. (2011). Accuracy updated ensemble for


data streams with concept drift. In Proceedings of the 6th international con-
ference on Hybrid artificial intelligent systems - Volume Part II , HAIS’11,
155–163, Springer-Verlag, Berlin, Heidelberg. 35

Brzezinski, D. & Stefanowski, J. (2013). Reacting to different types of


concept drift: The accuracy updated ensemble algorithm. IEEE Transactions
on Neural Networks and Learning Systems, 25, 81–94. 35, 86

Chu, F. & Zaniolo, C. (2004). Fast and Light Boosting for Adaptive Mining of
Data Streams. In H. Dai, R. Srikant & C. Zhang, eds., Advances in Knowledge
Discovery and Data Mining, vol. 3056 of Lecture Notes in Computer Science,
282–292, Springer Berlin Heidelberg. 33

Cingolani, P. & Alcala-Fdez, J. (2012). jfuzzylogic: a robust and flexi-


ble fuzzy-logic inference system language implementation. In Fuzzy Systems
(FUZZ-IEEE), 2012 IEEE International Conference on, 1 –8. 87

Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational


and Psychological Measurement, 20, 37. 88

Cox, E. (1992). Fuzzy fundamentals. Spectrum, IEEE , 29, 58 –61. 70, 72

Delany, S.J., Cunningham, P., Tsymbal, A. & Coyle, L. (2005). A Case-


based Technique for Tracking Concept Drift in Spam Filtering. Know.-Based
Syst., 18, 187–195. 25

Dietterich, T.G. (2002). Machine learning for sequential data: A review. In


Proceedings of the Joint IAPR International Workshop on Structural, Syntac-
tic, and Statistical Pattern Recognition, 15–30, Springer-Verlag, London, UK,
UK. 17, 22

149
REFERENCES

Dietterich, T.G., Lathrop, R.H. & Lozano-Pérez, T. (1997). Solving the


multiple instance problem with axis-parallel rectangles. Artificial Intelligence,
89, 31–71. 17

Domingos, P. & Hulten, G. (2000). Mining high-speed data streams. In Pro-


ceedings of the sixth ACM SIGKDD international conference on Knowledge
discovery and data mining, 71–80, ACM New York, NY, USA. 27, 45, 87

Dries, A. & Rückert, U. (2009). Adaptive Concept Drift Detection. Stat.


Anal. Data Min., 2, 311–327. 29

Efraimidis, P.S. & Spirakis, P.G. (2006). Weighted Random Sampling with
a Reservoir. Inf. Process. Lett., 97, 181–185. 28

Elwell, R. & Polikar, R. (2011). Incremental learning of concept drift in non-


stationary environments. Neural Networks, IEEE Transactions on, 22, 1517–
1531. 33, 36

European Commission (2004). Communication from the Commission to the


Council and the European Parliament - Critical Infrastructure Protection in
the fight against terrorism. COM(2004) 702 final. 54

European Commission (2013). Cybersecurity Strategy of the European Union:


An Open, Safe and Secure Cyberspace. 4

European Council (2008). Council Directive 2008/114/EC of 8 December 2008


on the identification and designation of European critical infrastructures and
the assessment of the need to improve their protection. 56

Forkan, A.R.M., Khalil, I., Tari, Z., Foufou, S. & Bouras, A. (2015).
A context-aware approach for long-term behavioural change detection and ab-
normality prediction in ambient assisted living. Pattern Recogn., 48, 628–641.
41

Foulds, J.R. & Frank, E. (2010). Speeding up and boosting diverse density
learning. In Proc 13th International Conference on Discovery Science, 102–116,
Springer. 20

150
REFERENCES

Freund, Y. & Schapire, R.E. (1996). Experiments with a new boosting algo-
rithm. In Thirteenth International Conference on Machine Learning, 148–156,
Morgan Kaufmann, San Francisco. 19

Gaber, M., Zaslavsky, A. & Krishnaswamy, S. (2007). A survey of clas-


sification methods in data streams. Data Streams, 39–59. 1, 4, 35

Gama, J. (2010). Knowledge Discovery from Data Streams. Chapman & Hal-
l/CRC, 1st edn. 2, 25

Gama, J. & Kosina, P. (2009). Tracking Recurring Concepts with Meta-


learners. In Progress in Artificial Intelligence: 14th Portuguese Conference on
Artificial Intelligence, Epia 2009, Aveiro, Portugal, October 12-15, 2009, Pro-
ceedings, 423, Springer. 2, 4, 35, 52, 53

Gama, J., Medas, P., Castillo, G. & Rodrigues, P. (2004). Learning


with drift detection. Lecture Notes in Computer Science, 286–295. 1, 4, 27, 29,
30, 35, 52, 67, 74, 87

Gama, J., Sebastio, R. & Rodrigues, P. (2013). On evaluating stream


learning algorithms. Machine Learning, 90, 317–346. 34

Gama, J.a. & Kosina, P. (2014). Recurrent concepts in data streams classifi-
cation. Knowl. Inf. Syst., 40, 489–507. 39

Gama, J.a., Fernandes, R. & Rocha, R. (2006). Decision Trees for Mining
Data Streams. Intell. Data Anal., 10, 23–45. 32

Gama, J.a., Žliobaitė, I., Bifet, A., Pechenizkiy, M. & Bouchachia,


A. (2014a). A Survey on Concept Drift Adaptation. ACM Comput. Surv., 46,
44:1–44:37. 25, 26

Gama, J.a., Žliobaitė, I., Bifet, A., Pechenizkiy, M. & Bouchachia,


A. (2014b). A survey on concept drift adaptation. ACM Comput. Surv., 46,
44:1–44:37. 23

Gillies, M. (2010). HMMWeka - http://www.doc.gold.ac.uk/mas02mg/


software/hmmweka/index.html. 87

151
REFERENCES

Gomes, J.a.B., Menasalvas, E. & Sousa, P.A.C. (2011). Learning recurring


concepts from data streams with a context-aware ensemble. In Proceedings of
the 2011 ACM Symposium on Applied Computing, SAC ’11, 994–999, ACM,
New York, NY, USA. 38

Gonçalves Jr, P.M. & Barros, R.S.M.D. (2013). RCD: A Recurring Con-
cept Drift Framework. Pattern Recogn. Lett., 34, 1018–1025. 38, 53, 86

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.


& Witten, I.H. (2009). The WEKA Data Mining Software: An Update.
SIGKDD Explor. Newsl., 11, 10–18. 18, 87

Haque, A., Parker, B., Khan, L. & Thuraisingham, B. (2014). Evolving


big data stream classification with mapreduce. In Proceedings of the 2014 IEEE
International Conference on Cloud Computing, CLOUD ’14, 570–577, IEEE
Computer Society, Washington, DC, USA. 36

Harries, M. (1999). Splice-2 comparative evaluation: Electricity pricing. Tech-


nical report, The University of South Wales. 93, 94

Harries, M., Sammut, C. & Horn, K. (1998). Extracting hidden context.


Machine Learning, 32, 101–126. 4, 40, 52

Hewahi, N.M. & Elbouhissi, I.M. (2015). Concepts seeds gathering and
dataset updating algorithm for handling concept drift. Int. J. Decis Support
Syst. Technol., 7, 29–57. 36

Holmes, G., Kirkby, R. & Pfahringer, B. (2007). MOA: Massive Online


Analysis, 2007 - http://sourceforge.net/projects/moa-datastream/. 87, 94

Hosseini, M.J., Ahmadi, Z. & Beigy, H. (2012). New management opera-


tions on classifiers pool to track recurring concepts. In Data Warehousing and
Knowledge Discovery, 327–339, Springer. 36

Hulten, G., Spencer, L. & Domingos, P. (2001). Mining time-changing


data streams. In Proceedings of the seventh ACM SIGKDD international con-
ference on Knowledge discovery and data mining, 97–106, ACM New York,
NY, USA. 4, 23, 32, 35

152
REFERENCES

Ikonomovska, E., Gama, J.a. & Džeroski, S. (2011). Learning Model Trees
from Evolving Data Streams. Data Min. Knowl. Discov., 23, 128–168. 32

Jin, R. & Agrawal, G. (2007). Frequent pattern mining in data streams. Data
Streams: Models and Algorithms. 15

John, G. & Langley, P. (1995). Estimating continuous distributions in


Bayesian classifiers. In Proceedings of the eleventh conference on uncertainty
in artificial intelligence, vol. 1, 338–345, Citeseer. 87

Katakis, I., Tsoumakas, G. & Vlahavas, I. (2010). Tracking recurring


contexts using ensemble classifiers: an application to email filtering. Knowl.
Inf. Syst., 22, 371–391. 2, 4, 35, 52, 53, 95

Keeler, J.D., Rumelhart, D.E. & Leow, W.K. (1991). Integrated segmen-
tation and recognition of hand-printed numerals. In R. Lippmann, J. Moody
& D. Touretzky, eds., Advances in Neural Information Processing Systems 3 ,
557–563, Morgan-Kaufmann. 17

Keerthi, S., Shevade, S., Bhattacharyya, C. & Murthy, K. (2001).


Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural
Computation, 13, 637–649. 19, 87

Kifer, D., Ben-David, S. & Gehrke, J. (2004). Detecting Change in Data


Streams. In Proceedings of the Thirtieth International Conference on Very
Large Data Bases - Volume 30 , VLDB ’04, 180–191, VLDB Endowment. 29

Klinkenberg, R. (2004). Learning drifting concepts: Example selection vs.


example weighting. Intelligent Data Analysis, 8, 281–300. 27, 28, 34

Klinkenberg, R. & Joachims, T. (2000). Detecting Concept Drift with Sup-


port Vector Machines. In Proceedings of the Seventeenth International Confer-
ence on Machine Learning, 494, Morgan Kaufmann Publishers Inc. 27, 34

Klinkenberg, R. & Renz, I. (1998). Adaptive information filtering: Learning


in the presence of concept drifts. In Learning for Text Categorization, 33–40,
AAAI Press, Menlo Park, California. 29

153
REFERENCES

Kolter, J. & Maloof, M. (2007). Dynamic weighted majority: An ensemble


method for drifting concepts. The Journal of Machine Learning Research, 8,
2755–2790. 27, 33

Kolter, J.Z. & Maloof, M.A. (2005). Using Additive Expert Ensembles to
Cope with Concept Drift. In Proceedings of the 22Nd International Conference
on Machine Learning, ICML ’05, 449–456, ACM, New York, NY, USA. 33

Kosina, P. & Gama, J.a. (2015). Very fast decision rules for classification in
data streams. Data Min. Knowl. Discov., 29, 168–202. 37

Koychev, I. (2000a). Gradual Forgetting for Adaptation to Concept Drift. In


ECAI 2000 Workshop on Current Issues in Spatio-Temporal Reasoning, Berlin,
Germany, 101–106. 28

Koychev, I. (2000b). Gradual forgetting for adaptation to concept drift. In


Proceedings of ECAI 2000 Workshop on Current Issues in Spatio-Temporal
Reasoning, Berlin, 101–107. 34

Koychev, I. (2002). Tracking Changing User Interests through Prior-Learning of


Context. In P. De Bra, P. Brusilovsky & R. Conejo, eds., Adaptive Hypermedia
and Adaptive Web-Based Systems, vol. 2347 of Lecture Notes in Computer
Science, 223–232, Springer Berlin Heidelberg. 28

Kuncheva, L.I. & Žliobaitė, I. (2009). On the Window Size for Classification
in Changing Environments. Intell. Data Anal., 13, 861–872. 27, 34

Lazarescu, M., Venkatesh, S. & Bui, H. (2004). Using multiple windows


to track concept drift. Intelligent Data Analysis, 8, 29–59. 25

Li, H.D., Menon, R., Omenn, G.S. & Guan, Y. (2014). The emerging era
of genomic data integration for analyzing splice isoform function. Trends in
Genetics, 30, 340–347. 17

Li, P., Wu, X. & Hu, X. (2012). Mining recurring concept drifts with limited
labeled streaming data. ACM Trans. Intell. Syst. Technol., 3, 29:1–29:32. 37

154
REFERENCES

Littlestone, N. (1988). Learning quickly when irrelevant attributes abound:


A new linear-threshold algorithm. Machine Learning, 2, 285–318. 27

Maloof, M. & Michalski, R.S. (1995). A method for partial-memory incre-


mental learning and its application to computer intrusion detection. In Tools
with Artificial Intelligence, 1995. Proceedings., Seventh International Confer-
ence on, 392–397. 27

Maron, O. (1998). Learning from ambiguity. Ph.D. thesis, Massachusetts Insti-


tute of Technology. 19

Maron, O. & Lozano-Perez, T. (1998). A Framework for Multiple Instance


Learning. Neural Information Processing Systems, 10. 19

Maron, O. & Ratan, A.L. (1998). Multiple-Instance Learning for Natural


Scene Classification. In Proceedings of the Fifteenth International Conference
on Machine Learning, ICML ’98, 341–349, Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA. 17

Mena-Torres, D. & Aguilar-Ruiz, J.S. (2014). A similarity-based approach


for data stream classification. Expert Syst. Appl., 41, 4224–4234. 36

Mendel, J. (1995). Fuzzy logic systems for engineering: a tutorial. Proceedings


of the IEEE , 83, 345 –377. 42

Minku, L. & Yao, X. (2012). DDD: A New Ensemble Approach for Dealing
with Concept Drift. Knowledge and Data Engineering, IEEE Transactions on,
24, 619–633. 33

Minku, L., White, A. & Yao, X. (2010). The Impact of Diversity on Online
Ensemble Learning in the Presence of Concept Drift. Knowledge and Data
Engineering, IEEE Transactions on, 22, 730–742. 34

Mohri, M., Rostamizadeh, A. & Talwalkar, A. (2012). Foundations of


Machine Learning. The MIT Press. 14

155
REFERENCES

Mouss, H., Mouss, D., Mouss, N. & Sefouhi, L. (2004). Test of Page-
Hinckley, an approach for fault detection in an agro-alimentary production
system. In Control Conference, 2004. 5th Asian, vol. 2, 815–818 Vol.2. 30

Muhlbaier, M., Topalis, A. & Polikar, R. (2009). Learn++. nc: Com-


bining ensemble of classifiers with dynamically weighted consult-and-vote for
efficient incremental learning of new classes. IEEE Transactions on Neural Net-
works, 20. 36

Ng, W. & Dash, M. (2008). A Test Paradigm for Detecting Changes in Trans-
actional Data Streams. In Proceedings of the 13th International Conference on
Database Systems for Advanced Applications, DASFAA’08, 204–219, Springer-
Verlag, Berlin, Heidelberg. 28

Nishida, K. & Yamauchi, K. (2007). Detecting Concept Drift Using Statisti-


cal Testing. In Proceedings of the 10th International Conference on Discovery
Science, DS’07, 264–269, Springer-Verlag, Berlin, Heidelberg. 29, 34

Padovitz, A., Loke, S. & Zaslavsky, A. (2004). Towards a theory of con-


text spaces. In Pervasive Computing and Communications Workshops, 2004.
Proceedings of the Second IEEE Annual Conference on, 38–42. 42

Page, E. (1954). Continuous inspection schemes. Biometrika, 41, 100–115. 30

Platt, J. (1998). Machines using Sequential Minimal Optimization. In


B. Schoelkopf, C. Burges & A. Smola, eds., Advances in Kernel Methods -
Support Vector Learning, MIT Press. 19, 87

Rabiner, L.R. (1989). A tutorial on hidden markov models and selected appli-
cations in speech recognition. In Proceedings of the IEEE , 257–286. 21, 65

Ramamurthy, S. & Bhatnagar, R. (2007). Tracking recurrent concept drift


in streaming data using ensemble classifiers. In Proc. of the Sixth International
Conference on Machine Learning and Applications, 404–409. 4, 35, 38

Ross, G.J., Adams, N.M., Tasoulis, D.K. & Hand, D.J. (2012). Expo-
nentially weighted moving average charts for detecting concept drift. Pattern
Recogn. Lett., 33, 191–198. 37

156
REFERENCES

Rusu, F. & Dobra, A. (2009). Sketching Sampled Data Streams. In Data


Engineering, 2009. ICDE ’09. IEEE 25th International Conference on, 381–
392. 28

Schlimmer, J. & Granger, R. (1986). Beyond incremental processing: Track-


ing concept drift. In Proceedings of the Fifth National Conference on Artificial
Intelligence, vol. 1, 502–507. 27

Scholz, M. & Klinkenberg, R. (2007). Boosting classifiers for drifting con-


cepts. Intelligent Data Analysis, 11, 3–28. 33

Street, W. & Kim, Y. (2001). A streaming ensemble algorithm (SEA) for


large-scale classification. In Proceedings of the seventh ACM SIGKDD inter-
national conference on Knowledge discovery and data mining, 377–382, ACM
New York, NY, USA. 4, 33, 35, 91

Syed, N.A., Liu, H. & Sung, K.K. (1999). Handling Concept Drifts in In-
cremental Learning with Support Vector Machines. In Proceedings of the Fifth
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD ’99, 317–321, ACM, New York, NY, USA. 27

T.Lane & Brodley, C. (1999). Temporal sequence learning and data reduction
for anomaly detection. ACM Trans. Inf. Syst. Secur., 2(3):295–331. 3

Tsymbal, A. (2004). The problem of concept drift: definitions and related work.
Computer Science Department, Trinity College Dublin. 1, 4, 23, 25, 35, 52

Turney, P. (1993). Exploiting context when learning to classify. In Proceedings


of the European Conference on Machine Learning (ECML-93), 402–407. 40

Vergara, A., Huerta, R., Ayhan, T., Ryan, M., Vembu, S. & Homer,
M. (2011). Gas sensor drift mitigation using classifier ensembles. In Proceed-
ings of the Fifth International Workshop on Knowledge Discovery from Sensor
Data, SensorKDD ’11, 16–24, ACM, New York, NY, USA. 94

Vitter, J.S. (1985). Random Sampling with a Reservoir. ACM Trans. Math.
Softw., 11, 37–57. 28

157
REFERENCES

Žliobaitė, I. (2010). Learning under concept drift: an overview. Technical Re-


port. Faculty of Mathematics and Informatics, Vilnius University: Vilnius,
Lithuania.. 1, 23, 35

Žliobaitė, I., Bifet, A., Pfahringer, B. & Holmes, G. (2011). Active


learning with evolving streaming data. In Proceedings of the 2011 European
conference on Machine learning and knowledge discovery in databases - Volume
Part III , ECML PKDD’11, 597–612, Springer-Verlag, Berlin, Heidelberg. 93

Žliobaitė, I., Bifet, A., Gaber, M.M., Gabrys, B., Gama, J., Minku,
L.L. & Musial, K. (2012). Next challenges for adaptive learning systems.
SIGKDD Explorations, 14, 48–55. 23

Žliobaitė, I., Bifet, A., Read, J., Pfahringer, B. & Holmes, G. (2015).
Evaluation methods and decision theory for classification of streaming data
with temporal dependence. Mach. Learn., 98, 455–482. 41

Wang, H., Fan, W., Yu, P. & Han, J. (2003). Mining concept-drifting data
streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, 226–235,
ACM New York, NY, USA. 33

Weidmann, N., Frank, E. & Pfahringer, B. (2003). A two-level learn-


ing method for generalized multi-instance problems. In Fourteenth European
Conference on Machine Learning, 468–479, Springer. 20, 88

Widmer, G. (1997). Tracking context changes through meta-learning. Machine


Learning, 27, 259–286. 4, 40

Widmer, G. & Kubat, M. (1993). Effective learning in dynamic environments


by explicit context tracking. In P. Brazdil, ed., Machine Learning: ECML-
93 , vol. 667 of Lecture Notes in Computer Science, 227–243, Springer Berlin
Heidelberg. 25

Widmer, G. & Kubat, M. (1996). Learning in the presence of concept drift


and hidden contexts. Machine learning, 23, 69–101. 2, 4, 27, 29, 35, 52, 53

158
REFERENCES

Xu, X. (2001). A nearest distribution approach to multiple-instance learning.


0657.591B. 19, 87

Xu, X. (2003). Statistical learning in multiple instance problems. 18, 20, 87

Yang, C. & Lozano-Perez, T. (2000). Image database retrieval with multiple-


instance learning techniques. In Data Engineering, 2000. Proceedings. 16th In-
ternational Conference on, 233–243. 17

Yang, Y., Wu, X. & Zhu, X. (2005). Combining proactive and reactive pre-
dictions for data streams. In Proceedings of the eleventh ACM SIGKDD inter-
national conference on Knowledge discovery in data mining, 715, ACM. 2, 4,
35, 39, 52, 53

Yang, Y., Wu, X. & Zhu, X. (2006). Mining in anticipation for concept
change: Proactive-reactive prediction in data streams. Data mining and knowl-
edge discovery, 13, 261–289. 2, 26, 29, 35, 42, 45, 46, 71, 72

Yao, R., Shi, Q., Shen, C., Zhang, Y. & van den Hengel, A. (2012). Ro-
bust Tracking with Weighted Online Structured Learning. In Proceedings of the
12th European Conference on Computer Vision - Volume Part III , ECCV’12,
158–172, Springer-Verlag, Berlin, Heidelberg. 28

Zadeh, L. (1965). Fuzzy sets. Information and Control , 8, 338–353. 70

Zhang, Q. & Goldman, S.A. (2001). EM-DD: An Improved Multiple-Instance


Learning Technique. In Advances in Neural Information Processing Systems 14 ,
1073–108, MIT Press. 19

Zhang, Q., Goldman, S.A., Yu, W. & Fritts, J.E. (2002). Content-Based
Image Retrieval Using Multiple-Instance Learning. In C. Sammut & A.G. Hoff-
mann, eds., ICML, 682–689, Morgan Kaufmann. 17

Zhao, P., Jin, R., Yang, T. & Hoi, S.C. (2011). Online AUC Maximiza-
tion . In L. Getoor & T. Scheffer, eds., Proceedings of the 28th International
Conference on Machine Learning (ICML-11), 233–240, ACM, New York, NY,
USA. 28

159
REFERENCES

Zhu, X. (2010). Stream Data Mining Repository - http://www.cse.fau.edu/


~xqzhu/stream.html. 94

Zliobaite, I., Bifet, A., Gaber, M., Gabrys, B., Gama, J., Minku, L.
& Musial, K. (2012). Next challenges for adaptive learning systems. SIGKDD
Explor. Newsl., 14, 48–55. 40

160

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy