Advances in Big Data PDF
Advances in Big Data PDF
Plamen Angelov
Yannis Manolopoulos
Lazaros Iliadis
Asim Roy
Marley Vellasco Editors
Advances
in Big Data
Proceedings of the 2nd INNS
Conference on Big Data, October
23–25, 2016, Thessaloniki, Greece
Advances in Intelligent Systems and Computing
Volume 529
Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: kacprzyk@ibspan.waw.pl
About this Series
Advisory Board
Chairman
Nikhil R. Pal, Indian Statistical Institute, Kolkata, India
e-mail: nikhil@isical.ac.in
Members
Rafael Bello, Universidad Central “Marta Abreu” de Las Villas, Santa Clara, Cuba
e-mail: rbellop@uclv.edu.cu
Emilio S. Corchado, University of Salamanca, Salamanca, Spain
e-mail: escorchado@usal.es
Hani Hagras, University of Essex, Colchester, UK
e-mail: hani@essex.ac.uk
László T. Kóczy, Széchenyi István University, Győr, Hungary
e-mail: koczy@sze.hu
Vladik Kreinovich, University of Texas at El Paso, El Paso, USA
e-mail: vladik@utep.edu
Chin-Teng Lin, National Chiao Tung University, Hsinchu, Taiwan
e-mail: ctlin@mail.nctu.edu.tw
Jie Lu, University of Technology, Sydney, Australia
e-mail: Jie.Lu@uts.edu.au
Patricia Melin, Tijuana Institute of Technology, Tijuana, Mexico
e-mail: epmelin@hafsamx.org
Nadia Nedjah, State University of Rio de Janeiro, Rio de Janeiro, Brazil
e-mail: nadia@eng.uerj.br
Ngoc Thanh Nguyen, Wroclaw University of Technology, Wroclaw, Poland
e-mail: Ngoc-Thanh.Nguyen@pwr.edu.pl
Jun Wang, The Chinese University of Hong Kong, Shatin, Hong Kong
e-mail: jwang@mae.cuhk.edu.hk
Marley Vellasco
Editors
123
Editors
Plamen Angelov Asim Roy
School of Computing and Communications WPC Information Systems Faculty
Lancaster University Arizona State University
Lancaster Tempe, AZ
UK USA
Lazaros Iliadis
Lab of Forest Informatics (FiLAB)
Democritus University of Thrace
Orestiada
Greece
The concept “Big data” is not only related to storage of and access to data.
Analytics play a major role in making sense of that data and exploiting its value.
Big data analytics is about examining vast volumes of data in order to discover
hidden patterns or even existing correlations. With available technology, it is fea-
sible to analyze data and get crucial answers from it, almost in real time.
A lot of research has been made by Google on this field with impressive results.
It is worth mentioning “Google Cloud Machine Learning”, a managed platform
enabling easy construction and development of machine learning models, capable
to perform on any type of data, of any size. The neural network field has historically
focused on algorithms that learn in an online, incremental mode without requiring
in-memory access to huge amounts of data. This type of learning is not only ideal
for streaming data (as in the Industrial Internet or the Internet of Things) but could
also be used on stored big data. Neural network technologies thus can become
significant components of big data analytics platforms.
One of the major challenges of our era is learning from big data, something that
requires the development of novel algorithmic approaches. Most of the available
machine learning algorithms, find it hard to scale up to big data. Moreover there are
serious challenges in the problems of high-dimensionality, velocity and variety.
Thus the aim of this conference is to promote new advances and research directions
in efficient and innovative algorithmic approaches to analyzing big data (e.g. deep
networks, nature-inspired and brain-inspired algorithms), implementations on dif-
ferent computing platforms (e.g. neuromorphic, GPUs, clouds, clusters) and
applications of Big Data Analytics to solve real-world problems (e.g. weather
prediction, transportation, energy management).
The 2nd INNS Big Data 2016 conference, founded by the International Neural
Network Society (INNS), was organized by Aristotle University of Thessaloniki
Greece and Democritus University of Thrace Greece, following the inaugural event
held in San Francisco in 2015. This conference aims to initiate the collaborative
adventure with big data and other learning technologies.
All submitted papers in the 2nd INNS Big Data conference have passed through a
peer review process by at least 2 independent academic referees. Where needed a
v
vi Preface
third and a fourth referee was consulted to resolve any potential conflicts. Out of 50
submissions, totally 34 submissions have been accepted for oral presentation; 27
papers have been accepted as full ones (54 %) and 7 as short ones. Authors come
from 19 different countries around the globe, namely: Brazil, China, Cyprus,
Denmark, France, Germany, Greece, India, Italy, Japan, Malaysia, Portugal, Saudi
Arabia, Spain, Tunisia, Turkey, USA, Ukraine and UK. Proceedings are published
by Springer in the “Advances in Intelligent Systems and Computing Series”.
The program of the 2nd INNS Big Data Conference includes 5 plenary talks by
distinguished keynote speakers, plus 2 tutorials.
David Bholat is Senior Analyst with the Bank of England. In particular, Dr.
Bholat leads a team of ten data scientists and researchers in Advanced Analytics, a
Big Data division in the Bank of England which he helped to establish in 2014. The
division is recognised as a leader among central banks in the area of Big Data, as
noted in a recent MIT Sloan Review article profiling the division. A former
Fulbright fellow, Dr. Bholat graduated from Georgetown University’s School of
Foreign Service with highest honours. He subsequently studied at the London
School of Economics, the University of Chicago and London Business School.
Publications in 2016 include Modelling metadata in central banks; Non-performing
loans: regulatory and accounting treatments of assets; Peer-to-peer lending and
financial innovation in the United Kingdom; and Accounting in central banks.
Other previous publications relevant to the conference include Text mining for
central banks; Big data and central banks; and The future of central bank data.
The title of the talk by Dr. Bholat is “Big Data at the Bank of England”. This talk
will discuss the Bank of England’s recent forays into Big Data and other uncon-
ventional data sources. Particular attention will be given to the practicalities of
embedding data analytics in a business context. Examples of the Bank’s use of Big
Data will be given, including the analysis of derivatives and mortgage data, Internet
searches, and social media.
Francesco Bonchi is Research Leader with the ISI Foundation of Turin and
Scientific Director of the Technological Center of Catalunya at Barcelona. Before he
was Director of Research at Yahoo Labs in Barcelona, Spain, where he was leading
the Web Mining Research group. His recent research interests include mining
query-logs, social networks, and social media, as well as the privacy issues related to
mining these kinds of sensible data. In the past he has been interested in data mining
query languages, constrained pattern mining, mining spatiotemporal and mobility
data, and privacy preserving data mining. He will be PC Chair of the 16th IEEE
International Conference on Data Mining (ICDM 2016) to be held in Barcelona in
December 2016. He is member of the ECML/PKDD Steering Committee, Associate
Editor of the IEEE Transactions on Big Data (TBD), the IEEE Transactions on
Knowledge & Data Engineering (TKDE), the ACM Transactions on Intelligent
Systems and Technology (TIST), Knowledge and Information Systems (KAIS), and
member of the Editorial Board of Data Mining & Knowledge Discovery (DMKD).
He has been Program Co-chair of the ECML/PKDD’2010. He is co-editor of the
Preface vii
(Model Selection) and the estimation of their quality (Error Estimation). In par-
ticular, Statistical Learning Theory (SLT) addresses these problems by deriving
non-asymptotic bounds on the generalization error of a model or, in other words, by
upper bounding the true error of the learned model based just on quantities com-
puted on the available data. However, for a long time, SLT has been considered
only an abstract theoretical framework, useful for inspiring new learning approa-
ches, but with limited applicability to practical problems. The purpose of this
tutorial is to give an intelligible overview of the problems of Model Selection and
Error Estimation, by focusing on the ideas behind the different SLT-based
approaches and simplifying most of the technical aspects with the purpose of
making them more accessible and usable in practice, with particular reference to
Big Data problems. We will start by presenting the seminal works of the 80’s until
the most recent results, then discuss open problems and finally outline future
directions of this field of research.
Professor Giacomo Boracchi, of the Politecnico di Milano, will deliver a tutorial
entitled “Change Detection in Data Streams: Big Data Challenges”. Changes might
indicate unforeseen evolution of the process generating the data, anomalous events,
or faults, to name a few examples. As such, change-detection tests provide precious
information for understanding the stream dynamics and activating suitable actions,
which are two primary concerns in financial analysis, quality inspection, environ-
mental and health-monitoring systems. Change detection plays also a central role in
machine learning, being often the first step towards adaptation. This tutorial presents
a formal description of the change-detection problem that fits sequential monitoring
as well as classification and signal/image analysis applications. The main approaches
in the literature are then presented, discussing their effectiveness in big data sce-
narios, where either data-throughput or data-dimension are large. In particular,
change-detection tests for monitoring multivariate data streams will be presented in
detail, including the popular approach of monitoring the log-likelihood, which will
be demonstrated to suffer from detectability loss when data-dimension increases.
The tutorial is accompanied by various examples where change-detection methods
are applied to real world problems, including classification of streaming data,
detection of anomalous heartbeats in ECG tracings and the localization of anomalous
patterns in images for quality control.
Finally, we would like to thank the Artificial Intelligence Journal (Elsevier) for
sponsoring the conference.
We hope that the 2nd INNS Big Data conference will stimulate the international
Big Data community and that it will provide insights in opening new paths in
Analytics by conducting further algorithmic and applied research.
General Chairs
Advisory Board
Tutorials/Workshop Chairs
xi
xii Organization
Panel Chair
Awards Chair
Competitions Chair
Publication Chairs
Publicity Chairs
WebMaster
Program Committee
Abstract. The enormous volumes of data generated by web users are the basis
of several research activities in a new innovative field of research: online
forecasting. Online forecasting is associated with the proper computation of web
users’ data with the aim to arrive at accurate predictions of the future in several
areas of human socio-economic activity. In this paper an algorithm is applied in
order to predict the results of the Greek referendum held in 2015, using as input
the data generated by users of the Google search engine. The proposed algo-
rithm allows us to predict the results of the referendum with great accuracy. We
strongly believe that due to the high internet penetration, as well as, the high
usage of web search engines, the proper analysis of data generated by web
search users reveals useful information about people preferences and/or future
actions in several areas of human activity.
1 Introduction
Almost a decade ago, Google opened to the public the web users’ preferences in
relation to their searching behavior. Several researchers realized that the proper pro-
cessing of the web users’ search behavior may allow them to reveal useful information
about the users’ needs, wants, concerns and in general about their feelings and pref-
erences (Ettredge et al. 2005).
Web users generate data almost in all web activities, such as visiting a website,
buying online, sending/receiving emails and participating in social networks. In cases
where the popularity of such activities is high, then there is plenty of room for
researchers and companies to use these data in order to reach valuable conclusions not
only for web users, but for the general population. The most indicative case of user
generated data is web search, since is characterized by high popularity among web
users and by an almost monopolized market structure since Google Search engine
holds more than 85 % of the market (source: www.statista.com).
A recent study published by Eurostat indicates that the 59 % of Europeans use web
search services to find information relevant to goods and services. As the percentages
of internet penetration and use of web search increase, relevant generated data
regarding web search behavior, become statistically significant. Thus, forecasting based
on web search data is becoming increasingly more accurate.
Within this context, the aim of this paper is to explore whether there is a correlation
between the users’ web search preferences during a time period before the Greek
referendum, held in July 2015, and the actual results of the referendum. In particular in
this paper an algorithm is applied in order to analyze the data generated by users of the
Google engine, aiming to predict the actual results of the referendum.
The paper is structured as follows: Sect. 2 discusses the relevant literature while in
Sect. 3 the proposed algorithm is described. In the next section, the algorithm is applied
in the study case and in Sect. 5 the main findings of this paper are discussed.
2 Literature Review
Online forecasting based on users’ web search data is becoming as one of the most
promising fields in the research area of forecasting. Several efforts have been carried
out by Google’s own researchers which have attempted predictions using search term
popularity in a number of areas ranging from home, automobile and retail sales to
travel behavior (Bangwayo-Skeete and Skeete 2015; Artola et al. 2015; Yang et al.
2015). In (Choi and Varian 2012) forecast methods are proposed in order to predict
near-term values of economic indicators such as automobile sales, unemployment
claims, travel destination planning and consumer confidence. Home sales predictions
have also been attempted by other researchers (Wu and Brynjolfsson 2009) while the
work in (Ginsberg et al. 2009) constitutes an interesting take on predicting flu epi-
demics before they actually emerge. The prediction of unemployment rates is another
successful exercise that has been attempted both before the establishment of Google
Trends for the USA job market (Ettredge et al. 2005), as well as, afterwards in regards
to Spain (Vicente et al. 2015).
With respect to elections, an initial approach in (Pion and Hamel 2007) considers
election predictions along with predictions in sports and economics. In addition in
(Davidowitz and Seth 2013) is argued that Google searches prior to an election can be
used to predict turnout in different parts of the USA. Franch (2013) provided predic-
tions for the 2010 UK elections by applying twice the concept behind Galton’s pre-
dictive “wisdom of the crowds”.
The proposed algorithm is applied on the data generated by the users of the Google
search engine. Each time that a user, searches the Web with the Google search engine,
the relevant data such as, the typed word or phrase, the date, the time, the location and
data related to his/her profile are stored by Google. The data are analyzed by Google
and some of them become publicly available by the Google Trends service. In par-
ticular Google Trends returns a normalized averaged number that corresponds to the
volume of daily searches for a specific term compared to the rest of the search terms.
Predicting Human Behavior Based on Web Search Activity 3
Our case study is related to the analysis of the web search behavior of users located in
Greece, aiming to predict the results of the Greek referendum held in 2015. In the early
morning of 27th of June 2015,the Greek prime minister announced a referendum to be
held on 5th of July 2015. The referendum contained only one question, which was
related to the bailout conditions proposed by European Commission (EC), International
Monetary Fund (IMF) and European Central Bank (ECB). More specifically, Greeks
were asked whether the Greek government should accept or not the plan proposed by
EC, IMF and ECB and submitted to the Greek government on the 25th of June. The
announcement of the referendum initiated endless public discussions between politi-
cians and in general between Greek people. Public opinion was divided in two main
groups: the supporters of “yes” and the supporters of “no”. The discussions about the
referendum had major implications on the relevant behavior of Greek web users. Several
profiles were created in social media in relation to the Greek referendum, while the
discussions about the referendum monopolized the messages exchanged in social media.
Taking into account the above facts, we applied the proposed algorithm to our case
study. The first steps were related to the determination of the geographic restrictions
and the time period. Since the referendum concerns the Greek people, we geographi-
cally restricted the web search users only to users with a Greek location (with Greek IP
address) during the determined time period. As regards to the determination of the time
period, the referendum was announced on 27th of June and was held on 5th of July,
therefore the time period was pre-determined from 27th of June to 4th of July. The
determination of the time period between 27th of June and 4th of July was verified by
the WI variance, during that period, of the most characteristic words used by web users
when they were searching for issues relevant to the referendum. More specifically
during that period the discussions in all media were focused on whether Greek people
will vote “yes” or “no” to the relevant question of the referendum. Therefore the most
characteristic words that came to our mind, relevant to referendum were: “ναι” and
“όχι” in Greek (“ναι” means yes and “όχι” means no). Indeed, as depicted in the
following figure1 (Fig. 1a), the variance of “oχι” and “ναι” during the 12-month period
before the referendum, verifies our determination of the time period between 27th of
June to 4th of July, as well as to select “ναι” and “οχι” as the initial set of words under
analysis.
Google Trends allows web users to determine the time periods on a monthly basis
presenting daily measures. The variation of the WI of the selected words during June –
July of 2015, represents a significant increase starting from the date of referendum
announcement (27th of June) until two days after the referendum (7th of July). The
peak values of both WIs are on Friday 3rd of July. Thus, the variation of the WIs of
both selected words (“ναι” and “οχι”) meet the two criteria described in previous
section, relevant to the variation and the absolute values of WI. Therefore both selected
words are selected as valid words in our algorithm.
1
In order to assist other researchers to verify our results, the graphs contained the figures are the ones
provided by the Google Trends service.
Predicting Human Behavior Based on Web Search Activity 5
Fig. 1. (a) Determination of time period (right-figure) and (b) WI of “no” indecisive users
(left-figure)
Having determined the initial set or words the next step is to examine whether there
were other words/phrases, which on one hand characterize the searching behavior of
supporters of “yes” or “no” and on the other hand fulfill the relevant two criteria.
Therefore we also examined additional forms of the words “yes” and “no”, as well as, the
names/acronyms/leaders of the major political parties. In particular for “yes” we exam-
ined the relevant form with Latin letters (“nai”) while for “no” we examined the relevant
form with Latin letter (“oxi”) and the grammatically correct (with intonation) form of the
Greek “no” (“όχι”). In addition we examined the WI for names/acronyms/leaders of
political parties that support “yes” (NeaDimokratia – ND, PASOK, Potami, Samaras,
Venizelos, Theodorakis), and the WI for names/acronyms/leaders of political parties that
support “no” (Syriza, ANEL, Tsipras, Kammenos).
Inserting the words “nai” and “ναι” at the Google Trends and having determined the
time period as well as the geographical restrictions, Google Trends in the relevant graph
returns WI values only for “ναι”. That means that the WI values for “nai” are significantly
lower than the WI of “ναι” and thus could not have been presented in the graph (in
practice that means that “nai” does not meet the second algorithm criteria). Running
similar scenarios for the other words/phrases which may characterize the searching
behavior of the supporters of “yes” and “no”, as described above, we reached the con-
clusion that there weren’t any other words/phrase that fulfill the two criteria of the
proposed algorithm. Thus the only characteristic word for supporters of “yes” and of “no”
that should be included in the algorithm were the words “ναι” and “οχι” respectively.
In this case study the sub-phase of historic feedback is not applicable since there is
no historical data relevant to referendums in the recent history of Greece. Therefore the
next issue under consideration was related to the noise elimination generated by
indecisive web users. An indecisive user is the user, whom his/her web search behavior
does not indicate clearly his/her intentions, thoughts and feelings about the referendum.
For example an indecisive user is the user who searches at the same time for “yes” and
“no” typing for example in the web search engine the phrase “yes no referendum”. The
web search data generated by indecisive web users should be excluded from the valid
data, analyzed by the proposed algorithm. Google Trends allows users to exclude a
word from a phrase by using the symbol “–”. Applying this rule to our algorithm, in
6 S.E. Polykalas and G.N. Prezerakos
order to eliminate noise, we use the followings phrases for each category: “ναι – οχι”
for the supporters of “yes” and “οχι- ναι” for the supporters of “no”. In the Fig. 1b is
displayed the WI variation of the selected final set of words.
The last phase of the proposed algorithm correlates the outcomes of the proposed
algorithm with the actual results of the referendum. To do this, the following steps are
followed: let WIi, group x, current, N be the Web Interest value on i day before the
referendum for supporters of x (x takes values “no” and “yes”) during the 2015
referendum, and with N the duration (days) for the determined time period. Then the
Average Web Interest (AWI) over a period of N days before the referendum date is
calculated. In our scenarios N takes the value of 9 (from 27th of June to 4th of July):
1X N
AWIgroup x; current; N ¼ WIi; group x; current; N ð1Þ
N i¼1
Running the final runs we reach the conclusion that the algorithm predicts the
referendum results with great accuracy. Indeed the algorithm predicts 38,41 % as the
percentage of “yes” supporters and 61,59 % as the percentage of “no” supporters,
while the actual results were 38,69 % and 61,33 % respectively (predictions error:
0,26 %).
5 Conclusions
We have shown that the proper computation of the data generated by web search users
could reveal useful information about their future behavior, such as the will to vote
against or not in a forthcoming referendum. In particular by applying the proposed
algorithm to the Greek referendum held in 2015, we are able to predict the results of the
referendum with high accuracy, based on data generated up to one day before the
referendum. It should be noted that mainly due to the lack of historic relevant data the
traditional methods of predictions (elections polls) failed to predict the actual results
with acceptable accuracy2. The successful application of the proposed algorithm in
several election races in different countries, the results of the current study, as well as,
several research publications in relation to predictions based on online data, allow us to
strongly believe that online forecasting, already plays and will continue to play a major
role in the predictions of people/consumer behavior, allowing the prediction of human
behavior based on web search activities.
2
https://en.wikipedia.org/wiki/Greek_bailout_referendum,_2015.
Predicting Human Behavior Based on Web Search Activity 7
References
Choi, Η., Varian, Η.: Predicting the present with Google Trends wiley online library. In:
Economic Record Special Issue: Selected Papers from the 40th Australian Conference of
Economists (2012)
Artola, C., Pinto, F., de Pedraza García, P.: Can internet searches forecast tourism inflows? Int.
J. Manpower 36(1), 103–116 (2015)
Ettredge, M., Gerdes, J., Karuga, G.: Using web-based search data to predict macroeconomic
statistics. Commun. ACM 48(11), 87–92 (2005)
Franch, F.: (Wisdom of the crowds)2: 2010 UK election prediction with social media. J. Inf.
Technol. Polit. 10(1), 57–71 (2013)
Ginsberg, J., Mohebbi, M., Patel, R., Brammer, L., Smolinski, M., Brilliant, L.: Detecting
influenza epidemics using search engine query data. Nature 457, 1012–1014 (2009)
Vicente, M.R., López-Menéndez, A.J., Pérez, R.: Forecasting unemployment with internet search
data: Does it help to improve predictions when job destruction is skyrocketing? Technol.
Forecast. Soc. Change 92(2015), 132–139 (2015)
Bangwayo-Skeete, P.F., Skeete, R.W.: Can Google data improve the forecasting performance of
tourist arrivals? Mixed-data sampling approach. Tourism Manage. 46(2015), 454–464 (2015)
Pion, S., Hamel, L.: The internet democracy: a predictive model based on web text mining. In:
Proceedings of DMIN 2007, pp. 292–300 (2007)
Stephens-Davidowitz, S.I.: Using Google Data to Predict Who Will Vote (2013). Available at
SSRN: http://ssrn.com/abstract=2238863
Polykalas, S.E., Prezerakos, G.N., Konidaris, A.T.: A general purpose model for future
prediction based on web search data: predicting Greek and Spanish elections. In: Proceedings
of International Symposium on Mining and Web, March 2013, Barcelona, Spain (2013a)
Polykalas, S.E., Prezerakos, G.N., Konidaris, A.T.: An algorithm based on Google Trends’ data
for future prediction. Case study: German elections. In: Proceedings of International
Conference ISSPIT/IEEE 2013, Athens, Greece (2013b)
Wu, L., Brynjolfsson, E.: The future of prediction: how Google searches foreshadow housing
prices and sales. In: NBER Conference Technological Progress & Productivity Measurement
(2009)
Yang, X., Pan, B., Evans, J.A., Lv, B.: Forecasting Chinese tourist volume with search engine
data. Tourism Manage. 46(2015), 386–397 (2015)
Spatial Bag of Features Learning for Large Scale
Face Image Retrieval
1 Introduction
Large-scale face image retrieval has recently received a lot of attention, espe-
cially in the context of celebrity face image retrieval [2]. Face image retrieval is
defined as the task of retrieving face images of a person given a query image that
depicts the face of that person. An image must be retrieved even if the person
has a different facial expression, pose, or different illumination conditions exist.
Given the vast amount of face images available on the Internet, a large-scale face
retrieval technique must be able to successfully handle large amounts of data.
Facial image retrieval also poses challenges of high-dimensionality, velocity and
variety.
A wide range of methods have been proposed for face recognition and
retrieval [18]. Recently, a widely known technique for image retrieval, the Bag-of-
Features (BoF) model [4], also known as Bag-of-Visual Words (BoVW) model,
has been applied for face recognition/retrieval after being appropriately mod-
ified [13,16,17]. The typical pipeline of the BoF model can be summarized as
follows. First, multiple features, such as SIFT descriptors [9], are extracted from
each image (feature extraction). That way, the feature space is formed where
each image is represented as an unordered set of features. Then, the extracted
features are used to learn a codebook of representative features (also called code-
words). This process is called codebook learning. Finally, each feature vector is
represented using a codeword from the learned codebook and a histogram is
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 2
Spatial Bag of Features Learning for Large Scale Face Image Retrieval 9
extracted for each image. That way, the histogram space is formed where each
image is represented by a constant dimensionality histogram vector.
The BoF model discards most of the spatial information contained in the orig-
inal image, which can severely harm the face recognition precision. To overcome
this limitation, the BoF-based techniques for face recognition, e.g., [13,16,17],
define a grid over each image (or some regions of interest) and independently
extract a histogram from each cell of the grid. These techniques tend to be
invariant to facial expression, pose, and illumination variations when used for
face recognition using a trained classifier [13]. However, when the extracted rep-
resentation is used for face image retrieval the problem becomes ill-defined, since
the user’s information need is sometimes ambiguous. For example, if the query
image depicts a smiling person the system might retrieve face images from the
same person, but it might also retrieve images from other persons that smile.
Therefore, it is critical that the extracted representation is appropriately tuned
for the given retrieval task.
The main contribution of this paper is the proposal of a supervised code-
book learning technique that is able to learn face retrieval-oriented codebooks
using an annotated training set of face images. This allows to use significantly
smaller codebooks reducing both the storage requirements and the retrieval time
by almost two orders of magnitude and allowing the technique to efficiently scale
to large datasets. The proposed method is also combined with a spatial image
segmentation technique that exploits the natural symmetry of the human face
to further reduce the size of the extracted representation. Furthermore, the pro-
posed approach is able to learn in incremental mode without requiring in-memory
access to large amounts of data.
The rest of the paper is structured as follows. The related work is discussed in
Sect. 2 and the proposed method is presented in Sect. 3. The experimental evalu-
ation using one large-scale dataset, the YouTube Faces Database and two smaller
datasets, the ORL Database of Faces and the Extended Yale Face Database B,
is presented in Sect. 4. Finally, conclusions are drawn in Sect. 5.
2 Related Work
This work mainly concerns codebook learning for face image retrieval using the
BoF model. A rich literature exists in the field of codebook learning for the BoF
representation, ranging from supervised approaches [3,8,10], to unsupervised
ones [11]. Despite the fact that supervised codebook learning is well established
in general computer vision tasks, such as scene classification [8], or action recog-
nition [3], little work has been done in the area of face image retrieval. One
application of supervised codebook learning for face image retrieval is presented
in [16], where a simple identity based codebook construction technique is pro-
posed. This method is combined with hamming signatures and locality-sensitive
hashing (LSH) in order to be applied to large datasets. In contrast to this, the
method proposed in this paper reduces the size of the codebook, instead of rely-
ing on hashing and approximate nearest neighbor search techniques, allowing
10 N. Passalis and A. Tefas
3 Proposed Method
Before presenting the proposed codebook learning method, the BoF model and
the SBoF model are briefly described. Let N be the number of face images that
are to be encoded using the regular BoF model. The i-th image is described by
Ni feature vectors: xij ∈ RD (i = 1...N , j = 1...Ni ), where D is the length of the
extracted feature vectors. In this work, dense SIFT features [6], are extracted
from 16 × 16 patches using a regular grid with spacing of 4 pixels. The BoF
model represents each face image using a fixed-length histogram of its quantized
feature vectors, where each histogram bin corresponds to a codeword. In hard
assignment each feature vector is quantized to its nearest codeword/histogram
bin, while in soft-assignment every feature vector contributes, by a different
amount, to each codeword/bin.
In order to learn a codebook the set of all feature vectors S = {xij |i =
1...N, j = 1...Ni } is clustered into NK clusters and the corresponding centroids
(codewords) vk ∈ RD (k = 1...NK ) are used to form the codebook V ∈ RD×NK ,
where each column of V is a centroid vector. These centroids are used to quantize
the feature vectors. It is common to cluster only a subset of S since this can
reduce the training time with little effect on the learned representation. The
codebook is learned only once and then it can be used to encode any new face
image.
To encode the i-th face image, the similarity between each feature vector xij
−||vk −xij ||2
and each codeword vk is computed as: dijk = exp( g ) ∈ R. The para-
meter g controls the quantization process: for harder assignment a small value,
i.e., g < 0.01, is used, while for softer assignment larger values, i.e., g > 0.01,
are used. Then, the l1 normalized membership vector of each feature vector xij
is obtained by: uij = ||dijij||1 ∈ RNK . This vector describes the similarity of fea-
d
ture vector xij to each codebook vector. Finally, the histogram si is extracted as
Ni
si = N1i j=1 uij ∈ RNK . The histogram si has unit l1 norm, since ||uij ||1 = 1 for
every j. These histograms describe each image and they can be used to retrieve
Spatial Bag of Features Learning for Large Scale Face Image Retrieval 11
relevant face images. The training and the encoding process are unsupervised
and no labeled data are required.
As previously mentioned, the BoF model discards most of the spatial infor-
mation contained in the images during the encoding process. This can severely
harm the face recognition precision, since most of the discriminant face features,
e.g., eyes, nose, mouth, are expected to be found in the same position when the
face is properly aligned. This allows to have highly specialized codebooks for
each of these regions.
In this work the following segmentation technique is used for the SBoF model.
Each image is segmented into Nr equally spaced horizontal stripes. Since the
human face is symmetric to a great extent, no vertical segmentation is used.
A separate codebook is learned for each strip and Nr histograms are extracted
(one from each strip). These histograms are fused into the final histogram vector
and renormalized. The length of this vector is Nr × NK . When four regions are
used, i.e., Nr = 4, and the images are correctly aligned, each region roughly
corresponds to one of the following parts of the human face: forehead, eyes, nose
and mouth. Although the images can be further split to more regions (either
horizontal either vertical or both), this provides little increase in retrieval pre-
cision. This fact is experimentally validated in Sect. 4. Note that the storage
requirements and the retrieval time depend on the number of regions used to
segment each image, since the storage required for each image is Nr × NK × B
bytes. This technique also provides mirror-invariance, while still preserving the
discriminant non-symmetric facial features that might appear.
where pjk is the probability that an image of the k-th cluster belongs to the
class j. This probability is estimated as pjk = hjk /nk , where nk is the number
of images in cluster k and hjk is the number of images in cluster k that belong
to class j.
Low-entropy clusters, i.e., clusters that contain mostly vectors from images
of the same person, are preferable for retrieval tasks to high-entropy clusters,
i.e., clusters that contain vectors from images that belong to several different
persons. Therefore the aim is to learn a codebook V that minimize the total
entropy of a cluster configuration, which is defined as:
NT
E= rk Ek (2)
k=1
However, it is not easy to directly optimize (3) as this function is not contin-
uous with respect to si . To this end, a continuous smooth approximation of the
previously defined entropy is introduced. To distinguish the previous definition
from the smooth entropy approximation, the former is called hard entropy and
the latter is called soft entropy.
A smooth cluster membership vector qi ∈ RNT is defined for each histogram
si , where qik = exp( −||sim −ck ||2
). The corresponding smooth l1 normalized mem-
bership vector wi is defined as wi = ||qqii||1 ∈ RNT . The parameter m controls
the fuzziness of the assignment process: for m → 0 each histogram is assigned
to its nearest cluster, while larger values allow fuzzy membership.N
Then, the quantities nk and hjk are redefined as: nk = i=1 wik and hjk =
N
i=1 wik πij , where πij is 1 if the i-th image belongs to class j and 0 otherwise.
These modifications lead to a smooth entropy approximation that converges to
hard entropy as m → 0.
The derivative of E with respect tovm can be calculated as the product of
∂E N NK ∂E ∂slκ
two other partial derivatives: ∂v m
= l=1 κ=1 ∂slκ ∂vm . In order to reduce
the entropy in the histogram space the histograms sl that, in turn, depend
∂E
on the codebook V, must be shifted. The partial derivative ∂s l
provides the
direction in which the histogram sl must be moved. Since each codeword vm lies
∂slκ
in the feature space, the derivative ∂v m
projects the previous direction into the
codebook.
The calculation of these derivatives is straightforward and it is omitted due
to space constraints. Note that the histogram space derivative does not exist
when a histogram vector and a centroid vector coincide. The same holds for
the feature space derivative when a codebook center and a feature vector also
coincide. When that happens, the corresponding derivatives are set to 0.
Spatial Bag of Features Learning for Large Scale Face Image Retrieval 13
During the optimization process the image histograms are updated. This
might invalidate the initial choice of the centers ck that should be also updated
during the optimization. Therefore, the derivative of the objective function with
∂E
respect to each ck , i.e., ∂cm
, is also calculated.
The codebook and the histogram centers can be optimized using gradient
∂E ∂E
descent, i.e., ΔV = −η ∂V and Δcm = −ηc ∂c m
, where η and ηc are the learning
rates. In this work, the adaptive moment estimation algorithm, also known as
Adam [5], is used instead of the simple gradient descent for the optimization,
since it provides faster and more stable convergence. To avoid simply overfitting
the histogram space centers, instead of learning the codebook, a smaller learning
rate is used for the histogram space centers. For all the conducted experiments
the optimization process runs for 100 iterations and the following learning rates
are used: η = 0.01 and ηc = 0.001. The default parameters are used for the
Adam algorithm (β1 = 0.9, β2 = 0.999 and = 10−8 ) [5]. Also, to reduce
the training time, 100 features vectors are randomly sampled from each region
during the training process. This reduces the training time, without affecting the
quality of the learned codebook. The proposed approach can be also adapted to
work in incremental mode by using small mini-batches of, e.g., 100 face images.
However, this is not always necessary since in facial image retrieval the number
of annotated training images is usually significantly smaller than the number of
images that are to be encoded.
Regarding the initialization of the codebook and the histogram space centers
several choices exist. In this work, the codebook is initialized using the k-means
algorithm, as in the regular BoF model. For the histogram space centers, NC
centers are used (one for each person) and each center is initialized using the
mean histogram of each person.
Finally, the softness parameters m and g must be selected. The parameter g
controls the softness of the quantization process, while m controls the histogram’s
membership fuzziness and should be set to small enough value in order to closely
approximate the hard entropy. It was experimentally established that the method
is relatively stable with regard to the selected parameters: m = 0.01 and g = 0.01
is used for all the conducted experiments.
To better understand how the proposed method works, a toy example of the
optimization process is provided. Four persons are chosen from the YouTube
Faces dataset, which is described in Sect. 4, 30 images are selected for each
person and 64 codewords are used for each of the four strips. The histogram
space is visualized during the optimization process in Fig. 1. Since the histograms
lie in a space with 4 × 64 dimensions, the PCA technique is used to visualize
the resulting histograms. The expression and illumination variations in the face
images lead to a representation where two or more subclasses exist for each
person. Nonetheless, the optimization of the representation using the proposed
O-SBoF technique successfully reduces the overlapping of histograms that belong
to different classes and brings the histograms that belong to the same person
significantly closer.
14 N. Passalis and A. Tefas
Histogram Space (initial) Histogram Space (10 iters) Histogram Space (50 iters) Histogram Space (100 iters)
0.2 0.2 0.3 0.3
0 0.1 0.1
0
-0.1 0 0
-0.1
-0.2 Barbara Walters -0.1 -0.1
Celine Dion
Nicolas Cage
-0.3 -0.2 -0.2 -0.2
-0.4 Tom Hanks
-0.2 0 0.2 -0.2 -0.1 0 0.1 0.2 -0.2 -0.1 0 0.1 0.2 -0.4 -0.2 0 0.2
Fig. 1. Histogram space during the optimization using the proposed O-SBoF method
4 Experiments
Two small-scale face recognition datasets, the ORL Database of Faces (ORL)
[12], and the cropped variant of the Extended Yale Face Database B (Yale B)
[1,7], and one large-scale face dataset, the YouTube Faces Database [14], were
used for the evaluation of the proposed method.
The ORL dataset contains 400 images from 40 different persons (10 images
per person) under varying pose and facial expression. The cropped Extended
Yale Face Database B contains 2432 images from 38 individuals. The images
were taken under greatly varying lighting conditions. The YouTube Faces dataset
contains 621,126 frames of 3,425 videos and a total number of 1,595 individuals.
The aligned version of this dataset is used, i.e., the face is already aligned in
each image using face detection and alignment techniques. Before extracting the
feature vectors each image is cropped by removing 25 % of its margins.
For the small-scale experiments each dataset is randomly split to two sets
using half of the images of each person as the train set and the rest of them as
the test set. The train set is used to build the database and train the O-SBoF
model. The retrieval performance is evaluated using the images contained in the
test set as queries. The training and the retrieval evaluation process are repeated
five times and the mean and the standard deviation of the evaluated metrics are
reported.
For the large-scale experiments a different evaluation strategy, similar to
those of celebrity face image retrieval tasks [2], is used. The training set is formed
by randomly selecting 100 images from the most popular persons that appear in
the videos (5,900 training images are collected from the videos of the 59 most
popular persons). A person is considered popular if it appears in at least 5 videos.
To build the database, the images of persons that appear in at least 3 videos
are used (the database contains 356,588 images from 226 persons). To evaluate
the retrieval performance 100 queries from the popular persons (celebrities) are
randomly selected. The evaluation process is repeated five times and the mean
and the standard deviation of the evaluated metrics are reported.
Throughout this paper, two evaluation metrics are used: the interpolated
precision (also abbreviated as ‘prec.’), and the mean average precision (mAP).
The mean average precision (AP) for a given query is computed at eleven equally
Spatial Bag of Features Learning for Large Scale Face Image Retrieval 15
spaced recall points (0, 0.1, ..., 0.9, 1). A more detailed definition of the used
evaluation metrics can be found in [10]. Also, for all the evaluated representations
16-bit floating numbers are used, since this can reduce the storage requirements
without harming the retrieval precision.
Table 1. Small-scale evaluation using the ORL and the Extended Yale B datasets
Method # codewords # bytes mAP top-20 prec. top-50 prec. top-100 prec.
CSLBP - 960 37.06 ± 1.21 98.23 ± 1.21 94.47 ± 2.34 87.93 ± 2.56
FPLBP - 1120 37.90 ± 1.34 99.15 ± 0.57 96.74 ± 1.61 90.99 ± 2.35
SBoF 4 × 1024 8192 41.19 ± 1.04 99.99 ± 0.02 99.05 ± 0.26 91.50 ± 1.41
SBoF 4 × 4096 32768 40.95 ± 1.07 99.90 ± 0.12 98.54 ± 0.51 90.99 ± 1.48
SBoF 4 × 16 128 32.13 ± 1.35 98.93 ± 0.46 94.72 ± 1.25 84.47 ± 2.25
O-SBoF 4 × 16 128 40.89 ± 1.31 99.71 ± 0.28 97.45 ± 0.70 88.78 ± 1.77
SBoF 4 × 64 512 38.60 ± 1.32 99.84 ± 0.17 98.18 ± 0.43 89.55 ± 1.31
O-SBoF 4 × 64 512 47.19 ± 1.24 99.93 ± 0.13 99.41 ± 0.31 92.17 ± 1.29
5 Conclusions
In this paper, a supervised codebook learning technique for face retrieval was
presented. It was demonstrated using one large-scale dataset and two smaller
datasets that the proposed technique can improve the retrieval precision and, at
the same time, reduce the storage requirements and the retrieval time by almost
two orders of magnitude.
There are several future research directions. In this work it was assumed
that the face images were already aligned using face detection and alignment
techniques and a fixed grid was used for the O-SBoF technique. However, the
proposed method can be combined with facial feature detectors to more accu-
rately define the regions used for feature extraction and further improve the
retrieval precision and reduce the representation size. Furthermore, using more
robust feature extractors is expected to improve the face recognition precision
even more.
Spatial Bag of Features Learning for Large Scale Face Image Retrieval 17
References
1. Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: illumination
cone models for face recognition under variable lighting and pose. IEEE Trans.
Pattern Anal. Mach. Intell. 23(6), 643–660 (2001)
2. Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: challenge of recognizing
one million celebrities in the real world. In: IS&T International Symposium on
Electronic, Imaging (2016)
3. Iosifidis, A., Tefas, A., Pitas, I.: Discriminant bag of words based representation
for human action recognition. Pattern Recogn. Lett. 49, 185–192 (2014)
4. Jégou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image
search. Int. J. Comput. Vis. 87(3), 316–336 (2010)
5. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Computing
Research Repository, abs/1412.6980 (2014)
6. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid
matching for recognizing natural scene categories. In: Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, vol.
2, 2169–2178 (2006)
7. Lee, K., Ho, J., Kriegman, D.: Acquiring linear subspaces for face recognition under
variable lighting. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 684–698 (2005)
8. Lian, X.-C., Li, Z., Lu, B.-L., Zhang, L.: Max-margin dictionary learning for mul-
ticlass image categorization. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010. LNCS, vol. 6314, pp. 157–170. Springer, Heidelberg (2010). doi:10.
1007/978-3-642-15561-1 12
9. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings
of the 7th IEEE International Conference on Computer Vision, vol. 2, pp. 1150–
1157 (1999)
10. Passalis, N., Tefas, A.: Entropy optimized feature-based bag-of-words represen-
tation for information retrieval. IEEE Trans. Knowl. Data Eng. 28, 1664–1677
(2016)
11. Passalis, N., Tefas, A.: Spectral clustering using optimized bag-of-features. In: Pro-
ceedings of the 9th Hellenic Conference on Artificial Intelligence, p. 19 (2016)
12. Samaria, F.S., Harter, A.C.: Parameterisation of a stochastic model for human face
identification. In: Proceedings of the Second IEEE Workshop on Applications of
Computer Vision, pp. 138–142 (1994)
13. Wang, C., Wang, Y., Zhang, Z.: Patch-based bag of features for face recognition
in videos. Biometric Recogn. 7701, 1–8 (2012)
14. Wolf, L., Hassner, T., Maoz, I.: Face recognition in unconstrained videos with
matched background similarity. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 529–534 (2011)
15. Wolf, L., Hassner, T., Taigman., Y.: Descriptor based methods in the wild. In: Real-
Life Images workshop at the European Conference on Computer Vision (ECCV)
(2008)
16. Wu, Z., Ke, Q., Sun, J., Shum, H.-Y.: Scalable face image retrieval with identity-
based quantization and multireference reranking. IEEE Trans. Pattern Anal. Mach.
Intell. 33(10), 1991–2001 (2011)
17. Yang, S., Bebis, G., Chu, Y., Zhao, L.: Effective face recognition using bag of
features with additive kernels. J. Electron. Imaging 25(1), 013025 (2016)
18. Zhang, X., Gao, Y.: Face recognition across pose: a review. Pattern Recogn. 42(11),
2876–2896 (2009)
Compact Video Description
and Representation for Automated
Summarization of Human Activities
1 Introduction
Several applications exist nowadays where large-scale video footage depicting
human activities needs to be analyzed, possibly on a frame-by-frame basis,
requiring human intervention. Examples include professional capture sessions,
where the action described in the script is typically filmed using multiple cam-
eras, or streams from surveillance cameras which may be capturing continuously
for many days. A very large volume of data are usually produced in such scenar-
ios, which may well exceed 6TB per day [1]. This amount of data is difficult to
The research leading to these results has received funding from the European Union
Seventh Framework Programme (FP7/2007-2013) under grant agreement number
316564 (IMPART).
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 3
Compact Video Description and Representation 19
In the end, a set of description vectors over various image regions and for various
scales is produced, including one truly global description vector (computed over
the entire Vik ). Available local information is more spatially focused for higher
values of d, at the cost of higher computational requirements. In general, how-
ever, the main advantage of global image descriptors, i.e., rapid computation [20],
is mostly retained with this simple spatial partitioning scheme, in comparison
to more complicated alternative approaches, such as image segmentation.
Below, the descriptors used for the evaluation of the proposed framework are
presented.
Global histograms computed in various image modalities are the most commonly
used feature descriptors for video summarization. For instance, in [6], 16-bin hue
22 I. Mademlis et al.
histograms derived from the frame representation in the HSV color space are
employed. In the presented framework, a histogram resolution of 16 bins is also
adopted in the context of the multimodal, frame partitioning scheme previously
described.
FMoD [8] was also adopted and adapted to our multimodal, frame parti-
tioning scheme. FMoD operates at each m × n block by computing one pro-
file vector for the horizontal dimension and one for the vertical dimension,
through averaging pixel values across block columns/rows, respectively. The
result is an n-dimensional and an m-dimensional profile vector. Each of the
two vectors is summarized by their first 4 statistical moments (mean, stan-
dard deviation, skewness, kurtosis). The resulting 8-dimensional vector fi =
[m1H , m2H , m3H , m4H , m1V , m2V , m3V , m4V ]T compactly captures the statisti-
cal properties of the block.
In this work, FMoD was extended by adding first-order statistical tex-
ture analysis components to the summary of each profile vector, i.e., energy
and entropy. Moreover, on top of the horizontal and the vertical pro-
file vector, a third block vector is constructed by vectorizing the actual
block in row-major order. The same statistical synopsis is also applied on
this vector, resulting in an 18-dimensional block description vector fi =
[m1H , · · · , m6H , m1V , · · · , m6V , m1B , · · · , m6B ]T . Using this notation, H, V and
B refer to the extracted statistical properties of horizontal profile vectors, verti-
cal profile vectors and block vectors, respectively.
description vector of the extended FMoD descriptor is computed and all such
vectors are concatenated into an 18S(d)-dimensional block description vector,
where S(d) is given by Eq. (1).
Instead of sparsely detecting interest points, as in the case of SIFT or SURF,
the luminance modality of the i-th video frame (Vil ) is densely sampled on a
rectangular grid to extract the block centers where LMoD vectors are to be com-
puted, using a sampling step of s pixels. Subsequently, each candidate block is
checked for luminance homogeneity, in order to dismiss blocks conveying minimal
information. To achieve rapid computation, this is simply implemented using a
threshold tl on the standard deviation of the block luminance.
Dense sampling of interest points allows the background of a depicted activity
to be taken into account and complements the holistic nature of LMoD descrip-
tors. As in the case of SURF, description vectors are constructed on all employed
modalities at the spatial coordinates computed in Vil .
The set of all trajectory vectors from all video frames is employed to con-
struct a codebook of size c, which is subsequently used to compute a trajectory
histogram hi per video frame Vil . Unlike in the traditional BoF approach, com-
puting hi includes a simple weighting scheme: the contribution of each trajectory
tj is weighted based on the relation between temporal positions bj and i. That
is, the corresponding weight wj is derived from a discrete Gaussian over the
temporal axis with its peak at position i, where each tj is assigned to position
bj + (Tw /2). Obviously, trajectories not containing position i are completely
disregarded. In the end, a c-dimensional trajectory histogram hi has been pro-
duced for each video frame Vil , encoding spatiotemporal activity information.
By employing the approach described above, activity motion descriptions
are computed as video features complementary to local description vectors, in a
manner that allows per-frame activity representation.
3 Quantitative Evaluation
3.1 Evaluation Dataset
In order to experimentally evaluate the proposed framework and descriptors,
a subset of the publicly available, annotated IMPART video dataset [19] was
employed. It depicts three subjects/actors in two different settings: one out-
door and one indoor. A living room-like setting was set-up for the latter, while
two scripts were executed during shooting, prescribing human activities by a
single human subject: one for the outdoor and one for the indoor setting. In
each shooting session, the camera was static and the script was executed three
times in succession, one time per subject/actor. This was repeated three times
per script, for a total of 3 indoor and 3 outdoor shooting sessions. Thus each
script was executed three times per actor. Three main actions were performed,
namely “Walk”, “Hand-wave” and “Run”, while additional distractor actions
were also included and jointly categorized as “Other” (e.g., “Jump-up-down”,
“Jump-forward”, “Bend-forward”). During shooting, the actors were moving
along predefined trajectories defined by three waypoints (A, B and C). Summing
up, the dataset consists of 6 MPEG-4 compressed video files with a resolution
of 720 × 540 pixels, where each one depicts three actors performing a series of
actions one after another. The mean duration of the videos is about 182 s, or
4542 frames.
The fact that ground truth annotation data provided along with the IMPART
dataset describe not key-frames pre-selected by users, as in [6] (which would be
highly subjective), but obvious activity segment frame boundaries, was exploited
to evaluate the proposed framework as objectively as possible. Given the results
of each summarization algorithm for each video, the number of extracted key-
frames derived from actually different activity segments (hereafter called inde-
pendent key-frames) can be used as an indication of summarization success.
Therefore, the ratio of extracted independent key-frames by the total number
of requested key-frames K, hereafter called Independence Ratio (IR) score, is a
practical evaluation metric.
Compact Video Description and Representation 25
Table 1. A comparison of the mean IR scores for different video description and
representation methods, using K-Means++ summarization.
Method Mean IR
Framework Histogram 0.685
Extended FMoD 0.723
LMoD 0.740
SURF 0.484
LMoD+Trajectories 0.802
SURF+Trajectories 0.553
Global Histogram 0.571
Table 2. A comparison of the mean execution time requirements per-frame for different
video description and representation methods.
successful than Framework Histogram and LMoD providing the best results in
both metrics. Additionally, the Trajectories per-frame activity descriptor seems
to beneficially enrich the informational content of both employed local descrip-
tors (LMoD and SURF), resulting in the combination LMoD+Trajectories being
the best choice. Obviously, it is reasonable that activity description aids the sum-
marization of activity videos. Not unexpectedly, this comes at the cost of a three-
fold increase in required computational time in comparison to the traditional
global image histograms. This indicates a typical trade-off between summariza-
tion quality and computational requirements, with better performing descriptors
being more appropriate for off-line/non real-time applications.
The very low performance of SURF is of high interest, since it validates our
assumption that sparsely sampled and highly invariant descriptors designed for
recognition tasks are not necessarily suitable for video summarization. This fits
with previous results in [3], which indicated that in the absence of clear shot
boundary information, global image color histograms produced better results
than SIFT and SURF. A local descriptor that is holistic, according to our def-
inition, and densely sampled, i.e., LMoD, outperforms both approaches, pos-
sibly because it captures spatial image properties lost in the case of simple
Compact Video Description and Representation 27
4 Conclusions
We have proposed a consistent video description and representation framework
that accommodates per-frame local, global and activity descriptors, with the
goal of assisting successful automated summarization of human activity videos.
The framework has been objectively evaluated on a publicly available dataset
(IMPART), using the most common video summarization method, i.e., frame
clustering. In all cases, several image modalities are being exploited (luminance,
hue, edges, optical flow magnitude) in order to simultaneously capture informa-
tion about the depicted shapes, colors, lighting, textures and motions. In this
context, the introduced Extended FMoD, LMoD and Trajectories descriptors
(specially adapted novel extensions of pre-existing descriptors) outperform com-
peting approaches, with the LMoD+Trajectories combination proving to be the
most effective.
References
1. Evans, A., Agenjo, J., Blat, J.: Combined 2D and 3D web-based visualisation of
on-set big media data. In: IEEE International Conference on Image Processing
(ICIP), pp. 1120–1124 (2015)
2. Money, A.G., Agius, H.: Video summarization: a conceptual framework and survey
of the state of the art. J. Vis. Commun. Representation 19(2), 121–143 (2008)
3. Cahuina, E.J., Chavez, G.C.: A new method for static video summarization using
local descriptors and video temporal segmentation. In: Conference on Graphics,
Patterns and Images (SIBGRAPI), pp. 226–233. IEEE (2013)
4. Hu, W., Xie, N., Li, L., Zeng, X., Maybank, S.: A survey on visual content-based
video indexing and retrieval. IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev.
41(6), 797–819 (2011)
5. Zhuang, Y., Rui, Y., Huang, T., Mehrotra, S.: Adaptive key frame extraction using
unsupervised clustering. In: International Conference on Image Processing (ICIP),
pp. 866–870. IEEE (1998)
6. De Avilla, S.E.F., Lopes, A.P.B., Luz Jr., A.L., Araujo, A.A.: VSUMM: a mecha-
nism designed to produce static video summaries and a novel evaluation method.
Pattern Recogn. Lett. 32(1), 56–68 (2011)
7. Wan, T., Qin, Z.: A new technique for summarizing video sequences through his-
togram evolution, pp. 1–5. IEEE (2010)
8. Mademlis, I., Nikolaidis, N., Pitas, I.: Stereoscopic video description for key-frame
extraction in movie summarization. In: European Signal Processing Conference
(EUSIPCO), pp. 819–823. IEEE (2015)
9. Li, J.: Video shot segmentation and key frame extraction based on SIFT feature.
In: International Conference on Image Analysis and Signal Processing (IASP), pp.
1–8. IEEE (2012)
28 I. Mademlis et al.
10. Lowe, D.G.: Object recognition from local scale-invariant features. In: International
Conference on Computer Vision (ICCV), pp. 1150–1157. IEEE (1999)
11. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF).
Comput. Vis. Image Underst. 110(3), 346–359 (2008)
12. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization
with bags of keypoints. In: European Conference on Computer Vision (ECCV),
pp. 1–2 (2004)
13. Tian, Z., Xue, J., Lan, X., Li, C., Zheng, N.: Key object-based static video summa-
rization. In: ACM International Conference on Multimedia, pp. 1301–1304 (2011)
14. Cernekova, Z., Pitas, I., Nikou, C.: Information theory-based shot cut/fade detec-
tion and video summarization. IEEE Trans. Circuits Syst. Video Technol. 16(1),
82–91 (2006)
15. Fu, W., Wang, J., Gui, L., Lu, H., Ma, S.: Online video synopsis of structured
motion. Neurocomputing 135, 155–162 (2014)
16. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by Dense Trajec-
tories. In: IEEE Conference on Computer Vision & Pattern Recognition (CVPR),
pp. 3169–3176 (2011)
17. Mademlis, I., Iosifidis, A., Tefas, A., Nikolaidis, N., Pitas, I.: Exploiting stereo-
scopic disparity for augmenting human activity recognition performance. Multi-
media Tools Appl. 75, 1–20 (2015)
18. Kourous, N., Iosifidis, A., Tefas, A., Nikolaidis, N., Pitas, I.: Video characteriza-
tion based on activity clustering. In: International Conference on Electrical and
Computer Engineering (ICECE), pp. 266–269. IEEE (2014)
19. Kim, H., Hilton, A.: Influence of colour and feature geometry on multi-modal 3D
point clouds data registration. In: International Conference on 3D Vision (3DV),
pp. 202–209 (2014)
20. Penatti, O., Valle, E., da Silva Torres, R.: Comparative study of global color and
texture descriptors for Web image retrieval. J. Vis. Commun. Image Representation
23(2), 359–380 (2012)
21. Arthur, D., Vassilvitskii, S.: K-Means++: the advantages of careful seeding. In:
Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and
Applied Mathematics (2007)
22. Farnebäck, G.: Two-frame motion estimation based on polynomial expansion.
In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370.
Springer, Heidelberg (2003). doi:10.1007/3-540-45103-X 50
Incremental Estimation of Visual Vocabulary
Size for Image Retrieval
Abstract. The increasing amount of image databases over the last years
has highlighted our need to represent an image collection efficiently and
quickly. The majority of image retrieval and image clustering approaches
has been based on the construction of a visual vocabulary in the so
called Bag-of-Visual-words (BoV) model, analogous to the Bag-of-Words
(BoW) model in the representation of a collection of text documents. A
visual vocabulary (codebook) is constructed by clustering all available
visual features in an image collection, using k-means or approximate k-
means, requiring as input the number of visual words, i.e. the size of the
visual vocabulary, which is hard to be tuned or directly estimated by the
total amount of visual descriptors. In order to avoid tuning or guessing
the number of visual words, we propose an incremental estimation of
the optimal visual vocabulary size, based on the DBSCAN-Martingale,
which has been introduced in the context of text clustering and is able to
estimate the number of clusters efficiently, even for very noisy datasets.
For a sample of images, our method estimates the potential number of
very dense SIFT patterns for each image in the collection. The proposed
approach is evaluated in an image retrieval and in an image cluster-
ing task, by means of Mean Average Precision and Normalized Mutual
Information.
1 Introduction
Image retrieval and image clustering are related tasks because of their need to
efficiently and quickly search for nearest neighbors in an image collection. Taking
into account that image collections are dramatically increasing (e.g. Facebook
and Flickr), both tasks, retrieval and clustering, become very challenging and
traditional techniques show reduced functionality. Nowadays, there are many
applications of image retrieval and clustering, to assist personal photo organiza-
tion and search.
Searching in an image collection for similar images is strongly affected by
the representation of all images. Spatial verification techniques for image rep-
resentation, like RANSAC and pixel-to-pixel comparisons, are computationally
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 4
30 I. Gialampoukidis et al.
Therefore, we are able to build efficient visual vocabularies without tuning the
size or guessing a value for k. Our proposed method is a hybrid framework, which
combines the recent DBSCAN-Martingale [6] and k-means clustering. The pro-
posed hybrid framework is evaluated in the image retrieval and image clustering
problems, where we initially provide an estimation for the number of visual words
k, using the DBSCAN-Martingale, and then cluster all visual descriptors by k,
as traditionally done by k-means clustering.
In Sect. 2 we present the related work in visual vocabulary construction and
in Sect. 3 we briefly present the DBSCAN-Martingale estimator of the number
of clusters. In Sect. 4, our proposed hybrid method for the construction of visual
vocabularies is described in detail, and finally, in Sect. 5, it is evaluated under
the image retrieval and image clustering tasks.
Incremental Estimation of Visual Vocabulary Size for Image Retrieval 31
2 Related Work
The Bag-of-Visual-Words (BoV) model initially appeared in [17], in which
k-means clustering is applied for the construction of a visual vocabulary. The
constructed visual vocabulary is then used for image retrieval purposes and is
similar to the Bag-of-Words model, where a vocabulary of words is constructed,
mainly for text retrieval, clustering and classification. In the BoV model, the
image query and each image of the collection are represented as a sparse vec-
tor of term (visual word) occurrences, weighted by tf-idf scores. The similarity
between the query and each image is calculated, using the Mahalanobis distance
or simply the Euclidean distance. However, there is no obvious value for the
number of clusters k in the k-means clustering algorithm.
Other approaches for the construction of visual vocabularies include Approx-
imate k-means (AKM) clustering, which offers scalable construction of visual
vocabularies. Hierarchical k-means (HKM) was the first approximate method
for fast and scalable construction of a visual vocabulary [11], where data points
are clustered by k = 2 or k = 10 using k-means clustering and then k-means
is applied to each one of the newly generated clusters, using the same num-
ber of clusters k. After n steps (levels), the result is k n clusters. HKM has
been outperformed by AKM [15], where a forest of 8 randomized k-d trees pro-
vides approximate nearest neighbor search between points and the approximately
nearest cluster centers. The use of 8 randomized k-d trees with skewed splits have
recently been proposed, in the special case of SIFT descriptors [5]. However, all
AKM clustering methods require as input the number of clusters k, so an efficient
estimation of k is necessary.
The need to estimate the number of visual words emerges from the com-
putational cost of k-means algorithm, either in exact or approximate k-means
clustering [19]. Apart from being a time consuming process, tuning the num-
ber of clusters k may affect significantly the performance of the image retrieval
task [15]. Some studies assume a fixed value of k, such as k = 4000 in [18], but
in general the choice of k varies from 103 up to 107 , as stated in [12]. In another
approach, 10 clusters are extracted using k-means for each one of the considered
classes (categories), which are then concatenated in order to form a global visual
vocabulary [20]. In contrast, we shall estimate the number of clusters using the
DBSCAN-Martingale [6], which automatically estimates the number of clusters,
based on an extension of DBSCAN [3], without a priori knowledge of the density
parameter minP ts of DBSCAN. DBSCAN-Martingale generates a probability
distribution over the number of clusters and has been applied to news clustering,
in combination with LDA [6]. Contrary to the previous text clustering approach,
DBSCAN-Martingale shall be presented in the context of visual vocabulary con-
struction for the estimation of the optimal number of visual words, to support
any image retrieval or clustering task.
32 I. Gialampoukidis et al.
0.5
3
0.4
2
probability
0.3
1
0.2
0
0.1
-1
0.0
-2 -1 0 1 2 3 4 5 6 7
clusters
ID with size greater than minP ts. After T stages, we have progressively gained
knowledge about the final number of clusters k̂, since all clusters have been
extracted with high probability.
The estimation of number of clusters k̂ is a random variable, because of the
randomness of the generated density levels t , t = 1, 2, . . . , T . For each realiza-
tion of the DBSCAN-Martingale one estimation k̂ is generated, and the final
estimation of the number of clusters have been proposed [6] as the majority
vote over 10 realizations of the DBSCAN-Martingale. The percentage of realiza-
tions where the DBSCAN-Martingale outputs exactly k̂ clusters is a probability
distribution, as the one shown in Fig. 1.
DBSCAN-
Martingale
DBSCAN-
Martingale
DBSCAN-
Martingale
Fig. 2. The estimation of the number of visual words in an image collection. Each image
i contributes with ki visual words to the overall estimation of the visual vocabulary
size.
Starting from the first image, keypoints are detected and SIFT descriptors
[9] are extracted. Each visual feature is represented as a 128-dimensional vector,
hence the whole image i is a matrix Mi with 128 columns, but the number of
rows is subject to the number of detected keypoints. On each matrix Mi , the 128-
dimensional vectors are clustered using the DBSCAN-Martingale, which outputs
the number of dense patterns in the set of visual features, as provided by several
density levels. Assuming that the application of 100 realizations of the DBSCAN-
Martingale has output k1 for the first image, k2 for the second image and kl for
the l-th image (Fig. 2), the proposed optimal size of the visual vocabulary is:
34 I. Gialampoukidis et al.
l
k= ki (1)
i=1
We utilize the estimation k̂, provided by Eq. (2), in order to cluster all visual
features by k̂ using k-means clustering. Therefore, a visual vocabulary of size
k̂ is constructed. After the construction of a visual vocabulary, as shown in
Fig. 3, images are represented using term-frequency scores with inverse document
frequency weighting (tf-idf) [17]:
nid D
tfidf ij = log (3)
nd ni
where nid is the number of occurrences of visual word i in image d, nd is the
number of visual words in image d, ni is the number of occurrences of visual
word i in the whole image collection and D is the total number of images in the
database.
In the following section, we test our hybrid visual vocabulary construction
in the image retrieval and image clustering problems.
1
https://www.r-project.org/.
2
https://github.com/MKLab-ITI/topic-detection/blob/master/DBSCAN
Martingale.r.
3
https://cran.r-project.org/web/packages/dbscan/index.html.
Incremental Estimation of Visual Vocabulary Size for Image Retrieval 35
SIFT Image
Extraction Representation Application
Image
Collection
Image
Retrieval
DBSCAN-
k-means (k) Visual
Martingale
Vocabulary
Construction Image
0.15
probability
0.10
Approximate Clustering
0.05
k-means (k)
0.00
4 5 6 7 8 9 10 12 14 16 19
clusters
Fig. 3. The hybrid visual vocabulary construction framework using the DBSCAN-
Martingale for the estimation of k and either exact or approximate k-means clustering
by k. After the visual vocabulary is constructed, the collection of images is efficiently
represented for any application, such as image retrieval or clustering.
5 Experiments
We evaluate our method in the image retrieval and image clustering tasks, in
which nearest neighbor search is performed in an unsupervised way. The datasets
we have selected are the WANG4 1K and Caltech5 2.5 K with 2,516 images, with
queries as described in [5] for the image retrieval task. The number of extracted
visual descriptors (SIFT) is 505,834 and 769,546 128-dimensional vectors in the
WANG 1K and Caltech 2.5 K datasets, respectively. The number of topics is 10
for the WANG dataset and 21 for the Caltech, allowing also image clustering
experiments with the considered datasets. We selected these datasets because
they are appropriate for performing both image retrieval and image clustering
experiments and tuning the number of visual words k may be done in reasonable
processing time, so as to evaluate the visual vocabulary construction in terms of
Mean Average Precision (MAP) and Normalized Mutual Information (NMI).
Keypoints are detected and SIFT descriptors are extracted using the LIP-
VIREO toolkit6 . For the implementation of DBSCAN-Martingale we used the
R-script, which is available on Github7 for max = 200 and 100 realizations. We
build one visual vocabulary for several number of visual words k, which is tuned
in k ∈ {100, 200, 300, . . . , 4000}. The parameter minP ts is tuned from 5 to 30
and the final number of clusters per image is the number which is more robust
to the variations of minP ts. In k-means clustering, we allow a maximum of 20
iterations.
4
http://wang.ist.psu.edu/docs/related/.
5
http://www.vision.caltech.edu/Image Datasets/Caltech101/.
6
http://pami.xmu.edu.cn/∼wlzhao/lip-vireo.htm.
7
https://github.com/MKLab-ITI/topic-detection/blob/master/DBSCAN
Martingale.r.
36 I. Gialampoukidis et al.
0.32
0.200
0.30
MAP
NMI
0.190
0.28
0.26
0.180
0.24
0 1000 2000 3000 4000 0 1000 2000 3000 4000
k k
(a) MAP for the WANG dataset (b) NMI for the WANG dataset
0.44
0.15
0.42
0.14
0.40
0.13
MAP
NMI
0.38
0.12
0.36
0.11
0.34
0.10
(c) MAP for the Caltech dataset (d) NMI for the Caltech dataset
Fig. 4. Evaluation using MAP and NMI in image retrieval and image clustering tasks
for the WANG and Caltech datasets. The MAP and NMI scores which are obtained
by our k estimation is the straight red line.
Our estimations for the number of visual words is k̂ = 2180 and k̂ = 1840
for the WANG and Caltech datasets, respectively, given a sample of 200 images.
The corresponding MAP and NMI are compared to the best MAP and NMI
scores in k ∈ {100, 200, 300, . . . , 4000}. The results are reported in Table 1, where
apart from the best MAP and NMI scores, we also present the ratio of MAP
(NMI) provided by our k̂-estimation to the maximum observed MAP (NMI)
score, denoted by rM AP (rN M I ). In particular, in the WANG dataset, MAP
is 96.42 % of the best MAP observed and NMI is 94.91 % of the best NMI.
Similar behavior is observed in the Caltech dataset, where NMI is approached
at 95.36 % and MAP at 80.06 %, respectively. In Fig. 4 we observe that our
6 Conclusion
We presented an incremental estimation of the optimal size of the visual vocab-
ulary, which efficiently estimates the number of visual words, and evaluated the
performance of the constructed visual vocabulary in an image retrieval and an
image clustering task. Increasing the visual vocabulary size does not guarantee
that the performance of the corresponding retrieval and clustering tasks will also
increase as shown in Fig. 4 (c) for the Mean Average Precision. Our proposed
hybrid framework utilizes the output of DBSCAN-Martingale on a sample of
SIFT descriptors, to estimate the visual vocabulary size, in order to be used as
input in any k-means or approximate k-means clustering for the construction
of a visual vocabulary. The estimation is incremental and can be distributed
to several threads. A potential limitation of our approach could appear in the
case where an image exists more than once in the image collection and therefore
needlessly contributes with extra visual words the final estimation. However, if
the sample of images on which the DBSCAN-Martingale is applied does not
have duplicate images, the overall estimation will not be affected. In the future,
we plan to test our method using other visual features and in the context of
multimedia retrieval, where multiple modalities are employed.
References
1. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: ordering points
to identify the clustering structure. ACM Sigmod Rec. 28(2), 49–60 (1999)
2. Devroye, L.: Sample-based non-uniform random variate generation. In: Proceedings
of the 18th Conference on Winter Simulation, pp. 260–265. ACM, December 1986
3. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discov-
ering clusters in large spatial databases with noise. Kdd 96(34), 226–231 (1996)
4. Gan, J., Tao, Y.: DBSCAN revisited: mis-claim, un-fixability, and approximation.
In: Proceedings of the 2015 ACM SIGMOD International Conference on Manage-
ment of Data, pp. 519–530. ACM, May 2015
5. Gialampoukidis, I., Vrochidis, S., Kompatsiaris, I.: Fast visual vocabulary con-
struction for image retrieval using skewed-split kd trees. In: MultiMedia Modeling,
pp. 466–477. Springer International Publishing, January 2016
6. Gialampoukidis, I., Vrochidis, S., Kompatsiaris, I.: A Hybrid framework for news
clustering based on the DBSCAN-Martingale and LDA. In: Machine Learning
and Data Mining in Pattern Recognition, pp. 170–184. Springer International
Publishing, July 2016
7. He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-
based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8(1), 83–99
(2014)
38 I. Gialampoukidis et al.
8. Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a
compact image representation. In: 2010 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 3304–3311. IEEE, June 2010
9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Intl. J. Com-
put. Vis. 60(2), 91–110 (2004)
10. Markatopoulou, F., Mezaris, V., Patras, I.: . Cascade of classifiers based on binary,
non-binary and deep convolutional network descriptors for video concept detection.
In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 1786–
1790. IEEE, September 2015
11. Mikolajczyk, K., Leibe, B., Schiele, B.: Multiple object class detection with a gen-
erative model. In: 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, vol. 1, pp. 26–36. IEEE, June 2006
12. Mikulik, A., Chum, O., Matas, J.: Image retrieval for online browsing in large image
collections. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds.) SISAP 2013. LNCS,
vol. 8199, pp. 3–15. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41062-8 2
13. Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale
image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV
2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010). doi:10.1007/
978-3-642-15561-1 11
14. Philbin, J.: Scalable object retrieval in very large image collections. Doctoral dis-
sertation, Oxford University (2010)
15. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with
large vocabularies and fast spatial matching. In: IEEE Conference on Computer
Vision and Pattern Recognition, 2007, CVPR 2007, pp. 1–8. IEEE, June 2007
16. Rawlings, J.O., Pantula, S.G., Dickey, D.A.: Applied regression analysis: a research
tool. Springer Science & Business Media (1998)
17. Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching
in videos. In: Ninth IEEE International Conference on Computer Vision, 2003,
Proceedings, pp. 1470–1477. IEEE, October 2003
18. Van De Sande, K.E., Gevers, T., Snoek, C.G.: Evaluating color descriptors for
object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1582–
1596 (2010)
19. Wang, J., Wang, J., Ke, Q., Zeng, G., Li, S.: Fast approximate k-means via clus-
ter closures. In: Multimedia Data Mining and Analytics, pp. 373–395. Springer
International Publishing (2015)
20. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels
for classification of texture and object categories: a comprehensive study. Intl. J.
Comput. Vis. 73(2), 213–238 (2007)
Attribute Learning for Network Intrusion
Detection
1 Introduction
Nowadays society is becoming increasingly dependent on the use of computer sys-
tems in various fields such as finance, security and many aspects of everyday life.
On the other hand threats and attacks are increasingly growing. Cyber-Security
research area looks at the ability to act pro-actively in order to mitigate or
prevent attacks. In that sense Network Intrusion Detection (NID) is one of the
(possible) solutions. This task is basically carried out under two approaches: (i)
Missuse Detection and (ii) Anomaly Detection. These approaches have advan-
tages and disadvantages associated with their suitability to various scenarios [1].
There are machine learning based solutions to these approaches [2,3]. Despite
the extensive academic research, their deployment in operational NID environ-
ments has been very limited [4]. On the other hand new attacks are constantly
occurring, often as variants of known attacks.
Traditional machine learning approaches are unable to tackle challenging
scenarios in which new classes may appear after the learning stage. This scenario
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 5
40 J.L.R. Pérez and B. Ribeiro
is present many real world situations [5]. Specifically in NID related tasks where
one of the main problems is the emergence of new attacks which corresponds with
new classes. In that case the detection algorithm should identify the new classes
which is important in real environments. Recently, there has been an increasing
interest in the study of ZSL approaches, which might be a possible solution to
this problem [5–7]. In this paper we propose an Attribute Learning for Network
Intrusion Detection (ALNID) algorithm. We present the preliminary results as
an initial step prior to the detection of new attacks on networks. A proposal of
the experimental data setup for the application of ZSL in NID is also given. The
proposed attribute learning algorithm can be generalized to many problems.
It is simple and based on criteria such as attributes frequency and entropy,
achieving to learn new values for the original attributes. The results show a
significant improvement in the representation of the attributes for classes. This
is encouraging since we expect to achieve higher accuracy in the inference stage
of ZSL. The rest of the paper is organized as follows. In Sect. 2 we briefly review
the background, related work on ZSL as well as the attribute learning prior
stage. In Sect. 3 we describe the ALNID algorithm. In Sect. 4 we present the
data preprocessing and propose an experimental data setup for the application
of ZSL in NID. In Sect. 5 we discuss the results on KDD Cup 99 dataset. Finally,
we conclude and address some lines of future work.
2 Zero-Shot Learning
ZSL is inspired by the way human beings are capable of identifying new classes
when a high-level description is provided. It consists of identifying new classes
without training examples, from a high-level description of the new classes
(unseen classes) that relate them to classes previously learned during the train-
ing (seen classes). This is done by learning the attributes as an intermediate
layer that provides semantic information about the classes to classify. ZSL has
two stages: the training or attribute learning stage in which knowledge about the
attributes is captured, and then the inference stage where this knowledge is used
to classify instances among a new set of classes. This approach has been widely
applied to images classification tasks [5–8]. There are different solutions for both
attribute learning and for the inference stage, which models the relationships
among features, attributes, and classes in images. Such strategy makes it dif-
ficult to apply ZSL to other problems. This, coupled with the need of identify
new attacks on computer networks motivated us to development this research.
In the inference stage the predicted attributes are combined to infer the classes.
There are basically three approaches [5] for this second stage: k–nearest neigh-
bour (k–NN), probabilistic frameworks [7] and energy function [5,6]. One inter-
esting technique is the cascaded probabilistic framework proposed in [7,10] where
the predicted attributes on the first stage are combined to determine the most
likely target class. It has two variants: the Directed Attribute Prediction (DAP),
in which during the training stage, for each attribute, a probabilistic classifier is
learned. Then, at inference stage the classifiers are used to infer new classes from
their attributes signatures. The other variant is the Indirected Attribute Predic-
tion (IAP) which learns a classifier for each training class and then combines
predictions of training and test classes. The third ZSL approach uses energy
function [5,6], which is the same as the penalty function where given x: data,
and v: category vector, the energy function Ew (x, v) = x W v is trained, with
the metric W being Ew (x, v) positive if x = v and negative if x! = v. In [11]
a framework is proposed to predict both seen and unseen classes. Unlike other
approaches this proposal not only classifies unknown classes but also classifies
known classes. In the attribute learning stage they consider the set of all classes
Y during training and testing. Some classes y are available as seen classes in
training data (Ys ) and the others are the Zero-Shot classes, without any train-
ing data, as unseen classes (Yu ). Then they define W = Ws ∪ Wu as the word
vectors for both, seen and unseen classes. All training images x(i) ∈ Xy of a seen
class y ∈ Ys are mapped to the word vector wy . Then, to train this mapping a
two-layer neural network to minimize the following objective function (Eq. 1) is
trained [11]:
2
J(Θ) = wy − θ(2) f (θ(1) x(i) ) (1)
y∈Ys x(i) ∈Xy
Other recent and simple energy function based approach is the ESZSL pro-
posal [5]. It is based on [6] which models the relationships among features,
attributes, and classes. The authors assume that at training stage there are
z classes, which have a signature composed of a attributes. That signatures are
represented in a matrix S ∈ [0, 1]axz . This matrix contains for each attribute
any value in [0, 1] representing the relationship between each attribute and the
classes. However, how the matrix S is computed is not addressed. The instances
available at training stage are denoted by X ∈ Rdxm , where d is the dimensional-
ity of the data, and m is the number of instances. The authors also compute the
42 J.L.R. Pérez and B. Ribeiro
matrix Y ∈ {−1, 1}mxz to denote to which class belongs each instance. During
the training they compute the matrix V ∈ Rdxa as in Eq. 2:
categories, thus reducing the number of classes to 5 [1]. Each instance repre-
sents a TCP/IP connection composed of 41 attributes both quantitative and
qualitative2 . Despite our proposal holds for the attribute learning stage, we pro-
pose herein a scheme to apply ZSL approach in NID tasks (see Fig. 1). Table 1
shows that classes such as spy and perl have 2 and 3 instances respectively while
classes as smurf and normal have 280,790 and 97,277 respectively. Then, con-
sidering the study in [1] we selected the 12 attributes from the original 41 ones.
Its statistical description are shown in Table 2. We modified the dataset using
the five categories as classes. Later, for each category we selected two attacks
as Zero-Shot classes. Table 1 illustrates these values and the Zero-Shot classes.
This setup is very practical for the application at hand because we can classify
2
Current research uses a small portion that represents the 10 % of the original dataset
containing 494, 021 instances.
ALNID: Attribute Learning for Network Intrusion Detection 45
the Zero-Shot classes. In this case, new attacks can be classified in the categories
to which they belong to.
(s) dst host srv count (t) dst host srv count’
(u) dst host same srv rate (v) dst host same srv rate’
(w) dst host same port rate (x) dst host same port rate’
Fig. 2. (continued)
48 J.L.R. Pérez and B. Ribeiro
distributions per class for the learned attributes with our proposal than for the
originals ones. The learned attribute duration’ in Fig. 2(a) and (b) improves
distribution at least w.r.t. one class, e.g. the NORMAL (N) one. The rest of
the learned attributes by ALNID achieves a higher separability than the origi-
nal ones in their distribution regarding the classes (see Fig. 2(c)–(x )). This new
representation of the learned attributes are expected to improve the k–NN classi-
fication during the inference stage. Also we found a way to compute the signature
matrix S ∈ [0, 1]axz required by the energy function used in the inference app-
roach in [5]. The range of values of the learned attributes is discrete which can
easily be integrated in the inference stage.
6 Conclusions
References
1. Rivero Pérez, P.L.: Técnicas de aprendizaje automático para la detección de intru-
sos en redes de computadoras. Revista Cubana de Ciencias Informáticas 8(4),
52–73 (2014)
2. Sangkatsanee, P., Wattanapongsakorn, N., Charnsripinyo, C.: Practical real-time
intrusion detection using machine learning approaches. Comput. Commun. 34(18),
2227–2235 (2011)
3. Tsai, C.F., Hsu, Y.F., Lin, C.Y., Lin, W.Y.: Intrusion detection by machine learn-
ing: a review. Expert Syst. Appl. 36(10), 11994–12000 (2009)
4. Sommer, R., Paxson, V.: Outside the closed world: on using machine learning for
network intrusion detection. In: 2010 IEEE Symposium on Security and Privacy
(SP), pp. 305–316. IEEE (2010)
5. Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot
learning. In: Proceedings of The 32nd International Conference on Machine Learn-
ing, pp. 2152–2161 (2015)
6. Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for attribute-
based classification. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 819–826 (2013)
7. Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-
shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36(3),
453–465 (2014)
ALNID: Attribute Learning for Network Intrusion Detection 49
8. Patterson, G., Xu, C., Su, H., Hays, J.: The sun attribute database: beyond cate-
gories for deeper scene understanding. Int. J. Comput. Vis. 108(1–2), 59–81 (2014)
9. Ferrari, V., Zisserman, A.: Learning visual attributes. In: NIPS, pp. 433–440 (2007)
10. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object
classes by between-class attribute transfer. In: IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2009, pp. 951–958. IEEE (2009)
11. Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-
modal transfer. In: NIPS, pp. 935–943 (2013)
Sampling Methods in Genetic
Programming Learners from Large Datasets:
A Comparative Study
Hmida Hmida1,2(B) , Sana Ben Hamida2 , Amel Borgi1 , and Marta Rukoz2
1
Université Tunis El Manar, LIPAH, Tunis, Tunisia
hhmida@gmail.com
2
Université Paris Dauphine, PSL Research University, CNRS, UMR 7243,
LAMSADE, 75016 Paris, France
Abstract. The amount of available data for data mining and knowledge
discovery continue to grow very fast with the era of Big Data. Genetic
Programming algorithms (GP), that are efficient machine learning tech-
niques, are face up to a new challenge that is to deal with the mass
of the provided data. Active Sampling, already used for Active Learn-
ing, might be a good solution to improve the Evolutionary Algorithms
(EA) training from very big data sets. This paper present a review of
sampling techniques already used with active GP learner and discuss
their ability to improve the GP training from very big data sets. A
method in each sampling strategy is implemented and applied on the
KDD intrusion detection problem using very close parameters. Experi-
mental results show that sampling methods outperforms results obtained
with full dataset but some of them cannot be scaled to large datasets.
1 Introduction
Genetic Programming (GP) [9] is an Evolutionary Algorithm (EA) considered
as a universal solver. This paradigm have shown its potential by reinventing
previously invented patents and creating new patentable inventions.
Applied to supervised learning and classification problems, GP generates a
population of classifiers by means of genetic operators. The performance of each
classifier is measured by a fitness function that needs the evaluation (execution)
of every program against the complete training dataset. This leads to a com-
putational overhead increased specially with large datasets and proportional to
the product of the number of instances in the dataset, the population size and
the number of generations that will be carried out during the evolution process.
Otherwise, using the full learning data might be impossible when the input data
doesn’t fit within the main memory which is often the case with Big Datasets.
The problem of reducing this calculation cost can be addressed mainly with
two approaches:
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 6
Sampling Methods in Genetic Programming Learners 51
The main objective of the studied sampling methods, in this paper, is to reduce
the original training dataset size by substituting it with a representative subset
much smaller and hence the evaluation cost of GP algorithm. Two major classes
of sampling techniques can be layed out: static sampling and dynamic sampling.
With static sampling, the learner obtains all the input training set at once and
stills unmodified across the learning process. Training subset is created by select-
ing a given number of patterns, independently from the training process in which
they will be used. Some methods use a unique subset during all GP runs. Other
ones can assign different subsets for each run. For example, the Historical Subset
Selection HSS [6] gathers misclassified instances from previous GP runs (using
full dataset) at each generation using the best individual.
Other well known static sampling methods are the Bagging/Boosting tech-
niques. These techniques were applied to GP in [8] but not as a speeding method
but rather to improve GP quality by evolving several subpopulations with dif-
ferent training sets generated by selecting the base dataset with replacement.
Dynamic sampling methods are also referred to as active data selection methods
as derived from Active Learning [2]. The underlying sampling techniques are
tightly related to the learning process evolution. Generally, it depends on a
particular feature like unresolved cases, the number of generations reached, . . .
52 H. Hmida et al.
DSS needs three parameters to be tuned: difficulty exponent (d), age exponent
(a) and target size (S). Gathercole [6] tried different subset and population
sizes with The thyroid classification problem1 using a full training set size of
3772. With 10000 individuals, GP with DSS realizes better results than neural
networks. DSS reduced evaluations by 20 % whilst obtained similar classifiers.
from this partition based on RSS or DSS. Finally, at level 2, the selected block
is considered as the full dataset on which DSS was applied for several rounds.
Depending on the level 1 algorithm, we have RSS-DSS hierarchy or DSS-DSS
hierarchy. Besides DSS parameters (see Sect. 2.2), new parameters are added:
level 0 block size, level 1 number of iterations, level 2 iterations, maximum level
2 iterations and tournament iterations. Only level 2 iterations is calculated by:
Incremental Sampling. The training subset starts with a few cases and
acquires more cases at every generation to reach the full dataset size.
Zhang [16,17] proposes to perform a uniform data crossover simultaneously
with genetic programs evolution. Data crossover means that when crossover oper-
ation is applied to genetic programs to produce new programs, their subsets are
crossed to obtain new offspring subsets inducing data inheritance through gen-
erations. This helps to preserve the knowledge acquired by the parents. This
method called Incremental Data Inheritance IDI starts with n0 sized subsets,
and increases by λ at every generation with respect to an import rate depend-
ing on a third parameter ρ. In addition to that Zhang uses an Adaptive Fitness
Function varying from an individual to another [17]. This method was applied
to the evolution of collective behaviors for multiple robotic agents [16] and time
series prediction problem [17]. It was compared to the standard GP, GP with
Incremental Random Selection (IRS) and FRS.
Experimental evidence supports that evolving programs on an incrementally
selected subset of fitness cases can significantly reduce the fitness evaluation time
without sacrificing generalization accuracy of the evolved programs.
Topology Based Sampling. Inspired by the idea that structure influences the
efficiency of heuristics working on it, this method consists at building a topology
of the problem from the knowledge acquired by individuals about fitness cases.
Lazarczyk [10] suggests a Fitness case Topology Based Sampling TBS in which
54 H. Hmida et al.
– Static Balanced Sampling: selects cases randomly from each class without
replacement until obtaining a balanced subset with equal number of majority
and minority class instances of the desired size at every generation.
– Basic Under-sampling (BUSS): selects all minority class instances and
then an equal number from majority class randomly.
– Basic Over-sampling (BOSS): selects all majority class instances and then
an equal number from minority class randomly with replacement.
– Under-sampling A/B and Over-sampling A/B: Multiple balanced sam-
ples created at each generation with BUSS (respectively BOSS). Then, indi-
viduals are attributed the average fitness realized across all the sample sets
(and the minimum fitness for version B).
Sampling Methods in Genetic Programming Learners 55
BRSS. This method is Balanced RSS in which the subset generated by RSS
reflects the original dataset classes’ frequencies. The number of each class pat-
terns is calculated with respect to the target size.
RSS-DSS. The implemented RSS-DSS here does not use the same fitness as
proposed by authors. It uses the same fitness function as the other implemented
methods. The effect of this transformation will be discussed in Sect. 4.3. Two
configurations was tested as shown on the Table 4 changing target size, RSS and
DSS iterations simultaneously. RSS-DSS2 denotes the second configuration.
56 H. Hmida et al.
IDI. Since we use a fixed subset grow increment, the full dataset size is not
reached at the last generation.
TBS. We calculated the threshold applying a more simpler steps: we start by the
lowest edge weight as the initial threshold to have the smallest possible subset
and then TBS selection/exclusion steps are applied using the current threshold.
When the subset size is to small, the new threshold value is the next edge weight
and the previously excluded patterns are added to candidates.
BUSS. KDD-99 have a normal class and 4 attack classes. The subset selection
uses the U2R attack (52 patterns) class as minority class and then randomly
selects an equal number of patterns from the remaining 3 attacks to obtain 208
attack patterns. In a first experiment, 52 normal were added obtaining a subset
of 260 patterns. In the second experiment (BUSS2), 208 normal patterns were
chosen to get a final subset of 416 patterns. Subsets are renewed each generation.
Terminal and function sets. The terminal set includes input variables which are the
41 KDD-99 features set with 8 randomly generated constants. The function set has
17 functions that are basic arithmetic, comparison and logical operators (Table 2).
Parameter Value
Population size 512 for RSS, DSS, BUSS, BRSS and RSS-DSS
128 for IDS, TBS and Full Subset
Subpopuations number 1
Number of generations 500 for RSS, DSS, BUSS and BRSS
100 for IDS, TBS
depending on the RSS iteration for RSS-DSS
CGP nodes 300
Inputs/Outputs 49/1
Tournament size 4
Crossover/Mutation probability 0.9/0.04
Fitness Minimize classification errors
However, this finding is not available for the best FPR values. Indeed, according
to the Accuracy and TPR metrics of the best solutions, BUSS method performed
better than all the other sampling approach, but it has the 2nd worst TPR value.
BUSS2, while using a larger subset (more normal cases than BUSS), has lost
accuracy and TPR but realizes a much better FPR. With RSS-DSS2 settings,
this method enhances its performance but FPR remains very high.
TBS, IDI and FSS have been tested with a reduced number of generations,
population size, and target subset size due to the huge amount of time needed for
each generation. Nevertheless, TBS and IDI reached a comparable performance
level. All 3 methods have a very increased run time. Moreover, both IDI and
TBS spent more time than using the full dataset assuming that the computing
time needed for sampling is very high compared with the evaluation time.
From Table 6, we can see that RSS-DSS in both settings finds the best indi-
vidual of run in the very first generations. BUSS makes the fastest runs because it
uses the smallest subset size but this result depends on the classification problem
itself and classes of the learning data.
5 Conclusion
The main purpose of this work is to study the ability of the most known active
sampling techniques for GP learners to deal with very large data sets. These
techniques were implemented on the same framework and applied to a classifi-
cation problem on the KDD intrusion detection data set.
Three main conclusions can be deduced from the experimental results. First,
any active sampling technique is able to induce better generalizing classifiers
compared to the standard GP using the full data set. However, the applicability
of some methods is limited because of their very high computation time (such
as IDI and TBS) or because they need specific conditions to be efficient (such
as RSS-DSS). This is the second conclusion. Third, simple sampling techniques,
such as BUSS or RSS have very low computational cost and they are able to
reach a competitive performance according to the advanced techniques.
To deal with run time problem of IDI and TBS, mixing them with other
sampling methods in a hierarchical sampling is a promising solution. Fitness
function affects the quality of GP results and this needs to be experimented on
the sampling methods tested here.
60 H. Hmida et al.
References
1. CGP: Cartesian gp website, http://www.cartesiangp.co.uk
2. Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning.
Mach. Learn. 15, 201–221 (1994)
3. Curry, R., Heywood, M.: Towards efficient training on large datasets for genetic
programming. In: Tawfik, A.Y., Goodwin, S.D. (eds.) AI 2004. LNCS (LNAI), vol.
3060, pp. 161–174. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24840-8 12
4. Curry, R., Lichodzijewski, P., Heywood, M.I.: Scaling genetic programming to
large datasets using hierarchical dynamic subset selection. IEEE Trans. Syst. Man
Cybern. Part B 37(4), 1065–1073 (2007)
5. Gathercole, C.: An Investigation of Supervised Learning in Genetic Programming.
University of Edinburgh, Thesis (1998)
6. Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learn-
ing in Genetic Programming. In: Davidor, Y., Schwefel, H.-P., Männer, R. (eds.)
PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994). doi:10.
1007/3-540-58484-6 275
7. Hunt, R., Johnston, M., Browne, W., Zhang, M.: Sampling methods in genetic
programming for classification with unbalanced data. In: Li, J. (ed.) AI 2010.
LNCS (LNAI), vol. 6464, pp. 273–282. Springer, Heidelberg (2010). doi:10.1007/
978-3-642-17432-2 28
8. Iba, H.: Bagging, boosting, and bloating in genetic programming. In: The 1st
Annual Conference on Genetic and Evolutionary Computation, Proceedings of
GECCO 1999, vol. 2, pp. 1053–1060. Morgan Kaufmann, San Francisco (1999)
9. Koza, J.R.: Genetic programming: on the programming of computers by means of
natural selection. Stat. Comput. 4(2), 87–112 (1994)
10. Lasarczyk, C., Dittrich, P., Banzhaf, W.: Dynamic subset selection based on a
fitness case topology. Evol. Comput. 12(2), 223–242 (2004)
11. Luke, S.: Ecj homepage. http://cs.gmu.edu/∼eclab/projects/ecj/
12. Miller, J.F., Thomson, P.: Cartesian genetic programming. In: Poli, R., Banzhaf, W.,
Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol.
1802, pp. 121–132. Springer, Heidelberg (2000). doi:10.1007/978-3-540-46239-2 9
13. Nordin, P., Banzhaf, W.: An on-line method to evolve behavior and to control
a miniature robot in real time with genetic programming. Adaptive Behav. 5(2),
107–140 (1997)
14. Teller, A., David, A.: Automatically choosing the number of fitness cases: the
rational allocation of trials. In: Genetic Programming 1997: Proceedings of the
Second Annual Conference, pp. 321–328. Morgan Kaufmann (1997)
15. UCI: Kdd cup (1999). http://kdd.ics.uci.edu/databases/kddcup99/
16. Zhang, B.-T., Cho, D.-Y.: Genetic programming with active data selection. In:
McKay, B., Yao, X., Newton, C.S., Kim, J.-H., Furuhashi, T. (eds.) SEAL 1998.
LNCS (LNAI), vol. 1585, pp. 146–153. Springer, Heidelberg (1999). doi:10.1007/
3-540-48873-1 20
17. Zhang, B.T., Joung, J.G.: Genetic programming with incremental data inheritance.
In: The Genetic and Evolutionary Computation Conference, Proceedings, vol. 2,
pp. 1217–1224. Morgan Kaufmann, Orlando (1999)
A Fast Deep Convolutional Neural Network
for Face Detection in Big Visual Data
1 Introduction
Face detection has been an active research area in the computer vision field for
more than two decades mainly due to the numerous applications that require
face detection as a first step (e.g., face verification [1]). The seminal work of Viola
and Jones [3], made real-time face detection possible and later on inspired many
cascade-based methods. Thereafter, research in face detection made noteworthy
progress as a result of the availability of data in unconstrained capture condi-
tions, the development of publicly available benchmarks and the fast growth in
computational and processing power of modern computers.
The original Viola-Jones detector is fast to evaluate, yet fails in detecting
faces from different angles. This issue was initially addressed either by using one
classifier cascade for each specific facial view [7,8], or by using a decision tree
for pose estimation and the corresponding cascade to verify the detection [6].
However, these approaches require pose/orientation annotations while complex
cascade structures increase the computational cost. The main line of research in
this direction was based on the combination of robust descriptors with classifiers
[9,10]. Among the variants, a method named Headhunter [11] improved the
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 7
62 D. Triantafyllidou and A. Tefas
Fig. 1. An example of face detection in various poses and occlusions. The bounding
boxes and scores show output of the trained CNN.
The recent resurgence of interest in deep neural networks owes a certain debt
to the availability of powerful GPUs which routinely speed up common oper-
ations such as large matrix computations. Deep convolutional neural networks
have wide applications in language processing, object classification and recom-
mendation systems. A deep network named Alexnet [15], which was trained on
ILSRC 2012, rekindled interest in convolutional neural networks and outper-
formed all other methods used for large scale image classification. The R-CNN
method proposed in [16] generates category-independent region proposals and
uses a CNN to extract a feature vector from each region. Then it applies a set
of class-specific linear SVMs to recognize the object category. Recently, a face
detector called DDFD [2], showed that a CNN can detect faces in a wide range
of orientations using a single model.
In this study, a novel CNN for face detection is presented that extends the
work in [2]. It is shown that a properly trained CNN can outperform more
complex and computationally expensive architectures. This paper makes the
following contributions:
A Fast Deep Convolutional Neural Network 63
2 Proposed Method
The CNN was trained with positive examples extracted from the AFLW [17]
and the MTFL [21] datasets. The first consists of 21 K images with 24 K face
annotations while the second consists of 12 K face annotations. Both datasets
include real world images with expression, pose, gender, age and ethnicity vari-
ations. For AFLW we used the provided face rectangle annotations. For MTFL
1
https://github.com/danaitri/Face-detection-cnn.
64 D. Triantafyllidou and A. Tefas
The CNN was trained successively in a set of five different data sets. Let NT
be a collection of images that will serve as a pool of negative examples. Let D0
be the original training set consisting of the original set of positive examples P0
and the set of negative examples N0 ∈ P:
D0 = T0 ∪ N0 (1)
Once the training process is complete, we run the network to the set of images
NT and we recollect a new set of false positives F1 which is added to the original
set of negative examples N0 :
N1 = N0 ∪ F1 (2)
The set F1 is selected according to the network’s score. During each training
round, we sort the false positives according to their score and we select a pre-
defined number of examples. In order to maintain the same ratio of positive to
negative examples after each training round we increase the number of positive
examples proportionally. A new set of images containing faces T1 is added to the
original set of positive examples P0
T1 = T0 ∪ P1 (3)
Fig. 3. The results of the proposed training methodology on the FDDB dataset for the
face detection CNN.
3 Experiments
We implemented the proposed face detector using the Caffe library [18]. The
output of the network shows the scores of the CNN for every 32 × 32 window
with a stride of 2 pixels in the original image. In order to detect faces smaller
or larger than 32 × 32 we scale up or down the original image respectively. We
apply the non maximum suppression strategy according to which all bounding-
boxes with a possibility lower than the score of the maximum window multiplied
by a constant factor are removed. The system was able to detect faces in the
FDDB dataset that were not annotated. These detections were removed as they
would count for false positives and lead to a deteriorated performance. During
deployment of the CNN, we add an extra average pooling layer of 3×3 to the final
output of the network. It has been verified that this layer reduces the number of
false positives. The heatmap produced by the CNN is smoothened by this extra
pooling layer.
faces are annotated with elliptic regions. As stated in [11] changing the output
format of detections to ellipses increases the overlap region between the detec-
tions and ground truth boxes. However, our detector achieves a high recall rate
without this conversion. Figures 4(a)-4(b) show the final results on the FDDB
dataset after circular training of the network in all the aforementioned datasets
described in Sect. 2.3 Our method achieves a recall rate of 90 %, outperforming
almost all recent published face detection methods.
Fig. 4. Comparison of different face detectors on FDDB dataset. Against deep archi-
tectures (a) Against other state-of-the-art approaches (b)
Fig. 5. Face detection examples in the FDDB databaset using the proposed CNN.
kernel size with the output and the input dimension of that layer. For example,
as shown in 1 the first layer has 3 × 3 × 3 × 24 = 648 weights. The complexity
of deep CNN can be estimated by the number of the required FLOPS for a single
face frame. Table 2 summarizes this comparison. We compare the complexity of
our network with AlexNet used in [2,23,25]. AlexNet requires a number of 1.5
billion FLOPS whereas our model requires only 6 million FLOPS. As a result,
our model has 800 × fewer parameters and is 250 times faster than AlexNet.
This issue is very important during training but also during testing and
deployment. The proposed lightweight model can be easily deployed to smart
devices (e.g. smartphones, notepads, etc.) or robotic systems (e.g. drones) that
do not have expensive and energy consuming multiple GPUs installed. Addi-
tionally, the proposed approach proves that when we have to deal with a specific
task (i.e., face detection), even if it is very complex, we can design and train
smaller and efficient architectures that outperform deeper and larger networks
in performance and in execution time.
# parameters # FLOPS
AlexNet 60 M 1.5 B
Ours 76 K 6M
A Fast Deep Convolutional Neural Network 69
4 Conclusion
In this paper, we presented a novel deep convolutional neural network for the
task of face detection. Our experiments on the publicly available benchmark
FDDB show the success of our method. Our detector is able to detect faces in
a wide range of orientations and expressions. Our detector does not require any
extra modules usually used in deep learning methods such as SVM or bounding-
box regression. Our work, extends the DDFD detector by using a light-weight
model that improves run time and training speed. Our model outperforms the
DDFD detector in the challenging FDDB dataset by a magnitude of 6 %. We
show that a properly trained smaller model is efficient and outperforms a more
complex and large network used for the same task. One of the future research
directions is to exploit the proposed approach in order to train a lightweight
deep model for face identification.
References
1. Kotropoulos, C., Tefas, A., Pitas, I.: Frontal face authentication using variants
of dynamic link matching based on mathematical morphology. In: Proceedings of
IEEE International Conference on Image Processing (ICIP 1998), Chicago, USA,
vol. 1, pp. 122–126, 4–7 October 1998
2. Farfade, S.S., Saberian, M., Li, L.-J.: Multi-view face detection using deep convo-
lutional neural networks. In: ICMR (2015)
3. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57,
137–154 (2004)
4. Chen, D., Ren, S., Wei, Y., Cao, X., Sun, J.: Joint cascade face detection and
alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV
2014. LNCS, vol. 8694, pp. 109–122. Springer, Heidelberg (2014). doi:10.1007/
978-3-319-10599-4 8
5. Yang, B., Yan, J., Lei, Z., Li, S.: Aggregate channel features for multi-view face
detection. In: IEEE International Joint Conference on Biometrics (2014)
6. Viola, M., Viola, P.: Fast multi-view face detection. In: Proceedings of CVPR
(2003)
7. Wu, B., Ai, H., Huang, C., Lao, S.: Fast rotation invariant multi-view face detection
based on real adaboost. In: Proceedings of IEEE Automatic Face and Gesture
Recognition (2004)
8. Li, S.Z., Zhu, L., Zhang, Z.Q., Blake, A., Zhang, H.J., Shum, H.: Statistical learning
of multi-view face detection. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P.
(eds.) ECCV 2002. LNCS, vol. 2353, pp. 67–81. Springer, Heidelberg (2002). doi:10.
1007/3-540-47979-1 5
9. Li, J., Zhang, Y.: Learning surf cascade for fast and accurate object detection. In:
CVPR (2013)
10. Jun, B., Choi, I., Kim, D.: Local transform features and hybridization for accurate
face and human detection. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1423–1436
(2013)
11. Mathias, M., Benenson, R., Pedersoli, M., Gool, L.: Face detection without bells
and whistles. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV
2014. LNCS, vol. 8692, pp. 720–735. Springer, Heidelberg (2014). doi:10.1007/
978-3-319-10593-2 47
70 D. Triantafyllidou and A. Tefas
12. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multi-
scale, deformable part model. In: Proceedings of CVPR (2008)
13. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.: Cascade object detection with
deformable part models. In: Computer Vision and Pattern Recognition (2010)
14. Ranjan, R., Patel, V.M., Chellappa, R.: A deep pyramid deformable part model
for face detection. In: International Conference on Biometrics Theory, Applications
and Systems (2015)
15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Proceedings of NIPS (2012)
16. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-
rate object detection and semantic segmentation. In: Proceedings of CVPR (2014)
17. Martin Koestinger, P.M.R., Wohlhart, P., Bischof, H.: Annotated facial landmarks
in the wild: a large-scale, real-world database for facial landmark localization.
In: Proceedings of IEEE International Workshop on Benchmarking Facial Image
Analysis Technologies (2011)
18. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding.
arXiv preprint arXiv:1408.5093 (2014)
19. Jain, V., Learned-Miller, E.: Fddb: A benchmark for face detection in uncon-
strained settings. Technical Report UM-CS-2010-009, University of Massachusetts,
Amherst (2010)
20. He, S.R., Sun, K., Jian, X.Z.: Delving deep into rectifiers: surpassing human-level
performance on imagenet classification. In: IEEE International Conference on Com-
puter Vision (ICCV) (2015)
21. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-
task learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV
2014. LNCS, vol. 8694, pp. 94–108. Springer, Heidelberg (2014). doi:10.1007/
978-3-319-10599-4 7
22. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In: International Conference on Artificial Intelligence and Statis-
tics (2010)
23. Yang, S., Luo, P., Loy, C.C., Tang, X.: From facial parts responses to face detection:
a deep learning approach. In: IEEE International Conference on Computer Vision
(2015)
24. Yang, B., Yan, J., Lei, Z., Li, S.Z.: Convolutional channel features. In: IEEE Inter-
national Conference on Computer Vision (2015)
25. Ranjan, R., Patel, V.M., Chellappa, R.: HyperFace: A Deep Multi-task Learn-
ing Framework for Face Detection, Landmark Localization, Pose Estimation, and
Gender Recognition arXiv:1603.01249 (2016)
Novel Automatic Filter-Class Feature Selection
for Machine Learning Regression
1 Introduction
More data is available now than ever before, with cheaper sensors and the instal-
lation of sensors everywhere. The “Internet of Things” and “Big Data” are terms
connected to the fact that the amount of data is increasing rapidly. Machine
learning regression attempts to learn the relation between parameters of a sys-
tem based on historical data. Implementing a successful machine learning algo-
rithm requires choosing a representation of the solution, selecting relevant input
features and setting parameters associated with the learning method [17].
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 8
72 M.G. Wollsen et al.
2 Method
The performance of RDESF will be benchmarked against commonly used feature
selection algorithms described in the litterature. The benchmarking will be based
on prediction error and computational speed.
Forward and Backward Selection. The forward and backward selection are
both wrapper-class feature selection algorithms. This means that the learning
algorithm for which features are selected is included in the feature selection
process. In forward selection the features are added one at a time. If the testing
error decreases when a feature is added, the feature is kept. In backward selection
the process is reversed where features are removed, and if the error increases,
the removed features are included again [11]. As with the NLPCA, the data
74 M.G. Wollsen et al.
is randomly divided 50/50 into a training set and a test set. The error is the
corrected Akaike information criterion (AICc) [1]. The formula for AICc is:
2k(k + 1)
AICc = n · ln(RM SE) + 2(k + 1) + (1)
n−k−1
with n being the sample size, k being the number of features and RM SE being
the root mean square error.
with P (xi ) being the probability density function, X is the feature and xi are
the samples of that feature [14].
Granger Casuality. A variable X “Granger-causes” Y if Y can be better pre-
dicted using the histories of both X and Y than it can using the history of Y
alone [3]. The Granger causality score is calculated by creating two linear regres-
sions; one that only contains the samples from Y and one that also includes the
samples from X. F-tests are used to reject the null hypothesis that X does not
Granger-cause Y. By using an F-distribution every input Xi can be scored as to
how much it causes Y.
Mutual Information (M). The MI is a measure of the distance between the
distributions of two variables and hence the dependence of the variables [2].
Choosing features with high dependence to the output could indicate that the
Novel Automatic Filter-Class Feature Selection 75
feature is good for predicting the output. The MI of an input X to the output
Y is defined as [2]:
p(x, y)
I(X; Y ) = p(x, y) log (3)
p(x)p(y)
x∈X y∈Y
with p(x, y) being the joint probability density function and p(x) and p(y) being
marginal probability distribution functions of X and Y respectively.
Pearson. The Pearson correlation coefficient is a measure of the dependence
between two variables [13]. The pearson correlation coefficient is defined as [13]:
cov(X, Y )
ρX,Y = (4)
σX σY
with cov as the covariance and σX and σY are the standard deviations of X and
Y respectively.
Spearman. Spearman’s rank correlation coefficient is simply the Pearson cor-
relation coefficient applied to ranked data [13]. Ranking the data will be more
resistant to outliers [13], and does not assume the data is numerical. The ranking
used is the normal ordering of the input.
Artificial Neural Networks (ANNs) are universal approximators [6] and have
been applied successfully in regression based problems for many years. Multilayer
Perceptrons (MLPs) are one of the most used ANNs also known as feedforward
networks. An MLP will be used to attempt to solve the regression problems.
The ANN uses the tanh activiation function, 20 nodes in the hidden layer and
the RPROP training algorithm [10]. A rule of thumb states to use the average of
the number of input features and the number of outputs as the number of nodes
in the hidden layer. However, we found that fewer nodes results in a better
generalization, and we’ve settled on 20 hidden nodes. The network is trained
until an error of 0.001 % is achieved with a maximum of 500 iterations. For
further information on the technique the reader is refered to [19]. The focus on
this paper is on the feature selection, which is why we have chosen MLP which
may be the most basic, but very efficient, type of ANNs. Researchers choice of
ANN comes down to personal preference, the type of problem, but also what
is hip at the time. Modern types of ANNs such as Deep Neural Networks have
feature selection functionality as well, but feature selection as a pre-processing
step will always decrease the learning complexity, regardless of the ANN type.
10-fold cross-validation is performed, and in each fold the network is trained
and tested 10 times to decrease the influence by the randomness in the ANN.
The average of those 10 repetitions is used for the comparison.
76 M.G. Wollsen et al.
2.3 Statistics
The error of the prediction is calculated using the root mean square error
(RMSE) measurement. The RMSE is defined as [13, p. 497]:
n
t=1 (ŷt − yt )
2
RM SE = (5)
n
where ŷt is the predicted value and yt is the actual value. The number of pre-
dictions in the series is denoted n. RMSE punishes negative and positive errors
equally.
The feature selection methods will be compared against each other using a
Wilcoxon signed-rank test. Because the output from all methods are used in the
equally configured ANN, we assume the samples are dependent, and hence we
must use a paired test. Because we have a small sample size, we cannot make
any safe assumptions about the underlying distribution. For that reason a rank
test is necessary. The Wilcoxon signed-rank test works any measurement type
and returns a p-value on the null hypothesis that the two sample populations
are identical. This also means that any specific error measurements or time
measurements will not be presented. The Wilcoxon signed-rank test uses the
following test statistic, W :
Nr
W = [sgn(x2,i − x1,i ) · Ri ] (6)
i=1
where Nr is the number of pairs with equalities removed, Ri is the rank of pair
i and x1 and x2 are the pair samples.
2.4 Experiments
The feature selection algorithms will be tested against two problems. Both data
sets are publicly available and will function as benchmarks for future comparison
and experimentation. The problems are time-series problems and we assume the
parameters change naturally over time. For this reason it is necessary to add the
delayed input and output to the available input features. We did a grid search
on a reduced version of one of the data sets, and found that a delay of 12 h gives
the best performance. We assume this delay will perform equally well for the full
data set as well as the other data set. With the delay the first problem will have
a total of 285 input features and the second problem will have a total of 142
features. To perform the long term prediction the output is shifted accordingly.
Besides the prediction error the computational speed will also be measured.
This gives an indication of the implications of using the feature selection algo-
rithms. The time will also be measured 10 times in each cross fold to even out any
randomness. The timer is started before the feature selection and stopped after
the output from the ANN has been denormalized. There is no significant time
difference between short-term and long-term predictions, so only the measured
time from the short-term prediction will be used for comparison.
Novel Automatic Filter-Class Feature Selection 77
The data is first delayed and then divided into the crossfold bins. In each
crossfold bin the feature selection is performed on the training part followed by
a normalization of the entire data in the bin to prepare it for the ANN. After the
ANN training the same features are selected from the test part and the ANN is
run using this data.
The results of the performance of predicting the indoor temperature and the
computational speed can be seen in Table 1. With 5 % significance the prediction
with RDESF outperforms backward selection, NLPCA and PCA on both short-
term and long-term. RDESF also outperforms using all available features on
short-term and with 10 % significance on long-term. Only forward selection is
able to get a better prediction than RDESF and only on short-term. On long-
term there is not enough statistical significance to make any statements. It was
expected that forward selection performs best because it includes the model in
the feature selection process. Interestingly, the backward selection also includes
the model in the selection process but does not have the same performance.
Table 1. p-values from the Wilcoxon signed-rank test for methods compared to RDESF
for the SML2010 data set
Results from the temperature prediction from the design reference year data
set can be seen in Table 2. The results do not change much between short term
78 M.G. Wollsen et al.
Table 2. p-values from the Wilcoxon signed-rank test for methods compared to RDESF
for the reference year data set
and long term prediction as seen in experiment 1. Our algorithm RDESF clearly
outperforms both NLPCA and PCA (with 1 % significance) when it comes to
the prediction error. There is not statistical significance for the performance of
RDESF against using all available input features nor backward selection. Again
RDESF is only outperformed by the forward selection.
Just like in experiment 1, we expected that forward selection would outper-
form RDESF, because the model is included in the feature selection process. In
experiment 1 we see that backward selection does not have the same superior
performance as forward selection. Forward and backward selection both have
advantages and disadvantages and it might be that the disadvantages of back-
ward selection is influencing the results. One could overcome the disadvantages
of both selection algorithms by using stepwise regression that combines forward
and backward selection, but that will not be further investigated in this paper.
The computational speed of RDESF outperforms all other algorithms with
1 % significance just as in experiment 1. As expected using all input features
is also faster in this experiment. However, keep in mind that using all input
features will increase the computational time heavily for other types of neural
networks.
The filters we chose to include in RDESF are by no means final. Mutual informa-
tion (MI), Pearson and Spearman ranking are all measures of dependency. Their
effectiveness is based on the assumption that a dependence between an input fea-
ture and the output will equal a better prediction performance. The Shannon
entropy ranks features higher that are unique with respect to their probability
density function. Selecting features with a high Shannon entropy score will include
features that are different from each other and thereby decreasing the amount of
similar information given to the prediction algorithm. The last included filter was
the Granger causality which investigates if histories of the input feature and the
output are better together. We believe that this mix of different types of filter
makes RDESF strong and a generic solution to automatic feature selection of the
filter-class. We will continue to investigate the performance of RDESF with other
filters.
The numbers of filters included in RDESF has a big influence. Too many
filters will decrease the influence of the individual filter. Because of the filter-wise
Novel Automatic Filter-Class Feature Selection 79
selection, all possible features can be selected if too many filters are included. On
the other hand, too few filters will mean the individual filters are too influential.
If the filters have measurements, it means that the 10 % from every filter will
be almost identical. The meta approach implies that a variety of included filters
with different types of measurements will result in a better performance.
The union selection used as the distinct selection in RDESF was the obvious
choice for us. Other set theory selections should be investigated such as intersec-
tion or even cartesian product. Other elaborate possibilities such as heuristics
or voting systems should also be further investigated in future work.
Our initial goal was to beat the prediction performance by using PCA. PCA
is included in MATLAB which is often used in machine learning. PCA is often
the default feature selection method used and it has clear advantages such as
reducing the dimension space and reversibility. It is a very positive result that
RDESF outperforms PCA. That RDESF outperforms or evens with using all
input features is a good indication that RDESF will perform well for computa-
tionally heavy types of ANNs. Support Vector Regression is one of those types
of network, and preliminary results show that RDESF performs equally well
for SVRs and RBFs. Testing the RDESF with other types of ANNs and other
application areas are planned for further research.
Through careful implementation of the included filters, RDESF is very fast,
especially compared to the computational intensive NLPCA and forward and
backward selection. We’ve implemented RDESF through a mix of the Apache
Commons library and an optimized use of data structures in Java. The speed
can be further improved by the use of multi-threading, with every filter scoring
the features simultaneously.
4 Conclusion
A novel filter-class algorithm for automated selection of features has been pro-
posed. Our algorithm called Ranked Distinct Elitism Selection Filter (RDESF)
was tested in two experiments. The filter-class of feature selection are model-
free and thereby fast. A big drawback of filters is the requirement of choosing a
threshold to select the features. This problem was overcome by creating a meta-
filter that selects the top 10 % features for each included filter. To avoid duplicate
features a distinct selection was performed, in this case a union selection.
The experiments in which RDESF was tested were time-series regression
based problems of prediction a variable. RDESF was only outperformed by for-
ward selection which was expected, because the prediction model is included in
the forward selection process. All the other feature selection algorithms which
were: Backward selection, NLPCA and PCA were all outperformed by RDESF.
In the first experiment RDESF even outperformed using all input features, that
indicates a good performance in computationally heavy types of ANNs. The com-
putational speed of RDESF was only outperformed by using all input features
which was also expected, since no pre-calculations are required.
RDESF clearly outperformed PCA in both prediction error and computa-
tional speed, which was our benchmark baseline. Unlike PCA, RDESF allows
80 M.G. Wollsen et al.
for feature analysis in fault detection scenarios because the features are not trans-
formed. This means that RDESF is a strong competitor in the feature selection
field, and applicable to a variety of application areas. The meta-filter approach
does not require the user to select a threshold for the filter which allows the user
to include filters into an automatic feature selection process or system.
References
1. Cavanaugh, J.E.: Unifying the derivations for the akaike and corrected akaike infor-
mation criteria. Stat. Probab. Lett. 33(2), 201–208 (1997)
2. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York
(1991)
3. Granger, C.: Some recent development in a concept of causality. J. Econometrics
39, 199–211 (1988)
4. Guttman, L.: Some necessary conditions for common-factor analysis. Psychome-
trika 19(2), 149–161 (1954)
5. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach.
Learn. Res. 3, 1157–1182 (2003)
6. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are uni-
versal approximators. Neural Netw. 2(5), 359–366 (1989)
7. Kramer, M.A.: Nonlinear principal component analysis using autoassociative
neural networks. AIChE J. 37(2), 233–243 (1991)
8. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/
ml
9. Nautical Almanac Office: Almanac for Computers 1990. U.S Government Printing
Office (1989)
10. Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation
learning: the RPROP algorithm. In: IEEE International Conference on Neural
Networks, pp. 586–591. IEEE (1993)
11. May, R., Dandy, G., Maier, H.: Review of Input Variable Selection Methods for
Artificial Neural Networks. InTech, April 2011
12. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-
propagating errors. Cogn. Model. 5, 1 (1988)
13. Shanmugan, K.S., Breipohl, A.M.: Random Signals: Detection, Estimation, and
Data Analysis. Wiley, New York (1988)
14. Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob.
Comput. Commun. Rev. 5(1), 3–55 (2001)
15. Shlens, J.: A tutorial on principal component analysis. arXiv preprint
arXiv:1404.1100 (2014)
16. Wang, P.G., Mikael, S., Nielsen, K.P., Wittchen, K.B., Kern-Hansen, C.: Reference
climate dataset for technical dimensioning in building, construction and other sec-
tors. DMI Technical reports (2013)
17. Whiteson, S., Stone, P., Stanley, K.O., Miikkulainen, R., Kohl, N.: Automatic
feature selection in neuroevolution. In: GECCO (2005)
18. Zamora-Martı́nez, F., Romeu, P., Botella-Rocamora, P., Pardo, J.: On-line learning
of indoor temperature forecasting models towards energy efficiency. Energy Build.
83, 162–172 (2014)
19. Zhang, G., Patuwo, B.E., Hu, M.Y.: Forecasting with artificial neural networks:
the state of the art. Int. J. Forecast. 14(1), 35–62 (1998)
Learning Using Multiple-Type Privileged
Information and SVM+ThinkTank
Abstract. In this paper, based on the extension to the standard Support Vector
Machines Plus (SVM+) model for the Learning Using Privileged Information
(LUPI) paradigm, a new SVM+ThinkTank (SVM+TT) model, is proposed for
Learning Using Multiple-Type Privileged Information (LUMTPI). In cases that
Multiple-Type Privileged Information (MTPI) from different perspectives is
available for interpreting those training samples, such as in Big Data Analytics,
it could be beneficial to leverage all these different types of Privileged Infor-
mation collectively to construct multiple correcting spaces simultaneously for
training the maximum margin based separating hyperplane of SVM model. In
fact, from the practical point of view, organising Privileged Information from
different perspectives might be a relatively easier task than finding a single type
perfect Privileged Information for those hardest training samples. The MTPI
collectively plays the role of a think tank as the single type Privileged Infor-
mation plays the role of a single perfect master class teacher. The preliminary
experimental results presented and analysed in this paper demonstrate that SVM
+TT, as a new learning instrument for the proposed LUMTPI paradigm, is
capable of improving the generalisation ability of the standard SVM+ in
learning from a small amount of training samples but incorporating with the
MTPI interpretations to these samples for improved learning performance.
1 Introduction
In this section, the basic models of SVM [8] and SVM+ [1] for classification problems
are extended to devise a Multiple-Type Privileged Information (MTPI) based SVM
+ThinkTank for the case of a classical binary classification problem.
where there is one subset with y ¼ þ 1 and another subset with y ¼ 1. In order to
find the decision rule y ¼ f ðxÞ to separate the two subsets, SVM first maps the vector xi
of space X into vector zi of space Z where it constructs the optimal separating
hyperplane by minimising the functional:
Learning Using Multiple-Type Privileged Information and SVM+TT 83
1 Xl
Rðw; b; nÞ ¼ ðw; wÞ þ c ni ð2Þ
2 i¼1
yi ½ðw; zi Þ þ b 1 ni ; i ¼ 1; . . .; l and
ð3Þ
ni [ 0;
where (w, w) is the margin size of the separating hyperplane, ni represents the deviation
of zi from the margin border, and C is the user defined model regularisation parameter,
which balances the size of margin and number of acceptable non-separable training
sample vectors. The generalisation ability of the decision rule y ¼ f ðxÞ, in terms of
prediction accuracy, can be tested with testing samples.
here x11 ; . . .; xt1 are the different types of privileged information which are only available
during the training process, and there is one subset with y ¼ þ 1 and another subset
with y ¼ 1.
In order to leverage the MTPI for training the classifier to control the slack vari-
ables of all sample data in its decision space, multiple correcting spaces should be
constructed and used to assist the classifier to make a better distinction between a hard
example and easy one in that decision space. Therefore, a new SVM+ThinkTank (SVM
+TT) model is proposed to incorporate one common decision space and multiple
correcting spaces together to train the classifier. By extending the objective function
(2), a generalised objective function will be:
1 1 Xt cr X t X l
Rðw; w1 ; . . .; wt Þ ¼ ðw; wÞ þ ðwr ; wr Þ þ nr ;
i¼1 i
ð5Þ
2 2c r¼1 t r¼1
where
subject to constrains:
yi ½ðw; zi Þ þ b 1 nri ;
ð7Þ
i ¼ 1; . . .; l; r ¼ 1; . . .; t; nri [ 0
84 M. Jiang and L. Zhang
Without loss of generality, let us consider two types (i.e., t = 2) Privileged Infor-
mation x* and x** in the problem setup of binary classification on two finite subsets of
vector x from the training set:
where the Privileged Information is only available during the training process and there
is one subset with y ¼ þ 1 and another subset with y ¼ 1.
The proposed SVM+TT first maps the vector xi of space X into vector zi of space Z,
xi into vector zi in Z* space, and x **
i into vector zi in Z space where it constructs the
optimal separating hyperplane by minimising the functional:
1 C1 X
l
C2 X
l
Rðw; w ; w ; b; b ; b Þ ¼ ½ðw; wÞ þ cððw ; w Þ þ ðw ; w ÞÞ þ ni þ n ð9Þ
2 2 i¼1 2 i¼1 i
yi ½ðw; zi Þ þ b 1 ni ;
yi ½ðw; zi Þ þ b 1 ni ; ð12Þ
i ¼ 1; . . .; l; ni [ 0; ni [ 0
X
l
1X l
Rða; b; pÞ ¼ ai ai aj yi yj Kðxi ; xj Þ
i¼1
2 i; j¼1
1 X l
½ ðai þ bi C1 Þðaj þ bj
2c i;j¼1
ð13Þ
C1 ÞK ðxi ; xj Þ
X
l
þ ðai þ pi C2 Þðaj þ pj
i;j¼1
C2 ÞK ðx
i ; xj Þ
in which there are two model regularisation parameters C1 and C2 applied to the two
correcting spaces respectively and the maximisation is subject to constraints:
Learning Using Multiple-Type Privileged Information and SVM+TT 85
P
l P
l P
l
yi ai ¼ 0; ðai þ bi C1 Þ ¼ 0; ðai þ pi C2 Þ ¼ 0;
i¼1 i¼1 i¼1
ð14Þ
i ¼ 1; . . .; l and ai 0; bi 0; pi 0
X
l
f ðxÞ ¼ yi ai Kðxi ; xÞ þ b ð15Þ
i¼1
3 Experiments
use of kernels in correcting spaces, there are further four different cases
as combined with two different kernels of RBF or Polynomial.
SVM+TT: Experiments with both type I and type II privileged information together,
i.e., training with the non-privileged features and type I and type II
privileged features together, but testing with non-privileged features
only. In this scenario, regarding the combinations of the use of privileged
features and use of kernels in correcting spaces, there are four different
cases as combined with two different kernels of RBF or Polynomial.
The model selection uses a 5-fold cross-validation and the final prediction accuracy
of the selected model is calculated based on the average accuracies of 5 runs.
and original feature spaces belong to the same data modality. In future, information
from different data modalities such as the narratives extracted and featured from
diagnosis reports, could be explored and as the Privileged Information to validate and
demonstrate the possible further generalisation performance improvements.
4 Related Work
5 Conclusions
References
1. Vapnik, V., Izmailov, R.: Learning using privileged information: similarity control and
knowledge transfer. J. Mach. Learn. Res. 16, 2023–2049 (2015)
2. Vapnik, V., Izmailov, R.: Learning with intelligent teacher. In: Gammerman, A., Luo, Z.,
Vega, J., Vovk, V. (eds.) COPA 2016. LNCS, vol. 9653, pp. 3–19. Springer, Heidelberg
(2016). doi:10.1007/978-3-319-33395-3_1
3. Wang, S., Yang, J.: Learning to transfer privileged ranking attribute for object classification.
J. Inf. Comput. Sci. 12(1), 367–380 (2015)
4. Liu, J., Zhu, W., Zhong, P.: A new multi-class support vector algorithm based on privileged
information. J. Inf. Comput. Sci. 10(2), 443–450 (2013)
5. Sharmanska, V., Quadrianto, N., Lampert, C.: Learning to Transfer Privileged Information,
CoRR, October 2014
6. Sharmanska, V., Quadrianto, N., Lampert, C.: Learning to rank using privileged information.
In: Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013),
1–8 December 2013
7. Serra-Toro, C., Traver, V., Pla, F.: Exploring some practical issues of SVM+: is really
privileged information that helps? Pattern Recogn. Lett. 42, 40–46 (2014)
8. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
9. Cai, F.: Advanced Learning Approaches Based on SVM+ Methodology. Ph.D. thesis, the
University of Minnesota, July 2011
10. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer
11. Liang, L., Cherkassky, V.: Connection between SVM+ and multi-task learning. In:
Proceedings of the IJCNN (2008)
12. Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Proceedings of the 17th
SIGKDD Conference on Knowledge Discovery and Data Mining (2004)
13. Ji, Y., Sun, S., Lu, Y.: Multitask multiclass privileged information support vector machines.
In: Proceedings of the ICPR, pp. 2323–2326. IEEE (2012)
14. Lapin, M., Hein, M., Schiele, B.: Learning using privileged information: SVM+ and
weighted SVM. Neural Netw. (2014)
15. Zhang, Y., Zhang, L., Hossain, A.: Adaptive 3D facial action intensity estimation and
emotion recognition. Expert Syst. Appl. 42(3), 1446–1464 (2015)
16. Zhang, Y., Zhang, L., Neoh, S.C., Mistry, K., Hossain, A.: Intelligent affect regression for
bodily expressions using hybrid particle swarm optimization and adaptive ensembles. Expert
Syst. Appl. 42(22), 8678–8697 (2015)
Learning Symbols by Neural Network
1 Introduction
When robots autonomously act under a certain environment, they need an intel-
ligent information process mechanism such as learning patterns from surrounded
environments. Reasoning based on symbol are a basic component of intelligent
information processing and symbol can be used in subsequent intelligent infor-
mation processing. It is a kind of pattern generated by abstracting signals from
outside world but it different in that symbol can be individually used or used in
combining with other symbols based on situation.
Since the early 2000’s, symbol learning for robots studied. Inamura [1] pro-
posed a model of stochastic behavior recognition and symbol learning. Kadone
[2] proposed symbolic representation learning model for a humanoid robot. On
these studies, symbol and relations between symbols are learned based on cooc-
currences of actions. In semiotics, many theories about symbol have been also
proposed. Chandler [3] proposed the model about the double articulation of
semiotic codes. Semiotic codes belong to any class, that is, single articulation,
double articulation or no articulation. The double articulation enables a semi-
otic code to form an infinite number of meaningful combinations using a small
number of low-level elements.
In any of these position, the common matter is that symbol is not only
abstracted pattern, but also symbol can have a mutual relationship. We proposed
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 10
90 Y. Kakemoto and S. Nakasuka
Hopfield [7] showed that associative memory has attractors and several attrac-
tors reach equilibrium points, if its weight matrix is symmetric and its activation
function is a monotonic increasing function. Kanode [2] theoretically analyzed
relations between patterns and attractors on Hopfield type associative memory.
He concluded that some learned patterns get attractors, if a correlation between
a presented pattern and a learned pattern is high. On the other hand, the associa-
tive memory can not get any attractor and it equilibrates at an averaged pattern
of learned patterns, if the correlation between an unlearned pattern and learned
patterns is low. This result shows that retrievals reach different statuses, depend-
ing on correlations between patterns. Therefore, we can discriminate unlearned
parts in an input from learned patterns for a neural network by monitoring
retrieval process of an associative memory. We analyze each equilibrium point of
neuron in an associative memory to find unused neurons with analyzing chaotic
oscillation on an associative memory.
Aihara [8] proposed Chaos Neural Network, CNN, to show chaotic retrieval
dynamics in addition to normal dynamics of associative memory. The chaotic
Learning Symbols by Neural Network 91
N
ηi (t + 1) = kn ηi (t) + wij xj (t), (3)
j=1
is a cyclic permutation σ = i11 , · · · , i1m1 · · · ik1 , · · · , ikmk . σk is cluster
k(k = 1, · · · , n) and mk is an index of elements in a cluster k. For any σ ∈ Uσ , an
orthogonal complement Uσ⊥ is invariant to Fa,ε (x). A dynamics of each cluster is
measured by traversal Lyapunov exponent. Traversal Lyapunov exponent is Lya-
punov exponent from Uσ to Uσ⊥ . Traversal Lyapunov exponents depend on the
eigenvalue value of Fa,ε (x) on Uσ⊥ and that is (1 − ε) fa xij1 , (j = 1, · · · , k).
Let uμ be a status of neuron i for a learned pattern μ. uν is a status of neuron
i for an unlearned pattern ν. Further, uν,a is a part of uν that has high cor-
relation with uμ and uν,c is a part of uν that has low correlation with uμ .
For uν,a and uν,c , we respectively define the cluster σa and σc . The eigen-
value of Uσa is F (xi ) ua = (1 − ε) fa (xi ) ua , and the eigenvalue of Uσc is
Fa,ε (xi ) uc = (1 − ε) fa (xi ) uc . From the definition uν,a and uν,c , they get a
stable equilibrium points on Uσa but that is not always the case for uσc There-
fore, the eigenvalue of Uσa shows a stable status but the eigenvalue of Uσc dose
not show a stable status.
In retrieval process of CNN, high correlated neurons retract to other neurons.
On the other hand, low correlated neurons do not retract. As the result, the
low correlated elements show isolate behavior. Therefore, role of a neuron to
recognize patterns is determined through the status of each neurons based on
the retrieval process of CNN.
When VSF−Network uses its learning results, all learned patterns are not
required for a recognition of a pattern. For example, if VSF−Network learned
patterns μ and ν, then it should return an output for a pattern μ for an input
Learning Symbols by Neural Network 93
that the pattern μ is presented. In the same way, it returns an output for a pat-
tern ν, if the pattern ν is presented. It should return outputs the pattern that
patters μ and ν are combined, if the patterns μ and ν are presented in an input.
Use of suitable patterns depending on situations is a basic requirement of the
pattern as symbol. The convolution neural network [11] learns partial patterns
at the middle layers and recognizes more complex pattern by combining these
partial patterns. For the developing VSF−Network, we take this technique to
recognize patterns by a combination of learned patterns.
Class Sub-network
f1 f2 · · · fn
cμ uμ∧ν u∗ · · · u∗
cν u∗ uν · · · u∗
cμ∧ν uμ uν ··· u∗
.. .. .. .. ..
. . . . .
space spanned by a hierarchical neural network and the forward pass between
the input layer and the middle layer is a mapping f : U → U . For a learned
hierarchical neural network, we can define an invariant space Λ for f . For Λ
and its open neighborhood T , there is attractors and trapping areas that sat-
∞
isfy the conditions f (T̄ ) ∈ T and f i (T̄ ) = Λ. Futhermore, there are basins
i=0
B(T ) = {x ∈ U : f M (x) ∈ T } of the attractors for large enough number
M > 0. The basins exit on each sub-space Ui (i = 1, . . . , p) spanned by learned
sub-network fi (i = 1, . . . , p). Let Bi (i = 1, . . . , p) be a basin on Ui . If each
sub-network fi is an independent hierarchical neural network, each sub-space Ui
is orthogonal to others and
U = U1 ⊕ · · · ⊕ Ui , i = 1, . . . , p.
In (7), η is a coefficient for update, Hiμ is an output from jth neuron in the
middle layer. λi is a synchronous rate calculated with (9) and P is a threshold
for λi . corki ,kj is a correlation between neuron i and j in the middle layer. Our
weight updating scheme is based on the correlations between patterns, therefore
we apply the correlations of weights between the input layer and the middle
layer. These have an effect to amplify the results from CNN-module.
96 Y. Kakemoto and S. Nakasuka
The update value ΔWjk for the weight between the j-th neuron in the middle
layer and the k-th neuron in the output layer is defined as
∂E
η ∂Wjk
jk
(λi ≤ P )
ΔWjk =
0 (λi > P )
n (8)
∂Ejk μ
μ
= E f (Yk ) .
∂Wjk j=1
In (8), E μ is an error between a current output and Yˆμ and an expected output
Y μ for it.
The synchronous rate λi is calculated by correlation integration (9) based on
Heaviside function H. In (9), r is a threshold for a difference xi − xj .
N
1
λi = H(r − |xi − xj |) (9)
T2
j=1,i=j
values are shown. Each value is determined for CNN−module to show a weak
chaotic behavior. Other parameters and values for BP−module at the experiment
are shown in Table 3.
The initial weights for BP−Module is obtained from other trained hierarchi-
cal neural network. The hierarchical neural network learned 1500 times of epochs
and its topology is same as BP−modele. Each epoch consists of 2000 times of
training and the data are randomly selected from the data set of condition 1.
At the incremental learning step, the number of epochs and training are same
as the first step. The input data for this step are randomly selected from data
set of the condition 2. To examine an ability of VSF−Network for recognizing
combined type pattern, we provide data randomly selected from data set of the
condition 3. Data of the condition1 and the condition 2 are provided at this step
to examine other ability of VSF−Network for incremental learning. On this step,
we compare MSE, Mean Squired Error, at each epoch for each data set.
In Fig. 3, we show the changes of MSE according to the progress of the incre-
mental learning. For incremental learning, the effect of VSF−network is obvious.
VSF-Network incrementally learns unlearned patterns and the weights for the
condition 1 are not destroyed. The combined patterns can be recognize without
learning with the progress of incremental learning. After certain number of learn-
ing, MSE for each condition reaches at an equilibrium status and the learning
stops when there is no redundant neuron because VSF−Network incrementally
learns by reusing redundant neurons.
98 Y. Kakemoto and S. Nakasuka
6 Conclusion
In this paper, we show the theoretical backgrounds of VSF−Network. Through
the experiment, VSF−Network show good performance to incremental learning.
The combined patterns are also recognized well without any learning of combined
pattern.
Now, we have three research tasks for VSF−Network. The first task is to
expand kinds of experiment task to examine additional abilities of VSF-Network.
The second task is application of the deep learning scheme to VSF−network.
The current BP−module has 3 layers. We expects an improvement of its ability
by introducing a deep learning scheme. Other task is sophistication of rela-
tion between learned patterns by VSF−Network. Several related studies about
symbol learning showed learning of high level relation between symbols such
as hierarchy of symbols. For VSF−Network, such a high level relations can be
introduced by a more detailed analysis about dynamicses on CNN−module.
References
1. Inamura, T., Tanie, H., Nakamura, Y.: Proto-symbol development and manipula-
tion in the geometry of stochastic model for motion generation and recognition.
Technical Report NC2003-65, IEICE (2003)
2. Kadone, H., Nakamura, Y.: Symbolic memory of motion patterns using hierarchical
bifurcations of attanctors in an associative memory model. J. Robot Soc. Jpn. 25,
249–258 (2007)
3. Chandler, D.: Semiotics for Beginners. Routledge, London (1995)
4. Kakemoto, Y., Nakasuka, S.: Neural assembly generation by selective connection
weight updating. In: Proceedings of IjCNN 2010 (2010)
Learning Symbols by Neural Network 99
Abstract. Big data analytics nowadays represent one of the most rel-
evant and promising research activities in the field of Big Data. Tools
and solutions designed for such purpose are meant to analyse very large
sets ot data to extract relevant/valuable information. In this path, this
paper addresses the problem of sequentially analysing big streams of data
inspecting for changes. This problem that has been extensively studied
for scalar or multivariate datastreams, has been mostly left unattended
in the Big Data scenario. More specifically, the aim of this paper is to
introduce a change detection test able to detect changes in datastreams
characterized by very-large dimensions (up to 1000). The proposed test,
based on a change-point method, is non parameteric (in the sense that it
does not require any apriori information about the system under inspec-
tion or the possible changes) and is designed to detect changes in the
mean vector of the datastreams. The effectiveness and the efficiency of
the proposed change detection test has been tested on both synthetic
and real datasets.
1 Introduction
In recent years the pervasive dissemination of sensors and devices (e.g., cyber-
physical systems, Internet-of-things technology and mobile devices) led to the
availability of a tremendous and ever-growing amount of collected data [13]. Such
a Big Data revolution made commonly-used techniques for data acquisition,
managing and processing inadequate and led to the design of novel tools and
solutions able to deal with such an enormous amount of data.
Among the wide range of technological and scientific issues related to Big
Data, the ability to make sense out of such a huge amount of data and exploit its
value is increasingly gaining importance. This is where Big Data Analytics come
into play, where novel techniques and mechanisms are designed to analyse (pos-
sibly very) large amounts of data to extract relevant information [5,7,15]. Often
such large amounts of data arrive in a stream manner (e.g., collected by large-
scale IoT systems, sensor networks or acquired through on-line systems/services),
hence requiring a sequential analysis to inspect for events/information of interest.
The aim of this paper is to address the problem of detecting changes in
Big Data scenarios, where massive amounts of data steadily arrive over time.
The detection of changes affecting streams of Big Data is extremely important
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 11
A CPM-Based Change Detection Test for Big Data 101
since it could reveal critical situations, such as ageing effects and faults affecting
sensing/control apparatus, anomalous events/behaviours, or an unforeseen evo-
lution of system under inspection. Hence, the prompt detection of such changes
is essential for undertaking suitable countermeasures like repairing/replacing a
sensor, raising an alarm, or activating adaptation mechanisms. While the prob-
lem of detecting changes in scalar datastreams has been extensively explored
(e.g., see [3]), very few works have been proposed for multivariate datastreams
(e.g., see [11]) and, to the best of our knowledge, the problem of sequen-
tially detecting changes in Big Data has never been addressed in the literature.
A review of related literature is presented in Sect. 2.
The aim of this paper is to propose a Change Detection Test (CDT) able
to detect changes in Big Data. The proposed test is non-parametric, hence
not requiring any a-priori information about the data-generating process or the
change to operate, and able to manage datastreams characterized by very large
dimensions (up to 1000). More specifically, the proposed CDT extends a multi-
variate Change-Point Method (CPM) [14], i.e., a theoretically-grounded statis-
tical hypothesis test able to verify and locate the presence of a change-point in
a fixed-length sequence of data, by considering sliding windows of data and a
modification of the Hotellings T 2 statistic to detect changes in the mean vector
of the datastream.
The proposed CDT for Big Data has been validated on both synthetic and
real datasets characterized by large dimensions and made available to the scien-
tific community as a MATLAB Demo Toolbox1 . In addition, the computational
times of the proposed CDT has been evaluated for different dimensions of the
datastreams.
The paper is organized as follows. Section 2 critically reviews the related
literature, while Sect. 3 formulates the change-detection problem. The proposed
CDT is detailed in Sect. 4. The experimental analysis is shown in Sect. 5, while
conclusions are finally drawn in Sect. 6.
2 Related Literature
CDTs are statistical techniques aiming at assessing the stationary of a data gen-
erating process over time. CDTs typically operate by inspecting independent
and identically distributed (i.i.d.) features extracted from the considered data
stream (e.g., the empirical error of the model, the sample mean or variance),
building an initial hypothesis and checking the validity of such an assumption
over time. Differently from statistical hypothesis tests or CPMs that require a
fixed dataset to operate, CDTs are meant to sequentially analyse streams of data.
While the majority of existing CDTs have been designed for univariate distrib-
utions [3], several multivariate solutions have been presented in the literature.
These multivariate solutions can be divided into three main families according to
1
The Matlab Demo Toolbox of the proposed CDT can be found at the following url:
http://roveri.faculty.polimi.it/software-and-datasets/.
102 G. Tacconelli and M. Roveri
3 Problem Formulation
the occurrence of changes in X, while not introducing false positive (i.e., detec-
tions before t0 ) or false negative detection (i.e., changes that are not detected
by CDT).
Aτ = {W (i), i = 1, . . . , τ },
Bτ = {W (i), i = τ + 1, . . . , 2γp},
2
with γp + 1 ≤ τ ≤ 2γp and compute the modified Hotelling statistic T τ for
Aτ ,Bτ as follows
2
T τ = (Yτ ) (S −1 )(Yτ ), (2)
where
Yτ = [τ (2γp − τ )/2γp]1/2 (X̄1,τ − X̄τ +1,2γp ) (3)
−1
and X̄1,τ , X̄τ +1,2γp , S are the sample mean vector of Aτ , the sample mean
vector of Bτ and the sample covariance matrix estimated on W1 , respectively.
We emphasize that, differently from the traditional Hotelling statistic [14], in our
case S does not depend on τ . Hence it can be computed only once during the
initial configuration of the CDT and it is not recomputed during the operational
life, thus reducing the computational complexity of the proposed CDT. With
this modification, we are implicitly assuming that the covariance of W1 and W2
are equal both before and after the change: an assumption that does not limit
the performance of our CDT since it focuses on detecting changes in the mean
vector.
2 2
The CDT computes the values of T τ for τ = {γp + 1, . . . , 2γp}. Hence, T τ is
t
computed only on W2 since there is no need to analyse W1 that is assumed to be
A CPM-Based Change Detection Test for Big Data 105
5 Experimental Results
The aim of this section is to analyse the change detection ability of the pro-
posed CDT in both small and large-dimension datastreams. In more detail, the
effectiveness of the proposed CDT has been contrasted with the Rank-based
procedure and evaluated on very-large normally-distributed synthetic datasets
(with p = 100, 300, 500 and 1000) and a real dataset (with p = 300).
To measure the ability to promptly and correctly detect changes in large-
dimension datastreams, we consider the following three figures of merits:
1. False positive (FP): it measures the times a change is detected, while it is not
(percentage).
2. False negative (FN): it measures the times a change is not detected, while it
is present (percentage).
3. Detection Delay (DD): it measures the time delay between when the change
occurred and when it is detected (in number of samples).
FP, FN and DD are averaged over 500 iterations. In our experimental analysis
we set γ = 1.5 and considered the following change model for the synthetic
experiments:
N (μ, Σ), if t < 2γp + 100
x(t) ∼ (5)
N (μ, Σ) + Δ, if t ≥ 2γp + 100
where μ ∈ Rp and Σ ∈ Rp×p are randomly generated at each iteration and
Δ ∈ Rp is the additive perturbation inserted at time t0 = 2γp + 100 to be
detected. The length of the training sequence W1 is γp samples.
1. Δ1 = [−25, 0, . . . , 0]: modelling the case where only one dimension is affected
by a change;
2. Δ2 = [−25, −25, . . . , −25]: modelling the case where all the dimensions are
affected by a change.
The Rank-based procedure has been configured to achieve DDs comparable with
the ones provided by the proposed CDT.
The comparison of detection abilities is presented in Table 1. As we can see
the proposed CDT outperforms the Rank-based procedure both in terms of
FP/FN and DD. As expected, the Rank-based procedure is not able to detect
changes that affect all the dimensions in the same way (i.e., FNs are very large
in Δ2 ) due to the rank-based analysis. Differently, the proposed CDT is able
to effectively detect changes affecting both just a single dimension and all the
dimensions.
1. Δ3 = [δ, 0, . . . , 0]: modelling the case where only one dimension is affected by
the additive term δ;
2. Δ4 = [0.1δ, 0.1δ, . . . , 0.1δ]: modelling the case where all the dimensions are
affected by the additive term 0.1δ,
Δ p α FPs FNs DD
Δ3 100 0.1 2.4 0 6.06
100 0.01 0.6 0 7.75
300 0.1 0.8 0 7.72
300 0.01 0 0 10.41
500 0.1 1.3 0 8.29
500 0.01 0 0 13.33
1000 0.1 2.6 0 8.21
1000 0.01 0 0 11.58
Δ4 100 0.1 2.0 0 6.28
100 0.01 0.6 0 7.97
300 0.1 0.6 0 3.06
300 0.01 0 0 3.96
500 0.1 2.0 0 2.15
500 0.01 0 0 2.62
1000 0.1 2.6 0 1.39
1000 0.01 0 0 1.62
Fig. 2. Execution time (in seconds) required to process each new acquired observation
by the proposed CDT w.r.t. the number of dimensions p.
2
The computing platform is based on a 1,7 GHz Intel Core i5 with 4 GB 1333 MHz
DDR3.
108 G. Tacconelli and M. Roveri
Fig. 3. The computed test statistic on the Mutant p53 proteins dataset.
The proposed CDT has been validated on a real dataset, i.e., the biophysical
models of mutant p53 proteins [4] available on the UCI website for classification
tasks. It is composed of 16592 instances (143 belonging to active class and 16449
to inactive class) and 5408 attributes. We organised the dataset appending all
the observations of the active class at the end of the inactive samples to model a
change in the distribution as per Eq. (1). In our experiments we considered the
first p = 300 features and we applied a preprocessing phase (i.e., computing the
sample mean on non-overlapping windows of 15 samples for each attribute) since
the original data distribution is not multinormal. This led to a 300-dimension
dataset composed by 1106 samples. The first γp = 450 samples are used to
configure the CDT, while t0 = 1097 (samples between 451 and 900 are used to fill
W2t at time t = 900). Figure 3 reports the behaviour of the test statistic in Eq. (4)
over time (solid blue line) and two different thresholds have been considered, i.e.,
those related to α = 0.1 (red dotted line) and α = 0.01 (green dotted line). As
it can be seen, the test statistic overcomes both thresholds at t = 1101 for both
thresholds (hence with a DD of four samples), without introducing FPs. This
highlights the effectiveness of the proposed CDT in detecting changes in real
and not-normal highly-dimension datastreams.
6 Conclusions
The aim of this paper was to introduce a CDT able to operate in Big Data
scenarios. More specifically, the proposed test is able to operate without requir-
ing any apriori information about the system under inspection or the possible
changes and is able to manage datastreams characterize by very large dimensions
(up to 1000). The performance of the proposed CDT has been contrasted with
the rank-base procedure and tested with both synthetic and real datasets.
Future works will encompass the characterization of the probability of false
positive detections of the proposed CDT, hence providing a additional “confi-
dence” information when a change in the datastream is detected.
A CPM-Based Change Detection Test for Big Data 109
Appendix
As described in Sect. 4, thresholds hp,γ,α s have been computed through simula-
tions since an analytical derivation of the test statistic in (4) is hard to obtain.
Thresholds have been computed as follows: for each value of p and α and γ, we
simulated 10,000 experiments in which we randomly generated a p-variate nor-
mal distribution with random mean vector and covariance matrix. The threshold
hp,γ,α is set to guarantee that the empirical probability of having a false positive
detection by the proposed CDT on a fixed data-sequence whose length is 2γp is
equal to the confidence parameter α. Computed thresholds for different values of
p and α and γ = 1.5 are shown in Table 3. Further values of hp,γ,α for different
configurations of p and α and γ can be found at the following url: http://roveri.
faculty.polimi.it/software-and-datasets/.
References
1. Agarwal, D.: An empirical bayes approach to detect anomalies in dynamic multidi-
mensional arrays. In: Fifth IEEE International Conference on Data Mining, 8-pp.
IEEE (2005)
2. Alippi, C., Roveri, M.: Just-in-time adaptive classifierspart i: detecting nonstation-
ary changes. IEEE Trans. Neural Netw. 19(7), 1145–1153 (2008)
3. Basseville, M., Nikiforov, I.V., et al.: Detection of Abrupt Changes: Theory and
Application, vol. 104. Prentice Hall, Englewood Cliffs (1993)
4. Danziger, S.A., Swamidass, S.J., Zeng, J., Dearth, L.R., Lu, Q., Chen, J.H., Cheng,
J., Hoang, V.P., Saigo, H., Luo, R., et al.: Functional census of mutation sequence
spaces: the example of p53 cancer rescue mutants. IEEE/ACM Trans. Comput.
Biol. Bioinform. (TCBB) 3(2), 114–125 (2006)
110 G. Tacconelli and M. Roveri
5. Ferreira, L.N., Zhao, L.: A time series clustering technique based on community
detection in networks. Procedia Comput. Sci. 53, 183–190 (2015). INNS Conference
on Big Data 2015, San Francisco, CA, USA, 8–10 August 2015
6. Galeano, P., Peña, D.: Covariance changes detection in multivariate time series. J.
Stat. Plann. Infer. 137(1), 194–211 (2007)
7. Hajj, N., Rizk, Y., Awad, M.: A mapreduce cortical algorithms implementation for
unsupervised learning of big data. Procedia Comput. Sci. 53, 327–334 (2015)
8. Hegedűs, I., Nyers, L., Ormándi, R.: Detecting concept drift in fully distributed
environments. In: 2012 IEEE 10th Jubilee International Symposium on Intelligent
Systems and Informatics (SISY), pp. 183–188. IEEE (2012)
9. Kuncheva, L.I.: Change detection in streaming multivariate data using likelihood
detectors. IEEE Trans. Knowl. Data Eng. 25(5), 1175–1180 (2013)
10. Qiu, P., Hawkins, D.: A rank-based multivariate cusum procedure. Technometrics
43(2), 120–132 (2012)
11. Sullivan, J.H., Woodall, W.H.: Change-point detection of mean vector or covariance
matrix shifts using multivariate individual observations. IIE Trans. 32(6), 537–549
(2000)
12. Wang, T.Y., Chen, L.H.: Mean shifts detection and classification in multivariate
process: a neural-fuzzy approach. J. Intell. Manufact. 13(3), 211–221 (2002)
13. Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans.
Knowl. Data Eng. 26(1), 97–107 (2014)
14. Zamba, K., Hawkins, D.M.: A multivariate change-point model for statistical
process control. Technometrics 48(4), 539–549 (2006)
15. Zikopoulos, P., Eaton, C., et al.: Understanding Big Data: Analytics for Enterprise
Class Hadoop and Streaming Data. McGraw-Hill Osborne Media, New York (2011)
Hadoop MapReduce Performance on SSDs:
The Case of Complex Network Analysis Tasks
1 Introduction
2 Related Work
Investigating the usage of SSDs in Hadoop clusters has been a hot issue of discussion
very recently. The most relevant work to ours is included in the following articles [4, 5,
8, 9, 11]. The first effort [5] to study the impact of SSDs on Hadoop was on a
virtualized cluster (multiple Hadoop nodes on a single physical machine) and showed
up to three times improved performance for SSDs versus HDDs. However, it remains
unclear whether the conclusions still hold in non-virtualized environments. The work in
[8] compared Hadoop’s performance on SSDs and HDDs on hardware with
non-uniform bandwidth and cost using the Terasort benchmark. The major finding is
that SSDs can accelerate the shuffle phase of MapReduce. However, this work is
confined by the very limited type of application/workload used to make the investi-
gation and the intervention of data transfers across the network. Cloudera’s employees
in [4], using a set of same-rack-mounted machines (not reporting how many of them),
focus on measuring the relative performance of SSDs and HDDs for equal-bandwidth
storage media. The MapReduce jobs they used are either read-heavy (Teravalidate,
Teraread, WordCount) or network-heavy (Teragen, HDFS data write), and the Terasort
which is read/write/shuffle “neutral”. Thus, neither the processing pattern is mixed nor
the network effects are neutral. Their findings showed that SSD has higher performance
compared to HDD, but the benefits vary depending on the MapReduce job involved,
which is exactly where the present study aims at [7].
The analysis performed in [9] using Intel’s HiBench benchmark [2] concluded that
“… the performance of SSD and HDD is nearly the same”, which contradicts all
previously mentioned works. A study of both pure (only with HDDs or only with
SSDs) and hybrid systems (combined SSDs and HDDs) is reported in [11] using a five
node cluster and the HiBench benchmark. In contrast to the current work, the authors in
[11] investigated the impact of HDFS’s block size, memory buffers, and input data
volume on execution time. The results illustrated that when the input data set size
and/or the block size increases, the performance gap between a pure SSD system and a
pure HDD system widens in favor of the SSD. Moreover, for hybrid systems, the work
showed that more SSDs result in better performance. These conclusions are again
expected since voluminous data imply increased network usage among nodes. Earlier
work [3, 10] studied the impact of interconnection on Hadoop performance in SSDs
identifying bandwidth as a potential bottleneck. Finally, some works propose exten-
sions to Hadoop with SSDs. For instance, VENU [6] is a proposal for an extension to
Hadoop that will use SSDs as a cache (of the HDDs) not for all data, but only for those
that are expected to benefit from the use of SSDs. This work still leaves open the
question about how to tell which applications are going to benefit from the performance
characteristics of SSDs.
3 Investigated Algorithms
Complex network analysis comprises a large set of diverse tasks (algorithms for finding
communities, centralities, epidemics, etc.) that cannot be enumerated here. Among all
these problems and their associated MapReduce solutions, we had to select some of
Hadoop MapReduce Performance on SSDs 113
them based on (a) their usefulness in complex network analysis tasks, (b) in their
suitability to the MapReduce programming paradigm, (c) the availability of their
implementations (free/open code) for purposes of reproducibility of measurements, and
(d) complexity in terms of multiple rounds of map-reduce operations. Based on these
criteria, we selected three problems/algorithms for running our experimentations1. The
first algorithm deals with a very simple problem which is at the same time a funda-
mental operation in Facebook, that of finding mutual friends. The second algorithm
deals with a network-wide path-based analysis for finding connected components
which finds applications in reachability queries, techniques for testing network
robustness and resilience to attacks, epidemics, etc. The third algorithm is about
counting triangles which is a fundamental operation for higher level tasks such as
calculating the clustering coefficient, or executing community finding algorithms based
on clique percolation concepts. Table 1 summarizes the “identity” of the tasks.
We deferred a more advanced method for measuring the performance for multi-job
workload such as the one described in [1], because the standalone, one-job-at-the-time
method allows for the examination of interaction between MapReduce and storage
media without the interventions of job scheduling and task placement algorithms. We
aim at showing that the conclusions about the relative performance of SSDs versus
HDDs are strongly depended on the features of the algorithms examined, which has
largely been neglected in earlier relative studies [4, 5, 8], and based on these features
we draw some conclusions on the relative benefits of SSDs.
4 System Setup
A commodity computer (Table 2) was used for the experiments. Three storage media
were used (Table 2) with capacities similar to that used in [8]. On each of the three
drives (one HDD and two SSDs) a separate and identical installation of the latest
1
The MapReduce codes (along with many experiments) can be found in the technical report at http://
www.inf.uth.gr/*dkatsar/Hadoop-SSD-HD-for-SNA.pdf.
114 M. Bakratsas et al.
version of required software was used. We emphasize at this point that since we need to
factor out the network effects, we used single machine installations. Three different
incremental setting setups were used: (a) with default settings, allowing 6 parallel
maps, (b) with modified containers, allowing 3 parallel maps, and (c) with custom
settings (Table 3). In all these setups, speculative execution was disabled and no early
shuffling was permitted.
For the evaluation of the two disk types, we used ten real social network data
(Table 4). They were retrieved from https://snap.stanford.edu/ and http://konect.uni-
koblenz.de/.
The two SSDs were of different size disallowing the execution of some datasets.
The most important measures we captured were the Map and Reduce execution times,
as also Sort (merge) and Shuffle phase. One common side effect is “cache hits” from
previous executions, that was also experienced in [8]. In order to give each experiment
an equal environment, Hadoop was halted and page cache was flushed, after each
experiment. Before each test, HDFS was re-formatted.
6 The Results
1. Mutual Friends
The complexity of this algorithm is exponential due to the mapper of the 2nd
MapReduce job. Thus, the 2nd MapReduce job is the most resource-intensive of the
three jobs, rendering it a good inspection point for our measures (see Table 5), whereas
the 1st and 3rd MapReduce jobs were fast-executed and almost identical for all disks.
For Amazon, Brightkite and DBLP, the three disks performed almost equally. For the
bigger datasets, the magnetic disk gives competitive (with respect to both SSD drives)
execution times for the reduce phase, but the HDD performs worse for the map phase.
The SSD2 displays superior performance at shuffling.
Table 5. Average times for each phase for 2nd job (creating triples) of “mutual friends”
algorithm
2. Counting Triangles
Here, the SSDs outperform the HDD for all evaluated datasets. At “forming the
triads” job, HDD illustrated competitive behavior at reduce phase (Table 7). The
“counting the triangles” job demonstrated greater variations in execution times. Our
evaluation shows that with small datasets the performance differentiations between the
two disk types are small (Table 6), whereas with larger ones (like YouTube dataset),
SSDs capabilities become evident for shuffle and merge (sort) phases.
For the 1st MR job (creating triads), map, shuffle and merge phases finished quite
fast and with almost zero differentiations among disks. Reduce phase lasted signifi-
cantly longer with both disks performing equally (Table 7). With containers settings,
the biggest dataset of Flickr gets significant improvement for both disk types (Table 8).
To optimize performance, increasing the following settings provided best results for
the magnetic disk, compared to “containers” settings:
116 M. Bakratsas et al.
Table 6. Average times for each phase for 2nd job (calculate triangles) of “counting triangles”
Table 7. Average times for each phase for 1st job (create triads) of “counting triangles”
algorithm
Table 8. Average times for each phase for 1st job (create triads) of “counting triangles”
algorithm, with changed container’s settings
Table 9. Performance difference for YouTube dataset at “Counting Triangles”, increasing sort
factor, for HDD
Table 10. Performance difference for YouTube dataset at “Counting Triangles”, increasing sort
factor, for SSD2
Hadoop MapReduce Performance on SSDs 117
(a) The number of streams to merge at once while sorting files. Minimizes merge time
for both disk types. Improves HDD shuffling time as well (Tables 9 and 10).
(b) The buffer size for I/O (read/write) operations (Table 11).
On the other hand, increasing the buffer size for I/O operations had minimal effect
on SSD2 performance (Tables 12, 13 and 14).
Table 11. Performance difference for YouTube dataset at “Counting Triangles”, increasing file
buffer size, for HDD
Table 12. Performance difference for YouTube dataset at “Counting Triangles”, increasing file
buffer size, for SSD2
Table 13. Percentage difference between “customs” and “containers settings for YouTube
dataset, at “Counting Triangles” algorithm
Table 14. Percentage difference between “customs” and “containers settings for YouTube
dataset, at “Mutual Friends” algorithm
3. Connected Components
Comparing SSD1 to the HDD, the Connected Components algorithm seems to
slightly favor the SSD1 for small datasets, at reduce phase. Map, shuffle and phase times
are close for both disk types (Table 15). For the datasets of Flickr and LiveJournal the
magnetic disk takes the lead at reduce phase which is mostly characterized as “write”
118 M. Bakratsas et al.
procedure for the Hadoop framework. Surprisingly, SSD1 performs quite slowly at
shuffle phase for the LiveJournal dataset. The SSD2 generally delivers great perfor-
mance especially at map and shuffle phase, noticeably as the datasets’ size increase. For
the reduce phase HDD falls behind SSD2, but not with a great margin.
Table 15. Sum of average times for each phase for the iterative Jobs of “Connected
Components”
7 Conclusions
We compared the performance of solid state drives and hard disk drives for social
network analysis. SSDs didn’t come out as the undisputed winner. The second SSD
performed significantly better. In many cases SSD1 and the magnetic disk came into a
draw. Although SSD1 was slightly faster in many tests, in some cases the magnetic
disk outperformed the SSD1. Even comparing to the faster SSD2, the magnetic disk
provided competitive times for reduce phase, especially with the “mutual friends”
algorithm, where it performed marginally better. Magnetic disk’s shuffle times can be
reduced. SSD’s performance doesn’t present further improvement. Nevertheless, HDD
can’t catch up with SSD’s superior performance at shuffling. With tweaking merge-sort
can be performed in less steps minimizing merge’s phase times for both disk types,
slightly favoring magnetic disk that would perform slower otherwise. For map phase
both disk types can get similar performance improvement.
Acknowledgement. This work was supported by the Project “REDUCTION: Reducing Envi-
ronmental Footprint based on Multi-Modal Fleet management System for Eco-Routing and
Driver Behaviour Adaptation,” funded by the EU.ICT program, Challenge ICT-2011.7.
References
1. Chen, Y., Ganapathi, A., Griffith, R., Katz, R.: The case for evaluating MapReduce
performance using workload suites. In: Proceedings of IEEE MASCOTS, pp. 390–399 (2011)
2. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite:
Characterization of the MapReduce-based data analysis. In: Proceedings of ICDE
Workshops (2010)
Hadoop MapReduce Performance on SSDs 119
3. Islam, N., Rahman, M., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy,
C., Panda, D.: High performance RDMA-design of HDFS over InfiniBand. In: Proceedings
of SC (2012)
4. Kambatla, K., Chen, Y.: The truth about MapReduce performance on SSDs. In: Proceedings
of LISA, pp. 109–117 (2014)
5. Kang, S.-H., Koo, D.-H., Kang, W.-H., Lee, S.-W.: A case for flash memory SSD in Hadoop
applications. Int. J. Control Autom. 6, 201–210 (2013)
6. Krish, K.R., Iqbal, M.S., Butt, A.R.: VENU: orchestrating SSDs in Hadoop storage. In:
Proceedings of IEEE BigData, pp. 207–212 (2014)
7. Min, C., Kim, K., Cho, H., Lee, S.-W., Eom, Y.I.: SFS: random write considered harmful in
solid state drives. In: Proceedings of USENIX FAST (2012)
8. Moon, S., Lee, J., Kee, Y.S.: Introducing SSDs to the Hadoop MapReduce framework. In:
Proceeding of IEEE CLOUD, pp. 272–279 (2014)
9. Saxena, P., Chou, J.: How much solid state drive can improve the performance of Hadoop
cluster? Performance evaluation of Hadoop on SSD and HDD. Int. J. Mod. Commun.
Technol. Res. 2(5), 1–7 (2014)
10. Sur, S., Wang, H., Huang, J., Ouyang, X., Panda, D.: Can high-performance interconnects
benefit Hadoop distributed file system. In: Proceedings of the Workshop MASVDC (2010)
11. Wu, D., Xie, W., Ji, X., Luo, W., He, J., Wu, D.: Understanding the impacts of solid-state
storage on the Hadoop performance. In: Proceedings of Advanced Cloud and Big Data,
pp. 125–130 (2013)
Designing HMMs in the Age of Big Data
Abstract. The rise of the Big Data age made traditional solutions for
data processing and analysis unsuitable due to the high computational
complexity. To address this problem, novel solutions specifically-designed
techniques to analyse Big Data have been recently presented. In this
path, when such a large amount of data arrives in a streaming manner, a
sequential mechanism for the Big Data analysis is required. In this paper
we target the modelling of high-dimension datastreams through hidden
Markov models (HMMs) and introduce a HMM-based solution, named
h-HMM, suitable for datastreams characterized by high dimensions. The
proposed h-HMM relies on a suitably-defined clustering algorithm (oper-
ating in the space of the datastream dimensions) to create clusters of
highly uncorrelated dimensions of the datastreams (as requested by the
theory of HMMs) and a two-layer hierarchy of HMMs modelling the
datastreams of such clusters. Experimental results on both synthetic and
real-world data confirm the advantages of the proposed solution.
1 Introduction
Not rarely when a great amount of data is made available e.g. by distributed
and pervasive systems [2,7], traditional solutions for data processing and analysis
revealed to be inadequate due to the excessive computational complexity. For
this reason, a novel generation of solutions recently arose under the umbrella
of Big Data Analytics [10], whose goal is to analyse Big Data to extract rele-
vant/valuable information. Big Data often come in a streaming manner and a
sequential analysis is the unique possibility to provide real time answers. The
problem amplifies when the computational complexity per datum is high, as
frequently is the case in neural networks processing say, when convolutional
networks, deep learning or multidimensional HMM models need to be both gen-
erated and executed.
Among the solutions present in the literature for this purpose (e.g., hierar-
chical self-organizing maps [12], principal component analysis and compressive
sampling [17]), we focus on hidden Markov models (HMMs) that represent a well-
known and widely-used statistical tool for modelling the temporal evolution of
streams of data [14]. HMMs have been successfully applied to many applica-
tion scenarios for event detection (e.g., speech [11] and emotion recognition [9],
human behaviour analysis [5], cognitive fault detection and identification [4])
but, to the best of our knowledge, HMMs have not been considered in the Big
Data scenario yet. In fact, in the HMM literature, the considered datastreams
rarely exceed the 50 dimensions due to the bad scaling of the computational
complexity w.r.t. the problem dimension. HMM-based solutions operating on
high-dimension datastreams have been proposed for specific application scenarios
(e.g., network anomaly detection [6] and identity management for social media
networks [18]). Hence, modelling Big Data with HMMs still represents an open
problem. The reason is twofold: (i) theoretically, the accuracy of the learning
phase generally decreases with the dimension of the considered datastream since
a large training set is required to obtain the modelling accuracy that is feasible in
low dimensions [3]; (ii) modelling high-dimension datastreams through a HMM
can be a time and resources consuming task.
The aim of this paper is to propose a novel HMM-based solution for the
modelling of high-dimension datastreams. Similarly to other solutions present
in the field of Big Data and Data Analytics [16], the proposed solution relies
on the ability to transform the modelling problem into sub-problems, which can
be more easily managed, and suitably aggregate their outcomes. In more detail,
the proposed solution, named hierarchical-HMM (h-HHM), is based on the use
of a suitably-defined clustering algorithm to create a set of highly uncorrelated
features, i.e., the dimensions of the datastreams, and a two-layer hierarchy of
HMMs meant to learn, in a hierarchical way, the system model generating the
high-dimension datastreams.
Through a suitably defined figure of merit, the proposed solution takes into
account the learning accuracy as well as the computational time of the proposed
h-HMM during the operational life, since it plays a critical role in Big Data
Analytics.
The paper is organized as follows. Section 2 describes the architecture of the
proposed h-HMM, while its learning algorithm is presented in Sect. 3. Experi-
mental results are detailed in Sect. 4. Finally, Sect. 5 presents our conclusions
and future work.
x1(1) xd(1) C1
uncorrelated C2
based
k-medoids
clustering
x1(t) xd(t) Ck
Fig. 1. The general idea of the proposed Hierarchical HMM-based solution for large-
dimension datastreams.
the whole datastream. Through the clustering phase and the subsequent mod-
elling of the datastreams coming from these clusters, the idea is to transform the
problem of modelling X into a sub-problems that can be more easily managed
by the proposed h-HHM.
In more detail, a set of clusters Cj , j ∈ 1, . . . , k meant to cluster the dimen-
sions of X is provided by a suitably-defined clustering algorithm based on the
analysis of uncorrelation among dimensions. The proposed clustering algorithm,
which will be detailed in the next section, operates in the space of the dimensions
of X by relying on the theoretical foundations of HMMs that assume to learn
the stochastic behaviour of uncorrelated features [14].
Afterwards, the hierarchy of HMMs is organized into two layers, named H I
and H II , respectively.
The first layer H I comprises k HMMs, i.e.,
These log-likelihood values measure the extent to which HMMs in H I of Eq. (1)
are able to recognize data from their respective clusters up to time t.
The second layer of the h-HMM relies on a single HMM, named H II , whose
goal is to learn the log-likelihood lI (t)s of Eq. 2, which is a k-dimensional datas-
tream and provide, as output, the loglikelihood lII (t). Hence, this second layer
represents the aggregation of the outputs of the first-layer HMMs. Here, the idea
is to capture the temporal behaviour of the likelihoods produced by the first-
layer HMMs representing the extent to which each HMM is able to recognize
the sequence of data acquired by the corresponding cluster. Hence, the second-
layer HMM is meant to jointly learn the temporal behaviour of the recognition
abilities of the first-layer HMMs: lII (t) measures the extent to which h-HMM is
able to recognize the statistical behaviour of X over time.
Designing HMMs in the Age of Big Data 123
Hence, the first level HMMs are characterized by the components HjI =
{SjI , PjI , AIj , πjI }, j = 1 . . . k, while the one at the second layer by H II =
{S II , P II , AII , π II }. The joint training of {H1I . . . HkI } and H II is achieved
through a novel training algorithm that will be discussed in the next section.
The proposed training algorithm allows also to automatically discover the cor-
rect number of cluster k, which is a crucial parameter for h-HMM. In fact, this
choice heavily affects the learning ability of the h-HHM since large values of k
would improve the learning of {H1I . . . HkI } (working on less dimensions) at the
expenses of an increased complexity of the learning of H II (which must operate
on larger number of dimensions). Differently, smaller numbers of clusters would
ease the learning of H II at the expenses of the learning abilities of {H1I . . . HkI }
that must operate on clusters characterized by larger dimensions. Hence, the
correct trade-off between the accuracy of the two layers of the h-HHM must be
identified to maximize the overall accuracy.
Input: T S, γ;
1. Create TT S , VT1S , VT2S as described in Sect. 3;
2. F = [];
3. Set kmax = d;
4. for k=1:kmax do
5. Apply k-medoids algorithm based on uncorrelation given k and get
C k = {C 1 , . . . , C k };
6. Train H I = {H1I , . . . , HkI } on T S;
for j=1:k do
7. Evaluate HMM Hj on VT1S and compute the log-likelihoods lj (t) on
VT1S ;
8. ljI = {ljI (1), . . . , ljI (t)} on VT1S ;
end
9. lI = {l1I , . . . , lkI };
10. Train H II on lI on VT1S and compute lII (t) on VT2S ;
11. Measure execution time ET as described in Sect. 3;
12. F = [F ; lII − γ × ET ], where lII is the average value of lII (t) on VT2S ;
end
13. Determine max(F ) and the associated k;
taking into account the average log-likelihood lII on validation set VT2S and
the execution time ET of the h-HMM (normalized w.r.t. lII ), multiplied by
an application-specific coefficient factor γ ≥ 0. ET denotes the time required for
the processing of one sample x(t) to compute the associated log-likelihood lII (t),
hence taking into account the whole processing of x(t) through HjI s and H II . F
measures the modelling accuracy as well as the associated computational time
as a penalty term.
The algorithm assumes the availability of a training sequence T S = {x(t), t =
1, . . . , tT }, aiming at modelling the nominal state of the X. T S is then divided
into TT S , VT1S , VT2S (line 1) as follows: T S = [TT S , VT1S , VT2S ], whose lengths are
lT S , lV 1 , and lV 2 respectively, and the maximum number of clusters kmax to
be considered is set equal to the dimension of the feature vector d (lines 3).
The proposed algorithm explores the number of clusters in the range [1, kmax ],
where kmax = d (line 4) and for each k, the ad-hoc clustering algorithm based
on k-medoids algorithm is applied (line 5) and clusters C k = {C 1 , . . . , Ck} are
produced.
This clustering algorithm is meant to create uncorrelated clusters of features,
i.e. dimensions of the datastream, maximizing the extent to which the features
belonging to the same cluster are uncorrelated to each other. As shown in Sect. 4,
this allows to improve the HMM learning of each cluster, hence increasing the
learning ability of the whole h-HHM solution.
Designing HMMs in the Age of Big Data 125
Welch algorithm [8], while fully-connected HMMs are considered (i.e. the system
may transit to any state at a given time instant, thus AIj > 0, j = 1 . . . k and
AII > 0). The matrix lI is formed by collecting the vectors {l1I , . . . , lkI }, which are
computed by evaluating H I = {H1I , . . . , HkI } on VT1S (line 7), i.e. lI = {l1I , . . . , lkI }
with k dimensions and length lV 1 (line 8).
Subsequently, we capture the behaviour of lI though H II , which operates on
the sequence of log-likelihoods generated by H I on VT1S . H II is evaluated on VT2S
for obtaining lII , which characterizes the extent to which h − HM M is able to
recognize the statistical behaviour of X (line 10). Finally, we compute the figure
of merit F = lII − γ × ET for the specific value of k , where ET is computed
on one sample lII is the average value of lII (t) on VT2S (line 11). Its maximum
value (line 12) over the range of k = [1, . . . , kmax ] reveals the optimal h-HMM
setting (the block diagram of the proposed system is depicted in Fig. 1).
4 Experiments
This section explains the experimental campaign designed to assess the effec-
tiveness of the proposed h-HMM in both synthetic and real datasets. Here, the
goal is to show that the proposed h-HMM is able to provide better modelling
abilities (measured as larger log-likelihood values) than other solutions present
in the literature on high-dimension datastreams.
Synthetic Datasets. We modelled X as a 200 dimensional data-generating
process, whose data x(t) are randomly drawn from a multivariate normal
126 C. Alippi et al.
distribution with mean vector μ, and covariance matrix containing integers
randomly drawn from the discrete uniform distribution on the interval [0, 100].
In particular, we considered two different scenarios for X:
• independent dimensions
F id : the dataset is built by considering a diagonal
covariance matrix id leading to independent feature vectors;
d
• dependent
dimensions F : a symmetric positive semi-definite covariance
matrix d is considered leading to correlated dimensions.
The length of each experiment is 10,000 samples, while results are aver-
aged over 50 runs. The proposed h-HMM is compared with the following three
solutions: (a) the traditional HMM; (b) c-HMM: h − HM M with correlation
based clustering, where the distance metric for the k-medoids is Dxi −xj =
1 − Cor(xi , xj ) aiming at clustering together the dimensions which are highly
correlated; and (c) r-HMM: h − HM M where the dimensions of X are ran-
domly clustered using labels drawn from a uniform distribution on the interval
[1, kmax ].
It should be noted that, during our experiments, we used the k-medoids
algorithm, which is embedded in the MATLAB c programming platform and
the HMM implementation provided in [13]. The length of T S is 6000 samples
and lT S , lV 1 , and lV 2 are 4800, 600, and 600 samples, respectively. HMMs are
trained by exploring the following configurations: S ∈[3, 4, 5, 6, 7, 8, 9], and
Gaussian functions ∈[2, 4, 8, 16, 32, 64, 128, 256, 512]. The HMM providing the
highest log-likelihood on a validation set was the final choice. For the HMMs in
H I the validation set is VT1S , while for H II , VT2S .
The experimental results are shown in Table 1 including the three considered
solutions (h-HMM, c-HMM, r-HMM) as well as the traditional single HMM. All
approaches are evaluated on both F id and F d . In addition, Fig. 2 depicts the
considered figure of merit F as a function of k for the proposed h-HMM in both
scenarios.
By observing Table 1 we can see that the h-HMM approach provides higher
accuracy than all the other approaches (i.e., traditional HMM, c-HMM and r-
HMM) at the expense of a slight increase of the ET . Nevertheless, we emphasize
that ET could be managed by tuning the parameter γ in F .
Hence, the h-HMM approach considering uncorrelated features provides the
best performance when applied to both F d and F id . This behaviour is justified by
the fact that the GMMs assume diagonal covariance, i.e. feature independence.
Interestingly, a small amount of clusters leads to quite satisfactory performance
when modelled in a hierarchical fashion without needing an excessive amount
of additional time. This is due to the fact that hierarchical HMMs take into
consideration the temporal evolution of the log-likelihoods produced by the H I ’s
over time.
Another interesting observation is that results on F id are better than F d ,
which is reasonable since the feature independence is assumed by HMMs as men-
tioned in Sect. 3. h-HMM is able to consistently provide higher log-likelihoods
than c-HMM and r-HMM across every different clustering setting. At the same
time, ET is kept at relatively low values. In addition, the proposed h-HMM
Designing HMMs in the Age of Big Data 127
approach provides higher modelling accuracy that the standard HMM, which as
expected provides very poor learning abilities on datastreams characterized by
high dimensions.
As expected, in case of k = 1 h-HMM, c-HMM and r-HMM provide the same
loglikelihood values (as well as ET ) since they all operate on the same unique
cluster comprising the whole set of dimensions of X.
A Real-World Data. In this experimental phase we considered a time-series
multivariate dataset containing 13,910 measurements from 16 chemical sensors
exposed to 6 gases at different concentration levels, which is publicly available
for research purposes at the UCI Machine Learning Repository [1]. Full descrip-
tion of the dataset is provided in [15,20]. The proposed algorithm was applied
with γ = 1 and T S length equal to 10.000, and it automatically divided 129
dimensions into 7 clusters, while H II provided lII = −456.4. At the same time
the single HMM was unable to capture the behaviour of the dataset since it
provided a log-likelihood of −∞. Here T S is not long enough for learning just
one HMM. Thus, the h − HM M might be more appropriate as it operates on
features of low dimensions. In Fig. 3 we can see how the log-likelihood lII alters
with respect to different number of clusters. When considering less than 7 or
more than 12 clusters, lII = −∞ while its maximum values is reached when
k=7. This behaviour shows that for several clustering settings T S is not ade-
quate for learning the associated statistical behaviour, thus the HMM is unable
to explain the observed feature sequence.
128 C. Alippi et al.
Fig. 3. The log-likelihood lII with respect to different number of clusters computed on
real-world data coming from 16 chemical sensors.
5 Conclusions
This paper presented an extensive study on statistical modelling of big data
using HMMs. We proposed a hierarchical HMM-modeling approach employing
a clustering algorithm for discovering groups including uncorrelated dimensions.
The number of clusters is automatically selected by an algorithm maximizing
a figure of merit, which considers both modelling accuracy and computational
time. Thorough experiments on both synthetic and real-world data demonstrated
the superiority of the proposed approach over the traditional HMM method and
both the random and correlation-based clustering solutions.
This work comprises a first step towards HMM-based modelling of large scale
datastreams, while our main goal is to extent the present modelling approach
for learning data acquired by large scale cyber physical systems (LCPS) com-
posed of heterogeneous units endowed with sensing, processing, communication
and (possibly) actuation abilities. In such large scale scenario, the analysis of
data to promptly detect changes in the environment and faults affecting sensors
Designing HMMs in the Age of Big Data 129
References
1. http://archive.ics.uci.edu/ml/datasets/gas+sensor+array+drift+dataset+at+
different+concentrations (2015)
2. Abolfazli, S., Sanaei, Z., Ahmed, E., Gani, A., Buyya, R.: Cloud-based augmenta-
tion for mobile devices: motivation, taxonomies, and open challenges. IEEE Com-
mun. Surv. Tutorials 16(1), 337–368 (2014). doi:10.1109/SURV.2013.070813.00285
3. Aladjem, Mayer: Projection pursuit fitting gaussian mixture models. In: Caelli,
Terry, Amin, Adnan, Duin, Robert, P,W., Ridder, Dick, Kamel, Mohamed (eds.)
SSPR/SPR 2002. LNCS, vol. 2396, pp. 396–404. Springer, Heidelberg (2002).
doi:10.1007/3-540-70659-3 41
4. Alippi, C., Ntalampiras, S., Roveri, M.: A cognitive fault diagnosis system for
distributed sensor networks. IEEE Trans. Neural Netw. Learn. Syst. 24(8), 1213–
1226 (2013)
5. Andersson, M., Ntalampiras, S., Ganchev, T., Rydell, J., Ahlberg, J., Fakotakis,
N.: Fusion of acoustic and optical sensor data for automatic fight detection in
urban environments. In: 2010 13th Conference on Information Fusion (FUSION),
pp. 1–8 (2010)
6. Babaie, T., Chawla, S., Ardon, S., Yu, Y.: A unified approach to network anomaly
detection. In: 2014 IEEE International Conference on Big Data (Big Data), pp.
650–655 (2014)
7. Bari, N., Mani, G., Berkovich, S.: Internet of things as a methodological concept.
In: 2013 Fourth International Conference on Computing for Geospatial Research
and Application (COM.Geo), pp. 48–55 (2013)
8. Baum, L.E., Petrie, T.: Statistical inference for probabilistic functions of finite
state markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)
9. Guo, Y., Gao, H.: Emotion recognition system in images based on fuzzy neural net-
work and hmm. In: 5th IEEE International Conference on Cognitive Informatics,
ICCI 2006, vol. 1, pp. 73–78 (2006)
10. Hu, H., Wen, Y., Chua, T.S., Li, X.: Toward scalable systems for big data analytics:
A technology tutorial. IEEE Access 2, 652–687 (2014). doi:10.1109/ACCESS.2014.
2332453
11. Kinjo, T., Funaki, K.: On hmm speech recognition based on complex speech analy-
sis. In: 32nd Annual Conference on IEEE Industrial Electronics, IECON 2006, pp.
3477–3480 (2006)
12. Lee, C.H., Wu, C.H.: A novel big data modeling method for improving driving range
estimation of EVS. IEEE Access 3, 1980–1993 (2015). doi:10.1109/ACCESS.2015.
2492923
13. Murphy, K.: https://www.cs.ubc.ca/murphyk/software/hmm/hmm.html (1998)
14. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in
speech recognition. In: Proceedings of the IEEE, pp. 257–286 (1989)
15. Rodriguez-Lujan, I., Fonollosa, J., Vergara, A., Homer, M., Huerta, R.: On the
calibration of sensor arrays for pattern recognition using the minimal number of
experiments. Chemometr. Intell. Lab. Syst. 130, 123–134 (2014)
130 C. Alippi et al.
16. Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T.: Big data clus-
tering: a review. In: Murgante, B., Misra, S., Rocha, A.M.A.C., Torre, C.,
Rocha, J.G., Falcão, M.I., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA
2014. LNCS, vol. 8583, pp. 707–720. Springer, Heidelberg (2014). doi:10.1007/
978-3-319-09156-3 49
17. Slavakis, K., Giannakis, G.B., Mateos, G.: Modeling and optimization for big data
analytics: (statistical) learning tools for our era of data deluge. IEEE Sig. Process.
Mag. 31(5), 18–31 (2014)
18. Subasinghe, K., Kodithuwakku, S.R.: A big data analytic identity management
expert system for social media networks. In: 2015 IEEE International WIE Con-
ference on Electrical and Computer Engineering (WIECON-ECE), pp. 126–129
(2015)
19. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Academic Press,
Inc., Orlando (2006)
20. Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chem-
ical gas sensor drift compensation using classifier ensembles. Sensors Actuators B:
Chem. 166–167, 320–329 (2012)
Analyzing Big Security Logs in Cluster
with Apache Spark
1 Introduction
A log can be defined as a record of an event [1]. A log is composed of log entries
which are generally called as log messages. A log message or a log entry in broad
terms is a message generated by computing, communications systems and appli-
cations in response to an event or events that happened to them. Logs are in
general simple text files to which the new log messages are appended. Log mes-
sages are the primary source of information to identify operational problems,
policy violations, security incidents. Logging is also used for audit and monitor-
ing [2].
Logs, whether or not they are generated security in mind, generally contain
valuable security related information. However, security log analytic has become
a Big Data problem in terms of the log’s massive size, various or undefined
formats, various source, variable generation rate and veracity. In order to cope
with these challenges, security log analytic should exploit Cloud Computing
capabilities and Big Data analytic process.
In this study, Apache Spark is used as Cloud Big Data Analytic framework
for security log analytic. It is shown that Apache Spark provides a very suitable
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 14
132 T. Oktay and A. Sayar
platform for exploratory or batch security log analysis in the cloud environment.
In the following two sections, properties of Apache Spark and security logs are
briefly introduced, and then through a problem, Apache Spark security log ana-
lyzing capabilities are presented with a real-world dataset.
2 Apache Spark
3 Security Logs
Logs are generated by all sorts of applications, servers, network devices, even
embedded devices. There are many types of logs created for different purposes
such as measuring performance of a system, tracking security related information
and debugging applications.
1
http://spark.apache.org.
2
http://hadoop.apache.org.
3
http://www.mathworks.com.
4
http://www.r-project.org.
Analyzing Big Security Logs in Cluster with Apache Spark 133
Many types of logs can be used to track security related information. These
logs in broad term can be classified as security or security related logs. These
include;
Security software logs that are intentionally generated to track security issues.
Some of the security related software are intrusion detection and prevention
systems, antivirus and malware software, remote access software, proxy servers,
vulnerability management software, authentication servers, routers and firewalls.
Operation system logs such as system events and audit records.
Application logs such as web, database, e-mail, ftp or specific application
servers. There are many types of applications that generate logs. From security
standpoint, logs of these applications contain useful data for analyzing or track-
ing security issues. Some of the valuable data found in these logs are account
information, client/server interactions, used resources, some other significant
events such as failures etc.
There are challenges in terms of log management, storage and analytics. Some
of the challenges are as follows.
Variety of Sources: Logs are generated by different types of applications, even
a single application can create different logs for different purposes. These logs
need to be combined for management and analyzing purposes.
Variety of Log Contents: Logs created by different applications focus on
different contents, level of details are much different.
Variety of Log Formats: There are very few common log formats and fields.
It is usually difficult to correlate different logs’ contents for specific events. Some
logs are created for humans to read, some of them are for log processing appli-
cations. Most applications’ log schemas are not well defined, and implementing
parsers for them is difficult. The terminology used may be different for the same
subject matter. The tags of the fields may be abbreviated, misspelled or plain
different. The model or the schema of application logs may quickly change with
the evolution of applications. Combined logs have generally mixed with struc-
tured, semi-structured or unstructured parts originating from different applica-
tions which make combined logs practically unstructured.
Variety of Time Formats: Time data is very important to correlate security
related events. Different time formats or missing granularity in time greatly affect
security analysis in combined logs.
Volume: Log management is becoming increasingly Big Data problem. Even a
web sites with moderate traffic, one day log may consume gigabytes of disk space
depending on the level of details logs are recorded. The volume of the activities
that needs to be logged also impacts storage requirements. The elasticity and
scalability of underlying infrastructure may help to handle storage needs.
134 T. Oktay and A. Sayar
Velocity: In high volume web sites, transactions create immense logs and aggre-
gating these logs in real time may create bottleneck in the network as well as
on log servers depending on the infrastructure. Real time analytic of important
logs may not be possible because of the high velocity and the volume of the logs.
Log reduction techniques may need to be applied.
Veracity: Authentication of log sources, integrity of logs, the level of noise in
the logs, all affect the value of log data. It is easy not to record important event
by giving wrong category or filtering the wrong level. It is also easy to generate
too much noise and disguise the valuable information in the logs.
(NAS). Nodes with the log forwarding capability, such as firewall, router, some
of PCs that are running linux and the NAS’ own logs, are directed to the NAS’s
syslog server. Logs are accumulated in one of the Network File System (NFS)
accessible directory of the NAS. Even in such a relatively simple environment
around 50 MB of logs are generated daily. At the time of this study around 45 GB
of log was accumulated. 45 GB data may or may not be Big Data in terms of
available analytic infrastructure. However, for an average analytic environment
that is not clustered, it becomes easily Big Data problem in terms of processing
capabilities and the size of the analytic data to work on.
For the demonstration of Apache Spark capabilities for analyzing security
logs; the most communicated Internet addresses by the network devices will be
extracted and the statistics will be gathered. Firewall logs are the primary source
of security information. However firewall log messages are intermixed with other
log messages from various devices in the combined syslog. The general problem
will be tackled as follows.
– Creating a cluster with Hadoop and Apache Spark and storing the log data
in Hadoop File System(HDFS)5 .
– Creating a Spark application that does (in parallel)
• Filtering firewall logs from syslog for outbound traffic
• Creating MapReduce source destination pairs
• Grouping most used IPs per source
• Outputing the statistics
A tiny Spark cluster, with four PC with 2nd and 3rd generation i5 processors
and 8 GB of RAM on Ubuntu 14.04 LTS Server OS6 , is created to be able to
analyze the network security logs. With Spark and Hadoop HDFS, a tiny clus-
ter with 32 GB of RAM and 12 cores is easily setup from relatively inexpensive
consumer hardware. Building Spark and configuring it on the cluster nodes man-
ually are somewhat labor intensive endeavor. However there are many cluster
provisioning tools available which handle automatic installation and configura-
tion of cluster nodes. It would be easier to use one with available instructions
that can be easily found on the Internet. One such tool is Ansible7 .
Apache Spark provides several programming language interfaces. Because of
the very terse and easy-to-use syntax Python API is used. For the solution of
the problem, a sample mostusedips.py Spark Application is developed. Some
important parts of the source code and the explanations are provided at the
following section.
lines = sc.textFile("hdfs://path/to/file")
Listing 1.1 shows a way of filtering a log file based on source and destination
IP patterns. All the operations on RDDs are handled in parallel.
filter domain ip pattern search function passed to lines.filter function
evaluates all lines in parallel and filter out every line that has no source and
destination pattern. Resulting domain lines is an RDD that has only lines with
appropriate source and outbound destination IP addresses.
import r e
s r c p a t = r e . compile ( r ’ s r c = ” ( 1 9 2 . 1 6 8 . 1 . \ d { 1 , 3 } ) : ( \ d+) ’ )
d s t p a t i n t e r n a l = r e . compile ( r ’ d s t = ” ( 1 9 2 . 1 6 8 . 1 . \ d { 1 , 3 } ) : ( \ d+) ’ )
d s t p a t = r e . compile ( r ’ d s t =”(\d { 1 , 3 } . \ d { 1 , 3 } . \ d { 1 , 3 } . \ d { 1 , 3 } ) : ( \ d+) ’ )
def f i l t e r d o m a i n i p ( l i n e ) :
should filter = (
( s r c p a t . s e a r c h ( l i n e ) and
d s t p a t . s e a r c h ( l i n e ) ) and
( not d s t p a t i n t e r n a l . s e a r c h ( l i n e ) )
)
return s h o u l d f i l t e r
d o m a i n l i n e s = l i n e s . filter ( f i l t e r d o m a i n i p )
Listing 1.2 shows the extraction of source and destination IP pairs from fil-
tered log lines domain lines, and creating log messages as tuples. An expected
output is presented at Listing 1.3. These pairs are the mapping part of the first
map-reduce operation.
extract src dst info transformation function passed to domain lines.map
evaluates all the lines in parallel and transforms every line to a (src,dest) pair.
def e x t r a c t s r c d s t i n f o ( r e p o r t ) :
srcmatches = srcpat . search ( report )
i f srcmatches :
s r c i p , s r c p o r t = srcmatches . groups ( )
dstmatches = dstpat . search ( r e p o r t )
i f dstmatches :
dstip , dstport = dstmatches . groups ( )
return ( s r c i p , d s t i p )
else :
print ( ’ERROR i n l i n e { } : d s t n o t matches ’ . format ( r e p o r t ) )
else :
print ( ’ERROR i n l i n e { } : s r c n o t matches ’ . format ( r e p o r t ) )
s r c d s t p a i r s = d o m a i n l i n e s . map ( e x t r a c t s r c d s t i n f o )
( src1 , dst10 )
( src1 , dst1 )
( src1 , dst2 )
( src2 , dst5 )
( src1 , dst3 )
( src3 , dst7 )
( src1 , dst2 )
( src2 , dst4 )
...
average programming skills and some Apace Spark API knowledge. The heavy
lifting of parallel operations, distribution of work is handled automatically in
disguise by Spark.
6 Conclusion
In this study, it is demonstrated that Apache Spark enables large scale log analy-
sis in interactive and batch processing environments. With its flexible and exten-
sive programming interface and built-in libraries, it is easy to implement sophisti-
cated parallel analytic tasks, without knowing much about the cluster computing
details. Apache Spark provides excellent environment for security log analyzes
or mining. The demo application can easily be extended and more sophisticated
security analyzes can be performed. As a future work, more complex security
analyzes will be performed along with benchmark comparisons on cluster with
several nodes vs. single node multicore computer based on different sizes of logs.
Moreover, by using MLib and streaming API, realtime log analyzing capabilities
of Apache Spark will be investigated in security log analyzes.
References
1. Kent, K., Souppaya, M.: Guide to computer security log management. recom-
mendations of the national institute of standards and technology (2006). http://
nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-92.pdf. Accessed
10 Jan 2016
2. Fekete, R.: Log message classification with syslog-ng (2010). http://lwn.net/
Articles/369075/. Accessed 10 Jan 2016
3. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin,
M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstrac-
tion for in-memory cluster computing. In: Proceedings of the 9th USENIX Confer-
ence on Networked Systems Design and Implementation, NSDI 2012, p. 2. USENIX
Association, Berkeley (2012)
4. Spark, A.: Apache spark web site. http://spark.apache.org. Accessed 10 Jan 2016
5. Kreps, J.: The log: What every software engineer should know about real-time
data’s unifying abstraction (2013). https://engineering.linkedin.com/distributed-
systems/log-what-every-software-engineer-should-know-about-real-time-datas-
unifying. Accessed 14 Jan 2016
6. Dean, A.: The three eras of business data processing (2014). http://snow
plowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing.
Accessed 14 Jan 2016
7. codecondo.com: 8 tools for log monitoring and processing big data (2014). http://
codecondo.com/8-tools-for-log-monitoring-and-processing-big-data. Accessed 14
Jan 2016
8. Kobielus, J.: Big data log analysis thrives on machine learning (2014). http://
www.infoworld.com/article/2608064/big-data/big-data-log-analysis-thrives-on-
machine-learning.html. Accessed 14 Jan 2016
9. Zeltser, L.: Critical log review checklist for security incidents (2015). https://
zeltser.com/security-incident-log-review-checklist. Accessed 14 Jan 2016
10. databricks.com: Log analysis with spark. https://databricks.gitbooks.io/databricks-
spark-reference-applications/content/logs analyzer/index.html. Accessed 14 Jan
2016
Delay Prediction System for Large-Scale
Railway Networks Based on Big Data Analytics
1 Introduction
Big Data Analytics is one of the current trending research interests in the context
of railway transportation systems. Indeed, many aspects of the railway world
can greatly benefit from new technologies and methodologies able to collect,
store, process, analyze and visualize large amounts of data [34,37], e.g. condition
based maintenance of railway assets [9,25], alarm detection with wireless sensor
networks [20], passenger information systems [23], risk analysis [8], and the like.
In particular, this paper focuses on predicting train delays in order to improve
traffic management and dispatching using Big Data Analytics, scaling to large
railway networks.
Although trains should respect a fixed schedule called “nominal timetable”,
delays occur daily because of accidents, repair works, extreme weather condi-
tions, etc., and affect negatively railway operations, causing service disruptions
and losses in the worst cases. Rail Traffic Management Systems (TMSs) have
been developed to support managing complex rail services and networks and
to increase operations efficiency by allowing train dispatching through remote
control of signaling systems. By providing accurate train delay predictions to
TMSs, it can be possible to greatly improve traffic management and dispatching
in terms of passenger information systems and perceived reliability [24], freight
tracking systems for improved customers’ decision-making, timetable planning
[5] by detecting recurrent delays, and delay management (rescheduling) [6].
Due to its key role, TMS stores the information about every “train move-
ment”, i.e. every train arrival and departure timestamp and delays at “check-
points” monitored by signaling systems (e.g. a station, a switch, etc.). Datasets
composed of train movements records have been used as fundamental data
sources for every work addressing the problem of train delays predictions. For
instance, Milinkovic et al. [22] developed a Fuzzy Petri Net (FPN) model to
estimate train delays based both on expert knowledge and on historical train
movements data. Berger et al. [2] presented a stochastic model for delay propa-
gation and forecasts based on directed acyclic graphs. Goverde, Keckman et al.
[11,12,17,18] developed an intensive research in the context of delay predic-
tion and propagation by using process mining techniques based on innovative
timed event graphs, on historical train movements data, and on expert knowl-
edge about railway infrastructure. However, their models are based on classical
univariate statistics, while our solution integrates multivariate statistical con-
cepts that allow our models to be extended in the future by including other kind
of data (e.g. weather forecasts, passenger flows, etc.). Moreover, these models
are not especially developed for Big Data technologies, possibly limitating their
adoption for large scale networks. Last but not least, S. Pongnumkul et al. [29]
worked on data-driven models for train delays predictions, treating the problem
as a time series forecast problem. The described system investigates the applica-
tion of ARIMA and k-NN models over limited train data, making it unsuitable
for Big Data.
For these reasons, this paper investigates the problem of predicting train
delays for large scale railway networks by treating it as a time series forecast
problem where every train movement represents an event in time, and by exploit-
ing Big Data Analytics methodologies. Delay profiles for each train are used
to build a set of data-driven models that, working together, make possible to
perform a regression analysis on the past delay profiles and consequently to
predict the future ones. The data-driven models exploit a well-know Machine
Delay Prediction System for Large-Scale RN Based on BDA 141
Learning algorithm, i.e. the Extreme Learning Machines (ELMs), which has
been adapted to exploit typical Big Data parallel architectures. Moreover, the
data have been analyzed by using state-of-art Big Data technologies, i.e. Apache
Spark on Apache Hadoop, so that it can be used for large scale railway networks.
The described approach and the prediction system performance have been val-
idated based on the real historical data provided by Rete Ferroviaria Italiana
(RFI), the Italian Infrastructure Manager (IM) that controls all the traffic of
the Italian railway network. For this purpose, a set of novel Key Performance
Indicators (KPIs) agreed with RFI has been designed and used. Several months
of records from the entire Italian railway network have been exploited to show
that the new proposed methodology outperforms the current technique used by
RFI to predict train delays in terms of overall accuracy.
In Algorithm 1 τ and niter are parameters related with the speed of the opti-
mization algorithms. Therefore, usually τ and niter are set based on the experi-
ence of the user. In any case τ and niter can be seen as other regularization terms
as λ since they are connected with the early stopping regularization technique
[4,30].
Algorithm 1 is well-suited for implementation in Spark and many of these
tools are already available in MLlib [21]. Basically, the implementation of
Algorithm 1 reported in Algorithm 2 is an application of two functions: a map
for the computation of the gradient and a reduction function for the sum of each
single gradient.
The main problem of Algorithm 2 is the computation and storage of A. If
h d it means that A ∈ Rn×h will be much smaller than the dataset which
belongs to Rn×d . In this case, it is more appropriate to compute it before the
SGD algorithms starts the iterative process and keep it in memory (note that
the computation of A is fully parallel). In this way all the data Rn×d projected
by φ into to matrix A ∈ Rn×h can be largely kept in volatile memory (RAM)
instead of reading from the disk. If instead h d, employing Algorithm 2 we
risk that A ∈ Rn×h does not fit into the RAM, consequently making too many
accesses to the disk. For this reason, we adopt two different strategies:
Fig. 3. KPIs for the train and the itinerary depicted in Fig. 1
We compared the results of our simulations with the results of the current train
delay prediction system used by RFI, which is quite similar the one described in
[12]. In order to fairly assess the performance of the two systems, a set of novel
KPIs agreed with RFI has been designed and exploited. Since the purpose of this
work was to build predictive models able to forecast the train delays, these KPIs
represent different indicators of the quality of these predictive models. Based on
these considerations, three different indicators have been used, which are also
proposed in Fig. 3 in a graphical fashion:
– Average Accuracy at the i-th following Checkpoint for train j (AAiCj): for a
particular train j, we average the absolute value of the difference between the
predicted delay and its actual delay, at the i-th following Checkpoint with
respect to the actual Checkpoint.
– AAiC: is the average over the different trains j of AAiCj
– Average Accuracy at Checkpoint-i for train j (AACij): for a particular train j,
the average of the absolute value of the difference between the predicted delay
and its actual delay, at the i-th checkpoint, is computed.
– AACi: is the average over the different trains j of AACij
– Total Average Accuracy for train j (TAAj): is the average over the different
checkpoint i of AACij (or equivalently the average over the index i of AAiCj).
– TAA: is the average over the different trains j of TAAj
We ran the experiments by exploiting the Google Compute Engine [10] on the
Google Cloud Platform. We employed a four-machines cluster, each one equipped
with two cores (machine type n1-highcpu-2), 1.8 GB of RAM and an HDD of
30 GB. We use Spark 1.5.1 running on Hadoop 2.7.1 configured analogously to
[31]. The set of possible configurations of hyperparameters is defined as
{1,1.2,··· a set
,3.8,4}
H where H = {(h, λ) : h ∈ Gh , λ ∈ Gλ } with Gh = 10 and
{−6,−5.5,··· ,3.5,4} −
Gλ = 10 . Finally, as suggested by the RFI experts, t0 − δ is set
equal to the time, in the nominal timetable, of the origin of the train.
Delay Prediction System for Large-Scale RN Based on BDA 147
Table 1. ELM based and RFI prediction systems KPIs (in minutes).
In Table 1 we have reported the KPIs for the proposed prediction system
and for the one currently used by RFI. Note that Trains and Checkpoints IDs
have been anonymized, and that only part of the results have been included in
the paper because of space constraints. Although the RFI system has shown
to be robust and accurate during our simulations, our data-driven prediction
system managed to outperform it by a factor of ×1.65 for TAA, since it is able
to infer a model which takes into account the entire state of the network and not
just the local dependencies. The accuracy of the ELM-based method is quite
homogeneous with respect to different trains and stations. Moreover, AAiCj
grows with i as expected (the further is the prediction in the future, the more
uncertain is the prediction itself). Finally, note that there are blank spaces in the
148 L. Oneto et al.
tables: this means that, for example, for AAiCj some trains have less checkpoints
(at least in the dataset that we analysed), or for AACij different trains run over
different checkpoints.
As final issue we would like to underline that if just Algorithm 2 is exploited
and not the proposal of combining Algorithms 2 and 3 based on d and h we
have that in the first case in order to train all the models of our simulation 15 h
are needed while in the second case approximately one hour is enough. In other
words, our optimization strategy is 15 time faster than the naive approach.
Our future works will take into account also exogenous information available
from external sources. For example, we will include in the models information
about passenger flows by using touristic databases, about weather conditions by
using the Italian National Weather Service, about railway assets conditions, or
any other source of data which may affect railway dispatching operations.
References
1. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: In-sample and out-of-sample model
selection and error estimation for support vector machines. IEEE Trans. Neural
Netw. Learn. Syst. 23(9), 1390–1406 (2012)
2. Berger, A., Gebhardt, A., Müller-Hannemann, M., Ostrowski, M.: Stochastic delay
prediction in large train networks. In: OASIcs-OpenAccess Series in Informatics
(2011)
3. Cambria, E., Huang, G.B.: Extreme learning machines. IEEE Intell. Syst. 28(6),
30–59 (2013)
4. Caruana, R., Lawrence, S., Lee, G.: Overfitting in neural nets: backpropagation,
conjugate gradient, and early stopping. In: Neural Information Processing Systems
(2001)
5. Cordeau, J.F., Toth, P., Vigo, D.: A survey of optimization models for train routing
and scheduling. Transp. Sci. 32(4), 380–404 (1998)
6. Dollevoet, T., Corman, F., D’Ariano, A., Huisman, D.: An iterative optimization
framework for delay management and train scheduling. Flex. Serv. Manuf. J. 26(4),
490–515 (2014)
7. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall,
New York (1993)
8. Figueres-Esteban, M., Hughes, P., Van Gulijk, C.: The role of data visualization
in railway big data risk analysis. In: European Safety and Reliability Conference
(2015)
9. Fumeo, E., Oneto, L., Anguita, D.: Condition based maintenance in railway trans-
portation systems based on big data streaming analysis. In: The INNS Big Data
conference (2015)
10. Google: Google Compute Engine (2016). https://cloud.google.com/compute/.
Accessed 3 May 2016
11. Goverde, R.M.P.: A delay propagation algorithm for large-scale railway traffic net-
works. Transp. Res. Part C: Emerg. Technol. 18(3), 269–287 (2010)
12. Hansen, I.A., Goverde, R.M.P., Van Der Meer, D.J.: Online train delay recogni-
tion and running time prediction. In: IEEE International Conference on Intelligent
Transportation Systems (2010)
Delay Prediction System for Large-Scale RN Based on BDA 149
13. Huang, G., Huang, G.B., Song, S., You, K.: Trends in extreme learning machines:
a review. Neural Netw. 61, 32–48 (2015)
14. Huang, G.B., Chen, L., Siew, C.K.: Universal approximation using incremental
constructive feedforward networks with random hidden nodes. IEEE Trans. Neural
Netw. 17(4), 879–892 (2006)
15. Huang, G.B., Zhou, H., Ding, X., Zhang, R.: Extreme learning machine for regres-
sion and multiclass classification. IEEE Trans. Syst. Man Cybern. Part B: Cybern.
42(2), 513–529 (2012)
16. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: a new learning
scheme of feedforward neural networks. In: IEEE International Joint Conference
on Neural Networks (2004)
17. Kecman, P.: Models for predictive railway traffic management (Ph.D. thesis). TU
Delft, Delft University of Technology (2014)
18. Kecman, P., Goverde, R.M.P.: Online data-driven adaptive prediction of train event
times. IEEE Trans. Intell. Transp. Syst. 16(1), 465–474 (2015)
19. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and
model selection. In: International Joint Conference on Artificial Intelligence (1995)
20. Li, H., Parikh, D., He, Q., Qian, B., Li, Z., Fang, D., Hampapur, A.: Improving rail
network velocity: a machine learning approach to predictive maintenance. Transp.
Res. Part C: Emerg. Technol. 45, 17–26 (2014)
21. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman,
J., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia,
M., Talwalkar, A.: MLlib: machine learning in apache spark. J. Mach. Learn. Res.
17(34), 1–7 (2016)
22. Milinković, S., Marković, M., Vesković, S., Ivić, M., Pavlović, N.: A fuzzy petri net
model to estimate train delays. Simul. Model. Prac. Theor. 33, 144–157 (2013)
23. Morris, C., Easton, J., Roberts, C.: Applications of linked data in the rail domain.
In: IEEE International Conference on Big Data (2014)
24. Müller-Hannemann, M., Schnee, M.: Efficient timetable information in the pres-
ence of delays. In: Ahuja, R.K., Möhring, R.H., Zaroliagis, C.D. (eds.) Robust
and Online Large-Scale Optimization. LNCS, vol. 5868, pp. 249–272. Springer,
Heidelberg (2009). doi:10.1007/978-3-642-05465-5 10
25. Núñez, A., Hendriks, J., Li, Z., De Schutter, B., Dollevoet, R.: Facilitating mainte-
nance decisions on the dutch railways using big data: the aba case study. In: IEEE
International Conference on Big Data (2014)
26. Oneto, L., Orlandi, I., Anguita, D.: Performance assessment and uncertainty quan-
tification of predictive models for smart manufacturing systems. In: IEEE Inter-
national Conference on Big Data (Big Data) (2015)
27. Oneto, L., Pilarz, B., Ghio, A., D., A.: Model selection for big data: algorithmic
stability and bag of little bootstraps on gpus. In: European Symposium on Artificial
Neural Networks, Computational Intelligence and Machine Learning (2015)
28. Packard, N.H., Crutchfield, J.P., Farmer, J.D., Shaw, R.S.: Geometry from a time
series. Phys. Rev. Lett. 45(9), 712 (1980)
29. Pongnumkul, S., Pechprasarn, T., Kunaseth, N., Chaipah, K.: Improving arrival
time prediction of thailand’s passenger trains using historical travel times. In: Inter-
national Joint Conference on Computer Science and Software Engineering (2014)
30. Prechelt, L.: Automatic early stopping using cross validation: quantifying the cri-
teria. Neural Netw. 11(4), 761–767 (1998)
31. Reyes-Ortiz, J.L., Oneto, L., Anguita, D.: Big data analytics in the cloud: spark
on hadoop vs mpi/openmp on beowulf. In: The INNS Big Data Conference (2015)
150 L. Oneto et al.
32. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-
propagating errors. Cogn. Model. 5(3), 1 (1988)
33. Shoro, A.G., Soomro, T.R.: Big data analysis: apache spark perspective. Glob. J.
Comput. Sci. Technol. 15(1) (2015)
34. Thaduri, A., Galar, D., Kumar, U.: Railway assets: a potential domain for big data
analytics. In: The INNS Big Data conference (2015)
35. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw.
10(5), 988–999 (1999)
36. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin,
M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstrac-
tion for in-memory cluster computing. In: USENIX Conference on Networked Sys-
tems Design and Implementation (2012)
37. Zarembski, A.M.: Some examples of big data in railroad engineering. In: IEEE
International Conference on Big Data (2014)
An Empirical Comparison of Methods
for Multi-label Data Stream Classification
1 Introduction
So far most of the work of the literature has dealt separately with the classifi-
cation of data streams and with the classification of multi-label data. However,
many real world problems include data which are considered as data streams with
multiple labels. This paper focuses on learning from multi-labeled data streams.
It reviews the related literature and experimentally compares state-of-the-art
multi-label methods, deriving several interesting conclusions.
In Sect. 2, we present some of the existing work regarding multi-label classi-
fication and data stream classification, while in Sect. 3 we discuss related work
on the classification of multi-label data streams as well as the evaluation meth-
ods and metrics that are suggested from the literature. This extensive literature
study gives a fairly complete picture of the problem of multi-label data stream
classification.
With a view to carrying out benchmarking experiments, some of the algo-
rithms studied in the literature were implemented under the MULAN library
[13]. In Sect. 4, we present the set of experiments that were conducted using
these algorithms. More specifically, we discuss the real-world data stream we
used, which is from the WISE 2014 Challenge [12] and list all of the algorithms
along with their necessary settings in order to produce the best results. Moreover,
we provide an analysis of the results and discuss the conclusions.
Finally, Sect. 5 discusses the general conclusions of this study along with some
interesting future steps of this research.
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 16
152 K. Karponi and G. Tsoumakas
2 Background
2.1 Multi-label Learning
A large amount of supervised learning methods deal with the analysis of data
that are associated with a single label from a set of labels (i.e. values of a
discrete target variable). However, in many applications the training examples
can be associated with a set of labels from a larger finite set of labels. This kind
of data is called multi-label data [11].
More specifically, multi-label learning has caught the attention of many
researchers and has begun providing solutions to a growing number of new appli-
cations such as semantic annotation of images and videos and music categoriza-
tion based on emotion. Moreover, the data that are related to documents and
websites are most often associated with more than one label.
There are two different main tasks in supervised learning from multi-label
data. The first one is multi-label classification, which involves training a model
that outputs a bipartition of the set of labels into relevant and irrelevant with
respect to a new instance. The second task is called label ranking, namely train-
ing a model that outputs a ranked list of labels according to their relevance to
a new instance.
Multi-label learning methods can be grouped into two categories: The first
one is called Problem Transformation and it consists of independent algorithms
that transform the multi-label learning process into one or more classification
procedures of one class. On the other hand, the second category is called Algo-
rithm Adaptation. The methods of this case extend some already existing algo-
rithms in order to manage directly the multi-label data.
Two of the most popular libraries for multi-label learning are MULAN and
MEKA. MULAN [13] is a Java open source library based on Weka and used
for training models from multi-label datasets. This library provides a variety
of state of the art algorithms for multi-label classification as well as an evalua-
tion framework which calculates a wide variety of multi-label metrics. Similarly,
MEKA [8] allows the use of open source implementations of methods in Java
and it is based on Weka toolkit. The library contains several basic methods as
well as methods for pruned sets and classifier chains, methods of the scientific
community, and a wrapper for the MULAN platform.
In order to solve real-world problems within applications that use data mining
and machine learning, several technologies for distributed processing of data
streams were designed. This includes the S4 [6], the Apache Storm1 , SAMOA
[5] and Apache Samza2 .
On the other hand, MOA (Massive Online Analysis) [2], is a non-distributed
software for data stream mining. It is a programming environment that imple-
ments algorithms and performs experiments for training a model in real time on
evolving data streams.
3 Related Work
3.1 Techniques
In [16], the authors suggest the Multiple Windows Classification method where
they deal with concept drift and class imbalance by creating two fixed-sized
windows, one for the positive examples and one for the negative. They also used
the Binary Relevance transformation along with kNN algorithm as the base
classifier.
Another approach on Multi-label Stream Classification, used several random
trees in order to maintain the labels that appeared in the data stream. This
algorithm is called SMART [4] and it deals with the problem of concept drift
and it is able to model the correlation between the labels as well as their joint
sparseness.
1
http://storm.apache.org/.
2
http://samza.apache.org/.
154 K. Karponi and G. Tsoumakas
In [15], a new method that deals with concept drift and class imbalance
was introduced. The Binary Relevance model was trained using Active Learning
and the method also used a Minimal Classifier Uncertainty sampling function
to choose the most informational instance. Moreover, the concept drift problem
was addressed using Maximum Posterior weight schema.
MOA extensions that were described in [7] can also be useful in classify-
ing multi-label data streams. The authors compare different types of existing
transformations and present several improvements concerning them. Among
those methods were Pruned Sets, Pair Wise Classification and Ranking, and
Threshold.
One more tree suggestion made in [7], which is based in the Hoeffding Tree, is
called Multi-label Hoeffding Tree. Splitting each node is based on the calculation
of the Multi-label Information Gain and the base classifier at each leaf consists
of Pruned Sets transformation with Naive Bayes.
An improvement of the Multi-label Hoeffding Tree was introduced in [9] and
it is using the Class Incremental technique. The authors propose an extra filtering
method in order to improve the performance of the model by choosing to train
it only with the instances that contain the most frequent combinations of labels.
3.2 Evaluation
An important factor for any ‘intelligent’ system is the methodology of its evalu-
ation. The solutions provided by the literature regarding data streams are either
the Holdout or the Prequential method. In the first case, the evaluation is based
on a dataset that has been withheld. This dataset is applied to the model at dif-
ferent times in order to evaluate it. In the latter case, each instance is separately
used for the evaluation of the model prior to using it for training it and thus it
may improve the accuracy.
The multi-label data classification requires different evaluation metrics from
the simple data classification. They are divided in three categories as mentioned
in [10,17]: Example-based, Label-based and Ranking-based metrics. The authors
of [18] provide an extensive analysis of an evaluation methodology that can be
applied to data streams with temporal dependence. The performance of a clas-
sifier under consideration is compared to the performance of baseline classifiers.
This way we set limits to succeed the desired performance for the classifier that
w e examine. The evaluation metric suggested for data streams without tempo-
ral dependence is called Kappa Statistic, while the one for data streams with
temporal dependence is called Kappa Temporal Statistic.
4 Empirical Comparison
4.1 Data
The initial set of data was collected by DataScouting3 , which scanned a num-
ber of Greek articles from May 2013 until September 2013. These articles were
3
http://www.datascouting.com/.
An Empirical Comparison of Methods 155
manually partitioned and their text was extracted via OCR (Optical Character
Recognition) software. The text of the articles is represented using the bag-of-
words model. For each token that was found in the text for all articles, tf-idf
statistic was calculated. Each tf-idf value was normalized by using unit normal-
ization. More information on this dataset and the corresponding challenge can
be found in [12].
To effectively carry out the experiments, we selected 16 out of the 203 labels.
To maintain balance in the distribution, we ensured that 12 out of 16 tags
appeared frequently in the data set, while the remaining 4 rarely occurred. The
selection both of the frequent and rare labels was random. As a result, the
training set was reduced to 23240 out of 64857 articles and the test set was
reduced to 12162 out of 34923 articles.
We applied several algorithms on the same dataset in order to draw some con-
clusions. Regarding transformations, the algorithms that we used were: Binary
Relevance with Naive Bayes as the base classifier, Binary Relevance with SGD
and more specifically we set Hinge Loss as the loss function. We also used Binary
Relevance with SGD but Logistic Regression as the loss function this time and
we set the regularization parameter to 6. On the other hand, we used the same
algorithms with the difference that the Binary Relevance method was executing
incrementally. Thus, the algorithms in this case were Binary Relevance Update-
able with Naive Bayes Updateable, Binary Relevance Updateable with SGD and
Hinge Loss as the loss function and finally Binary Relevance Updateable with
SGD and Logistic Regression as the loss function as well as the regularization
parameter set to 6. For every incremental algorithm we used prequential evalu-
ation to test the test set.
In addition, we tried the Multi-label Windows Classifier with kNN as the
base classifier. The number of neighbors was set to 11 and we preferred to set
the size of the positive examples window to 400 while the negative examples
window to 600. Furthermore, the value of thresholding update was set to 1000
examples.
Regarding trees, we tried three different algorithms. Firstly, we tested the
Multi-label Hoeffding Tree with Pruned Sets Updateable transformation and
Naive Bayes Updateable at the leaves. The pruning value was declared to 1
and the tree leaves were initialized with 1/3 of the training set. The rest of
the training set was used to train the whole model and we applied prequential
evaluation. Similarly, we tried the Multi-label Hoeffding Tree, but this time we
used the Class Incremental method at the leaves along with the modified Naive
Bayes Class Incremental method. The size of the buffer for the Class Incremental
method was set to 200 and the pruning value to 1. As previously the tree leaves
were initialized with 1/3 of the training set while the rest of it was used to train
the whole model. Prequential evaluation was also applied in this case.
156 K. Karponi and G. Tsoumakas
Lastly, we trained a model with the SMART algorithm where the number of
random trees was set to 20, their height to 15 and the fading factor to 200. As
in all of the incremental cases, the test set was evaluated prequentially.
The most significant result observed in the experiments that were conducted
is that the algorithms that use incremental transformations outweigh the non-
incremental transformations. More specifically, this observation is clearer in the
following comparisons:
– Binary Relevance with Naive Bayes - Binary Relevance Updateable with Naive
Bayes Updateable
– Binary Relevance in SGD (Hinge Loss) - Binary Relevance Updateable with
SGD (Hinge Loss)
– Binary Relevance in SGD (Logistic Regression) - Binary Relevance Updateable
with SGD (Logistic Regression)
The results of the experiments show that in general the Binary Relevance
Updateable with SGD (Hinge Loss) has the best performance in Example-based
An Empirical Comparison of Methods 157
and Label-based metrics. Second best appears to be the static version of the
aforementioned model. Next in the list we observe the models of Binary Rel-
evance Updateable with SGD (Logistic Regression) and non-incremental form.
It is important to emphasize that the last two models mentioned above have
the best performance regarding Ranking-based evaluation metrics and therefore
they outweigh the ability of ranking classification compared to the rest of the
models.
While observing that the values of Example-based F-Measure, Micro-averaged
F-measure and Macro-averaged F-measure, we notice that the experiments that
contain trees do not perform as good as the transformations mentioned earlier.
This is primarily because the trees are not performing at the desired level when it
comes to multi-dimensional data, as in our case. However, if we make a comparison
between these three tree models, we observe that regarding the Label based and
the Example based metrics, Multi-label Hoeffding Tree with Class Incremental
learning has the best performance due to the extensive filtering of the examples.
For the Ranking based metrics, SMART model outweighs the rest of the models.
Finally, regarding the evaluation measures Kappa Statistic and Kappa Tem-
poral Statistic which evaluate the classification of data streams, the best perfor-
mance was observed for Binary Relevance Updateable with SGD (Hinge Loss)
as well as for Binary Relevance Updateable with SGD (Logistic Regression).
5 Conclusions
In this work we studied and implemented methods to classify data streams with
multiple labels. We surveyed existing methods proposed in the literature both
for multi-label data classification and for data stream classification. Moreover,
we analyzed the evaluation methods and metrics which concern multi-label data
classification. Then we implemented the most important methods and as we
performed a number of experiments based on the DataScouting’s dataset. The
experiments contained both models with transformations and base classifiers as
well as various other algorithms and models based on trees.
From there the following conclusions were emerged:
References
1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: On demand classification of data
streams. In: Proceedings of the Tenth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD 2004, pp. 503–508. ACM, New York
(2004). http://doi.acm.org/10.1145/1014052.1014110
2. Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., Seidl, T.:
Moa: massive online analysis, a framework for stream classification and clustering.
In: Invited Presentation at the International Workshop on Handling Concept Drift
in Adaptive Information Systems in Conjunction with European Conference on
Machine Learning and Principles and Practice of Knowledge Discovery in Data-
bases (ECML PKDD 2010), pp. 3–16 (2010)
3. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the
Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD 2000, pp. 71–80. ACM, New York (2000). http://doi.acm.org/10.
1145/347090.347107
4. Kong, X., Yu, P.S.: An ensemble-based approach to fast classification of multi-label
data streams. In: Georgakopoulos, D., Joshi, J.B.D. (eds.) 7th International Con-
ference on Collaborative Computing: Networking, Applications and Worksharing,
CollaborateCom 2011, Orlando, FL, USA, pp. 95–104. ICST/IEEE, 15-18 October
2011. http://dx.doi.org/10.4108/icst.collaboratecom.2011.247086
5. Morales, G.D.F., Bifet, A.: Samoa: scalable advanced massive online analysis.
J. Mach. Learn. Res. 16, 149–153 (2015). http://jmlr.org/papers/v16/morales
15a.html
6. Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: distributed stream com-
puting platform. In: Proceedings of the 2010 IEEE International Confer-
ence on Data Mining Workshops, ICDMW 2010, pp. 170–177. IEEE Com-
puter Society, Washington, DC (2010). http://dx.doi.org/10.1109/ICDMW.
2010.172
7. Read, J., Bifet, A., Holmes, G., Pfahringer, B.: Scalable and efficient multi-label
classification for evolving data streams. Mach. Learn. 88(1–2), 243–272 (2012).
http://dx.doi.org/10.1007/s10994-012-5279-6
8. Read, J., Reutemann, P., Pfahringer, B., Holmes, G.: MEKA: a multi-
label/multi-target extension to Weka. J. Mach. Learn. Res. 17(21), 1–5 (2016).
http://jmlr.org/papers/v17/12-164.html
9. Shi, Z., Xue, Y., Wen, Y., Cai, G.: Efficient class incremental learning for multi-
label classification of evolving data streams. In: 2014 International Joint Conference
on Neural Networks, IJCNN 2014, Beijing, China, pp. 2093–2099. IEEE, 6-11 July
2014. http://dx.doi.org/10.1109/IJCNN.2014.6889926
10. Tsoumakas, G., Vlahavas, I.: Random k -labelsets: an ensemble method for mul-
tilabel classification. In: Kok, J.N., Koronacki, J., Mantaras, R.L., Matwin, S.,
Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 406–
417. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74958-5 38
11. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O.,
Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, 2nd edn, pp.
667–685. Springer, Heidelberg (2010). Chap. 34
12. Tsoumakas, G., et al.: WISE 2014 challenge: multi-label classification of print
media articles to topics. In: Benatallah, B., Bestavros, A., Manolopoulos, Y.,
Vakali, A., Zhang, Y. (eds.) Web Information Systems Engineering - WISE 2014.
LNCS, vol. 8787, pp. 541–548. Springer, Heidelberg (2014)
An Empirical Comparison of Methods 159
13. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: a java
library for multi-label learning. J. Mach. Learn. Res. (JMLR) 12, 2411–2414 (2011)
14. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using
ensemble classifiers. In: Proceedings of the Ninth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 226–235.
ACM, New York (2003). http://doi.acm.org/10.1145/956750.956778
15. Wang, P., Zhang, P., Guo, L.: Mining multi-label data streams using ensemble-
based active learning. In: SDM, pp. 1131–1140 (2012)
16. Xioufis, E.S., Spiliopoulou, M., Tsoumakas, G., Vlahavas, I.P.: Dealing with con-
cept drift and class imbalance in multi-label stream classification. In: Walsh, T.
(ed.) IJCAI, pp. 1583–1588. IJCAI/AAAI (2011)
17. Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learn-
ing. Pattern Recogn. 40(7), 2038–2048 (2007)
18. Zliobaite, I., Bifet, A., Read, J., Pfahringer, B., Holmes, G.: Evaluation methods
and decision theory for classification of streaming data with temporal dependence.
Mach. Learn. 98(3), 455–482 (2015). http://dx.doi.org/10.1007/s10994-014-5441-4
Extended Formulations for Online Action
Selection on Big Action Sets
1 Introduction
Consider an example of large scale wireless network. The data obtained from
such a network has both intrinsic structure (paths far apart has similar network
traffic density) or extrinsic structure (paths locally connects to each other). Such
data are often represented as a combinatorial graph, where every path (or set of
edges) denote an action in an action set; an action being a route from the source
to the destination. Now consider the problem of sequentially selecting a route
for transferring packets of information over the wireless network, that is optimal
at every time step. If the number of edges in the network is given by m, the
search space for selecting a path is given 2m , where every edge might or might
not be present on the path selected. The space is exponential in the size of the
problem; which is overwhelming in the case of big data from multiple sensors
and streaming data cloud servers. Thus it is important to design sequential
algorithms that can adaptively select optimal path at every time step.
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 17
Extended Formulations for Online Action Selection on Big Action Sets 161
T
T
RT = E aTt zt − min E aT zt . (1)
a∈A
t=1 t=1
Definition 4. As long there is a positive rank for the weights matrix Mt , there
is a guaranteed non-negative factorization possible such that M = RU , where
R and Uare non-negative factors. In general, there is a minimum k, such that
Mi,j = k Ri,k Uk,j
⎡ ⎤ ⎡ ⎤
0 0 1 2 2 1 1 0 1 0 0 ⎡ ⎤
⎢ ⎥ ⎢ ⎥ 0 0 0 1 2 1
⎢ 1 0 0 1 2 2 ⎥ ⎢ 1 0 0 0 1 ⎥⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ 1 2 1 0 0 0 ⎥
2 1 0 0 1 2 0 0 0 1 2
M =⎢
⎢
⎥=⎢
⎥ ⎢
⎥⎢
⎥⎢ 0 0 1 1 0 0 ⎥
⎥
⎢ 2 2 1 0 0 1 ⎥ ⎢ 0 1 0 0 1 ⎥⎣ ⎦
⎣ ⎦ ⎣ ⎦ 0 1 0 0 1 0
1 2 2 1 0 0 0 1 1 0 0
1 0 0 0 0 1
0 1 2 2 1 0 0 0 2 1 0
Fig. 2. Slack matrix factorization. Slack matrix for the regular hexagon M factored
into non-negative matrices. Example taken from [16].
Extended Formulations for Online Action Selection on Big Action Sets 165
4 Experimental Evalutation
We compared the extended exponential weighted approach with the state-of-
the-art exponential weighted algorithms in the adversarial linear optimization
bandit setting.
4.1 Simulations
In the first experiment, among a d-dimensional network of routes, the optimal
route should be selected by the learning algorithm. Typically, we choose d to vary
between 10 and 15. The environment is an oblivious adversary to the player’s
actions, simulated to choose fixed but unknown losses at each round. Losses
are bounded and in the range [0, 1]. The learning algorithm is executed using
our basic Extended Exp algorithm. Each action is represented a d-dimension
incidence vector, with 1 indicating if an edge or path is present in the route or 0
otherwise. Figure 3 displays the results of the cases where d = 10, 15 and d = 20.
The performance measure is the instantaneous regret over time; we use psuedo
166 S. Ghosh and A. Prügel-Bennett
0.49
0.5 combband
exp2
combband 0.48 extexp2
exp2 exp3
0.45 exp3.P
extexp2 0.47
exp3
0.4
Instantaneous Regret
exp3.P 0.46
0.35 0.45
regret
0.3 0.44
0.43
0.25
0.42
0.2
0.41
(a) (b)
Fig. 3. Simulation results with the network dataset with 10 dimensions. Results aver-
aged over 100 runs. Extended Exp2 (BLACK) beats the baseline Exp2 (RED) (a)
Network dataset with 10 dimensions (b) Network dataset with 15 dimensions.
regret here. The number of iterations in the game are 10000 for d = 10 and 100
for d = 15. In both cases, the results are averaged over 100 runs of the games. We
implemented Algorithm 1, Exp2 [7], Exp3 [4], Exp3.P [4], and CombBand [10].
Our baseline is Exp2 which has the best known performance but provably sub-
optimal. We could not compare with Exp2 with John’s exploration [8] as the
authors state its computational inefficiency. All the experiments are implemented
using Matlab on a notebook with i7-2820QM CPU 2.30 GHz with 8 GB RAM.
We see that Extended Exp 2 clearly beats the baseline comfortably against the
oblivious adversary. We repeated the experiments with different configurations of
the network and different dimensions. Each time, the complexity of the problem
increases exponentially with the dimension. Extended Exp2 performs best in all
our experiments, the results of the other trials are excluded for brevity.
3.5
Instantaneous Regret
combband
2.5
exp2
extexp2
2 exp3
exp3.P
1.5
0.5
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Time
Fig. 4. Dataset results with the Jester Online Joke Recommendation dataset with
20 dimensions. Results averaged over 100 runs. Extended Exp2 (BLACK) beats the
baseline Exp2 (RED).
CPU 2.30 GHz with 8 GB RAM. We observe that once again Extended Exp2
beats all the others. Quite surprisingly, Exp2 and Combband seem to perform
equivalently in the non-oblivious setting. It will be interesting to compare with
mirror descent based linear optimization, that is for future work.
References
1. Abernethy, J., Hazan, E., Rakhlin, A.: Competing in the dark: an efficient algo-
rithm for bandit linear optimization. In: Proceedings of the 21st Annual Conference
on Learning Theory (COLT), vol. 3, p. 3 (2008)
168 S. Ghosh and A. Prügel-Bennett
2. Audibert, J-Y., Bubeck, S., Lugosi, G.: Minimax policies for combinatorial predic-
tion games. arXiv preprint arXiv:1105.4871 (2011)
3. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino:
the adversarial multi-armed bandit problem. In: Proceedings of the 36th Annual
Symposium on Foundations of Computer Science, 1995, pp. 322–331. IEEE (1995)
4. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multi-
armed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
5. Balas, E., Ceria, S., Cornuéjols, G.: A lift-and-project cutting plane algorithm for
mixed 0–1 programs. Math. Program. 58(1–3), 295–324 (1993)
6. Ball, K.: An elementary introduction to modern convex geometry. Flavors Geom.
13, 13 (1997)
7. Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-
armed bandit problems. Found. Trends Mach. Learn. 5(1), 64–87 (2012)
8. Bubeck, S., Cesa-Bianchi, N., Kakade, S.M.: Towards minimax policies for online
linear optimization with bandit feedback. In: JMLR Workshop and Conference
Proceedings Volume 23: COLT 2012 (2012)
9. Cesa-Bianchi, N., Gaillard, P., Lugosi, G., Stoltz, G.: Mirror descent meets fixed
share (and feels no regret). arXiv preprint arXiv:1202.3323 (2012)
10. Cesa-Bianchi, N., Lugosi, G.: Combinatorial bandits. J. Comput. Syst. Sci. 78(5),
1404–1422 (2012)
11. Chandrasekaran, V., Jordan, M.I.: Computational and statistical tradeoffs via con-
vex relaxation. In: Proceedings of the National Academy of Sciences (2013)
12. Conforti, M., Cornuéjols, G., Zambelli, G.: Extended formulations in combinatorial
optimization. 4OR 8(1), 1–48 (2010)
13. Dani, V., Hayes, T., Kakade, S.M.: The price of bandit information for online
optimization. In: Advances in Neural Information Processing Systems, vol. 20, pp.
345–352 (2008)
14. Fiorini, S., Massar, S., Pokutta, S., Tiwary, HR., de Wolf, R.: Linear vs. semidef-
inite extended formulations: exponential separation and strong lower bounds. In:
Proceedings of the 44th Symposium on Theory of Computing, pp. 95–106. ACM
(2012)
15. Goldberg, K.: Anonymous Ratings from the Jester Online Joke Recommender Sys-
tem (2003). Accessed 03 Oct 2013
16. Gouveia, J., Parrilo, P.A., Thomas, R.: Lifts of convex sets and cone factorizations.
Math. Oper. Res. 38, 248–264 (2011)
17. Nesterov, Y., Nemirovskii, A.S., Ye, Y.: Interior-Point Polynomial Algorithms in
Convex Programming, vol. 13. SIAM, Philadelphia (1994)
18. Yannakakis, M.: Expressing combinatorial optimization problems by linear pro-
grams. J. Comput. Syst. Sci. 43(3), 441–466 (1991)
A-BIRCH: Automatic Threshold Estimation
for the BIRCH Clustering Algorithm
1 Introduction
Clustering is an unsupervised learning method that groups a set of given data
points into well separated subsets. Two prominent examples of clustering algo-
rithms are k-means [9] and the EM algorithm [4]. This paper addresses two
issues with clustering: (1) clustering algorithms usually do not scale well and
(2) most algorithms require the number of clusters (cluster count) as input. The
first issue is becoming more and more important. For applications that need to
cluster, e.g., millions of documents, huge image or video databases, or terabytes
of IoT sensor data, scalability is essential. The second issue severely reduces the
applicability of clustering in situations where the cluster count is very difficult to
predict, such as data exploration, feature engineering, and document clustering.
An important clustering method is BIRCH [17], which is one of the fastest
clustering algorithms available. It outperforms most of the other clustering algo-
rithms by up to two orders of magnitude. Thus, BIRCH already solves the first
issue mentioned above. However, to achieve sufficient clustering quality, BIRCH
requires the cluster count as input, therefore failing to solve the second issue.
This paper describes a method to use BIRCH without having to provide the
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 18
170 B. Lorbeer et al.
cluster count, yet preserving cluster quality and speed. We achieve this by omit-
ting the global clustering step and carefully choosing the threshold parameter
of BIRCH. Following an idea by Bach and Jordan [7], we propose to learn this
parameter from representative data. Our approach aims at datasets drawn from
two-dimensional isotropic Gaussian distributions which are typical when dealing
with, for example, geospatial data.
2 Related Work
Clustering algorithms usually do not scale well, because often they have a com-
plexity of O(N 2 ) or O(N M ), where N is the number of data points and M
is the cluster count. Scalability is typically achieved by parallelization of the
algorithm in compute clusters, e.g., Mahout’s k-means clustering [11] or Spark’s
distributed versions of k-means, EM clustering, and power iteration [10]. Other
parallelization attempts use the GPU. This has been done for k-means [16], EM
clustering [8], and others. The bottleneck here is the relatively slow connection
between host and device memory if the data does not fit into device memory.
The second issue we are concerned with is the identification of the cluster
count. A standard approach is to use one of the clustering algorithms that require
the cluster count to be input as a parameter, then run it for each count k inside
an interval of likely values. Then, the “elbow method” [14] is used to deter-
mine the optimal number k. For probabilistic models, one can apply information
criteria like AIC [1] or BIC [13] to rate the different clustering results, see, for
example, [18]. But all those methods increase the computation time considerably,
especially if there is not enough prior information to keep the range of possi-
ble cluster counts small. Some clustering algorithms find the number of clusters
directly, without being required to run the algorithm for all possible counts. Two
of the more well-known examples are DBSCAN [5] and Gap Statistic [15]. There
are also some attempts to improve the clustering quality of BIRCH by chang-
ing the algorithm itself, e.g. with non-constant thresholds [6], with two different
global thresholds [2], or by using DBSCAN on each CF level to reduce noise [3].
However, while sometimes improving the quality, those approaches slow BIRCH
down and still require the cluster count as input.
3 BIRCH
We shortly describe BIRCH, mainly to fix notations. For details, see [17]. BIRCH
requires three parameters: the branching factor Br, the threshold T , and the
cluster count k. While the data points are entered into BIRCH, a height-balanced
tree, the CF tree, of hierarchical clusters is built. Each node contains the most
important information of the belonging
n cluster, the cluster features (CF). From
those, the cluster center C = 1/n i xi , where {x }n
i i=1 are the elements of the
n 2
cluster, and cluster radius R = 1/n i xi can be computed for each cluster.
Every new point starts at the root and recursively walks down the tree entering
the subcluster with the nearest center.
Automatic Threshold Estimation for the BIRCH Clustering Algorithm 171
When adding a point at the leaves, a new cluster is created if the cluster
radius R increases beyond the threshold T , otherwise the point is added to the
nearest cluster. If the creation of a new cluster leads to more than Br child nodes
of the parent, the parent is split. To ensure that the tree stays balanced, the nodes
above might need to be split recursively. Once all points are submitted to BIRCH,
the centers of the leaf clusters are, in the global clustering phase, entered into
k-means with the cluster count k. This last step improves the cluster quality by
merging neighboring clusters. In this paper, we will refer to the BIRCH algorithm
as full-BIRCH and to BIRCH without its global clustering phase as tree-BIRCH.
Tree-BIRCH is very fast. It clusters 100,000 points into 1000 clusters in 4 s,
on a 2,9 GHz Intel Core i7, using scikit-learn [12]. The k-means implementation
of the same library needs over two minutes to complete the same task on the
same architecture. Furthermore, tree-BIRCH doesn’t require the cluster count
as input, which in full-BIRCH is only needed for the global clustering phase.
However, tree-BIRCH usually suffers from bad clustering quality. Therefore, this
paper focuses on improving the clustering quality of tree-BIRCH.
4 Concept
In the following, we present a method that automatically chooses an optimal
threshold parameter for tree-BIRCH. First, note that the CF-tree depends on
the order in which the data is entered. If the points of a single cluster are entered
in the order of increasing distance from the center, tree-BIRCH is more likely
to return just one cluster than if the first two points are from opposite sides
of the cluster. In the latter case, the single cluster is likely to split, a situation
we will refer to as cluster splitting (Fig. 1A). Next, consider two neighboring
clusters. If the first two points are from opposite clusters but still near each
other, they could be collected into the same cluster, given the threshold is large
enough, which we refer to as cluster combining (Fig. 1B). Cluster combining
often co-occurs with cluster splitting. To reduce splitting of a single cluster, the
threshold parameter of tree-BIRCH has to be increased, whereas a decreased
threshold parameter reduces cluster combining. Datasets with a large ratio of
cluster distance (the distance between the cluster centers) to cluster radius are
less likely to produce such errors (Fig. 1C). In the next section, we derive a
condition for the error probability to be less than one percent. Also, a formula
for an appropriate threshold is provided. Note that there is, in fact, another
source of splitting. This happens if a cluster overlaps with two regions belonging
to two non-leaf nodes in the CF-tree. Therefore, we choose the branching factor
Br to be larger than the highest possible cluster count.
Fig. 1. Cluster splitting (A), cluster combining where the combining cluster is circled
(B), depiction of cluster radius and cluster distance (C). Different forms and colors of
the observations correspond to different clusters they belong to.
the clusters and the minimum cluster distance Dmin . It is presumed that there
exists a small but representative subset of the data that has the same cluster
radius and minimum cluster distance Dmin as the full dataset. On this small
dataset, Gap Statistic is applied to obtain the cluster count k. This k in turn is
given to k-means to produce a clustering of the subset, which finally yields the
two values R and Dmin .
For the determination of R and Dmin one could also use any other clus-
tering algorithm that finds the cluster centers and radii without requiring the
cluster count k as an input. However, Gap Statistic is chosen here, due to its
high precision. Our approach is heuristic. In each case, tree-BIRCH is run often
enough to deduce estimates of cluster splitting and combining probabilities with
sufficiently small 95 % confidence intervals (with 10,000 repetitions of each tree-
BIRCH clustering, the confidence interval of our error estimate at 0.01 is roughly
0.01 ± 0.002).
Avoiding Cluster Splitting
We create many clusters containing the same number of elements n by sampling
from a single isotropic two dimensional Gaussian probability density function.
The units are chosen such that the radius R of this cluster will be one. Then,
tree-BIRCH is applied with the same threshold T to each of those datasets. After
each iteration, we determine whether tree-BIRCH returns the correct number of
clusters, namely one. From this, we assess if the error probability estimate for
tree-BIRCH is less than 0.01, or one percent. We also investigate the impact of
varying the number of elements n in the cluster on the resulting error probability.
Finally, we repeat all the above for several thresholds T . For this heuristic analy-
sis, we use the python library scikit-learn and its implementation of BIRCH.
According to the results presented in Fig. 2a, there is no indication that the
number of objects in the cluster impacts the error probability. However the error
probability is clearly affected by the threshold parameter; the error drops below
one percent when the threshold value is greater than or equal to 1.6 · R.
Avoiding Cluster Combining
We use a mixture of two Gaussians with a cluster distance D = 6, both with
radius R = 1. Again, the error probabilities are not dependent on the total
number of data points, as shown in Fig. 3a. This figure pertains only to the cluster
Automatic Threshold Estimation for the BIRCH Clustering Algorithm 173
Fig. 2. (a) For each pair (n, T ) of total number n of objects, running from 100 to
10,000, and threshold T ∈ {1.5, 1.6, 1.7}, we sampled n elements from a Gaussian of
radius R = 1, and applied tree-BIRCH 10,000 times to compute the probabilities for
each of the cluster counts 1, 2, and 3. Every count different from 1 is an error. (b)
For each threshold we sample 500 points from a single Gaussian of radius R = 1,
apply tree-BIRCH and record how often it returns the right number of clusters. This
is repeated 10,000 times for each T to obtain an estimate for the error probability.
distance 6.0. To understand the situation for different cluster distances, consider
Fig. 3b. Here, we see the dependence of the error probability on the threshold
for several different cluster distances. For small thresholds (T < 1.7) we witness
cluster splitting which results in higher cluster counts and a higher error. This
can also be deduced from Fig. 3a, where for the small threshold T = 1.5 we see
many cluster counts of three and four. With T = 2.0 less splitting occurs and the
error probability decreases. If the threshold continues growing (T = 3.0), cluster
combining occurs more frequently, which increases the cluster counts of three
and therefore increases the error probability. The fact that the graphs in Fig. 3b
are dropping below one percent later than in Fig. 2b is due to cluster combining,
that was not possible with just one cluster. For D ≥ 6.0 there are thresholds
where the error probability drops below one percent. The minimum is located
near 1.9. From this observation it can be inferred, that, if Dmin ≥ 6.0 · R, a
choice of
T = 1.9 · R (1)
would ensure that for each pair of neighboring clusters in the dataset, the error
probability would be less than one percent.
Of course, if Dmin is clearly larger than 6.0 · R, it would be beneficial to
increase the threshold beyond (1). While the lower bound on the threshold is
nearly the same for all cluster distances, the upper bound increases linearly,
174 B. Lorbeer et al.
Fig. 3. (a) For each pair (n, T ) of total number n of objects, running from 100 to 10,000,
and threshold T ∈ {1.5, 2.0, 3.0}, we sampled 10, 000 times n elements from a mixture
of two Gaussians of distance 6.0, and each time applied tree-BIRCH to compute the
probabilities for each of the cluster counts 1, 2, 3, and 4. Every count different from
2 is an error. (b) For each pair (D, T ) of cluster distance D and threshold T , we
sampled 10, 000 times 500 elements from a mixture of two Gaussians of radius R = 1
and distance D. Each time we applied tree-BIRCH with the threshold set to T and
computed the error probabilities.
roughly with half the increase of the distance (Fig. 4). This is intuitively clear
since two times the threshold should fit comfortably between two clusters to
avoid cluster combining. To place the threshold in the middle between lower
and upper bound, we choose 14 as the ratio of ΔDmin and ΔT . We then fit an
intercept of 0.7, which yields the following expression for choosing the threshold
Automatic Threshold Estimation for the BIRCH Clustering Algorithm 175
Fig. 4. For each pair (D, T ) of cluster distance D and threshold T , we sampled 10, 000
times 500 elements from a mixture of two Gaussians of radius R = 1 and distance D.
Each time we applied tree-BIRCH to compute the error probabilities.
6 Evaluation
We evaluated the accuracy of A-BIRCH with the threshold estimation as stated
in Eq. (2). Figure 5 shows that A-BIRCH performs correctly with different sizes of
Dmin and different numbers of clusters. The evaluation datasets contain samples
from two-dimensional isotropic Gaussian distributions with equal variance and
a Dmin ≥ 6.0 · R, which fulfills the previously defined requirement.
In an additional step, we evaluated the scalability of A-BIRCH. While tree-
BIRCH is considered scalable with an increasing number of elements and clus-
ters, Gap Statistic is the bottleneck as described in 5. We have tested the paral-
lelized implementation of Gap Statistic on an Apache Spark cluster on Microsoft
176 B. Lorbeer et al.
Fig. 5. The datasets A, B, C and D contain 3, 10, 100 and 200 clusters, respectively.
Each cluster consists of 1000 elements, the radius of the clusters is R = 1, and the
Dmin is in all cases larger than 6: in A - 6.001, in B - 7.225, in C - 6.025, in D - 6.410.
Azure. Two cluster configurations have been evaluated, each with two master
nodes and with four and eight workers, respectively, each of which running on
Standard D3 virtual machines. A Standard D3 virtual machine currently pro-
vides 4 CPU cores, 14 GB of memory, running the Linux operating system. The
parallelization has been implemented using the Spark Python API (PySpark).
The computation of Gap Statistic was run on a dataset with 10 clusters, each
consisting of 1000 two-dimensional data points. The computation times for vary-
ing numbers B of reference datasets and maximal number of clusters kmax are
shown in Table 1.
The results show that the parallelized implementation of Gap Statistic with
Spark is scalable as the computation times decrease linearly with an increasing
number of worker nodes. Although the Gap Statistic phase is considered com-
putationally expensive, it increases the correctness of BIRCH significantly and
does not require any prior knowledge on the dataset.
7 Conclusion
References
1. Akaike, H.: Information theory and an extension of the maximum likelihood prin-
ciple. In: Parzen, E., Tanabe, K., Kitagawa, G. (eds.) Selected Papers of Hirotugu
Akaike, pp. 199–213. Springer, New York (1998)
2. Burbeck, K., Nadjm-Tehrani, S.: Adaptive real-time anomaly detection with incre-
mental clustering. Inf. Secur. Tech. Rep. 12(1), 56–67 (2007)
3. Dash, M., Liu, H., Xu, X.: ‘1 + 1 > 2’: Merging distance and density based clus-
tering. In: Proceedings of Seventh International Conference on Database Systems
for Advanced Applications, 2001, pp. 32–39 (2001)
4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete
data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38
(1977)
5. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for dis-
covering clusters in large spatial databases with noise. In: Simoudis, E., Han, J.,
Fayyad, U.M. (eds.) Second International Conference on Knowledge Discovery and
Data Mining, pp. 226–231. AAAI Press (1996)
6. Ismael, N., Alzaalan, M., Ashour, W.: Improved multi threshold birch clustering
algorithm. Int. J. Artif. Intell. Appl. Smart Devices 2(1), 1–10 (2014)
7. Jordan, M.I., Bach, F.R.: Learning spectral clustering. In: Advances in Neural
Information Processing Systems 16. MIT Press (2003)
8. Kumar, N.S.L.P., Satoor, S., Buck, I.: Fast parallel expectation maximization for
gaussian mixture models on gpus using cuda. In: 11th IEEE International Confer-
ence on High Performance Computing and Communications, pp. 103–109 (2009)
178 B. Lorbeer et al.
9. Macqueen, J.B.: Some methods for classification and analysis of multivariate obser-
vations. In: Proceedings of the Fifth Berkeley Symposium on Math, Statistics, and
Probability, vol. 1, pp. 281–297. University of California Press (1967)
10. Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Free-
man, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh,
R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. CoRR
(2015)
11. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Pub-
lications Co., Shelter Island (2011)
12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine
learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
13. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
14. Sugar, C.A.: Techniques for Clustering and Classification with Applications to
Medical Problems. Ph.D. Dissertation, Department of Statistics, Stanford Univer-
sity (1998)
15. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a
data set via the gap statistic. J. Roy. Stat. Soc. B (Stat. Methodol.) 63(2), 411–
423 (2001)
16. Zechner, M., Granitzer, M.: Accelerating k-means on the graphics processor via
cuda. In: First International Conference on Intensive Applications and Services,
pp. 7–15 (2009)
17. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm
and its applications. Data Min. Knowl. Disc. 1(2), 141–182 (1997)
18. Zhou, B., Hansen, J.: Unsupervised audio stream segmentation and clustering via
the bayesian information criterion. In: Proceedings of ISCLP-2000: International
Conference of Spoken Language Processing, pp. 714–717 (2000)
Playlist Generation via Vector
Representation of Songs
1 Introduction
With the help of improvements in the Internet and related technologies, users face an
astonishing array of choices regarding listening to songs online or with their portable
devices. The Internet enables users to access various remote music resources and to
listen to songs online. A music recommender system helps users filter and discover
songs according to their tastes. A good music recommender system should be able to
automatically detect preferences and generate playlists accordingly. Due to huge data
over the internet and not easy to search and filter them according to a person’s music
taste, a recommender system becomes a necessity day after day, as the data sizes
increase. Recommender systems can be seen everywhere in every domain in different
formats and shapes [1]. It is not only about music but also everything. We can even use
them in our life without even realizing. Amazon and Spotify can be given as good
examples of employing recommender systems in their web sites.
It may be harsh and troublesome to elect convenient tracks to listen or to organize
them as a playlist. The fundamental concept of a music recommendation system is to
reduce users’ effort and increase their satisfaction.
Playlist generation is a significant resource area in music recommendation system
that allows listening to music based on user feedbacks. There are various approaches to
generate playlists automatically. A random approach, given a song or an artist, is based
on users’ behavior [2]. The most general approach to creating automatic playlists is
based on estimating music similarities between pairs of songs, which is established
with human experts. Various similarity functions have been suggested such as social
© Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2_19
180 B. Köse et al.
tag information [3], the web document mining [4], analysis of audio content [5, 6], or a
combination of these approaches as a hybrid system [7].
The remainder of this paper is organized as follows. Section 2 gives relevant works
about playlist generation. Section 3 explains the proposed technique for playlist gen-
eration using word embedding. Section 4 evaluates the proposed recommender system
by performing some tests on real world Cornell playlist dataset. Section 5 concludes
the paper and presents the future work.
2 Related Works
In its typical form, playlists are defined to be a list of songs. They can be in sequential or
shuffled order. However, in the most time, they are sequential and semantic. In this work,
we try to generate playlists based on a song set which can contain only one or more
songs. The principal purpose is to find vector representation of each song by Word2vec
algorithm. Our approach is based on the idea that if music is a communication language
and songs are words, then playlists will be sentences. Therefore, we serialize playlists to
sentences and then indexing them from 0 to n, where n is the total number of our song
space. Then, we invoke the Word2vec algorithm for finding vector representations of
each song. In our work, we use Apache Spark’s Word2vec implementation.
Apache Spark is a fast, powerful and exciting open source platform to process big
data, which is developed at the University of California, Berkeley’s AMPLab, and
afterward, donated to Apache Software Foundation. It has lots of usefully supported
operations for machine learning, structured query language, streaming, and graph
processing.
Playlist Generation via Vector Representation of Songs 181
1 X
V(SongSet) = V(song) ð1Þ
kSongSetk ðsongÞ2SongSet
V1 V2
similarity = cosðhÞ ¼ ð2Þ
kV1 kkV2 k
As a result, finding vector representation of each song, we are able to find top-N
recommendation via cosine similarity. In this approach, due to the usage of vector
representations, the model is not limited to take only one start point to generate a
playlist. Users can directly specify their music taste with many songs because our
model is able to calculate the vector that contains users taste. The general model is
illustrated in Fig. 1.
This section presents the studies to evaluate the proposed architecture for playlist
generation. Here, the proposed system is evaluated by using a real world song dataset.
The performance results in terms of hit rates are presented at the end of the section.
As a real world song dataset, we use a dataset which was published by Cornell
University and collected by Shuo Chen from Department of Computer Science. The
dataset contains playlist sequences and tag data which was obtained from Yes.com and
Last.fm. To get datasets of various sizes, researchers pruned the raw data so that only
the songs with a number of appearances above a certain threshold are kept. Then the
pruned set is divided into a training set and a testing set, making sure that each song has
appeared at least once in the training set. This playlist data is divided into three parts
then named as yes_small, yes_big, and yes_complete. The detailed properties of the
dataset are shown in Table 1 [8, 15, 16].
In our evaluation design, sequences of each playlist in the test set, the first song is
set apart as a target song and is dropped from the sequence of the playlist. The principal
purpose is to predict the removed song by the model which has the top-N recom-
mendations. If the song is found by the model, it can be acceptable that the song is
predicted correctly. The corresponding method (Eq. 3) is known as the hit rate eval-
uation [17].
1 X
HitRate(Train, Test) = dt;R ðhÞ ð3Þ
kTestk ðh;tÞ2Test Train
Fig. 2. Hit rates for the window size is 5 on Fig. 3. Hit rates for the window size is 25 on
yes_small yes_small
Fig. 4. Hit rates for the window size is 125 Fig. 5. Hit rates for the window size is 250 on
on yes_small yes_small
Table 2. According to yes_small, parameters of first five best average hit rate
Windows size Vector size Average hit rate
1 125 50 0.7282
2 125 200 0.7276
3 125 100 0.7270
4 125 300 0.7266
5 125 500 0.7260
We also use parameters described in Table 2 for measuring yes_big and yes_-
complete. Hit rates results for yes_big dataset are shown in Fig. 6 and yes_complete
dataset in Fig. 7.
184 B. Köse et al.
The study presented in this paper proposed a novel playlist generation method. The
evaluations and tests proved that the proposed recommendation system can easily assist
music audiences to discover their personalized playlists with an easy and performance
efficient way. In this study, we also showed that Word2vec algorithm can be used with
Apache Spark big data framework and run on distributed vector representation of songs
to produce a playlist reflecting a person’s personal tastes.
Recommender systems are broad and challenging area to research, so the elasticity
of Word2vec model provides exciting opportunities for future works. However, our
model cannot solve the cold start problem, so this model can be combined with tra-
ditional models like content-based systems for solving cold start problem or an
advanced new approach based on vectors.
References
1. Bobadilla, J., Ortega, F., Hernando, A., Gutiérrez, A.: Recommender systems survey.
Knowl. Based Syst. 46, 109–132 (2013)
2. Celma, O.: Music Recommendation. Springer, Berlin, Heidelberg (2012)
3. Levy, M., Sandler, M.: Music information retrieval using social tags and audio. IEEE Trans.
Multimedia 11, 383–395 (2009)
4. Knees, P., Pohle, T., Schedl, M., Schnitzer, D., Seyerlehner, K.: A document-centered
approach to a natural language music search engine. Adv. Inf. Retr. 4956, 627–631 (2008)
5. Logan, B.: Content-based playlist generation: exploratory experiments. In: Processing of 3rd
International Conference on Music Information Retrieval, Paris, France, 13–17 October,
pp. 1–2 (2002)
6. Elias, P., Arthur, F., Gerhard, W.: Improvements of audio-based music similarity and genre
classification. In: Processing of 6th International Conference on Music Information
Retrieval, London, UK, 11–15 September, pp. 634–637 (2005)
7. Paul, Q., Robert, P., Michael, M., Maxwell, W., Scott, J., Richard, W.: Playlist generation,
delivery and navigation. U.S. Patent No US20030135513 A1 (2002)
Playlist Generation via Vector Representation of Songs 185
8. Shuo, C., Josh, L.M., Douglas T., Thorsten, J.: Playlist prediction via metric embedding. In:
Processing of Proceedings of the 18th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, New York, USA, 12–16 August, pp. 714–722
(2012)
9. Gärtner, D.: User adaptive music similarity with an application to playlist generation. Ph.D.
Thesis, University at Karlsruhe (TH), Germany, Carnegie Mellon University, Pittsburgh,
PA, USA (2006)
10. Baltrunas, L., Kaminskas, M., Ludwig, B., Moling, O., Ricci, F., Aydin, A., Lüke, K.-H.,
Schwaiger, R.: InCarMusic: context-aware music recommendations in a car. In: Huemer, C.,
Setzer, T. (eds.) EC-Web 2011. LNBIP, vol. 85, pp. 89–100. Springer, Heidelberg (2011).
doi:10.1007/978-3-642-23014-1_8
11. Masahiro, N., Takaesu, H., Demachi, H., Oono, M., Saito, H.: Development of an automatic
music selection system based on runner’s step frequency. In: Processing of 9th International
Conference on Music Information Retrieval, Pennslyvania, USA, 14–18 September,
pp. 193–198 (2008)
12. Wang, X., Wang, Y., Hsu, D., Wang, Y.: Exploration in interactive personalized music
recommendation: a reinforcement learning approach. ACM Trans. Multimedia Comput.
Commun. Appl. 11, 1–22 (2014)
13. Bogdanov, D., Herrera, P.: How much metadata do we need in music recommendation? A
Subjective evaluation using preference sets. In: Processing of 12th International Conference
on Music Information Retrieval, Miami, Florida, USA, 24–28 October, pp. 97–102 (2011)
14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of
words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26,
3111–3119 (2013)
15. Moore, J.L., Chen, S., Joachims, T., Turnbull, D.: Learning to embed songs and tags for
playlists prediction. In: Processing of 13th International Society for Music Information
Retrieval, Porto, Portugal, 8–12 October, pp. 349–354 (2012)
16. Shuo, C., Jiexun, X., Thorsten, J.: Multi-space probabilistic sequence modeling. In:
Processing of the 19th ACM Conference on Knowledge Discovery and Data Mining,
Chicago, USA, 11–14 August, pp. 865–873 (2013)
17. Bonnin, G., Jannach, D.: A comparison of playlist generation strategies for music
recommendation and a new baseline scheme. In: Processing of the Twenty-Seventh AAAI
Conference on Artificial Intelligence, Washington, USA, 16–18 July, pp. 16–23 (2013)
A Distributed Framework for Early Trending
Topics Detection on Big Social Networks
Data Threads
Abstract. Social networks have become big data production engines and their
analytics can reveal insightful trending topics, such that hidden knowledge can
be utilized in various applications and settings. This paper addresses the problem
of popular topics’ and trends’ early prediction out of social networks data
streams which demand distributed software architectures. Under an online time
series classification model, which is implemented in a flexible and adaptive
distributed framework, trending topics are detected. Emphasis is placed on the
early detection process and on the performance of the proposed framework. The
implemented framework builds on the lambda architecture design and the
experimentation carried out highlights the usefulness of the proposed approach
in early trends detection with high rates in performance and with a validation
aligned with a popular microblogging service.
1 Introduction
Big data threads are rapidly produced in social networks, with emerging textual data
streams, evolving unpredictably, in a non pre-determined manner. Such data threads
offer fertile ground for analytics which can reveal trends, phenomena, and knowledge.
Already in social networks, such as in Twitter microblogging service1, data threads
unfolding over time are characterized as trending topics when they exhibit viral cas-
cading data patterns.
This work is motivated by the fact that predicting whether a social network’s topic
will become a trend, prior it is declared as a trend by the social network itself (i.e.
Twitter) can be addressed as a classification problem, tackling with evolving big data
requirements. Predicting topics which will actually end up as trends in social networks
has as major obstacles the real time, often bursty, data production, along with the
missing exemplar time series to ignite data processing.
While previous research has analyzed trending topics mostly on a long term, recent
efforts have focused on detection of tweets’ trend, in an almost real time fashion due to
ongoing and continuous social networks events [1, 2]. Setting the appropriate dis-
tributed frameworks to support trend detection in social networks, embeds big data
demands and real time requirements coverage [3, 4]. Such frameworks must be tunable
1
https://twitter.com/whatstrending.
to offer the appropriate testbeds for important applications highly related with trend
detection, such as fraud and emergencies [5, 6].
This paper addresses the problem of popular topics and trends prediction out of
social networks data streams, under an online time series classification model, which is
implemented over a flexible and adaptive distributed framework. Trend prediction is
employed in a real time fashion, utilizing big data principles and technologies, under a
framework designed in Lambda architecture outline. Lambda architecture has been
chosen due to its flexibility in processing both evolving and static data threads, based
on a tunable latent source model. The proposed model sets the so called latent source
“signals” corresponding to an exemplar event of a certain type, and a clustering process
is combining with the classification tasks of labeling the data threads over specific
categories (either detected as trends or not). Different similarity measures are used to
support classification via the latent source model. The proposed work is validated under
an experimentation setting with actual Twitter large scale data threads, having the
Twitter declared trending topics, as the ground truth for detected viral topics evalua-
tion. Under the experimentation carried out on the proposed distributed architecture,
trending topics prediction has reached an accuracy of 78.4 %, while in more than one
third of studied cases the prediction is successful in advance of the respective Twitter
trends declaration.
The structure of the paper is as follows. Next section outlines the principles of both
early trend detection in social networks and of the suitability of the architecture to carry
out such a process. Section 3 has the details for the proposed micro-blogging trending
topics prediction implementation, while experimentation and results are discussed in
Sect. 4. Section 5 has conclusions, indicating future work.
Trend detection problem as a time series classification problem has been studied in
several earlier efforts [7, 8], with a focus on classifying a topic as a trend based either
on the Euclidean distance of topic’s time series and the time series of trend (or not
trend) training sets, respectively or streaming time series into a vector which is then
used for clustering over the evolving time series. Social networks (Twitter in particular)
data stream emergence in real time, has set the floor for research approaches capable of
predicting phenomena, such as disease outbreaks, emergencies, and opinions shifts [9–
11]. Popularity alone is not adequate for a topic to break into the trends list since
Twitter favours novelty over popularity. It is the velocity of interactions around a
specific topic which should be monitored in relation to a baseline interactions level
[12], [1]. To resolve such issues earlier efforts have focused on the so called latent
source model for time series, which naturally leads to a “weighted majority voting”
classification rule, approximated by a nearest-neighbor classifier [7]. The proposed
work is inspired by this latent classification model which is advanced here with novel
distributed data processing methodologies, with the inclusion of social features related
to a process favoring the early detection of a trending topic. To support this approach
188 A. Vakali et al.
two distance metrics of cosine and squared Euclidean are utilized advancing earlier
efforts which have favored the use of only Euclidean space metrics.
Complex architectures have been designed and tested on evolving big data man-
agement use cases extracting useful insights from large scale data collections [13, 14].
Among such distributed frameworks, in this work Lambda Architecture2 is proposed
since it gains ground due to its well defined set of architecture principles allowing both
batch and real-time stream data processing [15, 16]. The three Lambda architecture’s
layers, i.e. the Batch, the Serving, and the Speed layer support the proposed work’s
framework since they enable processing of high volumes of data in either batch or on a
real time mode. Data streams initially enter the system and are dispatched to batch the
speed layer.
In our case, since data are collected from social networks, data arrival originates
from the Twitter streaming API, and at the batch layer side, a master dataset, proceeds
with an immutable, append-only set of raw data, computing arbitrary views. Hadoop3
and its implementation of MapReduce model has been chosen since it is ideal for this
batch layer, leveraging on HDFS. Serving layer gets the output from the batch layer as
a set of flat files containing pre-computed views and exposes views for querying.
Bridging among these layers and the support of a fast database is handled with specific
tools (like Apache Drill4) in combination with an in-memory key-value datastore.
Speed layer computes data views as well, but it does so incrementally by balancing
high latencies of the batch layer with real time views computations. Real time views are
of transient nature along with the data propagation rates (from the batch and serving
layers), under the so called “complexity isolation” process which pushes into the layer
only temporary results [17]. In response to users queries, merging results from batch
and real-time views are delivered to the user under an integrated common interface.
This architecture is next exploited for the specific social networks trend detection
problem, due to its capabilities of handling both static and evolving big data threads.
The proposed framework builds on Lambda Architecture, with a batch, a speed and a
serving layer. However, the key difference between them arises from the framework’s
execution work flow which is depicted in Fig. 1. Both batch and speed layers are fed
with the same incoming data stream, but batch layer executes periodic processing on
the whole of the dataset, until the specified time, while speed layer performs the same
processing, in real time, in the part of data acquired while the former layer is busy.
Finally, the results of the batch layer processing stored in the serving layer are
aggregated with the results of the serving layer processing and the outcome are global
views of the dataset in response to specific queries. Data processing procedure on batch
2
http://lambda-architecture.net/.
3
https://hadoop.apache.org/.
4
https://drill.apache.org/.
A Distributed Framework for Early Trending Topics Detection 189
and speed layers of the implemented framework differ in that batch layer is used for
both the creation of the training set and the periodic update of the training set, while
speed layer covers the topics classification into trends through a series of execution
steps. Big volume of tweets is used to define the training set, in an asynchronous
execution at the speed layer, accessing results stored at the service layer.
In this work, early prediction of Twitter’s trending topics involves a series com-
parison process, requiring a sample of the overall tweets dataset over a specific period.
This sample is built on the basis of the Twitter declared trending topics and by per-
forming retrospective analysis on the dataset, to generate the time series for each topic.
For each topic, the ratio of summarized distances is calculated and once it is above a
threshold the topic is classified as a potential trending topic. Proposed execution
approach novelty is due to the next key points:
• training set’s time series are constructed by a small rate of the overall tweets dataset;
• the time series of topics to be tested are not static, but are generated in real time in a
form of a sliding window, maintaining low percentage of tweets for constructing the
time series.
Creation of the final form of time series is given in Fig. 2, which visualizes the flow
of a time series generation example, in line with the next proposed steps:
• timestamps and exact date of the topic being declared as a trending topic;
• definition of a time range from the moment just before the topic became trended, i.e.
the time of the last kept time slot;
• removal of all time slots preceding the one that signifies the start of the afore-
mentioned time range;
• production of the final form of the time series by applying a set of normalization
filters.
Time series of non trending topics is generated under a similar procedure. Density
of time series to be chosen may range from too sparse to too dense, and the normal-
ization filters proposed here also range from baseline to spike ones, accordingly [7].
190 A. Vakali et al.
The overall framework (depicted in Fig. 2) has been installed at the GRNET’s Cloud
service, called Okeanos5, and it consisted of a Hadoop cluster (batch layer), a Storm
cluster (speed layer) and a single node (serving layer). The available resources from
Okeanos were 48 CPU cores, 46 GB of RAM and 600 GB of disk storage and in total,
14 virtual machines have been created, using these resources, for the implementation of
above clusters and servers.
The proposed framework has been stress tested under various big data tweets
threads collected via Twitter streaming API, over several time windows. Due to lack of
space, here a sample of 1 month period (Nov-Dec 2013) is discussed, of its total size
exceeding 300 GB, along with the Twitter trending topics announced for the same time
window. The parallel collection of tweets and the trending topics followed the pro-
posed time series generation, by dividing the time window of the one month in fine
grained time slots of 2 min. The evaluation of the proposed methodology implemented
on the lambda inspired architecture has reached improved performance since almost
80 % of the actual trending topics were classified as potential trending topics by the
method (Fig. 3a). Specifically, 79 % of the correctly classified trending topics, were
claimed as trending topics by the method under cosine similarities, and this was
5
https://okeanos.grnet.gr.
A Distributed Framework for Early Trending Topics Detection 191
actually performed earlier than the Twitter API, with the mean time to be 1.43 h earlier
and only 4 % of the potential trending topics were false positive. Figure 3(b–d), also
depicts the rates of trending detection ignition for true positives prior and upon Twitter
trending topics declaration, with the similarities estimation based on either Euclidean or
cosine metrics, in the regular intervals of 24th, 36th, and 48th execution hours.
The percentage of the reports of trending topics by the implemented framework
been done before the respective reporting from Twitter follows a downward path
throughout the execution (as partially depicted in Fig. 4) which illustrates the whole of
the execution flow.
On the contrary, mean times of the true positive’s early reporting in comparison to
the timing of Twitter’s reports, are quite satisfactory and such improved performance is
enhanced by the fact the mean times of the later reports, in comparison to Twitter’s, are
significantly smaller.
192 A. Vakali et al.
(a) (b)
Fig. 4. Hourly Percentages (a) and Mean times (b) fluctuation for trends detection prior and after
Twitter declaration
In summary, the overall percentage of the trending topics of the proposed framework
validates its positive performance which flows similarly to Twitter’s respective per-
centages of trending topics detection, its strong contribution lies in its ability to early
detect trending topics, and it can further be extended and improved.
Use of different distance metrics for the time series comparison can be beneficial
since time series oriented distance metrics like, edit distance with Real Penalty,
move-split-merge and sparse dynamic time wrapping are currently under experimen-
tation in the proposed framework. Normalization and filtering of the time series can
also be extended and refined while the implemented algorithms execution workflow,
can also become more “loyal” to the principles of the Lambda Architecture. In such a
case a more powerful infrastructure and the inclusion of a distributed NoSQL database
to serve the scope of the serving layer are under consideration. In terms of its appli-
cability, observing and analyzing social networks textual threads as evolving time
series, has many future relevant applications, such as recommenders and other per-
sonalized services.
A Distributed Framework for Early Trending Topics Detection 193
Regarding the infrastructure part of the current work, a challenging attempt would
be the usage of more recent technologies and tools. While the presented architecture is
heavily based on [17], other frameworks could be more effective. Indicatively, Apache
Spark6 could be used as the batch layer and Apache Flink7 as the serving layer, in
Lambda Architecture.
References
1. Arkaitz, Z.: Real-time classification of Twitter trends. J. Assoc. Inf. Sci. Technol. 66, 462–
473 (2015)
2. Salvatore, G., Lo Re, G., Morana, M.: A framework for real-time Twitter data analysis.
Comput. Commun. 73, 236–242 (2016)
3. Li, J.: Bursty event detection from microblog: a distributed and incremental approach.
Concurrency Comput. Pract. Exp. 28, 3115–3130 (2015)
4. Manirupa, D.: Towards methods for systematic research on big data. In: IEEE International
Conference on Big Data (Big Data). IEEE (2015)
5. Giatsoglou, M., Chatzakou, D., Shah, N., Faloutsos, C., Vakali, A.: Retweeting activity on
Twitter: signs of deception. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D.,
Motoda, H. (eds.) PAKDD 2015. LNCS, vol. 9077, pp. 122–134. Springer, Heidelberg
(2015)
6. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-time event
detection by social sensors. In: Proceedings of the Nineteenth International WWW
Conference. ACM (2010)
7. Stanislav, N., Chen, G., Shah, D.: A latent source model for nonparametric time series
classification. In: Advances in Neural Information Processing Systems (2013)
8. Kontaki, M., Papadopoulos, A.N., Manolopoulos, Y.: Continuous trend-based classification
of streaming time series. In: Eder, J., Haav, H.-M., Kalja, A., Penjam, J. (eds.) ADBIS 2005.
LNCS, vol. 3631, pp. 294–308. Springer, Heidelberg (2005)
9. Szomszor, M., Kostkova, P., de Quincey, E.: #Swineflu: Twitter predicts swine flu outbreak
in 2009. In: Szomszor, M., Kostkova, P. (eds.) e-Health. LNICST, vol. 69, pp. 18–26.
Springer, Heidelberg (2011)
10. Lei, S.: Predicting US primary elections with Twitter. http://snap.stanford.edu/social2012/
papers/shi.pdf. Accessed 2012
11. Wang, Y.: To Follow or Not to Follow: Analyzing the Growth Patterns of the Trumpists on
Twitter. arXiv:1603.08174 (2016)
12. Mathioudakis, M., Koudas, N.: Twittermonitor: trend detection over the twitter stream. In:
SIGMOD ACM (2010)
13. Gorton, I., Klein, K.: Distribution, data, deployment: Software architecture convergence in
big data systems. IEEE Softw. 32(3), 78–85 (2015)
14. Tang, B.: A hierarchical distributed fog computing architecture for big data analysis in smart
cities. In: Proceedings of the ASE BigData & SocialInformatics. ACM (2015)
6
http://spark.apache.org/.
7
https://flink.apache.org/.
194 A. Vakali et al.
15. Mariam, K.: Lambda architecture for cost-effective batch and speed big data processing. In:
IEEE Big Data International Conference (2015)
16. Martínez-Prieto, M.: The solid architecture for real-time management of big semantic data.
Future Gener. Comput. Syst. 47, 62–79 (2015)
17. Marz, N.: Big data : principles and best practices of scalable realtime data systems. O’Reilly
Media (2013)
Multi-task Deep Neural Networks
for Automated Extraction of Primary Site
and Laterality Information from Cancer
Pathology Reports
1 Introduction
Pathology reports are free-text documents containing information about human tissue
specimens. They are a standard component of cancer patient clinical reporting and
management. Extraction of information from pathology reports is a critical but largely
manual process performed by cancer registries as part of the national cancer surveil-
lance program. In the past few years there has been tremendous interest in developing
natural language processing (NLP) methods to automate this time-consuming process
and ease the burden of manual abstraction [1, 2]. However, due to the nature of the
pathology report, simple text matching is not effective. Some well-known challenges
are missing punctuation, clinical findings interspersed with explanations, as well as
complex information about multiple specimens mentioned throughout the report.
Due to these challenges, machine learning-based NLP methods have been explored
as a promising approach. Martinez and Li applied Naïve Bayes, Support Vector
Machine (SVM), and AdaBoost classifiers to identify information from pathology
reports including primary cancer site, with feature selections from Bag-of-Words,
lemma, and annotations of lexical databases [3]. Jouhet also applied Naïve Bayes and
SVM to classify the topographical and morphological codes using the Term Frequency
with Inverse Document Frequency (TF-IDF) of the Bag-of-Words features [4].
Kavuluru proposed n-gram features to identify 14 primary cancer sites using Logistic
Regression classifier [5]. Reported results vary widely depending on study, information
target, classifier, and feature selection method with F-scores ranging from 0.4 to 0.9.
Recently, there is growing interest in exploring deep learning (DL) methods for
NLP tasks [6–8]. Applications range from low-level tasks such as part-of-speech
tagging and named entity recognition to high-level tasks such as information retrieval,
semantic analysis, and sentence relation modeling. DL methods bypass laborious
hand-engineered feature extraction while often boosting NLP task performance by
learning high-level abstractions of word-level representations and sentence-level rep-
resentations. In the biomedical domain, DL-based NLP has been applied for analysis of
biomedical articles and electronic medical records [9, 10].
In this study we focused on automated information extraction from cancer
pathology reports, and specifically extraction of two key parameters, primary cancer
site and its laterality. First, we explored the application of deep neural networks
(DNNs), a cascade of hidden representations known to be able to learn complex
mappings between the input features and the output. Then, we studied the use of a
multi-task learning (MTL) paradigm to improve upon the DNN performance. MTL is a
learning framework designed to exploit commonality of different but related classifi-
cation tasks as a means to achieve higher classification performance [11]. Specifically,
we performed comparative analysis between an MTL-based DNN classifier versus
single-task DNN classifiers. In addition, we applied Naïve Bayes classifiers and
Logistic Regression, two common classification choices.
The paper is organized as follows: Sect. 2 provides a brief description of the
pathology report structure; Sect. 3 describes the available dataset, the feature extraction
methods, the classifiers, experimental design, and the performance evaluation protocols
employed in this study; finally, Sects. 4 and 5 describes results and discussion from our
studies.
The Surveillance, Epidemiology, and End Results (SEER) program began collecting
data on cancer cases since 1973, in seven US states, and has expanded to 18
population-based cancer registries [12]. The SEER program collects incidence and
survival of patients with malignant tumors, representing approximate 30% of the US
population. Cancer pathology reports are a primary information source for cancer
registries.
A sample pathology report is shown Fig. 1. A typical report is divided into sections
marked by angle-bracketed tags listed in Table 1. As Fig. 1 shows, not all the sections
are filled, therefore simply reviewing few sections to match keywords is not an
effective and reliable information extraction strategy.
Multi-task Deep Neural Networks 197
Fig. 1. Screenshot of sample free text pathology report. Note that not all sections are filled, e.g.,
<TEXT_PATH_COMMENTS>.
Tag
<PATIENT_DISPLAY_ID>
<TUMOR_RECORD_NUMBER>
<RECORD_DOCUMENT_ID>
<TEXT_PATH_CLINICAL_HISTORY>
<TEXT_PATH_COMMENTS>
<TEXT_PATH_FORMAL_DX>
<TEXT_PATH_FULL_TEXT>
<TEXT_PATH_GROSS_PATHOLOGY>
<TEXT_PATH_MICROSCOPIC_DESC>
<TEXT_PATH_NATURE_OF_SPECIMENS>
<TEXT_PATH_STAGING_PARMS>
<TEXT_PATH_SUPP_REPORTS_ADDENDA>
198 H.-J. Yoon et al.
3 Methods
3.1 Dataset
In this study, we used de-identified data obtained with an IRB-approved protocol from
the National Cancer Institute. The acquired data consisted of 1,976 de-identified free
text pathology reports of breast and lung cancers from five SEER cancer registries with
human-annotated gold standards. Of those, we used 985 reports for which the anno-
tations clearly stated their primary site category (breast, lung) and laterality (left, right).
There were 313 cases of left breast, 285 cases of right breast, 156 cases of left lung, and
231 cases of right lung. To mitigate case prevalence bias, we generated a dataset with
300 cases per primary (P) cancer site and laterality (L) grouping by either
over-sampling the under-represented or by randomly selecting 300 cases from the
over-represented group in the available data. This process resulted in 1,200 cases
(=2 primary sites × 2 laterality categories × 300 cases per (P,L) group).
Table 2. Top 30 most frequently occurring n-grams and their number of occurrence in the
pathology report data.
n-gram Frequency n-gram Frequency n-gram Frequency
Specimen 2128 Carcinoma 1574 Number 1325
Submitted 2062 Description 1479 Adenocarcinoma 1319
Tissue 1931 Biopsy 1477 Grade 1295
Diagnosis 1916 Gross 1455 Patient 1281
cm 1850 Negative 1431 Tan 1277
Labeled 1771 Formalin 1427 Present 1268
Microscopic 1741 Histologic 1403 Identified 1254
Received 1656 Signed 1377 Positive 1221
Tumor 1649 Case 1325 Gross description 1166
Clinical 1591 Performed 1325 Invasion 1166
Multi-task Deep Neural Networks 199
where wij is the weight on a connection to output unit j from input unit i, bj is the bias
input of unit j, and rð xÞ ¼ 1 þ1ex . The output layer k produces class probability pk over
all classes N by using the softmax nonlinearity
expðhk Þ
pk ¼ PN :
n¼1 expðhn Þ
DNN is trained using Stochastic Gradient Descent (SGD) [16], where the com-
putation of derivatives is made with random mini-batch of training cases for updating
the weights in proportion to the gradient. In the context of our study, two separate
single-task DNNs were developed for text analysis of pathology reports, namely one
for extracting the primary cancer site and another to extract cancer site laterality.
TPclassi TPclassi
Precisionclassi ¼ ; Recallclassi ¼
TPclassi þ FPclassi TPclassi þ FNclassi
2 Precisionclassi Recallclassi
Fclassi ¼
Precisionclassi þ Recallclassi
where TPclassi , FPclassi , and FNclassi are true positives, false positives, and false nega-
tives of the classi , macro F-score is defined as
1 XN
Macro-F ¼ Fclassi :
N i¼1
On the other hand, micro precision, recall, and F-scores are defined as follows:
PN PN
0 TPclassi 0
i¼1 TPclassi
Precision ¼ PN i¼1
; Recall ¼ PN
i¼1 ðTPclassi þ FPclassi Þ i¼1 ðTPclassi þ FNclassi Þ
0 0
2 Precision Recall
Micro-F ¼ :
Precision0 þ Recall0
4 Results
The MTL-DNN algorithm was implemented with the PDNN [17] software package
running on Theano 0.8.2 [19] deep learning library on Python 2.7.11 environment. The
network architecture shared across tasks was 400 × 400 × 100 nodes, and task-
specific individual networks were specified by 100 × 50 × 20 × 2 nodes. The sigmoid
activation function was selected. The learning rate was 0.08.
For comparison purposes, we divided the primary and secondary tasks and repeated
the study using individual single-task Deep Neural Networks with the same network
configuration and learning rule used to train the MTL-DNN. In addition, we imple-
mented single task classifiers using Logistic Regression and Naïve Bayes classifiers
using the Weka [20] software package.
Results are shown in Table 3.
The classification performance for the primary cancer site classification task was
almost identical and nearly perfect for all classifiers. Notable performance differences
Multi-task Deep Neural Networks 201
Table 3. Macro and micro F-scores of the two classification tasks, primary site category and
laterality, based on Naïve Bayes, Logistic Regression, Deep Neural Networks, and Multi-Task
Learning of DNN.
Task Classifier Macro F-score Micro F-score
Primary site Naïve Bayes 0.987 0.988
Logistic Regression 0.997 0.998
DNN 0.997 0.998
MTL 0.998 0.998
Laterality Naïve Bayes 0.648 0.654
Logistic Regression 0.899 0.899
DNN 0.929 0.929
MTL 0.948 0.948
were observed for the secondary, laterality classification task. The MTL classifier
achieved substantially higher macro and micro F-scores than the other classification
methods suggesting that MTL can effectively leverage the commonality between the
two tasks using a shared representation.
Further analysis of the laterality classification was performed using the area under
the Receiver Operating Characteristic (ROC) curve as the performance metric (Fig. 2).
ROC curve fitting and calculation of the Area Under ROC Curve (AUC) were based on
Wilcoxon statistics and obtained by JROCFIT [21] software.
AUC of MTL was 0.9762 (±0.0045), which is dramatically higher than the AUCs
of Naïve Bayes, 0.7061 (±0.0149), and Logistic Regression, 0.9102 (±0.0087).
However, it was not significantly higher than the AUC of DNN, 0.9731 (±0.0048).
Fig. 2. Receiver Operating Characteristic (ROC) curves of the laterality classification task. Area
Under ROC Curves (AUC) of the classifiers based on Wilcoxon statistics are 0.7061 for Naïve
Bayes, 0.9102 for Logistic Regression, 0.9731 for DNN, and 0.9762 for MTL.
202 H.-J. Yoon et al.
Fig. 3. Time to reach a fixed classification accuracy (92 %) for different architecture and
number of computing nodes.
5 Discussion
This study presents a novel application of deep learning for information extraction from
cancer pathology reports. Using two different but closely related information extraction
tasks, the study showed that deep learning outperforms traditional classifiers for the
more challenging task. Furthermore, the study demonstrated the value of multi-task
learning for boosting further classification performance. Although the improvement
was not statistically significant (mainly due to the relatively small size of the available
dataset), MTL appears to be a promising approach for analysis of pathology reports,
which tend to include several related pieces of clinical information. Compared to other
published studies on the topic, our information extraction algorithms perform com-
petitively, with F-scores on the high end of the reported spectrum (0.58 to 0.97).
However, no further conclusions can be drawn since in depth analysis requires
implementation on the same clinical text corpus. Our future studies will focus on
expanding the dataset size as well as extending the scope of the information extraction
task to multiple cancer types and multiple clinical attributes.
Multi-task Deep Neural Networks 203
Acknowledgements. This manuscript has been authored by UT-Battelle, LLC under Contract
No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government
retains and the publisher, by accepting the article for publication, acknowledges that the United
States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or
reproduce the published form of this manuscript, or allow others to do so, for United States
Government purposes. The Department of Energy will provide public access to these results of
federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/
downloads/doe-public-access-plan).
The study was supported by the Laboratory Directed Research and Development (LDRD)
program of Oak Ridge National Laboratory, under LDRD projects No. 7417 and No. 8231.
References
1. Greenhalgh, T., Hurwitz, B.: Narrative based medicine: why study narrative. Br. Med. J. 318
(7175), 48–50 (1999)
2. Stein, H.D., Nadkarni, P., Erdos, J., Miller, P.L.: Exploring the degree of concordance of
coded and textual data in answering clinical queries from a clinical data repository. JAMIA 7
(1), 42–54 (2000)
3. Martinez, D., Li, Y.: Information extraction from pathology reports in a hospital setting. In:
Proceedings of the 20th ACM International Conference on Information and Knowledge
Management, pp. 1877–1882 (2011)
4. Jouhet, V., Defossez, G., Burgun, A., Le Beux, P., Levillain, P., Ingrand, P., Claveau, V.:
Automated classification of free-text pathology reports for registration of incident cases of
cancer. Methods Inf. Med. 51(3), 242 (2012)
5. Kavuluru, R., Hands, I., Durbin, E.B., Witt, L.: Automatic extraction of ICD-O-3 primary
sites from cancer pathology reports. In: Clinical Research Informatics AMIA Symposium
(2013, forthcoming)
6. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model.
The. J. Mach. Learn. Res. 3, 1137–1155 (2003)
7. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural
networks with multitask learning. In: Proceedings of the 25th International Conference on
Machine Learning, pp. 160–167. ACM (2008)
8. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of
words and phrases and their compositionality. In: Advances in Neural Information
Processing Systems, pp. 3111–3119 (2013)
9. Choi, E., Schuetz, A., Stewart, W.F., Sun, J.: Medical concept representation learning from
electronic health records and its application on heart failure prediction. arXiv preprint arXiv:
1602.03686. (2016)
10. Miotto, R., Li, L., Dudley, J.T.: Deep Learning to predict patient future diseases from the
electronic health records. In: Advances in Information Retrieval, pp. 768–774. Springer
International Publishing (2016)
11. Caruana, R.: Multitask Learning. Springer, New York (1998)
12. Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov)
Research Data (1973–2013), National Cancer Institute, DCCPS, Surveillance Research
Program, Surveillance Systems Branch, released April 2016
13. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc.,
New York (1986)
204 H.-J. Yoon et al.
1 Introduction
A wireless multimedia sensor network (WMSN) is a distributed wireless net-
work that consists of a set of multimedia sensor nodes which are connected to
each other or connected to leading gateways. Nowadays, smart devices such as
mobile phones, smart televisions and smart watches are equipped with sensors
and network connections. Hence, with the advances in wireless communication
technologies, multimedia sensor networks are expected to be one of the major
components in the Internet of things (IoT).
A typical application for a WMSN would be a surveillance system or a
monitoring system. Smart city surveillance cameras with 7/24 recording, or
one million sensor nodes reporting meteorological data produce data in vari-
ous formats as video, audio, and text [4]. All those huge amount of structured
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 22
206 C. Küçükkeçeci and A. Yazıcı
2 Related Work
Over the years, various methods are used for wireless sensor network data rep-
resentation and management [12–14,17]. Yang et al. [10] propose a hybrid data
model to store their wireless sensor network data. NoSQL part of the hybrid
model is stored with a key-value structure to provide higher scalability and bet-
ter performance. Christine et al. [18] discuss big data with spatial data received
from wireless sensors using real life scenarios.
Renzo et al. [9] present a survey paper on graph database models. They
compare graph database models to other database models like a relational model
and then compare the representative graph database models. Mark et al. [8]
introduce a graph based model which is called HyperNode. They also define a
language HNQL (HyperNode Query Language) to query and update their model.
Arati et al. focus on the information retrieval from sensor networks and propose
hybrid protocol which is called APTEEN [11]. They have noticed that achieving
an efficient model needs to replace the conventional flat topology with a graph
based model.
PipeNet [20] is multi layered wireless sensor network application on pipeline
scenario which is the similar multi layer architecture as we propose but in another
domain. They have developed the system to analyze the collected multi-modal
data for detection of the leaks. Another research about WSN related to surveil-
lance domain is a survey paper written by Felemban [21]. His survey research
enlists the literature for experimenting work done in border surveillance and
intrusion detection using the technology of WSN. Our research differs from the
existing works by employing a graph based approach for surveillance domain
and focusing on the simulation of the big data.
are done at the third level fusion. The output of the last level fusion is an action
like triggering an alarm or a notification message to another system.
A graph database is our storage environment and data model is designed as
a graph-based big data model in parallel to our NoSQL database selection. Data
stored in graph database is used for further analytical processes like tracking
and event detection.
6 Experimental Work
We setup a test environment to make some experiments on our graph model.
Test environment hardware specifications are;
• Intel i7-4710HQ Quad Core CPU
• 16 GB DDR3 RAM
• 240 GB SSD storage
• 4 GB NVIDIA 860GTX GPU
Test environment has three database systems which are OrientDB v2.1.2
(Graph database), Neo4j v2.3.2 (Graph database) and MySQL v5.7.1 (Relational
database). We have simulated a sensor network with synthetic raw data for all
three databases. Sensor nodes are placed in a square shaped area. Gateways are
located in the center of each group of sensor nodes. The sink node is placed in
the center of the whole area. Total count for specific node types are;
A Graph-Based Big Data Model for Wireless Multimedia Sensor Networks 211
• 1 Sink
• 2,500 Gateway (Each gateway leads 10,000 sensor nodes)
• 25,000,000 Sensor Node
2. Explosion Videos Query: This query finds the possible explosions by iden-
tifying continued high volume around the surveillance area with their recorded
video paths and video duration. The value bigger than 15 is assumed to be a
high volume sound.
OrientDB Query:
SELECT i n ( ” c o l l e c t ” ) . name [ 0 ] , i n ( ” c o l l e c t ” ) . indexX [ 0 ] , i n ( ” c o l l e c t ” ) . indexY
[ 0 ] , a c o u s t i c , out (” video ”) . videoPath [ 0 ] , out (” video ”) . videoDurationSec
[ 0 ] FROM s e n s o r r a w d a t a WHERE a c o u s t i c >15 AND o u t ( ” n e x t ” ) . a c o u s t i c [ 0 ] > 1 5
AND o u t ( ” n e x t ” ) . o u t ( ” n e x t ” ) . a c o u s t i c [ 0 ] > 1 5 ORDER BY name
Neo4j Query:
MATCH ( s n : s e n s o r n o d e ) − [ : c o l l e c t ]−>( s a : s e n s o r r a w d a t a ) − [ : n e x t ]−>( s b :
s e n s o r r a w d a t a ) − [ : n e x t ]−>( s c : s e n s o r r a w d a t a ) − [ : v i d e o ]−>( s v :
s e n s o r r a w v i d e o d a t a ) WHERE s a . a c o u s t i c >15 AND s b . a c o u s t i c >15 AND s c .
a c o u s t i c >15 RETURN s n . name , s a . a c o u s t i c , s n . indexX , s n . indexY , s v .
v i d e o P a t h , s v . v i d e o D u r a t i o n S e c ORDER BY s n . name
Relational SQL Query:
SELECT s . name , s . i n d e x x , s . i n d e x y , r a . c o l l e c t D a t e , r a . a c o u s t i c , v .
v i d e o p a t h , v . d u r a t i o n FROM s e n s o r n o d e s , s e n s o r r a w d a t a ra ,
s e n s o r r a w d a t a rb , s e n s o r r a w d a t a r c , s e n s o r r a w v i d e o d a t a v WHERE r a .
a c o u s t i c >15 AND r b . a c o u s t i c >15 AND r c . a c o u s t i c >15 AND s . i d = r a .
s e n s o r n o d e i d and r a . i d = r b . n e x t i d and r b . i d = r c . n e x t i d and r c .
v i d e o i d = v . i d ORDER BY s . name ASC
Neo4j Query:
MATCH p=(a : s i n k ) − [ : l e a d ∗]−>(b : gateway ) WITH b . name a s gname , l e n g t h ( p ) AS
d e p t h MATCH ( s f d : f u s e d d a t a ) − [ : f u s e d b y ]−>( s n : s e n s o r n o d e ) <−[: l e a d ] −( g :
gateway ) WHERE s f d . c o n c e p t = ”Human” AND s f d . w e i g h t >0.9 AND
g . name = gname RETURN gname , s n . name , s f d . f u s i o n D a t e , d e p t h
Relational SQL Query:
WITH RECURSIVE s e a r c h g r a p h ( i d , name , l e a d , d e p t h ) AS (SELECT g . i d , g . name ,
g . l e a d , 0 FROM gateway g WHERE g . l e a d i s n u l l UNION ALL SELECT g . i d , g
. name , g . l e a d , s g . d e p t h + 1 FROM gateway g , s e a r c h g r a p h s g WHERE g .
l e a d = s g . i d ) SELECT s . name , s 2 . name , s 2 . indexX , s 2 . indexY , s . d e p t h
FROM s e a r c h g r a p h s i n n e r j o i n ( s e l e c t d i s t i n c t s n . i d , s n . name , s n .
indexX , s n . indexY , s n . l e a d FROM s i n k f u s e d d a t a s f d , g a t e w a y f u s e d d a t a g f d
, s e n s o r f u s e d d a t a s r f d , s e n s o r n o d e s n WHERE s r f d . f u s i o n = g f d . i d AND
g f d . f u s i o n = s f d . i d AND s r f d . f u s e d b y = s n . i d AND s f d . c o n c e p t = ’Human ’
AND s f d . w e i g h t > 0 . 9 ) s 2 ON s 2 . l e a d = s . i d
Table 2 shows the performance results of our example queries. For the first
query, OrientDB beats Neo4j and graph model is better than the relational
model. For the second query, Neo4j performs better than OrientDB and the
graph model is again better than the relational one. The last query is to test
the recursive SQL like query. The graph based model is much better than the
relational model. But Neo4j fails for this query, maybe there is other functions
of the Cypher (the query language of Neo4j).
7 Conclusions
We have developed a graph-based big data model to represent the multimedia
sensor networks. We have implemented our proposed graph data model and sim-
ulated multimedia wireless sensor networks with synthetic data to generate big
multimedia sensor data. As an application area, we have selected the surveil-
lance systems. The sensor data retrieved from the network and the video data
streamed from cameras are treated like big data. The storage environment of
aforesaid big data is selected as a graph database which suits well to our graph
model.
As our performance test results show that the graph model performs much
better than the relational model. In order to decide on the graph-based big
database for our implementation, we have compared two very well-known graph
databases, Neo4j and OrientDB. OrientDB generally performs better than Neo4j
in our experiments. On the other hand, Neo4j’s master-slave approach is not
scalable because only one single node is active at a time. From the point of big
data view, it is not suitable for us.
Our graph-based big data model successfully survives even with millions of
data nodes. We have tested many complex query scenarios on our synthetic data
and millions of data can be efficiently queried, in a few seconds. As our ongoing
research work, our big data graph database will be used for advanced analytics,
more complex database queries, object tracking, and early detection of events.
References
1. Moniruzzaman, A.B., Hossain, S.A.: Nosql database: New era of databases for
big data analytics-classification, characteristics and comparison. arXiv preprint
arXiv:1307.0191 (2013)
2. Chad, V., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison
of a graph database, a relational database: a data provenance perspective. In:
Proceedings of the 48th Annual Southeast Regional Conference, p. 42. ACM (2010)
3. Chen, M., Man, S., Liu, Y.: Big data: a survey. Mobile Netw. Appl. 19, 171–209
(2014)
4. Perera, C., Zaslavsky, A., Christen, P., Georgakopoulos, D.: Sensing as a service
model for smart cities supported by Internet of Things. Trans. Emerg. Telecommun.
Technol. 25(1), 81–93 (2014)
5. Ho, L-Y., Wu, J.-J., Liu, P.: Distributed graph database for large-scale social
computing. In: 2012 IEEE 5th International Conference on Cloud Computing
(CLOUD), pp. 455–462. IEEE (2012)
6. Zaslavsky, A., Perera, C., Georgakopoulos, D.: Sensing as a service, big data, arXiv
preprint arXiv:1301.0159 (2013)
7. Hadim, S., Nader, M.: Middleware: Middleware challenges and approaches for wire-
less sensor networks. IEEE Distrib. Syst. Online 3, (2006)
8. Levene, M., Loizou, G.: A graph-based data model, its ramifications. IEEE Trans.
Knowl. Data Eng. 7(5), 809–823 (2011)
9. Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv.
(CSUR) 40(1), 1–39 (2008)
10. Li, Y., Wu, C., Guo, L., Lee, C.-H., Guo, Y.: Wiki-health: a big data platform for
health. In: Cloud Computing Applications for Quality Health Care Delivery, p. 59
(2014)
11. Manjeshwar, A., Agrawal, D.P.: APTEEN: a hybrid protocol for efficient routing
and comprehensive information retrieval in wireless sensor networks. In: Interna-
tional Parallel and Distributed Processing Symposium, vol. 2. IEEE Computer
Society (2002)
12. Hu, C., Liu, Y., Chen, L.: Semantic link network based model for organizing mul-
timedia big data. IEEE Trans. Emerg. Topics Comput., 1 (2011)
13. Korpeoglu, B., Yazici, A., Korpeoglu, I., George, R.: A new approach for informa-
tion processing in wireless sensor network. In: Proceedings of the 22nd International
Conference on Data Engineering Workshops, pp. 34–34. IEEE (2006)
14. Zhang, P., Yan, Z., Sun, H.: A novel architecture based on cloud computing for
wireless sensor network. In: Proceedings of the 2nd International Conference on
Computer Science and Electronics Engineering (ICCSEE 2013), pp. 472–475 (2013)
15. Jing, H., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: 2011 6th
International Conference on Pervasive Computing and Applications (ICPCA), pp.
363–366. IEEE (2011)
16. Xu, Z., Liu, Y., Mei, L., Hu, C., Chen, L.: Semantic based representing and orga-
nizing surveillance big data using video structural description technology. J. Syst.
Softw. 102, 217–225 (2015)
17. Diallo, O., Rodrigues, J.J., Sene, M.: Real-time data management on wireless sen-
sor networks: a survey. J. Netw. Comput. Appl. 35(3), 1013–1021 (2012)
18. Jardak, C., Mahonen, P., Riihijärvi, J.: Spatial big data and wireless networks:
experiences, applications, and research challenges. IEEE Netw. 28(4), 26–31 (2014)
A Graph-Based Big Data Model for Wireless Multimedia Sensor Networks 215
19. Simmen, D., Schnaitter, K., Davis, J., He, Y., Lohariwala, S., Mysore, A., Shenoi,
V., Tan, M., Xiao, Y.: Large-scale graph analytics in Aster 6: bringing context to
big data discovery. Proc. VLDB Endowment 7(13), 1405–1416 (2014)
20. Stoianov, I., Nachman, L., Madden, S., Tokmouline, T.: PIPENET: a wireless
sensor network for pipeline monitoring. In: 2007 6th International Symposium on
Information Processing in Sensor Networks, pp. 264–273. IEEE (2007)
21. Felemban, E.: Advanced border intrusion detection and surveillance using wireless
sensor network technology. Int. J. Commun. Netw. Syst. Sci. 6(5), 251 (2013)
CCM: Controlling the Change Magnitude
in High Dimensional Data
1 Introduction
popular machine learning repositories like the UCI one [14]. Finally (Sect. 6),
we demonstrate the effectiveness of the proposed method, showing that changes
introduced by other methods can yield considerably different magnitude where
data dimension scales, as in big data scenario.
2 Problem Formulation
We consider a dataset S of stationary data, i.e. containing independent and
identically distributed (i.i.d.) random vectors s ∈ Rd . We assume that S is
generated by a probability density function (pdf) φ0 that is possibly unknown.
We want to generate a datastream X = {x(t), t = 1, . . . } affected by a change
at time t = τ as
φ0 t < τ
x(t) ∼ , where φ1 (x) := φ0 (Qx + v) , (1)
φ1 t ≥ τ
3 Method Description
In this section we describe two algorithms that iteratively compute the rotation
matrix Q and the translation vector v yielding sKL(φ0 , φ1 ) ≈ κ. To compute
sKL(φ0 , φ1 ) in (2), both φ0 and φ1 are necessary; however, since φ0 is typically
unknown, we replace it with a GM estimate φ0 , which is computed on the whole
CCM: Controlling the Change Magnitude in High Dimensional Data 219
dataset S. We also adopt a suitable parametrization for Q and v, that are ran-
domly initialized as in Algorithm 1. These parameters are then adjusted using a
bisection method, described in Algorithm 2, such that sKL(φ0 , φ1 ) approaches
the target value κ up to a tolerance ε.
Fitting Pre-Change Distribution: To define φ1 satisfying (3) we need to
know φ0 , which is typically unknown. Therefore, we compute its estimate φ0 by
fitting a GM on the whole dataset of stationary data S. We adopt GM since
these are flexible models that can well approximate skewed and multimodal
distributions even in high dimensions [13].
The pdf φ0 of a GM is a convex combination of k Gaussian functions
k
λi
φ0 (x) =
1 T −1
d/2 1/2
e− 2 (x−μi ) Σi (x−μi ) , (4)
i=1
(2π) det(Σi )
k
where i=1 λi = 1 and λi ≥ 0, i ∈ 1, . . . , k are the weights of each Gaussian
having mean μ1 and covariance Σi . We estimate the weights λi and the parame-
ters of the Gaussians using an Expectation-Maximization (EM) algorithm [15],
and select the best value of k via cross validation.
Parametrization: To ease calculations, we express Q with respect to its angles
of rotation. We stack m := d/2 angles θ1 , . . . , θm in a vector θ, and define the
matrix S(θ) ∈ Rd×d as
⎡ ⎤
R(θ1 ) · · · 0 0
⎢ .. . . .. .. ⎥
⎢ . . . . ⎥ cos θi − sin θi
S(θ) = ⎢ ⎥, R(θi ) = . (5)
⎣ 0 · · · R(θm ) 0⎦ sin θi cos θi
0 ··· 0 1
Initially, θl and ρl are set to 0 ([LINE 3]), while θu = θ (0) and ρu = ρ(0) ,
(j)
([LINE 4]). As in any bisection method, we set θ (j) to the average of θl and
(j) (j) (j)
θu ([LINE 7]), and ρ(j) to the average of ρl and ρu ([LINE 8]). Then, if the
corresponding value of sKL, namely s(j) ([LINE 10]), is smaller than the target
value κ, we update θl and ρl ([LINE 11]), otherwise we update θu and ρu ([LINE
13]).
4 Proofs of Convergence
To proof the convergence of the proposed method we demonstrate first that
Algorithm 1 terminates after a finite number of iterations (Theorem 1), and
then that Algorithm 2 actually converges (Theorem 2).
Theorem 1. Let φ0 be a Gaussian mixture. Then, for any κ > 0, Algorithm 1
converges in a finite number of iterations.
Proof. It is enough to show that sKL(φ0 , φ0 (Q(0) · +v)) → ∞ as v2 → ∞,
or that it admits a lower bound that diverges as v2 → ∞. The lower bound
follows from the definition of sKL and the Jensen’s inequality:
0 (x)
φ
sKL(φ0 , φ0 (Q(0) · +v)) ≥ φ0 (x) log dx
Rd φ0 (Q(0) x + v)
(8)
≥ φ0 (x) log(φ0 (x))dx − log φ0 (x)φ0 (Q x + v)dx .
(0)
Rd Rd
We now define f (v) := Rd φ0 (x)φ0 (Q(0) x + v)dx, which we observe is a con-
tinuous function (see Lemma 16.1 of [6]). We prove the theorem demonstrating
that f (v) ∈ L1 (Rd ), as this implies that f (v) → 0 when v2 → ∞, thus that
the lower bound (8) diverges. Thus, it follows:
|f (v)| dv = f (v)dv = φ0 (x)φ0 (Q(0) x + v)dxdv
Rd Rd Rd Rd
(9)
=
φ0 (x)
φ0 (Q x + v)dv dx = 1.
(0)
Rd Rd
Theorem 2. Let φ0 be a Gaussian mixture. Then, for any κ > 0, Algorithm 2
converges in a finite number of iterations.
Proof. The thesis follows from the Intermediate Value Theorem [18] (Theorem
4.23) applied to the function used in the bisection procedure, i.e.
that sKL(0) < κ (since sKL(0) = 0) and sKL(1) > κ, thus it is enough to show
that sKL(·) is continuous in [0, 1] to prove that there exist a value of s̄ such that
sKL(s̄) = κ, thus that θ(s̄) and ρ(s̄) yield sKL equal to κ, thus, that bisection
method converges (see [8], Theorem 2.1).
To show that sKL(·) in (10) is continuous, we show that KL(·) =
Rd
g(·, x)dx is continuous, where
φ0 (x)
g(a, x) := φ0 (x) log , (11)
φ0 (Q(θ(a), P )x + v(ρ(a), u))
that holds for some constants c1 , c2 , r > 0. Then, for a sufficiently large r
|g(a, x)| = φ0 (x) log(φ0 (x)) − log(φ0 (Q(θ(a), P )x + v(ρ(a), u)))
(13)
≤ φ0 (x)(c1 + c2 x2 )(c1 + c2 Q(θ(a), P )x + v(ρ(a), u)2 ) .
2 2
some c > 0. Thus, the function that dominates |g(·, x)| for x2 > r is
for
c φ0 (x) which obviously belongs3 to L1 (Rd ) .
5 The Framework
We implement the proposed method in a MATLAB framework which is publicly
available for download at http://home.deib.polimi.it/carrerad. The main func-
tion of the framework is generateDatastreams, which manipulates a dataset S
containing i.i.d. data and generates multiple datastreams affected by a change as
in (1). At first, this function performs a random shuffling of the dataset to obtain
different datastreams everytime the function is invoked. Then it estimates the
pdf φ0 over the whole S and generates a change having magnitude κ at time τ
by performing a roto-translation of the data implementing Algorithms 1 and 2.
The pdf φ0 is implemented as the object gmDistr, which includes fit and
symmetricKullbackLeibler among its methods. The method fit estimates
2
This bound is trivial for Gaussian pdfs and follows from basic algebra in case of GM.
3
There is no need to find a dominant function for x2 ≤ r since there g(·, x) is
bounded.
CCM: Controlling the Change Magnitude in High Dimensional Data 223
φ0 over the dataset S and is based on the EM algorithm [15] in MATLAB.
The method symmetricKullbackLeibler takes as input the φ1 and estimates
sKL(φ0 , φ1 ) via a Monte Carlo simulation which involves the computation of the
log-likelihood with respect to φ0 and φ1 on synthetically generated samples. To
prevent severe numerical issues raising when computing the log-likelihood of a
GM via (4) (in particular when d 1), we adopt the following upper-bound
that approximates φ0 (x) as in [2]
kλi∗
φ0 (x) ≤ − log (2π)d det(Σi∗ ) + (x − μi∗ ) Σi−1 ∗ (x − μi∗ ) (14)
2
λi − 12 (x−μi ) Σi−1 (x−μi )
where i∗ = argmax (2π)d/2 det(Σ i ) 1/2 e . The matrix Q and the
i=1,...,k
vector v are computed by computeRotoTranslation, which implements both
Algorithms 1 and 2, as in Sect. 3.
6 Experiments
80 1
CCM
Swap 0.8
60
Offset
0.6
Power
sKL
40
0.4
20
0.2
0 0
20 40 20 40
d d
(a) (b)
7 Conclusions
We have presented CCM, a rigorous method to control the change magnitude
in multivariate datasets. The algorithms at the core of CCM are sound and a
proof of their convergence is given. Our experiments remark the importance of
generating datasets by using CCM to control the change magnitude, rather than
more heuristic experimental practices that are commonly adopted. Most impor-
tantly, CCM enables to fairly assess the detection performance and investigate
the change-detection challenges raising in high dimensional data, as it typically
happens in big data scenarios. Future works concern extending CCM to include
also change models affecting the data dispersion.
References
1. Alippi, C.: Intelligence for Embedded Systems, A Methodological Approach.
Springer, Switzerland (2014)
CCM: Controlling the Change Magnitude in High Dimensional Data 225
2. Alippi, C., Boracchi, G., Carrera, D., Roveri, M.: Change detection in multivariate
datastreams: Likelihood and detectability loss. In: Proceedings of IJCAI (2016)
3. Alippi, C., Boracchi, G., Roveri, M.: A Just-In-Time adaptive classification system
based on the Intersection of Confidence Intervals rule. Neural Netw. 24(8), 791–800
(2011)
4. Alippi, C., Boracchi, G., Roveri, M.: Just-in-time classifiers for recurrent concepts.
IEEE Trans. Neural Netw. Learn. Syst. 24(4) (2013)
5. Alippi, C., Boracchi, G., Roveri, M.: Hierarchical change-detection tests. IEEE
Trans. Neural Netw. Learn. Syst. PP(99), 1–13 (2016)
6. Bauer, H.: Measure and Integration Theory. Walter de Gruyter, Berlin (2001)
7. Boracchi, G., Roveri, M.: Exploiting self-similarity for change detection. In: Pro-
ceedings of IEEE International Joint Conference on Neural Networks (IJCNN)
(2014)
8. Burden, R.L., Faires, J.D.: Numerical Analysis. Brooks/Cole, USA (2001)
9. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Com-
put. Surv. 41(3), 15:1–15:58 (2009)
10. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In:
Proceedings of Brazilian Symposium on Artificial Intelligence (SBIA) (2004)
11. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on
concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4) (2014)
12. Harel, M., Mannor, S., El-yaniv, R., Crammer, K.: Concept drift detection through
resampling. In: Proceedings of ICML, pp. 1009–1017 (2014)
13. Kuncheva, L.I.: Change detection in streaming multivariate data using likelihood
detectors. IEEE Trans. Knowl. Data Eng. 25(5) (2013)
14. Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml
15. McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2004)
16. Pimentel, M.A., Clifton, D.A., Clifton, L., Tarassenko, L.: A review of novelty
detection. Sig. Process. 99, 215–249 (2014)
17. Ross, G.J., Tasoulis, D.K., Adams, N.M.: Nonparametric monitoring of data
streams for changes in location and scale. Technometrics 53(4) (2011)
18. Rudin, W.: Principles of Mathematical Analysis. McGraw-Hill, New York (1964)
19. Street, W.N., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale clas-
sification. In: Proceedings of the Seventh ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (2001)
Spark Parameter Tuning via Trial-and-Error
1 Introduction
Spark [9,10] has emerged as one of the most widely used frameworks for mas-
sively parallel data analytics. In summary, it improves upon Hadoop MapReduce
in terms of flexibility in the programming model and performance [6], especially
for iterative applications. It can accommodate both batch and streaming applica-
tions, while providing interfaces to other established big data technologies, espe-
cially regarding storage, such as HDFS and NoSQL databases. Its key feature is
that it manages to hide the complexities related to parallelism, fault-tolerance
and cluster setting from end users and application developers. To support all
these, Spark execution engine has been evolved to an efficient albeit complex
system with more than 150 configurable parameters. The default values are usu-
ally sufficient for a Spark program to run, e.g., not to run out of memory without
having the option to spill data on the disk and thus crash. But this gives rise to
the following research question: “Can the default configuration be improved?”
The aim of this work is to answer the above question in an efficient manner.
Clearly, it is practically impossible to check all the different combinations of
parameter values for all tunable parameters. Therefore, tuning arbitrary Spark
applications by inexpensively navigating through the vast search space of all
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 24
Spark Parameter Tuning via Trial-and-Error 227
The work in [2] deals with the issue of parameter optimization in workflows,
but may involve far too many experimental runs, whereas we advocate a limited
number of configuration runs independently from the application size.
3 Parameters of Interest
Navigating through the vast search space is one the biggest challenges in para-
meter testing and tuning, due to the exponential increase of different configura-
tions in the number of properties and their valid values. Based on evidence from
(i) the documentation and the earlier works presented in Sect. 2 and (ii) our own
runs (for which we do not present results due to space constraints), we narrow
our focus on 12 parameters, the configuration of which needs to be investi-
gated according to each application instance separately. As already explained,
the application-independent parameters that need to be specific for a specific
data-center and those related to data parallelism are out of our scope. Finally,
we do not consider parameters related to YARN or MESOS.
4 Sensitivity Analysis
We employ three benchmark applications: (i) sort-by-key; (ii) shuffling and (iii)
k-means. K-means and Sort-by-key are also part of the HiBench benchmark and
where selected because they can be considered as representative of a variety
of applications. The shuffling application generates the data according to the
terasort benchmark, but does not perform any sorting; it just shuffles all the
data in order to stress the shuffling component of the system, given that shuffling
in known to play a big role in the performance of Spark applications. To avoid
the interference of the underlying file system, in all applications, the dataset was
generated at the beginning of each run on the fly. The MareNostrum hardware
specifications are described in [7]. 20 16-core machines are used, and the average
allocated memory per core is 1.5 GB. The version of Spark at the time of the
experiments was 1.5.2.
Spark Parameter Tuning via Trial-and-Error 231
For each of the selected parameters, we perform a separate set of tests, and in
each test we examine a different value. Then, the performance is compared with
the performance of the default configuration after modifying the serializer, as
argued below. If there is a big difference between the results, then the parameter
can be considered as having an impact on the overall performance.
The parameter values are selected as follows. If the parameter takes a binary
value, for instance a parameter that specifies whether to use a feature or not, then
the non-default value is tested. For parameters that have a variety of different
values that are distinct, for instance the compression codec that will be used
(snappy, lzf, lz4 ), all the different values are tested. Finally, for parameters that
take numeric values in a wide range, e.g., spark.io.file.buffer, the values
close to the default are tested. Each experiment was conducted five times (at
least) and the median value is reported.
does not appear to provide any significant improvement either. This could be
attributed to a variety of reasons, one being the fact that the sort-by-key appli-
cation does not generate a very high number of files during shuffling. The last
three parameters, shuffle.spill.compress, shuffle.io.preferDirectBufs
and rdd.compress do not seem to significantly affect the performance too. For
the former, this can be attributed to the fact that the spills conducted are few.
For Rdd.compress, there is a small performance degradation as expected, since
the RDD can fit into the main memory and CPU time is unnecessarily spent for
the compression.
Based on (i) the results of the previous section, (ii) the expert knowledge as
summarized in Sect. 2 and (iii) our overall experience from running hundreds
of experiments (not all are shown in this work), we derive an easily applicable
tuning methodology. This methodology is presented in Fig. 1 in the form of a
block diagram. In the figure, each node represents a test run with one or two
different configurations. Test runs that are higher in the figure are expected to
have a bigger impact on performance and, as a result, a higher priority. As such,
runs start from the top and, if an individual configuration improves the perfor-
mance, the configuration is kept and passed to its children replacing the default
value for all the test runs on the same path branch. If an individual configu-
ration does not improve the performance, then the configuration is not added
Spark Parameter Tuning via Trial-and-Error 235
and the default is kept. In other words, each parameter configuration is propa-
gated downstream up to the final configuration as long as it yields performance
improvements. Contrary to the previous experiments, the methodology test runs
investigate the combined effect of tuning.
Overall, as shown in the figure, at most ten configurations need to be eval-
uated referring to nine of the parameters in Sect. 3. Note that, even if each
parameter took only two values, exhaustively checking all combinations would
result in 29 = 512 runs. Finally, the methodology can be employed in a less
restrictive manner, where a configuration is chosen not only if it improves the
performance, but if the improvement exceeds a threshold, e.g., 5 % or 10 %.
The rationale of the diagram blocks, itemized by the main parameter they
target, is as follows:
– spark.serializer: This parameter had the highest impact in our series of
experiments, and using the KryoSerializer was the default baseline for all
the other parameters. We keep the same rationale in our methodology, so we
perform this test first.
– spark.shuffle.manager: As shown in the results of the previous section, the
shuffle manager has a high impact on performance, so it should be included in
the methodology. Since, based on documentation, tungsten-sort works better
with the lzf compression codec, we combine the test of these two settings. Also,
the test run for the other option of this parameter, the hash shuffling manager,
is conducted in combination with the implementation of consolidating files
during a shuffle, to avoid problems from the creation of too many intermediate
files.
– spark.shuffle.compress: In our parameters, disabling it led to serious per-
formance degradation (by default it is enabled). This means that it has an
impact. Interestingly, the best results presented by Spark’s developers for the
terasort benchmark are produced when this is disabled, which further supports
our choice to include it in our methodology.
236 P. Petridis et al.
6 Conclusions
This work deals with configuring Spark applications in an efficient manner. We
focus on 12 key application instance-specific configurable parameters and assess
Spark Parameter Tuning via Trial-and-Error 237
their impact using real runs on a petaflop supercomputer. Based on the results
and the knowledge about the role of these parameters, we derive a trial-and-
error methodology, which requires a very small number of experimental runs.
We evaluate the effectiveness of our methodology using three case studies, and
the results show that we can achieve up to more than a 10-fold speedup. Although
our results are significant, further research is required to investigate additional
infrastructures, benchmark applications, parameters and combinations.
References
1. Awan, A.J., Brorsson, M., Vlassov, V., Ayguade, E.: How data volume affects spark
based data analytics on a scale-up server (2015). arXiv:1507.08340
2. Holl, S., Zimmermann, O., Palmblad, M., Mohammed, Y., Hofmann-Apitius, M.: A
new optimization phase for scientific workflow management systems. Future Gener.
Comput. Syst. 36, 352–362 (2014)
3. Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-
Fast Data Analysis. O’Reilly Media, Sebastopol (2015)
4. Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense
of performance in data analytics frameworks. In: 12th USENIX Symposium on
Networked Systems Design and Implementation (NSDI 2015), pp. 293–307 (2015)
5. Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error.
arXiv:1607.07348
6. Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash
of the titans: mapreduce vs. spark for large scale data analytics. PVLDB 8(13),
2110–2121 (2015)
7. Tous, R., Gounaris, A., Tripiana, C., Torres, J., Girona, S., Ayguadé, E., Labarta,
J., Becerra, Y., Carrera, D., Valero, M.: Spark deployment and performance eval-
uation on the marenostrum supercomputer. In: IEEE International Conference on
Big Data (Big Data), pp. 299–306 (2015)
8. Wang, Y., Goldstone, R., Yu, W., Wang, T.: Characterization and optimization of
memory-resident mapreduce on HPC systems. In: 28th International Parallel and
Distributed Processing Symposium, pp. 799–808 (2014)
9. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin,
M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstrac-
tion for in-memory cluster computing. In: NSDI 2012 (2012)
10. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster
computing with working sets. In: HotCloud 2010 (2010)
Parallel Computing TEDA for High Frequency
Streaming Data Clustering
1 Introduction
The amount and scale of the data streams grow rapidly because of the more
mature technologies and lower prices of various electronic devices as well as
wider distributed sensor networks. As the world has already entered the Era of
Big Data, data-intensive technologies have been extensively used in the devel-
oped economies and numerous international organizations and companies are
c Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2 25
Parallel Computing TEDA for High Frequency Streaming Data Clustering 239
Additionally, we use the time tag, age [21] to monitor the quality of the
existing clusters stored in the data stream processors and enhance its ability
to self-evolve. Thus, Parallel TEDA is able to follow the changes of the data
pattern and avoid the accumulation of error. Because of the time tag, when the
Parallel TEDA approach is dealing with multi-data streams, there is no need to
wait for processing the data streams sequentially one by one. Once, there is any
free processor, the next data stream can be processed seamlessly. In addition,
several data streams can be processed together at the same time because of the
multi-processors structure.
The remainder of this paper is organized as follows: Sect. 2 introduces the
theoretical basis of the TEDA framework. The details of the proposed Par-
allel TEDA approach are described in Sect. 3. In Sect. 4, the algorithm of the
proposed Parallel TEDA approach is presented. Section 5 is for numerical exper-
iments as well as analysis and the conclusions are presented in Sect. 6.
where d(x i , x l ) denotes any distance measure between x i and x l . For clarity, in
this paper, we use the most commonly used Euclidean type of distance, namely
d(x i , x l ) = x i − x l .
2.3 Typicality
Typicality is the main measure introduced within TEDA [18–20]. It can be con-
sidered as a form of centrality or the normalized inverse cumulative proximity
[20]. The typicality of x i (j = 1, 2, . . . , k; k > 1) is [20]:
1 k
πk (x i )
τk (x i ) = k 1
) = [πk (x i )[ (πk (x l ))−1 ]]−1 (4)
l=1 πk (x l ) l=1
k
where µk+1 = k+1 µk + k+1 x k+1 ; µ1
1
= x1 (5)
k 2 2
Xk+1 = k+1 Xk + 1
k+1 x k+1 ; X1 = x 1
The corresponding sum of cumulated proximity can be updated as follows
[20]:
k+1
k+1
k+1
2
πk+1 (x j ) = (x j − x l )T (x j − x l ) = 2(k + 1)2 (Xk+1 − µk+1 ) (6)
j=1 j=1 l=1
242 X. Gu et al.
Using Eqs. (5) and (6), the recursive form of the standardized eccentricity
can be obtained as:
x k+1 − µk+1 + Xk+1 − µk+1 2 x k+1 − µk+1
εk+1 (x k+1 ) = 2 =1+ 2 (7)
Xk+1 − µk+1 Xk+1 − µk+1
Since the proposed approach is for live streaming data, the data pattern of the
stream will potentially change along with time elapse, as a result, the old existing
clusters may not be able to represent well the ensemble properties of the data
samples following a possible shift or drift [22,23]. Cluster age [22] allows to decide
whether a cluster is outdated. Cluster age [21] is an accumulated relative time
tag which allows to decide whether a cluster is outdated and is expressed as
follows: Skc c
I
Ack = k − l=1c l ; c = 1, 2, ...C; Skc ≥ 1; k > 1 (8)
Sk
where the sub-label c denotes the cth cluster and C is the number of existing
clusters in the clustering result; Skc is the support of the cth cluster at the time
instance k (number of data samples associated with it).
The well-known Chebyshev inequality [24] describes the probability of the dis-
tance between a certain data sample x i and the mean value to be larger than n
times the standard deviation, namely nσ. For Euclidean type of distance it has
the following form:
1
P (x i − µk ≤ n2 σ 2 ) ≥ 1 − 2 (9)
n
In TEDA, this condition is expressed regarding standardized eccentricity and
in a more elegant form [19]:
1
P (εk (x i ) ≤ 1 + n2 ) ≥ 1 − (10)
n2
That means, we can directly check the value of the standardized eccentricity,
εk and see if it is less than 10 for the “3σ” case.
Parallel Computing TEDA for High Frequency Streaming Data Clustering 243
For each data chunk, once it is sent to the data stream processor, it will be
processed separately and the current processing result will only be influenced by
the previous data chunks processed within the same data stream processor and
will also influence the future output of the processor. Therefore, for the remainder
of this section, all the basic elements of Parallel TEDA approach introduced are
considered to be within the same particular data stream processor.
The collection of data samples entered the ith data processor are denoted as
i
X = {x i1 , x i2 , . . . , x iK } (i = 1, 2, . . . , N ), and the number of samples, which can
also be viewed as time instance, is k and will keep growing with time. The number
of data samples processed by each processor is considered to be the same.
Parallel TEDA approach is divided into two stages. The first stage is the clus-
ters and parameters updating within the individual data stream separately. We
consider this stage to be a separate processing. The second stage is for merging
the separate clustering results of all existing data streams together, called clusters
fusion. The architecture of the proposed approach is presented in Fig. 1.
In the remainder of this section, the proposed approach will be described in
detail.
their standardized eccentricity per cluster is calculated using Eq. (7). Then, we
monitor εi,c i i i
k+1 (x k+1 ); c = 1, 2, . . . , C , here C denotes the number of existing
th
clusters in the i processors.
For every cluster, the standardized eccentricities of all its members sum to
2Ski,c , and the average standardized eccentricity is εaverage = 2. Combining the
Chebyshev conditions in terms of the standardised eccentricity (Eq. (10)) [19], we
use n = 2 to achieve a balance between sensitivity to anomalous data and toler-
ance to the intra-cluster variance, namely, εo = 5. The condition for associating
a data sample with a particular cluster can be expressed as follows:
IF (εi,c i i
k+1 (x k+1 ) ≤ εo ) THEN (x k+1 ∈ c) (12)
For each new data sample x ik+1 , if the condition to a particular existing
cluster is met, it is associated with this cluster. If χik+1 is associated with two
or more clusters, it is associated with the cluster that satisfies the following
condition:
i,c
cselected = argminC i
c=1 εk+1 (x k+1 )
i
(13)
Once, the cluster that x ik+1 should be assigned to is decided, the correspond-
ing mean (center) µi,c k+1
selected
, the mean of the scalar product Xk+1 i,cselected
, age
i,cselected i,cselected
Ak+1 and support Sk+1 of the cluster need to be updated accordingly:
i,cselected
Sk
µi,c
k+1
selected
←( i,c µi,c
k
selected
+ 1
i,cselected x ik+1 )
Sk selected +1 Sk +1
i,cselected
Sk i 2
i,cselected
Xk+1 ←( Xki,cselected + 1 x )
i,c
Sk selected +1
i,c
Sk selected +1 k+1
(14)
i,cselected i,c
Sk
Ai,c
(k−Ak selected )+k+1
k+1
selected
← (k + 1 − i,cselected )
Sk +1
i,cselected
Sk+1 ← (Ski,cselected + 1)
Parallel Computing TEDA for High Frequency Streaming Data Clustering 245
In contrast, if there is no cluster that meets the condition (Eq. (12)) for a
particular data sample, x ik+1 , the new data sample x ik+1 forms a new cluster.
The parameters of the new cluster are then initialised as follows:
2
Ci ← (Ci + 1); µi,C i
k+1 ← x k+1 ;
i i,Ci
Xk+1 ← x ik+1 ; Ai,C
k+1 ← 0;
i i,Ci
Sk+1 ←1
(15)
For the existing clusters which do not have new member at the (k + 1)th
time instance, their age should be updated using Eq. (16) All other parameters
do not change.
Ai,C
k+1 ← (Ak
i i,Ci
+ 1), c ∈ Other (16)
After the parameters updating, before the processor handles the next input
data sample, every existing cluster needs to be checked whether it is already out
of date according to its age using the following ageing condition:
IF (Ai,C i i i i,Ci
k+1 > μA + σA ) AND (Ak+1 > K) THEN (The c
th
cluster is out of date)
(17)
where μiA is the average value of the ages of all clusters within the processor,
i
σA is standard deviation of all clusters’ages within the ith processors.
Once, the processor detects a stale cluster, the stale cluster is removed auto-
matically because it may have adverse influence on the future clustering results.
After the stale clusters cleaning operation, the data stream processor will be
ready for the next data sample and begin a new round of data processing and
parameters updating. Based on the needs of users, the clustering results can be
viewed and checked at any time. Once requested by users, all the processors
will send the existing clusters and the corresponding parameters stored in their
memories to the fusion center, and the fusion center will fuse all the clustering
results together and give the final output. However, this user inference is entirely
optional. The algorithm is designed to work fully autonomously and will perform
fusion anyway unless specifically prompted not to.
In this stage, the clustering results from all the data stream processors will
be fused together to generate the final output of the proposed Parallel TEDA
approach.
Once we get all the clusters from all the processors and re-denote their
parameters as: centers µj , means of scalar product Xj and supports Sj , j =
N
1, 2, . . . , C0 (C0 = i=1 Ci ), the fusion process will start. The clusters fusion
stage begins from the cluster with the smallest support and ends with the largest
one. For each cluster, starting from its closest neighboring cluster to the farthest
cluster away from it, we check the condition to decide whether this cluster should
be merged with the neighboring cluster:
where µj and µl are the centers of the j th and lth clusters, εi and εj are
calculated using the following equations:
2 2 2
Sl2 (µj −µl ) +(Sl +1)(Sl Xl +µj )−µj +Sl µl
εl (µj ) = 2 2
(Sl +1)(Sl Xl +µj )−µj +Sl µl
(19)
2 2
Sj2 (µl −µj ) +(Sj +1)(Sj Xj +µl 2 )−µl +Sj µj
εj (µl ) = 2
(Sj +1)(Sj Xj +µl 2 )−µl +Sj µj
After the merging, there may be some trivial clusters (with small support)
left satisfying the following condition:
Co
j=1 Sj
IF (Si < ) THEN (Merge the ith cluster with the nearest larger cluster) (21)
5Co
3. Else
* Add a new cluster and its corresponding parameters using Eq. (15);
* Update the ages of other clusters using Eq. (16);
4. End If
• End If
– End While
2. Parallel TEDA Algorithm Part 2:
– While the existing clusters exhibit the potential of merging
• Calculate εl (µj ) and εj (µl ) using Eq. (19).
• If (Merging Condition is met) Then:
* Merge the two clusters using Eq. (20);
– End While
– Merge the minor clusters with the nearest larger clusters using Eqs. (21), (22)
and (20)
5 Numerical Experiments
In this section, several numerical experiments using benchmark datasets from
[25] are conducted to study the performance of the newly proposed Paral-
lel TEDA approach. The details of the benchmark datasets are given in Table 1.
DataSet Details
Number of Number of Number of Maximum Minimum
data samples clusters attributes cluster size cluster size
A1 3000 20 2 150 150
A2 5250 35 2 150 150
A3 7500 50 2 150 150
S1 5000 15 2 350 300
S2 5000 15 2 350 300
The Parallel TEDA approach was developed into a software which run within
MATLAB R2015a. The performance was evaluated on a PC with dual core
processor with clock frequency 3.6 GHz each and 8 GB RAM. The first exper-
iment is conducted to verify the correctness and effectiveness of the Paral-
lel TEDA approach. Datasets A1 and A2 are used together to testify the ability
of the parallel computation as well as handling multi-data streams. In this exper-
iment, 11 data stream processors are involved and the data chunk size is set to be
250. The processing procedure of the data chunks of the three datasets is given
in Fig. 2. The time-varying clustering results of each process cycle are presented
in Fig. 3.
248 X. Gu et al.
Fig. 2. The processing procedure of first experiment (DC is short for data chunk)
Fig. 3. The time-varying clustering result (The green dots are the data samples from
A1 dataset, blue dots are the ones from A2 dataset and red circles are centers of the
clusters).
250 X. Gu et al.
Fig. 4. The relationship between processing time and number of processors (The blue
bars represent the time consumed in stage 1 and the green ones represent the time
consumed in stage 2)
one data stream. We use 4 processors in the proposed Parallel TEDA approach
in comparative experiments and the chunk size is set to 200 data samples/points.
The comparison results are presented in Table 3. As it is shown in Table 3, com-
pared with the two comparative clustering approaches, Parallel TEDA approach
exhibits more accurate clustering results. Note that, both subtractive and ELM
clustering approaches require the radius (r or R, respectively) to be pre-defined.
Moreover, its choice heavily influences the result (see Table 3). The proposed
Parallel TEDA approach does not require such parameter to be pre-defined.
Yet, it is able to identify the correct number of clusters and is the fastest of
them. The subtractive clustering approach can be comparable with the proposed
252 X. Gu et al.
6 Conclusions
In this paper, we proposed a novel real-time clustering approach for high fre-
quency streaming data processing, called Parallel TEDA. This approach inher-
its the advantages of the recently introduced TEDA theoretical framework and
has the ability of parallel computation due to its multi-processors structure. In
addition, it can successfully follow the drifts and/or shifts in the data pattern.
Within the TEDA framework, the streaming data samples are divided into a
number of data chunks and sequentially sent to the data processors for cluster-
ing. Each generated cluster is additionally assigned a time tag for online cluster
quality monitoring. The fusion center gathers the clustering results from each
data processor and fuses them together to obtain the overall output. Numerical
experiments show the superior performance of the proposed approach as well
as the potential for an even higher processing speed. This approach will be a
promising tool for further applications in online high frequency data processing
and analysis.
Acknowledgements. The second author would like to acknowledge the partial sup-
port through The Royal Society grant IE141329/2014 Novel Machine Learning Par-
adigms to Address Big Data Streams. The third, fourth, and fifth authors would like
to acknowledge the support by the Spanish Goverment under the project TRA2013-
48314-C3-1-R and the project TRA2015-63708-R.
References
1. Fukunaga, K., Hostetler, L.: The estimation of the gradient of a density function,
with applications in pattern recognition. IEEE Trans. Inf. Theor. 21(1), 32–40
(1975)
2. MacQueen, J.: Some methods for classification and analysis of multi-variate obser-
vations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Sta-
tistics and Probability. Statistics, vol. 1, Berkeley, pp. 281–297 (1967)
3. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm.
Comput. Geosci. 10(2), 191–203 (1984)
4. Johnson, S.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)
5. de Oliveira, J.V., Pedrycz, W. (eds.): Advances in Fuzzy Clustering and Its Appli-
cations. Wiley, New York (2007)
6. Yager, R., Filev, D.: Generation of fuzzy rules by mountain clustering. J. Intell.
Fuzzy Syst. 2(3), 209–219 (1994)
7. Chiu, S.L.: Fuzzy model identification based on cluster estimation. J. Intell. Fuzzy
Syst. 2(3), 1064–1246 (1994)
Parallel Computing TEDA for High Frequency Streaming Data Clustering 253
8. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discover-
ing clusters in large spatial databases with noise. In: 2nd International Conference
on Knowledge Discovery and Data Mining, vol. 96(34), Portland, Oregan, pp. 226–
231 (1996)
9. Wang, C., Lai, J., Huang, D., Zheng, W.: SVStream: a support vector-based algo-
rithm for clustering data streams. IEEE Trans. Knowl. Data Eng. 25(6), 1410–1424
(2013)
10. Baruah, R., Angelov, P.: Evolving local means method for clustering of streaming
data. In: IEEE Congress on Computational Intelligence, Brisbane, Australia, pp.
2161–2168 (2012)
11. Hyde, R., Angelov, P.: A fully autonomous data density based clustering technique.
In: IEEE Symposium on Evolving and Autonomous Learning Systems, Orlando,
USA, pp. 116–123 (2014)
12. Angelov, P., Gu, X., Gutierrez, G., Iglesias, J.A., Sanchis, A.: Autonomous data
density based clustering method. In: 2016 IEEEWorld Congress on Computational
Intelligence, Vancouver, Canada, pp. 2405-2413 (2016)
13. Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams.
In: Proceedings of the Annual Symposium on Foundations of Computer Scuebce
(FOCS), Redondo Beach, CA, pp. 359–366 (2000)
14. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering
data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528
(2003)
15. Aggarwal, C., Han, J., Wang, J., Yu, S.: A framework for clustering evolving data
streams. In: Proceedings of the International Conference on Very Large Data Bases,
Berlin, Germany, pp. 81–92 (2003)
16. Comode, G., Muthukrishnan, S., Zhang, W.: Conquering the divide: continuous
clustering of distributed data streams. In: Proceedings of the International Con-
ference on Data Engineering, Istanbul, pp. 1036–1045 (2007)
17. Gama, J., Rodrigues, P., Sebastio, R.: Evaluating algorithms that learn from data
streams. In: Proceedings of the ACM Symposium on Applied Computing, Hawaii,
pp. 1496–1500 (2009)
18. Angelov, P.: Outside the box: an alternative data analytics framework. J. Autom.
Mob. Rob. Intell. Syst. 8(2), 53–59 (2014)
19. Angelov, P.: Typicality distribution function - a new density-based data analyt-
ics tool. In: IEEE International Joint Conference on Neural Networks (IJCNN),
Killarney, pp. 1–8 (2015)
20. Angelov, P., Gu, X., Kangin, D., Principe, J.: Empirical data analysis: a new tool
for data analytics. In: 2016 IEEE International Conference on Systems, Man, and
Cybernetics, Budapest, Hungary (2016, to appear)
21. Angelov, P., Filev, D.: Simple TS: a simplified method for learning evolving Takag-
iSugeno fuzzy models. In: IEEE International Conference on Fuzzy Systems, Reno,
USA, pp. 1068–1073 (2005)
22. Angelov, P.: Autonomous Learning Systems from Data Stream to Knowledge in
Real Time. John Wiley & Sons, Ltd., West Sussex (2012)
23. Lughofer, E., Angelov, P.: Handling drifts and shifts in on-line data streams with
evolving fuzzy systems. Appl. Soft Comput. J. 11(2), 2057–2068 (2011)
24. Saw, J., Yang, M., Mo, T.: Chebyshev inequality with estimated mean and variance.
Am. Stat. 38(2), 130–132 (1984)
25. Clustering Datasets - University of Eastern Finland (2016). http://cs.joensuu.fi/
sipu/datasets/. Access 12 May 2016
A Big Data Intelligent Search Assistant Based
on the Random Neural Network
Will Serrano(&)
Abstract. The need to search for specific information or products in the ever
expanding Internet has led the development of Web search engines and rec-
ommender systems. Whereas their benefit is the provision of a direct connection
between users and the information or products sought within the Big Data, any
search outcome will be influenced by a commercial interest as well as by the
users’ own ambiguity in formulating their requests or queries. This research
analyses the result rank relevance provided by the different Web search engines,
metasearch engines, academic databases and recommender systems. We propose
an Intelligent Internet Search Assistant (ISA) that acts as an interface between
the user and Big Data search engines. We also present a new relevance metric
which combines both relevance and rank. We use this metric to validate and
compare the performance of our proposed algorithm against other search
engines and recommender systems. On average, our ISA outperforms other
search engines.
1 Introduction
The extensive size of the Big Data does not allow Internet users to find all relevant
information or products without the use of Web Search Engines or Recommender
Systems. Web users can not be guaranteed that the results provided by search appli-
cations are either exhaustive or relevant to their search needs. Businesses have the
commercial interest to rank higher on results or recommendations to attract more cus-
tomers while Web search engines and recommender systems make their profit based on
their advertisements and product purchase. The main consequence is that irrelevant
results or products may be shown on top positions and relevant ones “hidden” at the very
bottom of the search list. As the size of the Internet and Big Data increasingly expands,
Web Users are more and more dependent on information filtering applications.
We describe the application of neural networks in recommender systems in Sect. 2.
In order to address the presented search issues; this paper proposes in Sect. 3 an
Intelligent Internet Search Assistant (ISA) that acts as an interface between an indi-
vidual user’s query and the different search engines. We have validated our ISA against
other Web search engines and metasearch engines, online databases and recommender
systems on Sect. 4. Our conclusions are presented on Sect. 5.
© Springer International Publishing AG 2017
P. Angelov et al. (eds.), Advances in Big Data, Advances in Intelligent
Systems and Computing 529, DOI 10.1007/978-3-319-47898-2_26
A Big Data Intelligent Search Assistant Based on the RNN 255
2 Related Work
The ability of neural networks to learn iteratively from different inputs to acquire the
desired outputs as a mechanism of adaptation to users’ interest in order to provide
relevant answers have already been applied in the World Wide Web and recommender
systems. S. Patil et al. [1] propose a recommender system using collaborative filtering
mechanism with k-separability approach for Web based marketing. They build a model
for each user on several steps: they cluster a group of individuals into different cate-
gories according to their similarity using Adaptive Resonance Theory (ART) and then
they calculate the Singular Value Decomposition matrix. M. Lee et al. [2] propose a
new recommender system which combines collaborative filtering with a
Self-Organizing Map neural network. They segment all users by demographic char-
acteristics where users in each segment are clustered according to the preference of
items using the neural network. C. Vassiliou et al. [3] propose a framework that
combines neural networks and collaborative filtering. Their approach uses a neural
network to recognize implicit patterns between user profiles and items of interest which
are then further enhanced by collaborative filtering to personalized suggestions.
K. Kongsakun et al. [4] develop an intelligent recommender system framework based
on an investigation of the possible correlations between the students’ historic records
and final results. C. Chang et al. [5] train the artificial neural networks to group users
into different types. They use an Adaptive Resonance Theory (ART) neural network
model in an unsupervised learning model where the input layer is a vector made of
user’s features and the output layer is the different cluster. P. Chou et al. [6] integrate a
back propagation neural network with supervised learning and feed forward architec-
ture in an “interior desire system”. D. Billsus et al. [7] propose a representation for
collaborative filtering tasks that allows the application of any machine learning algo-
rithm, including a feed forward neural network with k input neurons, 2 hidden neurons
and 1 output neuron. M. Krstic et al. [8] apply a single hidden layer feed forward neural
network as a classifier tool which estimates whether a certain TV programme is rele-
vant to the user based on the TV programme description, contextual data and the
feedback provided by the user. C. Biancalana et al. [9] propose a neural network to
include contextual information on film recommendations. The aim of the neural net-
work is to identify which member of a household gave a specific rating to a film at a
specific time. M. Devi et al. [10] use a probabilistic neural network to calculate the
rating between users based on the rating matrix. They smooth the sparse rating matrix
by predicting the rating values of the unrated items.
The search assistant we design is based on the Random Neural Network (RNN) [11–
13]. This is a spiking recurrent stochastic model for neural networks. Its main analytical
properties are the “product form” and the existence of the unique network steady state
solution. It represents more closely how signals are transmitted in many biological
neural networks where they actual travel as spikes or impulses, rather than as analogue
signal levels, and has been used in different applications including network routing
256 W. Serrano
with cognitive packet networks, using reinforcement learning, which requires the
search for paths that meet certain pre-specified quality of service requirements [14],
search for exit routes for evacuees in emergency situations [15, 16] and network
routing [17], pattern based search for specific objects [18], video compression [19],
image texture learning and generation [20] and Deep Learning [21].
Gelenbe, E. et al. have investigated different search models [22–24]. In the case of our
own application of the RNN [25]; our ISA acquires a query from the user and retrieves
results from one or various search engines assigning one neuron per each Web result
dimension. The result relevance is calculated by applying our innovative cost function
based on the division of a query into a multidimensional vector weighting its dimension
terms with different relevance parameters. Our ISA adapts and learns the perceived user’s
interest and reorders the retrieved snippets based in our dimension relevant centre point.
Our ISA learns result relevance on an iterative process where the user evaluates directly
the listed results. We evaluate and compare its performance against other search engines
with a new proposed quality definition, which combines both relevance and rank. We
have also included two learning algorithms; Gradient Descent learns the centre of rele-
vant dimensions and Reinforcement Learning updates the network weights based on
rewarding relevant dimensions and punishing irrelevant ones.
4 Validation
X
Y
Q¼ RML RSE ð1Þ
result¼1
where RML is the rank of the result in the master list that represents the optimum result
relevance order, RSE is the rank of the same result in a particular search engine and Y
is the number of results shown to the user, if the result order is larger than Y, we
discard the result in our calculation as it is considered irrelevant.
We define normalized quality, Q, as the division of the quality, Q, by the optimum
figure which it is when the results provided are ranked in the same order as in the
master list; this value corresponds to the sum of the squares of the first Y integers:
Q
Q¼ ð2Þ
YðY þ 1Þð2Y þ 1Þ
6
where Y the total number of results shown to the user.
A Big Data Intelligent Search Assistant Based on the RNN 257
The Intelligent Internet Search Assistant we have proposed emulates how Web
search engines work by using a very similar interface to introduce and display infor-
mation. We validate our proposed ISA with current Metasearch engines, we retrieve the
results from the Web search engines they use to generate the result master list and then
compare the results provided by the Metasearch engines against this result master list.
This proposed method has the inconvenience that we are not considering any result
obtained from Internet Web directories neither Online databases from where Meta-
search engines may have retrieved some results displayed. We have selected both
Ixquick and Metacrawler as the Metasearch engines we can compare our ISA. After
analysing the main characteristics of both Metasearch engines we consider Metacrawler
uses (Google Yahoo and Yandex) and Ixquick uses (Google Yahoo and Bing) as their
main source of search results. We have run our ISA to acquire 10 different queries
based on the travel industry from a user. The ISA then retrieves the first 30 results from
each of the main Web search engine driver programmed (Google, Yahoo, Bing, and
Yandex), we have therefore scored 30 points to the Web site result that is displayed in
the top position, 1 point to the Web site result that is shown in the last position and 0
points to each of the result that belongs to the same Web site and it is shown more than
once. After we have scored the 120 results provided by the 4 different Web search
engines, we combine them by adding the scores of the results which have the same
Web site and rank them to generate the result master list. We have done this evaluation
exercise for each high level query. We then retrieve the first 30 results from Meta-
crawler and Ixquick and benchmark them against the result master list using the Quality
formula proposed. We present the average Quality values for the 10 different queries on
the below table (Table 1).
X
Y
Q¼ RSEi ð3Þ
i¼1
where RSEi is the rank of the result i in a particular search engine with a value of N if
the result is in the first position and 1 if the result is the last one. Y is the total number
of results selected by the user. The best Online Academic Database would have the
258 W. Serrano
largest Quality value. We define normalized quality, Q, as the division of the quality,
Q, by the optimum figure which it is when the user consider relevant all the results
provided by the Web search engine. On this situation Y and N have the same value:
Q
Q¼ ð4Þ
NðN þ 1Þ
2
QW QR
I¼ ð5Þ
QR
5 Conclusions
Acknowledgment. This research has used Groupfilms dataset from the Department of
Computer Science and Engineering at the University of Minnesota; Trip Advisor dataset from the
University of California-Irvine, Machine Learning repository, Centre for Machine Learning and
Intelligent Systems and Amazon dataset from Julian McAuley Computer Science Department at
University of California, San Diego.
References
1. Patil, S., Mane, Y., Dabre, K., Dewan, P., Kalbande, D.: An efficient recommender system
using collaborative filtering methods with k-separability approach. Int. J. Eng. Res. Appl.,
30–35 (2012)
2. Lee, M., Choi, P., Woo, Y.T.: A hybrid recommender system combining collaborative
filtering with neural network. In: de Bra, P., Brusilovsky, P., Conejo, R. (eds.) AH 2002.
LNCS, vol. 2347, pp. 531–534. Springer, Heidelberg (2002)
3. Vassiliou, C., Stamoulis, D., Martakos, D., Athanassopoulos, S.: A recommender system
framework combining neural networks & collaborative filtering. In: International Conference
on Instrumentation, Measurement, Circuits and Systems, pp. 285–290 (2006)
4. Kongsakun, K., Kajornrit, J., Fung, C.: Neural network modelling for an intelligent
recommendation system supporting SRM for universities in Thailand. In: International
Conference on Computing and Information Technology, vol. 2, pp. 34–44 (2013)
5. Chang, C., Chen, P., Chiu, F., Chen, Y.: Application of neural networks and Kanos’s method
to content recommendation in Web personalization. Expert Syst. Appl. 36, 5310–5316
(2009)
6. Chou, P., Li, P., Chen, K., Wu, M.: Integrating Web mining and neural network for
personalized e-commerce automatic service. Expert Syst. Appl. 37, 2898–2910 (2010)
A Big Data Intelligent Search Assistant Based on the RNN 261
7. Billsus, D., Pazzani, M.: Learning collaborative information filters. In: International
Conference of Machine Learning, pp. 46–54 (1998)
8. Krstic, M., Bjelica, M.: Context aware personalized program guide based on neural network.
IEEE Trans. Consum. Electron. 58, 1301–1306 (2012)
9. Biancana, C., Gaspareti, F., Micarelli, A., Miola A., Sansonetti, G.: Context-aware movie
recommendation based on signal processing and machine learning. In: The Challenge on
Context Aware Movie Recommendation, pp. 5–10 (2011)
10. Devi, M., Samy, R., Kumar, S., Venkatesh, P.: Probabilistic neural network approach to
alleviate sparsity and cold start problems in collaborative recommender systems. Comput.
Intell. Comput. Res., 1–4 (2010)
11. Gelenbe, E.: Random neural network with negative and positive signals and product form
solution. Neural Comput. 1, 502–510 (1989)
12. Gelenbe, E.: Learning in the recurrent Random Neural Network. Neural Comput. 5, 154–164
(1993)
13. Gelenbe, E., Timotheou, S.: Random neural networks with synchronized interactions. Neural
Comput. 20(9), 2308–2324 (2008)
14. Gelenbe, E., Lent, R., Xu, Z.: Towards networks with cognitive packets. In: Goto, K.,
Hasegawa, T., Takagi, H., Takahashi, Y. (eds.) Performance and QoS of Next Generation
Networking, pp. 3–17. Springer, London (2011)
15. Gelenbe, E., Wu, F.J.: Large scale simulation for human evacuation and rescue. Comput.
Math Appl. 64(12), 3869–3880 (2012)
16. Filippoupolitis, A., Hey, L., Loukas, G., Gelenbe, E., Timotheou, S.: Emergency response
simulation using wireless sensor networks. In: Proceedings of the 1st International
Conference on Ambient Media and Systems, p. 21 (2008)
17. Gelenbe, E.: Steps towards self-aware networks. Commun. ACM 52(7), 66–75 (2009)
18. Gelenbe, E., Koçak, T.: Area-based results for mine detection. IEEE Trans. Geosci. Remote
Sens. 38(1), 12–24 (2000)
19. Cramer, C., Gelenbe, E., Bakircloglu, H.: Low bit-rate video compression with neural
networks and temporal subsampling. Proc. IEEE 84(10), 1529–1543 (1996)
20. Atalay, V., Gelenbe, E., Yalabik, N.: The random neural network model for texture
generation. Int. J. Pattern Recognit Artif Intell. 6(1), 131–141 (1992)
21. Gelenbe, E., Yin, Y.: Deep learning with random neural networks, In: International Joint
Conference on Neural Networks (IJCNN 2016) World Congress on Computational
Intelligence. IEEE Xplore, Vancouver (2016). Paper Number 16502
22. Gelenbe, E.: Search in unknown random environments. Phys. Rev. E 82(6), 061112 (2007)
23. Gelenbe, E., Abdelrahman, O.H.: Search in the universe of big networks and data. IEEE
Netw. 28(4), 20–25 (2014)
24. Abdelrahman, O.H., Gelenbe, E.: Time and energy in team-based search. Phys. Rev. E 87,
032125 (2013)
25. Gelenbe, E., Serrano, W.: An intelligent internet search assistant based on the random neural
network. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016, IFIP AICT, vol. 475, pp. 141–
153. Springer, Switzerland (2016)
RandomFIS: A Fuzzy Classification System
for Big Datasets
Abstract. One of the main advantages of fuzzy classifier models is their lin-
guistic interpretability, revealing the relation between input variables and the
output class. However, these systems suffer from the curse of dimensionality
when dealing with high dimensional problems (large number of attributes and
instances). This paper presents a new fuzzy classifier model, named
RandomFIS, that provides good performance in both classification accuracy and
rule base interpretability even when dealing with databases comprising large
numbers of inputs (attributes) and patterns (instances). RandomFIS employs
concepts from Random Subspace and Bag of little Bootstrap (BLB), resulting in
an ensemble of fuzzy classifiers. It was tested with different classification
benchmarks, proving to be an accurate and interpretable model, even for
problems involving big databases.
1 Introduction
Classification systems based on fuzzy rules are useful and well known tools for rep-
resenting and extracting knowledge from data bases involving uncertainty, inaccuracy
and nonlinearity [1]. Approaches based on fuzzy logic have the ability to provide
accurate models and, at the same time, interpretable linguistic rules to inform the end
user of the relationship between input variables and output classes [2]. In order to build
models that consider both accuracy and interpretability, most studies have used
(i) evolutionary algorithms [3] to elaborate fuzzy rules and (ii) Evolving Fuzzy
Inference Systems [4] to create and adapt a fuzzy rule base by gathering new
observations.
In [5], a simpler approach, known as AutoFIS-Class, was proposed to automatically
generate a Fuzzy Classifier System that provides good accuracy but also favors lin-
guistic interpretability. AutoFIS-Class was compared to other fuzzy-evolutionary
models [5] and performed very well in terms of accuracy and interpretability. However,
for big data bases AutoFIS-Class suffers from the curse of dimensionality [6], espe-
cially with respect to computational time and memory consumption.
2 RandomFIS
b ¼ nc ð1Þ
where n is the number of patterns (instances) in the original database, b is the size of the
reduced dataset and c is a constant (0:5\c 0:9). The Subsampling technique is used
to generate s sets by randomly selecting patterns based on an uniform distribution
264 O. Samudio et al.
without repetition. The second level generates, from each Seti, i2{1, …s}, a Boot-
strapping (Resampling) Subsetj, j2{1, …, r}, now with repetition. Each of the r subsets
contains a reduced number (J*) of randomly selected attributes (known as Ran-
domSubspace), given by Eq. (2):
where J is the total number of attributes in the original database and J* is the number of
variables that are selected in each Subsetj.
2.2.1 Fuzzification
In classification, the main information consists of n patterns xi ¼ ½xi1 ; xi2 ; . . .; xiJ of J
attributes Xj present in the database ði ¼ 1; . . .; ne j ¼ 1; . . .; JÞ. A number of L fuzzy
n o
sets Ajl ¼ xij ; lAjl ðxij Þ jxij 2 Xj is associated to each jth attribute, where lAjl : Xj !
½0; 1 is a membership function that associates to each observation xij a membership
degree lAjl ðxij Þ to the fuzzy set Ajl . Each pattern is associated to a class Ci out of K
possible ones, that is Ci 2 f1; 2; . . .k; . . .; K g: The Fuzzification stage takes into
account three aspects: membership function format, the support of each member
function lAjl ðxij Þ and the appropriate linguistic label. RandomFIS uses the Tukey
approach [5], which considers the information from the quartiles.
where lAd ðxi Þ is the joint membership degree of pattern i in premise d, ðd ¼ 1; . . .; DÞ,
and∗ is a t-norm. More generally, a premise can be built from a combination of the
lA1l xij through the use of t-norms, t-conorms, negation operators and linguistic
hedges. The negation operator acts upon each element in a premise.
Regarding interpretability, it is desirable to have few rules with few antecedent
elements. Thus, a limit is imposed on the maximum size of premises and these are
generated in an organized way: initialization with size-1 premises, creation of size-2
premises from the size-1 viable ones, generation of size-3 premises from viable size-2
premises and so on. Besides, a premise viability is evaluated through a set of filters:
support, similarity and conflict in classification.
(a) Support Filter: aims at building premises that cover a large number of patterns in
the database. The Relative Support of a premise lAd ðxi Þ is given by:
P
n
lAd ðxi Þ
Supd ¼ i¼1 ð4Þ
n
Given a user-defined tolerance esim , two premises are similar if Simd;v [ esim . If the
similarity is identified, the premise with the lower Relative Support is removed.
(c) PCD Filter: this filter aims at reducing the occurrence of similar or conflicting rules
by computing the Penalized Confidence Degree (PCDk) [10]:
P P
i2k lAd ðxi Þ i62k lAd ðxi Þ
PCDk ¼ max Pn ;0 ð6Þ
i¼1 lAd ðxi Þ
``If X1 is A1l and . . . and Xj is Ajl and . . . and Aj is Ajl ; then xi is Class k''
The Weighted Average estimated through Restricted Least Squares (RLS) is used
here. For each d th premise, a linear configuration of weights in the range [0, 1] is
determined. This can be formulated as:
Xn XK 2
min lCi 2k ðxi Þ bk lA ð x i Þ
ð7Þ
i¼1 k¼1 d
XK
subject b ¼ 1 e bk 0
k¼1 k
where lCi 2k ðxi Þ 2 f0; 1g is the membership degree of pattern xi to class k (binary
representation of class k), bk is the degree of influence of class k on premise lAd ðxi Þ. If
bk ¼ 0, then premise lAd ðxi Þ is eliminated.
h i
^Ci 2k xi ¼ g lA1 ð2Þ ðxi Þ; . . .; lAD ð2Þ ðxi Þ
l ð9Þ
..
.
h i
^Ci 2k xi ¼ g lA1 ðK Þ ðxi Þ; . . .; lAD ðK Þ ðxi Þ
l ð10Þ
Based on the results obtained with AutoFIS [5], RandomFIS uses Weighted Average
estimated through Restricted Least Squares (RLS) as the aggregation operator [11]:
X
Dð1Þ
^Ci 2k xi ¼
l wd ð1Þ lAd ð1Þ ðxi Þ ð11Þ
d ð1Þ ¼1
..
.
268 O. Samudio et al.
X
DðK Þ
^Ci 2k xi ¼
l wd ðK Þ lAd ðK Þ ðxi Þ ð12Þ
d ðK Þ ¼1
where wdðkÞ is the weight, or the degree of influence, of lAd ðkÞ ðxi Þ in the discrimination
of patterns related to class k. The same process described in the Association stage is
used for finding the weights.
2.6 Decision
The decision on the membership of pattern xi to class k is computed by:
^ i ¼ argk max l
C ^Ci 21 xi ; . . .; l
^Ci 2k xi ; . . .; l
^Ci 2K xi ð13Þ
where C ^ i is the predicted class: result of the k-th argument that maximizes Eq. (13).
Therefore, this method associates pattern xi to the most pertinent class according to the
existing rule base. In case of a tie, xi will be associated to the majority class.
3 Experiments
In order to evaluate RandomFIS, five databases were selected from the UCI machine
learning repository [12]: KDD Cup 1999, Poker Hand, Covertype, Census-Income
(KDD), Fatality Analysis Reporting System (FARS). Table 1 presents their main
features in terms of number of attributes (J), number of instances (patterns) (n) and
number of classes. All databases were evaluated using 10-fold crossvalidation.
In the comparison of results for the models presented in [13], all multiple classes
datasets have been transformed into binary classification problems. The labels of the
datasets in Table 1 indicate the classes used in the binary classification problem.
The parameters used to create each fuzzy classifier have been defined according to
the best results obtained in [5]: number of membership for each attribute = 3; product
t-norm; active negation; maximum size of premise = 2; eSup = 0,075 and esim = 0,95.
Parameters such as number of Subsamples ðsÞ, number of Bootstraps ðrÞ and c,
were empirically determined after some preliminary tests, based on results presented in
[9]. Best results were obtained for c ¼ 0:7 and r ¼ 100. In order to compare the
performance of RandomFIS to those of other similar models, three different
Subsamples values were considered: s ¼ 8; 16; 32, as in [13]. In that work, the
Chi-MapReduce model was proposed to deal with high dimensionality problems.
Results are compared in terms of number of fuzzy rules generated and Classifica-
tion Accuracy, computed as:
Pn
^ i
Ci C
Accuracy ¼ i¼1
ð14Þ
n
^ i is the predicted class. In Eq. (14),
where Ci is the real class of pattern xi , and C
Ci C
^ i
¼ 0 if Ci ¼ C
^ i ; and 1 otherwise.
3.1 Results
Tables 2, 3 and 4 present the results in terms of Accuracy, average Number of Rules and
Computational Time for configurations RandomFIS_Ave and RandomFIS_Rules, for 8,
16 and 32 Subsamples. In terms of Accuracy, Table 2 indicates that RandomFIS_Rules
provides better results than RandomFIS_Ave. Additionally, the configuration with 32
partitions tends to produce better Accuracy. Regarding the average number of rules
generated (Table 3), RandomFIS_Rules outperforms RandomFIS_Ave. The former
provides a significant reduction in the final number of rules, considerably enhancing
interpretability while maintaining classification accuracy. The average computational
time presented in Table 4, for each partition configuration, refers to the time needed for
Table 2. Average accuracy results for RandomFIS versions with 8, 16, 32 partitions.
Datasets RandomFIS_Ave RandomFIS_Rules
8 16 32 8 16 32
Kddcup_DOS_vs_normal 99.61 99.71 99.75 99.91 99.9 99.92
Poker_0_vs_1 57.07 56.77 56.98 58.46 59.27 60.02
Covtype_2_vs_1 73.15 73.23 73.18 76.58 76.54 76.66
Census 94.29 94.31 94.29 94.53 94.56 94.58
Fars_Fatal_Inj_vs_No_Inj 100.0 100.0 100.0 100.0 100.0 100.0
Average 84.05 84.04 84.08 85.90 86.05 86.24
Table 3. Average Number of Rules results for RandomFIS versions with 8, 16, 32 partitions.
Datasets RandomFIS_Ave RandomFIS_Rules
8 16 32 8 16 32
Kddcup_DOS_vs_normal 103.80 192.50 402.70 43.8 50.1 70
Poker_0_vs_1 623.00 1271.10 2522.90 111.2 104.2 96
Covtype_2_vs_1 97.00 186.70 384.60 31.5 35.4 44.7
Census 252.70 505.20 1012.90 54.6 66.7 76.6
Fars_Fatal_Inj_vs_No_Inj 360.90 734.40 1447.70 194.3 328.5 525.8
Average 287.48 577.98 1154.16 87.08 116.98 162.62
270 O. Samudio et al.
Table 4. Average runtime elapsed for RandomFIS versions with 8, 16, 32 partitions.
Datasets Partitions
8 16 32
Kddcup_DOS_vs_normal 126777.77 250116.3 442169.25
Poker_0_vs_1 84356.87 167628.51 330257.05
Covtype_2_vs_1 3284.65 6013.94 11536.96
Census 7148.62 14666.95 29682.78
Fars_Fatal_Inj_vs_No_Inj 10865.68 22299.52 44596.19
Average 46486.72 92145.04 171648.45
processing both RandomFIS models in an Intel Core i7-5820 K CPU @ 3.30 GHz,
RAM 32 GB, Windows 7 × 64 PC.
The best RandomFIS model – RandomFIS_Rules – was compared to two similar
models, proposed in [13]: Chi-FRBCS and Chi-MapReduce. Figures 4 and 5 show the
results in terms of Classification Accuracy and Average Number of Rules.
4 Conclusions
This paper presented a new classification model, called RandomFIS, that is able to deal
with high dimensionality problems. The proposed model was evaluated for five dif-
ferent binary databases and results demonstrate that it provides good accuracy with a
reduced number of fuzzy rules. This enhances its interpretability when compared to
other similar models in the literature. Future work will involve the application of
RandomFIS to benchmarks with multiple classes, as well as to real big datasets.
Additionally, feature selection methods shall be added to improve its performance, as
well as some other ways of combining the generated fuzzy classifiers.
References
1. Kuncheva, L.I.: Fuzzy Classifier Design, vol. 49. Springer, Heidelberg (2000)
2. Zhang, Y., Wu, X.B., Xing, Z.Y., Hu, W.L.: On generating interpretable and precise fuzzy
systems based on Pareto multi-objective cooperative co-evolutionary algorithm. Appl. Soft
Comput. J. 11(1), 1284–1294 (2011)
3. Koshiyama, A.S., Vellasco, M.M.B.R., Tanscheit, R.: GPFIS-CLASS: A Genetic Fuzzy
System based on Genetic Programming for classification problems. Appl. Soft Comput. 37,
561–571 (2015)
4. Lemos, A., Caminhas, W., Gomide, F.: Evolving Intelligent Systems: Methods, Algorithms
and Applications, pp. 117–159. Springer, Berlin, Heidelberg (2013)
5. Paredes, J., Tanscheit, R., Vellasco, M., Koshiyama, A.: Automatic synthesis of fuzzy
inference systems for classification. In: Carvalho, J.P., Lesot, M.-J., Kaymak, U., Vieira, S.,
Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 2016. CCIS, vol. 610, pp. 486–497.
Springer, Heidelberg (2016). doi:10.1007/978-3-319-40596-4_41
6. Ishibuchi, H., Nakashima, T., Nii, M.: Classification and modeling with linguistic
information granules: advanced approaches to linguistic Data Mining (2006)
7. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms (2004)
8. Panov, P., Džeroski, S.: Combining bagging and random subspaces to create better
ensembles. In: International Conference on Intelligent Data Analysis, pp. 118–129 (2007)
9. Kleiner, A., Jordan, M.I.: The big data bootstrap. In: Proceedings of 29th International
Conference Machine Learning, p. 8 (2012)
10. Fernández, A., Calderón, M., Barrenechea, E., Bustince, H., Herrera, F.: Solving multi-class
problems with linguistic fuzzy rule based classification systems based on pairwise learning
and preference relations. Fuzzy Sets Syst. 161(23), 3064–3080 (2010)
272 O. Samudio et al.
11. Calvo, T., Kolesárová, A., Komorníková, M., Mesiar, R.: Aggregation operators: properties,
classes and construction methods. Aggreg. Oper. New Trends Appl. 97(1), 3–104 (2002)
12. Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml.
Accessed 01 Mar 2016
13. del Río, S., López, V., Benítez, J.M., Herrera, F.: A mapreduce approach to address big data
classification problems based on the fusion of linguistic fuzzy rules. Int. J. Comput. Intell.
Syst. 8(3), 422–437 (2015)
Big Data for a Linked Open Economy
Abstract. As the volume data grows exponentially, more and more big data
handling approaches are also applied in the linked data cloud. Thus, semantic
triplets which are nucleus of the Resource Description Framework (RDF) must
be harmonized to the demanding needs of the 4 Vs. This paper presents the
architecture and the main components of a big linked data repository named
LinkedEconomy. The scope of the platform is to collect, process, interlink and
publish unprecedented high-detailed economic data in machine-readable format,
in order to (a) provide a new data corpus for enriching research efforts in
Economic, Statistics and Business studies, and (b) contribute to Big data ana-
lytics for corporate decision making.
Keywords: Big data Linked data Open data Semantic web Ontology
Economy Public finance Prices
The present paper describes the underlying structure and functionality of LinkedE-
conomy1, which stands as a co-funded project under the National Strategic Reference
Framework (NSRF). LinkedEconomy could be considered as a basic component of
Web economy providing a universal access to Greek and international economy data,
as well as at promoting the benefits of linking heterogeneous sources under the concept
of big, open and reusable data in this critical domain [1]. These data flows are of high
added value not only for exploration purposes, but also for exploiting the benefits of
transparency and openness to the citizens, the research and business communities, and
the government itself.
Related research approaches are characterized by a two-fold approach. In the first
fold, there are efforts that analyse the benefits of using open data in economy, trying in
parallel to provide a unified framework and modelling strategy [2, 3] and [4]. In the
1
http://linkedeconomy.org/en/.
second fold, can be found initiatives that integrate the benefits of openness, thus
creating a flourishing linked ecosystem of economy data [5–7] and [8].
Through our developed platform, we aim at tackling the challenge of building a
common terminology for the basic financial and economy activities, which will -in
turn- facilitate the research over new linked data and sources. All described compo-
nents that are analyzed in the following sections, form a system capable of linking
economy-related data at large scale, creating in parallel a framework for collect, val-
idate, clean and publish linked big data streams. According to our knowledge,
linkedeconomy.org is quite innovative effort and consist some of the cutting edge
semantic and big data handling approaches.
Open economic data related to public budgeting, spending and prices are characterized
by high volume, velocity, variety and veracity. Thus, we have decided to build custom
components under the common logic of transforming static data to linked and big open
data streams. This decision was made for the benefit of flexibility that helps us to
address the 3 V’s of economic data and cannot be found at Linked (Open) Data
Management Suites (e.g. UnifiedViews2).
The big picture of our approach is depicted in Fig. 1. As shown, data are stored in
raw (as harvested from sources), in RDF and JSON formats. Enriched data are dis-
tributed though five channels: data dumps (CKAN open source data portal software3),
SPARQL queries, Web, social media and structured inputs to Business Intelligence
(BI) systems. Additionally, data can be further analysed and exchanged with relevant
platforms (e.g. SPARQL to R). The validation component runs through out the whole
process to safeguard high data quality by detecting errors. The messaging component
works as an internal messaging and alert system for all components. It also produces
and automatically posts messages to Twitter related to data availability (e.g. Daily
#opendata for public financial decisions (2/16) @diavgeia @YDMED @CKANproject
@OpenGovGr @ODIAthens #diavgeia http://goo.gl/oGmfDP).
As core infrastructure we use okeanos4, which is an established cloud-based service
provided for the Greek research and academic community. Moreover, our computa-
tional stack consists of 12 virtual machines with memory and storage capacities that
span from 4 GB to 16 GB RAM and 100 GB to 500 GB respectively, as well as a
non-commodity (physical) server of 12 CPUs, 128 GB RAM and a storage capacity of
more than 8 TB. The first column of Table 1 indicates the nodes of our infrastructure;
the second column highlights their distinct role and functionality provided (in both
internal and external procedures), while the last column depicts the third-party
dependencies and services employed, as well as the stored data sources.
2
https://www.semantic-web.at/unifiedviews.
3
http://ckan.org/.
4
https://okeanos.grnet.gr/home/.
Big Data for a Linked Open Economy 275
5
https://github.com/garlik/4store.
6
https://www.big-data-europe.eu/.
7
https://flume.apache.org/.
8
http://kafka.apache.org/.
9
http://spark.apache.org/.
Big Data for a Linked Open Economy 277
multiple sources and according to the processes of Fig. 2. We next present an instance
of a unified view for organizations (see Fig. 3, case: Greek Ministry of Culture, http://
linkedeconomy.org/en/page/paymentAgents?=afm=090283815). It consists of the tabs
DIAVGEIA, E-Procurement (KHMDHS) and NSRF. Each tab presents related data
concerning the organization according to each source A, B, and C described above. In
this way, the user can have the “big picture” of the provided information in a single
web point.
More analytically, the first tab shows aggregated information of a buyer and/or a
seller in respect to payments as it is published in DIAVGEIA. The user can see the
basic information of a buyer/seller (e.g. name, vat id, address), as well as payments and
procurements from 2010 up to the current year. Also, the user is informed with detailed
data on payments (such as commitments, approvals and finalized decisions) and pro-
curements (e.g. assignments, notices and awards). The second tab shows information
about a buyer or seller, as published in the Central Electronic Public Procurement
Register (e-Procurement). Similarly to the above tab, the user can see the basic
information about a buyer or a seller, as well as the respective tenders, contracts and
payments from 2013 to this year. For these issues more details are also provided, such
as the date, the name of the seller (or buyer), the amount, the common procurement
vocabulary and source. The third tab displays information about the NSRF (National
Strategic Reference Framework) subsidies received by a beneficiary. The user receives
the basic information of each beneficiary (e.g. name, vat id, address), and the infor-
mation about the respective subsidies, such as budget, contracts and payments. Finally,
the user also receives detailed information on subsidies, such as the budget, the related
contracts and payments.
In a similar way, LinkedEconomy.org provides unified profiles for other cases, such
as the statistics based on the Common Procurement Vocabulary (CPV)10 and on
Diavgeia11,12, as well as profile pages based on single data sources. Some examples are
the profiles and the statistics of the Central Market of Thessaloniki (KATH)13,14, as
well as profiles and statistics from products15,16 or shop/market points17,18.
10
http://linkedeconomy.org/en/page/cpv?=cpv=09000000-3.
11
http://linkedeconomy.org/en/diaugeia/stats-diaugeia.
12
http://linkedeconomy.org/en/diaugeia/errorsvat-diaugeia.
13
http://linkedeconomy.org/en/kathprices?=product=16.
14
http://linkedeconomy.org/en/kath-stats.
15
http://linkedeconomy.org/en/page/eprices-product?=id=107.
16
http://linkedeconomy.org/en/page/eprices-product-stats.
17
http://linkedeconomy.org/en/page/eprices-shop?=id=1090.
18
http://linkedeconomy.org/en/page/shops-stats.
278 M. Vafopoulos et al.
Fig. 3. The result of a unified view instance - the Greek Ministry of Culture
In this section, we present the basic components and processes that mutually inter-
operate in order to support LinkedEconomy.org as an ecosystem of big data, web
technologies and semantics in the economy domain. All kinds of our data are textual
based information that is related with the economy field, supporting both static and
dynamic streaming threads, while the adaptation of the 3Vs in our paradigm can be
summarized as:
Variety refers to the different types and nature of data we can now use. This helps
people who analyze it to effectively use the resulting insight. In the past, we focused on
structured data that neatly fits into tables or relational databases such as financial data
(for example, sales by product or region). In fact, 80 percent of the world’s data is now
unstructured and therefore cannot easily be put into tables or relational databases.
Big Data for a Linked Open Economy 279
Veracity refers to the messiness or trustworthiness of the data in terms that the
quality of captured data can vary greatly, affecting accurate analysis. Big data and
analytics technology now allows us to work with these types of data. The volumes
often make up for the lack of quality or accuracy.
Volume refers to the vast amount of data generated and stored in a small time unit,
in a sense that no traditional database can efficiently store it. With big data technology
we can now store and use these data sets with the help of distributed systems, where
parts of the data is stored in different locations, connected by networks and brought
together by software when needed.
19
http://jxls.sourceforge.net/.
20
http://www.thessaloniki.gr/egov/budget.html.
21
http://opendata.kalamaria.gr:8080/accounting/opendata/budgetView.
22
https://github.com/ui4j/ui4j.
280 M. Vafopoulos et al.
23
https://jena.apache.org/index.html.
Big Data for a Linked Open Economy 281
3.3 Volume: Handle the Data: Storage, Indexing and Publish the Data
LinkedEconomy data are stored in a CouchDB24 instance in document format.
CouchDB is a consistent document-oriented DB that stores data in JSON format with a
RESTful HTTP API for executing create, read, update and delete (CRUD) commands.
In our case, we maintain more than 50 folders containing up to 300.000 documents.
The updates occur weekly or monthly by using the Bulk Document API25.
For full-text search we use CouchDB-Lucene26, which integrates CouchDB and
Lucene text-based search engine. The related CouchDB document fields are indexed,
while the queries are submitted to the created Inverted Indices27, thus providing fast
responses.
Finally, in LinkedEconomy, we use CKAN data portal to store and serve the data
after the harvesting procedures. In order to fully automate the procedure, the CKAN
component uploads periodically the data (in zip format). For each dataset that needs to
be handled and uploaded to CKAN, the user (member of our team) creates two distinct
files with all necessary manual settings. The first one contains the parameters that are
needed for choosing the right data and creating the compressed file, while the second
contains information that is needed for basic CKAN functionalities.
This paper presents the basic components and process flows of LinkedEconomy, which
lead to publishing data of high added value not only for exploration purposes, but also
for exploiting the benefits of transparency and openness to the citizens, the research
community, and the government itself. Our efforts produced a CKAN repository, which
publishes datasets from sources that are being updated regularly and contain valuable
information in respect to social and economic research28. Citizens and economy
stakeholders (government, local authorities) can exploit 27 datasets from 14 classified
data sources in many different formats (XLSX, CSV, RDF). Examples include econ-
omy data, such as public procurements, budgets, prices, expenditures, taxes and fines.
Finally, users of LinkedEconomy receive high-level aggregated information
through the traditional Web 2.0 paradigm (web pages with rich-content, social media,
online community) with the help of advanced search capabilities and intelligent
Structure Data Markup provision for the returned results. However, apart from tradi-
tional client-server model techniques, our platform: (a) employs big data handling
procedures in terms of variety, veracity and high volume data, (b) envisage the use of
Web 3.0 technologies by offering a publicly available SPARQL endpoint29 (offering
machine-readable results from a fast growing semantic repository of more than 1B
24
http://couchdb.apache.org/.
25
https://wiki.apache.org/couchdb/HTTP_Bulk_Document_API.
26
https://github.com/rnewson/couchdb-lucene.
27
https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/search/Similarity.html.
28
http://ckan.linkedeconomy.org.
29
http://linkedeconomy.org/sparql.
282 M. Vafopoulos et al.
triplets in total), and (c) supports the open community by sharing all ontological
schemas and related specifications in GitHub30.
Having the necessary expertise and based in the same processes and components,
we currently work on modeling a long series of global economic datasets. We expect to
support really soon the provision of more than 15 different economic datasets from
Europe, UK, Australia, USA and Canada, while we plan to further extend this economy
linked data cloud in the near future.
References
1. Vafopoulos, M.: The Web economy: goods, users, models and policies. Found. Trends Web
Sci. 3(1–2), 1–136 (2011)
2. O’Riain, S., Curry, E., Harth, A.: XBRL and open data for global financial ecosystems: a
linked data approach. Intl. J. Account. Inf. Syst. 13(2), 141–162 (2012)
3. Hodess, R.: Open budgets: the political economy of transparency, participation, and
accountability. J. Econ. Lit. 52(2), 545–548 (2014)
4. Tygel, A.F, Attard, J., Orlandi F., Campos, M.L., Auer, S.: “How much?” Is Not Enough-An
Analysis of Open Budget Initiatives, arXiv:1504.01563 (2015)
5. Petrou, I., Meimaris, M., Papastefanatos, G.: Towards a methodology for publishing Linked
Open Statistical Data. JeDEM-eJournal eDemocracy Open Gov. 6(1), 97–105 (2014)
6. Höffner, K., Martin, M., Lehmann, J.: LinkedSpending: openspending becomes linked open
data. Semantic Web 7(1), 95–104 (2015)
7. Alvarez-Rodríguez, J.M., Vafopoulos, M., Llorens, J.: Enabling policy making processes by
unifying and reconciling corporate names in public procurement data. The CORFU
Technique, Comput. Stan. Interfaces 41, 28–38 (2015)
8. Vafopoulos, M.N., Vafeiadis, G., Razis, G., Anagnostopoulos, I., Negkas. D., Galanos, L.:
Linked Open Economy: Take Full Advantage of Economic Data, SSRN 2732218 (2016)
30
https://github.com/LinkedEcon/LinkedEconomyOntology-ELOD.
Smart Data Integration by Goal Driven
Ontology Learning
Abstract. The smart data integration approach is proposed to compose data and
knowledge of the different nature, origin, formats and standards. This approach
is based on the selective goal driven ontology learning. The automated planning
paradigm in a combination with a value of the perfect information approach is
proposed to be used for evaluating the knowledge correspondence with the
learning goal for the data integration domain. The information model of a
document is represented as a supplement to the Partially Observable Markov
Decision Process (POMDP) strategy of a domain. It helps to estimate the
document a pertinence as the increment of the strategy expected utility. A sta-
tistical method for identifying the semantic relations in the natural language
texts for their linguistic characteristics is developed. It helps to extract the
Ontology Web Language (OWL) predicates from the natural language text using
data about sub semantic links. A set of methods and means based on ontology
learning was developed to support the smart data integration process. A tech-
nology uses the Natural Language Processing software Link Grammar Parser,
WordNet Application Programming Interface (API) as well as the OWL API.
1 Introduction
The issue of the integration data assets, knowledge and technologies is tremendous,
and it has many solutions and consists of many significant sub-problems. The most
promising solution maybe should be connected with both an agent approach and a
semantic web, because big amounts of data interconnect as a whole in different aspects
and for different purposes only by means of logically defined semantic relations. Their
implementation ultimately depends on ontology of a particular domain – its content,
volume, structure, adaptivity and learning method. We are considering the ontology as
the formal explicit representation of the common terminology and its logical interde-
pendence for a certain subject domain. The ontology formalizes the intensional
meaning of the domain - e.g. set of rules in terms of formal logic, while its extensional
meaning is defined in the knowledge base as a set of facts about instances of concepts
and relationships between them.
The process of filling the knowledge base is called knowledge markup, or ontology
population. In return the ontology learning (OL) methods are methods for automatic
(semi-automatic) ontology structure development, based on natural language process-
ing (NLP) and machine learning. Far less attention is paid to the approaches developed
in the field of automated planning (AP) in particular to the hierarchical task network
(HTN) approach.
The bottleneck of the smart data integration approach lays in an effective OL
technology. A manual OL is too expensive and time-consuming. On the other side, the
ontology structure should fit needs of domain tasks. The methods of simple collecting
of facts, implemented by most ontology population tools, are not effective enough. The
absence of the logical structure causes a lack of integrating the collected data. They are
classified only, and it’s not enough for decision making in the domain problem solving.
A key issue of smart data integration therefore is to build the OL strategy that could
take into account the particular domain tasks, their solving methods and tools. It means
the ontology structure should include (during an OL process) all related terms and
semantic relations. For this purpose the OL method must recognize the measure of
“usefulness” of logical (semantic) structures of data, detected by applied NLP method –
knowledge pertinence.
We suppose the AP approach could provide such infrastructure to estimate the
knowledge pertinence. Therefore including AP into ontology structure, based on the
planning paradigm, should solve the problem of the selective OL with the ultimate aim
of smart data integration.
2 Relative Works
A very first approach of data integration in a field of text mining using NLP was based
on the processing text documents (classification, ranking by relevance) [1]. There were
used not the systematic metadata about the document but the whole body of a
unstructured text document and it was presented as a set of words or concepts (i.e. “bag
of words”) [1, 2].
The first systematic researches of the ontology automatic construction (ontology
learning) considered the three main aspects of the problem (1) methodology;
(2) assessment; (3) using the scenarios in the particular subject area. A modern
Smart Data Integration by Goal Driven Ontology Learning 285
The key difference between an ordinary and a goal driven ontology learning consists in
the selective updating ontology only by data with the high enough pertinence.
A problem facing the developers of GDOL methods and tools is extremely complex
because it includes the following sub-problems:
• Natural Language (NL) text linguistic analysis;
• building the message model, identified in text document, using a formal knowledge
representation language;
• constructing the ontology and knowledge base of the intelligent agent as a model of
information needs (interests) for a client of the information retrieval system;
• automatic constructing the optimal strategy of the intelligent agent;
• numerical evaluating the pertinence of the message, detected in text document, for
the client of the information retrieval system;
• numerical evaluating the reliability of the received message.
Only a combination of the above tools allows us to start solving the problem facing
the developers of GDOL tools.
The interdisciplinary nature of the GDOL makes us to consider the process of the
knowledge extraction from NL texts as the technology engineering in the field of
knowledge engineering similar to the design of vehicles, or, when attempting to build a
prototype of an artificial intelligence system - similar to the design of an aircraft. This
similarity is due to the need to use different technology nodes (i.e., different IT) as a
286 J. Chen et al.
Y
d
pðXjCj Þ / p Xk jCj : ð1Þ
k¼1
Y
d
pðCj jXÞ / pðCj Þ p Xk jCj : ð2Þ
k¼1
The results of parsing the NL sentence using LGP parser onto pairs of words linked
by meta-semantic links were used as the descriptors (signs) of certain semantic links
with object and subject as the parameters.
As an example, for the simple test sentence:
[(a)(test.n)(is.v)(an)(example.n)].
LGP parsing result:
[[0 1 0 (Ds)][1 2 0 (Ss)][2 4 0 (Ost)][3 4 0 (Ds)]].
The results of Bayesian recognizing the semantic relation using (2) after the short
learning:
(1) cause: 1.0882684165532656E-4;
(2) caused-by: 0.013810506200916856;
(3) is-a: 0.024124901979118252;
(4) is-about: 0.0;
Smart Data Integration by Goal Driven Ontology Learning 287
Fig. 1. Message model consisting of two main components: stating and constructive part.
where EMV is the probability weighted sum of possible payoffs per each alternative;
X X
EMV ¼ max pj Rij ; pj Rij is the expected payoff for action i;
i
j j
EV|PI is the expected or average return if we priori have the “perfect” (i.e. new)
information for best i choice:
X
EVjPI ¼ pj ðmax Rij Þ: ð4Þ
i
j
To estimate new knowledge EVPI we must have the EMV value for each solving
approach (action for (3)) to each task from HTN of our ontology. To obtain them all we
have create and solve the appropriate POMDP task:
X
EMVi UðSi Þ ¼ RðSi Þ þ c max PðSi ; Aik ; Sj ÞUðSj Þ: ð5Þ
Aik
k
An Eq. (5) describes the reward for taking the action giving the highest expected
return. The additional information decreases the model uncertainty and the expected
common reward (utility) is not less than without such information.
If we use POMDP algorithms, such as value and policy iteration as well as gradient
ascent algorithms [16, 17] for particular domain model, then we may evaluate the
expected utility for both cases: with new information and without it.
Dn;i 1 Dn;i
Dn;i þ 1 ¼ þ s ð6Þ
ð2 sÞ 2
where: s is the truth of the statement. It takes a value 1 if the statement is true or 0 –
otherwise; i is the step number confirmation/denial of the truth of the one statement of
n-th source.
290 J. Chen et al.
4 Implementation
The presented OL concept above was implemented in the system of Cognition Rela-
tions Or Concepts Using Semantics (CROCUS). The system is built on Java Protégé-
OWL API using OWL-DL as a knowledge representation language. The WordNet API
was added there for providing the recognition of new domain concepts at the interactive
learning mode. Using the DBMS MySQL in the CROCUS system it’s possible to store,
process and employ a statistics for many domains and users simultaneously and
independently during the learning process (Fig. 2).
5 Conclusions
A new approach to the data integration as a goal driven ontology learning is proposed
and the pertinence evaluation method as the key technology was elaborated. It’s shown:
if the pertinence of the new information is under the threshold value then such
Smart Data Integration by Goal Driven Ontology Learning 291
information will be rejected from the ontology learning process as the irrelevant one to
the goal of the domain. To build the agent optimal strategy for solving the defined task
means for automated ontology learning were elaborated. It’s implemented in the Java
programming language as the semantic analysis package CROCUS and tested on
English educational texts.
A whole framework could serve as a prototype for developed knowledge discovery
means. That helps to supply ontology learning tools with the pertinent information and
use such learned ontology to evaluate the information pertinence.
The elaborated approach of the goal driven ontology learning in the framework of
the hierarchical task network-OWL planning can be widely implemented using the
existing open source software libraries. Their fast development presages an appropriate
development of the all relevant scientific fields: natural language processing, automated
planning and ontology learning. That changes revolutionary the information search
technique and means. But before the developed approach can be widely used, a
developing the new ontology architecture and ontology learning paradigm are the main
challenges on that way.
Acknowledgments. This work is supported by China 973 fundamental research and develop-
ment project, grant number 2014CB340404; the National Natural Science Foundation of China,
grant number 61373037.
References
1. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw Hill, New
York (1983)
2. Meadow, C.T., et al.: Text Information Retrieval Systems. Elsevier, Burlington (2007)
3. PubMed Celebrates its 10th Anniversary; Technical Bulletin. United States National Library
of Medicine. 2006-10-05. Cited 22 March 2011
4. Jacso, P.: The impact of Eugene Garfield through the prism of web of science. Ann. Libr. Inf.
Stud. 57, 222 (2010)
5. Muller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information
retrieval and extraction system for biological literature. PLoS Biol. 2(11), e309 (2004).
doi:10.1371/journal.pbio.0020309
6. Tschantz, M.C.: Formalizing and enforcing purpose restrictions. Ph.D. thesis (2012)
7. Sirin, E., Parsia, B.: Planning for semantic web services. In: Proceedings of the Semantic
Web Services Workshop at 3rd International Semantic Web Conference (ISWC 2004)
(2004)
8. Bouillet, E., Feblowitz, M., Liu Z., Ranganathan, A., Riabov, A.: A knowledge engineering
and planning framework based on OWL ontologies. In: Proceedings of the Second
International Competition on Knowledge Engineering (ICKEPS 2007) (2007)
9. Freitas, A., Schmidt, D., Meneguzzi, F., Vieira, R., Bordini, R.H.: Using ontologies as
semantic representations of hierarchical task network planning domains. In: Proceedings of
WWW (2014)
10. Horridge, M., Bechhofer, S.: The OWL API: a Java API for OWL ontologies. Semant. Web
2(1), 11–21 (2011)
292 J. Chen et al.
11. Sleator D., Temperley D.: Parsing English with a link grammar. Carnegie Mellon University
Computer Science Technical report CMU-CS-91-196, October 1991
12. Wong, W., Liu, W., Bennamoun, M.: Ontology learning from text: a look back and into the
future. ACM Comput. Surv. (CSUR) 44(4), 20 (2012)
13. Lytvyn, V., Medykovskyj, M., Shakhovska, N., Dosyn, D.: Intelligent agent on the basis of
adaptive ontologies. J. Appl. Comput. Sci. 20(2), 71–77 (2012)
14. Arboleda, H., Paz, A., Jiménez, M., Tamura, G.: A framework for the generation and
management of self-adaptive enterprise applications. In: 10th Computing Colombian
Conference (10CCC) (2015)
15. Hauskrecht, M.: Value-function approximations for partially observable Markov decision
processes. JAIR 13, 33–94 (2000)
16. Braziunas, D.: POMDP solution methods, Technical report, Department of Computer
Science, University of Toronto (2003)
17. Halbert, T.R.: An Improved Algorithm for Sequential Information-Gathering Decisions in
Design under Uncertainty. Master’s thesis, Texas A&M University (2015). http://hdl.handle.
net/1969.1/155384
An Infrastructure and Approach for Inferring
Knowledge Over Big Data in the Vehicle
Insurance Industry
1 Introduction
In recent years, an impressive growth of generated data has been considered across
business and government organizations. New information is constantly added, and
existing information is continuously changed or removed, in any format and coming
from any type of data sources (internal and external). So, the manipulation of these
massive amounts of data went beyond the power and the performance of conventional
processes and tools. At the same time, the volume of data offers greater and broader
opportunities for developing existing business areas or driving new ones by improving
on insight, decision-making and detecting sources of profit [15, 17].
Big Data [18], designated among the most common buzzwords dominating the IT
world, can cover efficiently and effectively the aforementioned need and thus they are
currently influencing most aspects of business units. In brief, the new technological
achievement can be thoroughly described by the following characteristics: Variety (the
number of types of data), Velocity (the speed of data generation and processing),
Volume (the amount of data) [21] and Value (the knowledge). However, the lack of
semantics seems to be restrictive towards the 4th V of Big Data. The aspect of the 1st V
of Big Data also remains a topic for further discussion.
2 Background
2.1 Big Data Ontology Access
In recent years, quite a few data storage and processing technologies have emerged in
terms of Big Data. Platforms like NoSQL, Yarn, Hadoop, MapReduce, HDFS are now
some of the most familiar terms within this growing ecosystem [22]. On the semantic
technologies’ side, the “traditional” triplestores are continually evolving by following
An Infrastructure and Approach for Inferring Knowledge Over Big Data 295
the vision of exploring large data sets. Oracle Spatial and Graph with Oracle Database
12c, AllegroGraph, Stardog, OpenLink Virtuoso v6.1 are only some examples that
expand their scalability and performance features to meet these premium requirements.
A lot of initiatives and synergies are on progress, having as main purpose the
smooth integration of the widely-known Big Data technologies with whatever is
coming from the area of semantic technologies. For example, AllegroGraph 6, has
recently been released with a certification on Cloudera Enterprise1, among the leader
companies of Apache Hadoop-based software. Furthermore, AllegroGraph has
implemented extensions allowing users to query MongoDB databases using SPARQL
and to execute heterogeneous joins [10].
Besides all the above notable new features of triplestores, the issue of querying data
from various legacy relational databases still stumbles on the required ETL (Extract –
Transform - Load) processes [20]. In these cases, the data flows can be summarized in
extracting the data from the database, transforming into triples and then storing in
conventional RDF stores. The approach of OBDA systems seems to cope well with this
scenario by leveraging the pure nature of ontology notion. With OBDA, an ontology is
used so as to expose data in a conceptual manner by abstracting away from the
schema-level details of the underlying data. The connection between data and ontology
is performed via mappings. The combination of ontology and mappings allows to
automatically translate queries posed over the ontology into data-level queries that can
be executed by the specific underlying database management system. Ontop2, Mastro3,
Stardog4 are among the most popular OBDA systems.
1
http://franz.com/about/press_room/ag-6.0_2-8-2016.lhtml.
2
http://ontop.inf.unibz.it/.
3
http://www.dis.uniroma1.it/*mastro/?q=node/2?.
4
http://stardog.com/.
296 A.K. Kalou and D.A. Koutsomitropoulos
3 Insurance Ontology
In this section, we describe briefly how an ontology, fully expressed in the Web
Ontology Language (OWL) [2], can be designed by considering the Property and
Casualty model. We present how data entities representing most of P&C insurance
business processes can be converted into ontology components. Taking into account
the InfoSphere Data Architect guidelines [14], we can simply correlate logical data
model’s data types to OWL data types, logical data model’s entities to ontology ele-
ments as well as logical data model’s relationships to OWL object properties. In detail,
Table 1 gathers all the restrictions that must be followed for an effective conversion of
these relationships.
Figure 2 depicts part of the neighborhood of the InsurableObject, Claim, Agree-
ment, Person and other major entities of the model, presented as classes in the resulting
ontology, based on the class hierarchy and property relations that may associate
instances of one class with another.
An Infrastructure and Approach for Inferring Knowledge Over Big Data 297
4 Architecture
Fig. 2. The mapping of InsurableObject and other entities in the P&C ontology.
In systems such as Οntop, there are two main elements, an ontology which models
the application domain by using a shared vocabulary, and the mappings, which relate
the ontological terms with the schemata of the underlying data sources. Therefore, the
ontology and mappings combined, expose a setting, so that end-users can formulate
queries without deeper understanding of the data sources, the relation between them, or
the encoding of the data.
The outcome of the mappings in conjunction with the defined ontology can be
thought as an RDF view, which can be materialised or not. In the materialization case,
the data are triplified and then are directly used within an RDF triplestore without
requiring additional interaction with the initial data sources. In the second case, which
is the one followed by our approach, the RDF view is called a virtual graph and can be
queried at query execution time, therefore allowing for on-the-fly ontology based
access to and inference on data. At the same time, this approach relieves from the
burden of data replication.
In our setting, the ontology is offered by our implementation of the P&C model into
OWL, described in the previous section. To construct the mappings, we used the
mapping tool integrated within the ontop Protégé plugin. Some of the mappings
designed to create the virtualized graph are shown in Table 2 in the form of ID, Source
and Target.
4.2 Infrastructure
After the ontology definition and the design of mappings, the next step is to set up how
raw data are consumed. Ontop can be configured to access DBMS data or other
An Infrastructure and Approach for Inferring Knowledge Over Big Data 299
Table 2. A partial set of mappings used for the semantic modeling of insurance data.
ID Source (SQL query) Target (Triples template)
1 SELECT plate, contract FROM :{plate} a :Vehicle ; :hasInsurance :
InsuredItemVehicle {contract} . :{contract} a :Policy.
2 SELECT customerCode as c, iv.plate :{c} a :Person ; :isOwnerOf :{plate}.
FROM InsuredItemCustomer as ic,
InsuredItemVehicle as iv WHERE
ic.contract = iv.contract;
3 SELECT policyNumer as pol, :Amount_{pol}_{r} :hasAmount
ClaimNumber as r, totalPayAmount as {amnt}^^xsd:decimal . :Claim_{pol}
amnt, iv.contract, plate FROM Claims, _{r} a :Claim ; :settlementResultsIn :
InsuredItemVehicle as iv where Amount_{pol}_{r} . :{plate} :
pol = iv.contract; involvedIn :Claim_{pol}_{r}
sources, through the Teiid data virtualization platform5. For our purposes, we inter-
faced the data dumps through a JDBC driver, by means of a H2/MySQL database.
Given the appropriate mappings, it is also possible to include data flows from other
sources as well, such as federated databases, insurance brokes, agents filling in claims,
vehicle sensors and other possibly unstructured data, in a similar manner.
Having established the connection to data, queries can then be performed using
SPARQL. To this purpose, Ontop exposes a SPARQL interface via the Protégé plugin.
For query evaluation, the system parses the query and communicates with its internal
reasoner module, which has already loaded the ontology, in order to infer any implicit
semantic implications. Next, the original query is internally transformed into one or
more SQL queries addressed to the federation of data sources, by considering the
mappings already defined and the outcomes of the reasoning proccess.
As a result, query answering and reasoning take place on-the-fly, without the need
to replicate data into ontology instances first or materialize the graph resulting from the
mappings. This also means that online updates to data are directly reflected into the
answers received by SPARQL queries, as all evaluation occurs during query execution
time. Ontop can also be used as a standard SPARQL HTTP endpoint by extending the
Sesame Workbench, a web application for administering RDF repositories [3]. The
data flow and the query evaluation process are depicted in Fig. 3.
A set of example queries over the insurance dataset is presented in this section. All
three queries involve some form of reasoning, so as to demonstrate the added-value
ontology-based inferences can have on insurance data. They are also indicative of the
expressivity of logical axioms allowed in OWL 2 - QL, like inheritance, hierarchy,
inverse property and existential restriction.
5
http://teiid.jboss.org/.
300 A.K. Kalou and D.A. Koutsomitropoulos
Fig. 3. Architecture of the experimental setup for data flow and query answering.
In the first query, shown in Fig. 4, even though we have not designed a mapping
rule defining instances of the InsurableObject classes, the instances of Vehicle are
automatically classified as such, due to Vehicle being defined as subclass of Insura-
bleObject in the ontology (see Fig. 2).
The next example, shown in Fig. 5, retrieves the insurance policies for all vehicles
that are owned by a specific Party. Note that we are able to use the isInsuranceOf
property in the query, instead of hasInsurance as specified by mapping #1 in Table 2,
because they are defined as inverses in the ontology (see Fig. 2).
An Infrastructure and Approach for Inferring Knowledge Over Big Data 301
With mapping #3, we relate a Vehicle to a Claim and the Claim to its resulting
settlement ClaimAmount. The following query (Fig. 6) discovers vehicles involvedIn
Claims that have already been settled with a ClaimAmount through settlementResultsIn.
This is possible using the auxiliary class AlreadySettled, specified as an existential
restriction on the settlementResultsIn property.
and risk management, the obstacle of data gathering from different access points in
different formats, as well as its semantic analytics, still remains.
In this work, we have exploited the capabilities of OBDA systems so as to highlight
the contribution of semantic technologies in the industry of vehicle insurance. Having
at our disposal a voluminous dataset from an existing insurance company, we set an
infrastructure to inference knowledge in an efficient manner. The key component is our
proposed P&C ontology and a set of logical axioms and mappings that correlate the
data with the ontology. We have shown that the ontology can be utilized as an inter-
mediate level for accessing directly legacy, existing databases and perform
reasoning-enabled queries on them.
While graph materialization appears to perform well with Ontop, there is some
delay when importing raw data into the relational schema. This may prove a bottleneck
further on, so it might be worth investigating further concurrent data flow approaches
for RDF virtualization and SPARQL querying, like C-SPARQL [1].
References
1. Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: Querying RDF streams
with C-SPARQL. SIGMOD Rec. 39(1), 20–26 (2010)
2. Bechhofer, S., Van Harmelen, F., Hendler, J., Horrocks, I., Mc Guinness, D.L.,
Patel-Schneider, P.F., Stein, L.A.: OWL Web Ontology Language Reference, W3C
Recommendation http://www.w3.org/TR/2004/REC-owl-ref-20040210/
3. Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: a generic architecture for storing
and querying RDF and RDF Schema. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS,
vol. 2342, pp. 54–68. Springer, Heidelberg (2002)
4. Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M.,
Rodriguez-Muro, M., Xiao, G.: Ontop: answering SPARQL queries over relational
databases. Semantic Web – Interoperability, Usability, Applicability (2016, in Press). ISSN:
1570-0844
5. Jenkins, W., Molnar, R., Wallman, B., Ford, T.: Property and Casualty Data Model
Specification (2011)
6. Kalou, A.K., Koutsomitropoulos, D.A.: Linking data in the insurance sector: a case study.
In: Iliadis, L., Maglogiannis, I., Papadopoulos, H., Sioutas, S., Makris, C. (eds.) AIAI 2014.
IFIP AICT, vol. 437, pp. 320–329. Springer, Heidelberg (2014)
7. Lanti, D., Rezk, M., Xiao, G., Calvanese, D.: The NPD benchmark: reality check for OBDA
systems. In: Proceedings of the 18th International Conference on Extending Database
Technology (EDBT), pp. 617–628 (2015)
8. Llull, E.: Big data analysis to transform insurance industry. Technical article, Financial
Times (2016)
9. Marr, B.: How Big Data is changing insurance forever. Technical article, Forbes (2015)
10. Michel, F., Faron-Zucker, C., Montagnat, J.: A mapping-based method to query MongoDB
documents with SPARQL. In: Hartmann, S., Ma, H. (eds.) DEXA 2016. LNCS, vol. 9828,
pp. 52–67. Springer, Heidelberg (2016). doi:10.1007/978-3-319-44406-2_6
11. Mitchell, I., Wilson, M.: Linked Data: Connecting and exploiting big data. White paper.
Fujitsu UK (2012)
An Infrastructure and Approach for Inferring Knowledge Over Big Data 303
12. World Bank Group. Transport and ICT: Open Data for Sustainable Development. Technical
report (2015)
13. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.: Linking
data to ontologies. In: Spaccapietra, S. (ed.) Journal on Data Semantics X. LNCS, vol. 4900,
pp. 133–173. Springer, Heidelberg (2008)
14. Soares, S.: IBM InfoSphere: A Platform for Big Data Governance and Process Data
Governance. MC Press Online, LLC, February 2013
15. Sodenkamp, M., Kozlovskiy, I., Staake, T.: Gaining IS business value through big data
analytics: a case study of the energy sector. In: Proceedings of the Thirty Sixth International
Conference on Information Systems (ICIS), Fort Worth, USA, pp. 13–16 (2015)
16. The Object Management Group (OMG). MDA Guide Version 1.0.1 (2003)
17. Tsai, C.W., Lai, C.F., Chao, H.C., Vasilakos, A.C.: Big data analytics: a survey. J. Big Data
2(21), 1–32 (2015)
18. Ylijoki, O., Porras, J.: Perspectives to definition of big data: a mapping study and discussion.
J. Innov. Manag. 4(1), 69–91 (2016)
19. Lenzerini, M.: Ontology-based data management. In: Proceedings of CIKM 2011, pp. 5–6
(2011)
20. Rodriguez-Muro, M., Calvanese, D.: Quest, an OWL 2 QL reasoner for ontology-based data
access. In: Proceedings of the 9th International Workshop on OWL: Experiences and
Directions (OWLED 2012). CEUR Electronic Workshop Proceedings, vol. 849 (2012)
21. Laney, D.: 3D data management: Controlling data volume, velocity and variety. META
Group Research Note 6, 70 (2001)
22. Press, G.: Top 10 hot big data technologies. Technical article. Forbes (2016)
Defining and Identifying Stophashtags
in Instagram
analyzing the informative content of each term and measuring the discriminant
and characteristic capability, so in order a term to be discriminant has to dis-
tinguish in a category against the others and in order to characterize a words as
a stopword has to be common all over the categories. In addition to these hash-
tags that promote glorification of self-harm are no longer searchable according
to Instagram community guidelines [9].
Drewe [6], with the aid of Instagram API and a list of popular hashtags,
created a list of unsearchable hashtags. This list, however, is unofficial, incom-
plete and needs regular update since it is created using ad-hoc processes and not
a scientific methodology. Sedhai and Sun [13] in their research to locate spam
tweets concluded that 40 % of spam tweets have three or more hashtags and it is
more likely to use the word ‘follow’ as part of the tweet hashtags. Fan et al. [7]
dealt with the problem of spam hashtags in Flickr and developed an algorithm
to clean spam tags through cross-modal tag cleansing and junk image filtering.
Fig. 1. An example of Instagram image along with its hashtags. Note the presence of
meaningless hashtags such as #likeforlike, #instapic, #instalike.
Yang and Lee [17] extract descriptive keywords from web pages and they
measure the relatedness between web pages and tags in order to detect spam
tags in social bookmarking. Tang et al. [15] in their research to eliminate noise
tags in a folksonomy system they propose a two-stage semantic-based method.
First, they remove non-descriptive tags and then the semantic similarity between
tags is examined in order to remove noise tags. Zhu et al. [18] in their approach
for tag refinement they propose a form of convex optimization which considers,
tag characteristics, error sparsity, content consistency and tag correlation.
According to a previous study [8] only 30 % of Instagram hashtags are rele-
vant to the visual content of posted photos. The remaining 70 % can be attributed
to metacommunicative use, trends mimicry and/or users’ effort to fool Instagram
Defining and Identifying Stophashtags in Instagram 307
search engine so as to attract views for their photos. Figure 1 shows an indica-
tive example depicting hashtags belonging to the above categories. The picture
depicts a pittbull along with the hashtags the creator/owner used to annotate
it. Hashtags such as #mydog, #pitbullchocolate are quite descriptive regarding
the visual content of this image while #pitbullisnotacrime is a hashtag, used in a
metacommunicative way, expressing rallying. On the other hand, hashtags such
as #likeforlike, #instapic, #instalike are meaningless ‘trendy’ hashtags without
any actual descriptive or metacommunicative value. These hashtags can be con-
sidered as stophashtags since they appear in many, visually irrelevant, images
in a similar manners as common words in irrelevant documents are considered
stopwords and discarded in document indexing and classification.
2 Methodology
In order to derive concrete conclusions in our study we suggest a mathematical
framework which allows us to extract the results, i.e., candidate stophashtags,
combined with social science research methodology to evaluate the obtained
results. Moreover, we propose a novel and innovative Algorithm for calculating
hashtag score and locating stophashtags.
As already mentioned we consider as stophashtags hashtags that are mean-
ingless and appear in different images retrieved by different and generally non-
related hashtag categories. We select N subjects / hashtags that have no obvious
thematic relation (e.g. the subjects ‘food’ and ‘gadgets’) and retrieve all images
related with these subjects. We argue that common hashtags that appear in
the retrieved images have no obvious relation with images’ visual content; thus,
they would not have descriptive value. The proposed methodology is analyti-
cally presented in the form of pseudocode in Algorithm 3. It involves six basic
steps: (1) Create a list N of independent hashtags, (2) for each hashtag H collect
relevant images, (3) for each image I find all hashtags, (4) for each hahstag H
compute the stophashtag score SH , (5) compute threshold T of SH scores with
the aid of Otsu method, (6) if SH > T add to H Stophashtag List.
where N > 2. We use NH −1 in the nominator of Eq. 1 to indicate that the hash-
tag used to retrieve the photos within a subject (lets call it retrieval keyword) is,
by definition, relevant to the content of photos retrieved within this particular
subject and cannot be considered as stophashtag. For instance assume that one
of the N subjects/hashtags used to collect photos is the hashtag #car. In order
to estimate the normalized subject score for the hashtag #car itself we should
count its frequency of appearance in the remaining subjects excluding the one
it was used as a retrieval keyword. In this respect the number of independent
subjects (used in the denominator of Eq. 1) is N − 1.
⎧
⎪
⎨0, NH = 1
IF = (2)
⎪
⎩ IH
I , NH > 1
SH = aN · SF + (1 − aN ) · IF (3)
where 0 < aN < 1 is a weighting factor indicating the relative importance of
normalized subject and image frequencies. Recommended values, found through
experimentation, for aN lie in the interval [0.75 1).
Table 1. Stophashtag list identified using the proposed methodology (with * are
denoted stophashtags that confirmed through human evaluation - see Table 2)
photo in question. From the retrieved, through the N subjects, photos 30 were
randomly selected and included in a online questionnaire which was delivered to
users through SurveyMokey4 . Participants had to select among eight hashtags,
including both descriptive hashtags and stophashtags but all of them assigned
to the photo by its owner/creator, for each one of the questionnaire photos.
The 30 photos were put into two separate questionnaires, containing 15 pic-
tures each, in order to reduce the fill in time and avoid fatigue effects. Among the
eight choices given for each photo of 2–3 correspond to descriptive hashtags and
the rest to stophashtags. In any case, participants were not aware that any of the
given choices were considered either descriptive or stophashtags; thus, they were
free to select as many of them as they wished according to their interpretation
of the shown photo. Choices not selected by participants are likely to corre-
spond to stophashtags since participants consider them as not descriptive for
the photo presented to them. In that context we can use effectiveness measures
such as Precision and Recall to identify which of the identified by the algorithm
stophashtags were also -indirectly- classified as such by humans and vice versa.
For this purpose, however, we needed to create the list of stophashtags according
to human judgement.
Let us denote HT the times that a hashtag was selected by the users and HS
the times that users had the possibility to choose it as descriptive for a picture’s
visual content for the 30 pictures presented to the users. In order a descriptive
score DH for each hashtag H score we can use the following formula:
4
http://www.surveymonkey.com/.
Defining and Identifying Stophashtags in Instagram 311
HT
DH = (4)
HS
For instance consider that a hashtag was selected two times and that hashtag
was given to the users as a choice only for one image. In case this photo was
accessed 25 times, that is appeared in 25 filled questionnaires, then DH equals
2/25. Based on DH and by using a threshold TD computed with aid of Otsu
method [12] we classified the hashtags to descriptive (DH > TD ) and stophash-
tags (DH TD ) according to human judgement. TD was found to be 0.17.
Table 2 shows the stophashtags identified through human judgement with
the procedure explained above. By comparing Tables 1 and 2 we can see that 26
hashtags appear in both tables while the 17 hashtags with the highest stophash-
tag score SH , shown in Table 1, were -indirectly- verified by human judgement.
Thus, we have a clear indication about the effectiveness of the proposed method
especially as fas as the definition of stophashtag score (see Eq. 3) is concerned.
On the other hand, there are some hashtags identified by the proposed algorithm,
such as #picoftheday, #vscogood, #vsco hub, etc., that were not confirmed by
human judgement although it is clear that they lack descriptive power.
312 S. Giannoulakis and N. Tsapatsoulis
4 Conclusion
References
1. Armano, G., Fanni, F., Giulian, A.: Stopwords identification by means of character-
istic and discriminant analysis. In: Proceedings of the 7th International Conference
on Agents Artificial Intelligence (ICAART 2015), pp. 353–360. Lisbon, Portugal
(2015). doi:10.5220/0005194303530360
2. Baranovic, M.: What #hashtags mean to mobile photography (2013). http://
connect.dpreview.com/post/1256293279/hastag-photography
3. Bramer, M.: Principles of Data Mining. Springer, London (2007)
Defining and Identifying Stophashtags in Instagram 313
4. Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world
web image database from National University of Singapore. In: Proceedings of
ACM Conference on Image and Video Retrieval, pp. 368–375. Santorini, Greece
(2009). doi:10.1145/1646396.1646452
5. Daer, A.R., Hoffman, R., Goodman, S.: Rhetorical functions of hashtag forms
across social media applications. Commun. Des. Q. Rev. 3(1), 12–16 (2014). doi:10.
1145/2721882.2721884
6. Drewe, N.: The Hilarious List of Hashtags Instagram Wont Let You Search. http://
thedatapack.com/banned-instagram-hashtags-update/#more-171
7. Fan, J., Shen, Y., Zhou, N., Gao, Y.: Harvesting large-scale weakly-tagged image
databases from the web. In: Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR 2010), pp. 802–809. San
Francisco, United States, (2010). doi:10.1109/CVPR.2010.5540135
8. Giannoulakis, S., Tsapatsoulis, N.: Instagram hashtags as image annotation meta-
data. In: Proceedings of the 11th International Conference on Artificial Intelli-
gence Applications and Innovations (AIAI’15), pp. 206-220. Bayonne, France
(2015). doi:10.1007/978-3-319-23868-5 15
9. Instagram’s New Guidelines Against Self-Harm Images & Accounts.
In: Instagram Inc (2016). http://blog.instagram.com/post/21454597658/
instagrams-new-guidelines-against-self-harm
10. Jin, R., Chai, J.Y., Si, L.: Effective automatic image annotation via a coherent
language model and active learning. In: Proceedings of the 12th ACM Interna-
tional Conference on Multimedia (ACM Multimedia 2004), pp. 892–899. New York,
United States (2004). doi:10.1145/1027527.1027732
11. Jin, C., Jin, S.-W.: Automatic image annotation using feature selection based
on improving quantum particle swarm optimization. Sig. Process. 109, 172–181
(2015). doi:10.1016/j.sigpro.2014.10.031
12. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans.
Syst. Man Cybern. 9(1), 62–66 (1979). doi:10.1109/TSMC.1979.4310076
13. Sedhai, S., Sun, A.: HSpam14: a collection of 14 million tweets for hashtag-oriented
spam research. In: Proceedings of the 38th International ACM SIGIR Conference
on Research and Development in Information Retrieval, pp. 223–232. Santiago,
Chile (2015). doi:10.1145/2766462.2767701
14. Snoek, C.G., Worring, M.: Concept-based video retrieval. Found. Trends Inf.
Retrieval 2(4), 215–322 (2008). doi:10.1561/1500000014
15. Tang, R., Zuo, J., Xu, K., Zheng, J., Wang, Y.: An intelligent semantic-based
tag cleaner for folksonomies. In: Proceedings 2010 International Conference on
Intelligent Computing and Integrated Systems (ICISS2010), pp. 773–776. Guilin,
China (2010). doi:10.1109/ICISS.2010.5657118
16. Theodosiou, Z., Tsapatsoulis, N.: Crowdsourcing annotation: modelling keywords
using low level features. In: Proceedings of the 5th International Conference on
Internet Multimedia Systems Architecture and Application (IEEE IMSAA 2011),
pp. 1–4. Bangalore, India (2011). doi:10.1109/IMSAA.2011.6156351
17. Yang, H.-C., Lee, C.H.: Identifying spam tags by mining tag semantics. In: Pro-
ceedings 3rd International Conference on Data Mining and Intelligent Information
Technology Applications (ICMiA), pp. 263–268. Macao, China (2011)
18. Zhu, G., Yan, S., Ma, Y.: Image tag refinement towards low-rank, content-tag
prior and error sparsity. In: 18th ACM international conference on Multimedia
(MM 2010), pp. 461–470. Firenze, Italy (2010). doi:10.1145/1873951.1874028
Big Data and the Virtuous Circle
of Railway Digitization
1 Introduction
Railways safety has improved over the years often as a result of learning hard lessons
from railway catastrophes. The modern approach is to try to envisage the accident
before it happens and put in place mitigations. We are entering an age now where we
have complex socio-technological systems that rely upon computer control and
human-machine interfaces. This is particularly true in the railway industry which is
moving towards more sophisticated automatic train protection (ATP). This could mean
that the human workers who are still involved will find that a full understanding of the
systems and operations is almost an intractable problem (Hollnagel 2015).
The situation is not all bad news though as there are other technologies and
philosophies emerging that could, when taken together actually help take the railway to
a new low level in accident and incident occurrence. These opportunities are described
in this paper as a digital virtuous circle, which is illustrated in Fig. 1.
The key elements are: a new way of considering safety termed Safety II (Hollnagel
2015), the advent of Big Data (BD) and the Internet of Things (IoT) (Parkinson and
Bamford 2016), the Digital Railway initiative (Network Rail 2015), and the potential
innovation of simulation, virtual and augmented reality and how they might fit
together.
The next section contains a brief discussion of the philosophy behind railway safety
and why we may need a new approach. Section 3 contains a discussion of Safety II,
and how the existing techniques are no longer adequate. Section 4 provides an over-
view of the disruptive technologies that are around at present affecting our approach to
safety.
Section 5 introduces an approach to assessing risk, building in the potential of Big
Data. Section 6 describes how the digital rail initiative coupled with Systems Engi-
neering (SE) and Computer Aided Design (CAD) data would provide a foundation for
the Big Data approach. Section 7 describes how simulation using all the aforemen-
tioned techniques might close the digital virtuous circle and Sect. 8 reviews how this
all can support the intent of Safety II.
2 Railway Safety
The unique characteristics of railway safety are that there is a vehicle with a large mass
travelling at high velocity with low friction, steel wheel to steel rail contact, making for
very long braking distances that are often longer than the driver’s field of vision. Also,
the train travels on guiding track which allows no opportunity to avoid collision by
steering action. High collision speeds and high kinetic energy makes for high severity
consequences. To make risks acceptable, therefore, it is necessary to reduce the
probability of train accidents to a vanishingly small number.
Railway tragedies over many years have led to numerous safety improvements
which include (Erdos 2004):
• Braking Systems: Vacuum brakes, compressor based braking systems, the West-
inghouse Brake system,
• Communication Systems: Electric telegraph, telephone and radio systems,
• Vehicle Design: Crashworthiness avoiding telescoping,
• Signalling: Absolute Block Sections, Multiple-Aspect Signals, Track Circuits,
Interlocking, Automated level crossings, ATP,
316 H.J. Parkinson and G.J. Bamford
There is, of course, still the danger of large unexpected railway accidents such as
the recent ones in Bavaria (BBC 2016) and Santiago de Compostela in Spain (RSSB
2015) together resulting in over one hundred deaths. These are attributable to organ-
isational type accident factors and are always complex in nature and usually the result
of human error (Hollnagel 2015). The railway appears to be catching people out with
more and more complex systems and rules.
So, in the railway as in other industries, there are tightly coupled technologies with
non-linear functioning, producing ‘outlier’ events that usually result in significant
impacts which human supervisors find difficult to control in the normal and degraded
operation. The next section contains a discussion of these implications in a little more
detail and what might be needed to control them.
3 Safety II
Hollnagel (2015) has suggested that instead of focusing upon failure we focus upon
how the system actually works in the real world and not as it is imagined to work by a
development team. Having an understanding of how the system interacts with humans,
it can then be established how best to safely manage them. Hollnagel (2015) has
described this as Safety II which would supersede the traditional Safety I approach,
which would still be suitable for non-complex systems.
Figure 3 taken from Hollnagel (2015) illustrates the imbalance that occurs when the
focus is on what goes wrong and not on how things go right.
Big Data and the Virtuous Circle of Railway Digitization 317
10-4 := 1 failure in
10 000 events
1-10-4 := 9999
non- failures
in 10 000
Fig. 3. The imbalance between things that go right and things that go wrong (from Hollnagel
2015).
Safety I has its focus on analysing failure and has evolved from the analysis of
more simple electro/mechanical system though 3 ages as describe by (Hale and Hovden
1998), i.e. the age of technology, the age of human factors and the age of safety
management. Many of the techniques still used are founded in simpler times such as,
failure mode and effect analysis (FMEA) and fault tree analysis (FTA) amongst others.
A thorough understanding by the practitioner of past incidents is essential in all these
processes. Without this knowledge it will be impossible to understand errors, failures,
and how hazards propagate into accidents.
As technical failures have been reduced, and as the complexity and difficulty in
understanding complex systems has increased, more and more accidents are being
attributed to human error. It is becoming increasingly difficult to train workers to
respond to the various hazardous states that these complex systems can find themselves
in. So, new approaches are needed to understand, analyse and optimise day-to-day
safety management. Modern systems are very reliable and focussing upon failure is the
wrong way to analyse these systems in order to achieve the next step change in safety.
As stated earlier new techniques, data and initiatives are available that will enable a
new focus upon how systems actually work. In addition, technologies to simulate
environments exist to help increase the ability of the human worker to deal with system
perturbation and increase the resilience of the safety system. The next section contains
an overview of some of these new technologies.
4 Disruptive Technologies
It is a widely held belief that we are entering a revolution in technology and computer
intelligence, driven primarily by increases in computing power and the development of
intelligent algorithms. This is being described as the 3rd Industrial Revolution (Ford
2015). There are several so-called disruptive technologies present which include a Big
Data (BD), the Internet of Things (IoT), Virtual Reality (VR), Augmented Reality
(AR), and Machine Learning (ML) to name a few.
This interaction of disruptive technologies can be viewed as a key part of the
virtuous circle. The pace at which ML is happening is now hard to deny (Parkinson,
318 H.J. Parkinson and G.J. Bamford
Bamford and Kandola 2016). Many of these technologies have promised great things in
the past and not delivered. For example, neural networks were very much the flavour in
the 1990 s, however, due to the lack of computational power and the lack of under-
standing of the how to combine them with other approaches, they did not deliver the
required accuracy for safety related predictions (Iwnicki and Parkinson 1999). Gartner
(2016) has identified a curve that describes this trend in the uptake of new technology
as shown in Fig. 4.
Many of these disruptive technologies are now beginning to climb the slope of
enlightenment and are approaching the plateau of productivity.
We are now in the Big Data Analytics (BDA) age where mountains of data should
enable us to understand how complex socio technical systems actually work. This is a
central idea of Safety II.
In addition, visualisation will be the key to providing the ‘users’ with a view of
system dynamics. Data/information overload is a problem that will need to be over-
come. Possibly by only presenting information critical to the business or the individual.
This new area of critical visualisation is being called the Internet of Me - what matters
for you view of the world (IET Turing lecture 2016).
Spending in railway systems in the next twenty years is set to explode, and it is
critical that best practices in safety, data analysis and systems engineering are employ.
(Parkinson and Bamford) have defined a BDA approach to learning from accidents
which will enable the identification of heightened risk. The approach uses an accident
causation model called an ELBowTie, illustrated in Fig. 5. This model is based upon
the bowtie methodology that is already well utilised in the safety engineering world
(Yaneira 2013).
(Parkinson and Bamford) first establish a railway Enterprise Data Taxonomy
(EDT) which lists the available sources of data that can be linked to railway operational
risks. These include, for example: condition based monitoring information from sensors,
either analogue or digital, that would provide digital information, including vibration
(accelerometers), machine vision, heat, displacement, strain, humidity, particle ingress,
etc. which would be classified as ‘Real Time, remote monitoring’, which is already an
Big Data and the Virtuous Circle of Railway Digitization 319
Fig. 5. ELBowTie Big Sat Railway Risk Assessment Tool (Parkinson and Bamford 2016)
accepted means of classifying this type of data. Other data types are less well defined, for
example, data from industry reports, staff morale, organisational culture, but can be
equally as important in safety evaluations.
The EDT is aimed at providing a mechanism for systematically classifying data
related to safety incidents to facilitate assessment of data sources in a traceable and
consistent manner, giving the new approach its name: “Enterprise data taxonomy
Linked Bow-Tie” (ELBowTie),
The research by (Parkinson and Bamford) looked a series of railway accidents and
established their potential hazardous causes and conditions. These causes and condi-
tions were then linked to the EDT. The EDT is then linked to the bowtie elements to
predict heightened railway risk. It will be necessary to develop a bowtie for every
railway hazard to develop the full risk picture.
At the heart of the tool are a series of Machine Learning algorithms that will be
trained to analyse a stream of data and recognise when the system is outside of its
normal bounds. These outputs are then amalgamated to take into account the complex
interaction of the typical accident scenario (Parkinson, Bamford and Kandola 2016).
The data is linked to the barriers on either side of the bowtie to enable real time
interventions to take place either to prevent hazards occurring or to prevent hazards
propagating into accidents. When a red flag situation transpires, the particular initiators
are known and thus enabling automated warnings to be activated. Figure 6 depicts this
Fig. 6. Proposed Structure for Combining Data (Parkinson, Bamford and Kandola 2016)
320 H.J. Parkinson and G.J. Bamford
relationship. The paper by Parkinson, Bamford and Kandola (2016) also has an
extensive discussion of the various machine learning approaches that might be taken
and how these could be amalgamated.
The Digital Railway is an initiate in the UK mainly driven by Network Rail (Network
Rail Ltd 2015) and is depicted in Fig. 7. Network Rail are the organisation responsible
for maintaining the mainline railway infrastructure in Great Britain. Several other
organisations called TOCs (Train Operating Companies) are responsible for running
trains and there are a myriad of private maintenance contractors and engineering
support companies.
The core element of the Digital Railway will be the ERTMS (European Rail Traffic
Management System) which, simplistically, comprises an on-board computer that
provides automatic train protection by calculating a safe train movement authority
based upon track conditions and the status of the railway. The interlocking sends
signals to the train, relating whether a route is set and whether other trains are present.
Around this core system all the communication, passenger information, status, etc. will
be provided.
The Digital Railway will provide a platform for an IoT approach with enhanced
communication systems and inbuilt sensors in assets.
Finally, we have moved into areas of simulation and training. With the aid of the digital
models created during system development, and using AR and VR, we can cre-
ate scenarios for workers to investigate how systems really function. This allows
Big Data and the Virtuous Circle of Railway Digitization 321
managers and designers to communicate with the various stakeholders about how
the system works in all its complexity and not simply how they “think” it will work.
This will enable effective training to be designed.
To build upon the ability to simulate systems, the recent training development of
gamification is useful. Research suggest this is the optimum approach (Kapp 2012) to
training were trainees are stimulated to learn through competition, collaboration and
reward. E-learning has struggled to deliver on its promise due to lack of good training
design and creativity. However, the latest simulation technology and gamification
opportunities for blended delivery will improve the situation. Blended delivery is the
combination of face to face delivery with e-learning and online training interventions.
Training and simulation enables the understanding of safety concepts and the
working principles to be properly laid out. The training coarse attendees discover the
skills and knowledge by playing the game and come away more motivate and better
able to apply the principles.
A virtuous circle has been described that could take railways to a lower level of safety
risk. Disruptive technologies could be threats to the way the railway is operated and its
continual viability but if the chances for efficiencies and improvements in safety are
taken then the railway’s prospects are improved. Machine learning and supporting
technologies (Ford) are coming together now that will make the way we undertake
much white collar work radically different in 5 years’ time and much potentially
redundant in 10 years’ time. This is likely to apply to railway risk and safety man-
agement activities too.
How does this affect the future of safety management? It will mean the replacement
of risk assessment with real data. The sacred cow of risk assessment and quantification
no longer works for socio technological systems as the world of railways is becoming
too complex. Quantification is ineffective as it seeks to model the world using a gross
simplification by deriving numbers that imply a large level of accuracy. Approaches
like the (RSSB) safety risk model, which uses historical data to determine quantified
risk figures for hazards on the railway, would be made redundant as the approach
described in the ELBowTie would use real data in real time.
A data driven approach will replace “conventional safety engineering”. If one looks
through the railway safety and reliability standards, CENELEC EN50126 and the
recommended activity at each life-cycle stage it can be seen that nearly all the items are
data driven. By using Big Data and Machine Learning the majority of the work
undertaken now by safety engineers could be eliminated.
9 Conclusion
The railway industry has advanced a long way with safety methods over the past fifty
years and accidents are at a record low, but now there must be a major change if further
improvements are to be realised. The approach of blaming 80 % of our accidents on
322 H.J. Parkinson and G.J. Bamford
human error (Hollnagel 2015) is clearly not right. Safety II, Big Data, the Digital
Railway, and Simulation have all come along in the last 5 years or so to produce
a virtuous circle that will provide massive synergies. The complexity of modern socio
technological systems requires that a new approach is required and the elements dis-
cussed in this paper provide a way forward. This will offer railway stakeholders striking
opportunities including greater safety, happier workers, better reliability, greater effi-
ciency and less waste.
References
BBC Report Bad Aibling Rail Disaster (2016). http://www.bbc.co.uk/news/world-europe-
35530538. Accessed 23 Mar 2016
CENELEC EN50126
Erdos, G.D.: The Evolution of Rail Safety IRSE Technical Meeting: Adelaide 29 October 2004
(2004)
Ford, M.: The Rise of the Robots. Oneworld, London (2015)
Gartner (2016). http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp.
Accessed 23 Mar 2016
Hale A, Hoyden J.: Management and culture: the third age of safety (1998)
Hollnagel, E.: Safety I and Safety II. Ashgate, Farnham (2015)
IET and BCS Turing Lecture 2016: The Internet of me: It’s all about my screens. http://
conferences.theiet.org/turing/. Accessed 29 Mar 2016
Iwnicki, S., Parkinson, H.J.: Fault Free Infrastructure: Effective Solutions to Improve Efficiency,
Derby, England, 23–24 November 1999. Hardcover Published Nov-1999 ISBN 1860582338,
“Assessing railway vehicle derailment potential using Neural Networks”. International
IMechE Conference
Kapp, K.M.: The Gamification of Learning and Instruction. Pfieffer, Wiley, Sanfrancisco (2012)
Network Rail 2015, Digital Railway Program. http://www.networkrail.co.uk/Digital-Railway-
Introduction.pdf. Accessed 26 Feb 2016
Parkinson, H.J., Bamford, G.: The potential for using big data analytics to predict safety risks by
analyzing rail accidents. In: Accepted for 3rd International Conference on Railway
Technology: Research, Development and Maintenance, Cagliari, Sardinia, Italy, 5–8 April
2016
Parkinson, H.J., Bamford, G., Kandola, B.: The development of an enhanced bowtie railway
safety assessment tool using a big data analytics approach. In: IET Conference ICRE
Brussels, May 2016
RSSB Annual Safety Performance Report 2014/15 (2015). http://www.rssb.co.uk/Library/risk-
analysis-and-safety-reporting/2015-07-aspr-key-findings-2014-15.pdf. Accessed 26 Feb 2016
Yaneira, E.S., et al.: Bowtie diagrams in downstream hazard identification and risk assessment.
In: 8th Global Congress on Process Safety Houston, TX, 1–4 April 2012. American Institute
of Chemical Engineers (2013)
Unified Retrieval Model of Big Data
Abstract. With the huge growth of big data, effective information retrieval
methods have gained research focus. This paper addresses the difficulty of
retrieving relevant information for a large system that involves fusion of data. We
propose a retrieval model to enhance and improve the retrieving process along
with the user’s metadata learning to develop and enhance a retrieval system.
1 Introduction
data to retrieve pertinent and relevant information for providing good recommender
systems for effective and timely decision-making [4–6].
This paper addresses the overview of the big data including its structure and
analysis techniques in Sect. 2. In Sect. 3 we discuss the retrieval models approaches
along with the big data. Some issues and challenges of big data in developing an
effective retrieval model will be explained in Sect. 4. Sect. 5 provides the proposed
effective information retrieval model. This retrieval model is based on user’ metadata
learning by combining TPO information’ (Time, Position, Occasion) from user’s
location by using some GPS techniques that facilitate in arriving at a reduced data set
intelligently. Finally, conclusion will be presented in Sect. 6.
We are awash in a flood of data today every day we create about 2.5 quintillion bytes of
data. This big data comes from everywhere and everyone: machines such like sensors
which used to gather climate information, posts to social media sites, digital pictures and
videos, purchase transaction records, and cell phone GPS signals to name a few [7, 8].
Big data refers to data sets or combinations of data sets whose size (volume),
complexity (variability), and rate of growth (velocity) make them difficult to be cap-
tured, managed, processed or analysed by conventional technologies and tools, such as
relational databases and desktop statistics or visualization packages, within the time
necessary to make them useful [9–11]. While the size used to determine whether a
particular data set is considered big data is not firmly defined and continues to change
over time, and it is expected that the rate of growth of big data will continue to increase
for the foreseeable future [40]. When big data is effectively and efficiently captured,
processed, and analysed. It is influencing the way of doing business, and drive real
business value. Big data can support value creation for organizations as following:
1. Creating transparency by making big data openly available for business and
functional analysis (quality, lower costs, reduce time to market).
2. Supporting experimental analysis in individual locations that can test decisions or
approaches, such as specific market programs.
3. Assisting, based on customer information, in defining market segmentation at more
narrow levels.
4. Supporting Real-time analysis and decisions based on sophisticated analytics
applied to data sets from customers and embedded sensors.
5. Facilitating computer-assisted innovation in products based on embedded product
sensors indicating customer responses [12, 13, 28].
2. Velocity: velocity measures the speed of data creation, streaming, and aggregation
that is the speed at which this data moves from endpoints into processing and
storage. Ecommerce has rapidly increased the speed and richness of data used for
different business transactions (for example, web-site clicks). Data velocity man-
agement is much more than a bandwidth issue; it is also an ingest issue (extract
transform- load) [17–19].
3. Variety: variety is a measure of the richness of the data representation – text, images
video and audio. From an analytic perspective, it is probably the biggest obstacle to
effectively using large volumes of data [21, 22]. Incompatible data formats,
non-aligned data structures, and inconsistent data semantics represents significant
challenges that can lead to analytic sprawl [23, 24].
4. Complexity: Complexity measures the degree of interconnectedness (possibly very
large) and interdependence in big data structures such that a small change (or
combination of small changes) in one or a few elements can yield very large
changes or a small change that ripple across or cascade through the system and
substantially affect its behaviour, or no change at all [25, 27, 29].
• Map Phase: Divides the workload into smaller subworkloads and assigns
tasks to Mapper, which processes each unit block of data. The output of
Mapper is a sorted list of (key, value) pairs. This list is passed (also called
shuffling) to the next phase.
• Reduce: Analyzes and merges the input to produce the final output. The final
output is written to the HDFS in the cluster [39].
documents to a given query are the basis for the document rankings that are now a
familiar part of IR systems [46].
Generally, there are two major categories of IR technology and research: semantic
and statistical. Semantic approaches attempt to implement some degree of syntactic and
semantic analysis [47, 48]. In statistical approaches, the documents that are retrieved or
that are highly ranked are those that match the query most closely in terms of some
statistical measure. Statistical approaches fall into a number of categories: Boolean,
vector space, and probabilistic. Statistical approaches break documents and queries into
terms. These terms are the population that is counted and measured statistically [20].
A major challenge for IT researchers and practitioners is that the growth rate of data is
very fast and exceeds the ability to handle and analyse data effectively. Many issues
and challenges raised regarding retrieving these data.
4.1 Issues
There are many issues raised while dealing with retrieving big data. The most fun-
damentals are three: storage, management, and processing issues [26]. Each one has its
own technical problems and challenges that need to deal with.
1. Storage Issues: Previously, the quantity of data exploded when each new storage
medium is introduced. On the other hand, nowadays the most recent explosion,
which is due to social networks and media, has been evolved without introducing
any new storage medium. Moreover, data is created by everyone and everything,
not by only professionals like before.
2. Management Issues: Management may be the most difficult issue to deal with big
data. This is because the sources of these data are varied, temporally and spatially,
by format, and by method of collection. Not all these data sources have adequate
metadata descriptions. Actually, there is no perfect big data management solution
until know, and this raise an important research challenge.
3. Processing Issues: When big data is needs to be processed entirely in an effective
manner, extensive parallel processing and new analytical algorithms are needed in
order to provide timely and actionable information [49–51].
4.2 Challenges
Retrieving information from big data faces many challenges regarding its dynamic design
and analytics. Under each challenge there are many issues and concerns [1, 26, 28].
• Design
1. Quality versus Quantity: As users acquire and have access to more data
(quantity), they often want even more. On the other hand, a big data user may be
328 A. Al-Drees et al.
concerned more on quality which means not having all the data available, but
having a (very) large quantity of high quality data that can be used to figure
specific and high-valued conclusions. The level of precision that the user
requires is important to solve this issue.
2. Structured versus Unstructured Data: Translation between structured and
unstructured data suitable for analytics can delay end-to-end processing per-
formance. The emergence of non-relational, distributed, analytics oriented
databases such as NoSQL, MongoDB, and SciDB provides dynamic flexibility
in representing and organizing information.
3. Data Ownership: Data ownership presents a critical and ongoing challenge,
particularly in social media. While huge amount of social media data reside on
the servers of Facebook and Twitter for example, it is not really owned by them,
although they may compete so because of residency. Actually, the owners of the
pages or accounts believe they own the data, and this is a dichotomy which
needs to be solved by law.
4. Security: In certain domains, there is a fear that certain organizations will know
too much about individuals, as more data is accumulated about them. A key
research problem is to randomize personal data among a large data set enough to
ensure privacy. Perhaps the biggest threat to personal security is the unregulated
accumulation of data by numerous social media companies. Clearly, some big
data must be secured with respect to privacy and security laws and regulations
[26, 52].
4.3 Analysis
1. Scale: Managing large and rapidly increasing volumes of data has been a chal-
lenging issue for many decades. A critical issue is whether or not an analytic
process scales as the data set increases by orders of magnitude. Data volume is
scaling faster than compute resources, and CPU speeds are static. Over the last five
years the processor technology has made a dramatic shift. Also the move towards
cloud computing, which now aggregates multiple disparate workloads with varying
performance goals [1].
2. Timeliness: The opposite side of size is speed. The larger the data set to be pro-
cessed, the longer it will take to analyze. The design of a system that effectively
deals with size is likely also result in a system that can process a given size of data
set faster. There are many situations in which the result of the analysis is required
immediately. It is obviously impractical to scan the entire data set to find suitable
elements. Rather than this, index structures are created in advance to permit
finding-qualifying elements quickly [1, 28].
3. Privacy: The privacy of data is another huge concern, and one that increases in the
context of Big Data. There is great public fear regarding the inappropriate use of
personal data, particularly through linking of data from multiple sources. Managing
privacy is effectively both a technical and a sociological problem, which must be
addressed jointly from both perspectives to realize the promise of big data [54].
The wealth of individual-level information that Google, Facebook, and some
mobile phone and credit card companies would jointly hold if they ever were to
Unified Retrieval Model of Big Data 329
group their information is in itself a concern [32]. Another privacy issue rises with
big data recommendation systems that use the user location for better prediction.
This affect the user privacy by tracking his location and may cause many critical
problems to the user, although in some systems, the user himself is the one who
publish his location on a social network site like Foursquare, Google Latitude or
Facebook Places [53, 55].
The proposed retrieval model based on the idea of: using the user’s metadata to filter
the results and get more valuable responds for the user. To search the contents through
the Internet, search engines should make use of some information such as current time,
user’s location, common occasions and so on (Metadata). We propose an information
retrieval model to effectively use both user’s query and database of metadata based on
(Time, Position, Occasion).
In this model, query part is reducing user’s overload information by choosing the
relevant metadata. And a user becomes possible to retry sorting the matching results.
Then, in the matching part, the search engine compares user’s metadata with database’
metadata to judge whether both of metadata are matching or not along with the user’s
query. The quantity of matching functions should be sufficient to compare any user’s
metadata and database’s metadata. Matching functions compare values of user’s metadata
and database’s metadata. When a lot of metadata are stored into user’s storage, a mech-
anism automatically selecting metadata is desired. These metadata can be gathered by
using some techniques related to the GPS (Global Position System) to determine the exact
location of the user. After that, the results are ranked according to some scoring methods;
static or dynamic scoring functions. Finally, the respond is sent to the user (Fig. 2).
This retrieving model solves a lot of the traditional retrieving models’ issues; it
reduces the storage size and allows the scaling part to be done more effectively and
efficiently. Also, it increases the speed of responding the results to the user.
6 Conclusion
This paper discussed the importance of demonstrating efficient retrieving model in the
huge big data with proposed an effective retrieval model.
The system framework was implemented in automating the process of a physical
search in the web. As the huge data comprises of many formats it has become
mandatory to perform intelligent automation and to devise the methods of retrieving the
information, which is relevant, and fast with respect to a user’s metadata information.
The proposed model shows the process of providing the more relevant results for
the user, considering user’s metadata along with the user’s query information. It
contains capability of narrowing to the user’s needs and preferences and based on them
considering user current place.
The real-life implementation has demonstrated that analysing a user-query along
with his/her metadata makes the retrieval system highly effective.
References
1. Agrawal, D., Bernstein, P., Davidson, S.: Challenges and Opportunities with Big Data.
A community white paper developed by leading researchers across the United States, p. 17
(2011)
2. Arai, A., Fujikawa, K., Sunahara, H.: A proposal of information retrieval method based on
TPO metadata (2009)
3. Bakshi, K.: Considerations for Big Data: Architecture and Approach (2012)
4. Begoli, E., Horey, J.: Design principles for effective knowledge discovery from big data. In:
Joint Working Conference on Software Architecture & 6th European Conference on
Software Architecture, p. 4 (2012)
5. Big Data for Development: Challenges & Opportunities. Global Pluse, 47 (2012)
6. Big Data Survey. Giga Spaces, 5 (2011)
7. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global
Institute, 156 (2011)
8. Bindra, A., Ashish Bindra, S., Ashish Bindra, K.: Distributed big advertiser data mining. In:
IEEE 12th International Conference on Data Mining Workshops, p. 1 (2012)
9. Borkar, V., Carey, M.J., Li, C.: Inside “Big Data Management”: ogres, onions, or parfaits?
In: EDBT/ICDT 2012 Joint Conference, Berlin, Germany, p. 12 (2012)
10. Cavoukian, A.: Privacy, security, big data–yes, you can! In: Information and Privacy
Commissioner Ontario, Canada, p. 26 (2013)
11. Chandramouli, B., Goldstein, J., Duan, S.: Temporal analytics on big data for web
advertising. In: IEEE 28th International Conference on Data Engineering, p. 12 (2012)
12. Chen, H., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to
big impact. Bus. Intell. Res., 25 (2012)
Unified Retrieval Model of Big Data 331
13. Clement, M., Sokol, L., Gary, L.: Robust decision engineering: collaborative big data and
its application to international development/aid. In: 8th International Conference on
Collaborative Computing: Networking, Applications and Worksharing, p. 8 (2012)
14. CS4103 Distributed Systems Coursework Part 1: Big Data (2012)
15. Demchenko, Y., Zhao, Z., Grosso, P., Wibisono, A., de Laat, C.: Addressing big data
challenges for scientific data infrastructure. In: IEEE 4th International Conference on Cloud
Computing Technology and Science, p. 4 (2012)
16. Distributed Systems Coursework Part 1: Big Data (2012). http://www.luisramalho.com/
wp-content/uploads/2012/04/bigdata.pdf
17. Dumbill, E.: Making sense of big data. 2BD, 2 (2013)
18. Geron, T.: Live: Facebook Launches Graph Search, A Social Search Engine, With Bing
Partnership (2013). http://www.forbes.com/sites/tomiogeron/2013/01/15/live-facebook-
announces-graph-search/
19. Greengrass, E.D.: Information Retrieval: A Survey (2000)
20. Guo, Z., Wang, J.: Information Retrieval from Large Data Sets via Multiple-winners-take-all
(2011)
21. Han, X., Tian, L., Yoon, M., Lee, M.: A big data model supporting information
recommendation in social networks. In: Second International Conference on Cloud and
Green Computing, p. 4 (2012)
22. HPCC Systems (n.d.). HPCC Systems: Models for Big Data. White paper, 17
23. IBM big data success stories. IBM Corporation, 76 (2011)
24. Intel IT Center. Big Data Analytics. Intel’s IT Manager Survey on How Organizations are
Using Big Data, 27 (2012)
25. Jain, M., Singh, S.K.: A survey on: content based image retrieval systems using clustering
techniques for large data sets. Int. J. Manag. Inf. Technol. (IJMIT) 3(4), 17 (2011)
26. Ji, C., Li,, Y., Qiu, W., Awada, U., Li, K.: Big data processing in cloud computing
environments. In: International Symposium on Pervasive Systems, Algorithms and
Networks, p. 7 (2012)
27. Borrero, J.D., Gualda, E.: Crawling big data in a new frontier for socioeconomic research:
testing with social tagging. J. Spat. Organ. Dyn. - Discussion Papers Number 12, 23 (2012)
28. Kaisler, S., Armour, F., Alberto Espinosa, J., Money, W.: Big data: issues and challenges
moving forward. In: 46th Hawaii International Conference on System Sciences, p. 10 (2013)
29. Kejariwal, A.: Big data challenges a program optimization perspective. In: Second
International Conference on Cloud and Green Computing, p. 6 (2012)
30. Kirkpatrick, R.: BIG data for development. BD3, 1(1), 2 (2013)
31. Kraska, T.: Finding the Needle in the Big Data Systems Haystack, p. 3. Brown University
(2013)
32. Laurila, J.K., Imad Aad, I., Perez, D.J. (n.d.).: The Mobile Data Challenge: Big Data for
Mobile Computing Research
33. Lioma, C.: Big Data Challenges for Information Retrieval. University of Copenhagen-
Department of Computer Science, p. 12 (2012)
34. Logothetis, D., Yocum, K.: Data Indexing for Stateful, Large-scale Data Processing (2009)
35. Lumley, T., Rice, K.: Storing and retrieving large data. UW Biostatistics, p. 18 (2009)
36. Meij, E.: Large-scale Data Processing for Information Retrieval #nlhug, 12 April 2012.
http://www.slideshare.net/edgar.meij/largescale-data-processing-for-information-retrieval-
nlhug. (Retrieved)
37. Miller, S.: How “Big Data” will change your life….. Pew Research Center’s Internet &
American Life Project, p. 29 (2012)
38. Nambiar, U.: Answering Imprecise Queries Over Autonomous Databases (2005). http://
rakaposhi.eas.asu.edu/ullas-thesis.pdf. (Retrieved)
332 A. Al-Drees et al.
39. Navint Enterprise. Why is BIG Data Important?. A Navint Partners White Paper, 5 (2012).
www.navint.com. (Retrieved)
40. Oracle Information Architecture: An Architect’s Guide to Big Data. An Oracle White Paper
in Enterprise Architecture, 25 (2012)
41. Oracle. Oracle: Big data for Enterprise. Oracle Enterprise, 16 (2012)
42. Oracle. Combining big data tools with traditional data management offers enterprises the
complete view. White paper: Integrate for Insight, 4 (2012)
43. Part III: IBM’s strategy for big data and analytics. IBM Corporation, 5 (2012)
44. Bennett, P.N., El-Arini, K.: Enriching Information Retrieval. In: SIGIR Workshop Report,
p. 6 (2011)
45. Paz-Trillo, C., Wassermann, R., Braga, P.P.: An Information Retrieval application using
Ontologies (2005). http://www.ime.usp.br/*rmcobe/onair/files/jsbc_onair.pdf
46. Provost, F., Fawcett, T.: DATA science and its relationship to big data and data-driven
decision making. BD51 1(1), 9 (2013)
47. Rabinowitz, J.: Indexing arbitrary data with SWISH-E. In: The Proceedings of the 2004
USENIX Technical Conference, p. 7 (2004)
48. Recommender system (2013). http://en.wikipedia.org/wiki/Recommender_system
49. Rouse, M.: What is Graph Search? (2013). http://whatis.techtarget.com/definition/Graph-
Search
50. Smith, M., Szongott, S., Henne, B., Voigt, G.: Big Data Privacy Issues in Public Social
Media (2013)
51. Sun Yanhou, Y.: Big data in enterprise challenges & opportunities. Software and Service
Group, p. 15 (2011)
52. Venkatraman, S., Kamatkar, S.J.: Intelligent information retrieval and recommender system
framework. Int. J. Future Comput. Commun. 2(2), 5 (2013)
53. Zhu, J.: Data Modeling for Big Data (2011)
54. Zhou, B., Yao, Y.: Evaluating Information Retrieval System Performance Based on User
Preference. http://www2.cs.uregina.ca/*zhou200b/4-zhou.pdf
55. Zikopoulos, P., Deustch, T.: The big deal about big data. IBM Corporation, 43 (2012)
Adaptive Elitist Differential Evolution Extreme
Learning Machines on Big Data: Intelligent
Recognition of Invasive Species
Abstract. One of the direct consequences of climate change lies in the spread
of invasive species, which constitute a serious and rapidly worsening threat to
ecology, preservation of natural biodiversity and protection of flora and fauna. It
can even be a potential threat to the health of humans. These species, do not
appear to have serious morphological variations, despite their strong biological
differences. Due to this fact their identification process is often quite difficult.
The need to protect the environment and safeguard public health, requires the
development of advanced methods for early and accurate identification of some
particularly dangerous invasive species, in order to plan and apply specific and
effective management measures. The aim of this study is to create an advanced
computer vision system for the automatic recognition of invasive or other
unknown species, based on their phenotypes. More specifically, this research
proposes an innovative and very effective Extreme Learning Machine
(ELM) model, which is optimized by the Adaptive Elitist Differential Evolution
algorithm (AEDE). The AEDE is an improved version of the differential evo-
lution (DE) algorithm and it is proper for big data resolution. Feature selection is
done by using deep learning Convolutional Neural Networks. A Geo Location
system is used to detect the invasive species by Comparing with the local
species of the region under research.
1 Introduction
moving is searching for colder climate conditions. A typical reason is the fact that their
physical environment does not meet the temperature range in which they can survive.
Also they follow different plant species or organisms which migrate to cooler habitats.
Although not all migrating species are harmful, the preparation of proper invasive
species management plans (depending on their risk profile) is imposed preemptively
and also institutionally. Their control and potential eradication should be a parallel
process with the restoration of ecosystems affected by them.
The identification and classification of the invasive species, exclusively with the
use of phenotypic markers, is an extremely difficult and dangerous process, as in this
case there might be neither major morphological differences nor significant similarities
capable of reflecting the affinity or not of the organisms (species problem) [2]. How-
ever, it is a necessary and an extremely important process in the overall strategy to
address alien species. Specifically, the above method is a key preventive mechanism, as
it can be used easily in low cost devices e.g. Smart phones without significant
expenses, unlike the genetic methods of identification, like DNA barcoding, or methods
using comparison with biochemical or molecular markers [3]. Moreover, it can be used
by personnel without specialized knowledge, such as farmers or breeders.
Fig. 1. Example images of four different fish species. All of them have similar visual appearance
despite being distinct species. (Images taken by J.E. Randall) [30]
Fig. 2. The three different capture conditions: “controlled”, “in-situ” and “out-of-the-water”.
Significant variation in appearance due to the changed imaging conditions (session variation) is
evident. Ground truth bounding boxes are shown in red [32].
The dimensions of the pictures used in this research were 81 × 42 and 96 filters
were applied (feature map). The picture was divided in frames of 9 × 7 pixels (in order
to avoid the frame of 1 × 1 pixels that would lead to an intangible to handle feature
vector. To perform feature extraction, the following procedure was used:
The dimensions of the convolution layer were 81 × 42 × 96 which means that it
contained 81 × 42 = 3402 neurons for each one of the 96 used filters, which leads to a
total number of neurons equal to 81 × 42 × 96 = 326,592. Each one of these neurons
had 9 × 7 × 3 + 1 = 190 weights plus a bias, (the number 3 is used due to the RGB
color model used that requires 1 channel for each color), which gives a total number of
326,592 × 190 = 62,052,480 weights and biases values which corresponds to an
intangible number of input data to be analyzed.
In order to avoid this huge problem, we accepted that the neurons belonging to the
same filter were assigned the same weights. This means that a filter with 3402 neurons
can have only 190 weights instead of 3402 × 190 = 646,380. In this way the weights
plus bias were reduced significantly to 96 × (9 × 7 × 3 + 1) = 18,240. This
assumption is not arbitrary but it is based on the logical statement that the application of
a filter can have useful results (obtained characteristics) regardless the position in which
it may be applied [33, 34].
A determining factor for delivering high precision and low uncertainty in machine
vision metrology is the resolution of the acquired image. As a gauge, the smallest unit
of measurement in a machine vision system is the single pixel. As with any mea-
surement system, in order to make a repeatable and reliable measurement one must use
a gauge where the smallest measurement unit (as a general rule of thumb) is one tenth
of the required measurement tolerance band [6].
parameters of hidden nodes at random, before they see the training data vectors. They
are extremely fast and effective and they are capable of handling a wide range of
activation functions (e.g. stopping criterion, learning rate and learning epochs [35]. The
output of an ELM with a training set fðxi ; ti ÞgNi¼1 consisting of N discrete samples
(xi 2 Rn and ti 2 Rm Þ with an activation function gð xÞ and l hidden nodes is given by
the following function 1 [35]:
Xl
ti ¼ j¼1
bj g xj xi þ bj ; i ¼ 1; 2; . . .; N; ð1Þ
T
Where bj ¼ bj1 ; bj2 ; . . .; bjm is the weight vector connecting the output neurons
T
and the j-th hidden neuron, xj ¼ xj1 ; xj2 ; . . .; xjm is the weight vector connecting
the input neurons and the j-th hidden neuron, bj is the bias of the j-th hidden neuron,
and xj :xi indicates the inner product of xj and xi . Function 1 can also be written as
follows:
Hb ¼ T; ð2Þ
where
2 3
gðx1 x1 þ b1 Þ gðxl x1 þ bl Þ
6 .. .. .. 7
H xj ; bj ; xi ¼ 4 . . . 5 ; ð3Þ
gðx1 xN þ b1 Þ gðxl xN þ bl Þ Nl
Also Η is called the hidden layer output matrix of the network. Before training, the
input weights matrix ω and the hidden biases vector b are created randomly in the
T T
interval [−1, 1], where xj ¼ xj1 ; xj2 ; . . .; xjm and bj ¼ bj1 ; bj2 ; . . .; bjm :
Then the hidden layer output matrix H is calculated by the activation function and
by using the training set based on the function H ¼ gðxx þ bÞ:
Finally, the output weights matrix β is estimated by the equation b ¼ H þ T, where
þ
H is the Moore–Penrose generalized inverse of matrix H and T ¼ ½t1 ; t2 ; . . .; tN T
[35].
speed of the DE algorithm. The new adaptive mutation scheme of the DE uses two
mutation operators. The first one is the ‘‘rand/1” which aims to ensure diversity of the
population and prohibits the population from getting stuck in a local optimum. The
second is the ‘‘current-to-best/1” which aims to accelerate convergence speed of the
population by guiding the population toward the best individual. On the other hand, the
new selection mechanism is performed as follows: Firstly, the children population
C consisting of trial vectors is combined with the parent population P of target vectors
to create a combined population Q. Then, NP best individuals are chosen from the Q to
construct the population for the next generation. In this way, the best individuals of the
whole population are always stored for the next generation. This helps the algorithm to
obtain a better convergence rate [30]. The elitist selection operator is presented in the
following Algorithm 1.
The algorithmic approach of the proposed hybrid scheme includes (in the first stage)
the feature extraction process using CNN, in order to extract features from each photo
of the fish dataset. In the second stage, these features are introduced in the proposed
ELM, which is optimized by the AEDE approach, to yield the maximum classification
accuracy in order to identify the type of fish detected by the machine vision system.
Having identified the type, the coordinates are obtained by the GPS and they are
mapped to the country where they belong. Then a crosscheck is made to find out
whether the type identified is indigenous in this country, or otherwise it is labeled as
invasive species. The lists of indigenous and invasive species were exported from the
Invasive Species Compendium (http://www.cabi.org/isc/) [38], the most authoritative
and comprehensive database on the subject that exists worldwide. The algorithm for
identifying as invasive species a species is presented below (Fig. 3):
It is extremely comforting and hopeful, the fact that the proposed hybrid system
manages to solve a particularly complex computer vision problem with high accuracy,
regardless the fact that the dataset used for training and evaluation of the proposed
AEDE-ELM is highly complex, given the similarity between the species tested and the
specificities resulting from the feature extraction process using CNN. It is also char-
acteristic that in the process of categorization of the species tested, the accuracy rate in
all cases (“controlled”, “in-situ”, “out-of-the-water”) was very high. This fact suggests
and confirms the generalization capabilities of the proposed system.
The following Table 1, presents the analytical values of the predictive power of the
AEDE-ELM by using a 10 Fold Cross Validation approach (10-FoldCV) and the
corresponding results when competitive algorithms were used, namely: Differential
Evolution ELM (DE-ELM), Artificial Neural Network (ANN) and Support Vector
Machine (SVM).
The Precision measure shows what percentage of positive predictions where cor-
rect, whereas Recall measures what percentage of positive events were correctly pre-
dicted. The F-Score can be interpreted as a weighted average of the precision and
recall. Therefore, this score takes both false positives and false negatives into account.
Intuitively it is not as easy to understand as accuracy, but F-Score is usually more
useful than accuracy and it works best if false positives and false negatives have similar
Adaptive Elitist Differential Evolution ELM on Big Data 343
cost, in this case. Finally, the ROC curve is related in a direct and natural way to
cost/benefit analysis of diagnostic decision making. This comparison generates
encouraging expectations for the identification of the AEDE-ELM as a robust classi-
fication model suitable for difficult problems. This method was tested with great suc-
cess in the control and automatic recognition of the highly dangerous for the public
health Mediterranean invasive species fish “Lagocephalus Sceleratus” (Table 2).
A future extension would be the use and incorporation to the system, of other
methods for similar characteristics determination like: Representational Similarity
Analysis, Local Similarity Analysis, Isoperimetry and Gaussian Analysis. Finally, it
would be very important to use and test the performance of the Deep Extreme Learning
Machines technology for this specific case.
References
1. Rahel, F., Olden, J.D.: Assessing the effects of climate change on aquatic invasive species.
Soc. Conserv. Biol. 22(3), 521–533 (2008)
2. Miller, W.: The structure of species, outcomes of speciation and the species problem: Ideas
for paleobiology. Palaeoclimatol. Palaeoecol. 176, 1–10 (2001)
3. Demertzis, K., Iliadis, L.: Intelligent bio-inspired detection of food borne pathogen by DNA
barcodes: the case of invasive fish species Lagocephalus Sceleratus. Eng. Appl. Neural
Netw. 517, 89–99 (2015). doi:10.1007/978-3-319-23983-5_9
4. Hornberg, A.: Handbook of Machine Vision, p. 709. Wiley, Hoboken (2006). ISBN:
978-3-527-40584-8
5. Graves, M., Batchelor, B.G.: Machine Vision for the Inspection of Natural Products, p. 5.
Springer, London (2003). ISBN: 978-1-85233-525-0
6. Carsten, S., Ulrich, M., Wiedemann, C.: Machine Vision Algorithms and Applications, p. 1.
Wiley-VCH, Weinheim (2008). ISBN: 978-3-527-40734-7
7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
8. Svellingen, C., Totland, B., White, D., Οvredal, T.: Automatic Species Recognition, length
measurement and weight determination using the CatchMeter Computer Vision System
(2006)
9. Cabreira, A.G., Tripode, M., Madirolas, A.: Artificial neural networks for fish-species
identification. ICES J. Mar. Sci. 66, 1119–1129 (2009)
10. Rova, A., Mori, G., Dill, L.M.: One fish, two fish, butterfish, trumpeter: recognizing fish in
underwater video. In: Conference on Machine Vision Applications, pp. 404–407 (2007)
11. Lee, D.J., Schoenberger, R., Shiozawa, D., Xu, X., Zhan, P.: Contour matching for a fish
recognition and migration monitoring system. Stud. Comput. Intell. 122, 183–207 (2008)
12. Ogunlana, S.O., Olabode, O., Oluwadare, S.A.A., Iwasokun, G.B.: Fish classification using
SVM. IEEE Afr. J. Comput. ICT 8(2), 75–82 (2015)
13. Mutasem, K.A., Khairuddin, B.O., Shahrulazman, N., Ibrahim, A.: Fish recognition based
on the combination between robust features selection, image segmentation and geometrical
parameters techniques using artificial neural network and decision tree. J. Comput. Sci. Inf.
Secur. 6(2), 215–221 (2009)
14. Zhu, Q.Y., Qin, A.K., Suganthan, P.N., Huang, G.B.: Evolutionary extreme learning
machine. Pattern Recogn. 38, 1759–1763 (2005)
15. Qu, Y., Shen, Q., Parthaláin, N.M., Wu, W.: Extreme learning machine for mammograhic
risk analysis. In: UK Workshop on Computational Intelligence, pp. 1–5 (2010)
16. Sridevi, N., Subashini, P.: Combining Zernike moments with regional features for
Classification of Handwritten Ancient Tamil Scripts using Extreme Learning Machine. In:
IEEE IC Emerging Trends in Computing, Communication and Nanotechnology, pp. 158–
162 (2013)
17. Wang, D.D., Wang, R., Yan, H.: Fast prediction of protein-protein interaction sites based on
extreme learning machines. Neurocomputing 77, 258–266 (2014)
Adaptive Elitist Differential Evolution ELM on Big Data 345
18. Bazi, Y., Alajlan, N., Melgani, F., AlHichri, H., Malek, S., Yager, R.R.: Differential
Evolution Extreme Learning Machine for the Classification of Hyperspectral Images. IEEE
Geosci. Remote Sens. Lett. 11, 1066–1070 (2014)
19. Zhao, X.: A perturbed particle swarm algorithm for numerical optimization. Appl. Soft
Comput. 10(1), 119–124 (2010). doi:10.1016/j.asoc.2009.06.010
20. Li, X., Yin, M.: Application of differential evolution algorithm on self-potential data.
PLoS ONE 7(12), e51199 (2012). doi:10.1371/journal.pone.0051199
21. Gandomi, A.H., Yang, X.-S., Alavi, A.H.: Cuckoo search algorithm: a metaheuristic
approach to solve structural optimization problems. Eng. Comput. 29(1), 17–35 (2013)
22. Wang, G.-G., Guo, L., Duan, H., Wang, H.: A new improved firefly algorithm for global
numerical optimization. J. Comput. Theor. Nanosci. 11(2), 477–485 (2014). doi:10.1166/
jctn.2014.3383
23. Mirjalili, S., Mirjalili, S.M., Lewis, A.: Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61
(2014). doi:10.1016/j.advengsoft.2013.12.007
24. Geem, Z.W., Kim, J.H., Loganathan, G.V.: A new heuristic optimization algorithm:
harmony search. Simulation 76(2), 60–68 (2001). doi:10.1177/003754970107600201
25. Simon, D.: Biogeography-based optimization. IEEE Trans. Evolut. Comput. 12(6), 702–713
(2008). doi:10.1109/TEVC.2008.919004
26. Li, X., Zhang, J., Yin, M.: Animal migration optimization: an optimization algorithm
inspired by animal migration behavior. Neural Comput. Appl. 24(7–8), 1867–1877 (2014).
doi:10.1007/s00521-013-1433-8
27. Mirjalili, S., Mohd Hashim, S.Z., Moradian Sardroudi, H.: Training feedforward neural
networks using hybrid particle swarm optimization and gravitational search algorithm. Appl.
Math. Comput. 218(22), 11125–11137 (2012). doi:10.1016/j.amc.2012.04.069
28. Zhang, Z., Zhang, N., Feng, Z.: Multi-satellite control resource scheduling based on ant
colony optimization. Expert Syst. Appl. 41(6), 2816–2823 (2014). doi:10.1016/j.eswa.2013.
10.014
29. Gandomi, A.H.: Interior search algorithm (ISA): a novel approach for global optimization.
ISA Trans. 53(4), 1168–1183 (2014). doi:10.1016/j.isatra.2014.03.018
30. Ho-Huu, V., Nguyen-Thoi, T., Vo-Duy, T., Nguyen-Trang, T.: An adaptive elitist
differential evolution for optimization of truss structures with discrete design variables.
Comput. Struct. 165, 59–75 (2016)
31. Bluche, T., Ney, H., Kermorvant, C.: Feature extraction with convolutional neural networks
for handwritten word recognition. In: 12th International Conference on Document Analysis
and Recognition, pp. 285–289. IEEE (2013)
32. Anantharajah, K., Ge, Z., McCool, C., Denman, S., Fookes, C., Corke, P., Tjondronegoro,
D., Sridharan, S.: Local inter-session variability modelling for object classification. In: IEEE
Winter Conference on Applications of Computer Vision (WACV 2014). Steamboat Springs,
Co., 24–26 March 2014
33. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A., Fei-Fei, L.: ImageNet Large Scale Visual Recognition
Challenge. IJCV (2015). arXiv:1409.0575
34. Jia, D.J., Vinyals, Y., Hoffman, O., Zhang, J., Tzeng, N., Darrell, E.T.: Decaf: a deep
convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531 (2013)
35. Cambria, E., Huang, G.-B.: Extreme learning machines. IEEE Intell. Syst. 28, 37–134
(2013)
36. Price, K., Storn, M., Lampinen, A.: Differential Evolution: A Practical Approach to Global
Optimization. Springer, Heidelberg (2005). ISBN: 978-3-540-20950-8
37. https://github.com/che0/countries
38. http://www.cabi.org/isc/
Author Index