0% found this document useful (0 votes)
224 views4 pages

Streaming Linear Regression On Spark MLlib and MOA

This document discusses comparing the performance of two frameworks, MOA and Spark MLlib, for streaming linear regression analysis on continuous data streams. It provides background on data streams and challenges in analyzing streaming data. Streaming linear regression is implemented in both MOA and Spark MLlib using the stochastic gradient descent algorithm. Experiments are conducted to compare the frameworks' performance in terms of CPU time cost, supported data types, usability, fault tolerance, and coding standards.

Uploaded by

ravigobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
224 views4 pages

Streaming Linear Regression On Spark MLlib and MOA

This document discusses comparing the performance of two frameworks, MOA and Spark MLlib, for streaming linear regression analysis on continuous data streams. It provides background on data streams and challenges in analyzing streaming data. Streaming linear regression is implemented in both MOA and Spark MLlib using the stochastic gradient descent algorithm. Experiments are conducted to compare the frameworks' performance in terms of CPU time cost, supported data types, usability, fault tolerance, and coding standards.

Uploaded by

ravigobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Streaming Linear Regression on Spark MLlib and


MOA
Bar Akgn ule Gndz dc
Computer Engineering Department Computer Engineering Department
Istanbul Technical University Istanbul Technical University
Istanbul, Turkey Istanbul, Turkey
barisakgun@itu.edu.tr gunduz@cs.itu.edu.tr

AbstractIn recent years, analyzing data streams has linear regression analysis on continuous data streams.
attracted considerable attention in different fields of computer Streaming linear regression is implemented on MLlib with
science. In this paper, two different frameworks, namely MOA Spark Streaming support. Several experiments are carried out
and Spark MLlib, are examined for linear regression on and the comparison is conducted in terms of CPU time cost,
streaming data. The focus is placed on determining how well the data set types they are supporting, usability, fault tolerance and
linear regression techniques implemented in the frameworks that coding standards.
could be used to model the data streams. We also examine the
challenges of massive data streams and how MOA and Spark The rest of the paper is organized as follows. In section II,
Streaming solve these kinds of challenges. As a result of the we present the concept of data streams for understanding the
experiments, we see that although the usage of MOA is more challenges of analyzing the stream data. In section III, we
easier than Spark MLlib, Spark MLlib linear regression shortly introduce data stream mining and this section will
performance on streaming data is better. include linear regression analysis technique that is used by
MOA and Spark MLlib frameworks. We will discuss the usage
Keywords Stream Mining; Spark Streaming; Spark MLlib; of MOA and Spark MLlib for linear regression and show the
MOA; Streaming Linear Regression; Data Streams advantages and disadvantages of these frameworks on
I. INTRODUCTION streaming linear regression in sections IV and V. Section VI
experimentally presents and analyzes the experimental results.
As the world becomes more digital, data are automatically Finally, in Section VII we will conclude.
generated by mobile applications, sensor applications, log
records, email, twitter posts etc. at an increasing rate [1]. Much II. DATA STREAMS
of these massive data is valuable at its time of received, Data stream is a real-time, continuous, ordered sequence of
therefore these types of data are called data stream and has to items [2]. The data streams may be created by transactional
be analyzed in real time. Data mining can be used to analyze systems. These kinds of streams are generated when the
massive volume data. The data stream is a real-time, interaction is occurred between data attributes; such as,
continuous, ordered sequence of items [2]; hence, stream data commercial credit card purchase, market trades, online scoring,
mining involves extracting knowledge from real-time actions. client request to a web server etc. [8]. The other type of data
Mostly these real-time actions produce massive high rate data. stream is machine generated data streams which are
The stream data mining approaches must handle these massive automatically generated by computer systems without the
(big) and high rate data in a very short time. intervention of a human; for example, GPS data records, sensor
Linear regression analysis is one of the widely used data and server performance logs etc. Therefore, these kinds of
techniques in data mining. Implementing a linear regression stream data may be big in volume [8].
model is not complex and it is an efficient algorithm; The data stream analysis techniques have several
therefore, it is a good choice for modelling and predicting the requirements, therefore the most significant challenges for data
behavior of massive stream data. stream analysis are the following [6]:
Several frameworks have been built for large scale analysis x Processing is done at a time.
of evolving data streams; such us, Apache Storm [3], IBM
InfoSphere Streams [4], Apache Samza [5] etc. Some of the x Use a limited amount of memory.
widely used frameworks are Massive Online Analysis (MOA)
[6] and Spark MLlib [7]. MOA is a software environment for x The streaming method has to be ready for data
implementing algorithms and running experiments for online analysis at any time and the arrival rate in the streams
learning from evolving data streams [6]. Machine Learning may be very fast, which may result in crashing if too
Library (MLlib) is Sparks scalable machine learning library many items arrive.
consisting of common learning algorithms [7]. x The volume of data stream may be very big (at most);
This paper aims at comparing the performances of two on the other hand the arriving data streams must be
widely used frameworks, namely MOA and Spark MLlib, for processed in a limited amount of time.

$621$0
$XJXVW3DULV)UDQFH 1244
$&0,6%1
'2,KWWSG[GRLRUJ
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

These above requirements show us that the data stream The   shows the differentiation of our error function
algorithms must process data that arrive at high speed under for each data points in data set () and is the learning rate
very strict space and time constraints. Traditional data mining parameter.
techniques cannot handle these kinds of requirements. The new
technologies that are called Data Stream Mining were found Stochastic Gradient Descent (SGD) also updates a set of
for solving these challenges. parameters in an iterative manner to minimize an error function
same as GD. On the other, SGD uses only one data point for
III. DATA STREAM MINING & LINEAR REGRESSION updating the parameters in a particular iteration. Since the
computation time and resource management are in the
Data stream mining is the concept for extracting patterns challenge list of stream mining, the SGD is more suitable for
and information from a sequence of elements that need to be stream mining algorithms. Both Spark MLlib and MOA use
analyzed online as they arrive. Data stream mining started to be SGD algorithm for streaming linear regression.
an increasingly important research area in the last decade, since
many real world applications; such as, web user behavior IV. MOA STREAMING LINEAR REGRESSION
analysis, telecommunication connection analysis, sensor data
analysis, are producing continuous and large data streams. Massive Online Analysis (MOA) is a software environment
Therefore lots of machine learning algorithms have been for implementing algorithms and running experiments for
implemented on it and many techniques are designed for online learning from evolving data streams [6]. This section
passing challenges of stream data analyzing [9]. briefly introduces the usage of MOA for streaming linear
regression. The advantages and disadvantages of MOA
Linear regression is used for finding linear relationships streaming linear regression is also mentioned in the section.
between variables. Due to its simplicity and low complexity
linear regression models are the most fundamental and widely MOA has a graphical user interface (GUI) for configuring
used techniques for modelling and predicting the behavior of and running tasks. The configuration steps for streaming linear
massive data streams. A linear regression is a statistical method regression on MOA is given below:
where a dependent variable y (the target variable) is computed x Choose the learner algorithm. Although, MOA has
from p independent variables that are assumed to have an different learning algorithms for linear regression, we
influence on the target variable [10]. Given a data set use SGD algorithm. We choose SGD, because the
of n data points, the formula for a regression Spark MLlib just supports it and SGD is one of the
of one data point yi (regressand) is as follows [10]. best learning methods for linear regression.
   x Set the SGD parameters. Its parameters are lambda
regularization, learning rate and loss function. The
Ej is the regression coefficient that can be calculated using learning rate and the loss function are introduced in the
Least Squares approach, xij (regressor) is the value of the jth section III. Lambda regularization parameter is also
independent variable and Hi is the error term. Linear regression used for protecting the overfitting in regression as the
aims at fitting a straight line, called a regression line, through learning rate parameter.
the set of n data points that minimizes the sum of squared
residuals. The error of a prediction for a point is the difference x Define data set. MOA has lots of predefined stream
between the value of the point yi and the predicted value yi generators for training linear regression. MOA is an
(the value on the line). The error function which measures the open source framework which enables one to
deviation of the predicted values from the true values can be implement a new stream generator on it. It also
calculated as follows: supports bi-directional interaction with WEKA [11],
therefore it accepts the Attribute-Relation File Format
(arff) as input for streaming data.
  

When working with arff files in MOA, two parameters
The best values of regression coefficients and the error should be set, namely the number of passes (numPasses) and
terms can be found by minimizing the error function in Eq. 2. maximum instances (maxInstances). Number of passes
Gradient Descent (GD) is one of the approaches that is applied parameter indicates the number of passes to do over the data set
to minimize this error function. The GD algorithm starts to where maximum instances parameter sets the maximum
search at any values of regression coefficients () and the error number of instances to train on per pass over the data.
terms. At each iteration the algorithm updates the regression
coefficients and the error terms that yield a lower error than the MOA is a Java based framework with a simple coding
previous iteration. This is accomplished by moving into the structure. We added our own arff reader method that listens the
negative direction of the gradient of the error function [17]. GD data set. If any new instance arrives, regression coefficients
uses a learning rate () parameter which determines how fast or will be automatically updated by SGD algorithm based on our
slow the algorithm updates the optimal regression coefficients. implementation. The data set format for MOA streaming linear
Although GD is one of the best algorithms for minimizing the regression is standard arff file format. The given data sets
error function, it solves the minimization problem using all of instances features and target labels must be numeric format
the data points. and the features must not be null. One important disadvantage
of MOA is that there is no fault tolerance mechanism in MOA.
    It has to restart the operation in the event of a failure.

1245
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

V. SPARK MLLIB STREAMING LINEAR REGRESSION This is the most powerful side of Spark MLlib for streaming
Spark MLlib currently supports streaming linear regression linear regression.
using Spark Streaming technology. The regression algorithm VI. EXPERIMENTAL RESULTS
runs on each batch of data, so that the model is continually
updated with new data from the stream [12]. It uses SGD to In our experiments, the aim is to compare the performances
update the regression coefficients. Spark MLlib lets one to of two widely used frameworks MOA and Spark MLlib on the
train and test a linear regression model on streaming data. The streaming linear regression. The experiments are applied to the
streaming data must be in form of a Discretized Stream [13]. Airfoil Self-Noise data set that was processed by NASA in
The configuration steps of Spark MLlib for streaming linear 1989 [15]. The Airfoil Self-Noise data set has 1503 instances
regression is given below: and consists of the five numeric features (Frequency, Angle of
attack, Chord length, Free-stream velocity, Suction side
x Determine training and test (optional) folders. Spark displacement thickness) and one output that shows the scaled
MLlib streaming linear regression listens the given sound pressure level. To produce massive data sets, we create
training and test folders for streaming data, therefore new set of instances by copying the subset of existing data set.
it detects the streaming data when new files are added
to train or test folders. Training and test data can be The MOA and Spark MLlib streaming linear regression
ingested from many sources. Due to Sparks data algorithms are implemented on 8 Core 2.2 GHz Intel i7 PC
parallel paradigm, Spark MLlib streaming linear machine with 16 GB memory and 750 GB disk running on
regression requires a shared file system for training Ubuntu 14.04 operating system.
and test folders. The example shared file systems are The experiments require different settings for two
S3, NFS, HDFS etc. [14]. Training and test data frameworks. Firstly, we will introduce the MOA streaming
instances have to be an RDD of Labeled Points linear regression parameters. As it is mentioned in the section
format. The number of data points per train can vary, IV lambda regularization, learning rate, loss function, learner
but the number of features must be constant [12]. The algorithm, maximum instances and number of passes
input format for Spark streaming linear regression is parameters must be set for running MOA linear regression.
as follows: each line should be a data point formatted Since we used our implemented arff reader, there is no need to
as where y is the label and set maximum instances and number of passes parameters. Our
are the features [12]. Anytime a text file is placed in implemented arff reader shows the current model weights after
training folder, the model will be updated. The each 250 records; on the other hand, it causes the time latency.
features and the labels have to be in numeric format. The settings of parameters can be made through user friendly
Spark MLlib streaming linear regression algorithm GUI of MOA framework. All parameters values that we used
checks the file creation time in training and test in our tests are given in Table I.
folders, therefore if any file is created before Spark
streaming linear regression starting time then the file To build Spark MLlib streaming linear regression, the path
will not be processed by Spark streaming linear of the folders where training and test data sets reside and the
regression. This is one disadvantage of Spark MLlib number of features parameters must be set by the user. The
streaming linear regression. user can also change other parameters that are explained in
section V. Spark MLlib has no GUI for setting parameters or
x Set the streaming linear regression parameters [7]. running tasks; hence, all parameter assignments should be
Spark MLlib streaming linear regression algorithm made with coding. The Spark MLlib parameters are set as
has four parameters. These parameters are step size shown in Table I.
(learning rate), number of iteration (for finishing the
TABLE I. Spark MLlib and MOA parameters for streaming linear regression
gradient descent), initial weights vector and mini
batch fraction time. First three parameters are required Spark MLlib Parameters MOA Parameters
for linear regression as mentioned in Section III and x stepSize: 0.1 x lambdaRegularization: 1
the last parameter is used for batch time. The batch
time parameter sets the time window for spark x numIterations: 1 x learningRate: 0.1
streaming. Spark Streaming linear regression has a x miniBatchFraction: x lossFunc: SQUAREDLOSS
latency of several seconds, because of mini batch 1.0
x learner:
time. On the other hand this mini batch time x initial weights: Vector class.moa.classifier.function
efficiently guarantees that each stream data will be with 0 values s.SGD
processed exactly once.
Although Spark Streaming supports Java, Scala and Python
languages, the streaming linear regression is implemented in The experiments indicate the performance of two
Scala. Spark Streaming linear regression coding flow is simple; frameworks in terms of CPU time costs that are given in Figure
therefore, users can easily add their implementations on the 1. As can be seen from the Figure 1, the Spark MLlib
streaming regression. streaming linear regression is much faster than MOA,
especially the CPU performance differences are more clear in
Spark MLlib streaming linear regression works on memory massive data sets. There are also some challenges which
of the distributed machines; hence, the memory base structure affected us at the development stage. The following table
decreases the execution time of linear regression algorithm.

1246
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

illustrates comparison points of these two frameworks that are Both of these frameworks are open source, therefore these
determined with these challenges. frameworks supply many advantages to the users for easily
TABLE II. Comparison of Streaming Linear Regression Table implementing massive data stream applications. We used
existing stream linear regression libraries and also implemented
Comparison of Streaming Linear Regression additional strategies to improve MOA framework in order to
Comparison Point Spark MLlib MOA compare two frameworks.
Code Complexity Low Low
Programming Although we used the Spark MLlib in local mode and non-
Scala Java distributed MOA framework for making streaming linear
Languages
Noise Data Set support No No regression, they have handled massive data streams (for local
mode) in reasonable CPU times. On the other hand, Spark
Fault Tolerance High Any
Streaming is one of the best technologies for distributed
Usability Simple Easy (GUI) streaming computing area and SAMOA (Scalable Advanced
Documentation Not much Not much
Massive Online Analysis) distributed form of MOA [16]
provides a collection of distributed streaming algorithms. In
Apache distribution
Version Rate
(Very High)
Not High future work, we plan to use Spark Streaming and SAMOA on
clusters for making these comparisons on very massive data.

9,00
7,72 REFERENCES
8,00
Execution Times(minute)

7,00 [1] A. Bifet, G. Holmes, B.Pfahringer, P.Kranen, H.Kremer, T.Jansen,


T.Seidl MOA: Massive Online Analysis, a Framework for Stream
6,00 5,24 Classification and Clustering, JMLR: Workshop and Conference
5,00 Proceedings, pp. 44-50, 2010.
3,88 [2] L. Golab and M. Ozsu, Issues in data stream management, ACM
4,00 SIGMOD Record,vol. 32 no. 2, pp. 5-14, 2003
3,00 2,60 [3] Apache Storm Website. Available: https://storm.apache.org/
1,93 [4] Real Time Processing with IBM InfoSphere Streams, IBM Data Sheets.
2,00 Available: http://www-03.ibm.com/software/products/en/infosphere-
1,00 0,60 streams
[5] Apache Samza Website. Available: http://samza.apache.org/
0,00
[6] A. Bifet, R. Kirkby G. Holmes, B.Pfahringer MOA: Massive Online
1,2 GB 2,4 GB 4,8 GB Analysis, Journal of Machine Learning Research, pp. 1601-1604, 2010
Data Size [7] Apache Spark MLlib Website. Available: https://spark.apache.org/mllib/
Spark MLlib MOA [8] N. Koudas, D. Srivastava, Data Stream Query Processing: A Tuorial,
Proceedings of the 29th VLDB Conference, pp. 1149-1149, 2003
[9] G. Krempl, I.Zliobalite, DBrzezinski, M. Last et al. Open Challenges
Fig. 1. CPU time costs of Streaming Linear Regression Graph for Data Stream Mining Research, ACM SIGKDD Explorations
Newsletter, July 2014
As a result of performances and development stages [10] C.H. Nadungodage, Y. Xia, F. Li, J. Ge StreamFitter: A Real Time
comparisons, making streaming linear regression on Spark Linear Regression Analysis System for Continuous Data Streams, 16th
International Conference, DASFAA, pp. 458-461, 2011
MLlib produces faster results than MOA; nevertheless,
[11] M. Hall, E.Frank, G.Holmes, B. Pfahringer, P.Reutemann, I.H. Witten
development time with MOA is shorter than that of Spark The WEKA Data Mining Software : An Update, SIGKDD
MLlib. Explorations, vol.11, pp. 10-18, 2009
[12] Spark MLlib Linear Methods Programming Guide
VII. CONCLUSION [13] M. Zaharia, T.Das, H. Li, S. Shenker, I. Stoica Discretized Streams: An
We have presented linear regression on streaming data Efficient and Fault-Tolerant Model for Stream Processing on Large
Clusters, Proceedings of the 4th USENIX conference on Hot Topics in
with using stream mining tools MOA and Spark MLlib with Cloud Ccomputing, pp. 10-10, 2012
Spark Streaming support. This study also aims to guide users in [14] T.S. Morais Survey on Frameworks for Distributed Computing:
implementing streaming linear regression models with MOA Hadoop, Spark and Storm, Doctoral Symposium in Informatics
and Spark MLlib. As a result of our empirical evaluation, the Engineering, pp. 95-105, 2015
key idea of MOA is to provide end to end and simple solutions [15] T. F. Brooks, D.S. Pope, and A.M. Marcolini, Airfoil self-noise and
prediction, Technical Report, NASA RP-1218, 1989
through its user friendly GUI; in fact, a user can easily
implement streaming linear model without coding knowledge. [16] G.D.F. Morales and A. Bifet SAMOA: Scalable Advanced Massive
Online Analysis, Journal of Machine Learning Research 16, pp. 149-
On the other hand, Spark streaming linear regression' key idea 153, 2015
is to handle streaming data as a series of short batch jobs, and [17] Christoper M. Bishop, Neural Networks for Pattern Recognation,
complete these batch jobs in a short time as much as possible. Oxford University Press, Inc. New York, NY, USA, 1995

1247

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy