Multivariate Time Series Classification With WEASE
Multivariate Time Series Classification With WEASE
net/publication/321417648
CITATIONS READS
81 2,619
2 authors:
All content following this page was uploaded by Ulf Leser on 13 September 2021.
ABSTRACT 30
Raw Multivariate Time Series
1. Hand tip left, X coordinate
Multivariate time series (MTS) arise when multiple interconnected 28 2. Hand tip left, Y coordinate
sensors record data over time. Dealing with this high-dimensional 3. Hand tip left, Z coordinate
26 4. Hand tip right, X coordinate
data is challenging for every classifier for at least two aspects: First, 5. Hand tip right, Y coordinate
arXiv:1711.11343v4 [cs.LG] 17 Aug 2018
characterized by long idle periods with small bursts of characteristic 0.5 Sample
0.4
movements in every dimension. Here, the exact time instant of an 0.3
0.2
event, e.g., thumbs up, is irrelevant for classification. To effectively 0.1
0.0
0.1
deal with this kind of information, an MTSC has to deal with noise, 0.2
0.3
irrelevant dimension data, and, most importantly, extract relevant 0.40 200 400 600 800 1000
features from each dimension.
0.5
(1) Windowing
In this paper, we introduce our novel domain agnostic MTSC
method called WEASEL+MUSE (WEASEL plus Multivariate Unsu- 0.0
pervised Symbols and dErivatives). WEASEL+MUSE conceptually 0.5 ...... ......
1.0
builds on the bag-of-patterns (BOP) model and the WEASEL (Word 0 200 400 600 800 1000
ExtrAction for time SEries cLassification) pipeline for feature selec-
tion. The BOP model moves a sliding window over an MTS, extracts bcc ccc bcb bcb
(2) Discretization
bbb bab cac ddd bdb aab bac ccc bdb
bcc ccc bcb bcb bbb bab cac ddc bdc bab bac ccc bdb
discrete features per window, and creates a histogram over feature bcc
bcc
bcc
bcb
bcb
bcb
ccc
ccc
abb
abb
cac
cac
cab
cab
cdc
cdb
bdc
bdc
bab
bab
bac
bac
ccc
ccc
bdb
bdb
filtering features in WEASEL+MUSE is different from state-of-the- 0 200 400 600 800 1000
art multivariate classifiers: 140 (3) Bag-of-Patterns model
120
100
Counts
(1) Identifiers: WEASEL+MUSE adds a dimension (sensor) 80
60
identifier to each extracted discrete feature. Thereby 40
20
WEASEL+MUSE can discriminate between the presence of 0
aaa
aab
aba
abb
abc
aca
acb
acc
ada
adb
adc
baa
bab
bac
bba
bbb
bbc
bcb
bcc
bda
bdb
bdc
bdd
cab
cac
cba
cbb
cbc
ccb
ccc
cda
cdb
cdc
cdd
dab
dac
dad
dbb
dbc
dbd
dcb
dcc
dcd
ddc
ddd
features in different dimensions - i.e., a left vs. right hand
was raised.
(2) Derivatives: To improve the accuracy, derivatives are Figure 2: Transformation of a TS into the Bag-of-Patterns
added as features to the MTS. Those are the differences be- (BOP) model using overlapping windows (second to top), dis-
tween neighbouring data points in each dimension. These cretization of windows to words (second from bottom), and
derivatives represent the general shape and are invariant word counts (bottom) (see [23]).
to the exact value at a given time stamp. This information
can help to increase classification accuracy.
(3) Noise robust: WEASEL+MUSE derives discrete features
from windows extracted from each dimension of the MTS is constantly among the most accurate methods. WEASEL+MUSE
using a truncated Fourier transform and discretization, clearly outperforms all other classifiers except the very recent deep-
thereby reducing noise. learning-based method from [11]. Compared to the latter, WM
(4) Interplay of features: The interplay of features along performs better for small-sized datasets with less features or sam-
the dimensions is learned by assigning weights to features ples to use for training, such as sensor readings.
(using logistic regression), thereby boosting or dampen- The rest of this paper is organized as follows: Section 2 briefly
ing feature counts. Essentially, when two features from recaps bag-of-patterns classifiers and definitions. In Section 3 we
different dimensions are characteristic for the class label, present related work. In Section 4 we present WEASEL+MUSE’s
these get assigned high weights, and their co-occurrence novel way of feature generation and selection. Section 5 presents
increases the likelihood of a class. evaluation results and Section 6 our conclusion.
(5) Order invariance: A main advantage of the BOP model
is its invariance to the order of the subsequences, as a 2 BACKGROUND: TIME SERIES AND
result of using histograms over feature counts. Thus, two BAG-OF-PATTERNS
MTS are similar, if they show a similar number of feature
A univariate time series (TS) T = {t 1 , . . . , tn } is an ordered se-
occurrences rather than having the same values at the same
quence of n ∈ N real values ti ∈ R. A multivariate time series
time instances.
(MTS) T = {t 1 , . . . , tm } is an ordered sequence of m ∈ N streams
(6) Feature selection: The wide range of features considered
(dimensions) with ti = (ti,1 , . . . , ti,n ) ∈ Rn . For instance, a stream
by WEASEL+MUSE (dimensions, derivatives, unigrams,
of m interconnected sensors recording values at each time instant.
bigrams, and varying window lengths) introduces many
As we primarily address MTS generated from automatic sensors
non-discriminative features. Therefore, WEASEL+MUSE
with a fixed and synchronized sampling along all dimensions, we
applies statistical feature selection and feature weighting
can safely ignore time stamps. A time series dataset D contains
to identify those features that best discern between classes.
N time series. Note, that we consider only MTS with numerical
The aim of our feature selection is to prune the feature
attributes (not categorical).
space to a level that feature weighting can be learned in
The derivative of a stream ti = (ti,1 , . . . , ti,n ) is given by the se-
reasonable time.
quence of pairwise differences ti0 = (|ti,2 − ti,1 |, . . . , |ti,n − ti,n−1 |).
In our experimental evaluation on 20 public benchmark MTS Adding derivatives to an MTS T = {t 1 , . . . , tm } of m streams, effec-
datasets and a use case on motion capture data, WEASEL+MUSE tively doubles the number of streams: T = {t 1 , . . . , tm , t 10 , . . . , tm
0 }.
Multivariate Time Series Classification with WEASEL+MUSE Conference’17, July 2017, Washington, DC, USA
Given a univariate TS T , a window S of length w is a subsequence optimized to extract discriminative words to ease classification of
with w contiguous values starting at offset a in T , i.e., S(a, w) = univariate TS. We observed that this led to an overall low accu-
(ta , . . . , ta+w −1 ) with 1 ≤ a ≤ n − w + 1. racy for MTSC due to the increased number of possible features
We associate each TS with a class label y ∈ Y from a predefined along all dimensions (see Section 5). WEASEL+MUSE was de-
set of labels Y . Time series classification (TSC) is the task of predict- signed on the WEASEL pipeline, but adding sensor identifiers to
ing a class label for a TS whose label is unknown. A TS classifier each word, generating unsupervised discrete features to minimize
is a function that is learned from a set of labelled time series (the overfitting, as opposed to WEASEL that uses a supervised trans-
training data), that takes an unlabelled time series as input and formation. WEASEL+MUSE further adds derivatives (differences
outputs a label. between all neighbouring points) to the feature space to increase
Our method is based on the bag-of-patterns (BOP) model [14, accuracy.
19, 20]. Algorithms following the BOP model build a classification For multivariate time series classification (MTSC), the most basic
function by (1) extracting subsequences from a TS, (2) discretiz- approach is to apply rigid dimensionality reduction (i.e., PCA) or
ing each real valued subsequence into a discrete-valued word (a simply concatenate all dimensions of the MTS to obtain a univariate
sequence of symbols over a fixed alphabet), (3) building a histogram TS and use proven univariate TSC. Some domain agnostic MTSC
(feature vector) from word counts, and (4) finally using a classifica- have been proposed.
tion model from the machine learning repertoire on these feature Symbolic Representation for Multivariate Time series (SMTS) [3]
vectors. uses codebook learning and the bag-of-words (BOW) model for
Figure 2 illustrates these steps from a raw time series to a BOP classification. First, a random forest is trained on the raw MTS to
model using overlapping windows. Overlapping subsequences of partition the MTS into leaf nodes. Each leaf node is then labelled
fixed length are extracted from a time series (second from top), each by a word of a codebook. There is no additional feature extraction,
subsequences is discretized to a word (second from bottom), and apart from calculating derivatives for the numerical dimensions
finally a histogram is built over the word counts. (first order differences). For classification a second random forest is
Different discretization functions have been used in literature, trained on the BOW representation of all MTS.
including SAX [13] and SFA [21]. SAX is based on the discretization Ultra Fast Shapelets (UFS) [27] applies the shapelet discovery
of mean values and SFA is based on the discretization of coefficients method to MTS classification. The major limiting factor for shapelet
of the Fourier transform. discovery is the time to find discriminative subsequences, which
In the BOP model, two TS are similar, if the subsequences have becomes even more demanding when dealing with MTS. UFS solves
similar frequencies in both TS. Feature selection and weighting can this by extracting random shapelets. On this transformed data, a
be used to damper of emphasize important subsequences, like in linear SVM or a Random Forest is trained. Unfortunately, the code
the WEASEL model [23]. is not available to allow for reproducibility
Generalized Random Shapelet Forests (gRSF) [12] also generates
a set of shapelet-based decision trees over randomly extracted
3 RELATED WORK shapelets. In their experimental evaluation, gRSF was the best
Research in univariate TSC has a long tradition and dozens of MTSC when compared to SMTS, LPS and UFS on 14 MTS datasets.
approaches have been proposed, refer to [2, 7, 22] for summary. The Thus, we use gRFS as a representative for random shapelets.
techniques used for TSC can broadly be categorized into two classes: Learned Pattern Similarity (LPS) [4] extracts segments from an
(a) similarity-based (distance-based) methods and (b) feature-based MTS. It then trains regression trees to identify structural dependen-
methods. cies between segments. The regression trees trained in this manner
Similarity-based methods make use of a similarity measure like represent a non-linear AR model. LPS next builds a BOW represen-
Dynamic Time Warping (DTW) [18] to compare two TS. 1-Nearest tation based on the labels of the leaf nodes similar to SMTS. Finally
Neighbour DTW is commonly used as a baseline in TSC compar- a similarity measure is defined on the BOW representations of the
isons [2]. In contrast, feature-based TSC rely on comparing features, MTS. LPS showed better performance than DTW in a benchmark
typically generated from substructures of a TS. The most successful using 15 MTS datasets. Autoregressive (AR) Kernel [5] proposes an
approaches are shapelets or bag-of-patterns (BOP). Shapelets are AR kernel-based distance measure for MTSC.
defined as TS subsequences that are maximally representative of Autoregressive forests for multivariate time series modelling (mv-
a class [29]. The standard BOP model [14] breaks up a TS into ARF) [25] proposes a tree ensemble trained on autoregressive mod-
windows, represent these as discrete features, and finally build a els, each one with a different lag, of the MTS. This model is used
histogram of feature counts as basis for classification. to capture linear and non-linear relationships between features
In previous research we have studied the BOP model for univari- in the dimensions of an MTS. The authors compared mv-ARF to
ate TSC. The BOSS (Bag-of-SFA-Symbols) [20] classifier is based AR Kernel, LPS and DTW on 19 MTS datasets. mv-ARF and AR
on the (unsupervised) Symbolic Fourier Approximation (SFA) [21] kernel showed the best results. mv-ARF performs well on motion
to generate discrete features and uses a similarity measure on the recognition data. AR kernel outperformed the other methods for
histogram of feature counts. The WEASEL classifier [23] applies sensor readings.
a supervised symbolic representation to transform subsequences At the time of writing this paper, Multivariate LSTM-FCN [11]
to words, uses statistical feature selection, and subsequently feeds was proposed that introduces a deep learning architecture based on
the words into a logistic regression classifier. WEASEL is among a long short-term memory (LSTM), a fully convolutional network
the most accurate and fastest univariate TSC [23]. WEASEL was
Conference’17, July 2017, Washington, DC, USA Patrick Schäfer and Ulf Leser
(FCN) and a squeeze and excitation block. Their method is compared obtained from the univariate words of each BOP model by con-
to state-of-the-art and shows the overall best results. catenating each word with an identifier (representing the sensor
and the window size). This maintains the association between the
4 WEASEL+MUSE dimension and the feature space.
We present our novel method for domain agnostic multivari- More precisely, an MTS is first split into its dimensions. Each
ate time series classification (MTSC) called WEASEL+MUSE dimension can now be considered as a univariate TS and trans-
(WEASEL+Multivariate Unsupervised Symbols and dErivatives). formed using the classical BOP approach. To this end, z-normalized
WEASEL+MUSE addresses the major challenges of MTSC in a windows of varying lengths are extracted. Next, each window
specific manner (using gesture recognition as an example): is approximated using the truncated Fourier transform, keeping
only lower frequency components of each window. Fourier values
(1) Interplay of dimensions: MTS are not only characterized (real and imaginary part separately) are then discretized into words
by individual features at a single time instance, but also based on equi-depth or equi-frequency binning using a symbolic
by the interplay of features in different dimensions. For transformation (details will be given in Subsection 4.2). Thereby,
example, to predict a hand gesture, a complex orchestration words (unigrams) and pairs of words (bigrams) with varying win-
of interactions between hand, finger and elbow may have dow lengths are computed. These words are concatenated with their
to be considered. identifiers, i.e., the sensor id (dimension) and the window length.
(2) Phase invariance: Relevant events in an MTS do not neces- Thus, WEASEL+MUSE keeps a disjoint word space for each dimen-
sarily reappear at the same time instances in each dimen- sion and two words from different dimensions can never coincide.
sion. Thus, characteristic features may appear anywhere in To deal with irrelevant features and dimensions, a Chi-squared test
an MTS (or not at all). For example, a hand gesture should is applied to all multivariate words (Subsection 4.4). As a result, a
allow for considerable differences in time schedule. highly discriminative feature vector is obtained and a fast linear
(3) Invariance to irrelevant dimensions: Only small periods in time logistic regression classifier can be trained (Subsection 4.4). It
time and in some streams may contain relevant informa- further captures the interplay of features in different dimensions
tion for classification. What makes things even harder by learning high weights for important features in each dimension
is the fact that whole sensor streams may be irrelevant (Subsection 4.5).
for classification. For instance, a movement of a leg is
irrelevant to capture hand gestures and vice versa. 4.2 Word Extraction: Symbolic Fourier
We engineered WEASEL+MUSE to address these challenges. Approximation
Our method conceptually builds on our previous work on the bag-
Instead of training a multivariate symbolic transformation, we train
of-patterns (BOP) model and univariate TSC [20, 23], yet uses a
and apply the univariate symbolic transformation SFA to each di-
different approach in many of the individual steps to deal with the
mension of the MTS separately. This allows for (a) phase invariance
aforementioned challenges. We will use the terms feature and word
between different dimensions, as a separate BOP model is built for
interchangeably throughout the text. In essence, WEASEL+MUSE
each dimension, but (b) the information that two features occurred
makes use of a histogram of feature counts. In this feature vec-
at exactly the same time instant in two different dimensions is lost.
tor it captures information about local and global changes in the
Semantically, splitting an MTS into its dimensions results in two
MTS along different dimensions. It then learns weights to boost or
MTS T1 and T2 to be similar, if both share similar substructures
damper characteristic features. The interplay of features is repre-
within the i-th dimension at arbitrary time stamps.
sented by high weights.
SFA transforms a real-valued TS window to a word using an
alphabet of size c as in [21]:
4.1 Overview
(1) Approximation: Each normalized window of length w is
We first give an overview of our basic idea and an example how subjected to dimensionality reduction by the use of the
we deal with the challenges described above. In WEASEL+MUSE a truncated Fourier transform, keeping only the first l w
feature is represented by a word that encodes the identifiers (sensor coefficients for further analysis. This step acts as a low pass
id, window size, and discretized Fourier coefficients) and counts its (noise) filter, as higher order Fourier coefficients typically
occurrences. Figure 3 shows an example for the WEASEL+MUSE represent rapid changes like drop-outs or noise.
model of a fixed window length 15 on motion capture data. The (2) Quantization: Each Fourier coefficient is then discretized
data has 3 dimensions (x,y,z coordinates). A feature ( 0 3 15 ad 0, 2) to a symbol of an alphabet of fixed size c, which in turn
(see Figure 3 (b)) represents a unigram ’ad’ for the z-dimension with achieves further robustness against noise.
window length 15 and frequency 2, or the feature ( 0 2 15 bd ad 0, 2)
represents a bigram ’bd ad’ for the y-dimension with length 15 and Figure 5 exemplifies this process for a univariate time series,
frequency 2. resulting in the word ABDDABBB. As a result, each real-valued
window in the i-th dimension is transformed into a word of length
Pipeline: WEASEL+MUSE is composed of the building blocks de- l with an alphabet of size c. For a given window length, there are a
picted in Figure 4: the symbolic representation SFA [21], BOP mod- maximum of O(n) windows in each of the m dimensions, resulting
els for each dimension, feature selection and the WEASEL+MUSE in a total of O(n × m) words.
model. WEASEL+MUSE conceptionally builds upon the univari- SFA is a data-adaptive symbolic transformation, as opposed to
ate BOP model applied to each dimension. Multivariate words are SAX [13] which always uses the same set of bins irrelevant of
Multivariate Time Series Classification with WEASEL+MUSE Conference’17, July 2017, Washington, DC, USA
Counts
2 0
0 10 20 30 40 50
1-15-aa
1-15-aa ba
1-15-ab
1-15-ab aa
1-15-ad
1-15-ad aa
1-15-ad ab
1-15-ba
1-15-ba ad
1-15-ba bb
1-15-ba bc
1-15-ba bd
1-15-ba ca
1-15-bb
1-15-bb bc
1-15-bc
1-15-bc ad
1-15-bc bb
1-15-bd
1-15-bd ad
1-15-ca
1-15-ca ba
1-15-ca cc
1-15-cb
1-15-cb da
1-15-cc
1-15-cc dd
1-15-cd
1-15-cd cb
1-15-da
1-15-da ca
1-15-da db
1-15-db
1-15-db ca
1-15-db da
1-15-dd
1-15-dd db
1. Hand tip left, X coordinate
2. Hand tip left, Y coordinate
3. Hand tip left, Z coordinate
2
Counts
0
2-15-aa
2-15-aa ba
2-15-ab
2-15-ab aa
2-15-ad
2-15-ad aa
2-15-ad ab
2-15-ba
2-15-ba ad
2-15-ba bb
2-15-ba bc
2-15-ba bd
2-15-ba ca
2-15-bb
2-15-bb bc
2-15-bc
2-15-bc ad
2-15-bc bb
2-15-bd
2-15-bd ad
2-15-ca
2-15-ca ba
2-15-ca cc
2-15-cb
2-15-cb da
2-15-cc
2-15-cc dd
2-15-cd
2-15-cd cb
2-15-da
2-15-da ca
2-15-da db
2-15-db
2-15-db ca
2-15-db da
2-15-dd
2-15-dd db
4
Counts
2
0
3-15-aa
3-15-aa ba
3-15-ab
3-15-ab aa
3-15-ad
3-15-ad aa
3-15-ad ab
3-15-ba
3-15-ba ad
3-15-ba bb
3-15-ba bc
3-15-ba bd
3-15-ba ca
3-15-bb
3-15-bb bc
3-15-bc
3-15-bc ad
3-15-bc bb
3-15-bd
3-15-bd ad
3-15-ca
3-15-ca ba
3-15-ca cc
3-15-cb
3-15-cb da
3-15-cc
3-15-cc dd
3-15-cd
3-15-cd cb
3-15-da
3-15-da ca
3-15-da db
3-15-db
3-15-db ca
3-15-db da
3-15-dd
3-15-dd db
Figure 3: WEASEL+MUSE model of a motion capture. (a) motion of a left hand in x/y/z coordinates. (b) the WEASEL+MUSE
model for each of these coordinates. A feature in the WEASEL+MUSE model encodes the dimension, window length and actual
word, e.g., 1 15 aa for ’left Hand’, window length 15 and word ’aa’.
Figure 4: WEASEL+MUSE Pipeline: Feature extraction, univariate Bag-of-Patterns (BOP) models and WEASEL+MUSE.
the data distribution. Quantization boundaries are derived from 4.3 Univariate Bag-of-Patterns: Unigrams,
a (sampled) train dataset using either (a) equi-depth or (b) equi- bigrams, derivatives, window lengths
frequency binning, such that (a) the Fourier frequency range is
In the BOP model, two TS are distinguished by the frequencies of
divided into equal-sized bins or (b) the boundaries are chosen to
certain subsequences rather than their presence or absence. A TS
hold an equal number of Fourier values. SFA is trained for each
is represented by word counts, obtained from the windows of the
dimension separately, resulting in m SFA transformations. Each
time series. BOP-based methods have a number of parameters, and
SFA transformation is then used to transform only its dimension of
of particular importance is the window length, which heavily influ-
the MTS.
ences its performance. For dealing with MTS, we have to find the
best window lengths for each dimension, as one cannot assume that
there is a single optimal value for all dimensions. WEASEL+MUSE
Conference’17, July 2017, Washington, DC, USA Patrick Schäfer and Ulf Leser
Time Series DFT SFA Algorithm 1 Build one BOP model using SFA, multiple window
2.0 2.0 2.0 lengths, bigrams and the Chi-squared test for feature selection. l
1.5 1.5 1.5 is the number of Fourier values to keep and wLen are the window
1.0 1.0 1.0 lengths used for sliding window extraction.
Value
feature selection methods. As our main aim is to reduce the runtime #classes m n N Train N Test
for training, we did not look into other feature selection techniques.
Digits 10 13 4-93 6600 2200
For a set of N m-dimensional MTS of length n, the size of the BOP
feature space is O(min(Nn 2 , c l ) × m) for word length l, c symbols AUSLAN 95 22 45-136 1140 1425
and m dimensions. The number of MTS N and length n affects the CharTrajectories 20 3 109-205 300 2558
actual word frequencies. But in the worst case each TS window
can only produce one distinct word, and there are Nn 2 windows in CMUsubject16 2 62 127-580 29 29
each dimension. WEASEL+MUSE further uses bigrams, derivatives, DigitShapes 4 2 30-98 24 16
and O(n) window lengths. WEASEL+MUSE keeps a disjoint word
ECG 2 2 39-152 100 100
space for each dimension and window lengths, thus two words
from different dimensions can never collide (no false positives). Japanese Vowels 9 12 7-29 270 370
Thus, the theoretical dimensionality of this feature space rises to KickvsPunch 2 62 274-841 16 10
O(min[N n2 , c 2l · n] × m). Essentially, the feature space can grow
quadratically with the number of observations of an MTS, if every LIBRAS 15 2 45 180 180
observation generates a distinct word. However, in practice we Robot Failure LP1 4 6 15 38 50
never observed that many features due to the periodicity of TS or
Robot Failure LP2 5 6 15 17 30
superfluous data/dimensions. Statistical feature selection reduces
the total number of features to just a few hundred features. Robot Failure LP3 4 6 15 17 30
We use sparse vectors to store the words for each MTS, as each Robot Failure LP4 3 6 15 42 75
feature vector only contains a few features after feature selection.
We implemented our MTS classifier using liblinear [8] as it scales Robot Failure LP5 5 6 15 64 100
linearly with the dimensionality of the feature space [17]. NetFlow 2 4 50-997 803 534
PenDigits 10 2 8 300 10692
4.5 Feature Interplay
The WEASEL+MUSE model is essentially a histogram of discrete Shapes 3 2 52-98 18 12
features extracted from all dimensions. The logistic regression UWave 8 3 315 200 4278
classifier trains for each class a weight vector, to assign high weights
Wafer 2 6 104-198 298 896
to those features that are relevant within each dimension. Thereby,
it captures the feature interplay, as dimensions are not treated WalkvsRun 2 62 128-1918 28 16
separately but the weight vector is trained over all dimensions. Table 1: 20 multivariate time series datasets collected
Still, this approach allows for phase-invariance as classes (events) from [15].
are represented by the frequency of occurrence of discrete features
rather than the exact time instance of an event.
CD
WEASEL also leads to significantly better ranks (6.05 vs 2.45). This
is a result of using feature identifiers and using derivatives.
9 8 7 6 5 4 3 2 1 Overall, WEASEL+MUSE has 12 wins (or ties) on the MTS
datasets (Table 2), which is the highest of all classifiers. With a
DTWi
6.6 2.45
WEASEL+MUSE mean of 93.5% it shows a similar average accuracy as MLSTN-FCN
6.05 2.75 with mean accuracy 92.1%.
WEASEL MLSTM-FCN
4.7 3.85
In the next section we look into the differences between MLSTM-
mv-ARF gRSF FCN and WEASEL+MUSE and identify the domains for which each
4.35 4.05
ARKernel SMTS classifier is best suited for.
4.05
LPS
5.3 Domain-dependent strength or limitation
We studied the individual accuracy of each method on the 20 differ-
Figure 6: Average ranks on the 20 MTS datasets.
ent MTS datasets, and grouped datasets by domain (Handwriting,
WEASEL+MUSE and MLSTM-FCN are the most accurate.
Motion Sensors, Sensor Readings, Speech) to test if our method
has a domain-dependent strength or limitation. Figure 7 shows the
accuracies of WEASEL+MUSE (orange line), MLSTM-FCN (black
line) vs. the other six MTS classifiers (green area).
5.2 Accuracy Overall, the performance of WEASEL+MUSE is very compet-
Figure 6 shows a critical difference diagram (as introduced in [6]) itive for all datasets. The black line is mostly very close to the
over the average ranks of the different MTSC methods. Classifiers upper outline of the orange area, indicating that WEASEL+MUSE’s
with the lowest (best) ranks are to the right. The group of classifiers performance is close to that of its best competitor in many cases.
that are not significantly different in their rankings are connected In total WEASEL+MUSE has 12 out of 20 possible wins (or ties).
by a bar. The critical difference (CD) length at the top represents WEASEL+MUSE has the highest percentage of wins in the groups
statistically significant differences. of motion sensors, followed by speech and handwriting.
MLSTM-FCN and WEASEL+MUSE show the lowest overall ranks WEASEL+MUSE and MLSTM-FCN perform similar on many
and are in the group of best classifiers. These are also signifi- dataset domain. WEASEL+MUSE performs best for sensor reading
cantly better than the baseline DTWi. When compared to the datasets and MLSTM-FCN performs best for motion and speech
plain WEASEL classifier, we can see that the MUSE extension to datasets. Sensor readings are the datasets with the least number
Multivariate Time Series Classification with WEASEL+MUSE Conference’17, July 2017, Washington, DC, USA
90%
higher is better
80%
Accuracy
70%
Handwriting
Motions
Sensor Readings
Speech
60%
50% WEASEL+MUSE
MLSTM-FCN
Other 6 MTSC
40%
CharacterTrajectories
DigitShapes
PenDIgits
Shapes
AUSLAN
CMUsubject16
KickvsPunch
Libras
UWave
WalkvsRun
ECG
LP1
LP2
LP3
LP4
LP5
NetFlow
Wafer
ArabicDigits
JapaneseVowels
Figure 7: Classification accuracies on the 20 MTS datasets for WEASEL+MUSE (orange), MLSTM-FCN (black) vs six state-of-
the-art MTSC. The green area represents the classifiers’ accuracies.
of samples N or features n in the range of a few dozens. On the WEASEL+MUSE was robust to Gaussian noise and its accuracy
other hand, speech and motion datasets contain the most samples remains stable up to a noise level of 100%.
or features in the range of hundreds to thousands.
This might indicate that WEASEL+MUSE performs well, even
for small-sized datasets, whereas MLSTM-FCN seems to re-
5.5 Relative Prediction Times
quire larger training corpora to be most accurate. Furthermore, In addition to achieving state-of-the-art accuracy, WEASEL+MUSE
WEASEL+MUSE is based on the BOP model that compares signal is also competitive in terms of prediction times. In this experiment,
based on the frequency of occurrence of subsequences rather than we compare WEASEL+MUSE to DTWi. We could not perform
their absence or presence. Thus, signals with some repetition profit a meaningful comparison to the other competitors, as we either
from using WEASEL+MUSE, such as ECG-signals. do not have the source codes or the implementation is given in a
different language (R, Matlab).
In general, 1-NN DTW has a computational complexity of
5.4 Effects of Gaussian Noise on Classification O(Nn 2 ) for TS of length n. For the implementation of DTWi we
Accuracy make use of the state-of-the-art cascading lower bounds from [18].
WEASEL+MUSE applies the truncated Fourier Transform and dis- In this experiment, we measure CPU time to address parallel and
cretization to generate features. This acts as a low-pass filter. single threaded codes. The DTWi prediction times is reported rela-
To illustrate the relevance of noise to the classification task, we tive to that of WEASEL+MUSE, i.e., a number lower than 1 means
performed another experiment on the two multivariate synthetic that DTW is faster than WEASEL+MUSE. 1-NN DTW is a neatest
datasets Shapes and DigitShapes. neighbour classifier, so its prediction times directly depend on the
First, all dimensions of each dataset were z-normalised to have a size of the train dataset. For WEASEL+MUSE the length of the time
standard deviation (SD) of 1. We then added an increasing Gaussian series n and the number of dimensions m are most important
noise with a SD of 0 to 1.0 to each dimension, equal to noise levels For all but three datasets WEASEL+MUSE is faster than DTWi.
of 0% to 100%. It is up to 400 times faster for the Robot Failure LP1 dataset and 200
Figure 8 shows the two classifiers DTWi and WEASEL+MUSE. times faster for ArabicDigits. On average it is 43 times faster than
For DTWi the classification accuracy drops by up to 30 per- DTWi. There are three datasets for which DTW is faster: Walkvs-
centage points for increasing levels of noise. At the same time, Run, KickvsPunch and CMU-MOCAP. These are the datasets with
Conference’17, July 2017, Washington, DC, USA Patrick Schäfer and Ulf Leser
90
Accuracy in %
95
DTWi DTW
90 MUSE 80 MUSE
70
85
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Stddev of Gaussian Noise added Stddev of Gaussian Noise added
Figure 8: Effects of Gaussian noise on classification accuracy. With increasing levels of noise added, the accuracy of DTW
drops and remains stable for WEASEL+MUSE.
scored 172 (95.6%). This challenge underlined that WEASEL+MUSE [13] Jessica Lin, Eamonn J. Keogh, Li Wei, and Stefano Lonardi. 2007. Experiencing
is applicable out-of-the-box to real-world use cases and competitive SAX: a novel symbolic representation of time series. Data Mining and knowledge
discovery 15, 2 (2007), 107–144.
with domain-specific tailored approaches. Motion captured data [14] Jessica Lin, Rohan Khade, and Yuan Li. 2012. Rotation-invariant similarity in time
is characterized by noisy data, that contains many superfluous series using bag-of-patterns representation. Journal of Intelligent Information
Systems 39, 2 (2012), 287–315.
information among dimensions. By design WEASEL+MUSE is able [15] Mustafa Gokce Baydogan. 2017. Multivariate Time Series Classification Datasets.
to deal with this kind of data effectively. http://www.mustafabaydogan.com. (2017).
[16] Christopher Mutschler, Holger Ziekow, and Zbigniew Jerzak. 2013. The DEBS
2013 grand challenge. In Proceedings of the 2013 ACM International Conference
6 CONCLUSION on Distributed Event-based Systems. ACM, 289–294.
In this work, we have presented WEASEL+MUSE, a novel multivari- [17] Andrew Y Ng. 2004. Feature selection, L 1 vs. L 2 regularization, and rotational
invariance. In Proceedings of the 2004 ACM International Conference on Machine
ate time series classification method following the bag-of-pattern Learning. ACM, 78.
approach and achieving highly competitive classification accuracies. [18] Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista,
The novelty of WEASEL+MUSE is its feature space engineering Brandon Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh. 2012. Search-
ing and mining trillions of time series subsequences under dynamic time warping.
using statistical feature selection, derivatives, variable window In Proceedings of the 2012 ACM SIGKDD International Conference on Knowledge
lengths, bi-grams, and a symbolic representation for generating Discovery and Data Mining. ACM, 262–270.
[19] Patrick Schäfer. 2015. Scalable time series classification. Data Mining and
discriminative words. WEASEL+MUSE provides tolerance to noise Knowledge Discovery (2015), 1–26.
(by use of the truncated Fourier transform), phase invariance, and [20] Patrick Schäfer. 2015. The BOSS is concerned with time series classification
superfluous data/dimensions. Thereby, WEASEL+MUSE assigns in the presence of noise. Data Mining and Knowledge Discovery 29, 6 (2015),
1505–1530.
high weights to characteristic, local and global substructures along [21] Patrick Schäfer and Mikael Högqvist. 2012. SFA: a symbolic fourier approxima-
dimensions of a multivariate time series. In our evaluation on al- tion and index for similarity search in high dimensional datasets. In Proceedings
together 21 datasets, WEASEL+MUSE is consistently among the of the 2012 International Conference on Extending Database Technology. ACM,
516–527.
most accurate classifiers and outperforms state-of-the-art similar- [22] Patrick Schäfer and Ulf Leser. 2017. Benchmarking Univariate Time Series
ity measures or shapelet-based approaches. It performs well even Classifiers. In BTW 2017. 289–298.
[23] Patrick Schäfer and Ulf Leser. 2017. Fast and Accurate Time Series Classification
for small-sized datasets, where deep learning based approaches with WEASEL. Proceedings of the 2017 ACM on Conference on Information and
typically tend to perform poorly. When looking into application Knowledge Management (2017), 637–646.
domains, it is best for sensor readings, followed by speech, motion [24] The Value of Wind Power Forecasting. 2016. http://www.nrel.gov/docs/fy11osti/
50814.pdf. (2016).
and handwriting recognition tasks. [25] Kerem Sinan Tuncel and Mustafa Gokce Baydogan. 2018. Autoregressive forests
Future work could direct into different feature selection methods, for multivariate time series modeling. Pattern Recognition 73 (2018), 202–215.
benchmarking approaches based on train and prediction times, or [26] WEASEL+MUSE Classifier Source Code and Raw Results. 2017. https://www2.
informatik.hu-berlin.de/∼schaefpa/muse/. (2017).
use ensembling to build more powerful classifiers, which has been [27] Martin Wistuba, Josif Grabocka, and Lars Schmidt-Thieme. 2015. Ultra-fast
successfully used for univariate time series classification. shapelets for time series classification. arXiv preprint arXiv:1503.05018 (2015).
[28] Y Chen, E Keogh, B Hu, N Begum, A Bagnall, A Mueen and G Batista . 2015.
The UCR Time Series Classification Archive. http://www.cs.ucr.edu/∼eamonn/
REFERENCES time series data. (2015).
[1] AALTD Time Series Classification Challenge. 2016. https://aaltd16.irisa.fr/ [29] Lexiang Ye and Eamonn J. Keogh. 2009. Time series shapelets: a new primitive
challenge/. (2016). for data mining. In Proceedings of the 2009 ACM SIGKDD International Conference
[2] Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh. on Knowledge Discovery and Data Mining. ACM.
2016. The Great Time Series Classification Bake Off: An Experimental Evaluation
of Recently Proposed Algorithms. Extended Version. Data Mining and Knowledge
Discovery (2016), 1–55.
[3] Mustafa Gokce Baydogan and George Runger. 2015. Learning a symbolic repre-
sentation for multivariate time series classification. Data Mining and Knowledge
Discovery 29, 2 (2015), 400–422.
[4] Mustafa Gokce Baydogan and George Runger. 2016. Time series representation
and similarity based on local autopatterns. Data Mining and Knowledge Discovery
30, 2 (2016), 476–509.
[5] Marco Cuturi and Arnaud Doucet. 2011. Autoregressive kernels for time series.
arXiv preprint arXiv:1101.0673 (2011).
[6] Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets.
The Journal of Machine Learning Research 7 (2006), 1–30.
[7] Philippe Esling and Carlos Agon. 2012. Time-series data mining. ACM Computing
Surveys 45, 1 (2012), 12:1–12:34.
[8] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen
Lin. 2008. LIBLINEAR: A library for large linear classification. The Journal of
Machine Learning Research 9 (2008), 1871–1874.
[9] Benjamin F Hobbs, Suradet Jitprapaikulsarn, Sreenivas Konda, Vira Chankong,
Kenneth A Loparo, and Dominic J Maratukulam. 1999. Analysis of the value
for unit commitment of improved load forecasts. IEEE Transactions on Power
Systems 14, 4 (1999), 1342–1348.
[10] Zbigniew Jerzak and Holger Ziekow. 2014. The DEBS 2014 Grand Challenge. In
Proceedings of the 2014 ACM International Conference on Distributed Event-based
Systems. ACM, 266–269.
[11] Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Samuel Harford.
2018. Multivariate LSTM-FCNs for Time Series Classification. arXiv preprint
arXiv:1801.04503 (2018).
[12] Isak Karlsson, Panagiotis Papapetrou, and Henrik Boström. 2016. Generalized
random shapelet forests. Data mining and knowledge discovery 30, 5 (2016),
1053–1085.