0% found this document useful (0 votes)
23 views12 pages

Multivariate Time Series Classification With WEASE

This document discusses a novel method called WEASEL+MUSE for classifying multivariate time series data. It builds multivariate feature vectors from time series data using a sliding window approach applied to each dimension, and extracts discrete features from each window. It then performs feature selection to remove non-discriminative features before classification. The method encodes context between features to create a small but discriminative feature set for classification. It was shown to be accurate on benchmark time series datasets and robust for motion gesture recognition.

Uploaded by

juan lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views12 pages

Multivariate Time Series Classification With WEASE

This document discusses a novel method called WEASEL+MUSE for classifying multivariate time series data. It builds multivariate feature vectors from time series data using a sliding window approach applied to each dimension, and extracts discrete features from each window. It then performs feature selection to remove non-discriminative features before classification. The method encodes context between features to create a small but discriminative feature set for classification. It was shown to be accurate on benchmark time series datasets and robust for motion gesture recognition.

Uploaded by

juan lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321417648

Multivariate Time Series Classification with WEASEL+MUSE

Article · November 2017

CITATIONS READS
81 2,619

2 authors:

Patrick Schäfer Ulf Leser


Humboldt-Universität zu Berlin Humboldt-Universität zu Berlin
45 PUBLICATIONS 1,544 CITATIONS 420 PUBLICATIONS 8,481 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Ulf Leser on 13 September 2021.

The user has requested enhancement of the downloaded file.


Multivariate Time Series Classification with WEASEL+MUSE
Patrick Schäfer Ulf Leser
Humboldt University of Berlin Humboldt University of Berlin
Berlin, Germany Berlin, Germany
patrick.schaefer@informatik.hu-berlin.de leser@informatik.hu-berlin.de

ABSTRACT 30
Raw Multivariate Time Series
1. Hand tip left, X coordinate
Multivariate time series (MTS) arise when multiple interconnected 28 2. Hand tip left, Y coordinate
sensors record data over time. Dealing with this high-dimensional 3. Hand tip left, Z coordinate
26 4. Hand tip right, X coordinate
data is challenging for every classifier for at least two aspects: First, 5. Hand tip right, Y coordinate
arXiv:1711.11343v4 [cs.LG] 17 Aug 2018

24 6. Hand tip right, Z coordinate


an MTS is not only characterized by individual feature values, but 7. Elbow left, X coordinate
also by the interplay of features in different dimensions. Second, 22 8. Elbow left, Y coordinate
9. Elbow left, Z coordinate
this typically adds large amounts of irrelevant data and noise. We 20 10. Elbow right, X coordinate
present our novel MTS classifier WEASEL+MUSE which addresses 11. Elbow right, Y coordinate
18 12. Elbow right, Z coordinate
both challenges. WEASEL+MUSE builds a multivariate feature 13. Wrist left, X coordinate
vector, first using a sliding-window approach applied to each di- 16 14. Wrist left, Y coordinate
15. Wrist left, Z coordinate
mension of the MTS, then extracts discrete features per window and 14 16. Wrist right, X coordinate
17. Wrist right, Y coordinate
dimension. The feature vector is subsequently fed through feature 12 18. Wrist right, Z coordinate
selection, removing non-discriminative features, and analysed by 19. Thumb left, X coordinate
10 20. Thumb left, Y coordinate
a machine learning classifier. The novelty of WEASEL+MUSE lies 21. Thumb left, Z coordinate
in its specific way of extracting and filtering multivariate features 8 22. Thumb right, X coordinate
23. Thumb right, Y coordinate
from MTS by encoding context information into each feature. Still 6 24. Thumb right, Z coordinate
the resulting feature set is small, yet very discriminative and use-
4
ful for MTS classification. Based on a popular benchmark of 20
MTS datasets, we found that WEASEL+MUSE is among the most 2
accurate classifiers, when compared to the state of the art. The 0
outstanding robustness of WEASEL+MUSE is further confirmed 2
based on motion gesture recognition data, where it out-of-the-box
0 10 20 30 40 50
achieved similar accuracies as domain-specific methods.

KEYWORDS Figure 1: Motion data recorded from 8 sensors recording


x/y/z coordinates (indicated by different line styles) at the
Time series; multivariate; classification; feature selection; bag-of-
left/right hand, left/right elbow, left/right wrist and left-
patterns
/right thumb (indicated by different colours).
ACM Reference format:
Patrick Schäfer and Ulf Leser. 2016. Multivariate Time Series Classification
with WEASEL+MUSE. In Proceedings of ACM Conference, Washington, DC, domains like smart homes [10], machine surveillance [16], or smart
USA, July 2017 (Conference’17), 11 pages. grids [9, 24].
DOI: 10.1145/nnnnnnn.nnnnnnn A multivariate time series (MTS) arises when multiple intercon-
nected streams of data are recorded over time. These are typically
1 INTRODUCTION produced by devices with multiple (heterogeneous) sensors like
weather observations (humidity, temperature), Earth movement (3
A time series (TS) is a collection of values sequentially ordered in axis), or satellite images (in different spectra).
time. TS emerge in many scientific and commercial applications, In this work we study the problem of multivariate time series
like weather observations, wind energy forecasting, industry au- classification (MTSC). Given a concrete MTS, the task of MTSC is
tomation, mobility tracking, etc. [28] One driving force behind to determine which of a set of predefined classes this MTS belongs
their rising importance is the sharply increasing use of heteroge- to, e.g., labeling a sign language gesture based on a set of pre-
neous sensors for automatic and high-resolution monitoring in defined gestures. The high dimensionality introduced by multiple
Permission to make digital or hard copies of all or part of this work for personal or streams of sensors is very challenging for classifiers, as MTS are not
classroom use is granted without fee provided that copies are not made or distributed only described by individual features but also by their interplay/co-
for profit or commercial advantage and that copies bear this notice and the full citation occurrence in different dimensions [3].
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, As a concrete example, consider the problem of gesture recogni-
to post on servers or to redistribute to lists, requires prior specific permission and/or a tion of different users performing isolated gestures (Figure 1). The
fee. Request permissions from permissions@acm.org.
dataset was recorded using 8 sensors recording x/y/z coordinates
Conference’17, Washington, DC, USA
© 2016 ACM. 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 at the left/right hand, left/right elbow, left/right wrist and left/right
DOI: 10.1145/nnnnnnn.nnnnnnn thumb (24 dimensions in total). The data is high dimensional and
Conference’17, July 2017, Washington, DC, USA Patrick Schäfer and Ulf Leser

characterized by long idle periods with small bursts of characteristic 0.5 Sample
0.4
movements in every dimension. Here, the exact time instant of an 0.3
0.2
event, e.g., thumbs up, is irrelevant for classification. To effectively 0.1
0.0
0.1
deal with this kind of information, an MTSC has to deal with noise, 0.2
0.3
irrelevant dimension data, and, most importantly, extract relevant 0.40 200 400 600 800 1000
features from each dimension.
0.5
(1) Windowing
In this paper, we introduce our novel domain agnostic MTSC
method called WEASEL+MUSE (WEASEL plus Multivariate Unsu- 0.0
pervised Symbols and dErivatives). WEASEL+MUSE conceptually 0.5 ...... ......
1.0
builds on the bag-of-patterns (BOP) model and the WEASEL (Word 0 200 400 600 800 1000
ExtrAction for time SEries cLassification) pipeline for feature selec-
tion. The BOP model moves a sliding window over an MTS, extracts bcc ccc bcb bcb
(2) Discretization
bbb bab cac ddd bdb aab bac ccc bdb
bcc ccc bcb bcb bbb bab cac ddc bdc bab bac ccc bdb
discrete features per window, and creates a histogram over feature bcc
bcc
bcc
bcb
bcb
bcb
ccc
ccc
abb
abb
cac
cac
cab
cab
cdc
cdb
bdc
bdc
bab
bab
bac
bac
ccc
ccc
bdb
bdb

counts. These histograms are subsequently fed into a machine


bcc bcb bcb ccc abb cac cab bda bdc bab cac ccc bdb
bcc bcb bcb ccc abb cac cac bda bdc bab cac ccc bdb
bcc bcb bcb ccc abb cac dbc bda adb bab cac ccc bdb
learning classifier. However, the concrete way of constructing and bcc
...
bcb
...
bcb
...
ccc
...
abb
...
cac
...
dbd
...
bda
...
ada
...
bac
...
cac
...
ccc
...
bdb
...

filtering features in WEASEL+MUSE is different from state-of-the- 0 200 400 600 800 1000
art multivariate classifiers: 140 (3) Bag-of-Patterns model
120
100

Counts
(1) Identifiers: WEASEL+MUSE adds a dimension (sensor) 80
60
identifier to each extracted discrete feature. Thereby 40
20
WEASEL+MUSE can discriminate between the presence of 0

aaa
aab
aba
abb
abc
aca
acb
acc
ada
adb
adc
baa
bab
bac
bba
bbb
bbc
bcb
bcc
bda
bdb
bdc
bdd
cab
cac
cba
cbb
cbc
ccb
ccc
cda
cdb
cdc
cdd
dab
dac
dad
dbb
dbc
dbd
dcb
dcc
dcd
ddc
ddd
features in different dimensions - i.e., a left vs. right hand
was raised.
(2) Derivatives: To improve the accuracy, derivatives are Figure 2: Transformation of a TS into the Bag-of-Patterns
added as features to the MTS. Those are the differences be- (BOP) model using overlapping windows (second to top), dis-
tween neighbouring data points in each dimension. These cretization of windows to words (second from bottom), and
derivatives represent the general shape and are invariant word counts (bottom) (see [23]).
to the exact value at a given time stamp. This information
can help to increase classification accuracy.
(3) Noise robust: WEASEL+MUSE derives discrete features
from windows extracted from each dimension of the MTS is constantly among the most accurate methods. WEASEL+MUSE
using a truncated Fourier transform and discretization, clearly outperforms all other classifiers except the very recent deep-
thereby reducing noise. learning-based method from [11]. Compared to the latter, WM
(4) Interplay of features: The interplay of features along performs better for small-sized datasets with less features or sam-
the dimensions is learned by assigning weights to features ples to use for training, such as sensor readings.
(using logistic regression), thereby boosting or dampen- The rest of this paper is organized as follows: Section 2 briefly
ing feature counts. Essentially, when two features from recaps bag-of-patterns classifiers and definitions. In Section 3 we
different dimensions are characteristic for the class label, present related work. In Section 4 we present WEASEL+MUSE’s
these get assigned high weights, and their co-occurrence novel way of feature generation and selection. Section 5 presents
increases the likelihood of a class. evaluation results and Section 6 our conclusion.
(5) Order invariance: A main advantage of the BOP model
is its invariance to the order of the subsequences, as a 2 BACKGROUND: TIME SERIES AND
result of using histograms over feature counts. Thus, two BAG-OF-PATTERNS
MTS are similar, if they show a similar number of feature
A univariate time series (TS) T = {t 1 , . . . , tn } is an ordered se-
occurrences rather than having the same values at the same
quence of n ∈ N real values ti ∈ R. A multivariate time series
time instances.
(MTS) T = {t 1 , . . . , tm } is an ordered sequence of m ∈ N streams
(6) Feature selection: The wide range of features considered
(dimensions) with ti = (ti,1 , . . . , ti,n ) ∈ Rn . For instance, a stream
by WEASEL+MUSE (dimensions, derivatives, unigrams,
of m interconnected sensors recording values at each time instant.
bigrams, and varying window lengths) introduces many
As we primarily address MTS generated from automatic sensors
non-discriminative features. Therefore, WEASEL+MUSE
with a fixed and synchronized sampling along all dimensions, we
applies statistical feature selection and feature weighting
can safely ignore time stamps. A time series dataset D contains
to identify those features that best discern between classes.
N time series. Note, that we consider only MTS with numerical
The aim of our feature selection is to prune the feature
attributes (not categorical).
space to a level that feature weighting can be learned in
The derivative of a stream ti = (ti,1 , . . . , ti,n ) is given by the se-
reasonable time.
quence of pairwise differences ti0 = (|ti,2 − ti,1 |, . . . , |ti,n − ti,n−1 |).
In our experimental evaluation on 20 public benchmark MTS Adding derivatives to an MTS T = {t 1 , . . . , tm } of m streams, effec-
datasets and a use case on motion capture data, WEASEL+MUSE tively doubles the number of streams: T = {t 1 , . . . , tm , t 10 , . . . , tm
0 }.
Multivariate Time Series Classification with WEASEL+MUSE Conference’17, July 2017, Washington, DC, USA

Given a univariate TS T , a window S of length w is a subsequence optimized to extract discriminative words to ease classification of
with w contiguous values starting at offset a in T , i.e., S(a, w) = univariate TS. We observed that this led to an overall low accu-
(ta , . . . , ta+w −1 ) with 1 ≤ a ≤ n − w + 1. racy for MTSC due to the increased number of possible features
We associate each TS with a class label y ∈ Y from a predefined along all dimensions (see Section 5). WEASEL+MUSE was de-
set of labels Y . Time series classification (TSC) is the task of predict- signed on the WEASEL pipeline, but adding sensor identifiers to
ing a class label for a TS whose label is unknown. A TS classifier each word, generating unsupervised discrete features to minimize
is a function that is learned from a set of labelled time series (the overfitting, as opposed to WEASEL that uses a supervised trans-
training data), that takes an unlabelled time series as input and formation. WEASEL+MUSE further adds derivatives (differences
outputs a label. between all neighbouring points) to the feature space to increase
Our method is based on the bag-of-patterns (BOP) model [14, accuracy.
19, 20]. Algorithms following the BOP model build a classification For multivariate time series classification (MTSC), the most basic
function by (1) extracting subsequences from a TS, (2) discretiz- approach is to apply rigid dimensionality reduction (i.e., PCA) or
ing each real valued subsequence into a discrete-valued word (a simply concatenate all dimensions of the MTS to obtain a univariate
sequence of symbols over a fixed alphabet), (3) building a histogram TS and use proven univariate TSC. Some domain agnostic MTSC
(feature vector) from word counts, and (4) finally using a classifica- have been proposed.
tion model from the machine learning repertoire on these feature Symbolic Representation for Multivariate Time series (SMTS) [3]
vectors. uses codebook learning and the bag-of-words (BOW) model for
Figure 2 illustrates these steps from a raw time series to a BOP classification. First, a random forest is trained on the raw MTS to
model using overlapping windows. Overlapping subsequences of partition the MTS into leaf nodes. Each leaf node is then labelled
fixed length are extracted from a time series (second from top), each by a word of a codebook. There is no additional feature extraction,
subsequences is discretized to a word (second from bottom), and apart from calculating derivatives for the numerical dimensions
finally a histogram is built over the word counts. (first order differences). For classification a second random forest is
Different discretization functions have been used in literature, trained on the BOW representation of all MTS.
including SAX [13] and SFA [21]. SAX is based on the discretization Ultra Fast Shapelets (UFS) [27] applies the shapelet discovery
of mean values and SFA is based on the discretization of coefficients method to MTS classification. The major limiting factor for shapelet
of the Fourier transform. discovery is the time to find discriminative subsequences, which
In the BOP model, two TS are similar, if the subsequences have becomes even more demanding when dealing with MTS. UFS solves
similar frequencies in both TS. Feature selection and weighting can this by extracting random shapelets. On this transformed data, a
be used to damper of emphasize important subsequences, like in linear SVM or a Random Forest is trained. Unfortunately, the code
the WEASEL model [23]. is not available to allow for reproducibility
Generalized Random Shapelet Forests (gRSF) [12] also generates
a set of shapelet-based decision trees over randomly extracted
3 RELATED WORK shapelets. In their experimental evaluation, gRSF was the best
Research in univariate TSC has a long tradition and dozens of MTSC when compared to SMTS, LPS and UFS on 14 MTS datasets.
approaches have been proposed, refer to [2, 7, 22] for summary. The Thus, we use gRFS as a representative for random shapelets.
techniques used for TSC can broadly be categorized into two classes: Learned Pattern Similarity (LPS) [4] extracts segments from an
(a) similarity-based (distance-based) methods and (b) feature-based MTS. It then trains regression trees to identify structural dependen-
methods. cies between segments. The regression trees trained in this manner
Similarity-based methods make use of a similarity measure like represent a non-linear AR model. LPS next builds a BOW represen-
Dynamic Time Warping (DTW) [18] to compare two TS. 1-Nearest tation based on the labels of the leaf nodes similar to SMTS. Finally
Neighbour DTW is commonly used as a baseline in TSC compar- a similarity measure is defined on the BOW representations of the
isons [2]. In contrast, feature-based TSC rely on comparing features, MTS. LPS showed better performance than DTW in a benchmark
typically generated from substructures of a TS. The most successful using 15 MTS datasets. Autoregressive (AR) Kernel [5] proposes an
approaches are shapelets or bag-of-patterns (BOP). Shapelets are AR kernel-based distance measure for MTSC.
defined as TS subsequences that are maximally representative of Autoregressive forests for multivariate time series modelling (mv-
a class [29]. The standard BOP model [14] breaks up a TS into ARF) [25] proposes a tree ensemble trained on autoregressive mod-
windows, represent these as discrete features, and finally build a els, each one with a different lag, of the MTS. This model is used
histogram of feature counts as basis for classification. to capture linear and non-linear relationships between features
In previous research we have studied the BOP model for univari- in the dimensions of an MTS. The authors compared mv-ARF to
ate TSC. The BOSS (Bag-of-SFA-Symbols) [20] classifier is based AR Kernel, LPS and DTW on 19 MTS datasets. mv-ARF and AR
on the (unsupervised) Symbolic Fourier Approximation (SFA) [21] kernel showed the best results. mv-ARF performs well on motion
to generate discrete features and uses a similarity measure on the recognition data. AR kernel outperformed the other methods for
histogram of feature counts. The WEASEL classifier [23] applies sensor readings.
a supervised symbolic representation to transform subsequences At the time of writing this paper, Multivariate LSTM-FCN [11]
to words, uses statistical feature selection, and subsequently feeds was proposed that introduces a deep learning architecture based on
the words into a logistic regression classifier. WEASEL is among a long short-term memory (LSTM), a fully convolutional network
the most accurate and fastest univariate TSC [23]. WEASEL was
Conference’17, July 2017, Washington, DC, USA Patrick Schäfer and Ulf Leser

(FCN) and a squeeze and excitation block. Their method is compared obtained from the univariate words of each BOP model by con-
to state-of-the-art and shows the overall best results. catenating each word with an identifier (representing the sensor
and the window size). This maintains the association between the
4 WEASEL+MUSE dimension and the feature space.
We present our novel method for domain agnostic multivari- More precisely, an MTS is first split into its dimensions. Each
ate time series classification (MTSC) called WEASEL+MUSE dimension can now be considered as a univariate TS and trans-
(WEASEL+Multivariate Unsupervised Symbols and dErivatives). formed using the classical BOP approach. To this end, z-normalized
WEASEL+MUSE addresses the major challenges of MTSC in a windows of varying lengths are extracted. Next, each window
specific manner (using gesture recognition as an example): is approximated using the truncated Fourier transform, keeping
only lower frequency components of each window. Fourier values
(1) Interplay of dimensions: MTS are not only characterized (real and imaginary part separately) are then discretized into words
by individual features at a single time instance, but also based on equi-depth or equi-frequency binning using a symbolic
by the interplay of features in different dimensions. For transformation (details will be given in Subsection 4.2). Thereby,
example, to predict a hand gesture, a complex orchestration words (unigrams) and pairs of words (bigrams) with varying win-
of interactions between hand, finger and elbow may have dow lengths are computed. These words are concatenated with their
to be considered. identifiers, i.e., the sensor id (dimension) and the window length.
(2) Phase invariance: Relevant events in an MTS do not neces- Thus, WEASEL+MUSE keeps a disjoint word space for each dimen-
sarily reappear at the same time instances in each dimen- sion and two words from different dimensions can never coincide.
sion. Thus, characteristic features may appear anywhere in To deal with irrelevant features and dimensions, a Chi-squared test
an MTS (or not at all). For example, a hand gesture should is applied to all multivariate words (Subsection 4.4). As a result, a
allow for considerable differences in time schedule. highly discriminative feature vector is obtained and a fast linear
(3) Invariance to irrelevant dimensions: Only small periods in time logistic regression classifier can be trained (Subsection 4.4). It
time and in some streams may contain relevant informa- further captures the interplay of features in different dimensions
tion for classification. What makes things even harder by learning high weights for important features in each dimension
is the fact that whole sensor streams may be irrelevant (Subsection 4.5).
for classification. For instance, a movement of a leg is
irrelevant to capture hand gestures and vice versa. 4.2 Word Extraction: Symbolic Fourier
We engineered WEASEL+MUSE to address these challenges. Approximation
Our method conceptually builds on our previous work on the bag-
Instead of training a multivariate symbolic transformation, we train
of-patterns (BOP) model and univariate TSC [20, 23], yet uses a
and apply the univariate symbolic transformation SFA to each di-
different approach in many of the individual steps to deal with the
mension of the MTS separately. This allows for (a) phase invariance
aforementioned challenges. We will use the terms feature and word
between different dimensions, as a separate BOP model is built for
interchangeably throughout the text. In essence, WEASEL+MUSE
each dimension, but (b) the information that two features occurred
makes use of a histogram of feature counts. In this feature vec-
at exactly the same time instant in two different dimensions is lost.
tor it captures information about local and global changes in the
Semantically, splitting an MTS into its dimensions results in two
MTS along different dimensions. It then learns weights to boost or
MTS T1 and T2 to be similar, if both share similar substructures
damper characteristic features. The interplay of features is repre-
within the i-th dimension at arbitrary time stamps.
sented by high weights.
SFA transforms a real-valued TS window to a word using an
alphabet of size c as in [21]:
4.1 Overview
(1) Approximation: Each normalized window of length w is
We first give an overview of our basic idea and an example how subjected to dimensionality reduction by the use of the
we deal with the challenges described above. In WEASEL+MUSE a truncated Fourier transform, keeping only the first l  w
feature is represented by a word that encodes the identifiers (sensor coefficients for further analysis. This step acts as a low pass
id, window size, and discretized Fourier coefficients) and counts its (noise) filter, as higher order Fourier coefficients typically
occurrences. Figure 3 shows an example for the WEASEL+MUSE represent rapid changes like drop-outs or noise.
model of a fixed window length 15 on motion capture data. The (2) Quantization: Each Fourier coefficient is then discretized
data has 3 dimensions (x,y,z coordinates). A feature ( 0 3 15 ad 0, 2) to a symbol of an alphabet of fixed size c, which in turn
(see Figure 3 (b)) represents a unigram ’ad’ for the z-dimension with achieves further robustness against noise.
window length 15 and frequency 2, or the feature ( 0 2 15 bd ad 0, 2)
represents a bigram ’bd ad’ for the y-dimension with length 15 and Figure 5 exemplifies this process for a univariate time series,
frequency 2. resulting in the word ABDDABBB. As a result, each real-valued
window in the i-th dimension is transformed into a word of length
Pipeline: WEASEL+MUSE is composed of the building blocks de- l with an alphabet of size c. For a given window length, there are a
picted in Figure 4: the symbolic representation SFA [21], BOP mod- maximum of O(n) windows in each of the m dimensions, resulting
els for each dimension, feature selection and the WEASEL+MUSE in a total of O(n × m) words.
model. WEASEL+MUSE conceptionally builds upon the univari- SFA is a data-adaptive symbolic transformation, as opposed to
ate BOP model applied to each dimension. Multivariate words are SAX [13] which always uses the same set of bins irrelevant of
Multivariate Time Series Classification with WEASEL+MUSE Conference’17, July 2017, Washington, DC, USA

(a) Raw Time Series 2 (b) WEASEL+MUSE words per dimension


0

Counts
2 0
0 10 20 30 40 50

1-15-aa
1-15-aa ba
1-15-ab
1-15-ab aa
1-15-ad
1-15-ad aa
1-15-ad ab
1-15-ba
1-15-ba ad
1-15-ba bb
1-15-ba bc
1-15-ba bd
1-15-ba ca
1-15-bb
1-15-bb bc
1-15-bc
1-15-bc ad
1-15-bc bb
1-15-bd
1-15-bd ad
1-15-ca
1-15-ca ba
1-15-ca cc
1-15-cb
1-15-cb da
1-15-cc
1-15-cc dd
1-15-cd
1-15-cd cb
1-15-da
1-15-da ca
1-15-da db
1-15-db
1-15-db ca
1-15-db da
1-15-dd
1-15-dd db
1. Hand tip left, X coordinate
2. Hand tip left, Y coordinate
3. Hand tip left, Z coordinate
2

Counts
0

2-15-aa
2-15-aa ba
2-15-ab
2-15-ab aa
2-15-ad
2-15-ad aa
2-15-ad ab
2-15-ba
2-15-ba ad
2-15-ba bb
2-15-ba bc
2-15-ba bd
2-15-ba ca
2-15-bb
2-15-bb bc
2-15-bc
2-15-bc ad
2-15-bc bb
2-15-bd
2-15-bd ad
2-15-ca
2-15-ca ba
2-15-ca cc
2-15-cb
2-15-cb da
2-15-cc
2-15-cc dd
2-15-cd
2-15-cd cb
2-15-da
2-15-da ca
2-15-da db
2-15-db
2-15-db ca
2-15-db da
2-15-dd
2-15-dd db
4

Counts
2

0
3-15-aa
3-15-aa ba
3-15-ab
3-15-ab aa
3-15-ad
3-15-ad aa
3-15-ad ab
3-15-ba
3-15-ba ad
3-15-ba bb
3-15-ba bc
3-15-ba bd
3-15-ba ca
3-15-bb
3-15-bb bc
3-15-bc
3-15-bc ad
3-15-bc bb
3-15-bd
3-15-bd ad
3-15-ca
3-15-ca ba
3-15-ca cc
3-15-cb
3-15-cb da
3-15-cc
3-15-cc dd
3-15-cd
3-15-cd cb
3-15-da
3-15-da ca
3-15-da db
3-15-db
3-15-db ca
3-15-db da
3-15-dd
3-15-dd db
Figure 3: WEASEL+MUSE model of a motion capture. (a) motion of a left hand in x/y/z coordinates. (b) the WEASEL+MUSE
model for each of these coordinates. A feature in the WEASEL+MUSE model encodes the dimension, window length and actual
word, e.g., 1 15 aa for ’left Hand’, window length 15 and word ’aa’.

Figure 4: WEASEL+MUSE Pipeline: Feature extraction, univariate Bag-of-Patterns (BOP) models and WEASEL+MUSE.

the data distribution. Quantization boundaries are derived from 4.3 Univariate Bag-of-Patterns: Unigrams,
a (sampled) train dataset using either (a) equi-depth or (b) equi- bigrams, derivatives, window lengths
frequency binning, such that (a) the Fourier frequency range is
In the BOP model, two TS are distinguished by the frequencies of
divided into equal-sized bins or (b) the boundaries are chosen to
certain subsequences rather than their presence or absence. A TS
hold an equal number of Fourier values. SFA is trained for each
is represented by word counts, obtained from the windows of the
dimension separately, resulting in m SFA transformations. Each
time series. BOP-based methods have a number of parameters, and
SFA transformation is then used to transform only its dimension of
of particular importance is the window length, which heavily influ-
the MTS.
ences its performance. For dealing with MTS, we have to find the
best window lengths for each dimension, as one cannot assume that
there is a single optimal value for all dimensions. WEASEL+MUSE
Conference’17, July 2017, Washington, DC, USA Patrick Schäfer and Ulf Leser

Time Series DFT SFA Algorithm 1 Build one BOP model using SFA, multiple window
2.0 2.0 2.0 lengths, bigrams and the Chi-squared test for feature selection. l
1.5 1.5 1.5 is the number of Fourier values to keep and wLen are the window
1.0 1.0 1.0 lengths used for sliding window extraction.
Value

0.5 0.5 0.5 1 f u n c t i o n WEASEL MUSE ( mts , l , wLen )


0.0 0.0 0.0 2 bag = empty B a g O f P a t t e r n
0.5 0.5 0.5 3 / / e x t r a c t from e a c h d i m e n s i o n
1.0 1.0 1.0 ABDDABBB 4 f o r each dimId in mts :
0 50 100 0 50 100 0 50 100 5 / / u s e m u l t i p l e window l e n g h t s
Time Time Time
6 f o r each window− l e n g t h w in wLen :
7 f o r each ( prevWindow , window ) in
SLIDING WINDOWS ( mts [ dimId ] , wLen ) :
Figure 5: The Symbolic Fourier Approximation (SFA): A
8
time series (left) is approximated using the truncated
9 / / BOP c o m p u t e d from u n i g r a m s
Fourier transform (centre) and discretized to the word AB-
DDABBB (right) with the four-letter alphabet (’a’ to ’d’). The 10 word = SFA ( window , l )
inverse transform is depicted by an orange area (right), rep- 11 unigram = c o n c a t ( dim , w , word )
resenting the tolerance for all signals that will be mapped 12 bag [ unigram ] . i n c r e a s e C o u n t s ( )
to the same word. 13
14 / / BOP c o m p u t e d from b i g r a m s
15 prevWord =SFA ( prevWindow , l )
addresses this issue by building a large feature space using multi- 16 bigram = c o n c a t ( dim , w , prevWord , word )
ple window lengths, the MTS dimensions, unigrams, bigrams, and 17 bag [ bigram ] . i n c r e a s e C o u n t s ( )
derivatives. This very large feature space is aggressively reduced 18
in a second separate step 4.4.
19 / / f e a t u r e s e l e c t i o n using ChiSquared
The feature set of WEASEL+MUSE, given an MTS T =
20 r et u r n CHI SQUARED FILTER ( bag )
(t 1 , . . . , tm ) is composed of (see also Section 4.4):
(1) Derivatives: Derivatives are added to the MTS. These are
the differences between all neighbouring points in one
dimension (see Section 2). This captures information about Pseudocode: Algorithm 1 illustrates WEASEL+MUSE: sliding
how much a signal changes in time. It has been shown that windows of length w are extracted in each dimension (line 7).
this additional information can improve the accuracy [3]. We empirically set the window lengths to all values in [4, . . . , n].
We show the utility of derivatives in Section 5.6. Smaller values are possible, but the feature space can become un-
(2) Local and Global Substructures: For each possible window traceable, and small window lengths are basically meaningless for
lengths w ∈ [4..len(ti )], windows are extracted from the TS of length > 103 . The SFA transformation is applied to each real-
dimensions and the derivatives, and each window is trans- valued sliding window (line 10,15). Each word is concatenated with
formed to a word using the SFA transformation. This helps the window length and dimension identifier, and its occurrence
to capture both local and global patterns in an MTS. is counted (line 12,17). Lines 15–17 illustrate the use of bigrams:
(3) Unigrams and Bigrams: Once we have extracted all the preceding sliding window is concatenated with the current
words (unigrams), we enrich this feature space with co- window. Note, that all words (each dimension, unigrams, bigrams,
occurrences of words (bigrams). It has been shown in [23] each window-length) are joined within a single bag-of-patterns. Fi-
that the usage of bigrams reduces the order-invariance of nally irrelevant words are removed from this bag-of-patterns using
the BOP model. We could include m-grams, but the feature the Chi-squared test (line 20). The target SFA length l is learned
space grow polynomial in the m-gram number, such that it through cross-validation.
is infeasible to use anything larger than bigrams (resulting
in O(n 2 ) features).
(4) Identifiers: Each word is concatenated with it’s sensor id
4.4 Feature Selection and Weighing:
and window size (see Figure 3). It is rather meaningless to Chi-squared Test and Logistic Regression
compare features from different sensors: if a temperature WEASEL+MUSE applies the Chi-squared (χ 2 ) test to identify the
sensor measures 10 and a humidity sensor measures 10, most relevant features, only features passing a certain threshold
these capture totally different concepts. To distinguish be- are kept to reduce this feature space prior to training the classifier.
tween sensors, the features are appended with sensor ids. We set the threshold such that it is high enough for the logistic
e.g., (temp: 10) and (humid: 10). However, both measure- regression classifier to train a model in reasonable time (and when
ments can be important for classification. Thus, we add set too low, training takes longer). If a feature is irrelevant but not
them to the same feature vector and use feature selection removed due to the χ 2 -test, it will get assigned a low weight by the
and feature weights to identify the important ones. logistic regression classifier. It would be possible to use different
Multivariate Time Series Classification with WEASEL+MUSE Conference’17, July 2017, Washington, DC, USA

feature selection methods. As our main aim is to reduce the runtime #classes m n N Train N Test
for training, we did not look into other feature selection techniques.
Digits 10 13 4-93 6600 2200
For a set of N m-dimensional MTS of length n, the size of the BOP
feature space is O(min(Nn 2 , c l ) × m) for word length l, c symbols AUSLAN 95 22 45-136 1140 1425
and m dimensions. The number of MTS N and length n affects the CharTrajectories 20 3 109-205 300 2558
actual word frequencies. But in the worst case each TS window
can only produce one distinct word, and there are Nn 2 windows in CMUsubject16 2 62 127-580 29 29
each dimension. WEASEL+MUSE further uses bigrams, derivatives, DigitShapes 4 2 30-98 24 16
and O(n) window lengths. WEASEL+MUSE keeps a disjoint word
ECG 2 2 39-152 100 100
space for each dimension and window lengths, thus two words
from different dimensions can never collide (no false positives). Japanese Vowels 9 12 7-29 270 370
Thus, the theoretical dimensionality of this feature space rises to KickvsPunch 2 62 274-841 16 10
O(min[N n2 , c 2l · n] × m). Essentially, the feature space can grow
quadratically with the number of observations of an MTS, if every LIBRAS 15 2 45 180 180
observation generates a distinct word. However, in practice we Robot Failure LP1 4 6 15 38 50
never observed that many features due to the periodicity of TS or
Robot Failure LP2 5 6 15 17 30
superfluous data/dimensions. Statistical feature selection reduces
the total number of features to just a few hundred features. Robot Failure LP3 4 6 15 17 30
We use sparse vectors to store the words for each MTS, as each Robot Failure LP4 3 6 15 42 75
feature vector only contains a few features after feature selection.
We implemented our MTS classifier using liblinear [8] as it scales Robot Failure LP5 5 6 15 64 100
linearly with the dimensionality of the feature space [17]. NetFlow 2 4 50-997 803 534
PenDigits 10 2 8 300 10692
4.5 Feature Interplay
The WEASEL+MUSE model is essentially a histogram of discrete Shapes 3 2 52-98 18 12
features extracted from all dimensions. The logistic regression UWave 8 3 315 200 4278
classifier trains for each class a weight vector, to assign high weights
Wafer 2 6 104-198 298 896
to those features that are relevant within each dimension. Thereby,
it captures the feature interplay, as dimensions are not treated WalkvsRun 2 62 128-1918 28 16
separately but the weight vector is trained over all dimensions. Table 1: 20 multivariate time series datasets collected
Still, this approach allows for phase-invariance as classes (events) from [15].
are represented by the frequency of occurrence of discrete features
rather than the exact time instance of an event.

5 EVALUATION their results are based on a single run. Instead, we report


the median over 5 runs using their published code [11].
5.1 Experimental Setup
For SMTS and gRSF, we additionally ran their code on the
• Datasets: We evaluated our WEASEL+MUSE classifier us- missing 5 and 7 datasets. The code for UFS is not available,
ing 20 publicly available MTS dataset listed in Table 1. thus we did not include it into the experiments. The web-
Furthermore, we compared its performance on a real-life page1 lists state-of-the-art univariate TSC. However, we
dataset taken from the motion capture domain; results are cannot use univariate TSC to classify multivariate datasets.
reported in Section 5.7. Each MTS dataset provides a train • Server: The experiments were carried out on a server run-
and test split set which we use unchanged to make our ning LINUX with 2xIntel Xeon E5-2630v3 and 64GB RAM,
results comparable to prior publications. using JAVA JDK x64 1.8.
• Competitors: We compare WEASEL+MUSE to the 7 domain • Training WEASEL+MUSE: For WEASEL+MUSE we per-
agnostic state-of-the-art MTSC methods we are aware of formed 10-fold cross-validation on the train datasets to
ARKernel [5], LPS [4], mv-ARF [25], SMTS [3], gRSF [12], find the most appropriate parameters for the SFA word
MLSTM-FCN [11], and the common baseline Dynamic lengths l ∈ [2, 4, 6] and SFA quantization method equi-
Time Warping independent (DTWi), implemented as the depth or equi-frequency binning. All other parameters are
sum of DTW distances in each dimension with a full warp- constant: chi = 2, as we observed that varying these val-
ing window. We use the reported test accuracies on the ues has only a negligible effect on the accuracy. We used
MTS datasets given by the authors in their publications, liblinear with default parameters (bias = 1, p = 0.1, c = 5
thereby avoiding any bias in training settings parameters. and solver L2R LR DUAL). To ensure reproducible results,
All reported numbers in our experiments correspond to the we provide the WEASEL+MUSE source code and the raw
accuracy on the test split. We were not able to reproduce measurement sheets [26].
the published results for MLSTM-FCN using their code.
The authors told us that this is due to random seeding and 1 http://www.timeseriesclassification.com
Conference’17, July 2017, Washington, DC, USA Patrick Schäfer and Ulf Leser

Dataset SMTS LPS mvARF DTWi ARKernel gRSF MLSTMFCN MUSE


ArabicDigits 96.4% 97.1% 95.2% 90.8% 98.8% 97.5% 99.0% 99.2%
AUSLAN 94.7% 75.4% 93.4% 72.7% 91.8% 95.5% 95.0% 97%
CharTrajectories 99.2% 96.5% 92.8% 94.8% 90% 99.4% 99.0% 97.3%
CMUsubject16 99.7% 100% 100% 93% 100% 100% 100% 100%
ECG 81.8% 82% 78.5% 79% 82% 88% 87% 88%
JapaneseVowels 96.9% 95.1% 95.9% 96.2% 98.4% 80% 100% 97.6%
KickvsPunch 82% 90% 97.6% 60% 92.7% 100% 90% 100%
Libras 90.9% 90.3% 94.5% 88.8% 95.2% 91.1% 97% 89.4%
NetFlow 97.7% 96.8% NaN 97.6% NaN 91.4% 95% 96.1%
UWave 94.1% 98% 95.2% 91.6% 90.4% 92.9% 97% 91.6%
Wafer 96.5% 96.2% 93.1% 97.4% 96.8% 99.2% 99% 99.7%
WalkvsRun 100% 100% 100% 100% 100% 100% 100% 100%
LP1 85.6% 86.2% 82.4% 76% 86% 84% 80% 94%
LP2 76% 70.4% 62.6% 70% 63.4% 66.7% 80% 73.3%
LP3 76% 72% 77% 56.7% 56.7% 63.3% 73% 90%
LP4 89.5% 91% 90.6% 86.7% 96% 86.7% 89% 96%
LP5 65% 69% 68% 54% 47% 45% 65% 69%
PenDigits 91.7% 90.8% 92.3% 92.7% 95.2% 93.2% 97% 91.2%
Shapes 100% 100% 100% 100% 100% 100% 100% 100%
DigitShapes 100% 100% 100% 93.8% 100% 100% 100% 100%
Wins/Ties 4 6 4 2 5 6 8 13
Mean 90.7% 89.8% 90% 84.6% 88.4% 88.7% 92.1% 93.5%
Avg. Rank 4.05 4.05 4.7 6.6 4.35 3.85 2.75 2.45
Table 2: Accuracies for each dataset. The best approaches are highlighted using a bold font.

CD
WEASEL also leads to significantly better ranks (6.05 vs 2.45). This
is a result of using feature identifiers and using derivatives.
9 8 7 6 5 4 3 2 1 Overall, WEASEL+MUSE has 12 wins (or ties) on the MTS
datasets (Table 2), which is the highest of all classifiers. With a
DTWi
6.6 2.45
WEASEL+MUSE mean of 93.5% it shows a similar average accuracy as MLSTN-FCN
6.05 2.75 with mean accuracy 92.1%.
WEASEL MLSTM-FCN
4.7 3.85
In the next section we look into the differences between MLSTM-
mv-ARF gRSF FCN and WEASEL+MUSE and identify the domains for which each
4.35 4.05
ARKernel SMTS classifier is best suited for.
4.05
LPS
5.3 Domain-dependent strength or limitation
We studied the individual accuracy of each method on the 20 differ-
Figure 6: Average ranks on the 20 MTS datasets.
ent MTS datasets, and grouped datasets by domain (Handwriting,
WEASEL+MUSE and MLSTM-FCN are the most accurate.
Motion Sensors, Sensor Readings, Speech) to test if our method
has a domain-dependent strength or limitation. Figure 7 shows the
accuracies of WEASEL+MUSE (orange line), MLSTM-FCN (black
line) vs. the other six MTS classifiers (green area).
5.2 Accuracy Overall, the performance of WEASEL+MUSE is very compet-
Figure 6 shows a critical difference diagram (as introduced in [6]) itive for all datasets. The black line is mostly very close to the
over the average ranks of the different MTSC methods. Classifiers upper outline of the orange area, indicating that WEASEL+MUSE’s
with the lowest (best) ranks are to the right. The group of classifiers performance is close to that of its best competitor in many cases.
that are not significantly different in their rankings are connected In total WEASEL+MUSE has 12 out of 20 possible wins (or ties).
by a bar. The critical difference (CD) length at the top represents WEASEL+MUSE has the highest percentage of wins in the groups
statistically significant differences. of motion sensors, followed by speech and handwriting.
MLSTM-FCN and WEASEL+MUSE show the lowest overall ranks WEASEL+MUSE and MLSTM-FCN perform similar on many
and are in the group of best classifiers. These are also signifi- dataset domain. WEASEL+MUSE performs best for sensor reading
cantly better than the baseline DTWi. When compared to the datasets and MLSTM-FCN performs best for motion and speech
plain WEASEL classifier, we can see that the MUSE extension to datasets. Sensor readings are the datasets with the least number
Multivariate Time Series Classification with WEASEL+MUSE Conference’17, July 2017, Washington, DC, USA

Datasets ordered by Domain


100%

90%

higher is better
80%
Accuracy

70%
Handwriting

Motions

Sensor Readings

Speech
60%

50% WEASEL+MUSE
MLSTM-FCN
Other 6 MTSC
40%
CharacterTrajectories

DigitShapes

PenDIgits

Shapes

AUSLAN

CMUsubject16

KickvsPunch

Libras

UWave

WalkvsRun

ECG

LP1

LP2

LP3

LP4

LP5

NetFlow

Wafer

ArabicDigits

JapaneseVowels
Figure 7: Classification accuracies on the 20 MTS datasets for WEASEL+MUSE (orange), MLSTM-FCN (black) vs six state-of-
the-art MTSC. The green area represents the classifiers’ accuracies.

of samples N or features n in the range of a few dozens. On the WEASEL+MUSE was robust to Gaussian noise and its accuracy
other hand, speech and motion datasets contain the most samples remains stable up to a noise level of 100%.
or features in the range of hundreds to thousands.
This might indicate that WEASEL+MUSE performs well, even
for small-sized datasets, whereas MLSTM-FCN seems to re-
5.5 Relative Prediction Times
quire larger training corpora to be most accurate. Furthermore, In addition to achieving state-of-the-art accuracy, WEASEL+MUSE
WEASEL+MUSE is based on the BOP model that compares signal is also competitive in terms of prediction times. In this experiment,
based on the frequency of occurrence of subsequences rather than we compare WEASEL+MUSE to DTWi. We could not perform
their absence or presence. Thus, signals with some repetition profit a meaningful comparison to the other competitors, as we either
from using WEASEL+MUSE, such as ECG-signals. do not have the source codes or the implementation is given in a
different language (R, Matlab).
In general, 1-NN DTW has a computational complexity of
5.4 Effects of Gaussian Noise on Classification O(Nn 2 ) for TS of length n. For the implementation of DTWi we
Accuracy make use of the state-of-the-art cascading lower bounds from [18].
WEASEL+MUSE applies the truncated Fourier Transform and dis- In this experiment, we measure CPU time to address parallel and
cretization to generate features. This acts as a low-pass filter. single threaded codes. The DTWi prediction times is reported rela-
To illustrate the relevance of noise to the classification task, we tive to that of WEASEL+MUSE, i.e., a number lower than 1 means
performed another experiment on the two multivariate synthetic that DTW is faster than WEASEL+MUSE. 1-NN DTW is a neatest
datasets Shapes and DigitShapes. neighbour classifier, so its prediction times directly depend on the
First, all dimensions of each dataset were z-normalised to have a size of the train dataset. For WEASEL+MUSE the length of the time
standard deviation (SD) of 1. We then added an increasing Gaussian series n and the number of dimensions m are most important
noise with a SD of 0 to 1.0 to each dimension, equal to noise levels For all but three datasets WEASEL+MUSE is faster than DTWi.
of 0% to 100%. It is up to 400 times faster for the Robot Failure LP1 dataset and 200
Figure 8 shows the two classifiers DTWi and WEASEL+MUSE. times faster for ArabicDigits. On average it is 43 times faster than
For DTWi the classification accuracy drops by up to 30 per- DTWi. There are three datasets for which DTW is faster: Walkvs-
centage points for increasing levels of noise. At the same time, Run, KickvsPunch and CMU-MOCAP. These are the datasets with
Conference’17, July 2017, Washington, DC, USA Patrick Schäfer and Ulf Leser

Influence of Noise: DigitShapes Influence of Noise: Shapes


100 100
Accuracy in %

90

Accuracy in %
95
DTWi DTW
90 MUSE 80 MUSE

70
85
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Stddev of Gaussian Noise added Stddev of Gaussian Noise added

Figure 8: Effects of Gaussian noise on classification accuracy. With increasing levels of noise added, the accuracy of DTW
drops and remains stable for WEASEL+MUSE.

Relative Time Absolute Time in ms CD

DTWi MUSE DTWi MUSE


ArabicDigits 206.3 1 10509055 50952 4 3 2 1

AUSLAN 42.4 1 3040070 71737


CharTrajectories 7.2 1 1153620 161104 3.1 1.35
WEASEL
CMU MOCAP 0.4 1 162131 410387 univariate, no derivatives WEASEL+MUSE
DigitShapes 16.7 1 2194 131 univariate, 3 1.4 multivariate,
derivatives no derivatives
ECG 2.1 1 4725 2228
Japanese Vowels 34.1 1 15588 457 Figure 9: Impact of the design decisions of the
KickvsPunch 0.1 1 31761 256406 WEASEL+MUSE classifier on accuracy.
LIBRAS 9.7 1 3399 350
Robot Failure LP1 413.7 1 28961 70
Robot Failure LP2 1.2 1 53 43 all dimensionality information and treat the data like a
Robot Failure LP3 1.4 1 51 36 univariate TS classification problem.
Robot Failure LP4 10.1 1 689 68 (2) Derivatives vs raw TS: Following [27] and [3], we have
Robot Failure LP5 20.4 1 1630 80 added derivatives for each dimension to add trend infor-
mation.
PenDigits 45.6 1 21409 469
Shapes 2.3 1 264 116 The univariate without derivatives approach is in concept similar
UWave 4.5 1 5724667 1262706 to WEASEL (without MUSE). There is a big gap between the multi-
Wafer 8.8 1 704649 79945 variate and univariate models of WEASEL+MUSE. The univariate
approaches are the least accurate, as the association of features to
WalkvsRun 0.3 1 67350 242181
sensors is lost. The use of derivatives results in a slightly better
Average 43.3 1 1130119 133656 score. Both extensions (multivariate and derivatives) combined
Table 3: Relative and absolute prediction times (lower is al- improve ranks the most.
ways better) of WEASEL+MUSE compared to DTWi.
5.7 Use Case: Motion Capture Data (Kinect)
This real world dataset was part of a challenge for gesture recogni-
tion [1] and represents isolated gestures of different users captured
the highest number of dimensions m = 62. Thus, WEASEL+MUSE
by a Kinect camera system. The task was to predict the gestures
is not only significantly more accurate then DTWi but also orders
performed by the users.
of magnitude faster.
The dataset consists of 180 labelled train and 180 test MTS. The
labels on the test set were not revealed. A total of 8 sensors were
5.6 Influence of Design Decisions on Accuracy used to record x,y,z coordinates with a total of 51 time instances,
We next look into the impact of several design decisions of the i.e., an MTS with 24 streams of 51 values each. The sensors are
WEASEL+MUSE classifier. Figure 9 shows the average ranks of the placed at the left/right hand, left/right elbow, left/right wrist and
WEASEL+MUSE classifier on the 20 MTS datasets where each of left/right thumb (see Figure 1 for an example gesture).
the following extension is disabled or enabled: WEASEL+MUSE (alias MWSL) scored 171 correct predictions,
(1) Multivariate vs Univariate A key design decision of which is equal to a test accuracy of 95%. The winning approach
WEASEL+MUSE is to keep sensor ids for each word. The scored 173 (96.1%), based on an ensemble of random shapelets
opposite is to treat all dimensions equally, i.e., concatenate and domain specific feature extraction, and the SMTS [3] classifier
Multivariate Time Series Classification with WEASEL+MUSE Conference’17, July 2017, Washington, DC, USA

scored 172 (95.6%). This challenge underlined that WEASEL+MUSE [13] Jessica Lin, Eamonn J. Keogh, Li Wei, and Stefano Lonardi. 2007. Experiencing
is applicable out-of-the-box to real-world use cases and competitive SAX: a novel symbolic representation of time series. Data Mining and knowledge
discovery 15, 2 (2007), 107–144.
with domain-specific tailored approaches. Motion captured data [14] Jessica Lin, Rohan Khade, and Yuan Li. 2012. Rotation-invariant similarity in time
is characterized by noisy data, that contains many superfluous series using bag-of-patterns representation. Journal of Intelligent Information
Systems 39, 2 (2012), 287–315.
information among dimensions. By design WEASEL+MUSE is able [15] Mustafa Gokce Baydogan. 2017. Multivariate Time Series Classification Datasets.
to deal with this kind of data effectively. http://www.mustafabaydogan.com. (2017).
[16] Christopher Mutschler, Holger Ziekow, and Zbigniew Jerzak. 2013. The DEBS
2013 grand challenge. In Proceedings of the 2013 ACM International Conference
6 CONCLUSION on Distributed Event-based Systems. ACM, 289–294.
In this work, we have presented WEASEL+MUSE, a novel multivari- [17] Andrew Y Ng. 2004. Feature selection, L 1 vs. L 2 regularization, and rotational
invariance. In Proceedings of the 2004 ACM International Conference on Machine
ate time series classification method following the bag-of-pattern Learning. ACM, 78.
approach and achieving highly competitive classification accuracies. [18] Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista,
The novelty of WEASEL+MUSE is its feature space engineering Brandon Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh. 2012. Search-
ing and mining trillions of time series subsequences under dynamic time warping.
using statistical feature selection, derivatives, variable window In Proceedings of the 2012 ACM SIGKDD International Conference on Knowledge
lengths, bi-grams, and a symbolic representation for generating Discovery and Data Mining. ACM, 262–270.
[19] Patrick Schäfer. 2015. Scalable time series classification. Data Mining and
discriminative words. WEASEL+MUSE provides tolerance to noise Knowledge Discovery (2015), 1–26.
(by use of the truncated Fourier transform), phase invariance, and [20] Patrick Schäfer. 2015. The BOSS is concerned with time series classification
superfluous data/dimensions. Thereby, WEASEL+MUSE assigns in the presence of noise. Data Mining and Knowledge Discovery 29, 6 (2015),
1505–1530.
high weights to characteristic, local and global substructures along [21] Patrick Schäfer and Mikael Högqvist. 2012. SFA: a symbolic fourier approxima-
dimensions of a multivariate time series. In our evaluation on al- tion and index for similarity search in high dimensional datasets. In Proceedings
together 21 datasets, WEASEL+MUSE is consistently among the of the 2012 International Conference on Extending Database Technology. ACM,
516–527.
most accurate classifiers and outperforms state-of-the-art similar- [22] Patrick Schäfer and Ulf Leser. 2017. Benchmarking Univariate Time Series
ity measures or shapelet-based approaches. It performs well even Classifiers. In BTW 2017. 289–298.
[23] Patrick Schäfer and Ulf Leser. 2017. Fast and Accurate Time Series Classification
for small-sized datasets, where deep learning based approaches with WEASEL. Proceedings of the 2017 ACM on Conference on Information and
typically tend to perform poorly. When looking into application Knowledge Management (2017), 637–646.
domains, it is best for sensor readings, followed by speech, motion [24] The Value of Wind Power Forecasting. 2016. http://www.nrel.gov/docs/fy11osti/
50814.pdf. (2016).
and handwriting recognition tasks. [25] Kerem Sinan Tuncel and Mustafa Gokce Baydogan. 2018. Autoregressive forests
Future work could direct into different feature selection methods, for multivariate time series modeling. Pattern Recognition 73 (2018), 202–215.
benchmarking approaches based on train and prediction times, or [26] WEASEL+MUSE Classifier Source Code and Raw Results. 2017. https://www2.
informatik.hu-berlin.de/∼schaefpa/muse/. (2017).
use ensembling to build more powerful classifiers, which has been [27] Martin Wistuba, Josif Grabocka, and Lars Schmidt-Thieme. 2015. Ultra-fast
successfully used for univariate time series classification. shapelets for time series classification. arXiv preprint arXiv:1503.05018 (2015).
[28] Y Chen, E Keogh, B Hu, N Begum, A Bagnall, A Mueen and G Batista . 2015.
The UCR Time Series Classification Archive. http://www.cs.ucr.edu/∼eamonn/
REFERENCES time series data. (2015).
[1] AALTD Time Series Classification Challenge. 2016. https://aaltd16.irisa.fr/ [29] Lexiang Ye and Eamonn J. Keogh. 2009. Time series shapelets: a new primitive
challenge/. (2016). for data mining. In Proceedings of the 2009 ACM SIGKDD International Conference
[2] Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh. on Knowledge Discovery and Data Mining. ACM.
2016. The Great Time Series Classification Bake Off: An Experimental Evaluation
of Recently Proposed Algorithms. Extended Version. Data Mining and Knowledge
Discovery (2016), 1–55.
[3] Mustafa Gokce Baydogan and George Runger. 2015. Learning a symbolic repre-
sentation for multivariate time series classification. Data Mining and Knowledge
Discovery 29, 2 (2015), 400–422.
[4] Mustafa Gokce Baydogan and George Runger. 2016. Time series representation
and similarity based on local autopatterns. Data Mining and Knowledge Discovery
30, 2 (2016), 476–509.
[5] Marco Cuturi and Arnaud Doucet. 2011. Autoregressive kernels for time series.
arXiv preprint arXiv:1101.0673 (2011).
[6] Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets.
The Journal of Machine Learning Research 7 (2006), 1–30.
[7] Philippe Esling and Carlos Agon. 2012. Time-series data mining. ACM Computing
Surveys 45, 1 (2012), 12:1–12:34.
[8] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen
Lin. 2008. LIBLINEAR: A library for large linear classification. The Journal of
Machine Learning Research 9 (2008), 1871–1874.
[9] Benjamin F Hobbs, Suradet Jitprapaikulsarn, Sreenivas Konda, Vira Chankong,
Kenneth A Loparo, and Dominic J Maratukulam. 1999. Analysis of the value
for unit commitment of improved load forecasts. IEEE Transactions on Power
Systems 14, 4 (1999), 1342–1348.
[10] Zbigniew Jerzak and Holger Ziekow. 2014. The DEBS 2014 Grand Challenge. In
Proceedings of the 2014 ACM International Conference on Distributed Event-based
Systems. ACM, 266–269.
[11] Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Samuel Harford.
2018. Multivariate LSTM-FCNs for Time Series Classification. arXiv preprint
arXiv:1801.04503 (2018).
[12] Isak Karlsson, Panagiotis Papapetrou, and Henrik Boström. 2016. Generalized
random shapelet forests. Data mining and knowledge discovery 30, 5 (2016),
1053–1085.

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy