Delachaux 2013
Delachaux 2013
1 Introduction
Supporting the quality of life of aging population is one of the greatest chal-
lenges facing our societies nowadays. With age comes neurodegenerative diseases,
memory loss, increased risks of falls, and broken limbs. However, smart home
technologies offer new opportunities to help the elderly population pursue an
independent living. In our work, we are interested in using activity recognition
to build applications that can help people with cognitive or memory problems.
We focus on indoor environments, such as an apartment, which can be equipped
with static sensors (e.g., Kinect depth cameras). In addition, the users can also
carry wearable sensors which can be hidden in clothes or in accessories (e.g., a
watch or a smartphone). In this context, an activity recognition system should
be able to (i) recognise a wide range of activities, (ii) be flexible with regards to
loss of one or more sensor modalities.
There are numerous previous works in activity recognition [7, 10, 13]. Many
works target a medical application: for instance, Amft et al. [1] use a variety
of on-body sensors to perform dietary monitoring and therefore help patients
I. Rojas, G. Joya, and J. Cabestany (Eds.): IWANN 2013, Part II, LNCS 7903, pp. 216–223, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Indoor Activity Recognition Using Neural Network Classifiers 217
Fig. 1. Kinect extracted skeleton for two sample poses. The skeleton is overlaid in
green, with yellow dots indicating the joints with a low confidence score reported by
the sensor. The skeleton provided by the Kinect is reliable when the person is standing
in front of the sensor, but it is unreliable when parts of the person’s body are occluded.
2 Dataset
Several activity recognition datasets are available and allow researchers to
benchmark their algorithms. However, even recent datasets [9] do not include
depth-sensors. Our dataset consists of recordings of typical daily living activities
performed by one subject so far, but we are working on a database where several
subjects will take part in the recordings1 . The recording environment simulates
the setup of a small apartment with one Kinect camera per room. We decided
1
The current database is available upon request. The complete database involving
several subjects will be available on our website http://ape.iict.ch soon.
218 B. Delachaux et al.
to use Kinect and accelerometer sensors because they are both relatively cheap
hardware (when compared to motion capture setup, for example) and are not
too invasive for the user of our system. In addition to its skeleton tracking ca-
pabilities, the Kinect can also be used as a camera and therefore allows us to
show pictures of the activities to the user, which are more easily interpretable
than text labels as shown by Browne et al. [3]. In order to respect the privacy
of the person, the system might not store the video after having been processed
(i.e., it will only store the snapshots, the duration of the activities, etc.).
During the sequence, the subject performs various daily activities including:
Read, Sleep, Sit idle, Dress, Undress, Brush teeth, Clean a table, Work at the
computer, Tidy up the wardrobe, Pick something up, and Sweep the floor, re-
peating each activity multiple times. The duration of the first recording is 2176
seconds, which corresponds to 65’057 samples. Each sample consists of the skele-
ton2 of the person, obtained from the Kinect sensor, placed at a height of around
2 m in the rooms, using the Microsoft’s SDK [12], and the 3-axis acceleration
from 5 wireless inertial measurement units (IMU) from Shimmer Research3 ,
placed on the wrists, ankles and on the back. In addition, for each timestamp
the Kinect also gives us a RGB image that we used to label the data (by hand)
and validate our system. Figure 1 shows two examples of Kinect captures with
the extracted skeleton overlaid in green. An in-house software was used to syn-
chronize the accelerometers with the Kinect. Figure 2 shows the activity labels
of the whole sequence used in our experiments.
read
dress
sleep
undress
brush_teeth
drink
sweep_floor
none
0 10000 20000 30000 40000 50000 60000
sample
3 Learning Pipeline
Activity recognition was achieved by combining a set of binary classifiers based
on feedforward artificial neural networks (i.e., Multi-layer Perceptrons).
2
The Kinect sensor tracks the position (relative to the sensor) of 20 body joints at a
rate of about 30hz.
3
http://shimmer-research.com
Indoor Activity Recognition Using Neural Network Classifiers 219
We decided to use a one-vs-all approach [4] and trained a neural network for
each activity. Each individual neural network was trained to distinguish all the
samples belonging to one activity from a randomly chosen set of samples be-
longing to all other activities. The main reason for this choice is that different
activities might require different features. For example, one might assume that
the position of the feet is not relevant to detect the drink activity.
Our learning pipeline consists of the following steps: acceleration and
skeleton acquisition, signal filtering, window-size selection, neural network com-
plexity (number of hidden neurons) selection, and input selection, for each indi-
vidual binary classifier. We performed parameter exploration to help us choose
the best value for each parameter of this learning pipeline, when possible. This
means that running the whole pipeline can be quite time-consuming, but most
parameters (e.g., window size, topology) that are found optimal on our test se-
quences are likely to be near-optimal on new similar sequences that our system
might encounter in the future.
Acceleration and Skeleton Acquisition. We used 5 three-axis accelerome-
ters. Two were placed on the legs, one on the back and one on each wrist of
the subject. This is a subtotal of 3 ∗ 5 = 15 inputs from the accelerometers. In
addition, we used the following information from the Kinect skeleton data : the
position of the person in the room, the position of left and right hands, elbows
and hips, and the position of each hand relative to the corresponding shoulder.
This gives a subtotal of 3 ∗ 9 = 27 inputs from the Kinect. So we have a total
number of 15 + 27 = 42 input values for each frame.
Signal Filtering. We used a 1Hz low-pass filter to only consider the low fre-
quency components of the subject’s movements. Hence, we mostly kept data
about the posture of the subject when performing an activity. Indeed, most of
the activities we are interested in have quite different postures.
Window Size Selection. After the signal has been filtered, we used a sliding
window to extract feature vectors. We tried window sizes between 10 and 100
samples, and chose the one that gave us the least validation error. We used the
absolute error = |yground truth −ypredicted | to evaluate the training performance
of each single neural network for this step. Although not all activities have the
same optimal window size, we used a single window size for all activities (for
simplicity). We chose a window size of 30 samples, which was a good choice for
almost all activities. We tested two overlapping settings for our sliding windows:
50% and “99%” (just slide the window by one value). The windows are used to
sub-sample the original time-series by computing the mean.
Topology Selection. Since we used one-vs-all networks, specific to each activ-
ity, such networks have a single output unit indicating whether the input corre-
sponds to the execution of the given activity (1) or not (0). We used sigmoidal
output neurons, thus output values between 0 and 1 represent the neural net-
work’s certainty regarding the classification of a particular input pattern. Since
we found that networks with a single hidden layer were sufficiently complex for
220 B. Delachaux et al.
sweep floor drink brush teeth undress sleep dress read clean table pick up tidy up sit idle work
4 8 7 7 8 8 1 5 8 8 5 4
igh
t ft k ht ft n nd ow _hip
d ow hip er er
t_r t_le bac _rig _le itio _ha _elb left han elb h t_ uld uld
wris leg leg pos left left h t_ h t_ rig sho sho
wris rig rig lh_ rh_
Accelerometers Kinect
Fig. 3. Matrix showing the selected inputs for each class. Each modality has 3 axis
(x, y, z). For each activity, we select a different set of inputs (rows). A black square
indicates that the input has been selected for the corresponding activity. On the left
are the features computed from accelerometers, on the right the features computed
from the Kinect (the red line shows the separation).
Binary Classifiers’ Training and Fusion. Each Multi-Layer Perceptron
(MLP) implementing an activity-specific one-vs-all binary classifier was trained
using the FENNIX software4, developed by one of the authors. In particular, we
used the Backpropagation algorithm [2] during 50 and 250 epochs, a decreasing
learning rate from 0.025 to 0.001, and a momentum term of 0.7. The output
of the ensemble of binary classifiers was the action having the highest classi-
fication certainty value (i.e., the highest activity-specific one-vs-all classifier’s
output value), provided that such a value was greater than 0.7. Otherwise, we
considered the recognized activity as being the none activity, which is a special
class for unknown activities.
4
http://fennix.sourceforge.net
Indoor Activity Recognition Using Neural Network Classifiers 221
4 Results
To fully profit from the use of both wearable and static sensors, we implemented
a model switching capability within the activity recognition system as follows:
given that the skeleton data provided by the Kinect has a quality flag for each
joint and frame, indicating the confidence in the reported joint position, when
this flag was low for most of the joints being tracked, we considered that the
Kinect data was not reliable and we relied on a model based only on the ac-
celerometers, to recognize the activity being performed. To achieve this, we
trained two models: one using both the wearable and the static depth-based
sensors, and a second one using only the wearable accelerometers.
0.85 0.85
0.80 0.80
0.75 0.75
F1 score
F1 score
0.70 0.70
0.65 0.65
0.60 0.60
50% overlap 50% overlap
0.55 0.55
99% overlap 99% overlap
0.50 0.50
k-NN Accelerometers Accelerometers k-NN Accelerometers Accelerometers
& Kinect & Kinect
0.85 0.85
0.80 0.80
0.75 0.75
F1 score
F1 score
0.70 0.70
0.65 0.65
0.60 0.60
Fig. 5. Classification performance (f (1)score) for our pipeline. On the left, k -NN is
used as a benchmark. There are two plots: one where we take the none class into
account and one where we do not. Then, in the middle of each plot, the performance
of our models without input pruning is shown and on the right are the results for
our models with input pruning. The window overlap parameter is indicated by the
color. The number on top of each boxplot indicates the value of the median, for ease
of comparison.
5 Conclusions
This paper presents an activity recognition system built using activity-specific
one-vs-all artificial neural networks. The experiments demonstrated that the
performance of the system is comparable with a k -NN classifier but with the ad-
vantage that, once trained, the training database can be discarded. This makes
it possible to use this kind of system on an embedded device like a smartphone
or a tablet. We also showed that input pruning and custom configuration of each
activity-specific model can be used to improve the performance of the system.
Indoor Activity Recognition Using Neural Network Classifiers 223
Last but not least, we showed that the combination of wearable motion sensors
and static depth sensors can enhance not only the performance, but the robust-
ness of indoor activity recognition systems by taking advantage of one modality
when the other one is not reliable. Our results are quite encouraging, but further
tests with several subjects are required in order to verify to a greater extent the
generalization performance of such a system. Moreover, given that our aim is
to work towards activity recognition democratization, we will try to use as few
sensors as possible in the future.
References
1. Amft, O., Tröster, G.: Recognition of dietary activity events using on-body sensors.
Artificial Intelligence in Medicine 42(2), 121–136 (2008)
2. Bishop, C.: Neural networks for pattern recognition. OUP, USA (1995)
3. Browne, G., et al.: Sensecam improves memory for recent events and quality of life
in a patient with memory retrieval difficulties. Memory 19(7), 713–722 (2011)
4. Garcia-Pedrajas, N., Ortiz-Boyer, D.: An empirical study of binary classifier fusion
methods for multiclass classification. Information Fusion 12(2), 111–130 (2011)
5. Hondori, H., et al.: Monitoring intake gestures using sensor fusion (microsoft kinect
and inertial sensors) for smart home tele-rehab setting. In: 2012 1st Annual IEEE
Healthcare Innovation Conference (2012)
6. Kepski, M., Kwolek, B.: Fall detection on embedded platform using kinect and
wireless accelerometer. In: Miesenberger, K., Karshmer, A., Penaz, P., Zagler, W.
(eds.) ICCHP 2012, Part II. LNCS, vol. 7383, pp. 407–414. Springer, Heidelberg
(2012)
7. Lara, O.D., et al.: Centinela: A human activity recognition system based on accel-
eration and vital sign data. Pervasive and Mobile Computing 8(5), 717–729 (2012)
8. Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworth-Heinemann,
Newton (1979)
9. Roggen, D., et al.: Collecting complex activity data sets in highly rich networked
sensor environments. In: Proceedings of the Seventh International Conference on
Networked Sensing Systems (INSS), pp. 233–240. IEEE CSP (2010)
10. Sagha, H., et al.: Benchmarking classification techniques using the Opportunity
human activity dataset. In: IEEE International Conference on Systems, Man, and
Cybernetics (2011)
11. Satizábal M, H.F., Pérez-Uribe, A.: Relevance metrics to reduce input dimensions
in artificial neural networks. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic,
D.P. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 39–48. Springer, Heidelberg (2007)
12. Shotton, J., et al.: Real-time human pose recognition in parts from single depth
images. In: Computer Vision and Pattern Recognition (CVPR), pp. 1297–1304.
IEEE (2011)
13. Stiefmeier, T., et al.: Wearable activity tracking in car manufacturing. IEEE
Pervasive Computing 7(2), 42–50 (2008)