Garcia-Hernando First-Person Hand Action CVPR 2018 Paper
Garcia-Hernando First-Person Hand Action CVPR 2018 Paper
Abstract
409
speed variations across subjects are more pronounced due to ing a thermal camera enabling easier hand detection with-
a higher degree of mobility of fingers and the motion can be out exploring its use for action recognition. In these pre-
very subtle. A setback for using hand pose for action recog- vious works, hands are modeled using low-level features
nition is the absence of reliable pose estimators off-the-shelf or intermediate representations following empirical grasp
in contrast to full body [55, 71], mainly due to the absence taxonomies [6] and thus are limited compared to the 3D
of hand pose annotations on real (cf. synthetic) data se- hand pose sequences used in this work. In [50], synthetic
quences, notably when objects are involved [10, 35, 48, 49]. hand poses are used to recognize grasps in static frames,
In this work we introduce a new dataset of first-person whereas our interest is in dynamic actions and hand poses
dynamic hand action sequences with more than 100,000 in real videos. From a hand pose estimation perspective,
RGB-D frames annotated with 3D hand poses, using six [48] proposed a small synthetic dataset of static poses and
magnetic sensors attached to the fingertips and inverse kine- thus could not succesfully train data-hungry algorithms, re-
matics. We captured 1175 action samples including 45 cat- cently relieved by larger synthetic datasets [10, 35]. Given
egories manipulating 26 different objects in 3 scenarios. that we also provide 6D object poses and 3D mesh mod-
We designed our hand actions and selected objects to cover els for a subset of objects, our dataset can be of interest to
multiple hand configurations and temporal dynamics. Fur- both object pose and joint hand-object tracking emerging
thermore, to encourage further research, we also provide communities [57, 62]. We compare our dataset with other
6-dimensional object pose ground truth, and their 3D mesh first-person view datasets in Section 3.5.
models, for 4 objects spanning, 10 different actions. We
evaluate several baselines and state-of-the-art RGB-D and RGB-D and pose-based action recognition: Using
pose-based action recognition in our dataset and test the depth sensors differs from traditional color action recog-
current state-of-the-art in hand pose estimation and its influ- nition in the fact that most successful color approaches
ence on action recognition. To the best of our knowledge, [15, 67] cannot be directly applied to the depth stream due
this is the first work that studies the problem of first-person to its nature: noisy, textureless and discontinuous pixel re-
action recognition with the use of hand pose features and the gions led to the necessity of depth-tailored methods. These
first benchmark of its kind. In summary, the contribution of methods usually focus on how to extract discriminative fea-
this paper is three-fold: tures from the depth images using local geometric descrip-
Dataset: we propose a fully annotated dataset to help the tors [40, 43, 76] sensitive to viewpoint changes and view-
study of egocentric dynamic hand-object actions and poses. invariant approaches [46, 47]. However, the recent trend
This is the first dataset to combine both fields in the context is to take advantage of the depth channel to obtain robust
of hands in real videos and quality hand pose labels. body pose estimates [55] and use them directly as a fea-
Action recognition: we evaluate 18 baselines and state- ture to recognize actions, what is known as pose or skeleton
of-the-art approaches in RGB-D and pose-based action action recognition. Popular approaches include the use of
recognition using our proposed dataset. Our selected meth- temporal state-space models [17, 70, 74, 75, 86], key-poses
ods cover most of the research trends in both methodology [66, 85], hand-crafted pose features [64, 65], and tempo-
and use of different data modalities. ral recurrent models [12, 63, 87]. Having multiple data
Hand pose: We evaluate a state-of-the-art hand pose es- streams has led to the study of combining different sources
timator in our real dataset, i.e., the occluded setting of hand- of information such as depth and pose [4, 40, 52, 69],
object manipulations and assess its performance for action color and pose [88], and all of them [19]. Most previous
recognition. works in RGB-D action recognition focus on actions per-
formed by the whole human body with some exceptions
2. Related work that are mainly application-oriented, such as hand gestures
for human-computer interaction [9, 11, 29, 34, 40] and sign
Egocentric vision and manipulations datasets: The language [68]. Related to us, [33] mounted a depth sensor
important role of hands while manipulating objects has at- to recognize egocentric activities and modeling hands using
tracted the interest from both computer vision and robotics low-level skin features. Similar to our interests but in third-
communities. From an action recognition perspective and person view, [27, 78] used a hand tracker to obtain noisy es-
only using RGB cues, recent research [5, 13, 14, 31, 44, 56] timates of hand pose in kitchen manipulation actions, while
has delved into recognizing daily actions and determined [11] recognized basic hand gestures for human-computer
that both manipulated objects and hands are important cues interaction without objects involved. In these works, ac-
to the action recognition problem. A related line of work is tions performed and pose labels are very limited due to the
the study of human grasp from a robotics perspective [6, 7], low quality of the hand tracker, while in this work we pro-
as a cue for action recognition [8, 16, 23, 77], force es- vide accurate hand pose labels to study more realistic hand
timation [16, 26, 50], and as a recognition problem it- actions. We go in depth and evaluate several baselines and
self [20, 50]. Recently, [30] proposed a benchmark us- state-of-the-art approaches in Sections 4 and 5.
410
Figure 2: Hand actions: We captured daily hand actions using a RGB-D sensor and used a mo-cap system to annotate hand
pose. Left: ‘put sugar’ and ‘pour milk’ (kitchen). Right: ‘charge cell phone’ (office) and ‘handshake’ (social).
3D hand pose estimation: Mainly due to the recent 3.2. Hand-object actions
availability of RGB-D sensors, the field has made signifi-
We captured 45 different daily hand action categories in-
cant progress in object-less third-person view [22, 25, 28,
volving 26 different objects. We designed our action cat-
37, 39, 41, 45, 53, 60, 80] and more modest advances in
egories to span a high number of different hand config-
first-person view [10, 35, 38, 48]. In [42], 3D tracking of
urations following the same taxonomy as [50] and to be
a hand interacting with an object in third-person view was
diverse in both hand pose and action space (see Fig. 4).
investigated. [18] studied the use of object-grasp as hand
Each object has a minimum of one associated action (e.g.,
pose prior, while [51] used the object shape as cue. An im-
pen-‘write’) and a maximum of four (e.g., sponge-‘wash’,
portant limitation is the difficulty of obtaining accurate 3D
‘scratch’, ‘squeeze’, and flip’). These 45 hand actions were
hand pose annotations leading researchers to resort to syn-
recorded and grouped in three different scenarios: kitchen
thetic [3, 10, 35, 48, 53], manually or semi-automatically
(25), office (12) and social (8). In this work we consider
annotated [38, 58, 59, 61] datasets, resulting in non-realistic
each hand-object manipulation as a different action cate-
images, a low number of samples, and often inconsistent
gory similar to previous datasets [14], although other def-
annotations. With the help of magnetic sensors for anno-
initions are possible [73, 78].
tation and similar to [72], [84] proposed a big benchmark
that included egocentric poses with no objects involved and 3.3. Sensors and data acquisition
showed that a ConvNet baseline can achieve state-of-the-art
performance when enough training data is available. This Visual data: We mounted an Intel RealSense SR300
was confirmed in a public challenge [83], also using a sub- RGB-D camera on the shoulder of the subject and cap-
set of our proposed dataset, and followed by a work [82] tured sequences at 30 fps and resolutions 1920×1080 and
analyzing the current state-of-the-art of the field. 640×480 for the color and depth stream respectively.
Pose annotation: To obtain quality annotations of hand
and object pose, the hand pose is captured using six mag-
3. Daily hand-object actions dataset netic sensors [36] attached to the user’s hand, five fingertips
and one wrist, following [84]. Each sensor provides posi-
3.1. Dataset overview tion and orientation with 6 degrees of freedom and the full
hand pose is inferred using inverse kinematics over a de-
The dataset contains 1,175 action videos belonging to 45 fined 21-joint hand model. Each sensor is 2 mm wide and
different action categories, in 3 different scenarios, and per- when attached to the human hand does not influence the
formed by 6 actors. A total of 105,459 RGB-D frames are depth image. The color image is affected as the sensors
annotated with accurate hand pose and action category. Ac- and the tape attaching them are visible, however the hand is
tion sequences present high inter-subject and intra-subject fully visible and actions distinguishable by using the color
variability of style, speed, scale, and viewpoint. The ob- image. Regarding object pose, we attach one more sensor to
ject’s 6-dimensional pose, 3D location and angle, and mesh the closest point to the center of mass that can be reached.
model are also provided for 4 objects involving 10 different Recording process: We asked 6 people, all right-
action categories. Our plan is to keep growing the dataset handed, to perform the actions. Instructions on how to per-
with more models and objects. In Fig. 2 we show some ex- form the action in a safe manner were given, however no
ample frames for different action categories and hand-pose instructions about style or speed were provided, in order to
annotation visualization. capture realistic data. Actions were labeled manually.
411
High Five
Dataset Sensor Real? Class. Seq. Frames Labels
Hand
Prick
Open
Yale [6] RGB X 33 - 9,100 Grasp
Toa
r
sh.
Pou
se
UTG [7] RGB X 17 - - Grasp
st
Clo
Po
en
GTEA [14] RGB X 61 525 31,222 Action
ur
Op
O
ur
pe
ce
Po
e
n
Fork
Choi et al. [10] RGB-D 33 - 16,500 Grasp+Pose
Hand
✗
Win
e
Giv ur
Juic
e Po
Wi
eG
SynthHands [35] RGB-D ✗ - - 63,530 Pose
ne
W
S.
se EgoDexter [35] RGB-D X - - 3,190 Fingertips
B.
Co
al
Clo
id
.
le
qu
Use in
Luo et al. [30] RGB-D-T X 44 250 450,000 Action
t
Li
Ca Social lk
Take Ou
rd Mi Drink
t Matc Ours RGB-D X 45 1,175 105,459 Action+Pose
h Hand Actions Mug Open
Letter
Open PeanutB
Close
Table 1: First-person view datasets with hands and objects
Pen Salt
Use
er Kitchen Sod Put
involved. Our proposed dataset is the first providing both
Pap Office a
Sque
eze s Sp hand pose and action annotations in real data (cf. synthetic).
se
Boo lator
oo
lcu h.
as n
Gl Op
C
r
Tea
Spo
en
Tea Bag
ll
Ce
Pu
nge
Ca
ad t
Re
ea
n Sc
oo age number of visible (not occluded by object or viewpoint)
Cl Sp p
fo
ld rin hand joints per action class. Most actions present a high de-
Un kl
St
e
e
gree of occlusion, on average 10 visible joints out of 21.
Us
ir
Fli
e
Squ
.
rat
Pag
Wash
Put
Use
Figure 3: Taxonomy of our hand actions involving objects 3.5. Comparison with other datasets
dataset. Some objects are associated with multiple actions
(e.g., spoon, sponge, liquid soap), while some others have In Table 1 we summarize popular egocentric datasets that
only one linked action (e.g., calculator, pen, cell charger). involve hands and objects in both dynamic and static fash-
ion depending on their problem of interest. For concise-
ness, we have excluded from the table related datasets that
3.4. Dataset statistics do not partially or fully contain objects manipulations, e.g.,
[38, 44, 84]. Note that previous datasets in action recog-
Taxonomy: Fig. 3 shows the distribution of different nition [5, 14, 30] do not include hand pose labels. On the
actions per involved object. Some objects such as ‘spoon’ other hand, pose and grasp datasets [6, 7, 10, 35, 48, 50]
have multiple actions (e.g., ‘stir’, ‘sprinkle’, ‘scoop’, and do not contain dynamic actions and hand pose annotation
‘put sugar’), while some objects have only one action (‘use is obtained by generating synthetic images or rough manual
calculator’). Although it is not an object per se, we included annotations [35]. Our dataset ‘fills the gap’ of egocentric
‘hand’ as an object in actions ‘handshake’ and ‘high five’. dynamic hand action using pose and compares favorably in
Videos per action class: On average there are 26.11 se- terms of diversity, number of frames, and use of real data.
quences per class action and 45.19 sequences per object.
For detailed per class numbers see Fig. 4 (c). 4. Evaluated algorithms and baselines
Duration of videos: Fig. 4 (d) shows the average num-
ber of video duration for the 45 action classes. Some ac- 4.1. Action recognition
tion classes such as ‘put sugar’ and ‘open wallet’ involve In order to evaluate the current state-of-the-art in action
short atomic movements, on average one second, while oth- recognition we chose a variety of approaches that, we be-
ers such as ‘open letter’ require more time to be executed. lieve, cover the most representative trends in the literature
Grasps: We identified 34 different grasps following the as shown in Table 4. As the nature of our data is RGB-D
same taxonomy as in [50], including the most frequently and we have hand pose, we focus our attention to RGB-D
studied ones [8] (i.e., precision/power grasps for different and pose-based action recognition approaches, although we
object attributes such as prismatic/round/flat/deformable). also evaluate two RGB action recognition methods [15, 19].
In Fig. 4 (b) we show some examples of correlation between Note that, as discussed above, most of previous works in
objects, hand poses, and actions. RGB-D action recognition involve full body poses instead
Viewpoints: In Fig. 4 (e) we show the distribution of of hands and some of them might not be tailored for hand
frames per hand viewpoint. We define the viewpoint as actions. We elaborate further on this in Section 5.1.
the angle between the camera direction and the palm of the We start with one baseline to assess how the current
hand. The dataset presents viewpoints that are more prone state-of-the-art in RGB action recognition performs in our
to self-occlusion than typical ones in third-person view. dataset. For this, and given that most successful RGB action
412
(a) (c) 40
kitchen
office
30 social
Number of videos
20
10
po n s h s ay
to po ge
st ar n
as s
cl n p ds ard
pr e co n
e an ke
cl n j ss
sp p e ju ce
k r e
ch ur es
e
ue dr e m n
op arg ine
op di sp p
g so n
ld a r
w ell
sc as ea ilk
ez ink ilk
w on g
pu pen ilk
ta ope pa n
e sp ork
us flip k f in
e igh er
op u h s ive
re pa ses
ou let r
te ite ge
di a p
op t g e
op ha ive ap
t s sp n
po las e
t
re t le ter
cl spo .
tc p g
ea ca or
po nut .
o rm .
fo p p pe
le
un fli pa r
ke n pe
ir sp
gl ge
ug oo
a tb
u b
iv oi
a
rin ou ic
sc le s juic
os o
en se oa
sh ca
sp u
ar pe
ur od oa
as ng
pu op poo
g l
ra h s ba
ad tte
n nd
e s r
e la
w tt m
os e ha
en e c
os ui
s
h on
cl ht lat
ze t s
e m
al
ce y c
w
di f
e n c
pe u
ca o
ee pu
ic
s
s
r
o
u
sq
cl
(d) 200
sq
kitchen
100
50
(b)
0
y y
di k m e
cl n j es
so on n
ue oo ss d
te put in
as s e
ch spo ss
p le s
cl p se ice
e ur ice
sq p can ne
sp cl nu ce
po d spo on
op sh ap
op e d ick an
op gla er
pa pra l
ar e
ta t a t
op spo lk
s lk
sc to h s ug
ea p en
pe pa lk
di h so rk
un g gh in
fli nut r
fo ive five
op ze sp s
o a r
os pr a c e
us e c e
fl w ap
op ee ur le
tc t g p
ha en on
t s po et
lig our tter
.
re ut l bag
cl ead e p r
op pa b.
kl e m .
e el
pu ar p sal
a pe
iv ter
ke te pe
sp
en sse
ur rin ng
rin os t b
en sp oo
sq c la ar
r rit to
w stir hak
e p e
ug ng
d g
ar ng
ra as oa
co
i
nd mi
en ze mi
hi co
en is fo
u o d
n ap
pu ip s all
h la
e g
a i
i
h po
en so
o u
os o ju
pe ju
ht w
w ula
ld c
ce et
s
e
g
lc
g
p
s
e
ca
e
s
us
cl
(e) 2000
Number of frames
1500
pour juice open juice pour milk open milk
1000
500
0
clean glasses 0 20 40 60 80 100 120
open soda write drink mug Hand viewpoint (degrees)
sprinkle spoon
Figure 4: (a) t-SNE [32] visualization of hand pose embedding over our dataset. Each colored dot represents a full hand pose
and each trajectory an action sequence. (b) Correlation between objects, grasps, and actions. Shown poses are the average
pose over all action sequences of a certain class. One object can have multiple grasps associated depending on the action
performed (e.g., ‘juice carton’ and ‘milk bottle’) and one grasp can have multiple actions associated (e.g., lateral grasp present
at ‘sprinkle’ and ‘clean glasses’). (c) Number of action instances per hand action class. (c) Average number of frames in each
video per hand action class. Our dataset contains both atomic and more temporally complex action classes. (d) Distribution
of hand viewpoints, defined as angles between the direction of the camera and the direction of the palm of the hand.
recognition approaches [31, 56] use ConvNets to learn de- HBRNN consists of a bidirectional recurrent neural network
scriptors from color and motion flow, we evaluate a recent with hierarchical layers designed to learn features from the
two-stream architecture fine-tuned on our dataset [15]. body pose. Gram Matrix is currently the best performing
About the depth modality, we first evaluate two local method for body pose and uses Gram matrices to learn the
depth descriptor approaches, HOG2 [40] and HON4D [43], dynamics of actions. TF learns both discriminative static
that exploit gradient and surface normal information as a poses and transitions between poses using decision forests.
feature for action recognition. As a global-scene depth de- To conclude, we evaluate one hybrid approach that
scriptor, we evaluate the recent approach by [47] that learns jointly learns heterogeneous features (JOULE) [19] using
view invariant features using ConvNets from several syn- an iterative algorithm to learn features jointly taking into
thesized depth views of human body pose. account all the data channels: color, depth, and hand pose.
We follow our evaluation with pose-based action recog-
nition methods. As our main baseline, we implemented 4.2. Hand pose estimation
a recurrent neural network with long-short term memory
(LSTM) modules inspired in the architecture by [87]. We To assess the state-of-the-art in hand pose estimation, we
also evaluate several state-of-the-art pose action recognition use the same ConvNet as [84]. We choose this approach as
approaches. We start with descriptor-based methods such as it is easy to interpret and it was shown to provide good per-
Moving Pose [85] that encodes atomic motion information formance in a cross-benchmark evaluation [84]. The chosen
and [64], which represents poses as points on a Lie group. method is a discriminative approach operating on a frame-
For methods focusing on learning temporal dependencies, by-frame basis, which does not need any initialization and
we evaluate HBRNN [12], Gram Matrix [86] and TF [17]. manual recovery when tracking fails [21, 41].
413
5. Benchmark evaluation results Protocol 1:3 1:1 3:1 cross-person
5.1. Action recognition Acc. (%) 58.75 78.73 84.82 62.06
In the following we present our experiments in action
recognition. In this section we assume the hand pose is Table 2: Action recognition results (percentage of correct
given, i.e., we use the hand pose annotations obtained us- video classification) for different training/testing protocols.
ing the magnetic sensors and inverse-kinematics. We eval-
uate the use of estimated hand poses without the aid of the 5.1.2 State-of-the-art evaluation
sensors for action recognition in Section 5.2. In Table 4 we show results for state-of-the-art approaches
Following common practice in full body-pose action in different data modalities. We observe that the Two-
recognition [64, 85], we compensate for anthropomorphic stream [15] method performs well when combining both
and viewpoint differences by normalizing poses to have the spatial and temporal cues. Depth methods tend to perform
same distance between pairs of joints and defining the wrist slightly worse than the rest of the methods, suggesting that
as the center of coordinates. they are not able to fully capture either the object cues or the
hand pose. Note that for Novel View [47] we extracted deep
5.1.1 A baseline: LSTM features from a network trained on several synthetic views
of bodies, which may not generalize well to hand poses and
We start our experimental evaluation with a simple yet pow- fine-tuning in our dataset did not help. From all approaches,
erful baseline: a recurrent neural network with long-short we observe that the ones using hand pose are the ones that
term memory module (LSTM). The architecture of our net- achieve the best performance, with Gram Matrix [86] and
work is inspired by [87] with two differences: we do not ‘go Lie group [64] performing particularly well, a result in line
deep’, and use a more conventional unidirectional network with the ones reported in body pose action recognition.
instead of bidirectional. Following [87], we set the number In Fig. 5 we select some of the most representative meth-
of neurons to 100 and a probability of dropout of 0.2. We ods and analyze their performance in detail. We observe
use TensorFlow and Adam optimizer. that the pose method Gram Matrix outperforms the rest in
Training and testing protocols: We experiment with most of the measures, specially when we retrieve the top k
two protocols. The first protocol consists of using differ- action hypothesis (Fig. 5 (b)), showing the benefit of using
ent partitions of the data for training and the rest for testing hand pose for action recognition. Looking at Fig. 5 (a),
and we tried three different training:testing ratios of 1:3, 1:1 we observe that Two-stream outperforms the rest of meth-
and 3:1 at sequence level. The second protocol is a 6-fold ods in some categories in which the object is big and the
‘leave-one-person-out’ cross-validation, i.e., each fold con- action does not involve much motion, e.g., ‘use calculator’
sists of 5 subjects for training and one for testing. Results and ‘read paper’. This good performance can be due to the
are presented in Table 2. We observe that following a cross- pre-training of the spatial network on a big image recogni-
person protocol yields the worst results taking into account tion dataset. We further observe this in Fig. 5 (c) where
that in each fold we have similar training/testing proportions we analyze the top k hypothesis given by the prediction and
to the 3 : 1 setting. This can be explained by the difference look whether the predicted action contains the object being
in hand action styles between subjects. In the rest of the pa- manipulated, suggesting that the network correctly recog-
per we perform our experiments using the 1:1 setting with nizes the object but fails to capture the temporal dynamics.
600 action sequences for training and 575 for testing. Hand pose vs. depth vs. color: We performed one
Results discussion: In Fig. 5 (a) we show the recog- additional experiment using the JOULE [19] approach by
nition accuracies per category on a subset actions and the breaking down the contributions of each data modality. In
action confusion matrix is shown in Fig. 6 (b). Some ac- Table 4 (bottom) we show that hand pose features are the
tions such as ‘sprinkle spoon’, ‘put tea bag’ and ‘pour juice’ most discriminative ones, although the performance can be
are easily identifiable, while actions such as ‘open wallet’ increased by combining them with RGB and depth cues.
and ‘use calculator’ are commonly confused, likely because This result suggests that hand poses capture complementary
hand poses are dissimilar and more subtle. In Fig. 5 (d) we information to RGB and depth features.
show the contribution of each finger motion to action recog- Object pose: We did an additional experiment using the
nition performance, finding that the index is the most infor- object pose as a feature for action recognition using the sub-
mative finger. Combining thumb and index poses boosts set of actions that have annotated object poses: a total of
the accuracy, likely due to the fact that most grasps are ex- 261 sequences for 10 different classes and 4 objects. We
plained by these two fingers [6]. Fingertips alone are also a trained our LSTM baseline on half of the sequences and us-
high source of information due to being the highest articu- ing three different inputs: hand pose, object pose, and both
lated joints and being able to ‘explain’ the hand pose. combined. In Table 3 we show the results and observe that
414
100
Two-stream
HON4D
LSTM
80
Gram Mat.
Class accuracy (%)
JOULE
60
40
20
0 r t
(a) juice poon on a ba
g onge pape tor five wine oin walle
pour kle s p spo put te sp read alcula high pour pay c open
sprin scoo wash use c
100 100 80
75
95 95
Accuracy (%)
85 85 60
55
80 Two-stream 80 Two-stream
HON4D HON4D
50
LSTM LSTM
75 Gram Matrix 75 Gram Matrix
45
JOULE JOULE
70 70 40
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
l
g
ps
x
I
e
Al
b
T+
M
in
de
nk
+R
dl
um
Ti
I+
R
id
Pi
In
Top-k error: Action Top-k error: Object
M
T+
Th
I+
T+
(b) (c) (d)
Figure 5: (a) Class accuracies of some representative methods on a subset of classes. (b) Top-k action accuracy: true
action label is in the top-k action prediction hypothesis. (c) Top-k object accuracy: manipulated object is in the top-k action
prediction hypothesis. (d) Impact of each of the five fingers, combinations of them, and fingertips on action recognition.
Pose feature Hand Object Hand+Object Method Year Color Depth Pose Acc. (%)
Two stream-color [15] 2016 X ✗ ✗ 61.56
Action acc. (%) 87.45 74.45 91.97 Two stream-flow [15] 2016 X ✗ ✗ 69.91
Table 3: We evaluate the use of 6D object pose for action Two stream-all [15] 2016 X ✗ ✗ 75.30
recognition on a subset of our dataset. We observe the ben- HOG2 -depth [40] 2013 ✗ X ✗ 59.83
efit of combining them with the hand pose. HOG2 -depth+pose [40] 2013 ✗ X X 66.78
HON4D [43] 2013 ✗ X ✗ 70.61
Novel View [47] 2016 ✗ X ✗ 69.21
both object pose and hand pose features are complimentary 1-layer LSTM 2016 ✗ ✗ X 78.73
and useful for recognizing egocentric hand-object actions. 2-layer LSTM 2016 ✗ ✗ X 80.14
Moving Pose [85] 2013 ✗ ✗ X 56.34
5.2. Hand pose estimation Lie Group [64] 2014 ✗ ✗ X 82.69
HBRNN [12] 2015 ✗ ✗ X 77.40
Training with objects vs. no objects: One question Gram Matrix [86] 2016 ✗ ✗ X 85.39
raised while designing our experiments was whether we ac- TF [17] 2017 ✗ ✗ X 80.69
tually needed to annotate the hand pose in a close to ground JOULE-color [19] 2015 X ✗ ✗ 66.78
truth accuracy to experiment with hand dynamic actions. JOULE-depth [19] 2015 ✗ X ✗ 60.17
We try to answer this question by estimating the hand poses JOULE-pose [19] 2015 ✗ ✗ X 74.60
of our hand action dataset in two ways partitioning our data JOULE-all [19] 2015 X X X 78.78
as in our Action split: using the nearly 300k object-free ego- Table 4: Hand action recognition performance by different
centric samples from [84] and using the images in the train- evaluated approaches on our proposed dataset.
ing set of our hand action dataset. As observed in Fig. 6
(c) and Table 5, the results suggest that having hand-object
images in the training set is crucial to train state-of-the-art the objects in training and half in testing, all subjects seen
hand pose estimators, likely due to the fact that occlusions in both splits). In Fig. 6 (c) and Table 5 we observe that the
and object shapes need to be seen by the estimator before- network is able to generalize to unseen subjects but strug-
hand. To confirm this, we conducted two extra experiments: gles to do so for unseen objects, suggesting that recognizing
cross-subject (half of the users in training and half in test- the shape of the object and its associated grasp is crucial to
ing, all objects seen in both splits) and cross-object (half of train hand pose estimators. This shows the need of having
415
1
(a) (b) 3
2 24.open soda can (c) (d)
4 25.use spray
5 26.write pen 100
6
100 7 27.tear paper
Class action accuracy
8
28.squeeze paper
9 90
80 10 29.open letter
11 1.open juice 30.take out letter
12
2.close juice 31.read paper
13 80
60 14 3.pour juice 32.flip pages
15 4.open peanut b.
16 33.use calculator
40 17 5.close peanut b. 34.light candle
11 0
13 2
15 4
17 6
19 8
21 0
23 2
25 4
27 6
29 8
31 0
33 2
35 4
37 6
39 8
41 0
43 2
45 4
Error threshold (mm)
4
1
3
5
7
9
7
4
1
3
45
33
18
17
28
36
31
38
20
37
43
27
29
24
14
34
35
21
23
16
32
41
40
15
42
22
12
26
39
30
10
19
13
25
11
44
Figure 6: (a) Top: Class action recognition accuracies for our LSTM baseline using estimated hand poses (accuracies with
groundtruth poses are represented with black triangles). Bottom: Average number of visible (not occluded) joints for hand
actions on our dataset and its impact on hand pose estimation. (b) Hand action confusion matrix for our LSTM baseline.
(c) Percentage of frames for different hand pose estimation error thresholds. (d) Qualitative results on hand pose estimation.
annotated hand poses interacting with objects and thus why Hand pose protocol Pose error (mm) Action (%)
our dataset can be of interest for the hand pose community. Cross-subject 11.25 -
In Fig. 6 (d) we show some qualitative results in hand pose Cross-object 19.84 -
estimation in our proposed dataset and observe that, while Action split (training w/o objects) 31.03 29.63
not perfect, they are good enough for action recognition. Action split (training w/ objects) 14.34 72.06
Hand pose estimation and action recognition: Now Action split (GT mag.+IK poses) - 78.73
we try to answer the following key question: ‘how good is Table 5: Average hand pose estimation error, 3D distance
the current hand pose estimation for recognizing hand ac- over all 21 joints between magnetic poses and estimates,
tions?’. In Table 5 we show results of hand action recogni- for different protocols and its impact on action recognition.
tion by swapping the hand pose labels by the estimated ones
in the test set. We observe that reducing the hand pose error
by a factor of two yields a more than twofold improvement explanation of why we can still obtain a good action recog-
in action recognition. The difference in hand action recog- nition performance while having noisy hand pose estimates.
nition between using the hand pose labels and using the es-
timated ones in testing is 6.67%. We also tested the two best
performant methods from previous section, Lie group [64] 6. Concluding remarks
and Gram Matrix [86]. For Lie group we obtained an ac-
curacy of 69.22%, while for Gram Matrix a poor result of We have proposed a novel benchmark and presented ex-
32.22% likely due to their strong assumptions in the noise perimental evaluations for RGB-D and pose-based, hand
distribution. On the other hand, our LSTM baseline shows action recognition, in first-person setting. The benchmark
more robust behavior in the presence of noisy hand pose provides both temporal action labels and full 3D hand pose
estimates. In Fig. 6 (a) we show how the hand occlusion labels, and additionally 6D object pose labels on a part of
affects the pose estimation quality and its impact on class the dataset. Both RGB-D action recognition and 3D hand
recognition accuracies. Although some classes present a pose estimation are relatively new fields, and this is a first
clear correlation between pose error and action accuracy attempt to relate both of them similar to full human body.
degradation (e.g., ‘receive coin’, ‘pour wine’), the LSTM We have evaluated several baselines in our dataset and con-
is still able to obtain acceptable recognition rates likely due cluded that hand pose features are a rich source of informa-
to being able to infer the action from temporal patterns and tion for recognizing manipulation actions. We believe that
correctly estimated joints. For more insight, we analyzed our dataset and experiments can encourage future work in
the pose error per finger: T: 12.45, I: 15.48, M: 18.08, R: multiple fields including action recognition, hand pose esti-
16.69, P: 18.95, all in mm. Thumb and index joints present mation, object pose estimation, and emerging ones, such as
the lowest estimation error because of typically being less joint hand-object pose estimation.
occluded in egocentric setting. According to previous sec- Acknowledgements: This work is part of Imperial Col-
tion where we found that the motion from these two fingers lege London-Samsung Research project, supported by Sam-
was a high source of information, this can be a plausible sung Electronics.
416
References [22] C. Ionescu, J. Carreira, and C. Sminchisescu. Iterated
second-order label sensitive pooling for 3d human pose esti-
[1] S. Allin and D. Ramanan. Assessment of post-stroke func- mation. In CVPR, 2014. 3
tioning using machine vision. In MVA, 2007. 1 [23] T. Ishihara, K. M. Kitani, W.-C. Ma, H. Takagi, and
[2] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A C. Asakawa. Recognizing hand-object interactions in wear-
survey of robot learning from demonstration. RAS, 2009. 1 able camera videos. In ICIP, 2015. 1, 2
[3] S. Baek, K. I. Kim, and T.-K. Kim. Augmented skele- [24] Y. Jang, S.-T. Noh, H. J. Chang, T.-K. Kim, and W. Woo. 3d
ton space transfer for depth-based hand pose estimation. In finger cape: Clicking action and position estimation under
CVPR, 2018. 3 self-occlusions in egocentric viewpoint. TVCG, 2015. 1
[4] S. Baek, Z. Shi, M. Kawade, and T.-K. Kim. Kinematic- [25] C. Keskin, F. Kıraç, Y. E. Kara, and L. Akarun. Hand pose
layout-aware random forests for depth-based action recogni- estimation and hand shape classification using multi-layered
tion. In BMVC, 2017. 2 randomized decision forests. In ECCV, 2012. 3
[5] S. Bambach, S. Lee, D. J. Crandall, and C. Yu. Lending a [26] P. G. Kry and D. K. Pai. Interaction capture and synthesis.
hand: Detecting hands and recognizing activities in complex ACM Transactions on Graphics, 2006. 2
egocentric interactions. In ICCV, 2015. 2, 4 [27] J. Lei, X. Ren, and D. Fox. Fine-grained kitchen activity
[6] I. M. Bullock, T. Feix, and A. M. Dollar. The yale human recognition using rgb-d. In Ubicomp, 2012. 2
grasping dataset: Grasp, object, and task data in household [28] H. Liang, J. Yuan, and D. Thalmann. Parsing the hand in
and machine shop environments. IJRR, 2015. 2, 4, 6 depth images. TMM, 2014. 3
[7] M. Cai, K. M. Kitani, and Y. Sato. A scalable approach for [29] L. Liu and L. Shao. Learning discriminative representations
understanding the visual structures of hand grasps. In ICRA, from rgb-d video data. In IJCAI, 2013. 2
2015. 2, 4 [30] R. Luo, O. Sener, and S. Savarese. Scene semantic recon-
[8] M. Cai, K. M. Kitani, and Y. Sato. Understanding hand- struction from egocentric rgb-d-thermal videos. In 3D Vision
object manipulation with grasp types and object attributes. (3DV), 2017. 2, 4
In RSS, 2016. 1, 2, 4 [31] M. Ma, H. Fan, and K. M. Kitani. Going deeper into first-
[9] H. J. Chang, G. Garcia-Hernando, D. Tang, and T.-K. person activity recognition. In CVPR, 2016. 1, 2, 5
Kim. Spatio-temporal hough forest for efficient detection– [32] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.
localisation–recognition of fingerwriting in egocentric cam- JMLR, 2008. 5
era. CVIU, 2016. 2 [33] M. Moghimi, P. Azagra, L. Montesano, A. C. Murillo, and
[10] C. Choi, S. H. Yoon, C.-N. Chen, and K. Ramani. Robust S. Belongie. Experiments on an rgb-d wearable vision sys-
hand pose estimation during the interaction with an unknown tem for egocentric activity recognition. In CVPRW, 2014.
object. In ICCV, 2017. 2, 3, 4 2
[11] Q. De Smedt, H. Wannous, and J.-P. Vandeborre. Skeleton- [34] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and
based dynamic hand gesture recognition. In CVPRW, 2016. J. Kautz. Online detection and classification of dynamic hand
2 gestures with recurrent 3d convolutional neural networks. In
[12] Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neu- CVPR, 2016. 2
ral network for skeleton based action recognition. In CVPR, [35] F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas,
2015. 2, 5, 7 and C. Theobalt. Real-time hand tracking under occlusion
[13] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding ego- from an egocentric rgb-d sensor. In ICCV, 2017. 2, 3, 4
centric activities. In ICCV, 2011. 2 [36] NDItrakSTAR. http://www.ascension-
[14] A. Fathi, X. Ren, and J. M. Rehg. Learning to recognize tech.com/products/trakstar-2-drivebay-2/. 3
objects in egocentric activities. In CVPR, 2011. 2, 3, 4 [37] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout. Hand
[15] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional segmentation with structured convolutional learning. In
two-stream network fusion for video action recognition. In ACCV, 2014. 3
CVPR, 2016. 2, 4, 5, 6, 7 [38] M. Oberweger, G. Riegler, P. Wohlhart, and V. Lepetit. Ef-
[16] C. Fermüller, F. Wang, Y. Yang, K. Zampogiannis, Y. Zhang, ficiently creating 3d training data for fine hand pose estima-
F. Barranco, and M. Pfeiffer. Prediction of manipulation ac- tion. In CVPR, 2016. 3, 4
tions. IJCV, 2017. 2 [39] M. Oberweger, P. Wohlhart, and V. Lepetit. Training a feed-
[17] G. Garcia-Hernando and T.-K. Kim. Transition forests: back loop for hand pose estimation. In ICCV, 2015. 3
Learning discriminative temporal transitions for action [40] E. Ohn-Bar and M. M. Trivedi. Hand gesture recognition
recognition and detection. In CVPR, 2017. 2, 5, 7 in real time for automotive interfaces: A multimodal vision-
[18] H. Hamer, J. Gall, T. Weise, and L. Van Gool. An object- based approach and evaluations. ITS, 2014. 2, 5, 7
dependent hand pose prior from sparse training data. In [41] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficient
CVPR, 2010. 3 model-based 3d tracking of hand articulations using kinect.
[19] J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang. Jointly learning In BMVC, 2011. 3, 5
heterogeneous features for RGB-D activity recognition. In [42] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Full dof
CVPR, 2015. 2, 4, 5, 6, 7 tracking of a hand interacting with an object by modeling
[20] D.-A. Huang, M. Ma, W.-C. Ma, and K. M. Kitani. How occlusions and physical constraints. In ICCV, 2011. 3
do we use our hands? discovering a diverse set of common [43] O. Oreifej and Z. Liu. HON4D: histogram of oriented 4D
grasps. In CVPR, 2015. 2 normals for activity recognition from depth sequences. In
[21] Intel. Perceptual computing sdk. 2013. 5 CVPR, 2013. 2, 5, 7
417
[44] H. Pirsiavash and D. Ramanan. Detecting activities of daily [65] R. Vemulapalli and R. Chellappa. Rolling rotations for rec-
living in first-person camera views. In CVPR, 2012. 2, 4 ognizing human actions from 3d skeletal data. In CVPR,
[45] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun. Realtime and 2016. 2
robust hand tracking from depth. In CVPR, 2014. 3 [66] C. Wang, Y. Wang, and A. L. Yuille. Mining 3d key-pose-
[46] H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian. His- motifs for action recognition. In CVPR, 2016. 2
togram of oriented principal components for cross-view ac- [67] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recog-
tion recognition. TPAMI, 2016. 2 nition by dense trajectories. In CVPR, 2011. 2
[47] H. Rahmani and A. Mian. 3d action recognition from novel [68] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu. Robust
viewpoints. In CVPR, 2016. 2, 5, 6, 7 3d action recognition with random occupancy patterns. In
[48] G. Rogez, M. Khademi, J. Supančič III, J. M. M. Montiel, ECCV, 2012. 2
and D. Ramanan. 3d hand pose detection in egocentric rgb-d [69] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet en-
images. In ECCV, 2014. 2, 3, 4 semble for action recognition with depth cameras. In CVPR,
[49] G. Rogez, J. S. Supancic, and D. Ramanan. First-person pose 2012. 2
recognition using egocentric workspaces. In CVPR, 2015. 2 [70] P. Wang, C. Yuan, W. Hu, B. Li, and Y. Zhang. Graph based
[50] G. Rogez, J. S. Supancic, and D. Ramanan. Understanding skeleton motion representation and similarity measurement
everyday hands in action from rgb-d images. In ICCV, 2015. for action recognition. In ECCV, 2016. 2
1, 2, 3, 4 [71] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con-
[51] J. Romero, H. Kjellström, C. H. Ek, and D. Kragic. Non- volutional pose machines. In CVPR, 2016. 2
parametric hand pose estimation with object context. IVC, [72] A. Wetzler, R. Slossberg, and R. Kimmel. Rule of thumb:
2013. 3 Deep derotation for improved fingertip detection. In BMVC,
[52] A. Shahroudy, T.-T. Ng, Q. Yang, and G. Wang. Multimodal 2015. 3
multipart learning for action recognition in depth videos. [73] M. Wray, D. Moltisanti, W. Mayol-Cuevas, and D. Damen.
TPAMI, 2016. 2 Sembed: Semantic embedding of egocentric action videos.
[53] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, In ECCW, 2016. 3
D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, [74] D. Wu and L. Shao. Leveraging hierarchical parametric
D. Freedman, P. Kohli, E. Krupka, A. Fitzgibbon, and networks for skeletal joints based action segmentation and
S. Izadi. Accurate, robust, and flexible real-time hand track- recognition. In CVPR, 2014. 1, 2
ing. In CHI, 2015. 3 [75] L. Xia, C.-C. Chen, and J. Aggarwal. View invariant human
[54] Z. Shi and T.-K. Kim. Learning and refining of privileged action recognition using histograms of 3d joints. In CVPRW,
information-based rnns for action recognition from depth se- 2012. 2
quences. In CVPR, 2017. 1 [76] X. Yang and Y. Tian. Super normal vector for activity recog-
[55] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finoc- nition using depth sequences. In CVPR, 2014. 2
chio, A. Blake, M. Cook, and R. Moore. Real-time human [77] Y. Yang, C. Fermuller, Y. Li, and Y. Aloimonos. Grasp type
pose recognition in parts from single depth images. Com- revisited: A modern perspective on a classical feature for
mun. ACM, 2013. 2 vision. In CVPR, 2015. 1, 2
[56] S. Singh, C. Arora, and C. Jawahar. First person action [78] Y. Yang, A. Guha, C. Fermuller, and Y. Aloimonos. A cog-
recognition using deep learned descriptors. In CVPR, 2016. nitive system for understanding human manipulation actions.
1, 2, 5 ACS, 2014. 2, 3
[57] S. Sridhar, F. Mueller, M. Zollhöfer, D. Casas, A. Oulasvirta, [79] A. Yao, J. Gall, G. Fanelli, and L. J. Van Gool. Does human
and C. Theobalt. Real-time joint tracking of a hand manipu- action recognition benefit from pose estimation?. In BMVC,
lating an object from rgb-d input. In ECCV, 2016. 2 2011. 1
[58] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded [80] Q. Ye, S. Yuan, and T.-K. Kim. Spatial attention deep net
hand pose regression. In CVPR, 2015. 3 with partial pso for hierarchical hybrid hand pose estimation.
[59] D. Tang, H. J. Chang, A. Tejani, and T.-K. Kim. Latent re- In ECCV, 2016. 3
gression forest: Structured estimation of 3d articulated hand [81] T.-H. Yu, T.-K. Kim, and R. Cipolla. Unconstrained monoc-
posture. In CVPR, 2014. 3 ular 3d human pose estimation by action detection and cross-
[60] D. Tang, T.-H. Yu, and T.-K. Kim. Real-time articulated hand modality regression forest. In CVPR, 2013. 1
pose estimation using semi-supervised transductive regres- [82] S. Yuan, G. Garcia-Hernando, B. Stenger, G. Moon, J. Y.
sion forests. In ICCV, 2013. 3 Chang, K. M. Lee, P. Molchanov, J. Kautz, S. Honari, L. Ge,
[61] J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Real-time J. Yuan, X. Chen, G. Wang, F. Yang, K. Akiyama, Y. Wu,
continuous pose recovery of human hands using convolu- Q. Wan, M. Madadi, S. Escalera, S. Li, D. Lee, I. Oikono-
tional networks. TOG, 2014. 3 midis, A. Argyros, and T.-K. Kim. Depth-based 3d hand
[62] D. Tzionas, L. Ballan, A. Srikantha, P. Aponte, M. Pollefeys, pose estimation: From current achievements to future goals.
and J. Gall. Capturing hands in action using discriminative In CVPR, 2018. 3
salient points and physics simulation. IJCV, 2016. 2 [83] S. Yuan, Q. Ye, G. Garcia-Hernando, and T.-K. Kim. The
[63] V. Veeriah, N. Zhuang, and G.-J. Qi. Differential recurrent 2017 hands in the million challenge on 3d hand pose estima-
neural networks for action recognition. In ICCV, 2015. 2 tion. arXiv preprint arXiv:1707.02237, 2017. 3
[64] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action [84] S. Yuan, Q. Ye, B. Stenger, S. Jain, and T.-K. Kim. Big
recognition by representing 3d skeletons as points in a lie hand 2.2m benchmark: Hand pose data set and state of the
group. In CVPR, 2014. 2, 5, 6, 7, 8 art analysis. In CVPR, 2017. 3, 4, 5, 7
418
[85] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving
pose: An efficient 3d kinematics descriptor for low-latency
action recognition and detection. In ICCV, 2013. 2, 5, 6, 7
[86] X. Zhang, Y. Wang, M. Gou, M. Sznaier, and O. Camps. Ef-
ficient temporal sequence comparison and classification us-
ing gram matrix embeddings on a riemannian manifold. In
CVPR, 2016. 1, 2, 5, 6, 7, 8
[87] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and
X. Xie. Co-occurrence feature learning for skeleton based
action recognition using regularized deep lstm networks. In
AAAI, 2016. 2, 5, 6
[88] Y. Zhu, W. Chen, and G. Guo. Fusing spatiotemporal fea-
tures and joints for 3d action recognition. In CVPRW, 2013.
2
[89] L. Zollo, S. Roccella, E. Guglielmelli, M. C. Carrozza, and
P. Dario. Biomechatronic design and control of an anthro-
pomorphic artificial hand for prosthetic and robotic applica-
tions. TMECH, 2007. 1
419