0% found this document useful (0 votes)

33 views11 pages

Garcia-Hernando First-Person Hand Action CVPR 2018 Paper

Uploaded by

sirdmdnd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views11 pages

Garcia-Hernando First-Person Hand Action CVPR 2018 Paper

Uploaded by

sirdmdnd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

First-Person Hand Action Benchmark with RGB-D Videos

and 3D Hand Pose Annotations

Guillermo Garcia-Hernando Shanxin Yuan Seungryul Baek Tae-Kyun Kim

Imperial College London
{g.garcia-hernando,s.yuan14,s.baek15,tk.kim}@imperial.ac.uk

Abstract

In this work we study the use of 3D hand poses to rec-

ognize first-person dynamic hand actions interacting with
3D objects. Towards this goal, we collected RGB-D video
sequences comprised of more than 100K frames of 45 daily
hand action categories, involving 26 different objects in sev-
eral hand configurations. To obtain hand pose annotations,
we used our own mo-cap system that automatically infers
the 3D location of each of the 21 joints of a hand model via
6 magnetic sensors and inverse kinematics. Additionally, we
recorded the 6D object poses and provide 3D object mod-
els for a subset of hand-object interaction sequences. To
the best of our knowledge, this is the first benchmark that
enables the study of first-person hand actions with the use
of 3D hand poses. We present an extensive experimental
evaluation of RGB-D and pose-based action recognition by
18 baselines/state-of-the-art approaches. The impact of us-
ing appearance features, poses, and their combinations are
Figure 1: We show two frames from a sequence belonging
measured, and the different training/testing protocols are
to the action class ‘pour juice’. We propose a novel first-
evaluated. Finally, we assess how ready the 3D hand pose
person action recognition dataset with RGB-D videos and
estimation field is when hands are severely occluded by ob-
3D hand pose annotations. We use magnetic sensors and
jects in egocentric views and its influence on action recog-
inverse kinematics to capture the hand pose. On the right
nition. From the results, we see clear benefits of using hand
we see the captured depth image and hand pose. We also
pose as a cue for action recognition compared to other data
captured 6D object pose for a subset of hand-object actions.
modalities. Our dataset and experiments can be of interest
to communities of 3D hand pose estimation, 6D object pose,
and robotics as well as action recognition.
Previous work in first-person action recognition [8, 23,
31, 56] found that daily actions are well explained by look-
1. Introduction ing at hands, a similar observation found in third-person
view [77]. In these approaches, hand information is ex-
We interact with the world using our hands to manipulate tracted from hand silhouettes [31, 56] or discrete grasp clas-
objects, machines, tools, and socialize with other humans. sification [8, 23, 50] using low-level image features. In full-
In this work we are interested in understanding how we use body human action recognition it is known that using higher
our hands while performing daily life dynamic actions with level and viewpoint invariant features such as body pose can
the help of fine-grained hand pose features, a problem of benefit action recognition [54, 74, 79, 81, 86], although this
interest for multiple applications requiring high precision, has not yet been studied in detail for hands. Compared to
such as hand rehabilitation [1], virtual/augmented reality full-body actions, hand actions present unique differences
[24], teleoperation [89], and robot imitation learning [2]. that make the use of pose as a cue not obvious: style and

409
speed variations across subjects are more pronounced due to ing a thermal camera enabling easier hand detection with-
a higher degree of mobility of fingers and the motion can be out exploring its use for action recognition. In these pre-
very subtle. A setback for using hand pose for action recog- vious works, hands are modeled using low-level features
nition is the absence of reliable pose estimators off-the-shelf or intermediate representations following empirical grasp
in contrast to full body [55, 71], mainly due to the absence taxonomies [6] and thus are limited compared to the 3D
of hand pose annotations on real (cf. synthetic) data se- hand pose sequences used in this work. In [50], synthetic
quences, notably when objects are involved [10, 35, 48, 49]. hand poses are used to recognize grasps in static frames,
In this work we introduce a new dataset of first-person whereas our interest is in dynamic actions and hand poses
dynamic hand action sequences with more than 100,000 in real videos. From a hand pose estimation perspective,
RGB-D frames annotated with 3D hand poses, using six [48] proposed a small synthetic dataset of static poses and
magnetic sensors attached to the fingertips and inverse kine- thus could not succesfully train data-hungry algorithms, re-
matics. We captured 1175 action samples including 45 cat- cently relieved by larger synthetic datasets [10, 35]. Given
egories manipulating 26 different objects in 3 scenarios. that we also provide 6D object poses and 3D mesh mod-
We designed our hand actions and selected objects to cover els for a subset of objects, our dataset can be of interest to
multiple hand configurations and temporal dynamics. Fur- both object pose and joint hand-object tracking emerging
thermore, to encourage further research, we also provide communities [57, 62]. We compare our dataset with other
6-dimensional object pose ground truth, and their 3D mesh first-person view datasets in Section 3.5.
models, for 4 objects spanning, 10 different actions. We
evaluate several baselines and state-of-the-art RGB-D and RGB-D and pose-based action recognition: Using
pose-based action recognition in our dataset and test the depth sensors differs from traditional color action recog-
current state-of-the-art in hand pose estimation and its influ- nition in the fact that most successful color approaches
ence on action recognition. To the best of our knowledge, [15, 67] cannot be directly applied to the depth stream due
this is the first work that studies the problem of first-person to its nature: noisy, textureless and discontinuous pixel re-
action recognition with the use of hand pose features and the gions led to the necessity of depth-tailored methods. These
first benchmark of its kind. In summary, the contribution of methods usually focus on how to extract discriminative fea-
this paper is three-fold: tures from the depth images using local geometric descrip-
Dataset: we propose a fully annotated dataset to help the tors [40, 43, 76] sensitive to viewpoint changes and view-
study of egocentric dynamic hand-object actions and poses. invariant approaches [46, 47]. However, the recent trend
This is the first dataset to combine both fields in the context is to take advantage of the depth channel to obtain robust
of hands in real videos and quality hand pose labels. body pose estimates [55] and use them directly as a fea-
Action recognition: we evaluate 18 baselines and state- ture to recognize actions, what is known as pose or skeleton
of-the-art approaches in RGB-D and pose-based action action recognition. Popular approaches include the use of
recognition using our proposed dataset. Our selected meth- temporal state-space models [17, 70, 74, 75, 86], key-poses
ods cover most of the research trends in both methodology [66, 85], hand-crafted pose features [64, 65], and tempo-
and use of different data modalities. ral recurrent models [12, 63, 87]. Having multiple data
Hand pose: We evaluate a state-of-the-art hand pose es- streams has led to the study of combining different sources
timator in our real dataset, i.e., the occluded setting of hand- of information such as depth and pose [4, 40, 52, 69],
object manipulations and assess its performance for action color and pose [88], and all of them [19]. Most previous
recognition. works in RGB-D action recognition focus on actions per-
formed by the whole human body with some exceptions
2. Related work that are mainly application-oriented, such as hand gestures
for human-computer interaction [9, 11, 29, 34, 40] and sign
Egocentric vision and manipulations datasets: The language [68]. Related to us, [33] mounted a depth sensor
important role of hands while manipulating objects has at- to recognize egocentric activities and modeling hands using
tracted the interest from both computer vision and robotics low-level skin features. Similar to our interests but in third-
communities. From an action recognition perspective and person view, [27, 78] used a hand tracker to obtain noisy es-
only using RGB cues, recent research [5, 13, 14, 31, 44, 56] timates of hand pose in kitchen manipulation actions, while
has delved into recognizing daily actions and determined [11] recognized basic hand gestures for human-computer
that both manipulated objects and hands are important cues interaction without objects involved. In these works, ac-
to the action recognition problem. A related line of work is tions performed and pose labels are very limited due to the
the study of human grasp from a robotics perspective [6, 7], low quality of the hand tracker, while in this work we pro-
as a cue for action recognition [8, 16, 23, 77], force es- vide accurate hand pose labels to study more realistic hand
timation [16, 26, 50], and as a recognition problem it- actions. We go in depth and evaluate several baselines and
self [20, 50]. Recently, [30] proposed a benchmark us- state-of-the-art approaches in Sections 4 and 5.

410
Figure 2: Hand actions: We captured daily hand actions using a RGB-D sensor and used a mo-cap system to annotate hand
pose. Left: ‘put sugar’ and ‘pour milk’ (kitchen). Right: ‘charge cell phone’ (office) and ‘handshake’ (social).

3D hand pose estimation: Mainly due to the recent 3.2. Hand-object actions
availability of RGB-D sensors, the field has made signifi-
We captured 45 different daily hand action categories in-
cant progress in object-less third-person view [22, 25, 28,
volving 26 different objects. We designed our action cat-
37, 39, 41, 45, 53, 60, 80] and more modest advances in
egories to span a high number of different hand config-
first-person view [10, 35, 38, 48]. In [42], 3D tracking of
urations following the same taxonomy as [50] and to be
a hand interacting with an object in third-person view was
diverse in both hand pose and action space (see Fig. 4).
investigated. [18] studied the use of object-grasp as hand
Each object has a minimum of one associated action (e.g.,
pose prior, while [51] used the object shape as cue. An im-
pen-‘write’) and a maximum of four (e.g., sponge-‘wash’,
portant limitation is the difficulty of obtaining accurate 3D
‘scratch’, ‘squeeze’, and flip’). These 45 hand actions were
hand pose annotations leading researchers to resort to syn-
recorded and grouped in three different scenarios: kitchen
thetic [3, 10, 35, 48, 53], manually or semi-automatically
(25), office (12) and social (8). In this work we consider
annotated [38, 58, 59, 61] datasets, resulting in non-realistic
each hand-object manipulation as a different action cate-
images, a low number of samples, and often inconsistent
gory similar to previous datasets [14], although other def-
annotations. With the help of magnetic sensors for anno-
initions are possible [73, 78].
tation and similar to [72], [84] proposed a big benchmark
that included egocentric poses with no objects involved and 3.3. Sensors and data acquisition
showed that a ConvNet baseline can achieve state-of-the-art
performance when enough training data is available. This Visual data: We mounted an Intel RealSense SR300
was confirmed in a public challenge [83], also using a sub- RGB-D camera on the shoulder of the subject and cap-
set of our proposed dataset, and followed by a work [82] tured sequences at 30 fps and resolutions 1920×1080 and
analyzing the current state-of-the-art of the field. 640×480 for the color and depth stream respectively.
Pose annotation: To obtain quality annotations of hand
and object pose, the hand pose is captured using six mag-
3. Daily hand-object actions dataset netic sensors [36] attached to the user’s hand, five fingertips
and one wrist, following [84]. Each sensor provides posi-
3.1. Dataset overview tion and orientation with 6 degrees of freedom and the full
hand pose is inferred using inverse kinematics over a de-
The dataset contains 1,175 action videos belonging to 45 fined 21-joint hand model. Each sensor is 2 mm wide and
different action categories, in 3 different scenarios, and per- when attached to the human hand does not influence the
formed by 6 actors. A total of 105,459 RGB-D frames are depth image. The color image is affected as the sensors
annotated with accurate hand pose and action category. Ac- and the tape attaching them are visible, however the hand is
tion sequences present high inter-subject and intra-subject fully visible and actions distinguishable by using the color
variability of style, speed, scale, and viewpoint. The ob- image. Regarding object pose, we attach one more sensor to
ject’s 6-dimensional pose, 3D location and angle, and mesh the closest point to the center of mass that can be reached.
model are also provided for 4 objects involving 10 different Recording process: We asked 6 people, all right-
action categories. Our plan is to keep growing the dataset handed, to perform the actions. Instructions on how to per-
with more models and objects. In Fig. 2 we show some ex- form the action in a safe manner were given, however no
ample frames for different action categories and hand-pose instructions about style or speed were provided, in order to
annotation visualization. capture realistic data. Actions were labeled manually.

411
High Five
Dataset Sensor Real? Class. Seq. Frames Labels

Hand

Prick

Open
Yale [6] RGB X 33 - 9,100 Grasp

Toa

r
sh.

Pou

se
UTG [7] RGB X 17 - - Grasp

Clo
Po

en
GTEA [14] RGB X 61 525 31,222 Action
ur

Op
O

Re EgoHands [5] RGB X 4 48 4,800 Action

ur
pe

Po
e
n

iv os GUN-71 [50] RGB-D X 71 - 12,000 Grasp

e Cl
Pa en UCI-EGO [48] RGB-D ✗ - - 400 Pose
y Op

Fork
Choi et al. [10] RGB-D 33 - 16,500 Grasp+Pose

Hand
✗
Win

e
Giv ur

Juic
e Po
Wi

eG
SynthHands [35] RGB-D ✗ - - 63,530 Pose
ne
W

S.
se EgoDexter [35] RGB-D X - - 3,190 Fingertips
B.

Co
al

Clo

id
.
le

qu
Use in
Luo et al. [30] RGB-D-T X 44 250 450,000 Action
t

Li
Ca Social lk
Take Ou
rd Mi Drink
t Matc Ours RGB-D X 45 1,175 105,459 Action+Pose
h Hand Actions Mug Open
Letter
Open PeanutB
Close
Table 1: First-person view datasets with hands and objects
Pen Salt
Use
er Kitchen Sod Put
involved. Our proposed dataset is the first providing both
Pap Oﬃce a
Sque
eze s Sp hand pose and action annotations in real data (cf. synthetic).
se
Boo lator

oo
lcu h.

as n
Gl Op
C

r
Tea
Spo

en
Tea Bag
ll
Ce

Hand occlusion: Fig. 6 (a) (bottom) shows the aver-

Spray
k

Pu
nge
Ca

ad t
Re
ea
n Sc
oo age number of visible (not occluded by object or viewpoint)
Cl Sp p
fo
ld rin hand joints per action class. Most actions present a high de-
Un kl
St
e

e
gree of occlusion, on average 10 visible joints out of 21.
Us

ir
Fli
e

Object pose: 6D object pose and mesh models are pro-

Sc
Us

Squ
.

rat
Pag

Wash
Put

Use

vided for the following objects involving 10 different ac-

ch
eez
Flip

tions: ‘milk bottle’, ‘salt’, ‘juice carton’, and ‘liquid soap’.

Figure 3: Taxonomy of our hand actions involving objects 3.5. Comparison with other datasets
dataset. Some objects are associated with multiple actions
(e.g., spoon, sponge, liquid soap), while some others have In Table 1 we summarize popular egocentric datasets that
only one linked action (e.g., calculator, pen, cell charger). involve hands and objects in both dynamic and static fash-
ion depending on their problem of interest. For concise-
ness, we have excluded from the table related datasets that
3.4. Dataset statistics do not partially or fully contain objects manipulations, e.g.,
[38, 44, 84]. Note that previous datasets in action recog-
Taxonomy: Fig. 3 shows the distribution of different nition [5, 14, 30] do not include hand pose labels. On the
actions per involved object. Some objects such as ‘spoon’ other hand, pose and grasp datasets [6, 7, 10, 35, 48, 50]
have multiple actions (e.g., ‘stir’, ‘sprinkle’, ‘scoop’, and do not contain dynamic actions and hand pose annotation
‘put sugar’), while some objects have only one action (‘use is obtained by generating synthetic images or rough manual
calculator’). Although it is not an object per se, we included annotations [35]. Our dataset ‘fills the gap’ of egocentric
‘hand’ as an object in actions ‘handshake’ and ‘high five’. dynamic hand action using pose and compares favorably in
Videos per action class: On average there are 26.11 se- terms of diversity, number of frames, and use of real data.
quences per class action and 45.19 sequences per object.
For detailed per class numbers see Fig. 4 (c). 4. Evaluated algorithms and baselines
Duration of videos: Fig. 4 (d) shows the average num-
ber of video duration for the 45 action classes. Some ac- 4.1. Action recognition
tion classes such as ‘put sugar’ and ‘open wallet’ involve In order to evaluate the current state-of-the-art in action
short atomic movements, on average one second, while oth- recognition we chose a variety of approaches that, we be-
ers such as ‘open letter’ require more time to be executed. lieve, cover the most representative trends in the literature
Grasps: We identified 34 different grasps following the as shown in Table 4. As the nature of our data is RGB-D
same taxonomy as in [50], including the most frequently and we have hand pose, we focus our attention to RGB-D
studied ones [8] (i.e., precision/power grasps for different and pose-based action recognition approaches, although we
object attributes such as prismatic/round/flat/deformable). also evaluate two RGB action recognition methods [15, 19].
In Fig. 4 (b) we show some examples of correlation between Note that, as discussed above, most of previous works in
objects, hand poses, and actions. RGB-D action recognition involve full body poses instead
Viewpoints: In Fig. 4 (e) we show the distribution of of hands and some of them might not be tailored for hand
frames per hand viewpoint. We define the viewpoint as actions. We elaborate further on this in Section 5.1.
the angle between the camera direction and the palm of the We start with one baseline to assess how the current
hand. The dataset presents viewpoints that are more prone state-of-the-art in RGB action recognition performs in our
to self-occlusion than typical ones in third-person view. dataset. For this, and given that most successful RGB action

412
(a) (c) 40
kitchen
office
30 social

Number of videos
20

po n s h s ay

to po ge

st ar n

as s
cl n p ds ard

pr e co n
e an ke

cl n j ss
sp p e ju ce
k r e

ch ur es
e

ue dr e m n

op arg ine
op di sp p

g so n

ld a r

lig lcu nge

os h pap lt

w ell
sc as ea ilk

ez ink ilk

w on g
pu pen ilk

ta ope pa n

e sp ork
us flip k f in
e igh er
op u h s ive

re pa ses
ou let r
te ite ge
di a p

op t g e
op ha ive ap

t s sp n

po las e

t
re t le ter
cl spo .
tc p g

ea ca or
po nut .
o rm .

fo p p pe

le
un fli pa r
ke n pe
ir sp

gl ge
ug oo
a tb
u b

iv oi
a

rin ou ic
sc le s juic

os o
en se oa

sh ca

sp u

ar pe
ur od oa

as ng

pu op poo

g l
ra h s ba

ad tte

n nd
e s r

e la
w tt m
os e ha

en e c
os ui

s
h on

cl ht lat
ze t s

e m

al
ce y c

w
di f

e n c

pe u

ca o
ee pu

ic
s
s

r
o
u
sq

cl
(d) 200

sq
kitchen

Average sequence length

office
150 social

100

50
(b)
0

y y
di k m e
cl n j es

so on n
ue oo ss d
te put in
as s e

ch spo ss
p le s

cl p se ice
e ur ice
sq p can ne

sp cl nu ce

po d spo on

op sh ap
op e d ick an
op gla er

pa pra l

ar e
ta t a t
op spo lk

s lk

sc to h s ug
ea p en

pe pa lk

di h so rk
un g gh in
fli nut r

fo ive five

op ze sp s
o a r

os pr a c e
us e c e

fl w ap
op ee ur le

tc t g p
ha en on

t s po et
lig our tter

.
re ut l bag
cl ead e p r

op pa b.

kl e m .

e el

pu ar p sal
a pe

iv ter
ke te pe

sp
en sse

ur rin ng
rin os t b

en sp oo
sq c la ar
r rit to

w stir hak

e p e

ug ng
d g
ar ng
ra as oa

co
i

nd mi
en ze mi

hi co

en is fo
u o d
n ap

pu ip s all
h la
e g

a i
i

h po

en so
o u
os o ju
pe ju
ht w
w ula

ld c
ce et
s

e
g
lc

g
p

s
e
ca
e

s
us

cl
(e) 2000

Number of frames
1500
pour juice open juice pour milk open milk

1000

500

0
clean glasses 0 20 40 60 80 100 120
open soda write drink mug Hand viewpoint (degrees)
sprinkle spoon

Figure 4: (a) t-SNE [32] visualization of hand pose embedding over our dataset. Each colored dot represents a full hand pose
and each trajectory an action sequence. (b) Correlation between objects, grasps, and actions. Shown poses are the average
pose over all action sequences of a certain class. One object can have multiple grasps associated depending on the action
performed (e.g., ‘juice carton’ and ‘milk bottle’) and one grasp can have multiple actions associated (e.g., lateral grasp present
at ‘sprinkle’ and ‘clean glasses’). (c) Number of action instances per hand action class. (c) Average number of frames in each
video per hand action class. Our dataset contains both atomic and more temporally complex action classes. (d) Distribution
of hand viewpoints, defined as angles between the direction of the camera and the direction of the palm of the hand.

recognition approaches [31, 56] use ConvNets to learn de- HBRNN consists of a bidirectional recurrent neural network
scriptors from color and motion flow, we evaluate a recent with hierarchical layers designed to learn features from the
two-stream architecture fine-tuned on our dataset [15]. body pose. Gram Matrix is currently the best performing
About the depth modality, we first evaluate two local method for body pose and uses Gram matrices to learn the
depth descriptor approaches, HOG2 [40] and HON4D [43], dynamics of actions. TF learns both discriminative static
that exploit gradient and surface normal information as a poses and transitions between poses using decision forests.
feature for action recognition. As a global-scene depth de- To conclude, we evaluate one hybrid approach that
scriptor, we evaluate the recent approach by [47] that learns jointly learns heterogeneous features (JOULE) [19] using
view invariant features using ConvNets from several syn- an iterative algorithm to learn features jointly taking into
thesized depth views of human body pose. account all the data channels: color, depth, and hand pose.
We follow our evaluation with pose-based action recog-
nition methods. As our main baseline, we implemented 4.2. Hand pose estimation
a recurrent neural network with long-short term memory
(LSTM) modules inspired in the architecture by [87]. We To assess the state-of-the-art in hand pose estimation, we
also evaluate several state-of-the-art pose action recognition use the same ConvNet as [84]. We choose this approach as
approaches. We start with descriptor-based methods such as it is easy to interpret and it was shown to provide good per-
Moving Pose [85] that encodes atomic motion information formance in a cross-benchmark evaluation [84]. The chosen
and [64], which represents poses as points on a Lie group. method is a discriminative approach operating on a frame-
For methods focusing on learning temporal dependencies, by-frame basis, which does not need any initialization and
we evaluate HBRNN [12], Gram Matrix [86] and TF [17]. manual recovery when tracking fails [21, 41].

413
5. Benchmark evaluation results Protocol 1:3 1:1 3:1 cross-person
5.1. Action recognition Acc. (%) 58.75 78.73 84.82 62.06
In the following we present our experiments in action
recognition. In this section we assume the hand pose is Table 2: Action recognition results (percentage of correct
given, i.e., we use the hand pose annotations obtained us- video classification) for different training/testing protocols.
ing the magnetic sensors and inverse-kinematics. We eval-
uate the use of estimated hand poses without the aid of the 5.1.2 State-of-the-art evaluation
sensors for action recognition in Section 5.2. In Table 4 we show results for state-of-the-art approaches
Following common practice in full body-pose action in different data modalities. We observe that the Two-
recognition [64, 85], we compensate for anthropomorphic stream [15] method performs well when combining both
and viewpoint differences by normalizing poses to have the spatial and temporal cues. Depth methods tend to perform
same distance between pairs of joints and defining the wrist slightly worse than the rest of the methods, suggesting that
as the center of coordinates. they are not able to fully capture either the object cues or the
hand pose. Note that for Novel View [47] we extracted deep
5.1.1 A baseline: LSTM features from a network trained on several synthetic views
of bodies, which may not generalize well to hand poses and
We start our experimental evaluation with a simple yet pow- fine-tuning in our dataset did not help. From all approaches,
erful baseline: a recurrent neural network with long-short we observe that the ones using hand pose are the ones that
term memory module (LSTM). The architecture of our net- achieve the best performance, with Gram Matrix [86] and
work is inspired by [87] with two differences: we do not ‘go Lie group [64] performing particularly well, a result in line
deep’, and use a more conventional unidirectional network with the ones reported in body pose action recognition.
instead of bidirectional. Following [87], we set the number In Fig. 5 we select some of the most representative meth-
of neurons to 100 and a probability of dropout of 0.2. We ods and analyze their performance in detail. We observe
use TensorFlow and Adam optimizer. that the pose method Gram Matrix outperforms the rest in
Training and testing protocols: We experiment with most of the measures, specially when we retrieve the top k
two protocols. The first protocol consists of using differ- action hypothesis (Fig. 5 (b)), showing the benefit of using
ent partitions of the data for training and the rest for testing hand pose for action recognition. Looking at Fig. 5 (a),
and we tried three different training:testing ratios of 1:3, 1:1 we observe that Two-stream outperforms the rest of meth-
and 3:1 at sequence level. The second protocol is a 6-fold ods in some categories in which the object is big and the
‘leave-one-person-out’ cross-validation, i.e., each fold con- action does not involve much motion, e.g., ‘use calculator’
sists of 5 subjects for training and one for testing. Results and ‘read paper’. This good performance can be due to the
are presented in Table 2. We observe that following a cross- pre-training of the spatial network on a big image recogni-
person protocol yields the worst results taking into account tion dataset. We further observe this in Fig. 5 (c) where
that in each fold we have similar training/testing proportions we analyze the top k hypothesis given by the prediction and
to the 3 : 1 setting. This can be explained by the difference look whether the predicted action contains the object being
in hand action styles between subjects. In the rest of the pa- manipulated, suggesting that the network correctly recog-
per we perform our experiments using the 1:1 setting with nizes the object but fails to capture the temporal dynamics.
600 action sequences for training and 575 for testing. Hand pose vs. depth vs. color: We performed one
Results discussion: In Fig. 5 (a) we show the recog- additional experiment using the JOULE [19] approach by
nition accuracies per category on a subset actions and the breaking down the contributions of each data modality. In
action confusion matrix is shown in Fig. 6 (b). Some ac- Table 4 (bottom) we show that hand pose features are the
tions such as ‘sprinkle spoon’, ‘put tea bag’ and ‘pour juice’ most discriminative ones, although the performance can be
are easily identifiable, while actions such as ‘open wallet’ increased by combining them with RGB and depth cues.
and ‘use calculator’ are commonly confused, likely because This result suggests that hand poses capture complementary
hand poses are dissimilar and more subtle. In Fig. 5 (d) we information to RGB and depth features.
show the contribution of each finger motion to action recog- Object pose: We did an additional experiment using the
nition performance, finding that the index is the most infor- object pose as a feature for action recognition using the sub-
mative finger. Combining thumb and index poses boosts set of actions that have annotated object poses: a total of
the accuracy, likely due to the fact that most grasps are ex- 261 sequences for 10 different classes and 4 objects. We
plained by these two fingers [6]. Fingertips alone are also a trained our LSTM baseline on half of the sequences and us-
high source of information due to being the highest articu- ing three different inputs: hand pose, object pose, and both
lated joints and being able to ‘explain’ the hand pose. combined. In Table 3 we show the results and observe that

414
100
Two-stream
HON4D
LSTM
80
Gram Mat.
Class accuracy (%)

JOULE

0 r t
(a) juice poon on a ba
g onge pape tor five wine oin walle
pour kle s p spo put te sp read alcula high pour pay c open
sprin scoo wash use c
100 100 80

75
95 95

Action accuracy w.r.t fingers

70
90 90
65
Accuracy (%)

Accuracy (%)

85 85 60

55
80 Two-stream 80 Two-stream
HON4D HON4D
50
LSTM LSTM
75 Gram Matrix 75 Gram Matrix
45
JOULE JOULE
70 70 40
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

l
g

ps
x

I
e

Al
b

T+

M
in
de

+R
dl
um

I+
R
id

Pi
In
Top-k error: Action Top-k error: Object

M
T+
Th

I+
T+
(b) (c) (d)
Figure 5: (a) Class accuracies of some representative methods on a subset of classes. (b) Top-k action accuracy: true
action label is in the top-k action prediction hypothesis. (c) Top-k object accuracy: manipulated object is in the top-k action
prediction hypothesis. (d) Impact of each of the five fingers, combinations of them, and fingertips on action recognition.

Pose feature Hand Object Hand+Object Method Year Color Depth Pose Acc. (%)
Two stream-color [15] 2016 X ✗ ✗ 61.56
Action acc. (%) 87.45 74.45 91.97 Two stream-flow [15] 2016 X ✗ ✗ 69.91
Table 3: We evaluate the use of 6D object pose for action Two stream-all [15] 2016 X ✗ ✗ 75.30
recognition on a subset of our dataset. We observe the ben- HOG2 -depth [40] 2013 ✗ X ✗ 59.83
efit of combining them with the hand pose. HOG2 -depth+pose [40] 2013 ✗ X X 66.78
HON4D [43] 2013 ✗ X ✗ 70.61
Novel View [47] 2016 ✗ X ✗ 69.21
both object pose and hand pose features are complimentary 1-layer LSTM 2016 ✗ ✗ X 78.73
and useful for recognizing egocentric hand-object actions. 2-layer LSTM 2016 ✗ ✗ X 80.14
Moving Pose [85] 2013 ✗ ✗ X 56.34
5.2. Hand pose estimation Lie Group [64] 2014 ✗ ✗ X 82.69
HBRNN [12] 2015 ✗ ✗ X 77.40
Training with objects vs. no objects: One question Gram Matrix [86] 2016 ✗ ✗ X 85.39
raised while designing our experiments was whether we ac- TF [17] 2017 ✗ ✗ X 80.69
tually needed to annotate the hand pose in a close to ground JOULE-color [19] 2015 X ✗ ✗ 66.78
truth accuracy to experiment with hand dynamic actions. JOULE-depth [19] 2015 ✗ X ✗ 60.17
We try to answer this question by estimating the hand poses JOULE-pose [19] 2015 ✗ ✗ X 74.60
of our hand action dataset in two ways partitioning our data JOULE-all [19] 2015 X X X 78.78
as in our Action split: using the nearly 300k object-free ego- Table 4: Hand action recognition performance by different
centric samples from [84] and using the images in the train- evaluated approaches on our proposed dataset.
ing set of our hand action dataset. As observed in Fig. 6
(c) and Table 5, the results suggest that having hand-object
images in the training set is crucial to train state-of-the-art the objects in training and half in testing, all subjects seen
hand pose estimators, likely due to the fact that occlusions in both splits). In Fig. 6 (c) and Table 5 we observe that the
and object shapes need to be seen by the estimator before- network is able to generalize to unseen subjects but strug-
hand. To confirm this, we conducted two extra experiments: gles to do so for unseen objects, suggesting that recognizing
cross-subject (half of the users in training and half in test- the shape of the object and its associated grasp is crucial to
ing, all objects seen in both splits) and cross-object (half of train hand pose estimators. This shows the need of having

415
1
(a) (b) 3
2 24.open soda can (c) (d)
4 25.use spray
5 26.write pen 100
6
100 7 27.tear paper
Class action accuracy

8
28.squeeze paper
9 90
80 10 29.open letter
11 1.open juice 30.take out letter
12
2.close juice 31.read paper
13 80
60 14 3.pour juice 32.ﬂip pages
15 4.open peanut b.
16 33.use calculator
40 17 5.close peanut b. 34.light candle

Percentage of frames with error <

18
70
6.prick fork 35.charge cell
19
20 7.sprinkle spoon 36.unfold glasses
20 21
22
8.scoop spoon 37.clean glasses 60
23 9.put sugar sp. 38.open wallet
24
20 10.stir spoon 39.pay coin
pose error
25 25 50
26 11.open milk 40.receive coin
27 12.close milk
Avg. visible joints per action

Avg. pose estimation error (mm)

28 41.give card
29 13.pour milk 42.pour wine 40
15 30
14.drink mug 43.toast glass
31
32 15.put tea bag 44.handshake
33
34
16.put salt 45.high ﬁve 30
10 35 17.open dish soap
36
37 18.close dish soap Cross-subject
19.pour dish soap 20
38
Cross-object
39 20.wash sponge
40 1:1 w/o objects
5 41 21.ﬂip sponge
42
10 1:1 w/ objects
43 22.scratch sponge
44 23.squeeze sponge 0 10 20 30 40 50 60 70 80
45
0 5

11 0
13 2
15 4
17 6
19 8
21 0
23 2
25 4
27 6
29 8
31 0
33 2
35 4
37 6
39 8
41 0
43 2
45 4
Error threshold (mm)

4
1
3
5
7
9
7

4
1

3
45
33
18
17
28
36
31
38
20

37
43

27
29

14
34
35
21
23
16
32

40
15
42
22
12
26

39
30
10

19
13
25
11
44
Figure 6: (a) Top: Class action recognition accuracies for our LSTM baseline using estimated hand poses (accuracies with
groundtruth poses are represented with black triangles). Bottom: Average number of visible (not occluded) joints for hand
actions on our dataset and its impact on hand pose estimation. (b) Hand action confusion matrix for our LSTM baseline.
(c) Percentage of frames for different hand pose estimation error thresholds. (d) Qualitative results on hand pose estimation.

annotated hand poses interacting with objects and thus why Hand pose protocol Pose error (mm) Action (%)
our dataset can be of interest for the hand pose community. Cross-subject 11.25 -
In Fig. 6 (d) we show some qualitative results in hand pose Cross-object 19.84 -
estimation in our proposed dataset and observe that, while Action split (training w/o objects) 31.03 29.63
not perfect, they are good enough for action recognition. Action split (training w/ objects) 14.34 72.06
Hand pose estimation and action recognition: Now Action split (GT mag.+IK poses) - 78.73
we try to answer the following key question: ‘how good is Table 5: Average hand pose estimation error, 3D distance
the current hand pose estimation for recognizing hand ac- over all 21 joints between magnetic poses and estimates,
tions?’. In Table 5 we show results of hand action recogni- for different protocols and its impact on action recognition.
tion by swapping the hand pose labels by the estimated ones
in the test set. We observe that reducing the hand pose error
by a factor of two yields a more than twofold improvement explanation of why we can still obtain a good action recog-
in action recognition. The difference in hand action recognition performance while having noisy hand pose estimates.
nition between using the hand pose labels and using the es-
timated ones in testing is 6.67%. We also tested the two best
performant methods from previous section, Lie group [64] 6. Concluding remarks
and Gram Matrix [86]. For Lie group we obtained an ac-
curacy of 69.22%, while for Gram Matrix a poor result of We have proposed a novel benchmark and presented ex-
32.22% likely due to their strong assumptions in the noise perimental evaluations for RGB-D and pose-based, hand
distribution. On the other hand, our LSTM baseline shows action recognition, in first-person setting. The benchmark
more robust behavior in the presence of noisy hand pose provides both temporal action labels and full 3D hand pose
estimates. In Fig. 6 (a) we show how the hand occlusion labels, and additionally 6D object pose labels on a part of
affects the pose estimation quality and its impact on class the dataset. Both RGB-D action recognition and 3D hand
recognition accuracies. Although some classes present a pose estimation are relatively new fields, and this is a first
clear correlation between pose error and action accuracy attempt to relate both of them similar to full human body.
degradation (e.g., ‘receive coin’, ‘pour wine’), the LSTM We have evaluated several baselines in our dataset and con-
is still able to obtain acceptable recognition rates likely due cluded that hand pose features are a rich source of informa-
to being able to infer the action from temporal patterns and tion for recognizing manipulation actions. We believe that
correctly estimated joints. For more insight, we analyzed our dataset and experiments can encourage future work in
the pose error per finger: T: 12.45, I: 15.48, M: 18.08, R: multiple fields including action recognition, hand pose esti-
16.69, P: 18.95, all in mm. Thumb and index joints present mation, object pose estimation, and emerging ones, such as
the lowest estimation error because of typically being less joint hand-object pose estimation.
occluded in egocentric setting. According to previous sec- Acknowledgements: This work is part of Imperial Col-
tion where we found that the motion from these two fingers lege London-Samsung Research project, supported by Sam-
was a high source of information, this can be a plausible sung Electronics.

416
References [22] C. Ionescu, J. Carreira, and C. Sminchisescu. Iterated
second-order label sensitive pooling for 3d human pose esti-
[1] S. Allin and D. Ramanan. Assessment of post-stroke func- mation. In CVPR, 2014. 3
tioning using machine vision. In MVA, 2007. 1 [23] T. Ishihara, K. M. Kitani, W.-C. Ma, H. Takagi, and
[2] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A C. Asakawa. Recognizing hand-object interactions in wear-
survey of robot learning from demonstration. RAS, 2009. 1 able camera videos. In ICIP, 2015. 1, 2
[3] S. Baek, K. I. Kim, and T.-K. Kim. Augmented skele- [24] Y. Jang, S.-T. Noh, H. J. Chang, T.-K. Kim, and W. Woo. 3d
ton space transfer for depth-based hand pose estimation. In finger cape: Clicking action and position estimation under
CVPR, 2018. 3 self-occlusions in egocentric viewpoint. TVCG, 2015. 1
[4] S. Baek, Z. Shi, M. Kawade, and T.-K. Kim. Kinematic- [25] C. Keskin, F. Kıraç, Y. E. Kara, and L. Akarun. Hand pose
layout-aware random forests for depth-based action recogni- estimation and hand shape classification using multi-layered
tion. In BMVC, 2017. 2 randomized decision forests. In ECCV, 2012. 3
[5] S. Bambach, S. Lee, D. J. Crandall, and C. Yu. Lending a [26] P. G. Kry and D. K. Pai. Interaction capture and synthesis.
hand: Detecting hands and recognizing activities in complex ACM Transactions on Graphics, 2006. 2
egocentric interactions. In ICCV, 2015. 2, 4 [27] J. Lei, X. Ren, and D. Fox. Fine-grained kitchen activity
[6] I. M. Bullock, T. Feix, and A. M. Dollar. The yale human recognition using rgb-d. In Ubicomp, 2012. 2
grasping dataset: Grasp, object, and task data in household [28] H. Liang, J. Yuan, and D. Thalmann. Parsing the hand in
and machine shop environments. IJRR, 2015. 2, 4, 6 depth images. TMM, 2014. 3
[7] M. Cai, K. M. Kitani, and Y. Sato. A scalable approach for [29] L. Liu and L. Shao. Learning discriminative representations
understanding the visual structures of hand grasps. In ICRA, from rgb-d video data. In IJCAI, 2013. 2
2015. 2, 4 [30] R. Luo, O. Sener, and S. Savarese. Scene semantic recon-
[8] M. Cai, K. M. Kitani, and Y. Sato. Understanding hand- struction from egocentric rgb-d-thermal videos. In 3D Vision
object manipulation with grasp types and object attributes. (3DV), 2017. 2, 4
In RSS, 2016. 1, 2, 4 [31] M. Ma, H. Fan, and K. M. Kitani. Going deeper into first-
[9] H. J. Chang, G. Garcia-Hernando, D. Tang, and T.-K. person activity recognition. In CVPR, 2016. 1, 2, 5
Kim. Spatio-temporal hough forest for efficient detection– [32] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.
localisation–recognition of fingerwriting in egocentric cam- JMLR, 2008. 5
era. CVIU, 2016. 2 [33] M. Moghimi, P. Azagra, L. Montesano, A. C. Murillo, and
[10] C. Choi, S. H. Yoon, C.-N. Chen, and K. Ramani. Robust S. Belongie. Experiments on an rgb-d wearable vision sys-
hand pose estimation during the interaction with an unknown tem for egocentric activity recognition. In CVPRW, 2014.
object. In ICCV, 2017. 2, 3, 4 2
[11] Q. De Smedt, H. Wannous, and J.-P. Vandeborre. Skeleton- [34] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and
based dynamic hand gesture recognition. In CVPRW, 2016. J. Kautz. Online detection and classification of dynamic hand
2 gestures with recurrent 3d convolutional neural networks. In
[12] Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neu- CVPR, 2016. 2
ral network for skeleton based action recognition. In CVPR, [35] F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas,
2015. 2, 5, 7 and C. Theobalt. Real-time hand tracking under occlusion
[13] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding ego- from an egocentric rgb-d sensor. In ICCV, 2017. 2, 3, 4
centric activities. In ICCV, 2011. 2 [36] NDItrakSTAR. http://www.ascension-
[14] A. Fathi, X. Ren, and J. M. Rehg. Learning to recognize tech.com/products/trakstar-2-drivebay-2/. 3
objects in egocentric activities. In CVPR, 2011. 2, 3, 4 [37] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout. Hand
[15] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional segmentation with structured convolutional learning. In
two-stream network fusion for video action recognition. In ACCV, 2014. 3
CVPR, 2016. 2, 4, 5, 6, 7 [38] M. Oberweger, G. Riegler, P. Wohlhart, and V. Lepetit. Ef-
[16] C. Fermüller, F. Wang, Y. Yang, K. Zampogiannis, Y. Zhang, ficiently creating 3d training data for fine hand pose estima-
F. Barranco, and M. Pfeiffer. Prediction of manipulation action. In CVPR, 2016. 3, 4
tions. IJCV, 2017. 2 [39] M. Oberweger, P. Wohlhart, and V. Lepetit. Training a feed-
[17] G. Garcia-Hernando and T.-K. Kim. Transition forests: back loop for hand pose estimation. In ICCV, 2015. 3
Learning discriminative temporal transitions for action [40] E. Ohn-Bar and M. M. Trivedi. Hand gesture recognition
recognition and detection. In CVPR, 2017. 2, 5, 7 in real time for automotive interfaces: A multimodal vision-
[18] H. Hamer, J. Gall, T. Weise, and L. Van Gool. An object- based approach and evaluations. ITS, 2014. 2, 5, 7
dependent hand pose prior from sparse training data. In [41] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficient
CVPR, 2010. 3 model-based 3d tracking of hand articulations using kinect.
[19] J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang. Jointly learning In BMVC, 2011. 3, 5
heterogeneous features for RGB-D activity recognition. In [42] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Full dof
CVPR, 2015. 2, 4, 5, 6, 7 tracking of a hand interacting with an object by modeling
[20] D.-A. Huang, M. Ma, W.-C. Ma, and K. M. Kitani. How occlusions and physical constraints. In ICCV, 2011. 3
do we use our hands? discovering a diverse set of common [43] O. Oreifej and Z. Liu. HON4D: histogram of oriented 4D
grasps. In CVPR, 2015. 2 normals for activity recognition from depth sequences. In
[21] Intel. Perceptual computing sdk. 2013. 5 CVPR, 2013. 2, 5, 7

417
[44] H. Pirsiavash and D. Ramanan. Detecting activities of daily [65] R. Vemulapalli and R. Chellappa. Rolling rotations for rec-
living in first-person camera views. In CVPR, 2012. 2, 4 ognizing human actions from 3d skeletal data. In CVPR,
[45] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun. Realtime and 2016. 2
robust hand tracking from depth. In CVPR, 2014. 3 [66] C. Wang, Y. Wang, and A. L. Yuille. Mining 3d key-pose-
[46] H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian. His- motifs for action recognition. In CVPR, 2016. 2
togram of oriented principal components for cross-view ac- [67] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recog-
tion recognition. TPAMI, 2016. 2 nition by dense trajectories. In CVPR, 2011. 2
[47] H. Rahmani and A. Mian. 3d action recognition from novel [68] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu. Robust
viewpoints. In CVPR, 2016. 2, 5, 6, 7 3d action recognition with random occupancy patterns. In
[48] G. Rogez, M. Khademi, J. Supančič III, J. M. M. Montiel, ECCV, 2012. 2
and D. Ramanan. 3d hand pose detection in egocentric rgb-d [69] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet en-
images. In ECCV, 2014. 2, 3, 4 semble for action recognition with depth cameras. In CVPR,
[49] G. Rogez, J. S. Supancic, and D. Ramanan. First-person pose 2012. 2
recognition using egocentric workspaces. In CVPR, 2015. 2 [70] P. Wang, C. Yuan, W. Hu, B. Li, and Y. Zhang. Graph based
[50] G. Rogez, J. S. Supancic, and D. Ramanan. Understanding skeleton motion representation and similarity measurement
everyday hands in action from rgb-d images. In ICCV, 2015. for action recognition. In ECCV, 2016. 2
1, 2, 3, 4 [71] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con-
[51] J. Romero, H. Kjellström, C. H. Ek, and D. Kragic. Non- volutional pose machines. In CVPR, 2016. 2
parametric hand pose estimation with object context. IVC, [72] A. Wetzler, R. Slossberg, and R. Kimmel. Rule of thumb:
2013. 3 Deep derotation for improved fingertip detection. In BMVC,
[52] A. Shahroudy, T.-T. Ng, Q. Yang, and G. Wang. Multimodal 2015. 3
multipart learning for action recognition in depth videos. [73] M. Wray, D. Moltisanti, W. Mayol-Cuevas, and D. Damen.
TPAMI, 2016. 2 Sembed: Semantic embedding of egocentric action videos.
[53] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, In ECCW, 2016. 3
D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, [74] D. Wu and L. Shao. Leveraging hierarchical parametric
D. Freedman, P. Kohli, E. Krupka, A. Fitzgibbon, and networks for skeletal joints based action segmentation and
S. Izadi. Accurate, robust, and flexible real-time hand track- recognition. In CVPR, 2014. 1, 2
ing. In CHI, 2015. 3 [75] L. Xia, C.-C. Chen, and J. Aggarwal. View invariant human
[54] Z. Shi and T.-K. Kim. Learning and refining of privileged action recognition using histograms of 3d joints. In CVPRW,
information-based rnns for action recognition from depth se- 2012. 2
quences. In CVPR, 2017. 1 [76] X. Yang and Y. Tian. Super normal vector for activity recog-
[55] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finoc- nition using depth sequences. In CVPR, 2014. 2
chio, A. Blake, M. Cook, and R. Moore. Real-time human [77] Y. Yang, C. Fermuller, Y. Li, and Y. Aloimonos. Grasp type
pose recognition in parts from single depth images. Com- revisited: A modern perspective on a classical feature for
mun. ACM, 2013. 2 vision. In CVPR, 2015. 1, 2
[56] S. Singh, C. Arora, and C. Jawahar. First person action [78] Y. Yang, A. Guha, C. Fermuller, and Y. Aloimonos. A cog-
recognition using deep learned descriptors. In CVPR, 2016. nitive system for understanding human manipulation actions.
1, 2, 5 ACS, 2014. 2, 3
[57] S. Sridhar, F. Mueller, M. Zollhöfer, D. Casas, A. Oulasvirta, [79] A. Yao, J. Gall, G. Fanelli, and L. J. Van Gool. Does human
and C. Theobalt. Real-time joint tracking of a hand manipu- action recognition benefit from pose estimation?. In BMVC,
lating an object from rgb-d input. In ECCV, 2016. 2 2011. 1
[58] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded [80] Q. Ye, S. Yuan, and T.-K. Kim. Spatial attention deep net
hand pose regression. In CVPR, 2015. 3 with partial pso for hierarchical hybrid hand pose estimation.
[59] D. Tang, H. J. Chang, A. Tejani, and T.-K. Kim. Latent re- In ECCV, 2016. 3
gression forest: Structured estimation of 3d articulated hand [81] T.-H. Yu, T.-K. Kim, and R. Cipolla. Unconstrained monoc-
posture. In CVPR, 2014. 3 ular 3d human pose estimation by action detection and cross-
[60] D. Tang, T.-H. Yu, and T.-K. Kim. Real-time articulated hand modality regression forest. In CVPR, 2013. 1
pose estimation using semi-supervised transductive regres- [82] S. Yuan, G. Garcia-Hernando, B. Stenger, G. Moon, J. Y.
sion forests. In ICCV, 2013. 3 Chang, K. M. Lee, P. Molchanov, J. Kautz, S. Honari, L. Ge,
[61] J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Real-time J. Yuan, X. Chen, G. Wang, F. Yang, K. Akiyama, Y. Wu,
continuous pose recovery of human hands using convolu- Q. Wan, M. Madadi, S. Escalera, S. Li, D. Lee, I. Oikono-
tional networks. TOG, 2014. 3 midis, A. Argyros, and T.-K. Kim. Depth-based 3d hand
[62] D. Tzionas, L. Ballan, A. Srikantha, P. Aponte, M. Pollefeys, pose estimation: From current achievements to future goals.
and J. Gall. Capturing hands in action using discriminative In CVPR, 2018. 3
salient points and physics simulation. IJCV, 2016. 2 [83] S. Yuan, Q. Ye, G. Garcia-Hernando, and T.-K. Kim. The
[63] V. Veeriah, N. Zhuang, and G.-J. Qi. Differential recurrent 2017 hands in the million challenge on 3d hand pose estima-
neural networks for action recognition. In ICCV, 2015. 2 tion. arXiv preprint arXiv:1707.02237, 2017. 3
[64] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action [84] S. Yuan, Q. Ye, B. Stenger, S. Jain, and T.-K. Kim. Big
recognition by representing 3d skeletons as points in a lie hand 2.2m benchmark: Hand pose data set and state of the
group. In CVPR, 2014. 2, 5, 6, 7, 8 art analysis. In CVPR, 2017. 3, 4, 5, 7

418
[85] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving
pose: An efficient 3d kinematics descriptor for low-latency
action recognition and detection. In ICCV, 2013. 2, 5, 6, 7
[86] X. Zhang, Y. Wang, M. Gou, M. Sznaier, and O. Camps. Ef-
ficient temporal sequence comparison and classification us-
ing gram matrix embeddings on a riemannian manifold. In
CVPR, 2016. 1, 2, 5, 6, 7, 8
[87] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and
X. Xie. Co-occurrence feature learning for skeleton based
action recognition using regularized deep lstm networks. In
AAAI, 2016. 2, 5, 6
[88] Y. Zhu, W. Chen, and G. Guo. Fusing spatiotemporal fea-
tures and joints for 3d action recognition. In CVPRW, 2013.
2
[89] L. Zollo, S. Roccella, E. Guglielmelli, M. C. Carrozza, and
P. Dario. Biomechatronic design and control of an anthro-
pomorphic artificial hand for prosthetic and robotic applica-
tions. TMECH, 2007. 1

419

Sensors 23 03255
No ratings yet
Sensors 23 03255
24 pages
566600917
No ratings yet
566600917
43 pages
Fine GrainedEgocentricHand ObjectSegmentation
No ratings yet
Fine GrainedEgocentricHand ObjectSegmentation
25 pages
First Meeting Hasanv3
No ratings yet
First Meeting Hasanv3
20 pages
10 1016@j Eswa 2019 06 055
No ratings yet
10 1016@j Eswa 2019 06 055
11 pages
TCSVT 2019
No ratings yet
TCSVT 2019
15 pages
Pfister15 PHD Thesis PDF
No ratings yet
Pfister15 PHD Thesis PDF
220 pages
Efficient Annotation and Learning For 3D Hand Pose Estimation: A Survey
No ratings yet
Efficient Annotation and Learning For 3D Hand Pose Estimation: A Survey
18 pages
Markerless and Efficient 26-DOFHand Pose Recovery - 2010 - 11 - ACCV - 3dhandpose
No ratings yet
Markerless and Efficient 26-DOFHand Pose Recovery - 2010 - 11 - ACCV - 3dhandpose
13 pages
Luvizon 2D3D Pose Estimation CVPR 2018 Paper
No ratings yet
Luvizon 2D3D Pose Estimation CVPR 2018 Paper
10 pages
Multi-Task Deep Learning For Real-Time 3D Human Pose Estimation and Action Recognition
No ratings yet
Multi-Task Deep Learning For Real-Time 3D Human Pose Estimation and Action Recognition
13 pages
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
No ratings yet
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
13 pages
Araset V62 N1 PP76 89
No ratings yet
Araset V62 N1 PP76 89
14 pages
Object-Based Hybrid Deep Learning Technique For Recognition of Sequential Actions
No ratings yet
Object-Based Hybrid Deep Learning Technique For Recognition of Sequential Actions
15 pages
mẫu báo cáo
No ratings yet
mẫu báo cáo
12 pages
Robot Programming by Demonstration With A Monocular RGB Camera
No ratings yet
Robot Programming by Demonstration With A Monocular RGB Camera
12 pages
Back To RGB - 3D Tracking of Hands and Hand-Object Interactions Based On Short-Baseline Stereo - 1705.05301
No ratings yet
Back To RGB - 3D Tracking of Hands and Hand-Object Interactions Based On Short-Baseline Stereo - 1705.05301
10 pages
Applsci 13 07433
No ratings yet
Applsci 13 07433
16 pages
Hand Gesture Minor Project
No ratings yet
Hand Gesture Minor Project
15 pages
RAGE: X R - A & G E D: X E Tended Eality Ction Esture Vents Ataset
No ratings yet
RAGE: X R - A & G E D: X E Tended Eality Ction Esture Vents Ataset
7 pages
Action Classification Based On 2D Coordinates Obtained by Real-Time Pose Estimation
No ratings yet
Action Classification Based On 2D Coordinates Obtained by Real-Time Pose Estimation
7 pages
Real-Time Vision-Based Hand Tracking and Gesture Recognition
No ratings yet
Real-Time Vision-Based Hand Tracking and Gesture Recognition
117 pages
Solution Manual For Basic Engineering Circuit Analysis 10th Edition by Irwin PDF
100% (1)
Solution Manual For Basic Engineering Circuit Analysis 10th Edition by Irwin PDF
58 pages
Sensors 23 05555
No ratings yet
Sensors 23 05555
20 pages
Understanding Everyday Hands in Action From RGB-D Images
No ratings yet
Understanding Everyday Hands in Action From RGB-D Images
9 pages
LITSURVEY
No ratings yet
LITSURVEY
2 pages
3D Hand Shape and Pose Estimation From A Single RGB Image
No ratings yet
3D Hand Shape and Pose Estimation From A Single RGB Image
12 pages
Fast and Robust Dynamic Hand Gesture Recognition Via Key Frames Extraction and Feature Fusion
No ratings yet
Fast and Robust Dynamic Hand Gesture Recognition Via Key Frames Extraction and Feature Fusion
11 pages
Thesis Topic Proposal: Computer Vision Lab
No ratings yet
Thesis Topic Proposal: Computer Vision Lab
9 pages
Wireless Vision Based Mobile Robot Control Using Hand Gesture Recognition Through Perceptual Color Space
No ratings yet
Wireless Vision Based Mobile Robot Control Using Hand Gesture Recognition Through Perceptual Color Space
6 pages
Time Invariant Gesture Recognition by Modelling Body Posture Space
No ratings yet
Time Invariant Gesture Recognition by Modelling Body Posture Space
10 pages
Hand Landmarks Detection and Localization in Color
No ratings yet
Hand Landmarks Detection and Localization in Color
26 pages
Hand Models and Systems For Hand Detection, Shape Recognition and Pose Estimation in Video
No ratings yet
Hand Models and Systems For Hand Detection, Shape Recognition and Pose Estimation in Video
41 pages
Report
No ratings yet
Report
18 pages
3D Skeleton-Based Human Action Classification: A Survey
No ratings yet
3D Skeleton-Based Human Action Classification: A Survey
18 pages
Sensors 13 11842
No ratings yet
Sensors 13 11842
19 pages
Boukhayma 3D Hand Shape and Pose From Images in The Wild CVPR 2019 Paper
No ratings yet
Boukhayma 3D Hand Shape and Pose From Images in The Wild CVPR 2019 Paper
10 pages
Large-Scale Multiview 3D Hand Pose Dataset
No ratings yet
Large-Scale Multiview 3D Hand Pose Dataset
23 pages
Sensors: Human-Computer Interaction Based On Hand Gestures Using RGB-D Sensors
No ratings yet
Sensors: Human-Computer Interaction Based On Hand Gestures Using RGB-D Sensors
19 pages
3D Visual Tracking of Articulated Objects and Hands
No ratings yet
3D Visual Tracking of Articulated Objects and Hands
16 pages
Mueller GANerated Hands For CVPR 2018 Paper
No ratings yet
Mueller GANerated Hands For CVPR 2018 Paper
11 pages
2022 Ohkawa
No ratings yet
2022 Ohkawa
14 pages
Good Paper
No ratings yet
Good Paper
5 pages
Bomb in Hand Paper Final
No ratings yet
Bomb in Hand Paper Final
10 pages
Baek Weakly-Supervised Domain Adaptation Via GAN and Mesh Model For Estimating CVPR 2020 Paper
No ratings yet
Baek Weakly-Supervised Domain Adaptation Via GAN and Mesh Model For Estimating CVPR 2020 Paper
11 pages
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
No ratings yet
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
9 pages
Understanding The Motion Adaption of Machine Using Long Short - Term Memory Networks For Voiceless Virtual Assistant
No ratings yet
Understanding The Motion Adaption of Machine Using Long Short - Term Memory Networks For Voiceless Virtual Assistant
8 pages
21bce5309 Ai Final-Report
No ratings yet
21bce5309 Ai Final-Report
8 pages
Zhou2020monocular Supp
No ratings yet
Zhou2020monocular Supp
3 pages
Report
No ratings yet
Report
7 pages
3D Body Moving
No ratings yet
3D Body Moving
13 pages
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
No ratings yet
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
8 pages
HGR Progress Presentation Apr 8
No ratings yet
HGR Progress Presentation Apr 8
46 pages
Tteh 000553
No ratings yet
Tteh 000553
5 pages
P-CNN: Pose-Based CNN Features For Action Recognition: Guilhem CH Eron Ivan Laptev Cordelia Schmid Inria
No ratings yet
P-CNN: Pose-Based CNN Features For Action Recognition: Guilhem CH Eron Ivan Laptev Cordelia Schmid Inria
9 pages
Group30 Superpixel Based Human Computer Interface UsingaHand Gesture Recognition
No ratings yet
Group30 Superpixel Based Human Computer Interface UsingaHand Gesture Recognition
3 pages
English Grammar Notes For Competitive Exams PDF Download
100% (2)
English Grammar Notes For Competitive Exams PDF Download
43 pages
Motion Capture of Hands in Action Using Discriminative Salient Points
No ratings yet
Motion Capture of Hands in Action Using Discriminative Salient Points
14 pages
Mediapipe Hands: On-Device Real-Time Hand Tracking
No ratings yet
Mediapipe Hands: On-Device Real-Time Hand Tracking
5 pages
Development of A Hand Pose Recognition System On An Embedded Computer Using Artificial Intelligence
No ratings yet
Development of A Hand Pose Recognition System On An Embedded Computer Using Artificial Intelligence
4 pages
SAP SD Archiving
50% (2)
SAP SD Archiving
70 pages
Edexcel IGCSE Further Pure Mathematics 4PM1 Revision Notes
No ratings yet
Edexcel IGCSE Further Pure Mathematics 4PM1 Revision Notes
23 pages
Fire Water System Design PDF
No ratings yet
Fire Water System Design PDF
22 pages
Design of A 50 KW Solar PV Rooftop System
No ratings yet
Design of A 50 KW Solar PV Rooftop System
9 pages
Poster Roughness EN 10037103 01 2016 PDF
No ratings yet
Poster Roughness EN 10037103 01 2016 PDF
1 page
Inverter E171781
No ratings yet
Inverter E171781
6 pages
#1TR Cable Wiring Diagram
No ratings yet
#1TR Cable Wiring Diagram
45 pages
Department of Education
No ratings yet
Department of Education
2 pages
Use Case Diagrams From An Event Table
No ratings yet
Use Case Diagrams From An Event Table
6 pages
General Notes:: 1. Standards and References
No ratings yet
General Notes:: 1. Standards and References
1 page
Kinetix Motion Control - (Gmc-sg001)
No ratings yet
Kinetix Motion Control - (Gmc-sg001)
106 pages
Polygon Practice Questions
No ratings yet
Polygon Practice Questions
3 pages
Teleport Gui Roblox
No ratings yet
Teleport Gui Roblox
4 pages
Comparative Study of Plain and Fiber Reinforced Concrete
100% (1)
Comparative Study of Plain and Fiber Reinforced Concrete
12 pages
Chemistry 1 (A) : Chem 181 H1: Answer On The Question Paper
No ratings yet
Chemistry 1 (A) : Chem 181 H1: Answer On The Question Paper
11 pages
Mengendapkan Garam, Melarutkan Dan Menelitinya
No ratings yet
Mengendapkan Garam, Melarutkan Dan Menelitinya
7 pages
Wen 3D Shape Reconstruction From 2D Images With Disentangled Attribute Flow CVPR 2022 Paper
No ratings yet
Wen 3D Shape Reconstruction From 2D Images With Disentangled Attribute Flow CVPR 2022 Paper
11 pages
Arcs and Sectors
No ratings yet
Arcs and Sectors
7 pages
Deep Neural Networks: Amity Centre For Artificial Intelligence, Amity University, Noida, India
No ratings yet
Deep Neural Networks: Amity Centre For Artificial Intelligence, Amity University, Noida, India
62 pages
Zorzi PolyWorld Polygonal Building Extraction With Graph Neural Networks in Satellite CVPR 2022 Paper
No ratings yet
Zorzi PolyWorld Polygonal Building Extraction With Graph Neural Networks in Satellite CVPR 2022 Paper
10 pages
Wang Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels CVPR 2022 Paper
No ratings yet
Wang Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels CVPR 2022 Paper
10 pages
Qin MonoGround Detecting Monocular 3D Objects From The Ground CVPR 2022 Paper
No ratings yet
Qin MonoGround Detecting Monocular 3D Objects From The Ground CVPR 2022 Paper
10 pages
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
No ratings yet
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
10 pages
Wang Continual Learning With Lifelong Vision Transformer CVPR 2022 Paper
No ratings yet
Wang Continual Learning With Lifelong Vision Transformer CVPR 2022 Paper
11 pages
Skorokhodov StyleGAN-V A Continuous Video Generator With The Price Image Quality CVPR 2022 Paper
No ratings yet
Skorokhodov StyleGAN-V A Continuous Video Generator With The Price Image Quality CVPR 2022 Paper
11 pages
Intelligent Web Security: Machine Learning-Based SQL Injection Detection and Honeypot Integration
No ratings yet
Intelligent Web Security: Machine Learning-Based SQL Injection Detection and Honeypot Integration
7 pages
Peng Semantic-Aware Domain Generalized Segmentation CVPR 2022 Paper
No ratings yet
Peng Semantic-Aware Domain Generalized Segmentation CVPR 2022 Paper
12 pages
Zhang Semantic Segmentation by Early Region Proxy CVPR 2022 Paper
No ratings yet
Zhang Semantic Segmentation by Early Region Proxy CVPR 2022 Paper
11 pages
Wang Learning by Hallucinating Vision-Language Pre-Training With Weak Supervision WACV 2023 Paper
No ratings yet
Wang Learning by Hallucinating Vision-Language Pre-Training With Weak Supervision WACV 2023 Paper
11 pages
Pang Zoom in and Out A Mixed-Scale Triplet Network For Camouflaged CVPR 2022 Paper
No ratings yet
Pang Zoom in and Out A Mixed-Scale Triplet Network For Camouflaged CVPR 2022 Paper
11 pages
Ke Unsupervised Hierarchical Semantic Segmentation With Multiview Cosegmentation and Clustering Transformers CVPR 2022 Paper
No ratings yet
Ke Unsupervised Hierarchical Semantic Segmentation With Multiview Cosegmentation and Clustering Transformers CVPR 2022 Paper
11 pages
Jeong 3D Scene Painting Via Semantic Image Synthesis CVPR 2022 Paper
No ratings yet
Jeong 3D Scene Painting Via Semantic Image Synthesis CVPR 2022 Paper
11 pages
Li Brain-Inspired Multilayer Perceptron With Spiking Neurons CVPR 2022 Paper
No ratings yet
Li Brain-Inspired Multilayer Perceptron With Spiking Neurons CVPR 2022 Paper
11 pages
Yu SoftCollage A Differentiable Probabilistic Tree Generator For Image Collage CVPR 2022 Paper
No ratings yet
Yu SoftCollage A Differentiable Probabilistic Tree Generator For Image Collage CVPR 2022 Paper
10 pages
Watson Learning How To MIMIC Using Model Explanations To Guide Deep WACV 2023 Paper
No ratings yet
Watson Learning How To MIMIC Using Model Explanations To Guide Deep WACV 2023 Paper
10 pages
Wang CLIP-NeRF Text-and-Image Driven Manipulation of Neural Radiance Fields CVPR 2022 Paper
No ratings yet
Wang CLIP-NeRF Text-and-Image Driven Manipulation of Neural Radiance Fields CVPR 2022 Paper
10 pages
Wang An Efficient Training Approach For Very Large Scale Face Recognition CVPR 2022 Paper
No ratings yet
Wang An Efficient Training Approach For Very Large Scale Face Recognition CVPR 2022 Paper
10 pages
Wang Cloning Outfits From Real-World Images To 3D Characters For Generalizable CVPR 2022 Paper
No ratings yet
Wang Cloning Outfits From Real-World Images To 3D Characters For Generalizable CVPR 2022 Paper
10 pages
Sun Human Instance Matting Via Mutual Guidance and Multi-Instance Refinement CVPR 2022 Paper
No ratings yet
Sun Human Instance Matting Via Mutual Guidance and Multi-Instance Refinement CVPR 2022 Paper
10 pages
Qiu 3D Change Localization and Captioning From Dynamic Scans of Indoor WACV 2023 Paper
No ratings yet
Qiu 3D Change Localization and Captioning From Dynamic Scans of Indoor WACV 2023 Paper
10 pages
Lu Prompt Distribution Learning CVPR 2022 Paper
No ratings yet
Lu Prompt Distribution Learning CVPR 2022 Paper
10 pages
Lin OcclusionFusion Occlusion-Aware Motion Estimation For Real-Time Dynamic 3D Reconstruction CVPR 2022 Paper
No ratings yet
Lin OcclusionFusion Occlusion-Aware Motion Estimation For Real-Time Dynamic 3D Reconstruction CVPR 2022 Paper
10 pages
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
No ratings yet
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
10 pages
Liu Open-Set Text Recognition Via Character-Context Decoupling CVPR 2022 Paper
No ratings yet
Liu Open-Set Text Recognition Via Character-Context Decoupling CVPR 2022 Paper
10 pages
Muller Self-Supervised Relative Pose With Homography Model-Fitting in The Loop WACV 2023 Paper
No ratings yet
Muller Self-Supervised Relative Pose With Homography Model-Fitting in The Loop WACV 2023 Paper
10 pages
Lee Sound-Guided Semantic Image Manipulation CVPR 2022 Paper
No ratings yet
Lee Sound-Guided Semantic Image Manipulation CVPR 2022 Paper
10 pages
Guo Generating Diverse and Natural 3D Human Motions From Text CVPR 2022 Paper
No ratings yet
Guo Generating Diverse and Natural 3D Human Motions From Text CVPR 2022 Paper
10 pages
Jain Zero-Shot Text-Guided Object Generation With Dream Fields CVPR 2022 Paper
No ratings yet
Jain Zero-Shot Text-Guided Object Generation With Dream Fields CVPR 2022 Paper
10 pages
Wang Rethinking Bayesian Deep Learning Methods For Semi-Supervised Volumetric Medical Image CVPR 2022 Paper
No ratings yet
Wang Rethinking Bayesian Deep Learning Methods For Semi-Supervised Volumetric Medical Image CVPR 2022 Paper
9 pages
Assignment No. 2: Assignment Submission Guidelines: Assignment Formatting Instructions
No ratings yet
Assignment No. 2: Assignment Submission Guidelines: Assignment Formatting Instructions
9 pages
Primary 1 Math Term 1 - Rosyth - 1626155393
No ratings yet
Primary 1 Math Term 1 - Rosyth - 1626155393
9 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
Inheritance Past Paper Questions
No ratings yet
Inheritance Past Paper Questions
5 pages
Sample Test Solutions
No ratings yet
Sample Test Solutions
7 pages
6.0 CX1104 Part2Introduction 28sep2022
No ratings yet
6.0 CX1104 Part2Introduction 28sep2022
6 pages
Ans314 Se3000-1015
No ratings yet
Ans314 Se3000-1015
2 pages
Real No Test
No ratings yet
Real No Test
3 pages
Energy: Zhijia Liu, Tao Zhang, Jian Zhang, Hongzhong Xiang, Xiaomeng Yang, Wanhe Hu, Fang Liang, Bingbing Mi
No ratings yet
Energy: Zhijia Liu, Tao Zhang, Jian Zhang, Hongzhong Xiang, Xiaomeng Yang, Wanhe Hu, Fang Liang, Bingbing Mi
6 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Garcia-Hernando First-Person Hand Action CVPR 2018 Paper

Uploaded by

Garcia-Hernando First-Person Hand Action CVPR 2018 Paper

Uploaded by

First-Person Hand Action Benchmark with RGB-D Videos

and 3D Hand Pose Annotations

Guillermo Garcia-Hernando Shanxin Yuan Seungryul Baek Tae-Kyun Kim

In this work we study the use of 3D hand poses to rec-

Re EgoHands [5] RGB X 4 48 4,800 Action

iv os GUN-71 [50] RGB-D X 71 - 12,000 Grasp

Hand occlusion: Fig. 6 (a) (bottom) shows the aver-

Object pose: 6D object pose and mesh models are pro-

vided for the following objects involving 10 different ac-

tions: ‘milk bottle’, ‘salt’, ‘juice carton’, and ‘liquid soap’.

lig lcu nge

Average sequence length

Action accuracy w.r.t fingers

Percentage of frames with error <

Avg. pose estimation error (mm)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.