0% found this document useful (0 votes)
18 views10 pages

Tulsiani Viewpoints and Keypoints 2015 CVPR Paper

Lalakddjd

Uploaded by

Rafaella Anielly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

Tulsiani Viewpoints and Keypoints 2015 CVPR Paper

Lalakddjd

Uploaded by

Rafaella Anielly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Viewpoints and Keypoints

Shubham Tulsiani and Jitendra Malik


University of California, Berkeley - Berkeley, CA 94720
{shubhtuls,malik}@eecs.berkeley.edu

Abstract

We characterize the problem of pose estimation for rigid


objects in terms of determining viewpoint to explain coarse
pose and keypoint prediction to capture the finer details. We Figure 1: Alternate characterizations of pose in terms of
address both these tasks in two different settings - the con- viewpoint and keypoint locations
strained setting with known bounding boxes and the more
challenging detection setting where the aim is to simulta-
neously detect and correctly estimate pose of objects. We point prediction.
present Convolutional Neural Network based architectures A robot which merely knows that a cup exists but cannot
for these and demonstrate that leveraging viewpoint esti- find its handle will not be able to grasp it. Towards the goal
mates can substantially improve local appearance based of developing a finer understanding of objects, we tackle
keypoint predictions. In addition to achieving significant the task of predicting keypoints by modeling appearances
improvements over state-of-the-art in the above tasks, we at multiple scales - a fine scale appearance model, while
analyze the error modes and effect of object characteristics prone to false positives can localize accurately and a coarser
on performance to guide future efforts towards this goal. scale appearance model is more robust to mis-localizations.
Note that merely reasoning over local appearance is not suf-
ficient to solve the task of keypoint prediction. For example,
1. Introduction the notion of the ’front wheel’ assumes its meaning in con-
text of the whole bicycle. The local appearance of the patch
There are two ways in which one can describe the pose might also correspond to the ’back wheel’ - it is because we
of the car in Figure 1 - either via its viewpoint or via spec- know the bicycle is front facing that we are able to disam-
ifying the locations of a fixed set of keypoints. The former biguate. Motivated by this, we use the viewpoint predicted
characterization provides a global perspective about the ob- by our system to improve the local appearance based key-
ject whereas the latter provides a more local one. In this point predictions.
work, we aim to reliably predict both these representations Our proposed algorithm, as illustrated in Figure 2 has the
of pose for objects. following components -
Our overall approach is motivated by the theory of global
precedence - that humans perceive the global structure be-
fore the fine level local details [27]. It was also noted Viewpoint Prediction : We formulate the problem of
by Koenderink and van Doorn [22] that viewpoint deter- viewpoint prediction as predicting three euler angles ( az-
mines appearance and several works have shown that larger imuth, elevation and cyclorotation) corresponding to the in-
wholes improve the discrimination performance of parts stance. We train a CNN based architecture which can im-
[31, 26, 29]. Inspired by this philosophy, we propose an plicitly capture and aggregate local evidences for predicting
algorithm which first estimates viewpoint for the target ob- the euler angles to obtain a viewpoint estimate.
ject and leverages the predicted viewpoint to improve the
local appearance based keypoint predictions. Local Appearance based Keypoint Activation : We
Viewpoint is manifested in a 2D image by the spatial re- propose a fully convolutional CNN based architecture to
lationships among the different features of the object. Con- model local part appearance. We capture the appearance
volutional Neural Network (CNN) [9, 24] based methods at multiple scales and combine the CNN responses across
which can implicitly capture and hierarchically build on scales to obtain a resulting heatmap which corresponds to a
such relations are therefore suitable candidates for view- spatial log-likelihood distribution for each keypoint.
Figure 2: Overview of our approach. To recover an estimate of the global pose, we use a CNN based architecture to predict
viewpoint. For each keypoint, a spatial likelihood map is obtained via combining multiscale convolutional response maps
and it is then combined with a likelihood conditioned on predicted viewpoint to obtain our final predictions.

Viewpoint Conditioned Keypoint Likelihood : We pro- structure. Girshick et al. [12] argued that DPMs could also
pose a viewpoint conditioned keypoint likelihood, imple- be thought as a specific instantiation of CNNs and there-
mented as a non-parametric mixture of gaussians, to model fore training an end-to-end CNN for the corresponding task
the probability distribution of keypoints given the viewpoint should outperform a method which instead explicitly mod-
prediction. We combine it with the appearance based likeli- els part appearances and relations.
hood computed above to obtain our keypoint predictions. This result is particularly applicable to viewpoint es-
Keypoint prediction methods have traditionally been timation where the prominent approaches, from the ini-
evaluated assuming ground-truth boxes as input [1, 21, 25]. tial instance based methods [19] to current state-of-the-art
This means that the evaluation setting is quite different from [37, 30] explicitly model local appearances and aggregate
the conditions under which these methods would be used - evidence to infer viewpoint. Pepik et al. [30] extend DPMs
in conjunction with imprecisely localized object detections. to 3D to model part appearances and rely on these to infer
Yang and Ramanan [38] argued for the importance of this pose and Xiang et al. [37] introduce a separate DPM com-
task for human pose estimation and introduced an evalua- ponent corresponding to each viewpoint. Ghodrati et al.
tion criterion which we adapt to generic object categories. [10] differ from the explicit part-based methodology, using
To the best of our knowledge, we are the first to empirically a fixed global descriptor to estimate viewpoint. We build
evaluate the applicability of a keypoint prediction algorithm on both these approaches by using a method which, while
not restricted to a specific object category in this challeng- using a global descriptor, can implicitly capture part appear-
ing setting. ances.
Furthermore, inspired by the analysis of the detection
methods presented by Hoeim et al. [18], we present an anal-
ysis of our algorithm’s failure modes as well as the impact Keypoint Prediction: Keypoint Prediction can be clas-
of object characteristics on the algorithm’s performance. sified into two settings - a) ’Keypoint Localization’ where
the task is to find keypoints for objects with known bound-
2. Related Work ing boxes and b) ’Keypoint Detection’ where bounding box
is unknown. This problem has been particularly well stud-
Viewpoint Prediction: Recently, CNNs [9, 24] have ied for humans - tracing back from classic model-based ap-
been shown to outperform Deformable Part Model (DPM) proaches for video [28, 17] to more recent pictorial structure
[8] based methods for recognition tasks [11, 6, 23]. based approaches [38] on challenging single image based
Whereas DPMs explicitly model part appearances and their real world datasets like LSP[21] or MPII Human Pose [1].
deformations, the CNN architecture allows such relations Recently Toshev et al. [36] demonstrated that CNN based
to be captured implicitly using a hierarchical convolutional models can successfully be used for keypoint prediction
for humans and Tompson et al. [35] significantly improved 3.2. Network Architecture and Training
upon these results using a purely convolutional approach.
Let Nc be the number of object classes, Na be number of
These evaluations, however, are restricted to keypoint local-
angles to be predicted per instance. The number of output
izations. A more general task of keypoint detection with-
units per class is Na ∗Nθ resulting in a total of Nc ∗Na ∗Nθ
out assuming ground truth box annotations was also re-
outputs. We adopt an approach similar to Girshick et al.
cently introduced for humans by Yang and Ramanan [38]
[11] and finetune a CNN model whose weights are initial-
and Gkioxari et al. [14, 13] evaluated their keypoint predic-
ized from a model pretrained on the Imagenet [5] classifi-
tion algorithm in this setting.
cation task. We experimented with the architectures from
For generic object categories, annotations for keypoints Krizhevsky et al. [23] (denoted as TNet) and Simonyan et
on the challenging PASCAL VOC dataset [7] were intro- al. [33] (denoted as ONet). The architecture of our network
duced by Bourdev et al. [4]. Though similar annotations is the same as the corresponding pre-trained network with
or fitted CAD models have been successfully used to train an additional fully-connected layer having Nc ∗ Na ∗ Nθ
better object detection systems [3] as well as for simulta- output units. We provide an alternate detailed visualization
neous object detection and viewpoint estimation [30], the of the network architecture in the supplementary material.
task of keypoint prediction has largely been unaddressed for Instead of training a separate CNN for each class, we im-
generic object categories. Long et al. [25] recently evalu- plement a loss layer that selectively considers the Na ∗ Nθ
ated keypoint localization results across all PASCAL cate- outputs corresponding the class of the training instance and
gories but, to the best of our knowledge, the more general computes a logistic loss for each of the angle predictions.
setting of keypoint detection for generic object categories This allows us to train a CNN which can jointly predict
has not yet been explored. viewpoint for all classes, thus enabling learning a shared
Previous works [32, 15, 16] have also jointly tackled the feature representation across all categories. We use the
problem of keypoint detection and pose estimation. While Caffe framework [20] to train and extract features from the
these are perhaps the closest to ours in terms of goals, they CNN described above. We augment the training data with
differ markedly in methodology - they explicitly aggregate jittered ground-truth bounding boxes that overlap with the
local evidence for pose estimation and have either been re- annotated bounding box with IoU > 0.7. Xiang et al. [37]
stricted to a specific object category [15, 16] or use instance provide annotations for (φ, ϕ, ψ) corresponding to all the
model based matching [32]. Long et al. [25], on the other instances in the PASCAL VOC 2012 detection train, val-
hand share many commonalities with our methodology for idation set as well as for ImageNet images. We use the
the task of keypoint prediction - convolutional keypoint de- PASCAL train set and the ImageNet annotations to train the
tections augmented with global priors to predict keypoints. network described above and use the PASCAL VOC 2012
However, we show that we can significantly improve their validation set annotations to evaluate our performance.
results by combining multiscale convolutional predictions
from a trained CNN with a more principled, viewpoint es- 4. Viewpoint Informed Keypoint Prediction
timation based global model. Both [16, 25] only evaluate As we noted earlier, parts assume their meaning in con-
keypoint localization performance whereas we also evalu- text of the whole. Thus, in addition to local appearance, we
ate our method in the setting of keypoint detection. should take into account the global context. To operational-
ize this observation, we propose a two-component approach
to keypoint prediction.
3. Viewpoint Estimation
4.1. Multiscale Convolutional Response Maps
3.1. Formulation We use CNN based architectures to learn the appearance
of keypoints across an object class. Using a fully convo-
We formulate the global pose estimation for rigid cat-
lutional architecture allows us to capture local appearance
egories as predicting the viewpoint wrt to a canonical
in a more hierarchical and robust way than HOG feature
pose. This is equivalent to determining the three euler an-
based models while still allowing for efficient inference by
gles corresponding to azimuth (φ), elevation(ϕ) and cyclo-
sharing computations across evaluations at different spatial
rotation(ψ). We frame the task of predicting the euler angles
locations in the same image.
as a classification problem where the classes {1, . . . Nθ }
Let C denote the set of classes, Kc denote the set of key-
correspond to Nθ disjoint angular bins. We note that the
euler angles, and therefore every viewpoint, can be equiv-
points for class c and Nc = P |Kc |. The total number of
keypoints Nkp is therefore Nc . We train a fully convo-
alently described by a rotation matrix. We will use the no- c∈C
tion of viewpoints, euler angles and rotation matrices inter- lutional network with an input size (384 × 384) such that
changeably. the channels in its last layer correspond to the keypoints i.e.
we use a loss which forces the channels in the last layer to pose for the given instance.
only fire at positions which correspond to the locations of Let the training instances for class c be denoted by
the respective keypoint. The CNN architecture we use has {Ri , {(xik , yki )|k ∈ Kc }} where Ri is the rotation ma-
the convolutional layers from ONet followed by an addi- trix and {(xik , yki )|k ∈ Kc } the annotated keypoints cor-
tional convolution layer with the output size 12 × 12 × Nkp responding to the ith instance. Let R be the predicted
such that each channel of the output corresponds to a spe- rotation matrix corresponding to which we want a prior
cific keypoint of a particular class. for keypoint locations denoted by P st P (i, j, k) indi-
The architecture enforces that the receptive field of an cates the likelihood of keypoint k being present at loca-
output unit in the location (i, j) has a centre correspond- klog(R1T R2 )kF
tion (i, j). Let ∆(R1 , R2 ) = √
2
denote the
ing to (32 ∗ i, 32 ∗ j) in the input image. For each geodesic distance between rotation matrices R1 , R2 and
training instance with annotated keypoints with locations N (R) = {i|∆(R, Ri ) < π6 } represent the the training in-
{(xk , yk )|k ∈ Kc }, we construct a target response map T stances whose viewpoint is close to the predicted viewpoint.
with T (ki , kj , k) = 1 and zero otherwise (where (ki , kj ) is Our non-parametric global pose conditional likelihood (P )
the index of the unit with its receptive field’s centre clos- is defined as a mixture of gaussians and we combine it with
est to the annotated keypoint). For each keypoint, this is the local appearance likelihood (L) to get keypoint loca-
similar to training with multiple classification examples per tions as follows -
image centered at the repective fields of output units, akin to
1 X
the formulation used for object detection by Szegedy et al. P (·, ·, k) = N ((xik , yki ), σI) (1)
[34]. Similar to the details described in section 3.2, we use |N (R)|
i∈N (R)
a loss layer that only selects the channels corresponding to (xk , yk ) = argmax log(P (x, y, k)) + L(x, y, k) (2)
the instance class and implements a euclidean loss between y,x
the output and the target map, thus enabling us to jointly
Note that all the coordinates above are normalized by warp-
train a single network to predict keypoints for all classes.
ing the instance bounding box to a fixed size (12 × 12) and
We train using the annotations from Bourdev et al. [4] and
we choose σ = 2.
use ground truth and jittered boxes as training examples.
The above network captures the appearance of the key- 5. Experiments : Viewpoint Prediction
points at a particular scale. A coarser scale would be more
robust to false positives as it captures more context but In this section, we use the the PASCAL3D+ [37] anno-
would not be able to localize well. In order to benefit from tations to evaluate the viewpoint estimation performance of
the predictions at a coarser level, without compromising lo- our approach in the two different settings described below -
calization, we propose using a multiscale ensemble of net-
5.1. Viewpoint Estimation with Ground Truth box
works. We therefore train another network with exactly the
same architecture with a smaller input size (192 × 192) and To analyze the performance of our viewpoint estimation
a smaller output size 6 × 6 × Nkp . We upsample the outputs method independent of factors like mis-localization, we first
of the smaller network and linearly combine them with the tackle the task of estimating the viewpoint of an object with
outputs of the larger network to get a spatial log-likelihood klog(R1T R2 )kF
known bounds. Let ∆(R1 , R2 ) = √
2
denote
response map L(·, ·, k) for each keypoint k. the geodesic distance function over the manifold of rotation
matrices. ∆(Rgt , Rpred ) captures the difference between
4.2. Viewpoint Conditioned Keypoint Likelihood
ground truth viewpoint Rgt and predicted viewpoint Rpred .
If we know that a particular car is left-facing, we’d ex- We use two complementary metrics for evaluation -
pect its left wheels to be visible but not the right wheels. In • Median Error : The common confusions for the task
addition to the ability to predict visibility, we’d also have a of viewpoint estimation often are predictions which are
strong intuition about the approximate locations of the key- far apart (eg. left facing vs right facing car) and the
points. If the problem setting was restricted to a particu- median error (M edErr) is a widely use metric that is
lar instance, the exact locations of the keypoints could be robust to these if a significant fraction of the estimates
inferred geometrically from the exact global pose. How- are accurate.
ever, the two assumptions that would allow this approach
do not hold true - we have to deal with different instances • Accuracy at θ : A small median error does not nec-
of the object category and our inferred global pose would essarily imply accurate estimates for all instances, a
only be approximate. To counter this, we propose a non- complementary performance measure is the fraction of
parametric solution - we would expect the keypoints of a instances whose predicted viewpoint is within a fixed
given instance to lie at positions similar to other training in- threshold of the target viewpoint. We denote this met-
stances whose global pose is close to the predicted global ric by Accθ where θ is the threshold. We use θ = π6 .
Figure 3: Viewpoint predictions for unoccluded groundtruth instances using our algorithm. The columns show 15th, 30th,
45th, 60th, 75th and 90th percentile instances respectively in terms of the error. We visualize the predictions by rendering a
3D model using our predicted viewpoint.

aero bike boat bottle bus car chair table mbike sofa train tv mean

Acc π (Pool5-TNet) 0.27 0.18 0.36 0.81 0.71 0.36 0.52 0.52 0.38 0.67 0.7 0.71 0.52
6
Acc (fc7-TNet)
π 0.5 0.44 0.39 0.88 0.81 0.7 0.39 0.38 0.48 0.44 0.78 0.65 0.57
6
Acc π (ours-TNet) 0.78 0.74 0.49 0.93 0.94 0.90 0.65 0.67 0.83 0.67 0.79 0.76 0.76
6
Acc π (ours-ONet) 0.81 0.77 0.59 0.93 0.98 0.89 0.80 0.62 0.88 0.82 0.80 0.80 0.81
6

M edErr (Pool5-TNet) 42.6 52.3 46.3 18.5 17.5 45.6 28.6 27.7 37 25.9 20.6 21.5 32
M edErr(fc7-TNet) 29.8 40.3 49.5 13.5 7.6 13.6 45.5 38.7 31.4 38.5 9.9 22.6 28.4
M edErr(ours-TNet) 14.7 18.6 31.2 13.5 6.3 8.8 17.7 17.4 17.6 15.1 8.9 17.8 15.6
M edErr(ours-ONet) 13.8 17.7 21.3 12.9 5.8 9.1 14.8 15.2 14.7 13.7 8.7 15.4 13.6

Table 1: Viewpoint Estimation with Ground Truth box

Recently, Ghodrati et al. [10] achieved results compara- ’Pool5-TNet’ method used in [10]. We also observe a sig-
ble to state-of-the art by using a linear classifier over layer nificant improvement by using the ONet architecture and
5 features of TNet. We denote this method as ’Pool5-TNet’ only use this architecture for further experiments/analysis.
and implement it as a baseline. To study the effect of end- In figure 3, we show our predictions sorted in terms of the
to-end training of the CNN architecture, we use a linear error and it can be seen that the predictions for most cate-
classifier on top of the fc7 layer of TNet as another base- gories are reliable even at the 90th percentile.
line (denoted as ’fc7-TNet’ ). With the aim of analyzing
viewpoint estimation independently, the evaluations were 5.2. Detection and Viewpoint Estimation
restricted only to objects marked as non-occluded and non-
truncated and we defer the study of the effects of occlu- Xiang et al. [37] introduced the AV P metric to measure
sion/truncation in this setting to section 7.1. The perfor- advances in the task of viewpoint estimation in the setting
mance of our method and comparisons to the baseline are where localizations are not known a priori. The metric is
shown in Table 2. The results clearly demonstrate that end- similar to the AP criterion used for PASCAL VOC detec-
to-end training improves results and that our method with tion except that each detection candidate has an associated
the TNet architecture performs significantly better than the viewpoint and the detection is labeled correct if it has a cor-
rect predicted viewpoint bin as well as a correct localization
(bounding box IoU > 0.5). Xiang et al. [37] also compared • PCK (Keypoint Localization) : For each annotated in-
to Pepik et al. [30] on the AVP metric using various view- stance, the algorithm predicts a location for each key-
point bin sizes and Ghodrati et al. [10] also showed com- point and a groundtruth keypoint is said to have been
parable results on the metric. To evaluate our method, we found correctly if the corresponding prediction lies
obtain detections from RCNN [11] using MCG [2] object within α ∗ max(h, w) of the annotated keypoint with
proposals and augment them with a pose predicted using the corresponding object’s dimension being (h, w).
the corresponding detection’s bounding box. We note that For each keypoint, we measure the fraction of objects
there are two issues with the AV P metric - it only evalu- where it was found correctly.
ates the prediction for the azimuth (φ) angle and discretizes
viewpoint instead of treating it continuously. Therefore, we • APK (Keypoint Detection) : A keypoint candidate is
also introduce two additional evaluation metrics which fol- deemed correct if it lies within α ∗ max(h, w) of a
low the IoU > 0.5 criteria for localization but modify the groundtruth keypoint. Each keypoint hypothesis has
criteria for assigning a viewpoint prediction to be correct as an associated score and the area under the precision-
follows - recall curve is used as the evaluation criterion.
• AV Pθ : δ(φgt , φpred ) < θ
We use the keypoint annotations from [4] and use the PAS-
• ARPθ : ∆(Rgt , Rpred ) < θ CAL VOC train set for training and the validation set im-
Note that ARPθ requires the prediction of all euler angles ages for evaluation.
instead of just φ and therefore, is a stricter metric.
The performance of our CNN based approach for view- 6.1. Keypoint Localization
point prediction is shown in Table 2 and it can be seen
The performance of our system and comparison to
that we significantly outperform the state-of-the-art meth-
[25] are shown in Table 3. We denote by ’conv6’
ods across all categories. While it is not possible to compare
(’conv12’) the predictions using only the 6 × 6 (12 ×
our pose estimation performance independent of detection
12) output size network, by ’conv6+conv12’ the predic-
with DPM based methods like [37, 30], an indirect com-
tions using the multiscale convolutional response and by
parison results from the analysis using ground truth boxes
’conv6+conv12+pLikelihood’ the predictions using our full
where we demonstrate that our pose estimation approach is
system. Our baseline system ( ’conv6+conv12’) performs
an improvement over [10] which in turn performs similar to
much better than [25], indicating the importance of end-to-
[37, 30] while using similar detectors.
end training and multiscale response maps. We also see that
incorporating the viewpoint conditioned likelihood induces
AV P AV P π ARP π
6 6 a significant performance gain.
Number of bins 4 8 16 24 - -
6.2. Keypoint Detection
Xiang et al. [37] 19.5 18.7 15.6 12.1 - -
Pepik et al. [30] 23.8 21.5 17.3 13.6 - - Given an image, we use RCNN [11] combined with
Ghodrati et al. [10] 24.1 22.3 17.3 13.7 - - MCG [2] object proposals to obtain detection candidates,
ours 49.1 44.5 36.0 31.1 50.7 46.5 each comprising of a class label and location. We then
predict keypoints on each candidate using our system and
Table 2: Mean performance of our approach for various score each keypoint hypothesis by linearly combining the
metrics. We report the performance for individual classes keypoint log-likelihood score and the object detection sys-
with the supplementary material tem score. Our results for the task of keypoint detection are
summarized in Table 4. The pose conditioned likelihood
consistently improves the local appearance based predic-
6. Experiments : Keypoint Prediction tions. Though the task of keypoint detection on PASCAL
VOC has not yet been analyzed for categories other than
person, we believe our results of 33.2% mean APK with a
The task of keypoint prediction is commonly studied in reasonably strict threshold indicate a promising start.
the setting with known location of the object but some meth- The above results support our three main assertions - a
ods, restricted to specific categories like ’people’ recently global prior obtained in the form of a viewpoint conditioned
evaluated their performance in the more general detection likelihood improves the local appearance based predictions,
setting. We extend these metrics to generic categories and that end-to-end trained CNNs can effectively model part ap-
evaluate our predictions in both the settings using the fol- pearances and combining responses from multiple scales
lowing metrics proposed by Yang and Ramanan [38] - significantly improves performance.
Figure 4: Visualization of keypoints predicted in the detection setting. We visualize every 15th detection, sorted by score, for
’Nosetip’ of aeroplanes, ’Crankcentre’ of bicycles, ’Left Headlight’ of cars and ’Right Base’ of buses.

PCK[α = 0.1] aero bike boat bottle bus car chair table mbike sofa train tv mean

Long et al. [25] 53.7 60.9 33.8 72.9 70.4 55.7 18.5 22.9 52.9 38.3 53.3 49.2 48.5
conv6 (coarse scale) 51.4 62.4 37.8 65.1 60.1 59.9 34.8 31.8 53.6 44 52.3 41.1 49.5
conv12 (fine scale) 54.9 66.8 32.6 60.2 80.5 59.3 35.1 37.8 58 41.6 59.3 53.8 53.3
conv6+conv12 61.9 74.6 43.6 72.8 84.3 70.0 45.0 44.8 66.7 51.2 66.8 56.8 61.5
conv6+conv12+pLikelihood 66.0 77.8 52.1 83.8 88.7 81.3 65.0 47.3 68.3 58.8 72.0 65.1 68.8

Table 3: Keypoint Localization

6.3. Generalization to Articulated Pose Setting Mean Error Mean Accuracy

While the focus of our work is pose prediction for rigid Default 13.5 0.81
objects, we note that our multiscale convolutional response Small Objects 15.1 0.75
based approach is also applicable for articulated pose esti- Large Objects 12.7 0.87
mation. To demonstrate this, we trained our convolutional Occluded Objects 19.9 0.65
response map system to detect keypoints for the category
’person’ in PASCAL VOC 2012 and achieved an APK = Table 5: Object characteristics vs viewpoint prediction error
0.22 which is a significant improvement compared to the
state-of-the-art method [13] which achieves APK = 0.15.
Setting Accuracy
We refer the reader to [13] for further details on the evalua-
tion metrics for the task of articulated pose estimation. Error< π9 83.7
π
9
<Error < 2π
9
5.7
π π
7. Analysis Error> 9
& Error(π − f lip)< 9
5.8
π π
Error> 9
& Error(z − ref )< 9
6.5
An understanding of failure cases and effect of object Other 2.9
characteristics on performance can often suggest insights
for future directions. Hoeim et al. [18] suggested some Table 6: Analysis of error modes for viewpoint prediction
excellent diagnostics for object detection systems and we
adapt those for the task of pose estimation. We evaluate
our system’s output for both the task of viewpoint predic- us to analyze our pose estimation method independent of
tion as well as keypoint prediction but restrict our analy- the detection system. We denote as ’large objects’ the top
sis to the setting with known bounding boxes - this enables third of instances and by ’small objects’ the bottom third
APK[α = 0.1] aero bike boat bottle bus car chair table mbike sofa train tv mean

conv6+conv12 41.9 47.1 15.4 29.0 58.2 37.1 11.2 8.1 40.7 25.0 36.9 25.5 31.3
conv6+conv12+pLikelihood 44.9 48.3 17.0 30.0 60.8 40.7 14.6 8.6 42.8 25.7 38.3 26.2 33.2

Table 4: Keypoint Detection

PCK[α = 0.1] aero bike boat bottle bus car chair table mbike sofa train tv mean

Default 66.0 77.8 52.1 83.8 88.7 81.3 65.0 47.3 68.3 58.8 72.0 65.1 68.8
Occluded Objects 55.2 56.6 38.7 68.8 64.4 62.8 48.1 40.5 53.1 59.6 68.6 47.3 55.3
Small Objects 51.6 66.4 48.1 81.2 85 67.4 57.4 48.2 57.9 53.8 57.4 56.8 60.9
Large Objects 74.6 87.4 57.2 86.3 90.9 90.6 65.1 37.7 76.1 68.5 74.1 65.3 72.8

left/right 71.1 80.2 53.4 84.4 90.9 84.1 74.7 49.2 69.8 63.4 75.0 68.2 72.0
PCK[α = 0.2] 79.9 88.7 69.1 95.2 92 88.3 79.6 67.5 87.3 72.2 82.2 78.1 81.7

Table 7: Analysis of Keypoint Prediction

of instances. The label ’occluded’ describes all the objects scale prediction, purely local appearance etc.) in the sup-
marked as truncated or occluded according to the PASCAL plementary material.
VOC annotations. We summarize our observations below.

7.1. Viewpoint Prediction


Object Characteristics : The effect of object charac-
Object Characteristics : Table 5 shows the effect of ob- teristics is similar to the viewpoint prediction setting - oc-
ject characteristics by reporting the mean across the classes cluded objects are not handled well and there is a significant
of the median viewpoint error and accuracy. We can see that performance gap between small and large objects.
the method performs worse for occluded objects. There is
also a significant difference between the performance for
small and large objects - while such error trends are accept-
able in the robotic setting where ambiguity for the farther Error Modes : In the ’left/right’ setting, we label a pre-
objects is tolerable, one may need to capture more context diction to be correct if it was in the vicinity of the corre-
to perform well without higher resolution input. sponding or the laterally inverted keypoint. Surprisingly,
the performance is similar to the base performance - indicat-
Error Modes: Since it is difficult to characterize error ing that laterally symmetric keypoints are not a significant
modes for generic rotations, we restrict the analysis to only error mode. The difference between the base performance
the predicted azimuth. Assuming the image plane to be XY, and P CK[α = 0.2] analyzes the inaccurate localizations
we denote by Z − ref the pose for the instance reflected which we find to be the main source of error.
along the XY plane and by π −f lip a rotation of π along the
Z axis. Table 6 reports the percentage of instances whose
8. Conclusion
predicted pose corresponds to various modes. We observe
that these error modes are equally common and that only We have presented an algorithm which leverages CNN
about 3% of the errors are not explained by these. architectures to predict viewpoint, and combines multiscale
Note that we exclude ’diningtable’ and ’bottle’ cate- appearance with a viewpoint conditioned likelihood to pre-
gories from the above analysis due to small number of un- dict keypoints. We demonstrated that our approach signifi-
occluded instances and insignificant variations respectively. cantly improve state-of-the-art in settings with and without
annotated bounding boxes for both viewpoint and keypoint
7.2. Keypoint Prediction
prediction tasks. We also present evaluations for the key-
We use the PCK metric (section 6.2) to characterize point detection setting alongwith a detailed ablation study
our algorithm’s performance for various settings. Our re- of our performance on various tasks and hope that these
sults using the full method (local appearance combined with will contribute towards progress on the task of pose estima-
viewpoint conditioned likelihood) are reported in Table 7. tion for generic objects. We will make our code and trained
We report the analysis using various components (single models publicly available.
Acknowledgements [14] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Us-
ing k-poselets for detecting people and localizing their key-
The authors would like to thank Abhishek Kar, João Car- points. In Computer Vision and Pattern Recognition (CVPR),
reira and Saurabh Gupta for their valuable comments. This 2014. 3
work was supported in part by NSF Award IIS-1212798 and [15] D. Glasner, M. Galun, S. Alpert, R. Basri, and
ONR MURI - N00014-10-1-0933 and the Berkeley Gradu- G. Shakhnarovich. Viewpoint-aware object detection and
ate Fellowship. We gratefully acknowledge NVIDIA cor- pose estimation. In IEEE International Conference on Com-
poration for the donation of Tesla GPUs for this research. puter Vision (ICCV), pages 1275–1282, 2011. 3
[16] M. Hejrati and D. Ramanan. Analyzing 3d objects in clut-
References tered images. In P. Bartlett, F. Pereira, C. Burges, L. Bottou,
and K. Weinberger, editors, Advances in Neural Information
[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d Processing Systems 25, pages 602–610. 2012. 3
human pose estimation: New benchmark and state of the art [17] D. Hogg. Model-based vision: a program to see a walking
analysis. In IEEE Conference on Computer Vision and Pat- person. Image and Vision Computing, 1(1):5 – 20, 1983. 2
tern Recognition (CVPR), June 2014. 2
[18] D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error
[2] P. Arbeláez, J. Pont-Tuset, J. Barron, F. Marques, and J. Ma- in object detectors. In Computer Vision–ECCV 2012, pages
lik. Multiscale combinatorial grouping. In Computer Vision 340–353. Springer Berlin Heidelberg, 2012. 2, 7
and Pattern Recognition, 2014. 6
[19] D. P. Huttenlocher and S. Ullman. Recognizing solid ob-
[3] H. Azizpour and I. Laptev. Object detection using strongly-
jects by alignment with an image. International Journal of
supervised deformable part models. In Proceedings of the
Computer Vision, 5(2):195–212, 1990. 2
12th European Conference on Computer Vision - Volume
[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
Part I, ECCV’12, pages 836–849, Berlin, Heidelberg, 2012.
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-
Springer-Verlag. 3
tional architecture for fast feature embedding. arXiv preprint
[4] L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting people
arXiv:1408.5093, 2014. 3
using mutually consistent poselet activations. In European
[21] S. Johnson and M. Everingham. Clustered pose and nonlin-
Conference on Computer Vision (ECCV), 2010. 3, 4, 6
ear appearance models for human pose estimation. In Pro-
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
ceedings of the British Machine Vision Conference, 2010.
Fei. Imagenet: A large-scale hierarchical image database. In
doi:10.5244/C.24.12. 2
CVPR, 2009. 3
[22] J. J. Koenderink and A. J. van Doorn. The internal repre-
[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- sentation of solid shape with respect to vision. Biological
vation feature for generic visual recognition. arXiv preprint cybernetics, 32(4):211–216, 1979. 1
arXiv:1310.1531, 2013. 2 [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, classification with deep convolutional neural networks. In
and A. Zisserman. The PASCAL Visual Object Classes NIPS, 2012. 2, 3
Challenge 2012 (VOC2012) Results. http://www.pascal- [24] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
network.org/challenges/VOC/voc2012/workshop/index.html. Howard, W. Hubbard, and L. D. Jackel. Backpropagation
3 applied to handwritten zip code recognition. Neural Com-
[8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- put., 1(4):541–551, Dec. 1989. 1, 2
manan. Object detection with discriminatively trained part- [25] J. Long, N. Zhang, and T. Darrell. Do convnets learn corre-
based models. TPAMI, 2010. 2 spondence? In NIPS, 2014. 2, 3, 6, 7
[9] K. Fukushima. Neocognitron: A self-organizing neural net- [26] J. McClelland and J. Miller. Structural factors in figure per-
work model for a mechanism of pattern recognition unaf- ception. Perception & Psychophysics, 26(3):221–229, 1979.
fected by shift in position. Biological Cybernetics, 36:193– 1
202, 1980. 1, 2 [27] D. Navon. Forest before trees: The precedence of global
[10] A. Ghodrati, M. Pedersoli, and T. Tuytelaars. Is 2d informa- features in visual perception. 1977. 1
tion enough for viewpoint estimation? In Proceedings of the [28] J. O’Rourke and N. Badler. Model-based image analysis of
British Machine Vision Conference. BMVA Press, 2014. 2, human motion using constraint propagation. Pattern Analy-
5, 6 sis and Machine Intelligence, IEEE Transactions on, PAMI-
[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- 2(6):522–536, Nov 1980. 2
ture hierarchies for accurate object detection and semantic [29] S. E. Palmer and N. M. Bucher. Configural effects in per-
segmentation. In CVPR, 2014. 2, 3, 6 ceived pointing of ambiguous triangles. Journal of Exper-
[12] R. B. Girshick, F. N. Iandola, T. Darrell, and J. Malik. imental Psychology: Human Perception and Performance,
Deformable part models are convolutional neural networks. 7(1):88, 1981. 1
CoRR, abs/1409.5403, 2014. 2 [30] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Teaching 3d
[13] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. R- geometry to deformable part models. In IEEE Conference
cnns for pose estimation and action detection. CoRR, on Computer Vision and Pattern Recognition (CVPR), June
abs/1406.5212, 2014. 3, 7 2012. 2, 3, 6
[31] J. R. Pomerantz, L. C. Sager, and R. J. Stoever. Perception of
wholes and of their component parts: Some configural supe-
riority effects. Journal of Experimental Psychology-human
Perception and Performance, 3:422–435, 1977. 1
[32] S. Savarese and L. Fei-Fei. 3d generic object categorization,
localization and pose estimation. In IEEE International Con-
ference on Computer Vision (ICCV), 2007. 3
[33] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014. 3
[34] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks
for object detection. In Advances in Neural Information Pro-
cessing Systems, pages 2553–2561, 2013. 4
[35] J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training
of a convolutional network and a graphical model for human
pose estimation. CoRR, abs/1406.2984, 2014. 3
[36] A. Toshev and C. Szegedy. Deeppose: Human pose esti-
mation via deep neural networks. In 2014 IEEE Conference
on Computer Vision and Pattern Recognition, CVPR 2014,
Columbus, OH, USA, June 23-28, 2014, pages 1653–1660.
IEEE, 2014. 2
[37] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A
benchmark for 3d object detection in the wild. In WACV,
2014. 2, 3, 4, 5, 6
[38] Y. Yang and D. Ramanan. Articulated pose estimation with
flexible mixtures-of-parts. In Proceedings of the 2011 IEEE
Conference on Computer Vision and Pattern Recognition,
CVPR ’11, pages 1385–1392, Washington, DC, USA, 2011.
IEEE Computer Society. 2, 3, 6

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy