Tulsiani Viewpoints and Keypoints 2015 CVPR Paper
Tulsiani Viewpoints and Keypoints 2015 CVPR Paper
Abstract
Viewpoint Conditioned Keypoint Likelihood : We pro- structure. Girshick et al. [12] argued that DPMs could also
pose a viewpoint conditioned keypoint likelihood, imple- be thought as a specific instantiation of CNNs and there-
mented as a non-parametric mixture of gaussians, to model fore training an end-to-end CNN for the corresponding task
the probability distribution of keypoints given the viewpoint should outperform a method which instead explicitly mod-
prediction. We combine it with the appearance based likeli- els part appearances and relations.
hood computed above to obtain our keypoint predictions. This result is particularly applicable to viewpoint es-
Keypoint prediction methods have traditionally been timation where the prominent approaches, from the ini-
evaluated assuming ground-truth boxes as input [1, 21, 25]. tial instance based methods [19] to current state-of-the-art
This means that the evaluation setting is quite different from [37, 30] explicitly model local appearances and aggregate
the conditions under which these methods would be used - evidence to infer viewpoint. Pepik et al. [30] extend DPMs
in conjunction with imprecisely localized object detections. to 3D to model part appearances and rely on these to infer
Yang and Ramanan [38] argued for the importance of this pose and Xiang et al. [37] introduce a separate DPM com-
task for human pose estimation and introduced an evalua- ponent corresponding to each viewpoint. Ghodrati et al.
tion criterion which we adapt to generic object categories. [10] differ from the explicit part-based methodology, using
To the best of our knowledge, we are the first to empirically a fixed global descriptor to estimate viewpoint. We build
evaluate the applicability of a keypoint prediction algorithm on both these approaches by using a method which, while
not restricted to a specific object category in this challeng- using a global descriptor, can implicitly capture part appear-
ing setting. ances.
Furthermore, inspired by the analysis of the detection
methods presented by Hoeim et al. [18], we present an anal-
ysis of our algorithm’s failure modes as well as the impact Keypoint Prediction: Keypoint Prediction can be clas-
of object characteristics on the algorithm’s performance. sified into two settings - a) ’Keypoint Localization’ where
the task is to find keypoints for objects with known bound-
2. Related Work ing boxes and b) ’Keypoint Detection’ where bounding box
is unknown. This problem has been particularly well stud-
Viewpoint Prediction: Recently, CNNs [9, 24] have ied for humans - tracing back from classic model-based ap-
been shown to outperform Deformable Part Model (DPM) proaches for video [28, 17] to more recent pictorial structure
[8] based methods for recognition tasks [11, 6, 23]. based approaches [38] on challenging single image based
Whereas DPMs explicitly model part appearances and their real world datasets like LSP[21] or MPII Human Pose [1].
deformations, the CNN architecture allows such relations Recently Toshev et al. [36] demonstrated that CNN based
to be captured implicitly using a hierarchical convolutional models can successfully be used for keypoint prediction
for humans and Tompson et al. [35] significantly improved 3.2. Network Architecture and Training
upon these results using a purely convolutional approach.
Let Nc be the number of object classes, Na be number of
These evaluations, however, are restricted to keypoint local-
angles to be predicted per instance. The number of output
izations. A more general task of keypoint detection with-
units per class is Na ∗Nθ resulting in a total of Nc ∗Na ∗Nθ
out assuming ground truth box annotations was also re-
outputs. We adopt an approach similar to Girshick et al.
cently introduced for humans by Yang and Ramanan [38]
[11] and finetune a CNN model whose weights are initial-
and Gkioxari et al. [14, 13] evaluated their keypoint predic-
ized from a model pretrained on the Imagenet [5] classifi-
tion algorithm in this setting.
cation task. We experimented with the architectures from
For generic object categories, annotations for keypoints Krizhevsky et al. [23] (denoted as TNet) and Simonyan et
on the challenging PASCAL VOC dataset [7] were intro- al. [33] (denoted as ONet). The architecture of our network
duced by Bourdev et al. [4]. Though similar annotations is the same as the corresponding pre-trained network with
or fitted CAD models have been successfully used to train an additional fully-connected layer having Nc ∗ Na ∗ Nθ
better object detection systems [3] as well as for simulta- output units. We provide an alternate detailed visualization
neous object detection and viewpoint estimation [30], the of the network architecture in the supplementary material.
task of keypoint prediction has largely been unaddressed for Instead of training a separate CNN for each class, we im-
generic object categories. Long et al. [25] recently evalu- plement a loss layer that selectively considers the Na ∗ Nθ
ated keypoint localization results across all PASCAL cate- outputs corresponding the class of the training instance and
gories but, to the best of our knowledge, the more general computes a logistic loss for each of the angle predictions.
setting of keypoint detection for generic object categories This allows us to train a CNN which can jointly predict
has not yet been explored. viewpoint for all classes, thus enabling learning a shared
Previous works [32, 15, 16] have also jointly tackled the feature representation across all categories. We use the
problem of keypoint detection and pose estimation. While Caffe framework [20] to train and extract features from the
these are perhaps the closest to ours in terms of goals, they CNN described above. We augment the training data with
differ markedly in methodology - they explicitly aggregate jittered ground-truth bounding boxes that overlap with the
local evidence for pose estimation and have either been re- annotated bounding box with IoU > 0.7. Xiang et al. [37]
stricted to a specific object category [15, 16] or use instance provide annotations for (φ, ϕ, ψ) corresponding to all the
model based matching [32]. Long et al. [25], on the other instances in the PASCAL VOC 2012 detection train, val-
hand share many commonalities with our methodology for idation set as well as for ImageNet images. We use the
the task of keypoint prediction - convolutional keypoint de- PASCAL train set and the ImageNet annotations to train the
tections augmented with global priors to predict keypoints. network described above and use the PASCAL VOC 2012
However, we show that we can significantly improve their validation set annotations to evaluate our performance.
results by combining multiscale convolutional predictions
from a trained CNN with a more principled, viewpoint es- 4. Viewpoint Informed Keypoint Prediction
timation based global model. Both [16, 25] only evaluate As we noted earlier, parts assume their meaning in con-
keypoint localization performance whereas we also evalu- text of the whole. Thus, in addition to local appearance, we
ate our method in the setting of keypoint detection. should take into account the global context. To operational-
ize this observation, we propose a two-component approach
to keypoint prediction.
3. Viewpoint Estimation
4.1. Multiscale Convolutional Response Maps
3.1. Formulation We use CNN based architectures to learn the appearance
of keypoints across an object class. Using a fully convo-
We formulate the global pose estimation for rigid cat-
lutional architecture allows us to capture local appearance
egories as predicting the viewpoint wrt to a canonical
in a more hierarchical and robust way than HOG feature
pose. This is equivalent to determining the three euler an-
based models while still allowing for efficient inference by
gles corresponding to azimuth (φ), elevation(ϕ) and cyclo-
sharing computations across evaluations at different spatial
rotation(ψ). We frame the task of predicting the euler angles
locations in the same image.
as a classification problem where the classes {1, . . . Nθ }
Let C denote the set of classes, Kc denote the set of key-
correspond to Nθ disjoint angular bins. We note that the
euler angles, and therefore every viewpoint, can be equiv-
points for class c and Nc = P |Kc |. The total number of
keypoints Nkp is therefore Nc . We train a fully convo-
alently described by a rotation matrix. We will use the no- c∈C
tion of viewpoints, euler angles and rotation matrices inter- lutional network with an input size (384 × 384) such that
changeably. the channels in its last layer correspond to the keypoints i.e.
we use a loss which forces the channels in the last layer to pose for the given instance.
only fire at positions which correspond to the locations of Let the training instances for class c be denoted by
the respective keypoint. The CNN architecture we use has {Ri , {(xik , yki )|k ∈ Kc }} where Ri is the rotation ma-
the convolutional layers from ONet followed by an addi- trix and {(xik , yki )|k ∈ Kc } the annotated keypoints cor-
tional convolution layer with the output size 12 × 12 × Nkp responding to the ith instance. Let R be the predicted
such that each channel of the output corresponds to a spe- rotation matrix corresponding to which we want a prior
cific keypoint of a particular class. for keypoint locations denoted by P st P (i, j, k) indi-
The architecture enforces that the receptive field of an cates the likelihood of keypoint k being present at loca-
output unit in the location (i, j) has a centre correspond- klog(R1T R2 )kF
tion (i, j). Let ∆(R1 , R2 ) = √
2
denote the
ing to (32 ∗ i, 32 ∗ j) in the input image. For each geodesic distance between rotation matrices R1 , R2 and
training instance with annotated keypoints with locations N (R) = {i|∆(R, Ri ) < π6 } represent the the training in-
{(xk , yk )|k ∈ Kc }, we construct a target response map T stances whose viewpoint is close to the predicted viewpoint.
with T (ki , kj , k) = 1 and zero otherwise (where (ki , kj ) is Our non-parametric global pose conditional likelihood (P )
the index of the unit with its receptive field’s centre clos- is defined as a mixture of gaussians and we combine it with
est to the annotated keypoint). For each keypoint, this is the local appearance likelihood (L) to get keypoint loca-
similar to training with multiple classification examples per tions as follows -
image centered at the repective fields of output units, akin to
1 X
the formulation used for object detection by Szegedy et al. P (·, ·, k) = N ((xik , yki ), σI) (1)
[34]. Similar to the details described in section 3.2, we use |N (R)|
i∈N (R)
a loss layer that only selects the channels corresponding to (xk , yk ) = argmax log(P (x, y, k)) + L(x, y, k) (2)
the instance class and implements a euclidean loss between y,x
the output and the target map, thus enabling us to jointly
Note that all the coordinates above are normalized by warp-
train a single network to predict keypoints for all classes.
ing the instance bounding box to a fixed size (12 × 12) and
We train using the annotations from Bourdev et al. [4] and
we choose σ = 2.
use ground truth and jittered boxes as training examples.
The above network captures the appearance of the key- 5. Experiments : Viewpoint Prediction
points at a particular scale. A coarser scale would be more
robust to false positives as it captures more context but In this section, we use the the PASCAL3D+ [37] anno-
would not be able to localize well. In order to benefit from tations to evaluate the viewpoint estimation performance of
the predictions at a coarser level, without compromising lo- our approach in the two different settings described below -
calization, we propose using a multiscale ensemble of net-
5.1. Viewpoint Estimation with Ground Truth box
works. We therefore train another network with exactly the
same architecture with a smaller input size (192 × 192) and To analyze the performance of our viewpoint estimation
a smaller output size 6 × 6 × Nkp . We upsample the outputs method independent of factors like mis-localization, we first
of the smaller network and linearly combine them with the tackle the task of estimating the viewpoint of an object with
outputs of the larger network to get a spatial log-likelihood klog(R1T R2 )kF
known bounds. Let ∆(R1 , R2 ) = √
2
denote
response map L(·, ·, k) for each keypoint k. the geodesic distance function over the manifold of rotation
matrices. ∆(Rgt , Rpred ) captures the difference between
4.2. Viewpoint Conditioned Keypoint Likelihood
ground truth viewpoint Rgt and predicted viewpoint Rpred .
If we know that a particular car is left-facing, we’d ex- We use two complementary metrics for evaluation -
pect its left wheels to be visible but not the right wheels. In • Median Error : The common confusions for the task
addition to the ability to predict visibility, we’d also have a of viewpoint estimation often are predictions which are
strong intuition about the approximate locations of the key- far apart (eg. left facing vs right facing car) and the
points. If the problem setting was restricted to a particu- median error (M edErr) is a widely use metric that is
lar instance, the exact locations of the keypoints could be robust to these if a significant fraction of the estimates
inferred geometrically from the exact global pose. How- are accurate.
ever, the two assumptions that would allow this approach
do not hold true - we have to deal with different instances • Accuracy at θ : A small median error does not nec-
of the object category and our inferred global pose would essarily imply accurate estimates for all instances, a
only be approximate. To counter this, we propose a non- complementary performance measure is the fraction of
parametric solution - we would expect the keypoints of a instances whose predicted viewpoint is within a fixed
given instance to lie at positions similar to other training in- threshold of the target viewpoint. We denote this met-
stances whose global pose is close to the predicted global ric by Accθ where θ is the threshold. We use θ = π6 .
Figure 3: Viewpoint predictions for unoccluded groundtruth instances using our algorithm. The columns show 15th, 30th,
45th, 60th, 75th and 90th percentile instances respectively in terms of the error. We visualize the predictions by rendering a
3D model using our predicted viewpoint.
aero bike boat bottle bus car chair table mbike sofa train tv mean
Acc π (Pool5-TNet) 0.27 0.18 0.36 0.81 0.71 0.36 0.52 0.52 0.38 0.67 0.7 0.71 0.52
6
Acc (fc7-TNet)
π 0.5 0.44 0.39 0.88 0.81 0.7 0.39 0.38 0.48 0.44 0.78 0.65 0.57
6
Acc π (ours-TNet) 0.78 0.74 0.49 0.93 0.94 0.90 0.65 0.67 0.83 0.67 0.79 0.76 0.76
6
Acc π (ours-ONet) 0.81 0.77 0.59 0.93 0.98 0.89 0.80 0.62 0.88 0.82 0.80 0.80 0.81
6
M edErr (Pool5-TNet) 42.6 52.3 46.3 18.5 17.5 45.6 28.6 27.7 37 25.9 20.6 21.5 32
M edErr(fc7-TNet) 29.8 40.3 49.5 13.5 7.6 13.6 45.5 38.7 31.4 38.5 9.9 22.6 28.4
M edErr(ours-TNet) 14.7 18.6 31.2 13.5 6.3 8.8 17.7 17.4 17.6 15.1 8.9 17.8 15.6
M edErr(ours-ONet) 13.8 17.7 21.3 12.9 5.8 9.1 14.8 15.2 14.7 13.7 8.7 15.4 13.6
Recently, Ghodrati et al. [10] achieved results compara- ’Pool5-TNet’ method used in [10]. We also observe a sig-
ble to state-of-the art by using a linear classifier over layer nificant improvement by using the ONet architecture and
5 features of TNet. We denote this method as ’Pool5-TNet’ only use this architecture for further experiments/analysis.
and implement it as a baseline. To study the effect of end- In figure 3, we show our predictions sorted in terms of the
to-end training of the CNN architecture, we use a linear error and it can be seen that the predictions for most cate-
classifier on top of the fc7 layer of TNet as another base- gories are reliable even at the 90th percentile.
line (denoted as ’fc7-TNet’ ). With the aim of analyzing
viewpoint estimation independently, the evaluations were 5.2. Detection and Viewpoint Estimation
restricted only to objects marked as non-occluded and non-
truncated and we defer the study of the effects of occlu- Xiang et al. [37] introduced the AV P metric to measure
sion/truncation in this setting to section 7.1. The perfor- advances in the task of viewpoint estimation in the setting
mance of our method and comparisons to the baseline are where localizations are not known a priori. The metric is
shown in Table 2. The results clearly demonstrate that end- similar to the AP criterion used for PASCAL VOC detec-
to-end training improves results and that our method with tion except that each detection candidate has an associated
the TNet architecture performs significantly better than the viewpoint and the detection is labeled correct if it has a cor-
rect predicted viewpoint bin as well as a correct localization
(bounding box IoU > 0.5). Xiang et al. [37] also compared • PCK (Keypoint Localization) : For each annotated in-
to Pepik et al. [30] on the AVP metric using various view- stance, the algorithm predicts a location for each key-
point bin sizes and Ghodrati et al. [10] also showed com- point and a groundtruth keypoint is said to have been
parable results on the metric. To evaluate our method, we found correctly if the corresponding prediction lies
obtain detections from RCNN [11] using MCG [2] object within α ∗ max(h, w) of the annotated keypoint with
proposals and augment them with a pose predicted using the corresponding object’s dimension being (h, w).
the corresponding detection’s bounding box. We note that For each keypoint, we measure the fraction of objects
there are two issues with the AV P metric - it only evalu- where it was found correctly.
ates the prediction for the azimuth (φ) angle and discretizes
viewpoint instead of treating it continuously. Therefore, we • APK (Keypoint Detection) : A keypoint candidate is
also introduce two additional evaluation metrics which fol- deemed correct if it lies within α ∗ max(h, w) of a
low the IoU > 0.5 criteria for localization but modify the groundtruth keypoint. Each keypoint hypothesis has
criteria for assigning a viewpoint prediction to be correct as an associated score and the area under the precision-
follows - recall curve is used as the evaluation criterion.
• AV Pθ : δ(φgt , φpred ) < θ
We use the keypoint annotations from [4] and use the PAS-
• ARPθ : ∆(Rgt , Rpred ) < θ CAL VOC train set for training and the validation set im-
Note that ARPθ requires the prediction of all euler angles ages for evaluation.
instead of just φ and therefore, is a stricter metric.
The performance of our CNN based approach for view- 6.1. Keypoint Localization
point prediction is shown in Table 2 and it can be seen
The performance of our system and comparison to
that we significantly outperform the state-of-the-art meth-
[25] are shown in Table 3. We denote by ’conv6’
ods across all categories. While it is not possible to compare
(’conv12’) the predictions using only the 6 × 6 (12 ×
our pose estimation performance independent of detection
12) output size network, by ’conv6+conv12’ the predic-
with DPM based methods like [37, 30], an indirect com-
tions using the multiscale convolutional response and by
parison results from the analysis using ground truth boxes
’conv6+conv12+pLikelihood’ the predictions using our full
where we demonstrate that our pose estimation approach is
system. Our baseline system ( ’conv6+conv12’) performs
an improvement over [10] which in turn performs similar to
much better than [25], indicating the importance of end-to-
[37, 30] while using similar detectors.
end training and multiscale response maps. We also see that
incorporating the viewpoint conditioned likelihood induces
AV P AV P π ARP π
6 6 a significant performance gain.
Number of bins 4 8 16 24 - -
6.2. Keypoint Detection
Xiang et al. [37] 19.5 18.7 15.6 12.1 - -
Pepik et al. [30] 23.8 21.5 17.3 13.6 - - Given an image, we use RCNN [11] combined with
Ghodrati et al. [10] 24.1 22.3 17.3 13.7 - - MCG [2] object proposals to obtain detection candidates,
ours 49.1 44.5 36.0 31.1 50.7 46.5 each comprising of a class label and location. We then
predict keypoints on each candidate using our system and
Table 2: Mean performance of our approach for various score each keypoint hypothesis by linearly combining the
metrics. We report the performance for individual classes keypoint log-likelihood score and the object detection sys-
with the supplementary material tem score. Our results for the task of keypoint detection are
summarized in Table 4. The pose conditioned likelihood
consistently improves the local appearance based predic-
6. Experiments : Keypoint Prediction tions. Though the task of keypoint detection on PASCAL
VOC has not yet been analyzed for categories other than
person, we believe our results of 33.2% mean APK with a
The task of keypoint prediction is commonly studied in reasonably strict threshold indicate a promising start.
the setting with known location of the object but some meth- The above results support our three main assertions - a
ods, restricted to specific categories like ’people’ recently global prior obtained in the form of a viewpoint conditioned
evaluated their performance in the more general detection likelihood improves the local appearance based predictions,
setting. We extend these metrics to generic categories and that end-to-end trained CNNs can effectively model part ap-
evaluate our predictions in both the settings using the fol- pearances and combining responses from multiple scales
lowing metrics proposed by Yang and Ramanan [38] - significantly improves performance.
Figure 4: Visualization of keypoints predicted in the detection setting. We visualize every 15th detection, sorted by score, for
’Nosetip’ of aeroplanes, ’Crankcentre’ of bicycles, ’Left Headlight’ of cars and ’Right Base’ of buses.
PCK[α = 0.1] aero bike boat bottle bus car chair table mbike sofa train tv mean
Long et al. [25] 53.7 60.9 33.8 72.9 70.4 55.7 18.5 22.9 52.9 38.3 53.3 49.2 48.5
conv6 (coarse scale) 51.4 62.4 37.8 65.1 60.1 59.9 34.8 31.8 53.6 44 52.3 41.1 49.5
conv12 (fine scale) 54.9 66.8 32.6 60.2 80.5 59.3 35.1 37.8 58 41.6 59.3 53.8 53.3
conv6+conv12 61.9 74.6 43.6 72.8 84.3 70.0 45.0 44.8 66.7 51.2 66.8 56.8 61.5
conv6+conv12+pLikelihood 66.0 77.8 52.1 83.8 88.7 81.3 65.0 47.3 68.3 58.8 72.0 65.1 68.8
While the focus of our work is pose prediction for rigid Default 13.5 0.81
objects, we note that our multiscale convolutional response Small Objects 15.1 0.75
based approach is also applicable for articulated pose esti- Large Objects 12.7 0.87
mation. To demonstrate this, we trained our convolutional Occluded Objects 19.9 0.65
response map system to detect keypoints for the category
’person’ in PASCAL VOC 2012 and achieved an APK = Table 5: Object characteristics vs viewpoint prediction error
0.22 which is a significant improvement compared to the
state-of-the-art method [13] which achieves APK = 0.15.
Setting Accuracy
We refer the reader to [13] for further details on the evalua-
tion metrics for the task of articulated pose estimation. Error< π9 83.7
π
9
<Error < 2π
9
5.7
π π
7. Analysis Error> 9
& Error(π − f lip)< 9
5.8
π π
Error> 9
& Error(z − ref )< 9
6.5
An understanding of failure cases and effect of object Other 2.9
characteristics on performance can often suggest insights
for future directions. Hoeim et al. [18] suggested some Table 6: Analysis of error modes for viewpoint prediction
excellent diagnostics for object detection systems and we
adapt those for the task of pose estimation. We evaluate
our system’s output for both the task of viewpoint predic- us to analyze our pose estimation method independent of
tion as well as keypoint prediction but restrict our analy- the detection system. We denote as ’large objects’ the top
sis to the setting with known bounding boxes - this enables third of instances and by ’small objects’ the bottom third
APK[α = 0.1] aero bike boat bottle bus car chair table mbike sofa train tv mean
conv6+conv12 41.9 47.1 15.4 29.0 58.2 37.1 11.2 8.1 40.7 25.0 36.9 25.5 31.3
conv6+conv12+pLikelihood 44.9 48.3 17.0 30.0 60.8 40.7 14.6 8.6 42.8 25.7 38.3 26.2 33.2
PCK[α = 0.1] aero bike boat bottle bus car chair table mbike sofa train tv mean
Default 66.0 77.8 52.1 83.8 88.7 81.3 65.0 47.3 68.3 58.8 72.0 65.1 68.8
Occluded Objects 55.2 56.6 38.7 68.8 64.4 62.8 48.1 40.5 53.1 59.6 68.6 47.3 55.3
Small Objects 51.6 66.4 48.1 81.2 85 67.4 57.4 48.2 57.9 53.8 57.4 56.8 60.9
Large Objects 74.6 87.4 57.2 86.3 90.9 90.6 65.1 37.7 76.1 68.5 74.1 65.3 72.8
left/right 71.1 80.2 53.4 84.4 90.9 84.1 74.7 49.2 69.8 63.4 75.0 68.2 72.0
PCK[α = 0.2] 79.9 88.7 69.1 95.2 92 88.3 79.6 67.5 87.3 72.2 82.2 78.1 81.7
of instances. The label ’occluded’ describes all the objects scale prediction, purely local appearance etc.) in the sup-
marked as truncated or occluded according to the PASCAL plementary material.
VOC annotations. We summarize our observations below.