Scalable Object Detection
Scalable Object Detection
1
part-based models [7], which more recently have achieved Bounding box: we encode the upper-left and lower-right
impressive performance thanks to discriminative learning coordinates of each box as four node values, which can
and carefully crafted features [6]. These methods, however, be written as a vector li ∈ R4 . These coordinates are
rely on exhaustive application of part templates over multi- normalized w. r. t. image dimensions to achieve invari-
ple scales and as such are expensive. Moreover, they scale ance to absolute image size. Each normalized coordi-
linearly in the number of classes, which becomes a chal- nate is produced by a linear transformation of the last
lenge for modern datasets such as ImageNet 1 . hidden layer.
To address the former issue, Lampert et al. [12] use a
branch-and-bound strategy to avoid evaluating all potential Confidence: the confidence score for the box containing
object locations. To address the latter issue, Song et al. [14] an object is encoded as a single node value ci ∈ [0, 1].
use a low-dimensional part basis, shared across all object This value is produced through a linear transformation
classes. A hashing based approach for efficient part detec- of the last hidden layer followed by a sigmoid.
tion has shown good results as well [3].
A different line of work, closer to ours, is based on the We can combine the bounding box locations li , i ∈
idea that objects can be localized without having to know {1, . . . K}, as one linear layer. Similarly, we can treat col-
their class. Some of these approaches build on bottom-up lection of all confidences ci , i ∈ {1, . . . K} as the output as
classless segmentation [10]. The segments, obtained in this one sigmoid layer. Both these output layers are connected
way, can be scored using top-down feedback [17, 2, 4]. Us- to the last hidden layers.
ing the same motivation, Alexe et al. [1] use an inexpen- At inference time, our algorithm produces K bound-
sive classifier to score object hypotheses for being an ob- ing boxes. In our experiments, we use K = 100 and
ject or not and in this way reduce the number of location K = 200. If desired, we can use the confidence scores
for the subsequent detection steps. These approaches can and non-maximum suppression to obtain a smaller number
be thought of as multi-layered models, with segmentation of high-confidence boxes at inference time. These boxes are
as first layer and a segment classification as a subsequent supposed to represent objects. As such, they can be classi-
layer. Despite the fact that they encode proven perceptual fied with a subsequent classifier to achieve object detection.
principles, we will show that having deeper models which Since the number of boxes is very small, we can afford pow-
are fully learned can lead to superior results. erful classifiers. In our experiments, we use second DNN
Finally, we capitalize on the recent advances in Deep for classification [11].
Learning, most noticeably the work by Krizhevsky et Training Objective We train a DNN to predict bounding
al. [11]. We extend their bounding box regression approach boxes and their confidence scores for each training image
for detection to the case of handling multiple objects in a such that the highest scoring boxes match well the ground
scalable manner. DNN-based regression applied to object truth object boxes for the image. Suppose that for a partic-
masks has been investigated by Szegedy et al. [15]. This ular training example, M objects were labeled by bounding
last approach achieves state-of-art detection performance on boxes gj , j ∈ {1, . . . , M }. In practice, the number of pre-
VOC2007 but does not scale up to multiple classes due to dictions K is much larger than the number of groundtruth
the cost of a single mask regression: in that setup, one needs boxes M . Therefore, we try to optimize only the subset of
to execute 5 networks per class at inference time, which is predicted boxes which match best the ground truth ones. We
not scalable for most real-world applications. optimize their locations to improve their match and maxi-
mize their confidences. At the same time we minimize the
3. Proposed approach confidences of the remaining predictions, which are deemed
We aim at achieving a class-agnostic scalable object de- not to localize the true objects well.
tection by predicting a set of bounding boxes, which rep- To achieve the above, we formulate an assignment prob-
resent potential objects. More precisely, we use a Deep lem for each training example. Let xij ∈ {0, 1} denote the
Neural Network (DNN), which outputs a fixed number of assignment: xij = 1 iff the i-th prediction is assigned to
bounding boxes. In addition, it outputs a score for each box j-th true object. The objective of this assignment can be
expressing the network confidence of this box containing an expressed as:
object.
Model To formalize the above idea, we encode the i-th 1X
Fmatch (x, l) = xij ||li − gj ||22 (1)
object box and its associated confidence as node values of 2 i,j
the last net layer:
1A typical deformable-parts model takes 1 CPU-sec/image/label at
where we use L2 distance between the normalized bound-
inference time, thus for 1000 classes inference would take 1000 CPU- ing box coordinates to quantify the dissimilarity between
seconds; sharing parts across class labels is an open research problem. bounding boxes.
Additionally, we want to optimize the confidences of the match between the K priors and the ground truth. Once
boxes according to the assignment x. Maximizing the con- the matching is done, the target confidences are computed
fidences of assigned predictions can be expressed as: as before. Moreover, the location prediction loss is also
X X X unchanged: for any matched pair of (target, prediction)
Fconf (x, c) = − xij log(ci )− (1− xij ) log(1−ci ) locations, the loss is defined by the difference between
i,j i j the groundtruth and the coordinates that correspond to the
P (2) matched prior. We call the usage of priors for matching
In the above objective j xij = 1 iff prediction i has been prior matching and hypothesize that it enforces diversifi-
matched to a groundtruth. In that case ci is being maxi- cation among the predictions, since the linear assignment
mized, while in the opposite case it is being minimized. A forces the model to learn a diverse set of predictions. We
different
P interpretation of the above term is achieved if we have found that without prior matching, the convergence
j x ij view as a probability of prediction i containing an speed and quality of our models were significantly lower.
object of interest. Then, the above loss is the negative of the It should be noted, that although we defined our method
entropy and thus corresponds to a max entropy loss. in a class-agnostic way, we can apply it to predicting object
The final loss objective combines the matching and con- boxes for a particular class. To do this, we simply need to
fidence losses: train our models on bounding boxes for that class.
F (x, l, c) = αFmatch (x, l) + Fconf (x, c) (3) Further, we can predict K boxes per class. Unfortu-
nately, this model will have number of parameters grow-
subject to constraints in Eq. 1. α balances the contribution ing linearly with the number of classes. Also, in a typi-
of the different loss terms. cal setting, where the number of objects for a given class
Optimization For each training example, we solve for an is relatively small, most of these parameters will see very
optimal assignment x∗ of predictions to true boxes by few training examples with a corresponding gradient con-
tribution. We thus argue that our two-step process – first
x∗ = arg min F (x, l, c) (4) localize, then recognize – is a superior alternative in that
x
X it allows leveraging data from multiple object types in the
subject to xij ∈ {0, 1}, xij = 1, (5) same image using a small number of parameters.
i
Figure 2. Sample of detection results on VOC 2007: up to 10 boxes from the class-agnostic detector are output, after non-max-suppression
with Jaccard overlap 0.5 is performed.
uniformly distributed among the classes. The validation set, also train a model on the ImageNet Classification challenge
on which the performance metrics are calculated, consists data, which will serve as the recognition model. This model
of 48,238 images. is trained in a procedure that is substantially similar to that
of [11] and is able to achieve the same results on the clas-
4.4.1 Training methodology sification challenge validation set; note that we only train
a single model, instead of 7 – the latter brings substantial
In addition to a localization model that is identical (up to benefits in terms of classification accuracy, but is 7× more
the dataset on which it is trained on) to the VOC model, we
cat chair horse person
precision
precision
precision
precision
recall recall recall recall
precision
precision
precision
recall recall recall recall
Figure 3. Precision-recall curves on selected VOC classes.
expensive, which is not a negligible factor. calization” challenge), with 1 network trained (instead of
Inference is done as with the VOC setup: the number 7).
of predicted locations is K = 100, which are then reduced
Table 2. Performance of Multibox (the proposed method) vs. clas-
by Non-Max-Suppression (Jaccard overlap criterion of 0.4)
sifying ground-truth boxes directly and predicting one box per
and which are post-scored by the classifier: the score is the
class
product of the localizer confidence for the given box mul- Method det@5 class@5
tiplied by the score of the classifier evaluated on the mini- One-box-per-class 61.00% 79.40%
mum square region around the crop. The final scores (de- Classify GT directly 82.81% 82.81%
tection score times classification score) are then sorted in
DeepMultiBox, top 1 window 56.65% 73.03%
descending order and only the top scoring score/location
DeepMultiBox, top 3 windows 58.71% 77.56%
pair is kept for a given class (as per the challenge evalua-
tion criterion). DeepMultiBox, top 5 windows 58.94% 78.41%
In all experiments, the hyper-parameters were selected DeepMultiBox, top 10 windows 59.06% 78.70%
by evaluating on a held out portion of the training set (10% DeepMultiBox, top 25 windows 59.04% 78.76%
random choice of examples).
We can see that the DeepMultiBox approach is quite
competitive: with 5-10 windows, it is able to perform about
4.4.2 Evaluation methodology as well as the competing approach. While the one-box-per-
The official metric of the “Classification with localization“ class approach may come off as more appealing in this par-
ILSVRC-2012 challenge is detection@5, where an algo- ticular case in terms of the raw performance, it suffers from
rithm is only allowed to produce one box per each of the 5 a number of drawbacks: first, its output scales linearly with
labels (in other words, a model is neither penalized nor re- the number of classes, for which there needs to be training
warded for producing valid multiple detections of the same data. The multibox approach can in principle use transfer
class), where the detection criterion is 0.5 Jaccard overlap learning to detect certain types of objects on which it has
with any of the ground-truth boxes (in addition to the match- never been specifically trained on, but which share similar-
ing class label). ities with objects that it has seen2 . Figure 5 explores this
Table 2 contains a comparison of the proposed method, hypothesis by observing what happens when one takes a lo-
dubbed DeepMultiBox, with classifying the ground-truth calization model trained on ImageNet and applies it on the
boxes directly and with the approach of inferring one box VOC test set, and vice-versa. The figure shows a precision-
per class directly. The metrics reported are detection@5 recall curve: in this case, we perform a class-agnostic de-
and classification@5, the official metrics for the ILSVRC- tection: a true positive occurs if two windows (prediction
2012 challenge metrics. In the table, we vary the number of and groundtruth) overlap by more than 0.5, independently
windows at which we apply the classifier (this number rep- of their class. Interestingly, the ImageNet-trained model is
resents the top windows chosen after non-max-suppression, able to capture more VOC windows than vice-versa: we
the ranking coming from the confidence scores). The one- hypothesize that this is due to the ImageNet class set being
box-per-class approach is a careful re-implementation of the 2 For instance, if one trains with fine-grained categories of dogs, it will
winning entry of ILSVRC-2012 (the “classification with lo- likely generalize to other kinds of breeds by itself
Figure 4. Some selected detection results on the ILSVRC-2012 classification with localization challenge validation set.
much richer than the VOC class set. across the two datasets, in terms of being able to predict lo-
Secondly, the one-box-per-class approach does not gen- cations of interest, even for categories on which it was not
eralize naturally to multiple instances of objects of the same trained on. Additionally, it is able to capture multiple in-
type (except via the the method presented in this work, stances of objects of the same class, which is an important
for instance). Figure 5 shows this too, in the comparison feature of algorithms that aim for better image understand-
between DeepMultiBox and the one-per-class approach3 . ing.
Generalizing to such a scenario is necessary for actual im- While our method is indeed competitive, there ex-
age understanding by algorithms, thus such limitations need ist methods which have substantially larger computational
to be overcome, and our method is a scalable way of doing cost, but that can achieve better detection performance,
so. Evidence supporting this statement is shown in Figure 5 notably on VOC2007 and ILSVRC localization. Over-
shows that the proposed method is able to generally capture Feat [13] efficiently slides a convolutional network at mul-
more objects more accurately that a single-box method. tiple locations and scales, predicting one bounding box
per class. That model takes 2 seconds/image on a GPU,
5. Discussion and Conclusion roughly 40x slower than a GPU implementation of our
In this work, we propose a novel method for localiz- model. Fig. 9 of [13] has the results of a single-scale, cen-
ing objects in an image, which predicts multiple bounding tered crop version of their model, the closest to what we
boxes at a time. The method uses a deep convolutional neu- propose. That results in a 40% top-5 result on ILSVRC-
ral network as a base feature extraction and learning model. 2012, compared to 40.94%, but with DeepMultiBox we are
It formulates a multiple box localization cost that is able to able to extract multiple regions of interest in one network
take advantage of variable number of groundtruth locations evaluation.
of interest in a given image and learn to predict such loca- Another method is that of [8], using selective search [16]
tions in unseen images. to propose 2000 candidate locations per image, extract top-
We present results on two challenging benchmarks, layer features from a ConvNet and using a hard-negative-
VOC2007 and ILSVRC-2012, on which the proposed trained SVM to classify the locations into VOC classes. The
method is competitive. Moreover, the method is able to main differences with our approach are that this method is
perform well by predicting only very few locations to be 200x more expensive, the authors pre-train their feature ex-
probed by a subsequent classifier. Our results show that the tractor on ImageNet and that they use hard negative mining
DeepMultiBox approach is scalable and can even generalize to learn a mapping from features to classes that has low false
3 In the case of the one-box-per-class method, non-max-suppression is
positive ratio.
performed on the 1000 boxes using the same criterion as the DeepMulti- The latter two are good lessons, which we need to ex-
Box method plore. While we showed in Fig. 1 that by predicting more
Figure 5. Class-agnostic detection on ILSVRC-2012 (left) and VOC 2007 (right).
windows we are able to capture more ground-truth bound- [7] M. A. Fischler and R. A. Elschlager. The representation and
ing boxes, a comparable increase in mAP on VOC2007 matching of pictorial structures. Computers, IEEE Transac-
was not observed by us. We hypothesize that a classifi- tions on, 100(1):67–92, 1973.
cation model that incorporates better hard-negative mining [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
and learns to better model local features, the context and de- ture hierarchies for accurate object detection and semantic
segmentation. In Proceedings of the IEEE Conference on
tector confidences jointly will likely take better advantage of
Computer Vision and Pattern Recognition (CVPR), 2014.
the proposed windows.
[9] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester.
In the future, we hope to be able to fold the localization Discriminatively trained deformable part models, release 5.
and recognition paths into a single network, such that we http://people.cs.uchicago.edu/ rbg/latent-release5/.
would be able to extract both location and class label infor- [10] C. Gu, J. J. Lim, P. Arbeláez, and J. Malik. Recognition
mation in a single feed-forward pass through the network. using regions. In CVPR, 2009.
Even in its current state, the two-pass procedure (localiza- [11] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet clas-
tion network followed by categorization network) entails 5- sification with deep convolutional neural networks. In Ad-
10 network evaluations. Importantly, this number does not vances in Neural Information Processing Systems 25, pages
scale linearly with the number of classes to be recognized, 1106–1114, 2012.
which still makes the proposed approach very competitive [12] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond
sliding windows: Object localization by efficient subwindow
with DPM-like approaches.
search. In CVPR, 2008.
[13] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
References and Y. LeCun. Overfeat: Integrated recognition, localization
[1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In and detection using convolutional networks. arXiv preprint
CVPR. IEEE, 2010. arXiv:1312.6229, 2013.
[14] H. O. Song, S. Zickler, T. Althoff, R. Girshick, M. Fritz,
[2] J. Carreira and C. Sminchisescu. Constrained parametric
C. Geyer, P. Felzenszwalb, and T. Darrell. Sparselet models
min-cuts for automatic object segmentation. In CVPR, 2010.
for efficient multiclass object detection. In ECCV. 2012.
[3] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijaya- [15] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks
narasimhan, and J. Yagnik. Fast, accurate detection of for object detection. In Advances in Neural Information Pro-
100,000 object classes on a single machine. In CVPR, 2013. cessing Systems (NIPS), 2013.
[4] I. Endres and D. Hoiem. Category independent object pro- [16] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders.
posals. In ECCV. 2010. Selective search for object recognition. International journal
[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and of computer vision, 104(2):154–171, 2013.
A. Zisserman. The pascal visual object classes (voc) chal- [17] K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W.
lenge. International journal of computer vision, 88(2):303– Smeulders. Segmentation as selective search for object
338, 2010. recognition. In ICCV, 2011.
[6] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- [18] L. Zhu, Y. Chen, A. Yuille, and W. Freeman. Latent hierar-
manan. Object detection with discriminatively trained part- chical structural learning for object detection. In Computer
based models. Pattern Analysis and Machine Intelligence, Vision and Pattern Recognition (CVPR), 2010 IEEE Confer-
IEEE Transactions on, 32(9):1627–1645, 2010. ence on, pages 1062–1069. IEEE, 2010.