0% found this document useful (0 votes)
2 views8 pages

Scalable Object Detection

This paper presents a scalable object detection model called 'DeepMultiBox' that utilizes deep neural networks to predict class-agnostic bounding boxes and their confidence scores for objects in images. The model addresses the computational challenges of traditional detection methods by defining object detection as a regression problem and employing a loss function that incorporates matching predicted boxes to ground truth. Experimental results demonstrate competitive performance on benchmark datasets, indicating the model's efficiency and ability to generalize across unseen classes.

Uploaded by

Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views8 pages

Scalable Object Detection

This paper presents a scalable object detection model called 'DeepMultiBox' that utilizes deep neural networks to predict class-agnostic bounding boxes and their confidence scores for objects in images. The model addresses the computational challenges of traditional detection methods by defining object detection as a regression problem and employing a loss function that incorporates matching predicted boxes to ground truth. Experimental results demonstrate competitive performance on benchmark datasets, indicating the model's efficiency and ability to generalize across unseen classes.

Uploaded by

Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Scalable Object Detection using Deep Neural Networks

Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov


Google, Inc.
1600 Amphitheatre Parkway, Mountain View (CA), 94043, USA
{dumitru, szegedy, toshev, dragomir}@google.com

Abstract [17, 2, 4].


In this paper, we ascribe to the latter philosophy and pro-
Deep convolutional neural networks have recently pose to train a detector, called “DeepMultiBox”, which gen-
achieved state-of-the-art performance on a number of erates a small number of bounding boxes as object candi-
image recognition benchmarks, including the ImageNet dates. These boxes are generated by a single Deep Neural
Large-Scale Visual Recognition Challenge (ILSVRC-2012). Network (DNN) in a class agnostic manner. Our model has
The winning model on the localization sub-task was a net- several contributions. First, we define object detection as a
work that predicts a single bounding box and a confidence regression problem to the coordinates of several bounding
score for each object category in the image. Such a model boxes. In addition, for each predicted box the net outputs a
captures the whole-image context around the objects but confidence score of how likely this box contains an object.
cannot handle multiple instances of the same object in the This is quite different from traditional approaches, which
image without naively replicating the number of outputs for score features within predefined boxes, and has the advan-
each instance. In this work, we propose a saliency-inspired tage of expressing detection of objects in a very compact
neural network model for detection, which predicts a set of and efficient way.
class-agnostic bounding boxes along with a single score for The second major contribution is the loss, which trains
each box, corresponding to its likelihood of containing any the bounding box predictors as part of the network training.
object of interest. The model naturally handles a variable For each training example, we solve an assignment problem
number of instances for each class and allows for cross- between the current predictions and the groundtruth boxes
class generalization at the highest levels of the network. We and update the matched box coordinates, their confidences
are able to obtain competitive recognition performance on and the underlying features through backpropagation. In
VOC2007 and ILSVRC2012, while using only the top few this way, we learn a deep net tailored towards our local-
predicted locations in each image and a small number of ization problem. We capitalize on the excellent representa-
neural network evaluations. tion learning abilities of DNNs, as exemplified recently in
image classification [11] and object detection settings [15],
and perform joint learning of representation and predictors.
1. Introduction Finally, we train our object box predictor in a class-
agnostic manner. We consider this as a scalable way to en-
Object detection is one of the fundamental tasks in com- able efficient detection of large number of object classes.
puter vision. A common paradigm to address this problem We show in our experiments that by only post-classifying
is to train object detectors which operate on a sub-image and less than ten boxes, obtained by a single network applica-
apply these detectors in an exhaustive manner across all lo- tion, we can achieve competitive detection results. Further,
cations and scales. This paradigm was successfully used we show that our box predictor generalizes over unseen
within a discriminatively trained Deformable Part Model classes and as such is flexible to be re-used within other
(DPM) to achieve state-of-art results on detection tasks [6]. detection problems.
The exhaustive search through all possible locations and
scales poses a computational challenge. This challenge be- 2. Previous work
comes even harder as the number of classes grows, since
most of the approaches train a separate detector per class. The literature on object detection is vast, and in this sec-
In order to address this issue a variety of methods were tion we will focus on approaches exploiting class-agnostic
proposed, varying from detector cascades, to using seg- ideas and addressing scalability.
mentation to suggest a small number of object hypotheses Many of the proposed detection approaches are based on

1
part-based models [7], which more recently have achieved Bounding box: we encode the upper-left and lower-right
impressive performance thanks to discriminative learning coordinates of each box as four node values, which can
and carefully crafted features [6]. These methods, however, be written as a vector li ∈ R4 . These coordinates are
rely on exhaustive application of part templates over multi- normalized w. r. t. image dimensions to achieve invari-
ple scales and as such are expensive. Moreover, they scale ance to absolute image size. Each normalized coordi-
linearly in the number of classes, which becomes a chal- nate is produced by a linear transformation of the last
lenge for modern datasets such as ImageNet 1 . hidden layer.
To address the former issue, Lampert et al. [12] use a
branch-and-bound strategy to avoid evaluating all potential Confidence: the confidence score for the box containing
object locations. To address the latter issue, Song et al. [14] an object is encoded as a single node value ci ∈ [0, 1].
use a low-dimensional part basis, shared across all object This value is produced through a linear transformation
classes. A hashing based approach for efficient part detec- of the last hidden layer followed by a sigmoid.
tion has shown good results as well [3].
A different line of work, closer to ours, is based on the We can combine the bounding box locations li , i ∈
idea that objects can be localized without having to know {1, . . . K}, as one linear layer. Similarly, we can treat col-
their class. Some of these approaches build on bottom-up lection of all confidences ci , i ∈ {1, . . . K} as the output as
classless segmentation [10]. The segments, obtained in this one sigmoid layer. Both these output layers are connected
way, can be scored using top-down feedback [17, 2, 4]. Us- to the last hidden layers.
ing the same motivation, Alexe et al. [1] use an inexpen- At inference time, our algorithm produces K bound-
sive classifier to score object hypotheses for being an ob- ing boxes. In our experiments, we use K = 100 and
ject or not and in this way reduce the number of location K = 200. If desired, we can use the confidence scores
for the subsequent detection steps. These approaches can and non-maximum suppression to obtain a smaller number
be thought of as multi-layered models, with segmentation of high-confidence boxes at inference time. These boxes are
as first layer and a segment classification as a subsequent supposed to represent objects. As such, they can be classi-
layer. Despite the fact that they encode proven perceptual fied with a subsequent classifier to achieve object detection.
principles, we will show that having deeper models which Since the number of boxes is very small, we can afford pow-
are fully learned can lead to superior results. erful classifiers. In our experiments, we use second DNN
Finally, we capitalize on the recent advances in Deep for classification [11].
Learning, most noticeably the work by Krizhevsky et Training Objective We train a DNN to predict bounding
al. [11]. We extend their bounding box regression approach boxes and their confidence scores for each training image
for detection to the case of handling multiple objects in a such that the highest scoring boxes match well the ground
scalable manner. DNN-based regression applied to object truth object boxes for the image. Suppose that for a partic-
masks has been investigated by Szegedy et al. [15]. This ular training example, M objects were labeled by bounding
last approach achieves state-of-art detection performance on boxes gj , j ∈ {1, . . . , M }. In practice, the number of pre-
VOC2007 but does not scale up to multiple classes due to dictions K is much larger than the number of groundtruth
the cost of a single mask regression: in that setup, one needs boxes M . Therefore, we try to optimize only the subset of
to execute 5 networks per class at inference time, which is predicted boxes which match best the ground truth ones. We
not scalable for most real-world applications. optimize their locations to improve their match and maxi-
mize their confidences. At the same time we minimize the
3. Proposed approach confidences of the remaining predictions, which are deemed
We aim at achieving a class-agnostic scalable object de- not to localize the true objects well.
tection by predicting a set of bounding boxes, which rep- To achieve the above, we formulate an assignment prob-
resent potential objects. More precisely, we use a Deep lem for each training example. Let xij ∈ {0, 1} denote the
Neural Network (DNN), which outputs a fixed number of assignment: xij = 1 iff the i-th prediction is assigned to
bounding boxes. In addition, it outputs a score for each box j-th true object. The objective of this assignment can be
expressing the network confidence of this box containing an expressed as:
object.
Model To formalize the above idea, we encode the i-th 1X
Fmatch (x, l) = xij ||li − gj ||22 (1)
object box and its associated confidence as node values of 2 i,j
the last net layer:
1A typical deformable-parts model takes 1 CPU-sec/image/label at
where we use L2 distance between the normalized bound-
inference time, thus for 1000 classes inference would take 1000 CPU- ing box coordinates to quantify the dissimilarity between
seconds; sharing parts across class labels is an open research problem. bounding boxes.
Additionally, we want to optimize the confidences of the match between the K priors and the ground truth. Once
boxes according to the assignment x. Maximizing the con- the matching is done, the target confidences are computed
fidences of assigned predictions can be expressed as: as before. Moreover, the location prediction loss is also
X X X unchanged: for any matched pair of (target, prediction)
Fconf (x, c) = − xij log(ci )− (1− xij ) log(1−ci ) locations, the loss is defined by the difference between
i,j i j the groundtruth and the coordinates that correspond to the
P (2) matched prior. We call the usage of priors for matching
In the above objective j xij = 1 iff prediction i has been prior matching and hypothesize that it enforces diversifi-
matched to a groundtruth. In that case ci is being maxi- cation among the predictions, since the linear assignment
mized, while in the opposite case it is being minimized. A forces the model to learn a diverse set of predictions. We
different
P interpretation of the above term is achieved if we have found that without prior matching, the convergence
j x ij view as a probability of prediction i containing an speed and quality of our models were significantly lower.
object of interest. Then, the above loss is the negative of the It should be noted, that although we defined our method
entropy and thus corresponds to a max entropy loss. in a class-agnostic way, we can apply it to predicting object
The final loss objective combines the matching and con- boxes for a particular class. To do this, we simply need to
fidence losses: train our models on bounding boxes for that class.
F (x, l, c) = αFmatch (x, l) + Fconf (x, c) (3) Further, we can predict K boxes per class. Unfortu-
nately, this model will have number of parameters grow-
subject to constraints in Eq. 1. α balances the contribution ing linearly with the number of classes. Also, in a typi-
of the different loss terms. cal setting, where the number of objects for a given class
Optimization For each training example, we solve for an is relatively small, most of these parameters will see very
optimal assignment x∗ of predictions to true boxes by few training examples with a corresponding gradient con-
tribution. We thus argue that our two-step process – first
x∗ = arg min F (x, l, c) (4) localize, then recognize – is a superior alternative in that
x
X it allows leveraging data from multiple object types in the
subject to xij ∈ {0, 1}, xij = 1, (5) same image using a small number of parameters.
i

where the constraints enforce an assignment solution. This 4. Experimental results


is a variant of bipartite matching, which is polynomial in
complexity. In our application the matching is very inex-
4.1. Network Architecture and Experiment Details
pensive – the number of labeled objects per image is less The network architecture for the localization and clas-
than a dozen and in most cases only very few objects are sification models that we use is the same as the one used
labeled. by [11]. We use Adagrad for controlling the learning rate
Then, we optimize the network parameters via back- decay, mini-batches of size 128, and parallel distributed
propagation. For example, the first derivatives of the back- training with multiple identical replicas of the network,
propagation algorithm are computed w. r. t. l and c: which achieves faster convergence. As mentioned previ-
ously, we use priors in the localization loss – these are com-
∂F X
= (li − gj )x∗ij (6) puted using k-means on the training set. We also use an α
∂li j of 0.3 to balance the localization and confidence losses.
x∗ij ci The localizer might output coordinates outside the crop
P
∂F j
= (7) area used for the inference. The coordinates are mapped
∂ci ci (1 − ci )
and truncated to the final image area, at the end. Boxes are
Training Details While the loss as defined above is in additionally pruned using non-maximum-suppression with
principle sufficient, three modifications make it possible to a Jaccard similarity threshold of 0.5. Our second model
reach better accuracy significantly faster. The first such then classifies each bounding box as objects of interest or
modification is to perform clustering of ground truth loca- “background”.
tions and find K such clusters/centroids that we can use as To train our localizer networks, we generated approx-
priors for each of the predicted locations. Thus, the learn- imately millions of images (10–30 million, depending on
ing algorithm is encouraged to learn a residual to a prior, for the dataset) from the training set by applying the following
each of the predicted locations. procedure to each image in the training set. For each image,
A second modification pertains to using these priors in we generate the same number of square samples such that
the matching process: instead of matching the N ground the total number of samples is about ten million. For each
truth locations with the K predictions, we find the best image, the samples are bucketed such that for each of the ra-
tios in the ranges of 0−5%, 5−15%, 15−50%, 50−100%,
there is an equal number of samples in which the ratio cov-
ered by the bounding boxes is in the given range.
For the experiments below we have not explored any
non-standard data generation or regularization options. In
all experiments, all hyper-parameters were selected by eval-
uating on a held out portion of the training set (10% random
choice of examples).
4.2. VOC 2007
The Pascal Visual Object Classes (VOC) Challenge [5]
is the most common benchmark for object detection algo-
rithms. It consists mainly of complex scene images in which
bounding boxes of 20 diverse object classes were labelled.
Figure 1. Detection rate of class “object” vs number of bounding
In our evaluation we focus on the 2007 edition of VOC, boxes per image. The model, used for these results, was trained on
for which a test set was released. We present results by VOC 2012.
training on VOC 2012, which contains approx. 11000 im-
ages. We trained a 100 box localizer as well as a deep net
based classifier [11]. of produced bounding boxes. In Fig. 1 plot we show results
obtained by training on VOC2012. In addition, we present
4.2.1 Training methodology results by using the max-center square crop of the image as
input as well as by using two scales: the max-center crop by
We trained the classifier on a data set comprising of a second scale where we select 3 × 3 windows of size 60%
of the image size.
• 10 million crops overlapping some object with at least
0.5 Jaccard overlap similarity. The crops are labeled As we can see, when using a budget of 10 bounding
with one of the 20 VOC object classes. boxes we can localize 45.3% of the objects with the first
model, and 48% with the second model. This shows better
• 20 million negative crops that have at most 0.2 Jaccard performance than other reported results, such as the object-
similarity with any of the object boxes. These crops ness algorithm achieving 42% [1]. Further, this plot shows
are labeled with the special “background” class label. the importance of looking at the image at several resolu-
tions. Although our algorithm manages to get large number
The architecture and the selection of hyperparameters fol- of objects by using the max-center crop, we obtain an addi-
lowed that of [11]. tional boost when using higher resolution image crops.
Further, we classify the produced bounding boxes by a
4.2.2 Evaluation methodology 21-way classifier, as described above. The average preci-
In the first round, the localizer model is applied to the max- sions (APs) on VOC 2007 are presented in Table 1. The
imum center square crop in the image. The crop is resized achieved mean AP is 0.29, which is quite competitive. Note
to the network input size which is 220 × 220. A single that, our running time complexity is very low – we simply
pass through this network gives us up to hundred candi- use the top 10 boxes.
date boxes. After a non-maximum-suppression with over- Example detections and full precision recall curves are
lap threshold 0.5, the top 10 highest scoring detections are shown in Fig. 2 and Fig. 3 respectively. It is important to
kept and were classified by the 21-way classifier model in note that the visualized detections were obtained by using
a separate passes through the network. The final detection only the max-centered square image crop, i. e. the full im-
score is the product of the localizer score for the given box age was used. Nevertheless, we manage to obtain relatively
multiplied by the score of the classifier evaluated on the small objects, such as the boats in row 2 and column 2, as
maximum square region around the crop. These scores are well as the sheep in row 3 and column 3.
passed to the evaluation and were used for computing the
precision recall curves.
4.4. ILSVRC 2012 Classification with Localization
Challenge
4.3. Discussion
For this set of experiments, we used the ILSVRC 2012
First, we analyze the performance of our localizer in iso- classification with localization challenge dataset. This
lation. We present the number of detected objects, as de- dataset consists of 544,545 training images labeled with cat-
fined by the Pascal detection criterion, against the number egories and locations of 1,000 object categories, relatively
class aero bicycle bird boat bottle bus car cat chair cow
DeepMultiBox .413 .277 .305 .176 .032 .454 .362 .535 .069 .256
3-layer model [18] .294 .558 .094 .143 .286 .440 .513 .213 .200 .193
Felz. et al. [6] .328 .568 .025 .168 .285 .397 .516 .213 .179 .185
Girshick et al. [9] .324 .577 .107 .157 .253 .513 .542 .179 .210 .240
Szegedy et al. [15] .292 .352 .194 .167 .037 .532 .502 .272 .102 .348
class table dog horse m-bike person plant sheep sofa train tv
DeepMultiBox .273 .464 .312 .297 .375 .074 .298 .211 .436 .225
3-layer model [18] .252 .125 .504 .384 .366 .151 .197 .251 .368 .393
Felz. et al. [6] .259 .088 .492 .412 .368 .146 .162 .244 .392 .391
Girshick et al. [9] .257 .116 .556 .475 .435 .145 .226 .342 .442 .413
Szegedy et al .[15] .302 .282 .466 .417 .262 .103 .328 .268 .398 .47
Table 1. Average Precision on VOC 2007 test of our method, called DeepMultiBox, and other competitive methods. DeepMultibox was
trained on VOC2012 training data, while the rest of the models were trained on VOC2007 data.

Figure 2. Sample of detection results on VOC 2007: up to 10 boxes from the class-agnostic detector are output, after non-max-suppression
with Jaccard overlap 0.5 is performed.

uniformly distributed among the classes. The validation set, also train a model on the ImageNet Classification challenge
on which the performance metrics are calculated, consists data, which will serve as the recognition model. This model
of 48,238 images. is trained in a procedure that is substantially similar to that
of [11] and is able to achieve the same results on the clas-
4.4.1 Training methodology sification challenge validation set; note that we only train
a single model, instead of 7 – the latter brings substantial
In addition to a localization model that is identical (up to benefits in terms of classification accuracy, but is 7× more
the dataset on which it is trained on) to the VOC model, we
cat chair horse person

precision

precision

precision

precision
recall recall recall recall

potted plant sheep train tv


precision

precision

precision

precision
recall recall recall recall
Figure 3. Precision-recall curves on selected VOC classes.

expensive, which is not a negligible factor. calization” challenge), with 1 network trained (instead of
Inference is done as with the VOC setup: the number 7).
of predicted locations is K = 100, which are then reduced
Table 2. Performance of Multibox (the proposed method) vs. clas-
by Non-Max-Suppression (Jaccard overlap criterion of 0.4)
sifying ground-truth boxes directly and predicting one box per
and which are post-scored by the classifier: the score is the
class
product of the localizer confidence for the given box mul- Method det@5 class@5
tiplied by the score of the classifier evaluated on the mini- One-box-per-class 61.00% 79.40%
mum square region around the crop. The final scores (de- Classify GT directly 82.81% 82.81%
tection score times classification score) are then sorted in
DeepMultiBox, top 1 window 56.65% 73.03%
descending order and only the top scoring score/location
DeepMultiBox, top 3 windows 58.71% 77.56%
pair is kept for a given class (as per the challenge evalua-
tion criterion). DeepMultiBox, top 5 windows 58.94% 78.41%
In all experiments, the hyper-parameters were selected DeepMultiBox, top 10 windows 59.06% 78.70%
by evaluating on a held out portion of the training set (10% DeepMultiBox, top 25 windows 59.04% 78.76%
random choice of examples).
We can see that the DeepMultiBox approach is quite
competitive: with 5-10 windows, it is able to perform about
4.4.2 Evaluation methodology as well as the competing approach. While the one-box-per-
The official metric of the “Classification with localization“ class approach may come off as more appealing in this par-
ILSVRC-2012 challenge is detection@5, where an algo- ticular case in terms of the raw performance, it suffers from
rithm is only allowed to produce one box per each of the 5 a number of drawbacks: first, its output scales linearly with
labels (in other words, a model is neither penalized nor re- the number of classes, for which there needs to be training
warded for producing valid multiple detections of the same data. The multibox approach can in principle use transfer
class), where the detection criterion is 0.5 Jaccard overlap learning to detect certain types of objects on which it has
with any of the ground-truth boxes (in addition to the match- never been specifically trained on, but which share similar-
ing class label). ities with objects that it has seen2 . Figure 5 explores this
Table 2 contains a comparison of the proposed method, hypothesis by observing what happens when one takes a lo-
dubbed DeepMultiBox, with classifying the ground-truth calization model trained on ImageNet and applies it on the
boxes directly and with the approach of inferring one box VOC test set, and vice-versa. The figure shows a precision-
per class directly. The metrics reported are detection@5 recall curve: in this case, we perform a class-agnostic de-
and classification@5, the official metrics for the ILSVRC- tection: a true positive occurs if two windows (prediction
2012 challenge metrics. In the table, we vary the number of and groundtruth) overlap by more than 0.5, independently
windows at which we apply the classifier (this number rep- of their class. Interestingly, the ImageNet-trained model is
resents the top windows chosen after non-max-suppression, able to capture more VOC windows than vice-versa: we
the ranking coming from the confidence scores). The one- hypothesize that this is due to the ImageNet class set being
box-per-class approach is a careful re-implementation of the 2 For instance, if one trains with fine-grained categories of dogs, it will

winning entry of ILSVRC-2012 (the “classification with lo- likely generalize to other kinds of breeds by itself
Figure 4. Some selected detection results on the ILSVRC-2012 classification with localization challenge validation set.

much richer than the VOC class set. across the two datasets, in terms of being able to predict lo-
Secondly, the one-box-per-class approach does not gen- cations of interest, even for categories on which it was not
eralize naturally to multiple instances of objects of the same trained on. Additionally, it is able to capture multiple in-
type (except via the the method presented in this work, stances of objects of the same class, which is an important
for instance). Figure 5 shows this too, in the comparison feature of algorithms that aim for better image understand-
between DeepMultiBox and the one-per-class approach3 . ing.
Generalizing to such a scenario is necessary for actual im- While our method is indeed competitive, there ex-
age understanding by algorithms, thus such limitations need ist methods which have substantially larger computational
to be overcome, and our method is a scalable way of doing cost, but that can achieve better detection performance,
so. Evidence supporting this statement is shown in Figure 5 notably on VOC2007 and ILSVRC localization. Over-
shows that the proposed method is able to generally capture Feat [13] efficiently slides a convolutional network at mul-
more objects more accurately that a single-box method. tiple locations and scales, predicting one bounding box
per class. That model takes 2 seconds/image on a GPU,
5. Discussion and Conclusion roughly 40x slower than a GPU implementation of our
In this work, we propose a novel method for localiz- model. Fig. 9 of [13] has the results of a single-scale, cen-
ing objects in an image, which predicts multiple bounding tered crop version of their model, the closest to what we
boxes at a time. The method uses a deep convolutional neu- propose. That results in a 40% top-5 result on ILSVRC-
ral network as a base feature extraction and learning model. 2012, compared to 40.94%, but with DeepMultiBox we are
It formulates a multiple box localization cost that is able to able to extract multiple regions of interest in one network
take advantage of variable number of groundtruth locations evaluation.
of interest in a given image and learn to predict such loca- Another method is that of [8], using selective search [16]
tions in unseen images. to propose 2000 candidate locations per image, extract top-
We present results on two challenging benchmarks, layer features from a ConvNet and using a hard-negative-
VOC2007 and ILSVRC-2012, on which the proposed trained SVM to classify the locations into VOC classes. The
method is competitive. Moreover, the method is able to main differences with our approach are that this method is
perform well by predicting only very few locations to be 200x more expensive, the authors pre-train their feature ex-
probed by a subsequent classifier. Our results show that the tractor on ImageNet and that they use hard negative mining
DeepMultiBox approach is scalable and can even generalize to learn a mapping from features to classes that has low false
3 In the case of the one-box-per-class method, non-max-suppression is
positive ratio.
performed on the 1000 boxes using the same criterion as the DeepMulti- The latter two are good lessons, which we need to ex-
Box method plore. While we showed in Fig. 1 that by predicting more
Figure 5. Class-agnostic detection on ILSVRC-2012 (left) and VOC 2007 (right).

windows we are able to capture more ground-truth bound- [7] M. A. Fischler and R. A. Elschlager. The representation and
ing boxes, a comparable increase in mAP on VOC2007 matching of pictorial structures. Computers, IEEE Transac-
was not observed by us. We hypothesize that a classifi- tions on, 100(1):67–92, 1973.
cation model that incorporates better hard-negative mining [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
and learns to better model local features, the context and de- ture hierarchies for accurate object detection and semantic
segmentation. In Proceedings of the IEEE Conference on
tector confidences jointly will likely take better advantage of
Computer Vision and Pattern Recognition (CVPR), 2014.
the proposed windows.
[9] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester.
In the future, we hope to be able to fold the localization Discriminatively trained deformable part models, release 5.
and recognition paths into a single network, such that we http://people.cs.uchicago.edu/ rbg/latent-release5/.
would be able to extract both location and class label infor- [10] C. Gu, J. J. Lim, P. Arbeláez, and J. Malik. Recognition
mation in a single feed-forward pass through the network. using regions. In CVPR, 2009.
Even in its current state, the two-pass procedure (localiza- [11] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet clas-
tion network followed by categorization network) entails 5- sification with deep convolutional neural networks. In Ad-
10 network evaluations. Importantly, this number does not vances in Neural Information Processing Systems 25, pages
scale linearly with the number of classes to be recognized, 1106–1114, 2012.
which still makes the proposed approach very competitive [12] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond
sliding windows: Object localization by efficient subwindow
with DPM-like approaches.
search. In CVPR, 2008.
[13] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
References and Y. LeCun. Overfeat: Integrated recognition, localization
[1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In and detection using convolutional networks. arXiv preprint
CVPR. IEEE, 2010. arXiv:1312.6229, 2013.
[14] H. O. Song, S. Zickler, T. Althoff, R. Girshick, M. Fritz,
[2] J. Carreira and C. Sminchisescu. Constrained parametric
C. Geyer, P. Felzenszwalb, and T. Darrell. Sparselet models
min-cuts for automatic object segmentation. In CVPR, 2010.
for efficient multiclass object detection. In ECCV. 2012.
[3] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijaya- [15] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks
narasimhan, and J. Yagnik. Fast, accurate detection of for object detection. In Advances in Neural Information Pro-
100,000 object classes on a single machine. In CVPR, 2013. cessing Systems (NIPS), 2013.
[4] I. Endres and D. Hoiem. Category independent object pro- [16] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders.
posals. In ECCV. 2010. Selective search for object recognition. International journal
[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and of computer vision, 104(2):154–171, 2013.
A. Zisserman. The pascal visual object classes (voc) chal- [17] K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W.
lenge. International journal of computer vision, 88(2):303– Smeulders. Segmentation as selective search for object
338, 2010. recognition. In ICCV, 2011.
[6] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- [18] L. Zhu, Y. Chen, A. Yuille, and W. Freeman. Latent hierar-
manan. Object detection with discriminatively trained part- chical structural learning for object detection. In Computer
based models. Pattern Analysis and Machine Intelligence, Vision and Pattern Recognition (CVPR), 2010 IEEE Confer-
IEEE Transactions on, 32(9):1627–1645, 2010. ence on, pages 1062–1069. IEEE, 2010.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy