0% found this document useful (0 votes)
6 views9 pages

R-CNN Minus R: Karel Lenc Andrea Vedaldi

Uploaded by

naneja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

R-CNN Minus R: Karel Lenc Andrea Vedaldi

Uploaded by

naneja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

R-CNN minus R

Karel Lenc Andrea Vedaldi


Department of Engineering Science, University of Oxford
arXiv:1506.06981v1 [cs.CV] 23 Jun 2015

June 24, 2015

Abstract
Deep convolutional neural networks (CNNs) have had
a major impact in most areas of image understand-
ing, including object category detection. In object
detection, methods such as R-CNN have obtained ex-
cellent results by integrating CNNs with region pro-
posal generation algorithms such as selective search.
In this paper, we investigate the role of proposal gen-
eration in CNN-based detectors in order to determine
whether it is a necessary modelling component, car-
rying essential geometric information not contained
in the CNN, or whether it is merely a way of acceler-
ating detection. We do so by designing and evaluat-
ing a detector that uses a trivial region generation
scheme, constant for each image. Combined with Figure 1: Some examples of the bounding box regres-
SPP, this results in an excellent and fast detector that sor outputs. The dashed box is the image-agnostic
does not require to process an image with algorithms proposal, correctly selected despite the bad overlap,
other than the CNN itself. We also streamline and and the solid box is the result of improving it by using
simplify the training of CNN-based detectors by in- the pose regressor. Both steps use the same CNN,
tegrating several learning steps in a single algorithm, but the first uses the geometrically-invariant fully-
as well as by proposing a number of improvements connected layers, and the last the geometry-sensitive
that accelerate detection. convolutional layers. In this manner, accurate object
location can be recovered without using complemen-
tary mechanisms such as selective search.
1 Introduction
Object detection is one of the core problems in im- deep learning that acquire representations automati-
age understanding. Until recently, the best perform- cally from data using Convolutional Neural Networks
ing detectors in standard benchmarks such as PAS- (CNNs). Currently, the best CNN-based detectors
CAL VOC were based on a combination of hand- are based on the R-CNN construction of [9]. Con-
crafted image representations such as SIFT, HOG, ceptually, R-CNN is remarkably simple: it samples
and the Fisher Vector and a form of structured out- image regions using a proposal mechanism such as
put regression, from sliding window to deformable Selective Search (SS; [18]) and classifies them as fore-
parts models. Recently, however, these pipelines have ground and background using a CNN. Looking more
been outperformed significantly by the ones based on closely, however, R-CNN leaves open several interest-

1
ing question. proposal generation and of the other simplifications
The first question is whether CNN contain suffi- are shown in Section 4 to provide a substantial de-
cient geometric information to localise objects, or tection speedup – and this for the overall system, not
whether the latter must be supplemented by an exter- just the CNN part. Our findings are summarised in
nal mechanism, such as region proposal generation. Section 5.
There are in fact two hypothesis. The first one is that Related work. The basis of our work are the
the only role of proposal generation is to cut down current generation of deep CNNs for image under-
computation by allowing to evaluate the CNN, which standing, pioneered by [14]. For object detection,
is expensive, on a small number of image regions. our method builds directly on the R-CNN approach
In this case proposal generation becomes less impor- of [9] as well as the SPP extension proposed in [12].
tant as other speedups such as SPP-CNN [10] become All such methods rely not only on CNNs, but also
available and may be forego. The second hypothesis, on a region proposal generation mechanism such as
instead, is that proposal generation provides geomet- SS [18], CPMC [3], multi-scale combinatorial group-
ric information vital for accurate object localisation ing [2], and edge boxes [23]. These methods, which
which is not represented in the CNN. This is not un- are extensively reviewed in [13], originate in the idea
likely, given that CNNs are often trained to be highly of “objectness” proposed by [1]. Interestingly, [13]
invariant to even large geometric deformations and showed that a good region proposal scheme is essen-
hence may not be sensitive to an object’s location. tial for R-CNN to work well. Here, we show that
This question is answered in Section 3.1 by show- this is in fact not the case provided that bounding
ing that the convolutional layers of standard CNNs box locations are corrected by a strong CNN-based
contain sufficient information to localise objects (Fig- bounding box regressor, a step that was not evaluated
ure 1). for R-CNNs in [13]. The R-CNN and SPP-CNN de-
The second question is whether the R-CNN tectors build on years of research in object detection.
pipeline can be simplified. While conceptually Both can be seen as accelerated sliding window detec-
straightforward, in fact, R-CNN comprises many tors [20, 4]. The two-stage computation using region
practical steps that need to be carefully implemented proposal generation is a form of cascade detector [20]
and tuned to obtain a good performance. To start or jumping window [17, 19]. However, they differ in
with, R-CNN builds on a CNN pre-trained on an im- part-based detector such as [7] in that they do not ex-
age classification tasks such as ImageNet ILSVRC [5]. plicitly model object parts in learning; instead parts
This CNN is ported to detection by: i) learning an are implicitly capture in the CNN. Integrated train-
SVM classifier for each object class on top of the last ing of SPP-CNN as a single CNN learning problem,
fully-connected layer of the network, ii) fine-tuning not dissimilar to some of the ideas of Section 3.2,
the CNN on the task of discriminating objects and have very recently been explored in the unpublished
background, and iii) learning a bounding box regres- manuscript [8].
sor for each object class. Section 3.2 simplifies these
steps, which require running a mix of different soft-
ware on cached data, by training a single CNN ad- 2 CNN-based detectors
dressing all required tasks.
The third question is whether R-CNN can be accel- This section introduces the R-CNN (Section 2.1) and
erated. A substantial speedup was already obtained SPP-CNN (Section 2.2) detectors.
in spatial pyramid pooling (SPP) by [11] by realising
that convolutional features can be shared among dif- 2.1 R-CNN detector
ferent regions rather than being recomputed. How-
ever, this does not accelerate training, and in testing The R-CNN method [9] is a chain of conceptually
the region proposal generation mechanism becomes simple steps: generating candidate object regions,
the new bottleneck. The combination of dropping classifying them as foreground or background, and

2
post-processing them to improve their fit to objects. tor hwc , φRCNN i + bc learned using an SVM solver to
These steps are described next. minimise the regularised empirical hinge loss risk.
Region proposal generation. R-CNN starts by Bounding box regression. Candidate bounding
running an algorithm such as SS [18] or CPMC [3] boxes are refitted to detected objects by using a
to extracts from an image x a shortlist of of image CNN-based regressor as in [9]. Given a candidate
regions R ∈ R(x) that are likely to contain objects. bounding box R = (x, y, w, h), where (x, y) are its
These proposals, in the order of a few thousands per centre and (w, h) its width and height, a linear re-
image, may have arbitrary shapes, but in the follow- gressor estimates an adjustment d = (dx , dy , dw , dh )
ing are assumed to be converted to rectangles. that yields the new bounding box d[R] = (wdx +
CNN-based features. Candidate regions are de- x, hdy + y, wedw , hedh ). In order to train this re-
scribed by CNN features before being classified. The gressor, one collects for each ground truth region R∗
CNN itself is transferred from a different problem – all the candidates R that overlap sufficiently with it
usually image classification in the ImageNet ILSVRC (with an overlap of at least 0.5). Each pair (R∗ , R)
challenge [5]. In this manner, the CNN can be trained of regions is converted in a training input/output
on a very large dataset, as required to obtain good pair (φcnv (x, R), d) for the regressor, where d is the
performance, and then applied to object detection, adjustment required to transform R into R∗ , i.e.
where datasets are usually much smaller. In order to R∗ = d[R]. The pairs are then used to train the
transfer a pre-trained CNN to object detection, its regressor using ridge regression with a large regular-
last few layers, which are specific to the classification isation constant. The regressor itself takes the form
task, are removed; this results in a “beheaded” CNN d = Q> c φcnv (resize(x|R )) + tc where φcnv denotes the
φ that outputs relatively generic features. The CNN CNN restricted to the convolutional layers, as further
is applied to the image regions R by cropping and re- discussed in Section 2.2. The regressor is further im-
sizing the image x, i.e. φRCNN (x; R) = φ(resize(x|R )). proved by retraining it after removing the 20% of the
Cropping and resizing serves two purposes: to localise examples with worst regression loss – as found in the
the descriptor and to provide the CNN with an im- publicly-available implementation of SPP-CNN.
age of a fixed size, as this is required by many CNN Post-processing. The refined bounding boxes are
architectures. passed to non-maxima suppression before being eval-
SVM training. Given the region descriptor uated. Non-maxima suppression eliminates duplicate
φRCNN (x; R), the next step is to learn a SVM classi- detections prioritising regions with higher SVM score
fier to decide whether a region contains an object or ŝ(φ(R)). Starting from the highest ranked region in
background. Learning the SVM starts from a num- an image, other regions are iteratively removed if they
ber of example images x1 , . . . , xN , each annotated overlap by more than 0.3 with any region retained so
with ground-truth regions R̄ ∈ R (x ) and object far.
gt i
labels c(R̄) ∈ {1, . . . C}. In order to learn a classifier CNN fine-tuning. The quality of the CNN fea-
for class c, R-CNN divides ground-truth Rgt (xi ) and tures, ported from an image classification task, can
candidate R(x) regions into positive and negative. In be improved by fine-tuning the network on the tar-
particular, ground truth regions R ∈ Rgt (x) for class get data. In order to do so, the CNN φRCNN (x; R)
c(R) = c are assigned a positive label y(R; c; τ ) = +1; is concatenated with additional layers φsftmx (linear
other regions R are labelled as ambiguous y(R; c; τ ) = followed by softmax normalisation) to obtain a pre-
 and ignored if overlap(R, R̄) ≥ τ = 0 with any dictor for the C + 1 object classes. The new CNN
ground truth region R̄ ∈ Rgt (x) of the same class φsftmx ◦ φRCNN (x; R) is then trained as a classifier
c(R̄) = c. The remaining regions are labelled as by minimising its empirical logistic risk on a train-
negative. Here overlap(A, B) = |A ∩ B|/|A ∪ B| is ing set of labelled regions. This is analogous to
the intersection-over-union overlap measure, and the the procedure used to learn the CNN in the first
threshold is set to τ = 0.3. The SVM takes the form place, but with a reduced learning rate and a dif-
φSVM ◦ φRCNN (x; R), where φSVM is a linear predic- ferent (and smaller) training set similar to the one

3
used to train the SVM. In this dataset, a region feature field in the original CNN. In this manner, the
R, either ground-truth or candidate, is assigned the output can be concatenated with the existing FC lay-
class c(R; τ+ , τ− ) = c(R̄∗ ) of the closest ground- ers: φSPP (x; R) = φfc ◦SPP(·; R)◦φcnv (x). Note that,
truth region R̄∗ = argmaxR̄∈Rgt (x) overlap(R, R̄), compared to R-CNN, the first part of the computa-
provided that overlap(R, R̄∗ ) ≥ τ+ . If instead tion is shared among all regions R.
overlap(R, R̄∗ ) < τ− , then the region is labelled as Next, we derive the map g that transforms fea-
c(R; τ+ , τ− ) = 0 background, and the remaining re- ture coordinates back to image coordinates as re-
gions as ambiguous. By default τ+ and τ− are both quired by (1) (this correspondence was established
set 1/2, resulting in a much more relaxed training set only heuristically in [11]). It suffices to consider one
than for the SVM. Since the dataset is strongly biased spatial dimension. The question is which pixel x0 (i0 )
towards background regions, during CNN training it corresponds to feature xL (iL ) in the L-th layer of a
is rebalanced by sampling with 25% probability re- CNN. While there is no unique definition, a useful
gions such that c(R) > 0 and with 75% probability one is to let i0 be the centre of the receptive field of
regions such that c(R) = 0. feature xL (iL ), defined as the set of pixels ΩL (iL )
that can affect xL (iL ) as a function of the image (i.e.
the support of the feature). A short calculation leads
2.2 SPP-CNN detector
to
A significant disadvantage of R-CNN is the need to
recompute the whole CNN from scratch for each eval- i0 = gL (iL ) = αL (iL − 1) + βL ,
uated region; since this occurs thousands of times per L
Y
image, the method is slow. SPP-CNN addresses this αL = Sp ,
issue by factoring the CNN φ = φfc ◦φcnv in two parts, p=1
where φcnv contains the so-called convolutional lay- L p−1
! 
ers, pooling information from local regions, and φfc
X Y Fp − 1
βL = 1 + Sq − Pp ,
the fully connected (FC) ones, pooling information p=1 q=1
2
from the image as a whole. Since the convolutional
layers encode local information, this can be selectively
pooled to encode the appearance of an image subre- 3 Simplifying and streamlining
gion R instead of the whole image. In more detail,
let y = φcnv (x) the output of the convolutional layers
R-CNN
applied to image x. The feature field y is a H ×W ×D This section describes the main technical contribu-
tensor of height H and width W , proportional to the tions of the paper: removing region proposal genera-
height and width of the input image x, and D feature tion from R-CNN (Section 3.1) and streamlining the
channels. Let z = SP (y; R) be the result of applying pipeline (Section 3.2).
the spatial pooling (SP) operator to the feature in y
contained in region R. This operator is defined as:
3.1 Dropping region proposal genera-
zd = max yijd , d = 1, . . . , D (1)
(i,j):g(i,j)∈R tion
where the function g maps the feature coordi- While the SPP method of [11] (Section 2.2) accel-
nates (i, j) back to image coordinates g(i, j). The erates R-CNN evaluation by orders of magnitude, it
SP operator is extended to spatial pyramid pooling does not result in a comparable acceleration of the
(SPP; [15]) by dividing the region R into subregions detector as a whole; in fact, proposal generation with
R = R1 ∪ R2 ∪ . . . RK , applying the SP operator to SS is about ten time slower than SPP classification.
each, and then stacking the resulting features. In Much faster proposal generators exist, but may not
practice, SSP-CNN uses K × K subdivisions, where result in very accurate regions [22]. Here we pro-
K is chosen to match the size of the convolutional pose to drop R(x) entirely and to use instead an

4
GT SS SS oGT > 0.5 SW 7k Cluster 3k
−0.4
−0.2
0

y 0.2
0.4
−0.4−0.2 0 0.2 0.4 −0.4−0.2 0 0.2 0.4 −0.4−0.2 0 0.2 0.4 −0.4−0.2 0 0.2 0.4 −0.4−0.2 0 0.2 0.4
x x x x x
−3

−2
log10(h)

−1

0
−3 −2 −1 0 −3 −2 −1 0 −3 −2 −1 0 −3 −2 −1 0 −3 −2 −1 0
log10(w) log10(w) log10(w) log10(w) log10(w)

−3
log10(scale)

−2

−1

0
0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5
|c| |c| |c| |c| |c|

Figure 2: Bounding box distributions using the normalised coordinates of Section 3.1. Rows show the
histograms for the bounding box centre (x, y), size (w, h), and scale vs distance from centre (s, |c|). Column
shows the statistics for ground-truth, selective search, restricted selective search, sliding window, and cluster
bounding boxes.

image-independent list of candidate regions R0 , us- a GT BB, then the distributions are similar again,
ing the CNN itself to regress better object locations with a strong preference for centred and large boxes.
a-posteriori. The fourth column shows the distribution of boxes
generated by a sliding window (SW; [4]) object de-
Constructing R0 starts by studying the distri-
tector. For an “exhaustive” enumeration of boxes
bution of bounding boxes in a representative ob-
at all location, scales, and aspect ratios, there can
ject detection benchmark, namely the PASCAL
be hundred of thousands boxes per image. Here we
VOC 207 data [6]. A box is defined by the tu-
subsample this set to 7K in order to obtain a candi-
ple (rs , cs , re , ce ) denoting the upper-left and lower-
date set with a size comparable to SS. This was ob-
right corners coordinates (rs , cs ) and (re , ce ). Given
tained by sampling the width of the bounding boxes
an image of size H × W , define the normalised
as w = w0 2l , l = 0, 0.5, . . . 4 where w0 ≈ 40 pixels is
with and height as w = (ce − ce )/W and h =
the width of the smallest bounding box considered in
√e − rs )/H respectively; define also the scale s =
(r
the SSP-CNN detector. Similarly, aspect ratios are
wh and distance from the image centre |c| =
sampled as 2{−1,−0.75,...1} . The distribution of boxes,
k [(cs + ce )/2W − 0.5, (rs + re )/2H − 0.5)] k2 .
visualised in the fourth column of Figure 2, is similar
The first column of Figure 2 shows the distribution to SS and dissimilar from GT.
of such parameters for the GT boxes in the PASCAL A simple modification of sliding window is to bias
data. It is evident that boxes tend to appear close sampling to match the statistics of the GT bounding
to the image centre and to fill the image. The statis- boxes. We do so by computing n K-means clusters
tics of SS regions differs substantially; in particular, from the collection of vectors (rs , cs , re , ce ) obtained
the (s, |c|) histogram shows that SS boxes tend to from the GT boxes in the PASCAL VOC training
distribute much more uniformly in scale and space data. We call this set of boxes R0 (n); the fifth column
compared to the GT ones. If SS boxes are restricted of Figure 2 shows that, as expected, the correspond-
to the ones that have an overlap of at least 0.5 with ing distribution matches nicely the one of GT. Sec-

5
tion 4 shows empirically that, when combined with a Evaluation method Single scale Multi scale
BB regression no yes no yes
CNN-based bounding box regressor, this proposal set Sc (SVM) 54.0 58.6 56.3 59.7
results in a very competitive (and very fast) detector. Pc (softmax) 27.9 34.5 30.1 38.1
Pc /P0 (modified softmax) 54.0 58.0 55.3 58.4

3.2 Streamlined detection pipeline


Table 1: Evaluation of SPP-CNN with and without
This section proposes several simplifications to the the SVM classifier. The table report mAP on the
R/SPP-CNN pipelines complementary to dropping PASCAL VOC 2007 test set for the single scale and
region proposal generation as done in Section 3.1. multi scale detector, with or without bounding box
As a result of all these changes, the whole detec- regression. Different rows compare different bound-
tor, including detection of multiple object classes ing box scoring mechanism of Section 3.2: the SVM
and bounding box regression, reduces to evaluating scores Sc , the softmax posterior probability scores Pc ,
a single CNN. Furthermore, the pipeline is straight- and the modified softmax scores Pc /P0 .
forward to implement on GPU, and is sufficiently
memory-efficient to process multiple images at once. RH×W ×D and outputs n feature fields of size h ×
In practice, this results in an extremely fast detector w × D, one for each region R1 , . . . , Rn passed as in-
which still retains excellent performance. put. These fields can be stacked in a 4D output ten-
sor, which is supported by all common CNN soft-
Dropping the SVM. As discussed in Section 2.1, ware. Given a dual CPU/GPU implementation of
R-CNN involves training an SVM classifier for each the layer, SPP integrates seamlessly with most CNN
target object class as well as fine-tuning the CNN fea- packages, with substantial benefit in speed and flexi-
tures for all classes. An obvious question is whether bility, including the possibility of training with back-
SVM training is redundant and can be eliminated. propagation through it.
Recall from Section 2.1 that fine-tuning learns a Similar to SPP, bounding box regression is easily
softmax predictor φsftmx on top of R-CNN features integrated as a bank of filters (Qc , bc ), c = 1, . . . , C
φRCNN (x; R), whereas SVM training learns a linear running on top of the convolutional features φcnv (x).
predictor φSVM on top of the same features. In the This is cheap enough that can be done in parallel for
first case, Pc = P (c|x, R) = [φsftmx ◦ φRCNN (x; R)]c all the object classes.
is an estimate of the class posterior for region R;
in the second case Sc = [φSVM ◦ φRCNN (x; R)]c is Scale-augmented training, single scale evalu-
a score that discriminates class c from any other ation. While SPP is fast, one of the most time
class (in both cases background is treated as one consuming step is to evaluate features at multiple
of the classes). As verified in Section 4 and Ta- scales [11]. However, the authors of [11] also indi-
ble 1, Pc works poorly as a score for an object de- cate that restricting evaluation to a single scale has
tector; however, and somewhat surprisingly, using as a marginal effect in performance. Here, we maintain
score the ratio Sc0 = Pc /P0 results in performance the idea of evaluating the detector at test time by
nearly as good as using an SVM. Further, note that processing each image at a single scale. However,
φsftmx can be decomposed as C + 1 linear predictors this requires the CNN to explicitly learn scale invari-
hwc , φRCNN i + bc followed by exponentiation and nor- ance, which is achieved by fine-tuning the CNN using
malisation; hence the scores Sc0 reduces to the expres- randomly rescaled versions of the training data.
sion Sc0 = exp (hwc − w0 , φRCNN i + bc − b0 ).
Integrating SPP and bounding box regres- 4 Experiments
sion. While in the original implementation of
SPP [11] the pooling mechanism is external to the This section evaluates the changes to R-CNN and
CNN software, we implement it directly as a layer SPP-CNN proposed in Section 3. All experiments
SPP(·; R1 , . . . , Rn ). This layer takes as input a ten- use the Zeiler and Fergus (ZF) small CNN [21] as
sor representing the convolutional features φcnv (x) ∈ this is the same network used by [11] that introduce

6
method mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
SVM MS 59.68 66.8 75.8 55.5 43.1 38.1 66.6 73.8 70.9 29.2 71.4 58.6 65.5 76.2 73.6 57.4 29.9 60.1 48.4 66.0 66.8
SVM SS 58.60 66.1 76.0 54.9 38.6 32.4 66.3 72.8 69.3 30.2 67.7 63.7 66.2 72.5 71.2 56.4 27.3 59.5 50.4 65.3 65.2
FC8 MS 58.38 69.2 75.2 53.7 40.0 33.0 67.2 71.3 71.6 26.9 69.6 60.3 64.5 74.0 73.4 55.6 25.3 60.4 47.0 64.9 64.4
FC8 SS 57.99 67.0 75.0 53.3 37.7 28.3 69.2 71.1 69.7 29.7 69.1 62.9 64.0 72.7 71.0 56.1 25.6 57.7 50.7 66.5 62.3
FC8 C3k MS 53.41 55.8 73.1 47.5 36.5 17.8 69.1 55.2 73.1 24.4 49.3 63.9 67.8 76.8 71.1 48.7 27.6 42.6 43.4 70.1 54.5
FC8 C3k SS 53.52 55.8 73.3 47.3 37.3 17.6 69.3 55.3 73.2 24.0 49.0 63.3 68.2 76.5 71.3 48.2 27.1 43.8 45.1 70.2 54.6

Table 2: Comparison of different variants of the SPP-CNN detector. First group of rows: original SPP-CNN
using Multi Scale (MS) or Single Scale (SS) detection. Second group: the same experiment, but dropping
the SVM and using the modified softmax scores of Section 3.2. Third group: SPP-CNN without region
proposal generation, but using a fixed set of 3K candidate bounding boxes as explained in Section 3.1.
SPP-CNN. While more recent networks such as the
very deep models of Simonyan and Zisserman [16] are SS
0.6
SS-BBR
likely to perform better, this choice allows to compare

mAP
Cx
directly [11]. The detector itself is trained and eval- 0.5
Cx-BBR
uated on the PASCAL VOC 2007 data [6], as this is SW
a default benchmark for object detection and is used 0.4 SW-BBR
in [11] as well. 1 2 3 4 5 6 7

Dropping the SVM. The first experiment evalu- Num Boxes/im [103 ]

ates the performance of the SPP-CNN detector with


or without the linear SVM classifier, comparing the Figure 3: mAP on the PASCAL VOC 2007 test data
bounding box scores Sc (SVM), Pc (softmax), and as a function of the number of candidate boxes per
Sc0 (modified softmax) of Section 3.2. As can be seen image, proposal generation method, and using or not
in Table 1 and Table 2, the best performing method bounding box regression.
is SSP-CNN evaluated at multiple scales, resulting 58.6%, with a drop of just 1.1% mAP points. Just
in 59.7% mAP on the PASCAL VOC 2007 test data like when removing the SVM, the resulting simplifi-
(this number matches the one reported in [11], vali- cation and in this case detection speedup make this
dating our implementation). Removing the SVM and drop in accuracy more than tolerable. In particular,
using the CNN softmax scores directly performs re- testing at a single scale accelerates detection roughly
ally poorly, with a drop of 21.6% mAP point. How- five-folds.
ever, adjusting the softmax scores using the simple
formula Pc /P0 restores the performance almost en- Dropping region proposal generation. The next
tirely, back to 58.4% mAP. While there is still a experiment evaluates replacing the SS region propos-
small 1.3% drop in mAP accuracy compared to using als RSS (x) with the fixed proposals R0 (n) as sug-
the SVM, removing the latter dramatically simplifies gested in Section 3.1. Table 2 shows the detection
the detector pipeline, resulting in particular in sig- performance for n = 3, 000, a number of candidates
nificantly faster training as it removes the need of comparable with the 2,000 extracted by selective
preparing and caching data for the SVM (as well as search. While there is a drop in performance com-
learning it). pared to using SS, this is small (59.68% vs 53.41%, i.e.
a 6.1% reduction), which is surprising since bounding
Multi-scale evaluation. The second set of exper- box proposals are now oblivious of the image content.
iments assess the importance of performing multi- Figure 3 looks at these results in greater detail.
scale evaluation of the detector. Results are reported Three bounding box generation methods are com-
once more in Tables 1 and 2. Once more, multi-scale pared: selective search, sliding windows, and cluster-
detection is the best performing method, with per- ing (see also Section 3.1), with or without bounding
formance up to 59.7% mAP. However, single scale box regression. Neither clustering nor sliding win-
testing is very close to this level of performance, at dows result in an accurate detector: even if the num-

7
Impl. [ms] SelS Prep. Move Conv SPP FC BBR Σ − SelS
SPP 23.3 67.5 186.6 211.1 91.0 39.8 619.2 ±118.0
MS
OURS 23.7 17.7 179.4 38.9 87.9 9.8 357.4 ±34.3
1.98 · 103
SPP 9.0 47.7 31.1 207.1 90.4 39.9 425.1 ±117.0
SS
OURS 9.0 3.0 30.3 19.4 88.0 9.8 159.5 ±31.5

Table 3: Timing (in ms) of the original SPP-CNN and our streamlined full-GPU implementation, broken
down into selective search (SS) and preprocessing: image loading and scaling (Prep), CPU/GPU data
transfer (Move), convolution layers (Conv), spatial pyramid pooling (SPP), fully connected layers (FC), and
bounding box regression (BBR).
ber of candidate boxes is increased substantially (up and GPU.
to n = 7K), performance saturates at around 46% As suggested before, however, the bottleneck is se-
mAP. This is much poorer than the ∼56% achieved lective search. Compared to the slowest MS SPP-
by selective search. Bounding box regression im- CNN implementation of [11], using all the simpli-
proves selective search by about 3% mAP, up to fications of Section 3, including removing selective
∼59%, but it has a much more significant effect on the search, results in an overall detection speedup of more
other two methods, improving performance by about than 16×, from about 2.5s per image down to 160ms
10% mAP. Note that clustering with 3K candidates (this at a reduction of about 6% mAP points).
performs as well as sliding window with 7K.
We can draw several interesting conclusions. First,
for the same low number of candidate boxes, selec-
tive search is much better than any fixed proposal set;
less expected is that performance does not increase 5 Conclusions
even with 3× more candidates, indicating that the
CNN is unable to tell which bounding boxes wrap ob-
jects better even when tight boxes are contained in Our most significant finding is that current CNNs do
the shortlist of proposals. This can be explained by contain sufficient geometric information for accurate
the high degree of geometric invariance in the CNN. object detection, although in the convolutional rather
At the same time, the CNN-based bounding box re- than fully connected layers. This finding opens the
gressor can make loose bounding boxes significantly possibility of building state-of-the-art object detec-
tighter, which requires geometric information to be tors that rely exclusively on CNNs, removing region
preserved by the CNN. This apparent contradiction proposal generation schemes such as selective search,
can be explained by noting that bounding box classi- and resulting in integrated, simpler, and faster detec-
fication is built on top of the FC layers of the CNN, tors.
whereas bounding box regression is built on the con- Our current implementation of a proposal-free de-
volutional ones. Evidently, geometric information is tector is already much faster than SPP-CNN, and
removed in the FC layers, but is still contained in the very close, but not quite as good, in term of mAP.
convolutional layers (see also Figure 1). However, we have only begun exploring the design
Detection speed. The last experiment (Table 3) possibilities and we believe that it is a matter of time
evaluates the detection speed of SPP-CNN (which is before the gap closes entirely. In particular, our cur-
already orders of magnitude faster than R-CNN) and rent scheme is likely to miss small objects in the im-
our streamlined implementation. Not counting SS age. These may be retained by alternative methods
proposal generation, the streamlined implementation to search the object pose space, such as for example
is between 1.7× (multi-scale) to 2.6× (single-scale) Hough voting on top of convolutional features, which
faster than original SPP, with the most significant would maintain the computational advantage and el-
gain emerging from the integrated SPP and bound- egance of integrated and streamlined CNN detectors
ing box regression implementation on GPU and con- while allowing to thoroughly search the image for ob-
sequent reduction of data transfer cost between CPU ject occurrences.

8
References [15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond
bag of features: Spatial pyramid matching for
[1] B. Alexe, T. Deselaers, and V. Ferrari. What is recognizing natural scene categories. In Proc.
an object? In Proc. CVPR, 2010. CVPR, 2006.
[2] P. Arbeláez, J. Pont-Tuset, J. T. Barron, [16] K. Simonyan and A. Zisserman. Very deep con-
F. Marques, and J. Malik. Multiscale combi- volutional networks for large-scale image recog-
natorial grouping. In Proc. CVPR, 2014. nition. CoRR, abs/1409.1556, 2014.
[3] J. Carreira and C. Sminchisescu. Cpmc: Au- [17] J. Sivic, B. C. Russel, A. A. Efros, A. Zisserman,
tomatic object segmentation using constrained and W. T. Freeman. Discovering objects and
parametric min-cuts. In PAMI, 2012. their location in images. In Proc. ICCV, 2005.
[4] N. Dalal and B. Triggs. Histograms of oriented [18] J. Uijlings, K. van de Sande, T. Gevers, and
gradients for human detection. In Proc. CVPR, A. Smeulders. Selective search for object recog-
2005. nition. IJCV, 2013.
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, [19] A. Vedaldi, V. Gulshan, M. Varma, and A. Zis-
and L. Fei-Fei. ImageNet: A Large-Scale Hier- serman. Multiple kernels for object detection. In
archical Image Database. In Proc. CVPR, 2009. Proc. ICCV, 2009.
[6] M. Everingham, A. Zisserman, C. Williams, and [20] P. Viola and M. Jones. Rapid object detection
L. V. Gool. The PASCAL visual obiect classes using a boosted cascade of simple features. In
challenge 2007 (VOC2007) results. Technical re- Proc. CVPR, 2001.
port, Pascal Challenge, 2007.
[21] M. D. Zeiler and R. Fergus. Visualizing and
[7] P. F. Felzenszwalb, D. McAllester, and D. Ra- understanding convolutional networks. In Proc.
manan. A discriminatively trained, multiscale, ECCV, 2014.
deformable part model. In Proc. CVPR, 2008.
[22] Q. Zhao and Z. L. an B. Yin. Cracking bing and
[8] R. Girshick. Fast RCNN. In arXiv, number beyond. In Proc. BMVC, 2014.
arXiv:1504.08083, 2015.
[23] C. Zitnick and P. Dollár. Edge boxes: Locating
[9] R. B. Girshick, J. Donahue, T. Darrell, and object proposals from edges. In Proc. ECCV,
J. Malik. Rich feature hierarchies for accurate 2014.
object detection and semantic segmentation. In
Proc. CVPR, 2014.
[10] Y. Gong, L. Wang, R. Guo, and S. Lazebnik.
Multi-scale orderless pooling of deep convolu-
tional activation features. In Proc. ECCV, 2014.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial
pyramid pooling in deep convolutional networks
for visual recognition. In Proc. ECCV, 2014.
[12] X. He, R. Zemel, and M. C.-P. nán. Multiscale
conditional random fields for image labeling. In
Proc. CVPR, 2004.
[13] J. Hosang, R. Beneson, P. Dollár, and B. Schiele.
What makes for effective detection proposals?
arXiv:1502.05082, 2015.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional
neural networks. In Proc. NIPS, 2012.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy