OTA: Optimal Transport Assignment For Object Detection
OTA: Optimal Transport Assignment For Object Detection
Zheng Ge1,2 , Songtao Liu2 *, Zeming Li2 , Osamu Yoshie1 , Jian Sun2
1
Waseda University, 2 Megvii Technology
jokerzz@fuji.waseda.jp;liusongtao@megvii.com;lizeming@megvii.com;
yoshie@waseda.jp;sunjian@megvii.com
arXiv:2103.14259v1 [cs.CV] 26 Mar 2021
Abstract GT1
1
chors to any gt (or background) may introduce harmful gra- 2. Related Work
dients w.r.t. other gts. Hence the assignment for ambiguous
anchors is non-trivial and requires further information be- 2.1. Fixed Label Assignment
yond the local view. Thus a better assigning strategy should Determining which gt (or background) should each an-
get rid of the convention of pursuing optimal assignment chor been assigned to is a necessary procedure before train-
for each gt independently and turn to the ideology of global ing object detectors. Anchor-based detectors usually adopt
optimum, in other words, finding the global high confidence IoU at a certain threshold as the assigning criterion. For ex-
assignment for all gts in an image. ample, RPN in Faster R-CNN [33] uses 0.7 and 0.3 as the
positive and negative thresholds, respectively. When train-
DeTR [3] is the first work that attempts to consider label ing the R-CNN module, the IoU threshold for pos/neg divi-
assignment from global view. It replaces the detection head sion is changed to 0.5. IoU based label assignment is proved
with transformer layers [39] and considers one-to-one as- effective and soon been adopted by many Faster R-CNN’s
signment using the Hungarian algorithm that matches only variants like [2, 12, 20, 42, 26, 49, 37], as well as many
one query for each gt with global minimum loss. However, one-stage detectors like [31, 32, 25, 27, 23, 21].
for the CNN based detectors, as the networks often pro- Recently, anchor-free detectors have drawn much atten-
duce correlated scores to the neighboring regions around tion because of their concision and high computational effi-
the object, each gt is assigned to many anchors (i.e., one- ciency. Without anchor box, FCOS [38], Foveabox [17] and
to-many), which also benefits to training efficiency. In this their precursors [30, 14, 46] directly assign anchor points
one-to-many manner, it remains intact to assign labels with around the center of objects as positive samples, show-
a global view. ing promising detection performance. Another stream of
anchor-free detectors [18, 8, 50, 45, 4] view each object as
To achieve the global optimal assigning result under the a single or a set of key-points. They share distinct char-
one-to-many situation, we propose to formulate label as- acteristics from other detectors, hence will not be further
signment as an Optimal Transport (OT) problem – a special discussed in our paper.
form of Linear Programming (LP) in Optimization Theory. Although detectors mentioned above are different in
Specifically, we define each gt as a supplier who supplies many aspects, as for label assignment, they all adopt a
a certain number of labels, and define each anchor as a de- single fixed assigning criterion (e.g., a fixed region of the
mander who needs one unit label. If an anchor receives center area or IoU threshold) for objects of various sizes,
sufficient amount of positive label from a certain gt, this shapes, and categories, etc, which may lead to sub-optimal
anchor becomes one positive anchor for that gt . In this assigning results.
context, the number of positive labels each gt supplies can
be interpreted as “how many positive anchors that gt needs 2.2. Dynamic Label Assignment
for better convergence during the training process”. The Many recent works try to make the label assigning
unit transportation cost between each anchor-gt pair is de- procedure more adaptive, aiming to further improve the
fined as the weighted summation of their pair-wise cls and detection performance. Instead of using pre-defined an-
reg losses. Furthermore, as being negative should also be chors, GuidedAnchoring [40] generates anchors based on
considered for each anchor, we introduce another supplier – an anchor-free mechanism to better fit the distribution of
background who supplies negative labels to make up the rest various objects. MetaAnchor [44] proposes an anchor gen-
of labels in need. The cost between background and a cer- eration function to learn dynamic anchors from the arbitrary
tain anchor is defined as their pair-wise classification loss customized prior boxes. NoisyAnchors [19] proposes soft-
only. After formulation, finding the best assignment solu- label and anchor re-weighting mechanisms based on clas-
tion is converted to solve the optimal transport plan, which sification and localization losses. FreeAnchor [48] con-
can be quickly and efficiently solved by the off-the-shelf structs top-k anchor candidates for each gt based on IoU
Sinkhorn-Knopp Iteration [5]. We name such an assigning and then proposes a detection-customized likelihood to per-
strategy as Optimal Transport Assignment (OTA). form pos/neg division within each candidate set. ATSS [47]
proposes an adaptive sample selection strategy that adopts
Comprehensive experiments are carried out on MS mean+std of IoU values from a set of closest anchors for
COCO [22] benchmark, and significant improvements each gt as a pos/neg threshold. PAA [16] assumes that the
from OTA demonstrate its advantage. OTA also achieves distribution of joint loss for positive and negative samples
the SOTA performance among one-stage detectors on a follows the Gaussian distribution. Hence it uses GMM to
crowded pedestrian detection dataset named CrowdHu- fit the distribution of positive and negative samples, and
man [35], showing OTA’s generalization ability on different then use the center of positive sample distribution as the
detection benchmarks. pos/neg division boundary. AutoAssign [51] tackles label
2
Cost Matrix Optimal Assigning Plan !*
Transportation
…
GT0 Cost Plane GT0 0. 0. 0.1 1. …
0. 0.
…
GT1 GT1 0. 0.1 0.9 0. …
0. 0.
… …
GT2 GT2 0. 0. 0. 0. 1. 0.1
…
…
… !* …
BG BG 1. 0.9 0. 0. 0. 0.9
Sinkhorn-Knopp
Anchors across FPN Levels Iteration
Assigning Results
Figure 2. An illustration of Optimal Transport Assignment. Cost Matrix is composed of the pair-wise cls and reg losses between each
anchor-gt pair. The goal of finding the best label assigning is converted to solve the best transporting plan which transports the labels from
suppliers (i.e. GT and BG) to demanders (i.e. anchors) at a minimal transportation cost via Sinkhorn-Knopp Iteration.
assignment in a fully data-driven manner by automatically according to which all goods from suppliers can be trans-
determine the positives/negatives in both spatial and scale ported to demanders at a minimal transportation cost:
dimensions.
These methods explore the optimal assigning strategy for m X
n
X
individual objects, while failing to consider context infor- min cij πij .
mation from a global perspective. DeTR [3] examines the π
i=1 j=1
idea of global optimal matching. But the Hungarian algo- Xm n
X
rithm they adopted can only work in a one-to-one assign- s.t. πij = dj , πij = si ,
ment manner. So far, for the CNN based detectors in one-to- i=1 j=1 (1)
many scenarios, a global optimal assigning strategy remains Xm n
X
uncharted. si = dj ,
i=1 j=1
3
unit of positive label from gti to anchor aj is defined as the Algorithm 1 Optimal Transport Assignment (OTA)
weighted summation of their cls and reg losses: Input:
I is an input image
cfijg =Lcls (Pjcls (θ), Gcls
i )+ A is a set of anchors
(2)
αLreg (Pjbox (θ), Gbox G is the gt annotations for objects in image I
i ),
γ is the regularization intensity in Sinkhorn-Knopp Iter.
where θ stands for model‘s parameters. Pjcls and Pjbox de- T is the number of iterations in Sinkhorn-Knopp Iter.
note predicted cls score and bounding box for aj . Gcls α is the balanced coefficient in Eq. 2
i
and Gbox denote ground truth class and bounding box for Output:
i
gt i. Lcls and Lreg stand for cross entropy loss and IoU π ∗ is the optimal assigning plan
Loss [46]. One can also replace these two losses with Fo- 1: m ← |G|, n ← |A|
cal Loss [21] and GIoU [34]/SmoothL1 Loss [11]. α is the 2: P cls , P box ← Forward(I,A)
balanced coefficient. 3: si (i = 1, 2, ..., m) ← Dynamic k Estimation
Pm
Besides positive assigning, a large set of anchors are 4: sm+1 ← n − i=1 si
treated as negative samples during training. As the opti- 5: dj (j = 1, 2, ..., n) ← OnesInit
ij
mal transportation involves all anchors, we introduce an- 6: pairwise cls cost: ccls = FocalLoss(Pjcls , Gcls i )
other supplier – background, who only provides negative ij
7: pairwise reg cost: creg = IoULoss(Pjbox ,Gbox i )
labels. In a standard OT problem, the total supply must be cp
8: pairwise Center Prior cost: cij ← (Aj , Gi ) box
equal to the total demand. We thus set the number of nega- bg
9: bg cls cost: ccls = FocalLoss(Pjcls , ∅)
tive labels that background can supply as n − m × k. The
10: f g cost: cfg = ccls + αcreg + ccp
cost for transporting one unit of negative label from back- bg
ground to aj is defined as: 11: compute final cost matrix c via concatenating ccls to the
fg
last row of c
cbg cls
(3) 12: v 0 , u0 ← OnesInit
j = Lcls (Pj (θ), ∅),
13: for t=0 to T do:
where ∅ means the background class. Concatenating this 14: ut+1 , v t+1 ← SinkhornIter(c, ut , v t , s, d)
cbg ∈ R1×n to the last row of cf g ∈ Rm×n , we can get 15: compute optimal assigning plan π ∗ according to Eq. 11
the complete form of the cost matrix c ∈ R(m+1)×n . The 16: return π ∗
supplying vector s should be correspondingly updated as:
(
k, if i ≤ m statistics in the subsequent process. Instead of relying on
si = (4) statistical characteristics, our OTA is based on global op-
n − m × k, if i = m + 1.
timization methodology and thus is naturally resistant to
As we already have the cost matrix c, supplying vector these two issues. Theoretically, OTA can assign any an-
s ∈ Rm+1 and demanding vector d ∈ Rn , the optimal trans- chor within the region of gts’ boxes as a positive sample.
portation plan π ∗ ∈ R(m+1)×n can be obtained by solving However, for general detection datasets like COCO, we find
this OT problem via the off-the-shelf Sinkhorn-Knopp Iter- the Center Prior still benefit the training of OTA. Forcing
ation [5]. After getting π ∗ , one can decode the correspond- detectors focus on potential positive areas ( i.e., center ar-
ing label assigning solution by assigning each anchor to the eas) can help stabilize the training process, especially in the
supplier who transports the largest amount of labels to them. early stage of training, which will lead to a better final per-
The subsequent processes (e.g., calculating losses based on formance. Hence, we impose a Center Prior to the cost
assigning result, back-propagation) are exactly the same as matrix. For each gt, we select r2 closest anchors from each
in FCOS [38] and ATSS [47]. Noted that the optimization FPN level according to the center distance between anchors
process of OT problem only contains some matrix multi- and gts 2 . As for anchors not in the r2 closest list, their
plications which can be accelerated by GPU devices, hence corresponding entries in the cost matrix c will be subject
OTA only increases the total training time by less than to an additional constant cost to reduce the possibility they
20% and is totally cost-free in testing phase. are assigned as positive samples during the training stage.
In Sec. 4, we will demonstrate that although OTA adopts a
3.3. Advanced Designs certain degree of Center Prior like other works [38, 47, 48]
Center Prior. Previous works [47, 16, 48] only select do, OTA consistently outperforms counterparts by a large
positive anchors from the center region of objects with lim- margin when r is set to a large value (i.e., large number of
ited areas, called Center Prior. This is because they suffer 2 For anchor-based methods, the distances are measured between the
from either a large number of ambiguous anchors or poor geometric center of anchors and gts
4
potential positive anchors as well as more ambiguous an- Method Aux. Branch Center Dyn. k AP AP50 AP75
chors). - X 38.3 57.1 41.3
CenterNess X 38.9 57.5 42.0
FCOS
IoU 38.8 57.7 41.8
Dynamic k Estimation. Intuitively, the appropriate num-
IoU X 39.5 57.6 42.9
ber of positive anchors for each gt (i.e., si in Sec. 3.1)
- X 39.2 58.3 42.2
should be different and based on many factors like objects’
OTA IoU 39.6 58.1 42.5
sizes, scales, and occlusion conditions, etc. As it is hard to
(FCOS) IoU X 40.3 58.6 43.7
directly model a mapping function from these factors to the
IoU X X 40.7 58.4 44.3
positive anchor’s number, we propose a simple but effective
OTA
method to roughly estimate the appropriate number of pos- IoU X X 40.7 58.6 44.1
(RetinaNet)
itive anchors for each gt based on the IoU values between
Table 1. Ablation studies on each components in OTA. “Center”
predicted bounding boxes and gts. Specifically, for each stands for Center Prior and Center Sampling for OTA and FCOS,
gt, we select the top q predictions according to IoU values. respectively. Dyn.k is the abbreviation of our proposed Dynamic
These IoU values are summed up to represent this gt’s esti- k Estimation strategy.
mated number of positive anchors. We name this method as
Dynamic k Estimation. Such an estimation method is based
on the following intuition: The appropriate number of pos- as a default component in our experiments. The top q in
itive anchors for a certain gt should be positively correlated Sec. 3.3 is directly set to 20, as we find this set of param-
with the number of anchors that well-regress this gt. In Sec. eter values can consistently yield stable results in various
4, we present a detailed comparison between the fixed k and situations.
Dynamic k Estimation strategies.
A toy visualization of OTA is shown in Fig. 2. We also 4.2. Ablation Studies and Analysis
describe the OTA’s completed procedure including Center Effects of Individual Components. We verify the effec-
Prior and Dynamic k Estimation in Algorithm 1. tiveness of each component in our proposed methods. For
fair comparisons, all detectors’ regression losses are mul-
4. Experiments tiplied by 2, which is known as a useful trick to boost
the AP at high IoU thresholds [28]. As seen in Table 1,
In this section, we conduct extensive experiments on MS when no auxiliary branch is adopted, OTA outperforms
COCO 2017 [22] which contains about 118k, 5k and 20k FCOS by 0.9% AP (39.2% v.s.38.3%). This gap almost re-
images for train, val, and test-dev sets, respectively. For mains the same after adding IoU branch to both of them
ablation studies, we train detectors on train set and report (39.5% v.s. 40.3% and 38.8% v.s. 39.6% with or with-
the performance on val set. Comparisons with other meth- out center prior, respectively). Finally, dynamic k pushes
ods are conducted on test-dev set. We also compare OTA AP to a new state-of-the-art 40.7%. In the whole paper,
with other methods on CrowdHuman [35] validation set to we emphasize that OTA can be applied to both anchor-
demonstrate the superiority of OTA in crowd scenarios. based and anchor-free detectors. Hence we also adopt
4.1. Implementation Details OTA on RetinaNet [21] with only one square anchor per-
location across feature maps. As shown in Table 1, the AP
If not specified, we use ResNet-50 [13] pre-trained on values of OTA-FCOS and OTA-RetinaNet are exactly the
ImageNet [6] with FPN [20] as our default backbone. Most same, demonstrating OTA’s applicability on both anchor-
of experiments are trained with 90k iterations which is de- based and anchor-free detectors.
noted as “1×”. The initial learning rate is 0.01 and is de-
cayed by a factor of 10 after 60k and 80k iterations. Mini-
Effects of r. The values of radius r for Center Prior serve
batch size is set to 16. Following the common practice, the
to control the number of candidate anchors for each gt. If
model is trained with SGD [1] on 8 GPUs.
adopting a small r, only anchors near objects’ centers could
OTA can be adopted in both anchor-based and anchor- be assigned as positives, helping the optimization process
free detectors, the following experiments are mainly con- focus on regions that are more likely to be informative. As
ducted on FCOS [38] because of its simplicity. We adopt r increases, the number of candidates also quadratically in-
Focal Loss and IoU Loss as Lcls and Lreg that make up the creases, leading to potential instability in the optimization
cost matrix. α in Eq. 2 is set to 1.5. For back-propagation, process. For example, when r is set to 3, 5 or 7, their
the regression loss is replaced by GIoU Loss and is re- corresponding numbers of candidate anchors are 45, 125
weighted by a factor of 2. IoU Branch is first introduced and 2453 , respectively. We study behaviors of ATSS [47],
in YOLOv1 [30] and proved effective in modern one-stage
object detectors by PAA [16]. We also adopt IoU Branch 3 Total number of potential positive anchors equals to (r 2 ∗FPN Levels).
5
ATSS PAA OTA Method ATSS [47] PAA [16] OTA
r 3 5 7 3 5 7 3 5 7
Namb. 2.1 15.9 36.3 0.5 0.8 1.2 0.2 0.2 0.3
AP 39.4 38.0 37.2 40.3 40.1 39.5 40.6 40.7 40.4
AP50 57.5 56.7 55.8 58.9 58.4 57.5 58.7 58.4 58.3
AP75 42.7 40.4 39.8 43.4 43.4 42.4 44.1 44.3 43.6
Table 2. Performances of different label assigning strategies under
different number of anchor candidates. Namb. denotes the aver-
age number of ambiguous anchors per-image calculated on COCO
train set.
6
gt supply. This value also represents how many anchors ev- ther improved to 47.0% AP. To demonstrate the compati-
ery gt needs for better convergence. A naive way is setting bility of our method with other advanced technologies in
k to a constant value for all gts. We try different values of object detection, we adopt Deformable Convolutional Net-
k from 1 to 20. As seen in Table 4, among all different val- works (DCN) [54] to ResNeXt backbones as well as the
ues, k=10 and k=12 achieve the best performances. As k in- last convolution layer in the detection head. This improves
creases from 10 to 20, the possibility that an anchor is suit- our model’s performance from 47.0% AP to 49.2% AP. Fi-
able as a positive sample for two close targets at the same nally, with the multi-scale testing technique, our best model
time also increases, but there is no obvious performance achieves 51.5% AP.
drop (0.2%) according to Table 4 which proves OTA’s su-
periority in handling potential ambiguity. When k=1, OTA 4.4. Experiments on CrowdHuman
becomes a one-to-one assigning strategy, the same as in
Object detection in crowded scenarios has raised more
DeTR. The poor performance tells us that achieving com-
and more attention [24, 15, 9, 10]. Compared to dataset
petitive performance via one-to-one assignment under the
designed for general object detection like COCO, ambigu-
1× scheduler remains challenging, unless an auxiliary one-
ity happens more frequently in crowded dataset. Hence to
to-many supervision is added [41].
demonstrate OTA’s advantage on handling ambiguous an-
k AP AP50 AP75 APs APm APl chors, it is necessary to conduct experiments on a crowded
1 36.5 55.4 38.8 21.4 39.7 46.2 dataset – Crowdhuman [35]. CrowdHuman contains 15000,
5 39.5 58.1 42.7 23.1 43.0 50.6 4370, and 5000 images in training, validation, and test set,
8 39.8 58.4 42.9 22.7 43.6 51.5 respectively, with the average number of persons in an im-
10 40.3 58.6 43.7 23.4 44.2 52.1 age 22.6. For all experiments, we train the detectors for 30
12 40.3 58.6 43.6 23.2 44.2 51.9 epochs (i.e., 2.5x) for better convergence. NMS threshold
15 40.2 58.4 43.6 23.2 44.1 51.9 is set to 0.6. We adopt ResNet-50 [13] as the default back-
20 40.1 58.2 43.6 23.5 44.0 52.8 bone in our experiments. Other settings are the same as our
Dyn. k 40.7 58.4 44.3 23.2 45.0 53.6 experiments on COCO. For evaluation, we follow the stan-
Table 4. Analysis of different values of k and Dynamic k Estima- dard Caltech [7] evaluation metric – MR, which stands for
tion strategy on the COCO val set. the Log-Average Missing Rate over false positives per im-
age (FPPI) ranging in [10−2 , 100 ]. AP and Recall are also
Fixing k strategy assumes every gt has the same number reported for reference. All evaluation results are reported
of appropriate positive anchors. However, we believe that on the CrowdHuman val subset.
this number for each gt should vary and may be affected by As shown in Table 6, RetinaNet and FCOS only achieve
many factors like objects’ sizes, spatial attitudes, and oc- 58.8% and 55.0% MR respectively, which are far worse
clusion conditions, etc. Hence we adopt the Dynamic k Es- than two stage detectors like Faster R-CNN (with FPN), re-
timation proposed in Sec 3.3 and compare its performance vealing the dilemma of one-stage detectors in crowd sce-
to the fixed k strategy. Results in Table 4 shows that dy- narios. Starting from FreeAnchor, the performances of one-
namic k surpasses the best performance of fixed k by 0.4% stage detectors gradually get improved by the dynamic la-
AP, validating our point and the effectiveness of Dynamic k bel assigning strategies. ATSS achieves 49.5% MR, which
Estimation strategy. is very close to the performance of Faster R-CNN (48.7%
AP). Recent proposed LLA [10] leverages loss-aware label
4.3. Comparison with State-of-the-art Methods. assignment, which is similar to OTA and achieves 47.9%
We compare our final models with other state-of-the-art MR. However, our OTA takes a step forward by introducing
one-stage detectors on MS COCO test-dev. Following pre- global information into the label assignment, boosting MR
vious works [21, 38], we randomly scale the shorter side to 46.6%. The AP and Recall of OTA also surpass other
of images in the range from 640 to 800. Besides, we dou- existing one-stage detectors by a clear margin.
ble the total number of iterations to 180K with the learning Although PAA achieves competitive performance with
rate change points scaled proportionally. Other settings are OTA on COCO, it performs struggling on CrowdHuman.
consistent with [21, 38]. We conjecture that PAA needs clear pos/neg decision
As shown in Table 5, our method with ResNet-101-FPN boundaries to help GMM learn better clusters. But in
achieves 45.3% AP, outperforms all other methods with the crowded scenarios, such clear boundaries may not exist be-
same backbone including ATSS (43.6% AP), AutoAssign cause potential negative samples usually cover a sufficient
(44.5% AP) and PAA (44.6% AP). Noted that for PAA, we amount of foreground areas, resulting in PAA’s poor per-
remove the score voting procedure for fair comparisons be- formance. Also, PAA performs per-gt’s clustering, which
tween different label assigning strategies. With ResNeXt- heavily increases the training time on crowded datasets like
64x4d-101-FPN [43], the performance of OTA can be fur- CrowdHuman. Compared to PAA, OTA still shows promis-
7
Method Iteration Backbone AP AP50 AP75 APs APm APl
RetinaNet [21] 135k ResNet-101 39.1 59.1 42.3 21.8 42.7 50.2
FCOS [38] 180k ResNet-101 41.5 60.7 45.0 24.4 44.8 51.6
NoisyAnchor [19] 180k ResNet-101 41.8 61.1 44.9 23.4 44.9 52.9
FreeAnchor [48] 180k ResNet-101 43.1 62.2 46.4 24.5 46.1 54.8
SAPD [52] 180k ResNet-101 43.5 63.6 46.5 24.9 46.8 54.6
MAL [44] 180k ResNet-101 43.6 61.8 47.1 25.0 46.9 55.8
ATSS [47] 180k ResNet-101 43.6 62.1 47.4 26.1 47.0 53.6
AutoAssign [51] 180k ResNet-101 44.5 64.3 48.4 25.9 47.4 55.0
PAA [16] 180k ResNet-101 44.6 63.3 48.4 26.4 48.5 56.0
OTA (Ours) 180k ResNet-101 45.3 63.5 49.3 26.9 48.8 56.1
FoveaBox [17] 180k ResNeXt-101 42.1 61.9 45.2 24.9 46.8 55.6
FSAF [53] 180k ResNeXt-64x4d-101 42.9 63.8 46.3 26.6 46.2 52.7
FCOS [38] 180k ResNeXt-64x4d-101 43.2 62.8 46.6 26.5 46.2 53.3
NoisyAnchor [19] 180k ResNeXt-101 44.1 63.8 47.5 26.0 47.4 55.0
FreeAnchor [48] 180k ResNeXt-64x4d-101 44.9 64.3 48.5 26.8 48.3 55.9
SAPD [52] 180k ResNeXt-64x4d-101 45.4 65.6 48.9 27.3 48.7 56.8
ATSS [47] 180k ResNeXt-64x4d-101 45.6 64.6 49.7 28.5 48.9 55.6
MAL [44] 180k ResNeXt101 45.9 65.4 49.7 27.8 49.1 57.8
AutoAssign [51] 180k ResNeXt-64x4d-101 46.5 66.5 50.7 28.3 49.7 56.6
PAA [16] 180k ResNeXt-64x4d-101 46.6 65.6 50.7 28.7 50.5 58.1
OTA (Ours) 180k ResNeXt-64x4d-101 47.0 65.8 51.1 29.2 50.4 57.9
SAPD [52] 180k ResNeXt-64x4d-101-DCN 47.4 67.4 51.1 28.1 50.3 61.5
ATSS [47] 180k ResNeXt-64x4d-101-DCN 47.7 66.5 51.9 29.7 50.8 59.4
AutoAssign [51] 180k ResNeXt-64x4d-101-DCN 48.3 67.4 52.7 29.2 51.0 60.3
PAA [16] 180k ResNeXt-64x4d-101-DCN 48.6 67.5 52.7 29.9 52.2 61.5
OTA (Ours) 180k ResNeXt-64x4d-101-DCN 49.2 67.6 53.5 30.0 52.5 62.3
ATSS [47]∗ 180k ResNeXt-64x4d-101-DCN 50.7 68.9 56.3 33.2 52.9 62.2
PAA [16]∗ 180k ResNeXt-64x4d-101-DCN 51.3 68.8 56.6 34.3 53.5 63.6
OTA (Ours)∗ 180k ResNeXt-64x4d-101-DCN 51.5 68.6 57.1 34.1 53.7 64.1
Table 5. Performance comparison with state-of-the-art one-stage detectors on MS COCO 2017 test-dev set. * indicates the specific form of
multi-scale testing that adopted in ATSS [47].
Method MR AP Recall egy. OTA formulates the label assigning procedure in object
Faster R-CNN with FPN [20] 48.7 86.1 90.4 detection into an Optimal Transport problem, which aims to
RetinaNet [21] 58.8 81.0 88.2 transport labels from ground-truth objects and backgrounds
FCOS [38] 55.0 86.4 94.1 to anchors at minimal transporting costs. To determine the
FreeAnchor [48] 51.3 83.9 89.8 number of positive labels needed by each gt, we further pro-
ATSS [47] 49.5 87.4 94.2 pose a simple estimation strategy based on the IoU values
PAA [16] 52.2 86.0 92.0 between predicted bounding boxes and each gt. As shown
LLA [10] 47.9 88.0 94.0 in experiments, OTA achieves the new SOTA performance
OTA (Ours) 46.6 88.4 95.1 on MS COCO. Because OTA can well-handle the assign-
Table 6. Performance comparison on the CrowdHuman validation ment of ambiguous anchors, it also outperforms all other
set. All experiments are conducted under 2.5x scheduler. one-stage detectors on CrowdHuman dataset by a large mar-
gin, demonstrating its strong generalization ability.
8
References Conference on Computer Vision and Pattern Recognition,
pages 10750–10759, 2020. 7
[1] Léon Bottou. Large-scale machine learning with stochastic [16] Kang Kim and Hee Seok Lee. Probabilistic anchor assign-
gradient descent. In Proceedings of COMPSTAT’2010, pages ment with iou prediction for object detection. arXiv preprint
177–186. Springer, 2010. 5 arXiv:2007.08103, 2020. 1, 2, 4, 5, 6, 8
[2] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- [17] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li,
ing into high quality object detection. In Proceedings of the and Jianbo Shi. Foveabox: Beyound anchor-based object de-
IEEE conference on computer vision and pattern recogni- tection. IEEE Transactions on Image Processing, 29:7389–
tion, pages 6154–6162, 2018. 2 7398, 2020. 2, 8
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [18] Hei Law and Jia Deng. Cornernet: Detecting objects as
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- paired keypoints. In Proceedings of the European Confer-
to-end object detection with transformers. arXiv preprint ence on Computer Vision (ECCV), pages 734–750, 2018. 2
arXiv:2005.12872, 2020. 2, 3 [19] Hengduo Li, Zuxuan Wu, Chen Zhu, Caiming Xiong,
[4] Yihong Chen, Zheng Zhang, Yue Cao, Liwei Wang, Stephen Richard Socher, and Larry S Davis. Learning from noisy
Lin, and Han Hu. Reppoints v2: Verification meets regres- anchors for one-stage object detection. In Proceedings of
sion for object detection. arXiv preprint arXiv:2007.08508, the IEEE/CVF Conference on Computer Vision and Pattern
2020. 2 Recognition, pages 10588–10597, 2020. 1, 2, 8
[5] Marco Cuturi. Sinkhorn distances: Lightspeed computation [20] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
of optimal transport. In Advances in neural information pro- Bharath Hariharan, and Serge Belongie. Feature pyra-
cessing systems, pages 2292–2300, 2013. 2, 3, 4 mid networks for object detection. In Proceedings of the
[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, IEEE conference on computer vision and pattern recogni-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image tion, pages 2117–2125, 2017. 2, 3, 5, 8
database. In 2009 IEEE conference on computer vision and [21] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
pattern recognition, pages 248–255. Ieee, 2009. 5 Piotr Dollár. Focal loss for dense object detection. In Pro-
[7] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Per- ceedings of the IEEE international conference on computer
ona. Pedestrian detection: A benchmark. In 2009 IEEE Con- vision, pages 2980–2988, 2017. 1, 2, 4, 5, 6, 7, 8
ference on Computer Vision and Pattern Recognition, pages [22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
304–311. IEEE, 2009. 7 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[8] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- Zitnick. Microsoft coco: Common objects in context. In
ming Huang, and Qi Tian. Centernet: Keypoint triplets for European conference on computer vision, pages 740–755.
object detection. In Proceedings of the IEEE International Springer, 2014. 2, 5
Conference on Computer Vision, pages 6569–6578, 2019. 2 [23] Songtao Liu, Di Huang, et al. Receptive field block net for
accurate and fast object detection. In Proceedings of the Eu-
[9] Zheng Ge, Zequn Jie, Xin Huang, Rong Xu, and Osamu
ropean Conference on Computer Vision (ECCV), pages 385–
Yoshie. Ps-rcnn: Detecting secondary human instances in
400, 2018. 2
a crowd via primary object suppression. In 2020 IEEE Inter-
national Conference on Multimedia and Expo (ICME), pages [24] Songtao Liu, Di Huang, and Yunhong Wang. Adaptive nms:
1–6. IEEE, 2020. 7 Refining pedestrian detection in a crowd. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
[10] Zheng Ge, Jianfeng Wang, Xin Huang, Songtao Liu, and Os-
Recognition, pages 6459–6468, 2019. 7
amu Yoshie. Lla: Loss-aware label assignment for dense
[25] Songtao Liu, Di Huang, and Yunhong Wang. Learning spa-
pedestrian detection. arXiv preprint arXiv:2101.04307,
tial fusion for single-shot object detection. arXiv preprint
2021. 7, 8
arXiv:1911.09516, 2019. 2
[11] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- [26] Songtao Liu, Di Huang, and Yunhong Wang. Pay attention to
national conference on computer vision, pages 1440–1448, them: deep reinforcement learning-based cascade object de-
2015. 4 tection. IEEE transactions on neural networks and learning
[12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- systems, 31(7):2544–2556, 2019. 2
shick. Mask r-cnn. In Proceedings of the IEEE international [27] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
conference on computer vision, pages 2961–2969, 2017. 2 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Berg. Ssd: Single shot multibox detector. In European con-
Deep residual learning for image recognition. In Proceed- ference on computer vision, pages 21–37. Springer, 2016. 1,
ings of the IEEE conference on computer vision and pattern 2
recognition, pages 770–778, 2016. 5, 7 [28] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng,
[14] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense- Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards bal-
box: Unifying landmark localization with end to end object anced learning for object detection. In Proceedings of the
detection. arXiv preprint arXiv:1509.04874, 2015. 2 IEEE conference on computer vision and pattern recogni-
[15] Xin Huang, Zheng Ge, Zequn Jie, and Osamu Yoshie. Nms tion, pages 821–830, 2019. 5
by representative region: Towards crowded pedestrian detec- [29] Han Qiu, Yuchen Ma, Zeming Li, Songtao Liu, and Jian
tion by proposal pairing. In Proceedings of the IEEE/CVF Sun. Borderdet: Border feature for dense object detection. In
9
European Conference on Computer Vision, pages 549–564. [44] Tong Yang, Xiangyu Zhang, Zeming Li, Wenqiang Zhang,
Springer, 2020. 1 and Jian Sun. Metaanchor: Learning to detect objects with
[30] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali customized anchors. In Advances in Neural Information Pro-
Farhadi. You only look once: Unified, real-time object de- cessing Systems, pages 320–330, 2018. 2, 8
tection. In Proceedings of the IEEE conference on computer [45] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen
vision and pattern recognition, pages 779–788, 2016. 1, 2, 5 Lin. Reppoints: Point set representation for object detec-
[31] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, tion. In Proceedings of the IEEE International Conference
stronger. In Proceedings of the IEEE conference on computer on Computer Vision, pages 9657–9666, 2019. 2
vision and pattern recognition, pages 7263–7271, 2017. 2 [46] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and
[32] Joseph Redmon and Ali Farhadi. Yolov3: An incremental Thomas Huang. Unitbox: An advanced object detection net-
improvement. arXiv preprint arXiv:1804.02767, 2018. 2 work. In Proceedings of the 24th ACM international confer-
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. ence on Multimedia, pages 516–520, 2016. 2, 4
Faster r-cnn: Towards real-time object detection with region [47] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
proposal networks. In Advances in neural information pro- Stan Z Li. Bridging the gap between anchor-based and
cessing systems, pages 91–99, 2015. 1, 2 anchor-free detection via adaptive training sample selection.
[34] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir In Proceedings of the IEEE/CVF Conference on Computer
Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- Vision and Pattern Recognition, pages 9759–9768, 2020. 1,
tersection over union: A metric and a loss for bounding box 2, 4, 5, 6, 8
regression. In Proceedings of the IEEE Conference on Com- [48] Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and
puter Vision and Pattern Recognition, pages 658–666, 2019. Qixiang Ye. Freeanchor: Learning to match anchors for vi-
4 sual object detection. In Advances in Neural Information
[35] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Processing Systems, pages 147–155, 2019. 1, 2, 4, 6, 8
Xiangyu Zhang, and Jian Sun. Crowdhuman: A bench- [49] Yangtao Zheng, Di Huang, Songtao Liu, and Yunhong Wang.
mark for detecting human in a crowd. arXiv preprint Cross-domain object detection through coarse-to-fine feature
arXiv:1805.00123, 2018. 2, 5, 7 adaptation. In Proceedings of the IEEE/CVF Conference
[36] Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Hongbin on Computer Vision and Pattern Recognition, pages 13766–
Sun, Jian Sun, and Nanning Zheng. Fine-grained dynamic 13775, 2020. 2
head for object detection. arXiv preprint arXiv:2012.03519, [50] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
2020. 1 jects as points. arXiv preprint arXiv:1904.07850, 2019. 2
[37] Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Xiangyu [51] Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong,
Zhang, Hongbin Sun, Jian Sun, and Nanning Zheng. Re- Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differ-
thinking learnable tree filter for generic feature transform. entiable label assignment for dense object detection. arXiv
arXiv preprint arXiv:2012.03482, 2020. 2 preprint arXiv:2007.03496, 2020. 1, 2, 8
[38] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
[52] Chenchen Zhu, Fangyi Chen, Zhiqiang Shen, and Marios
Fully convolutional one-stage object detection. In Proceed-
Savvides. Soft anchor-point object detection. arXiv preprint
ings of the IEEE international conference on computer vi-
arXiv:1911.12448, 2019. 8
sion, pages 9627–9636, 2019. 1, 2, 4, 5, 6, 7, 8
[53] Chenchen Zhu, Yihui He, and Marios Savvides. Feature se-
[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
lective anchor-free module for single-shot object detection.
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
In Proceedings of the IEEE Conference on Computer Vision
Polosukhin. Attention is all you need. In Advances in neural
and Pattern Recognition, pages 840–849, 2019. 8
information processing systems, pages 5998–6008, 2017. 2
[40] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and [54] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
Dahua Lin. Region proposal by guided anchoring. In Pro- formable convnets v2: More deformable, better results. In
ceedings of the IEEE Conference on Computer Vision and Proceedings of the IEEE Conference on Computer Vision
Pattern Recognition, pages 2965–2974, 2019. 2 and Pattern Recognition, pages 9308–9316, 2019. 7
[41] Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun,
Jian Sun, and Nanning Zheng. End-to-end object de-
tection with fully convolutional network. arXiv preprint
arXiv:2012.03544, 2020. 7
[42] Jiaxi Wu, Songtao Liu, Di Huang, and Yunhong Wang.
Multi-scale positive sample refinement for few-shot object
detection. In European Conference on Computer Vision,
pages 456–472. Springer, 2020. 2
[43] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1492–1500,
2017. 7
10
A. Appendix. The updating rule in Eq. 10 is also known as the Sinkhorn-
Knopp Iteration. After repeating this iteration T times, the
A.1. Optimal Transport and Sinkhorn Iteration approximate optimal plan π ∗ can be obtained:
To ensure the integrity of this paper, we briefly introduce
the derivation of the Sinkhorn Iteration algorithm which we π ∗ = diag(v)M diag(u). (11)
emphasize not our contributions and belongs to textbook
knowledge. γ and T are empirically set to 0.1 and 50. Please refer to
The mathematical formula of the Optimal Transport our code for more details.
problem is defined in Eq. 1. This is a linear program which
can be solved in polynomial time. For dense detectors,
however, the resulting linear program is large, involving the
square of feature dimensions with anchors in all scales. This
issue can be addressed by a fast iterative solution, which
converts the optimization target in Eq. 1 into a non-linear
but convex form with an entropic regularization term E
added:
m X
X n
min cij πij + γE(πij ), (5)
π
i=1 j=1
∗ αj cij βi
πij = exp(− ) exp(− ) exp(− ). (7)
γ γ γ
α
Letting uj = exp(− γj ), vi = exp(− βγi ), Mij =
c
exp(− γij ), the following constraints can be enforced:
X X
πij = uj ( Mij vi ) = dj , (8)
i i
X X
πij = (uj Mij )vi = si . (9)
j i
11