0% found this document useful (0 votes)
133 views11 pages

OTA: Optimal Transport Assignment For Object Detection

This document proposes a new method called Optimal Transport Assignment (OTA) for label assignment in object detection. OTA formulates label assignment as an optimal transport problem to find the global optimal assignment between anchors and ground truths. This considers label assignment from a global perspective rather than independently for each ground truth. OTA defines the transportation cost between each anchor-ground truth pair as the weighted sum of their classification and regression losses. It then solves for the optimal transport plan to minimize total transportation costs using the Sinkhorn-Knopp algorithm. When applied to a single stage detector on COCO, OTA improves mAP by 0.7% over existing assignment methods and performs better especially on crowd scenes.

Uploaded by

cheng peng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views11 pages

OTA: Optimal Transport Assignment For Object Detection

This document proposes a new method called Optimal Transport Assignment (OTA) for label assignment in object detection. OTA formulates label assignment as an optimal transport problem to find the global optimal assignment between anchors and ground truths. This considers label assignment from a global perspective rather than independently for each ground truth. OTA defines the transportation cost between each anchor-ground truth pair as the weighted sum of their classification and regression losses. It then solves for the optimal transport plan to minimize total transportation costs using the Sinkhorn-Knopp algorithm. When applied to a single stage detector on COCO, OTA improves mAP by 0.7% over existing assignment methods and performs better especially on crowd scenes.

Uploaded by

cheng peng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

OTA: Optimal Transport Assignment for Object Detection

Zheng Ge1,2 , Songtao Liu2 *, Zeming Li2 , Osamu Yoshie1 , Jian Sun2
1
Waseda University, 2 Megvii Technology
jokerzz@fuji.waseda.jp;liusongtao@megvii.com;lizeming@megvii.com;
yoshie@waseda.jp;sunjian@megvii.com
arXiv:2103.14259v1 [cs.CV] 26 Mar 2021

Abstract GT1

Recent advances in label assignment in object detec- GT2


GT1
tion mainly seek to independently define positive/negative
training samples for each ground-truth (gt) object. In this
GT2
paper, we innovatively revisit the label assignment from a
global perspective and propose to formulate the assign-
ing procedure as an Optimal Transport (OT) problem – a
GT BBoxes A Set of Ambiguous Anchor Points
well-studied topic in Optimization Theory. Concretely, we
define the unit transportation cost between each deman- Figure 1. An illustration of ambiguous anchor points in object de-
tection. Red dots show some of the ambiguous anchors in two
der (anchor) and supplier (gt) pair as the weighted sum-
sample images. Currently, the assignment of these ambiguous an-
mation of their classification and regression losses. After
chors is heavily based on hand-crafted rules.
formulation, finding the best assignment solution is con-
verted to solve the optimal transport plan at minimal trans-
portation costs, which can be solved via Sinkhorn-Knopp for pos/neg anchors division. Anchor-free detectors like
Iteration. On COCO, a single FCOS-ResNet-50 detector FCOS [38] treat the anchors within the center/bbox re-
equipped with Optimal Transport Assignment (OTA) can gion of any gt object as the corresponding positives. Such
reach 40.7% mAP under 1× scheduler, outperforming all static strategies ignore a fact that for objects with different
other existing assigning methods. Extensive experiments sizes, shapes or occlusion condition, the appropriate posi-
conducted on COCO and CrowdHuman further validate the tive/negative (pos/neg) division boundaries may vary.
effectiveness of our proposed OTA, especially its superior-
Motivated by this, many dynamic assignment strategies
ity in crowd scenarios. The code is available at https:
have been proposed. ATSS [47] proposes to set the divi-
//github.com/Megvii-BaseDetection/OTA.
sion boundary for each gt based on statistical characteris-
tics. Other recent advances [48, 19, 51, 16] suggest that
the predicted confidence scores of each anchor could be a
1. Introduction proper indicator to design dynamic assigning strategies, i.e.,
Current CNN-based object detectors [27, 30, 21, 47, 33, high confidence anchors can be easily learned by the net-
29, 36] perform a dense prediction manner by predicting works and thus be assigned to the related gt, while anchors
the classification (cls) labels and regression (reg) offsets for with uncertain predictions should be considered as nega-
a set of pre-defined anchors1 . To train the detector, defining tives. Those strategies enable the detector to dynamically
cls and reg targets for each anchor is a necessary procedure, choose positive anchors for each individual gt object and
which is called label assignment in object detection. achieve state-of-the-art performance.
Classical label assigning strategies commonly adopt pre- However, independently assigning pos/neg samples for
defined rules to match the ground-truth (gt) object or back- each gt without context could be sub-optimal, just like the
ground for each anchor. For example, RetinaNet [21] adopts lack of context may lead to improper prediction. When
Intersection-over-Union (IoU) as its thresholding criterion dealing with ambiguous anchors (i.e., anchors that are qual-
ified as positive samples for multiple gts simultaneously as
* Corresponding author
1 For anchor-free detectorslike FCOS [38], the feature points can be
seen in Fig. 1.), existing assignment strategies are heav-
viewed as shrunk anchor boxes. Hence in this paper, we collectively refer ily based on hand-crafted rules (e.g., Min Area [38], Max
to anchor box and anchor point as “anchor”. IoU [16, 21, 47].). We argue that assigning ambiguous an-

1
chors to any gt (or background) may introduce harmful gra- 2. Related Work
dients w.r.t. other gts. Hence the assignment for ambiguous
anchors is non-trivial and requires further information be- 2.1. Fixed Label Assignment
yond the local view. Thus a better assigning strategy should Determining which gt (or background) should each an-
get rid of the convention of pursuing optimal assignment chor been assigned to is a necessary procedure before train-
for each gt independently and turn to the ideology of global ing object detectors. Anchor-based detectors usually adopt
optimum, in other words, finding the global high confidence IoU at a certain threshold as the assigning criterion. For ex-
assignment for all gts in an image. ample, RPN in Faster R-CNN [33] uses 0.7 and 0.3 as the
positive and negative thresholds, respectively. When train-
DeTR [3] is the first work that attempts to consider label ing the R-CNN module, the IoU threshold for pos/neg divi-
assignment from global view. It replaces the detection head sion is changed to 0.5. IoU based label assignment is proved
with transformer layers [39] and considers one-to-one as- effective and soon been adopted by many Faster R-CNN’s
signment using the Hungarian algorithm that matches only variants like [2, 12, 20, 42, 26, 49, 37], as well as many
one query for each gt with global minimum loss. However, one-stage detectors like [31, 32, 25, 27, 23, 21].
for the CNN based detectors, as the networks often pro- Recently, anchor-free detectors have drawn much atten-
duce correlated scores to the neighboring regions around tion because of their concision and high computational effi-
the object, each gt is assigned to many anchors (i.e., one- ciency. Without anchor box, FCOS [38], Foveabox [17] and
to-many), which also benefits to training efficiency. In this their precursors [30, 14, 46] directly assign anchor points
one-to-many manner, it remains intact to assign labels with around the center of objects as positive samples, show-
a global view. ing promising detection performance. Another stream of
anchor-free detectors [18, 8, 50, 45, 4] view each object as
To achieve the global optimal assigning result under the a single or a set of key-points. They share distinct char-
one-to-many situation, we propose to formulate label as- acteristics from other detectors, hence will not be further
signment as an Optimal Transport (OT) problem – a special discussed in our paper.
form of Linear Programming (LP) in Optimization Theory. Although detectors mentioned above are different in
Specifically, we define each gt as a supplier who supplies many aspects, as for label assignment, they all adopt a
a certain number of labels, and define each anchor as a de- single fixed assigning criterion (e.g., a fixed region of the
mander who needs one unit label. If an anchor receives center area or IoU threshold) for objects of various sizes,
sufficient amount of positive label from a certain gt, this shapes, and categories, etc, which may lead to sub-optimal
anchor becomes one positive anchor for that gt . In this assigning results.
context, the number of positive labels each gt supplies can
be interpreted as “how many positive anchors that gt needs 2.2. Dynamic Label Assignment
for better convergence during the training process”. The Many recent works try to make the label assigning
unit transportation cost between each anchor-gt pair is de- procedure more adaptive, aiming to further improve the
fined as the weighted summation of their pair-wise cls and detection performance. Instead of using pre-defined an-
reg losses. Furthermore, as being negative should also be chors, GuidedAnchoring [40] generates anchors based on
considered for each anchor, we introduce another supplier – an anchor-free mechanism to better fit the distribution of
background who supplies negative labels to make up the rest various objects. MetaAnchor [44] proposes an anchor gen-
of labels in need. The cost between background and a cer- eration function to learn dynamic anchors from the arbitrary
tain anchor is defined as their pair-wise classification loss customized prior boxes. NoisyAnchors [19] proposes soft-
only. After formulation, finding the best assignment solu- label and anchor re-weighting mechanisms based on clas-
tion is converted to solve the optimal transport plan, which sification and localization losses. FreeAnchor [48] con-
can be quickly and efficiently solved by the off-the-shelf structs top-k anchor candidates for each gt based on IoU
Sinkhorn-Knopp Iteration [5]. We name such an assigning and then proposes a detection-customized likelihood to per-
strategy as Optimal Transport Assignment (OTA). form pos/neg division within each candidate set. ATSS [47]
proposes an adaptive sample selection strategy that adopts
Comprehensive experiments are carried out on MS mean+std of IoU values from a set of closest anchors for
COCO [22] benchmark, and significant improvements each gt as a pos/neg threshold. PAA [16] assumes that the
from OTA demonstrate its advantage. OTA also achieves distribution of joint loss for positive and negative samples
the SOTA performance among one-stage detectors on a follows the Gaussian distribution. Hence it uses GMM to
crowded pedestrian detection dataset named CrowdHu- fit the distribution of positive and negative samples, and
man [35], showing OTA’s generalization ability on different then use the center of positive sample distribution as the
detection benchmarks. pos/neg division boundary. AutoAssign [51] tackles label

2
Cost Matrix Optimal Assigning Plan !*
Transportation

GT0 Cost Plane GT0 0. 0. 0.1 1. …
0. 0.

GT1 GT1 0. 0.1 0.9 0. …
0. 0.
… …
GT2 GT2 0. 0. 0. 0. 1. 0.1



… !* …
BG BG 1. 0.9 0. 0. 0. 0.9

Sinkhorn-Knopp
Anchors across FPN Levels Iteration

Assigning Results

Figure 2. An illustration of Optimal Transport Assignment. Cost Matrix is composed of the pair-wise cls and reg losses between each
anchor-gt pair. The goal of finding the best label assigning is converted to solve the best transporting plan which transports the labels from
suppliers (i.e. GT and BG) to demanders (i.e. anchors) at a minimal transportation cost via Sinkhorn-Knopp Iteration.

assignment in a fully data-driven manner by automatically according to which all goods from suppliers can be trans-
determine the positives/negatives in both spatial and scale ported to demanders at a minimal transportation cost:
dimensions.
These methods explore the optimal assigning strategy for m X
n
X
individual objects, while failing to consider context infor- min cij πij .
mation from a global perspective. DeTR [3] examines the π
i=1 j=1
idea of global optimal matching. But the Hungarian algo- Xm n
X
rithm they adopted can only work in a one-to-one assign- s.t. πij = dj , πij = si ,
ment manner. So far, for the CNN based detectors in one-to- i=1 j=1 (1)
many scenarios, a global optimal assigning strategy remains Xm n
X
uncharted. si = dj ,
i=1 j=1

3. Method πij ≥ 0, i = 1, 2, ...m, j = 1, 2, ...n.


In this section, we first revisit the definition of the Op- This is a linear program which can be solved in polyno-
timal Transport problem and then demonstrate how we for- mial time. In our case, however, the resulting linear pro-
mulate the label assignment in object detection into an OT gram is large, involving the square of feature dimensions
problem. We also introduce two advanced designs which with anchors in all scales. We thus address this issue by
we suggest adopting to make the best use of OTA. a fast iterative solution, named Sinkhorn-Knopp [5] (de-
scribed in Appendix. A.1.)
3.1. Optimal Transport
3.2. OT for Label Assignment
The Optimal Transport (OT) describes the following
problem: supposing there are m suppliers and n demanders In the context of object detection, supposing there are
in a certain area. The i-th supplier holds si units of goods m gt targets and n anchors (across all FPN [20] levels) for
while the j-th demander needs dj units of goods. Transport- an input image I, we view each gt as a supplier who holds
ing cost for each unit of good from supplier i to demander j k units of positive labels (i.e., si = k, i = 1, 2, ..., m), and
is denoted by cij . The goal of OT problem is to find a trans- each anchor as a demander who needs one unit of label (i.e.,
portation plan π ∗ = {πi,j |i = 1, 2, ...m, j = 1, 2, ...n}, dj = 1, j = 1, 2, ..., n). The cost cf g for transporting one

3
unit of positive label from gti to anchor aj is defined as the Algorithm 1 Optimal Transport Assignment (OTA)
weighted summation of their cls and reg losses: Input:
I is an input image
cfijg =Lcls (Pjcls (θ), Gcls
i )+ A is a set of anchors
(2)
αLreg (Pjbox (θ), Gbox G is the gt annotations for objects in image I
i ),
γ is the regularization intensity in Sinkhorn-Knopp Iter.
where θ stands for model‘s parameters. Pjcls and Pjbox de- T is the number of iterations in Sinkhorn-Knopp Iter.
note predicted cls score and bounding box for aj . Gcls α is the balanced coefficient in Eq. 2
i
and Gbox denote ground truth class and bounding box for Output:
i
gt i. Lcls and Lreg stand for cross entropy loss and IoU π ∗ is the optimal assigning plan
Loss [46]. One can also replace these two losses with Fo- 1: m ← |G|, n ← |A|
cal Loss [21] and GIoU [34]/SmoothL1 Loss [11]. α is the 2: P cls , P box ← Forward(I,A)
balanced coefficient. 3: si (i = 1, 2, ..., m) ← Dynamic k Estimation
Pm
Besides positive assigning, a large set of anchors are 4: sm+1 ← n − i=1 si
treated as negative samples during training. As the opti- 5: dj (j = 1, 2, ..., n) ← OnesInit
ij
mal transportation involves all anchors, we introduce an- 6: pairwise cls cost: ccls = FocalLoss(Pjcls , Gcls i )
other supplier – background, who only provides negative ij
7: pairwise reg cost: creg = IoULoss(Pjbox ,Gbox i )
labels. In a standard OT problem, the total supply must be cp
8: pairwise Center Prior cost: cij ← (Aj , Gi ) box
equal to the total demand. We thus set the number of nega- bg
9: bg cls cost: ccls = FocalLoss(Pjcls , ∅)
tive labels that background can supply as n − m × k. The
10: f g cost: cfg = ccls + αcreg + ccp
cost for transporting one unit of negative label from back- bg
ground to aj is defined as: 11: compute final cost matrix c via concatenating ccls to the
fg
last row of c
cbg cls
(3) 12: v 0 , u0 ← OnesInit
j = Lcls (Pj (θ), ∅),
13: for t=0 to T do:
where ∅ means the background class. Concatenating this 14: ut+1 , v t+1 ← SinkhornIter(c, ut , v t , s, d)
cbg ∈ R1×n to the last row of cf g ∈ Rm×n , we can get 15: compute optimal assigning plan π ∗ according to Eq. 11
the complete form of the cost matrix c ∈ R(m+1)×n . The 16: return π ∗
supplying vector s should be correspondingly updated as:
(
k, if i ≤ m statistics in the subsequent process. Instead of relying on
si = (4) statistical characteristics, our OTA is based on global op-
n − m × k, if i = m + 1.
timization methodology and thus is naturally resistant to
As we already have the cost matrix c, supplying vector these two issues. Theoretically, OTA can assign any an-
s ∈ Rm+1 and demanding vector d ∈ Rn , the optimal trans- chor within the region of gts’ boxes as a positive sample.
portation plan π ∗ ∈ R(m+1)×n can be obtained by solving However, for general detection datasets like COCO, we find
this OT problem via the off-the-shelf Sinkhorn-Knopp Iter- the Center Prior still benefit the training of OTA. Forcing
ation [5]. After getting π ∗ , one can decode the correspond- detectors focus on potential positive areas ( i.e., center ar-
ing label assigning solution by assigning each anchor to the eas) can help stabilize the training process, especially in the
supplier who transports the largest amount of labels to them. early stage of training, which will lead to a better final per-
The subsequent processes (e.g., calculating losses based on formance. Hence, we impose a Center Prior to the cost
assigning result, back-propagation) are exactly the same as matrix. For each gt, we select r2 closest anchors from each
in FCOS [38] and ATSS [47]. Noted that the optimization FPN level according to the center distance between anchors
process of OT problem only contains some matrix multi- and gts 2 . As for anchors not in the r2 closest list, their
plications which can be accelerated by GPU devices, hence corresponding entries in the cost matrix c will be subject
OTA only increases the total training time by less than to an additional constant cost to reduce the possibility they
20% and is totally cost-free in testing phase. are assigned as positive samples during the training stage.
In Sec. 4, we will demonstrate that although OTA adopts a
3.3. Advanced Designs certain degree of Center Prior like other works [38, 47, 48]
Center Prior. Previous works [47, 16, 48] only select do, OTA consistently outperforms counterparts by a large
positive anchors from the center region of objects with lim- margin when r is set to a large value (i.e., large number of
ited areas, called Center Prior. This is because they suffer 2 For anchor-based methods, the distances are measured between the

from either a large number of ambiguous anchors or poor geometric center of anchors and gts

4
potential positive anchors as well as more ambiguous an- Method Aux. Branch Center Dyn. k AP AP50 AP75
chors). - X 38.3 57.1 41.3
CenterNess X 38.9 57.5 42.0
FCOS
IoU 38.8 57.7 41.8
Dynamic k Estimation. Intuitively, the appropriate num-
IoU X 39.5 57.6 42.9
ber of positive anchors for each gt (i.e., si in Sec. 3.1)
- X 39.2 58.3 42.2
should be different and based on many factors like objects’
OTA IoU 39.6 58.1 42.5
sizes, scales, and occlusion conditions, etc. As it is hard to
(FCOS) IoU X 40.3 58.6 43.7
directly model a mapping function from these factors to the
IoU X X 40.7 58.4 44.3
positive anchor’s number, we propose a simple but effective
OTA
method to roughly estimate the appropriate number of pos- IoU X X 40.7 58.6 44.1
(RetinaNet)
itive anchors for each gt based on the IoU values between
Table 1. Ablation studies on each components in OTA. “Center”
predicted bounding boxes and gts. Specifically, for each stands for Center Prior and Center Sampling for OTA and FCOS,
gt, we select the top q predictions according to IoU values. respectively. Dyn.k is the abbreviation of our proposed Dynamic
These IoU values are summed up to represent this gt’s esti- k Estimation strategy.
mated number of positive anchors. We name this method as
Dynamic k Estimation. Such an estimation method is based
on the following intuition: The appropriate number of pos- as a default component in our experiments. The top q in
itive anchors for a certain gt should be positively correlated Sec. 3.3 is directly set to 20, as we find this set of param-
with the number of anchors that well-regress this gt. In Sec. eter values can consistently yield stable results in various
4, we present a detailed comparison between the fixed k and situations.
Dynamic k Estimation strategies.
A toy visualization of OTA is shown in Fig. 2. We also 4.2. Ablation Studies and Analysis
describe the OTA’s completed procedure including Center Effects of Individual Components. We verify the effec-
Prior and Dynamic k Estimation in Algorithm 1. tiveness of each component in our proposed methods. For
fair comparisons, all detectors’ regression losses are mul-
4. Experiments tiplied by 2, which is known as a useful trick to boost
the AP at high IoU thresholds [28]. As seen in Table 1,
In this section, we conduct extensive experiments on MS when no auxiliary branch is adopted, OTA outperforms
COCO 2017 [22] which contains about 118k, 5k and 20k FCOS by 0.9% AP (39.2% v.s.38.3%). This gap almost re-
images for train, val, and test-dev sets, respectively. For mains the same after adding IoU branch to both of them
ablation studies, we train detectors on train set and report (39.5% v.s. 40.3% and 38.8% v.s. 39.6% with or with-
the performance on val set. Comparisons with other meth- out center prior, respectively). Finally, dynamic k pushes
ods are conducted on test-dev set. We also compare OTA AP to a new state-of-the-art 40.7%. In the whole paper,
with other methods on CrowdHuman [35] validation set to we emphasize that OTA can be applied to both anchor-
demonstrate the superiority of OTA in crowd scenarios. based and anchor-free detectors. Hence we also adopt
4.1. Implementation Details OTA on RetinaNet [21] with only one square anchor per-
location across feature maps. As shown in Table 1, the AP
If not specified, we use ResNet-50 [13] pre-trained on values of OTA-FCOS and OTA-RetinaNet are exactly the
ImageNet [6] with FPN [20] as our default backbone. Most same, demonstrating OTA’s applicability on both anchor-
of experiments are trained with 90k iterations which is de- based and anchor-free detectors.
noted as “1×”. The initial learning rate is 0.01 and is de-
cayed by a factor of 10 after 60k and 80k iterations. Mini-
Effects of r. The values of radius r for Center Prior serve
batch size is set to 16. Following the common practice, the
to control the number of candidate anchors for each gt. If
model is trained with SGD [1] on 8 GPUs.
adopting a small r, only anchors near objects’ centers could
OTA can be adopted in both anchor-based and anchor- be assigned as positives, helping the optimization process
free detectors, the following experiments are mainly con- focus on regions that are more likely to be informative. As
ducted on FCOS [38] because of its simplicity. We adopt r increases, the number of candidates also quadratically in-
Focal Loss and IoU Loss as Lcls and Lreg that make up the creases, leading to potential instability in the optimization
cost matrix. α in Eq. 2 is set to 1.5. For back-propagation, process. For example, when r is set to 3, 5 or 7, their
the regression loss is replaced by GIoU Loss and is re- corresponding numbers of candidate anchors are 45, 125
weighted by a factor of 2. IoU Branch is first introduced and 2453 , respectively. We study behaviors of ATSS [47],
in YOLOv1 [30] and proved effective in modern one-stage
object detectors by PAA [16]. We also adopt IoU Branch 3 Total number of potential positive anchors equals to (r 2 ∗FPN Levels).

5
ATSS PAA OTA Method ATSS [47] PAA [16] OTA
r 3 5 7 3 5 7 3 5 7
Namb. 2.1 15.9 36.3 0.5 0.8 1.2 0.2 0.2 0.3
AP 39.4 38.0 37.2 40.3 40.1 39.5 40.6 40.7 40.4
AP50 57.5 56.7 55.8 58.9 58.4 57.5 58.7 58.4 58.3
AP75 42.7 40.4 39.8 43.4 43.4 42.4 44.1 44.3 43.6
Table 2. Performances of different label assigning strategies under
different number of anchor candidates. Namb. denotes the aver-
age number of ambiguous anchors per-image calculated on COCO
train set.

OTA, and evaluate their corresponding performance under


different r in Table 2. Noted that the optimal assigning plan
in OTA is continuous, hence we define anchor aj as an am-
biguous anchor if max πj∗ < 0.9. Table 2 shows that for
ATSS, the number of ambiguous anchors greatly increases
as r varies from 3 to 7. Its performance correspondingly
drops from 39.4% to 37.2%. For PAA, the number of am-
biguous anchors is less sensitive to r, but its performance
still drops 0.8%, indicating that Max IoU adopted by PAA
is not an ideal prior to ambiguous anchors. In OTA, when
multiple gts tend to transport positive labels to the same
anchor, the OT algorithm will automatically resolve their
conflicts based on the principle of minimum global costs.
Figure 3. Visualizations of assigning results. For PAA, the dots Hence the number of ambiguous anchor for OTA remains
stand for geometric centers of positive anchor boxes. For ATSS low and barely increases as r increases from 3 to 7. The
and OTA, the dots stand for positive anchor points. Rectangles corresponding performance is also stable.
represent the gt bounding boxes. To clearly illustrate the differ- Further, we manually assign the ambiguous anchors
ences between different assigning strategies, we set r to 5 for all based on hand-crafted rules before performing OTA. In this
methods. Only FPN layers with the largest number of positive case, OTA is only in charge of pos/neg samples division. Ta-
anchors are shown for better visualization. ble 3 shows that such a combination of hand-crafted rules
and OTA decreases the AP by 0.7% and 0.4%, respectively.
PAA [16], and OTA under different values of r in Table 2. Finally, we visualize some assigning results in Fig. 3. Red
OTA achieves the best performance (40.7% AP) when r is arrows and dashed ovals highlight the ambiguous regions
set to 5. When r is set to 3 as ATSS and PAA do, OTA also (i.e., overlaps between different f gs or junctions between
achieves 40.6% AP, indicating that most potential positive f gs and bg). Suffering from the lack of context and global
anchors are near the center of objects on COCO. While r is information, ATSS and PAA perform poorly in such re-
set to 7, the performance only slightly drops 0.3%, showing gions, leading to sub-optimal detection performances. Con-
that OTA is insensitive to the hyper-parameter r. versely, OTA assigns much less positive anchors in such re-
gions, which we believe is a desired behavior.

Ambiguous Anchors Handling. Most existing dynamic Method AP AP50 AP75


label assigning methods [47, 16, 48] only conduct a small Min Area [38] f.b. OTA 40.0 57.8 43.6
candidate set for each gt, because a large number of can- Max IoU [47] f.b. OTA 40.3 58.1 43.7
didates brings trouble – when occlusion happens or several Min Loss f.b. OTA 40.3 57.9 43.6
objects are close enough, an anchor may simultaneously be OTA 40.7 58.4 44.3
a qualified candidate for multiple gts. We define such an- Table 3. Performance comparisons on ambiguity handling between
chors as ambiguous anchors. Previous methods mainly han- OTA and other human-designed strategies on the COCO val set..
dle this ambiguity by introducing hand-crafted rules e.g., f.b. denotes “followed by”.
Min Area [38], Max IoU [47, 16, 21] and Min Loss4 . To
illustrate OTA’s superiority on ambiguous handling, We
count the number of ambiguous anchors in ATSS, PAA and
Effects of k. Before performing Sinkhorn-Knopp Itera-
4 Assigning ambiguous anchor to the gt with the minimal loss. tion, we need to define how many positive labels can each

6
gt supply. This value also represents how many anchors ev- ther improved to 47.0% AP. To demonstrate the compati-
ery gt needs for better convergence. A naive way is setting bility of our method with other advanced technologies in
k to a constant value for all gts. We try different values of object detection, we adopt Deformable Convolutional Net-
k from 1 to 20. As seen in Table 4, among all different val- works (DCN) [54] to ResNeXt backbones as well as the
ues, k=10 and k=12 achieve the best performances. As k in- last convolution layer in the detection head. This improves
creases from 10 to 20, the possibility that an anchor is suit- our model’s performance from 47.0% AP to 49.2% AP. Fi-
able as a positive sample for two close targets at the same nally, with the multi-scale testing technique, our best model
time also increases, but there is no obvious performance achieves 51.5% AP.
drop (0.2%) according to Table 4 which proves OTA’s su-
periority in handling potential ambiguity. When k=1, OTA 4.4. Experiments on CrowdHuman
becomes a one-to-one assigning strategy, the same as in
Object detection in crowded scenarios has raised more
DeTR. The poor performance tells us that achieving com-
and more attention [24, 15, 9, 10]. Compared to dataset
petitive performance via one-to-one assignment under the
designed for general object detection like COCO, ambigu-
1× scheduler remains challenging, unless an auxiliary one-
ity happens more frequently in crowded dataset. Hence to
to-many supervision is added [41].
demonstrate OTA’s advantage on handling ambiguous an-
k AP AP50 AP75 APs APm APl chors, it is necessary to conduct experiments on a crowded
1 36.5 55.4 38.8 21.4 39.7 46.2 dataset – Crowdhuman [35]. CrowdHuman contains 15000,
5 39.5 58.1 42.7 23.1 43.0 50.6 4370, and 5000 images in training, validation, and test set,
8 39.8 58.4 42.9 22.7 43.6 51.5 respectively, with the average number of persons in an im-
10 40.3 58.6 43.7 23.4 44.2 52.1 age 22.6. For all experiments, we train the detectors for 30
12 40.3 58.6 43.6 23.2 44.2 51.9 epochs (i.e., 2.5x) for better convergence. NMS threshold
15 40.2 58.4 43.6 23.2 44.1 51.9 is set to 0.6. We adopt ResNet-50 [13] as the default back-
20 40.1 58.2 43.6 23.5 44.0 52.8 bone in our experiments. Other settings are the same as our
Dyn. k 40.7 58.4 44.3 23.2 45.0 53.6 experiments on COCO. For evaluation, we follow the stan-
Table 4. Analysis of different values of k and Dynamic k Estima- dard Caltech [7] evaluation metric – MR, which stands for
tion strategy on the COCO val set. the Log-Average Missing Rate over false positives per im-
age (FPPI) ranging in [10−2 , 100 ]. AP and Recall are also
Fixing k strategy assumes every gt has the same number reported for reference. All evaluation results are reported
of appropriate positive anchors. However, we believe that on the CrowdHuman val subset.
this number for each gt should vary and may be affected by As shown in Table 6, RetinaNet and FCOS only achieve
many factors like objects’ sizes, spatial attitudes, and oc- 58.8% and 55.0% MR respectively, which are far worse
clusion conditions, etc. Hence we adopt the Dynamic k Es- than two stage detectors like Faster R-CNN (with FPN), re-
timation proposed in Sec 3.3 and compare its performance vealing the dilemma of one-stage detectors in crowd sce-
to the fixed k strategy. Results in Table 4 shows that dy- narios. Starting from FreeAnchor, the performances of one-
namic k surpasses the best performance of fixed k by 0.4% stage detectors gradually get improved by the dynamic la-
AP, validating our point and the effectiveness of Dynamic k bel assigning strategies. ATSS achieves 49.5% MR, which
Estimation strategy. is very close to the performance of Faster R-CNN (48.7%
AP). Recent proposed LLA [10] leverages loss-aware label
4.3. Comparison with State-of-the-art Methods. assignment, which is similar to OTA and achieves 47.9%
We compare our final models with other state-of-the-art MR. However, our OTA takes a step forward by introducing
one-stage detectors on MS COCO test-dev. Following pre- global information into the label assignment, boosting MR
vious works [21, 38], we randomly scale the shorter side to 46.6%. The AP and Recall of OTA also surpass other
of images in the range from 640 to 800. Besides, we dou- existing one-stage detectors by a clear margin.
ble the total number of iterations to 180K with the learning Although PAA achieves competitive performance with
rate change points scaled proportionally. Other settings are OTA on COCO, it performs struggling on CrowdHuman.
consistent with [21, 38]. We conjecture that PAA needs clear pos/neg decision
As shown in Table 5, our method with ResNet-101-FPN boundaries to help GMM learn better clusters. But in
achieves 45.3% AP, outperforms all other methods with the crowded scenarios, such clear boundaries may not exist be-
same backbone including ATSS (43.6% AP), AutoAssign cause potential negative samples usually cover a sufficient
(44.5% AP) and PAA (44.6% AP). Noted that for PAA, we amount of foreground areas, resulting in PAA’s poor per-
remove the score voting procedure for fair comparisons be- formance. Also, PAA performs per-gt’s clustering, which
tween different label assigning strategies. With ResNeXt- heavily increases the training time on crowded datasets like
64x4d-101-FPN [43], the performance of OTA can be fur- CrowdHuman. Compared to PAA, OTA still shows promis-

7
Method Iteration Backbone AP AP50 AP75 APs APm APl
RetinaNet [21] 135k ResNet-101 39.1 59.1 42.3 21.8 42.7 50.2
FCOS [38] 180k ResNet-101 41.5 60.7 45.0 24.4 44.8 51.6
NoisyAnchor [19] 180k ResNet-101 41.8 61.1 44.9 23.4 44.9 52.9
FreeAnchor [48] 180k ResNet-101 43.1 62.2 46.4 24.5 46.1 54.8
SAPD [52] 180k ResNet-101 43.5 63.6 46.5 24.9 46.8 54.6
MAL [44] 180k ResNet-101 43.6 61.8 47.1 25.0 46.9 55.8
ATSS [47] 180k ResNet-101 43.6 62.1 47.4 26.1 47.0 53.6
AutoAssign [51] 180k ResNet-101 44.5 64.3 48.4 25.9 47.4 55.0
PAA [16] 180k ResNet-101 44.6 63.3 48.4 26.4 48.5 56.0
OTA (Ours) 180k ResNet-101 45.3 63.5 49.3 26.9 48.8 56.1
FoveaBox [17] 180k ResNeXt-101 42.1 61.9 45.2 24.9 46.8 55.6
FSAF [53] 180k ResNeXt-64x4d-101 42.9 63.8 46.3 26.6 46.2 52.7
FCOS [38] 180k ResNeXt-64x4d-101 43.2 62.8 46.6 26.5 46.2 53.3
NoisyAnchor [19] 180k ResNeXt-101 44.1 63.8 47.5 26.0 47.4 55.0
FreeAnchor [48] 180k ResNeXt-64x4d-101 44.9 64.3 48.5 26.8 48.3 55.9
SAPD [52] 180k ResNeXt-64x4d-101 45.4 65.6 48.9 27.3 48.7 56.8
ATSS [47] 180k ResNeXt-64x4d-101 45.6 64.6 49.7 28.5 48.9 55.6
MAL [44] 180k ResNeXt101 45.9 65.4 49.7 27.8 49.1 57.8
AutoAssign [51] 180k ResNeXt-64x4d-101 46.5 66.5 50.7 28.3 49.7 56.6
PAA [16] 180k ResNeXt-64x4d-101 46.6 65.6 50.7 28.7 50.5 58.1
OTA (Ours) 180k ResNeXt-64x4d-101 47.0 65.8 51.1 29.2 50.4 57.9
SAPD [52] 180k ResNeXt-64x4d-101-DCN 47.4 67.4 51.1 28.1 50.3 61.5
ATSS [47] 180k ResNeXt-64x4d-101-DCN 47.7 66.5 51.9 29.7 50.8 59.4
AutoAssign [51] 180k ResNeXt-64x4d-101-DCN 48.3 67.4 52.7 29.2 51.0 60.3
PAA [16] 180k ResNeXt-64x4d-101-DCN 48.6 67.5 52.7 29.9 52.2 61.5
OTA (Ours) 180k ResNeXt-64x4d-101-DCN 49.2 67.6 53.5 30.0 52.5 62.3
ATSS [47]∗ 180k ResNeXt-64x4d-101-DCN 50.7 68.9 56.3 33.2 52.9 62.2
PAA [16]∗ 180k ResNeXt-64x4d-101-DCN 51.3 68.8 56.6 34.3 53.5 63.6
OTA (Ours)∗ 180k ResNeXt-64x4d-101-DCN 51.5 68.6 57.1 34.1 53.7 64.1
Table 5. Performance comparison with state-of-the-art one-stage detectors on MS COCO 2017 test-dev set. * indicates the specific form of
multi-scale testing that adopted in ATSS [47].

Method MR AP Recall egy. OTA formulates the label assigning procedure in object
Faster R-CNN with FPN [20] 48.7 86.1 90.4 detection into an Optimal Transport problem, which aims to
RetinaNet [21] 58.8 81.0 88.2 transport labels from ground-truth objects and backgrounds
FCOS [38] 55.0 86.4 94.1 to anchors at minimal transporting costs. To determine the
FreeAnchor [48] 51.3 83.9 89.8 number of positive labels needed by each gt, we further pro-
ATSS [47] 49.5 87.4 94.2 pose a simple estimation strategy based on the IoU values
PAA [16] 52.2 86.0 92.0 between predicted bounding boxes and each gt. As shown
LLA [10] 47.9 88.0 94.0 in experiments, OTA achieves the new SOTA performance
OTA (Ours) 46.6 88.4 95.1 on MS COCO. Because OTA can well-handle the assign-
Table 6. Performance comparison on the CrowdHuman validation ment of ambiguous anchors, it also outperforms all other
set. All experiments are conducted under 2.5x scheduler. one-stage detectors on CrowdHuman dataset by a large mar-
gin, demonstrating its strong generalization ability.

ing results, which demonstrates OTA’s superiority on vari-


ous detection benchmarks.
Acknowledgements
5. Conclusion
This research was partially supported by National Key
In this paper, we propose Optimal Transport Assignment R&D Program of China (No. 2017YFA0700800), and Bei-
(OTA) – an optimization theory based label assigning strat- jing Academy of Artificial Intelligence (BAAI).

8
References Conference on Computer Vision and Pattern Recognition,
pages 10750–10759, 2020. 7
[1] Léon Bottou. Large-scale machine learning with stochastic [16] Kang Kim and Hee Seok Lee. Probabilistic anchor assign-
gradient descent. In Proceedings of COMPSTAT’2010, pages ment with iou prediction for object detection. arXiv preprint
177–186. Springer, 2010. 5 arXiv:2007.08103, 2020. 1, 2, 4, 5, 6, 8
[2] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- [17] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li,
ing into high quality object detection. In Proceedings of the and Jianbo Shi. Foveabox: Beyound anchor-based object de-
IEEE conference on computer vision and pattern recogni- tection. IEEE Transactions on Image Processing, 29:7389–
tion, pages 6154–6162, 2018. 2 7398, 2020. 2, 8
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [18] Hei Law and Jia Deng. Cornernet: Detecting objects as
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- paired keypoints. In Proceedings of the European Confer-
to-end object detection with transformers. arXiv preprint ence on Computer Vision (ECCV), pages 734–750, 2018. 2
arXiv:2005.12872, 2020. 2, 3 [19] Hengduo Li, Zuxuan Wu, Chen Zhu, Caiming Xiong,
[4] Yihong Chen, Zheng Zhang, Yue Cao, Liwei Wang, Stephen Richard Socher, and Larry S Davis. Learning from noisy
Lin, and Han Hu. Reppoints v2: Verification meets regres- anchors for one-stage object detection. In Proceedings of
sion for object detection. arXiv preprint arXiv:2007.08508, the IEEE/CVF Conference on Computer Vision and Pattern
2020. 2 Recognition, pages 10588–10597, 2020. 1, 2, 8
[5] Marco Cuturi. Sinkhorn distances: Lightspeed computation [20] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
of optimal transport. In Advances in neural information pro- Bharath Hariharan, and Serge Belongie. Feature pyra-
cessing systems, pages 2292–2300, 2013. 2, 3, 4 mid networks for object detection. In Proceedings of the
[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, IEEE conference on computer vision and pattern recogni-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image tion, pages 2117–2125, 2017. 2, 3, 5, 8
database. In 2009 IEEE conference on computer vision and [21] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
pattern recognition, pages 248–255. Ieee, 2009. 5 Piotr Dollár. Focal loss for dense object detection. In Pro-
[7] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Per- ceedings of the IEEE international conference on computer
ona. Pedestrian detection: A benchmark. In 2009 IEEE Con- vision, pages 2980–2988, 2017. 1, 2, 4, 5, 6, 7, 8
ference on Computer Vision and Pattern Recognition, pages [22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
304–311. IEEE, 2009. 7 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[8] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- Zitnick. Microsoft coco: Common objects in context. In
ming Huang, and Qi Tian. Centernet: Keypoint triplets for European conference on computer vision, pages 740–755.
object detection. In Proceedings of the IEEE International Springer, 2014. 2, 5
Conference on Computer Vision, pages 6569–6578, 2019. 2 [23] Songtao Liu, Di Huang, et al. Receptive field block net for
accurate and fast object detection. In Proceedings of the Eu-
[9] Zheng Ge, Zequn Jie, Xin Huang, Rong Xu, and Osamu
ropean Conference on Computer Vision (ECCV), pages 385–
Yoshie. Ps-rcnn: Detecting secondary human instances in
400, 2018. 2
a crowd via primary object suppression. In 2020 IEEE Inter-
national Conference on Multimedia and Expo (ICME), pages [24] Songtao Liu, Di Huang, and Yunhong Wang. Adaptive nms:
1–6. IEEE, 2020. 7 Refining pedestrian detection in a crowd. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
[10] Zheng Ge, Jianfeng Wang, Xin Huang, Songtao Liu, and Os-
Recognition, pages 6459–6468, 2019. 7
amu Yoshie. Lla: Loss-aware label assignment for dense
[25] Songtao Liu, Di Huang, and Yunhong Wang. Learning spa-
pedestrian detection. arXiv preprint arXiv:2101.04307,
tial fusion for single-shot object detection. arXiv preprint
2021. 7, 8
arXiv:1911.09516, 2019. 2
[11] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- [26] Songtao Liu, Di Huang, and Yunhong Wang. Pay attention to
national conference on computer vision, pages 1440–1448, them: deep reinforcement learning-based cascade object de-
2015. 4 tection. IEEE transactions on neural networks and learning
[12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- systems, 31(7):2544–2556, 2019. 2
shick. Mask r-cnn. In Proceedings of the IEEE international [27] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
conference on computer vision, pages 2961–2969, 2017. 2 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Berg. Ssd: Single shot multibox detector. In European con-
Deep residual learning for image recognition. In Proceed- ference on computer vision, pages 21–37. Springer, 2016. 1,
ings of the IEEE conference on computer vision and pattern 2
recognition, pages 770–778, 2016. 5, 7 [28] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng,
[14] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense- Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards bal-
box: Unifying landmark localization with end to end object anced learning for object detection. In Proceedings of the
detection. arXiv preprint arXiv:1509.04874, 2015. 2 IEEE conference on computer vision and pattern recogni-
[15] Xin Huang, Zheng Ge, Zequn Jie, and Osamu Yoshie. Nms tion, pages 821–830, 2019. 5
by representative region: Towards crowded pedestrian detec- [29] Han Qiu, Yuchen Ma, Zeming Li, Songtao Liu, and Jian
tion by proposal pairing. In Proceedings of the IEEE/CVF Sun. Borderdet: Border feature for dense object detection. In

9
European Conference on Computer Vision, pages 549–564. [44] Tong Yang, Xiangyu Zhang, Zeming Li, Wenqiang Zhang,
Springer, 2020. 1 and Jian Sun. Metaanchor: Learning to detect objects with
[30] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali customized anchors. In Advances in Neural Information Pro-
Farhadi. You only look once: Unified, real-time object de- cessing Systems, pages 320–330, 2018. 2, 8
tection. In Proceedings of the IEEE conference on computer [45] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen
vision and pattern recognition, pages 779–788, 2016. 1, 2, 5 Lin. Reppoints: Point set representation for object detec-
[31] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, tion. In Proceedings of the IEEE International Conference
stronger. In Proceedings of the IEEE conference on computer on Computer Vision, pages 9657–9666, 2019. 2
vision and pattern recognition, pages 7263–7271, 2017. 2 [46] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and
[32] Joseph Redmon and Ali Farhadi. Yolov3: An incremental Thomas Huang. Unitbox: An advanced object detection net-
improvement. arXiv preprint arXiv:1804.02767, 2018. 2 work. In Proceedings of the 24th ACM international confer-
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. ence on Multimedia, pages 516–520, 2016. 2, 4
Faster r-cnn: Towards real-time object detection with region [47] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
proposal networks. In Advances in neural information pro- Stan Z Li. Bridging the gap between anchor-based and
cessing systems, pages 91–99, 2015. 1, 2 anchor-free detection via adaptive training sample selection.
[34] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir In Proceedings of the IEEE/CVF Conference on Computer
Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- Vision and Pattern Recognition, pages 9759–9768, 2020. 1,
tersection over union: A metric and a loss for bounding box 2, 4, 5, 6, 8
regression. In Proceedings of the IEEE Conference on Com- [48] Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and
puter Vision and Pattern Recognition, pages 658–666, 2019. Qixiang Ye. Freeanchor: Learning to match anchors for vi-
4 sual object detection. In Advances in Neural Information
[35] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Processing Systems, pages 147–155, 2019. 1, 2, 4, 6, 8
Xiangyu Zhang, and Jian Sun. Crowdhuman: A bench- [49] Yangtao Zheng, Di Huang, Songtao Liu, and Yunhong Wang.
mark for detecting human in a crowd. arXiv preprint Cross-domain object detection through coarse-to-fine feature
arXiv:1805.00123, 2018. 2, 5, 7 adaptation. In Proceedings of the IEEE/CVF Conference
[36] Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Hongbin on Computer Vision and Pattern Recognition, pages 13766–
Sun, Jian Sun, and Nanning Zheng. Fine-grained dynamic 13775, 2020. 2
head for object detection. arXiv preprint arXiv:2012.03519, [50] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
2020. 1 jects as points. arXiv preprint arXiv:1904.07850, 2019. 2
[37] Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Xiangyu [51] Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong,
Zhang, Hongbin Sun, Jian Sun, and Nanning Zheng. Re- Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differ-
thinking learnable tree filter for generic feature transform. entiable label assignment for dense object detection. arXiv
arXiv preprint arXiv:2012.03482, 2020. 2 preprint arXiv:2007.03496, 2020. 1, 2, 8
[38] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
[52] Chenchen Zhu, Fangyi Chen, Zhiqiang Shen, and Marios
Fully convolutional one-stage object detection. In Proceed-
Savvides. Soft anchor-point object detection. arXiv preprint
ings of the IEEE international conference on computer vi-
arXiv:1911.12448, 2019. 8
sion, pages 9627–9636, 2019. 1, 2, 4, 5, 6, 7, 8
[53] Chenchen Zhu, Yihui He, and Marios Savvides. Feature se-
[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
lective anchor-free module for single-shot object detection.
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
In Proceedings of the IEEE Conference on Computer Vision
Polosukhin. Attention is all you need. In Advances in neural
and Pattern Recognition, pages 840–849, 2019. 8
information processing systems, pages 5998–6008, 2017. 2
[40] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and [54] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
Dahua Lin. Region proposal by guided anchoring. In Pro- formable convnets v2: More deformable, better results. In
ceedings of the IEEE Conference on Computer Vision and Proceedings of the IEEE Conference on Computer Vision
Pattern Recognition, pages 2965–2974, 2019. 2 and Pattern Recognition, pages 9308–9316, 2019. 7
[41] Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun,
Jian Sun, and Nanning Zheng. End-to-end object de-
tection with fully convolutional network. arXiv preprint
arXiv:2012.03544, 2020. 7
[42] Jiaxi Wu, Songtao Liu, Di Huang, and Yunhong Wang.
Multi-scale positive sample refinement for few-shot object
detection. In European Conference on Computer Vision,
pages 456–472. Springer, 2020. 2
[43] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1492–1500,
2017. 7

10
A. Appendix. The updating rule in Eq. 10 is also known as the Sinkhorn-
Knopp Iteration. After repeating this iteration T times, the
A.1. Optimal Transport and Sinkhorn Iteration approximate optimal plan π ∗ can be obtained:
To ensure the integrity of this paper, we briefly introduce
the derivation of the Sinkhorn Iteration algorithm which we π ∗ = diag(v)M diag(u). (11)
emphasize not our contributions and belongs to textbook
knowledge. γ and T are empirically set to 0.1 and 50. Please refer to
The mathematical formula of the Optimal Transport our code for more details.
problem is defined in Eq. 1. This is a linear program which
can be solved in polynomial time. For dense detectors,
however, the resulting linear program is large, involving the
square of feature dimensions with anchors in all scales. This
issue can be addressed by a fast iterative solution, which
converts the optimization target in Eq. 1 into a non-linear
but convex form with an entropic regularization term E
added:
m X
X n
min cij πij + γE(πij ), (5)
π
i=1 j=1

where E(πij ) = πij (log πij − 1). γ is a constant hyper-


parameter controlling the intensity of regularization term.
According to Lagrange Multiplier Method, the constraint
optimization target in Eq. 5 can be convert to a non-
constraint target:
m X
X n
min cij πij + γE(πij )+
π
i=1 j=1
m n
(6)
X X
αj ( πij − dj ) + βi ( πij − si ),
i=1 j=1

where αj (j = 1, 2, ...n) and βi (i = 1, 2, ..., m) are La-


grange multipliers. By letting the derivatives of the opti-
mization target equal to 0, the optimal plan π ∗ is resolved
as:

∗ αj cij βi
πij = exp(− ) exp(− ) exp(− ). (7)
γ γ γ
α
Letting uj = exp(− γj ), vi = exp(− βγi ), Mij =
c
exp(− γij ), the following constraints can be enforced:
X X
πij = uj ( Mij vi ) = dj , (8)
i i
X X
πij = (uj Mij )vi = si . (9)
j i

These two equations have to be satisfied simultaneously.


One possible solution is to calculate vi and uj by repeating
the following updating formulas sufficient steps:
dj si
ut+1
j =P t, vit+1 = P t+1 . (10)
i M ij vi j M ij uj

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy