0% found this document useful (0 votes)
82 views10 pages

High-Level Semantic Feature Detection: A New Perspective For Pedestrian Detection

This document proposes a new approach to pedestrian detection called Center and Scale Prediction (CSP) that frames pedestrian detection as a high-level semantic feature detection task. Unlike traditional anchor-based detectors, CSP predicts pedestrian locations and scales directly through convolutional neural networks in a single pass without using anchors or post-processing. The proposed CSP detector achieves state-of-the-art accuracy on challenging pedestrian detection benchmarks while being conceptually simple and computationally efficient compared to existing methods.

Uploaded by

Nguyen Nam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views10 pages

High-Level Semantic Feature Detection: A New Perspective For Pedestrian Detection

This document proposes a new approach to pedestrian detection called Center and Scale Prediction (CSP) that frames pedestrian detection as a high-level semantic feature detection task. Unlike traditional anchor-based detectors, CSP predicts pedestrian locations and scales directly through convolutional neural networks in a single pass without using anchors or post-processing. The proposed CSP detector achieves state-of-the-art accuracy on challenging pedestrian detection benchmarks while being conceptually simple and computationally efficient compared to existing methods.

Uploaded by

Nguyen Nam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

High-level Semantic Feature Detection:

A New Perspective for Pedestrian Detection

Wei Liu1,2∗, Shengcai Liao3†, Weiqiang Ren4 , Weidong Hu1 , Yinan Yu4
1
ATR, College of Electronic Science, National University of Defense Technology, Changsha, China
2
CBSR & NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing ,China
3
Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, UAE
arXiv:1904.02948v1 [cs.CV] 5 Apr 2019

4
Horizon Robotics Inc., Beijing ,China
{liuwei16, wdhu}@nudt.edu.cn, scliao@ieee.org, {weiqiang.ren, yinan.yu}@hobot.cc

Abstract

Object detection generally requires sliding-window clas-


sifiers in tradition or anchor-based predictions in modern
deep learning approaches. However, either of these ap-
proaches requires tedious configurations in windows or an-
chors. In this paper, taking pedestrian detection as an ex-
ample, we provide a new perspective where detecting ob-
jects is motivated as a high-level semantic feature detection Figure 1. The overall pipeline of the proposed CSP detector. The
task. Like edges, corners, blobs and other feature detectors, final convolutions have two channels, one is a heatmap indicat-
the proposed detector scans for feature points all over the ing the locations of the centers (red dots), and the other serves to
image, for which the convolution is naturally suited. How- predict the scales (yellow dotted lines) for each detected center.
ever, unlike these traditional low-level features, the pro-
posed detector goes for a higher-level abstraction, that is,
we are looking for central points where there are pedestri- Canny [4], Sobel [41]), corner (or interest point) detection
ans, and modern deep models are already capable of such (e.g. SUSAN [40], FAST [37]), and blob (or region of inter-
a high-level semantic abstraction. Besides, like blob de- est point) detection (e.g. LoG [25], DoG [31], MSER [33]).
tection, we also predict the scales of the pedestrian points, Feature detection is of vital importance to a variety of com-
which is also a straightforward convolution. Therefore, in puter vision tasks ranging from image representation, image
this paper, pedestrian detection is simplified as a straight- matching to 3D scene reconstruction, to name a few.
forward center and scale prediction task through convolu- Generally speaking, a feature is defined as an ”interest-
tions. This way, the proposed method enjoys an anchor- ing” part of an image, and so feature detection aims to com-
free setting. Though structurally simple, it presents compet- pute abstractions of image information and make local deci-
itive accuracy and good speed on challenging pedestrian sions at every image point whether there is an image feature
detection benchmarks, and hence leading to a new attrac- of a given type at that point or not [1]. Regarding abstrac-
tive pedestrian detector. Code and models will be available tion of image information, with the rapid development for
at https://github.com/liuwei16/CSP. computer vision tasks, deep convolutional neural networks
(CNN) are believed to be of very good capability to learn
high-level image abstractions. Therefore, it has also been
1. Introduction applied for feature detection, and demonstrates attractive
successes even in low-level feature detections. For example,
Feature detection is one of the most fundamental prob- there is a recent trend of using CNN to perform edge detec-
lems in computer vision. It is usually viewed as a low-level tion [39, 47, 2, 29], which has substantially advanced this
technique, with typical tasks including edge detection (e.g. field. It shows that clean and continuous edges can be ob-
∗ Wei Liu finished his part of work during his visit in CASIA.
tained by deep convolutions, which indicates that CNN has
† Shengcai Liao is the corresponding author. He was previously in CA- a stronger capability to learn higher-level abstraction of nat-
SIA. ural images than traditional methods. This capability may

1
not be limited to low-level feature detection; it may open up follows: (i) We show a new possibility that pedestrian de-
many other possibilities of high-level feature detection. tection can be simplified as a straightforward center and
Therefore, in this paper, taking pedestrian detection as an scale prediction task through convolutions, which bypasses
example, we provide a new perspective where detecting ob- the limitations of anchor-based detectors and gets rid of the
jects is motivated as a high-level semantic feature detection complex post-processing of recent keypoint pairing based
task. Like edges, corners, blobs and other feature detectors, detectors. (ii) The proposed CSP detector achieves the new
the proposed detector scans for feature points all over the state-of-the-art performance on two challenging pedestrian
image, for which the convolution is naturally suited. How- detection benchmarks, CityPersons [51] and Caltech [9].
ever, unlike these traditional low-level feature detectors, the
proposed detector goes for a higher-level abstraction, that is, 2. Related Works
we are looking for central points where there are pedestri-
ans. Besides, similar to the blob detection, we also predict 2.1. Anchor-based object detection
the scales of the pedestrian points. However, instead of pro- One key component of anchor-based detectors is the an-
cessing an image pyramid to determine the scale as in tra- chor boxes of pre-defined scales and aspect ratios. In this
ditional blob detection, we predict object scale with also a way, detection is performed by classifying and regressing
straightforward convolution in one pass upon a fully convo- these anchor boxes. Faster R-CNN [36] is known as a
lution network (FCN) [30], considering its strong capabil- two-stage detector, which generates objectness proposals
ity. As a result, pedestrian detection is simply formulated as and further classify and refine these proposals in a sin-
a straightforward center and scale prediction task via con- gle framework. In contrast, singe-stage detectors, popular-
volution. The overall pipeline of the proposed method, de- ized by SSD [27], remove the proposal generation step and
noted as Center and Scale Prediction (CSP) based detector, achieve comparable accuracy while are more efficient than
is illustrated in Fig. 1. two-stage detectors. In terms of pedestrian detection, Faster
As for general object detection, starting from the pioneer R-CNN has become the predominant framework. For ex-
work of the Viola-Jones detector [45], it generally requires ample, RPN+BF [48] adapts the RPN and re-scores these
sliding-window classifiers in tradition or anchor-based pre- proposals via boosted forests. MS-CNN [3] also applies the
dictions in CNN-based methods. These detectors are es- Faster R-CNN framework but generates proposals on multi-
sentially local classifiers used to judge the pre-defined win- scale feature maps. Zhang et al. [51] contribute five strate-
dows or anchors as being objects or not. However, either gies to adapt the plain Faster R-CNN for pedestrian detec-
of these approaches requires tedious configurations in win- tion. RepLoss [46] and OR-CNN [52] design two novel re-
dows or anchors. Generally speaking, object detection is to gression losses to tackle the occluded pedestrian detection
tell where the object is, and how big it is. Traditional meth- in crowded scenes. Bi-Box [53] proposes an auxiliary sub-
ods combines the ”where” and ”how” subproblems into a network to predict the visible part of a pedestrian instance.
single one through the overall judgement of various scales Most recently, single-stage detectors also present competi-
of windows or anchors. In contrast, the proposed CSP de- tive performance. For example, ALFNet [28] proposes the
tector separates the ”where” and ”how” subproblems into asymptotic localization fitting strategy to evolve the default
two different convolutions. This makes detection a more anchor boxes step by step into precise detection results, and
natural way, and enjoys a window-free or anchor-free set- [21] focuses on the discriminative feature learning based on
ting, significantly reducing the difficulty in training. the original SSD architecture.
There is another line of research which inspires us a lot.
Previously, FCN has already been applied to and made a 2.2. Anchor-free object detection
success in multi-person pose estimation [5, 34], where sev- Anchor-free detectors bypass the requirement of anchor
eral keypoints are firstly detected merely through responses boxes and detect objects directly from an image. DeNet [44]
of full convolutions, and then they are further grouped into proposes to generate proposals by predict the confidence of
complete poses of individual persons. In view of this, re- each location belonging to four corners of objects. Follow-
cently two inspirational works, CornerNet [18] and TLL ing the two-stage pipeline, DeNet also appends another sub-
[42], successfully go free from windows and anchors, which network to re-score these proposals. Within the single-stage
perform object detection as convolutional keypoint detec- framework, YOLO [35] appends fully-connected layers to
tions and their associations. Though the keypoint associ- parse the final feature maps of a network into class confi-
ation require additional computations, sometimes complex dence scores and box coordinates. Densebox [14] devises a
as in TLL, the keypoint prediction by FCN inspires us to go unified FCN that directly regresses the classification scores
a step further, achieving center and scale prediction based and distances to the boundary of a ground truth box on all
pedestrian detection in full convolutions. pixels, and demonstrates improved performance with land-
In summary, the main contributions of this work are as mark localization via multi-task learning. Most recently,

2
CornerNet [18] also applies a FCN but to predict objects’ Φ = {φ1 , φ2 , ..., φN }, which is further utilized by detec-
top-left and bottom-right corners and then group them via tion heads.
associative embedding [34]. Enhanced by the novel cor- Generally speaking, the CNN-based object detectors dif-
ner pooling layer, CornerNet achieves superior performance fer in how to utilize Φ. We denote these feature maps that
on MS COCO object detection benchmark [22]. Similarly, are responsible for detection as Φdet . In RPN [36], only
TLL [42] proposes to detect an object by predicting the the final feature map φN is used to perform detection, thus
top and bottom vertexes. To group these paired keypoints the final set of feature maps for detection is Φdet = {φN }.
into individual instances, it also predicts the link edge be- While in SSD [27], the detection feature maps can be repre-
tween them and employs a post-processing scheme based sented as Φdet = {φL , φL+1 , ..., φN }, where 1 < L < N .
on Markov Random Field. Applying on pedestrian detec- Further, in order to enrich the semantic information of shal-
tion, TLL achieves significant improvement on Caltech [9], lower layers for detecting small-scale objects, FPN [23]
especially for small-scale pedestrians. and DSSD [10] utilize the lateral connection to combine
Our work also falls in the anchor-free object detection, feature maps of different resolutions, resulting in Φdet =
0 0 0 0
but with significant differences to all above methods. We {φL , φL+1 , ..., φN }, where φi (i = L, L+1, ...N ) is a com-
try to answer to what extent a single FCN can be simpli- bination of φi (i = L, L + 1, ...N ).
fied for pedestrian detection, and demonstrate that a single Besides Φdet , in anchor-based detectors, another key
center point is feasible for object localization. Along with component is called anchor boxes (denoted as B). Given
the scale prediction, CSP is able to generate bounding boxes Φdet and B in hand, detection can be formulated as:
without any requirements of extra post-processing schemes
except the Non-Maximum Suppression (NMS). Dets = H(Φdet , B)
(2)
= {cls(Φdet , B), regr(Φdet , B)},
2.3. Feature detection
Feature detection is a long-standing problem in com- where B is pre-defined according to the corresponding set of
puter vision with extensive literatures. Generally speak- feature maps Φdet , and H(.) represents the detection head.
ing, it mainly includes edge detection [4, 41], corner de- Generally, H(.) contains two elements, namely cls(.) which
tection [37, 38], blob detection [33, 7] and so on. Tradi- predicts the classification scores, and regr(.) which pre-
tional leading methods [4, 41] mainly focus on the utiliza- dicts the scaling and offsets of the anchor boxes.
tion of local cues, such as brightness, colors, gradients and While in anchor-free detectors, detection is performed
textures. With the development of CNN, a series of CNN- merely on the set of feature maps Φdet , that is,
based method are proposed that significantly push forward
Dets = H(Φdet ) (3)
the state of the arts in the task of feature detection. For ex-
ample, there is a recent trend of using CNN to perform edge
detection [39, 47, 2, 29], which have substantially advanced 3.2. Overall architecture
this field. However, different from these low-level feature
The overall architecture of the proposed CSP detector is
points like edge, corners and blobs, the proposed method
illustrated in Fig. 2. The backbone network are truncated
goes for a higher-level abstraction task, that is, we focus
from a standard network pretrained on ImageNet [8] (e.g.
on detecting central points where there are pedestrians, for
ResNet-50 [12] and MobileNet [13]).
which modern deep models are already capable of.
Feature Extraction. Taking ResNet-50 as an example,
its Conv layers can be divided into five stages, in which the
3. Proposed Method
output feature maps are downsampled by 2, 4, 8, 16, 32
3.1. Preliminary w.r.t. the input image. As a common practice [46, 42], the
dilated convolutions are adopted in stage 5 to keep its out-
The CNN-based object detectors often rely on a back-
put as 1/16 of the input image size. We denote the output
bone network (e.g. ResNet [12]). Taking an image I as
of stage 2, 3, 4 and 5 as φ2 , φ3 , φ4 and φ5 , in which the
input, the network may generate several feature maps with
shallower feature maps can provide more precise localiza-
different resolutions, which can be defined as follows:
tion information, while the coarser ones contain more se-
φi = fi (φi−1 ) = fi (fi−1 (...f2 (f1 (I)))), (1) mantic information with increasing the sizes of receptive
fields. Therefore, we fuse these multi-scale feature maps
where φi represents feature maps output by the ith layer. from each stage into a single one in a simple way, that
These feature maps decrease in size progressively and are is, a deconvolution layer is adopted to make multi-scale
generated by fi (.), which may be a combination of con- feature maps with the same resolution before concatena-
volution or pooling, etc. Given a network with N lay- tion. Since the feature maps from each stage have differ-
ers, all the generated feature maps can be denoted as ent scales, we use L2-normalization to rescale their norms

3
Figure 2. Overall architecture of CSP, which mainly comprises two components, i.e. the feature extraction module and the detection head.
The feature extraction module concatenates feature maps of different resolutions into a single one. The detection head merely contains a
3x3 convolutional layer, followed by two prediction layers, one for the center location and the other for the corresponding scale.

to 10, which is similar to [21]. To investigate the optimal


combination from these multi-scale feature maps, we con-
duct an ablative experiment in Sec. 4.2 and demonstrate
that Φdet = {φ3 , φ4 , φ5 } is the best choice. Given an input
image of size H × W , the size of final concatenated feature
maps is H/r × W/r, where r is the downsampling fac-
tor. Similarly to [42], r = 4 gives the best performance as
demonstrated in our experiments, because a larger r means
coarser feature maps which struggle on accurate localiza-
tion, while a smaller r brings more computational burdens.
Figure 3. (a) is the bounding box annotations commonly adopted
Note that more complicated feature fusion strategies like
by anchor-based detectors. (b) is the center and scale ground truth
[23, 15, 17] can be explored to further improve the detec-
generated automatically from (a). Locations of all objects’ cen-
tion performance, but it is not in the scope of this work. ter points are assigned as positives, and negatives otherwise. Each
Detection Head. Upon the concatenated feature maps pixel is assigned a scale value of the corresponding object if it is
Φdet , a detection head is appended to parse it into detection a positive point, or 0 otherwise. We only show the height infor-
results. As stated in [26], the detection head plays a signif- mation of the two positives for clarity. (c) is the overall Gaussian
icant role in top performance, which has been extensively mask map M defined in Eq.4 to reduce the ambiguity of these
explored in the literature [10, 26, 20, 19]. In this work, we negatives surrounding the positives.
firstly attach a single 3x3 Conv layer on Φdet to reduce its
channel dimensions to 256, and then two sibling 1x1 Conv
layers are appended to produce the center heatmap and scale ter ground truth, the location where an object’s center point
map, respectively. Also, we do this for simplicity and any falls is assigned as positive while all others are negatives.
improvement of the detection head [10, 26, 20, 19] can be Scale can be defined as the height and/or width of ob-
flexibly incorporate into this work to be a better detector. jects. Towards high-quality ground truth for pedestrian de-
A drawback from the downsampled feature maps is the tection, line annotation is first proposed in [50, 51], where
problem of poor localization. Optionally, to slightly adjust tight bounding boxes are automatically generated with a
the center location, an extra offset prediction branch can be uniform aspect ratio of 0.41. In accordance to this anno-
appended in parallel with the above two branches. tation, we can merely predict the height of each object and
generate the bounding box with the predetermined aspect
3.3. Training ratio. For the scale ground truth, the kth positive location is
assigned with the value of log(hk ) corresponding to the kth
Ground Truth. The predicted heatmaps are with the object. To reduce the ambiguity, log(hk ) is also assigned
same size as the concatenated feature maps (i.e. H/r × to the negatives within a radius 2 of the positives, while
W/r). Given the bounding box annotations, we can gen- all other locations are assigned as zeros. Alternatively, we
erate the center and scale ground truth automatically. An can also predict the width or height+width but with slightly
illustration example is depicted in Fig. 3 (b). For the cen- poor performance for pedestrian detection as demonstrated

4
in our experiments (Sec. 4.2). where sk and tk represents the network’s prediction and the
When the offset prediction branch is appended, the ground truth of each positive, respectively.
ground truth for the offsets of those centers can be defined If the offset prediction branch is appended, the similar
as ( xrk − b xrk c, yrk − b yrk c). smooth L1 loss in Eq. 7 is adopted (denoted as Lof f set ).
Loss Function. For the center prediction branch, we for- To sum up, the full optimization objective is:
mulate it as a classification task via the cross-entropy loss.
Note that it is difficult to decide an ’exact’ center point, thus L = λc Lcenter + λs Lscale + λo Lof f set , (8)
the hard-designation of positives and negatives brings more
where λc , λs and λo are the weights for center classifica-
difficulties for training. In order to reduce the ambiguity
tion, scale regression and offset regression losses, which are
of these negatives surrounding the positives, we also apply
experimentally set as 0.01, 1 and 0.1, respectively.
a 2D Gaussian mask G(.) centered at the location of each
Data Augmentation. To increase the diversity of the
positive. An illustration example of the overall mask map
training data, standard data augmentation techniques are
M is depicted in Fig. 3 (c). Formally, it is formulated as:
adopted. Firstly, random color distortion and horizontal flip
Mij = max G(i, j; xk , yk , σwk , σhk ), are applied, followed by randomly scaled in the range of
k=1,2,...,K
(4) [0.4, 1.5]. Secondly, a patch is cropped or expanded by
(i−x)2 (j−y)2
−( 2
2σw
+
2σ 2
) zero-padding such that the shorter side has a fixed number
G(i, j; x, y, σw , σh ) = e h ,
of pixels (640 for CityPersons [51], and 336 for Caltech
where K is the number of objects in an image, [9]). Note that the aspect ratio of the image is kept during
(xk , yk , wk , hk ) is the center coordinates, width and height this process.
k
of the kth object, and the variances (σw , σhk ) of the Gaus-
sian mask are proportional to the height and width of indi- 3.4. Inference
vidual objects. If these masks have overlaps, we choose the During testing, CSP simply involves a single forward of
maximum values for the overlapped locations. To combat FCN with several predictions. Specifically, locations with
the extreme positive-negative imbalance problem, the focal confidence score above 0.01 in the center heatmap are kept,
weights [24] on hard examples are also adopted. Thus, the along with their corresponding scale in the scale map. Then
classification loss can be formulated as: bounding boxes are generated automatically and remapped
W/r H/r to the original image size, followed by NMS with a thresh-
1 XX old of 0.5. If the offset prediction branch is appended, the
Lcenter = − αij (1 − p̂ij )γ log(p̂ij ), (5)
K i=1 j=1 centers are adjusted accordingly before remapping.
where ( 4. Experiments
pij if yij = 1
p̂ij = 4.1. Experiment settings
1 − pij otherwise,
(6)
Datasets. To demonstrate the effectiveness of the pro-
(
1 if yij = 1
αij = posed method, we evaluate on two of the largest pedestrian
(1 − Mij )β otherwise.
detection benchmarks, i.e. Caltech [9] and CityPersons
In the above, pij ∈ [0, 1] is the network’s estimated prob- [51]. Caltech comprises approximately 2.5 hours of au-
ability indicating whether there is an object’s center or not todriving video with extensively labelled bounding boxes.
in the location (i, j), and yij ∈ {0, 1} specifies the ground Following [51, 32, 46, 28, 52], we use the training data aug-
truth label, where yij = 1 represents the positive location. mented by 10 folds (42782 frames) and test on the 4024
αij and γ are the focusing hyper-parameters, we experimen- frames in the standard test set, all experiments are con-
tally set γ = 2 as suggested in [24]. To reduce the ambigu- ducted on the new annotations provided by [49]. CityPer-
ity from those negatives surrounding the positives, the αij sons is a more challenging large-scale pedestrian detection
according to the Gaussian mask M is applied to reduce their dataset with various occlusion levels. We train the models
contributions to the total loss, in which the hyper-parameter on the official training set with 2975 images and test on the
β controls the penalty. Experimentally, β = 4 gives the validation set with 500 images.
best performance, which is similar to the one in [18]. For One reason we choose these two datasets lies in that they
positives, αij is set as 1. provide bounding boxes via central body line annotation
For scale prediction, we formulate it as a regression task and normalized aspect ratio, this annotation procedure is
via the smooth L1 loss [11]: helpful to ensure the boxes align well with the centers of
K pedestrians. Evaluation follows the standard Caltech eval-
1 X uation metric [9], that is log-average Miss Rate over False
Lscale = SmoothL1(sk , tk ), (7)
K
k=1
Positive Per Image (FPPI) ranging in [10−2 , 100 ] (denoted

5
Point M R−2 (%) generation. In practice, we merely predict the height for
Prediction IoU=0.5 IoU=0.75 each detected center in accordance to the line annotation in
Center point 4.62 36.47 [50, 51]. To demonstrate the generality of CSP, we have
Top vertex 7.75 44.70 also tried to predict Width or Height+Width for compari-
Bottom vertex 6.52 40.25 son. For Height+Width, the only difference in network ar-
chitecture lies in that the scale prediction branch has two
Table 1. Comparisons of different high-level feature points. Bold channels responsible for the height and width respectively.
number indicates the best result. It can be observed in Table 2 that Width and Height+Width
prediction can also achieve comparable but suboptimal re-
Scale M R−2 (%) sults to Height prediction. This result may be attributed to
Prediction IoU=0.5 IoU=0.75 the line annotation adopted in [50, 51] which provides accu-
Height 4.62 36.47 rate height information with less noise during training. Be-
Width 5.31 53.06 sides, the ground truth for width is automatically generated
Height+Width 4.73 41.09 by the annotated height information, thus is not able to pro-
vide additional information for training. With the compara-
Table 2. Comparisons of different definitions for scale prediction. ble performance from Height +Width prediction, it makes
Bold number indicates the best result.
CSP potentially feasible for other object detection tasks re-
quiring both height and width.
as M R−2 ). Tests are only applied on the original image How important is the Feature Resolution? In the
size without enlarging for speed consideration. proposed method, the final set of feature maps (denoted
Training details. We implement the proposed method as Φrdet ) is downsampled by r w.r.t the input image. To
in Keras [6]. The backbone is ResNet-50 [12] pretrained explore the influence from r, we train the models with
on ImageNet [8] unless otherwise stated. Adam [16] is r = 2, 4, 8, 16 respectively. For r = 2, Φ2det are up-
applied to optimize the network. We also apply the strat- sampled from Φ4det by deconvolution. To remedy the is-
egy of moving average weights proposed in [43] to achieve sue of poor localization from downsampling, the offset pre-
more stable training. For Caltech [9], a mini-batch con- diction branch is alternatively appended for r = 4, 8, 16
tains 16 images with one GPU (GTX 1080Ti), the learn- to adjust the center location. Evaluations under IoU=0.75
ing rate is set as 10−4 and training is stopped after 15K are included to verify the effectiveness of additional off-
iterations. Following[51, 46, 28, 52], we also include ex- set prediction when stricter localization quality is required.
periments with the model initialized from CityPersons [51], As can be seen from Table. 3, without offset prediction,
which is trained with the learning rate of 2 × 10−5 . For Φ4det presents the best result under IoU=0.5, but performs
CityPersons [51], we optimize the network on 4 GPUs with poorly under IoU=0.75 when compared with Φ2det , which
2 images per GPU for a mini-batch, the learning rate is set indicates that finer feature maps are beneficial for precise lo-
as 2 × 10−4 and training is stopped after 37.5K iterations. calization. Though Φ2det performs the best under IoU=0.75,
it does not bring performance gain under IoU=0.5 though
4.2. Ablation Study with more computational burdens. Not surprisingly, a larger
In this section, an ablative analysis of the proposed r witnesses a significant performance drop, which is mainly
method is conducted on the Caltech dataset, evaluations are due to that coarser feature maps lead to poor localization.
based on the new annotations provided by [49]. In this case, the offset prediction plays a significant role.
Why is the Center Point? As a kind of high-level fea- Notably, additional offset prediction can substantially im-
ture point, the center point is capable of locating an indi- prove the detector upon Φ16 det by 12.86% and 41.30% under
vidual object. A question comes in that how about other the IoU threshold of 0.5 and 0.75, respectively. It can also
high-level feature points. To answer this, we choose two achieve an improvement of 7.67% under IoU=0.75 for the
other high-level feature points as adopted in [42], i.e. the detector upon Φ4det , even though the performance gain is
top and bottom vertexes. Comparisons are reported in Ta- saturating under IoU=0.5. It is worth noting that the extra
ble. 1. It is shown that both the two vertexes can succeed computation cost from the offset prediction is negligible,
in detection but underperform the center point by approxi- with approximately 1ms per image of 480x640 pixels.
mately 2%-3% under IoU=0.5, and the performance gap is How important is the Feature Combination? It has
even larger under the stricter IoU=0.75. This is probably been revealed in [42] that multi-scale representation is vi-
because the center point is advantageous to perceive the full tal for pedestrian detection of various scales. In this part,
body information and thus is easier for training. we conduct an ablative experiment to study which combi-
How important is the Scale Prediction? Scale predic- nation of the multi-scale feature maps from the backbone is
tion is another indespensible component for bounding box the optimal one. As the much lower layer has limited dis-

6
Feature for Test Time M R−2 (%) ∆M R−2 (%)
+Offset
Detection (ms/img) IoU=0.5 IoU=0.75 IoU=0.5 IoU=0.75
Φ2det 69.8 5.32 30.08 - -
58.2 4.62 36.47
Φ4det +0.08 +7.67
X 59.6 4.54 28.80
49.2 7.00 54.25
Φ8det +0.92 +21.32
X 50.4 6.08 32.93
42.0 20.27 75.17
Φ16
det +12.86 +41.30
X 42.7 7.41 33.87

Table 3. Comparisons of different downsampling factors of the feature maps, which are denoted as Φrdet downsampled by r w.r.t the input
image. Test time is evaluated on the image with size of 480x640 pixels. ∆M R−2 means the improvement from the utilization of the offset
prediction. Bold numbers indicate the best result.

Feature Maps ResNet-50[12] MobileNetV1[13]


φ2 φ3 φ4 φ5 # Parameters Test Time M R−2 (%) # Parameters Test Time M R−2 (%)
X X 4.7MB 36.2ms/img 9.96 2.1MB 27.3ms/img 34.96
X X 16.1MB 44.5ms/img 5.68 6.0MB 32.3ms/img 8.33
X X 37.4MB 54.4ms/img 5.84 10.7MB 34.5ms/img 10.03
X X X 16.7MB 46.0ms/img 6.34 6.3MB 33.3ms/img 8.43
X X X 40.0MB 58.2ms/img 4.62 12.3MB 38.2ms/img 9.59
X X X X 40.6MB 61.1ms/img 4.99 12.6MB 40.5ms/img 9.05

Table 4. Comparisons of different combinations of multi-scale feature representations defined in Sec. 3.2. φ2 , φ3 , φ4 and φ5 represent the
output of stage 2, 3, 4 and 5 of a backbone network, respectively. Bold numbers indicate the best results.

criminant information, in practice we choose the output of explicitly designed for occlusion cases.
stage 2 (φ2 ) as a start point and the downsampling factor r CityPersons. Table 5 shows the comparisons with previ-
is fixed as 4. In spite of the ResNet-50[12] with stronger ous state of the arts on CityPersons. Besides the reasonable
feature representation, we also choose a light-weight net- subset, following [46], we also evaluate on three subsets
work like MobileNetV1[13] as the backbone. The results in with different occlusion levels, and following [51], results
Table 4 shows that the much shallower feature maps like φ2 on three subsets with various scale ranges are also included.
result in poorer accuracy, while deeper feature maps like φ4 It can be observed that CSP beats the competitors and per-
and φ5 are of great importance for superior performance, forms fairly well on occlusion cases even without any spe-
and the middle-level feature maps φ3 are indispensable to cific occlusion-handling strategies [46, 52]. On the Rea-
achieve the best results. For ResNet-50, the best perfor- sonable subset, CSP with offset prediction achieves the best
mance comes from the combination of {φ3 , φ4 , φ5 }, while performance, with a gain of 1.0% M R−2 upon the closest
{φ3 , φ4 } is the optimal one for MobileNetV1. competitor (ALFNet [28]), while the speed is comparable
on the same running environment with 0.33 second per im-
4.3. Comparison with the State of the Arts age of 1024x2048 pixels.
Caltech. The proposed method are extensively com-
4.4. Discussions
pared with the state of the arts on three settings: Reason-
able, All and Heavy Occlusion. As shown in Fig. 4, CSP Note that CSP only requires object centers and scales for
achieves M R−2 of 4.5% on the Reasonable setting, which training, though generating them from bounding box or cen-
outperforms the best competitor (5.0 of RepLoss [46]) by tral line annotations is more feasible since centers are not al-
0.4%. When the model is initialized from CityPersons[51], ways easy to annotate. Besides, the model may be puzzled
CSP also achieves a new state of the art of 3.8%, compared on ambiguous centers during training. To demonstrate this,
to 4.0% of RepLoss [46], 4.1% of OR-CNN [52], and 4.5% we randomly disturbed object centers in the range of [0,4]
of ALFNet [28]. It presents the superiority on detecting and [0,8] pixels during training. From the results shown in
pedestrians of various scales and occlusion levels as demon- Table 6, it can be seen that performance drops with increas-
strated in Fig . 4 (b). Moreover, Fig. 4 (c) shows that CSP ing annotation noise. For Caltech, we also apply the origi-
also performs very well for heavily occluded pedestrians, nal annotations but with inferior performance to TLL [42],
outperforming RepLoss [46] and OR-CNN [52] which are which is also anchor-free. A possible reason is that TLL

7
Figure 4. Comparisons with the state of the arts on Caltech using new annotations.

Method Backbone Reasonable Heavy Partial Bare Small Medium Large Test Time
FRCNN[51] VGG-16 15.4 - - - 25.6 7.2 7.9 -
FRCNN+Seg[51] VGG-16 14.8 - - - 22.6 6.7 8.0 -
OR-CNN[52] VGG-16 12.8 55.7 15.3 6.7 - - - -
RepLoss[46] ResNet-50 13.2 56.9 16.8 7.6 - - - -
TLL[42] ResNet-50 15.5 53.6 17.2 10.0 - - - -
TLL+MRF[42] ResNet-50 14.4 52.0 15.9 9.2 - - - -
ALFNet[28] ResNet-50 12.0 51.9 11.4 8.4 19.0 5.7 6.6 0.27s/img
CSP(w/o offset) ResNet-50 11.4 49.9 10.8 8.1 18.2 3.9 6.0 0.33s/img
CSP(with offset) ResNet-50 11.0 49.3 10.4 7.3 16.0 3.7 6.5 0.33s/img

Table 5. Comparison with the state of the arts on CityPersons[51]. Results test on the original image size (1024x2048 pixels) are reported.
Red and green indicate the best and second best performance.

Disturbance (pixels) M R−2 (%) ∆M R−2 (%) back by scanning for pedestrian centers instead of boxes in
0 4.62 - an image, thus is more robust to occluded objects.
[0, 4] 5.68 ↓ 1.06
[0, 8] 8.59 ↓ 3.97 5. Conclusion
Table 6. Performance drop with disturbances of the centers. Inspired from the traditional feature detection task, we
provide a new perspective where pedestrian detection is
motivated as a high-level semantic feature detection task
includes a series of post-processing strategies in keypoint through straightforward convolutions for center and scale
pairing. For evaluation with tight annotations based on cen- predictions. This way, the proposed method enjoys anchor-
tral lines, as results of TLL on Caltech are not reported in free settings and is also free from complex post-processing
[42], comparison to TLL is given in Table 5 on the CityPer- strategies as in recent keypoint-pairing based detectors. As
sons, which shows the superiority of CSP. Therefore, the a result, the proposed CSP detector achieves the new state-
proposed method may be limited for annotations with am- of-the-art performance on two challenging pedestrian de-
biguous centers, e.g. the traditional pedestrian bounding tection benchmarks, namely CityPersons and Caltech. Due
box annotations affected by limbs. In view of this, it may to the general structure of the CSP detector, it is interest-
also be not straightforward to apply CSP to generic object ing to further explore its capability in other tasks like face
detection without further improvement or new annotations. detection, vehicle detection, and general object detection.
When compared with anchor-based methods, the advan-
tage of CSP lies in two aspects. Firstly, CSP does not re- Acknowledgment
quire tedious configurations on anchors specifically for each This work was partly supported by the Na-
dataset. Secondly, anchor-based methods detect objects by tional Key Research and Development Plan (Grant
overall classifications of each anchor where background in- No.2016YFC0801003), the NSFC Project #61672521, the
formation and occlusions are also included and will confuse NLPR Independent Research Project #Z-2018008, and the
the detector’s training. However, CSP overcomes this draw- IIAI financial support.

8
References [16] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980,
[1] https://en.wikipedia.org/wiki/Feature_ 2014.
detection_(computer_vision).
[17] Tao Kong, Fuchun Sun, Chuanqi Tan, Huaping Liu, and
[2] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Wenbing Huang. Deep feature pyramid reconfiguration for
Deepedge: A multi-scale bifurcated deep network for top- object detection. In The European Conference on Computer
down contour detection. In 2015 IEEE Conference on Com- Vision (ECCV), September 2018.
puter Vision and Pattern Recognition (CVPR), pages 4380–
[18] Hei Law and Jia Deng. Cornernet: Detecting objects as
4389. IEEE, 2015.
paired keypoints. In The European Conference on Computer
[3] Z. Cai, Q. Fan, R.S.Feris, and N. Vasconcelos. A unified Vision (ECCV), September 2018.
multi-scale deep convolutional neural network for fast ob- [19] Jian Li, Yabiao Wang, Changan Wang, Ying Tai, Jian-
ject detection. In European Conference on Computer Vision, jun Qian, Jian Yang, Chengjie Wang, Jilin Li, and Feiyue
pages 354–370. Springer, 2016. Huang. Dsfd: Dual shot face detector. arXiv preprint
[4] John Canny. A computational approach to edge detection. arXiv:1810.10220, 2018.
IEEE Transactions on pattern analysis and machine intelli- [20] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yang-
gence, (6):679–698, 1986. dong Deng, and Jian Sun. Light-head r-cnn: In defense of
[5] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. two-stage object detector. arXiv preprint arXiv:1711.07264,
Realtime multi-person 2d pose estimation using part affinity 2017.
fields. arXiv preprint arXiv:1611.08050, 2016. [21] Chunze Lin, Jiwen Lu, Gang Wang, and Jie Zhou.
[6] Franois Chollet. Keras. published on github Graininess-aware deep feature learning for pedestrian de-
(https://github.com/fchollet/keras),. 2015. tection. In The European Conference on Computer Vision
[7] Hongli Deng, Wei Zhang, Eric Mortensen, Thomas Diet- (ECCV), September 2018.
terich, and Linda Shapiro. Principal curvature-based region [22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
detector for object recognition. In Computer Vision and Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Pattern Recognition, 2007. CVPR’07. IEEE Conference on, Zitnick. Microsoft coco: Common objects in context. In
pages 1–8. IEEE, 2007. European conference on computer vision, pages 740–755.
[8] J. Deng, W. Dong, R. Socher, L.-J Li, K. Li, and F.-F Li. Im- Springer, 2014.
agenet: A large-scale hierarchical image database. In Com- [23] Y.-T Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
puter Vision and Pattern Recognition, 2009. CVPR 2009. S. Belongie. Feature pyramid networks for object detection.
IEEE Conference on, pages 248–255. IEEE, 2009. arXiv preprint arXiv:1612.03144, 2016.
[9] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Per- [24] Y.-T Lin, P. Goyal, R. Girshick, K. He, and P. Dollár.
ona. Pedestrian detection: An evaluation of the state of the Focal loss for dense object detection. arXiv preprint
art. IEEE transactions on pattern analysis and machine in- arXiv:1708.02002, 2017.
telligence, 34(4):743–761, 2012. [25] Tony Lindeberg. Scale selection properties of generalized
[10] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, scale-space interest point detectors. Journal of Mathematical
and Alexander C Berg. Dssd: Deconvolutional single shot Imaging and vision, 46(2):177–210, 2013.
detector. arXiv preprint arXiv:1701.06659, 2017. [26] Songtao Liu, Di Huang, and andYunhong Wang. Receptive
[11] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter- field block net for accurate and fast object detection. In The
national conference on computer vision, pages 1440–1448, European Conference on Computer Vision (ECCV), Septem-
2015. ber 2018.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S.Reed, C.-
ing for image recognition. In Proceedings of the IEEE con- Y Fu, and A.C. Berg. Ssd: Single shot multibox detector.
ference on computer vision and pattern recognition, pages In European conference on computer vision, pages 21–37.
770–778, 2016. Springer, 2016.
[13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry [28] Wei Liu, Shengcai Liao, Weidong Hu, Xuezhi Liang, and
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- Xiao Chen. Learning efficient single-stage pedestrian de-
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu- tectors by asymptotic localization fitting. In The European
tional neural networks for mobile vision applications. arXiv Conference on Computer Vision (ECCV), September 2018.
preprint arXiv:1704.04861, 2017. [29] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and
[14] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense- Xiang Bai. Richer convolutional features for edge detection.
box: Unifying landmark localization with end to end object In Computer Vision and Pattern Recognition (CVPR), 2017
detection. arXiv preprint arXiv:1509.04874, 2015. IEEE Conference on, pages 5872–5881. IEEE, 2017.
[15] Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, [30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyra- convolutional networks for semantic segmentation. In Pro-
mid network for object detection. In The European Confer- ceedings of the IEEE conference on computer vision and pat-
ence on Computer Vision (ECCV), September 2018. tern recognition, pages 3431–3440, 2015.

9
[31] David G Lowe. Distinctive image features from scale- [46] Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian
invariant keypoints. International journal of computer vi- Sun, and Chunhua Shen. Repulsion loss: Detecting pedestri-
sion, 60(2):91–110, 2004. ans in a crowd. arXiv preprint arXiv:1711.07752, 2017.
[32] Jiayuan Mao, Tete Xiao, Yuning Jiang, and Zhimin Cao. [47] Saining Xie and Zhuowen Tu. Holistically-nested edge de-
What can help pedestrian detection? In The IEEE Confer- tection. International Journal of Computer Vision, 125(1-
ence on Computer Vision and Pattern Recognition (CVPR), 3):3–18, 2017.
volume 1, page 3, 2017. [48] Liliang Zhang, Liang Lin, Xiaodan Liang, and Kaiming He.
[33] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pa- Is faster r-cnn doing well for pedestrian detection? In
jdla. Robust wide-baseline stereo from maximally stable ex- European Conference on Computer Vision, pages 443–457.
tremal regions. Image and vision computing, 22(10):761– Springer, 2016.
767, 2004. [49] Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan
[34] Alejandro Newell, Zhiao Huang, and Jia Deng. Associa- Hosang, and Bernt Schiele. How far are we from solving
tive embedding: End-to-end learning for joint detection and pedestrian detection? In Proceedings of the IEEE Confer-
grouping. In Advances in Neural Information Processing ence on Computer Vision and Pattern Recognition, pages
Systems, pages 2277–2287, 2017. 1259–1267, 2016.
[35] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You [50] Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan
only look once: Unified, real-time object detection. In Pro- Hosang, and Bernt Schiele. Towards reaching human perfor-
ceedings of the IEEE Conference on Computer Vision and mance in pedestrian detection. IEEE transactions on pattern
Pattern Recognition, pages 779–788, 2016. analysis and machine intelligence, 40(4):973–986, 2018.
[36] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards [51] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele.
real-time object detection with region proposal networks. In Citypersons: A diverse dataset for pedestrian detection.
Advances in neural information processing systems, pages arXiv preprint arXiv:1702.05693, 2017.
91–99, 2015. [52] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and
Stan Z. Li. Occlusion-aware r-cnn: Detecting pedestrians in
[37] Edward Rosten and Tom Drummond. Machine learning for
a crowd. In The European Conference on Computer Vision
high-speed corner detection. In European conference on
(ECCV), September 2018.
computer vision, pages 430–443. Springer, 2006.
[53] Chunluan Zhou and Junsong Yuan. Bi-box regression for
[38] Edward Rosten, Reid Porter, and Tom Drummond. Faster
pedestrian detection and occlusion estimation. In The Eu-
and better: A machine learning approach to corner detec-
ropean Conference on Computer Vision (ECCV), September
tion. IEEE transactions on pattern analysis and machine
2018.
intelligence, 32(1):105–119, 2010.
[39] Wei Shen, Xinggang Wang, Yan Wang, Xiang Bai, and Zhi-
jiang Zhang. Deepcontour: A deep convolutional feature
learned by positive-sharing loss for contour detection. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3982–3991, 2015.
[40] Stephen M Smith and J Michael Brady. Susan: a new ap-
proach to low level image processing. International journal
of computer vision, 23(1):45–78, 1997.
[41] Irwin Sobel. Camera models and machine perception. Tech-
nical report, Computer Science Department, Technion, 1972.
[42] Tao Song, Leiyu Sun, Di Xie, Haiming Sun, and Shiliang
Pu. Small-scale pedestrian detection based on topological
line localization and temporal feature aggregation. In The
European Conference on Computer Vision (ECCV), Septem-
ber 2018.
[43] Antti Tarvainen and Harri Valpola. Mean teachers are better
role models: Weight-averaged consistency targets improve
semi-supervised deep learning results. In Advances in neural
information processing systems, pages 1195–1204, 2017.
[44] Lachlan Tychsen-Smith and Lars Petersson. Denet: Scalable
real-time object detection with directed sparse sampling. In
Proceedings of the IEEE International Conference on Com-
puter Vision, pages 428–436, 2017.
[45] Paul Viola and Michael J Jones. Robust real-time face detec-
tion. International journal of computer vision, 57(2):137–
154, 2004.

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy