Overview of Object Detection Algorithms Using
Overview of Object Detection Algorithms Using
https://www.scirp.org/journal/jcc
ISSN Online: 2327-5227
ISSN Print: 2327-5219
School of Computer Science and Engineering, Sichuan University of Science and Engineering, Yibing, China
1. Introduction
Convolutional Neural Network [1] (CNN) has made great progress in recent
years and is a very bright pearl in the booming deep neural network treasure
house. And computer vision technology allows artificial intelligence to have the
ability of visual perception and understanding. In recent years, thanks to the
improvement of computer hardware performance and the creation of large-scale
image annotation data sets, computer vision algorithms based on deep learning
have achieved great success in classic computer vision tasks such as image classi-
fication, object detection, and image segmentation.
At present, object detection has not only received a lot of research in acade-
DOI: 10.4236/jcc.2022.101006 Jan. 29, 2022 115 Journal of Computer and Communications
J. S. Ren, Y. Wang
mia, but also has been widely used in real life, such as video fire detection [2],
unmanned driving [3], security monitoring [4], and UAV scene analysis [5]. At
present, object detection algorithms are mainly divided into two types, tradi-
tional object detection algorithms based on image processing and object detec-
tion algorithms based on convolutional neural networks. In 2014, Girshick et al.
proposed R-CNN [6] on this basis. For the first time, convolutional neural net-
works were applied to object detection, and the detection accuracy was im-
proved by nearly 30% compared with traditional detection algorithms, which
caused a great response. From the current academic research and practical ap-
plication, the object detection algorithm based on the convolutional neural net-
work has higher accuracy and shorter test time than the traditional method, and
it has almost completely replaced the traditional algorithm.
through the intermediate layer, but performs feature extraction, target classifica-
tion, and position regression in the entire convolutional network, and then ob-
tains the target position and category. The recognition accuracy is slightly weak-
er than that of the two-stage object detection algorithm. Under the premise, the
speed has been greatly improved. The development process of one-stage and
two-stage algorithms is shown in Figure 3 and Figure 4, respectively.
4.2. SPPNet
In 2015, SPPNet [18] was published on IEEE. In R-CNN, to generate a vector of
equal dimensions for all candidate regions, the candidate regions are forcibly
scaled, which will destroy the proportional relationship of the image, which is
Figure 3. The development history of the two-stage object detection network framework.
Figure 4. The development history of the one-stage object detection network framework.
not good for feature extraction, and this extraction process is quite time-consuming,
so SPPNet is optimized here, using spatial pyramid pooling.
The spatial pyramid pooling layer in the figure below is the core of SPPNet,
and its main purpose is to generate a fixed size output for any size input. The
idea is to first divide a feature map of any size into 16, 4, or 1 blocks, and then
pool the maximum on each block. The pooled features are spliced to obtain a
fixed-dimensional output to meet the needs of the fully connected layer. Ob-
viously, for images of different sizes, we get vectors of the same size. This is the
advantage of spatial pyramid pooling. The structure of SPPNet network is shown
in Figure 6.
R-CNN model, it is trained together with the entire model. Realizing the inte-
gration of candidate frame extraction into the deep network, RPN can learn how
to generate high-quality proposed regions, reduce the number of proposed re-
gions learned from the data, and still maintain the accuracy of object detection.
Faster R-CNN [22] is another masterpiece of the author Ross Girshick after
Fast R-CNN. It also uses VGG-16 as the backbone of the network. The inference
speed reaches 5fps on the GPU (including the generation of candidate regions),
that is, it can detect every second of Five pictures, the accuracy rate has also been
further improved, and won first place in multiple projects in the 2015 ILSVRC
and COCO competitions. The structure of Faster R-CNN network is shown in
Figure 8.
4.7. D2Det
In 2020, based on the two-stage method, Cao et al. improved the classification
and regression branches to further improve the accuracy of object detection and
instance segmentation. They proposed D2Det [29], a method that can both ac-
curately locate and accurately classify.
For precise positioning, this paper introduces a dense local regression method,
which is used to predict multiple dense box offsets for each target candidate box.
For accurate classification, this paper introduces a discriminative RoI pooling
scheme. For a candidate area, it can sample from different sub-regions, and then
assign adaptive weights during the calculation to obtain discriminative features.
The structure of D2Det network is shown in Figure 11.
4.9. YOLOv1
YOLO [24] was proposed in 2016 and published in CVPR, the computer vision
conference.
Unlike the R-CNN series that needs to find the candidate area first, and then
identify the objects in the candidate area, YOLO’s prediction is based on the en-
tire picture, and it will output all detected target information at one time, in-
cluding category and location.
The first step of YOLO is to divide the picture. It divides the picture into gr-
ids, and the size of each grid is equal. The core idea of YOLO is to turn object
detection into a regression problem, using the entire image as the input of the
network, and only going through a neural network to get the location of the
bounding box and its category. Its detection speed is extremely fast, the genera-
lization ability is strong, the speed is provided, and the accuracy is reduced. The
disadvantage is that for small objects, overlapping objects cannot be detected.
4.10. YOLOv2
In 2017, Joseph Redmon and Ali Farhadi made a lot of improvements based on
YOLOv1, and proposed YOLOv2 [32], focusing on solving the shortcomings of
YOLOv1’s recall rate and positioning accuracy.
Compared with YOLOv1, which uses the fully connected layer to directly pre-
dict the coordinates of the Bounding Box, YOLOv2 draws on the idea of Faster
R-CNN and introduces the Anchor mechanism. The K-means clustering method
is used to cluster and calculate a better Anchor template in the training set,
which greatly improves the recall rate of the algorithm. At the same time, com-
bining the fine-grained features of the image, the shallow features are connected
with the deep features, which is helpful for the detection of small-scale targets.
The article proposes a new training method—a joint training algorithm. This
algorithm can mix these two data sets. Use a hierarchical view to classify objects,
and use a huge amount of classification data set data to expand the detection da-
ta set, thereby mixing two different data sets.
4.11. YOLOv3
In 2018, Redmon made some improvements based on YOLOv2. The feature ex-
traction part uses the darknet-53 network structure to replace the original dark-
net-19 and uses the feature pyramid network structure to achieve multi-scale
detection. The classification method uses logistic regression instead of softmax
to ensure the accuracy of object detection while taking into account real-time
performance.
YOLOv3’s prior detection system reuses the classifier or locator to perform
detection tasks. They apply the model to multiple locations and scales of the im-
age. Those areas with higher scores can be regarded as the test results. In addi-
tion, compared to other object detection methods, they use a completely differ-
ent method. They apply a single neural network to the entire image. The net-
work divides the image into different regions and predicts the bounding box and
probability of each region. These bounding boxes are weighted by the predicted
probability. The model has some advantages over classifier-based systems. It
looks at the entire image during the test, so its prediction uses the global infor-
mation in the image. Unlike R-CNN, which requires thousands of single target
images, it makes predictions through a single network evaluation. This makes
YOLOv3 [33] very fast, generally, it is 1000 times faster than R-CNN and 100
times faster than Fast R-CNN.
4.12. YOLOv4
In 2020, Bochkovskiy and others launched YOLOv4 [33]. YOLOv4 conducted a
lot of tests on some commonly used Tricks in deep learning and finally selected
these useful Tricks: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data aug-
mentation, CmBN, DropBlock regularization, and CIoU loss. YOLOv4 adds
these practical skills based on traditional YOLO to achieve the best trade-off
between detection speed and accuracy.
4.13. SSD
The full name of the SSD [25] algorithm is Single Shot MultiBox Detector. Sin-
gle-shot indicates that the SSD algorithm is a one-stage method, and MultiBox
indicates that the SSD is a multi-frame prediction.
Compared with Yolo, SSD uses CNN to directly perform detection instead of
performing detection after the fully connected layer as Yolo does. In fact, the di-
rect detection of convolution is only one of the differences between SSD and
Yolo. There are also two important changes. One is that the SSD extracts feature
maps of different scales for detection. Feature maps) can be used to detect small
objects, while small-scale feature maps (the later feature maps) can be used to
detect large objects; the second is that SSD uses prior boxes with different scales
and aspect ratios (Prior boxes, Default boxes), Called Anchors in Faster R-CNN).
The disadvantage of the Yolo algorithm is that it is difficult to detect small tar-
gets and the positioning is not accurate, but these important improvements ena-
ble SSD to overcome these shortcomings to a certain extent.
SSD uses VGG16 as the basic model, and then adds a new convolutional layer
based on VGG16 to obtain more feature maps for detection.
4.14. RetinaNet
In 2017, Lin et al. proposed RetinaNet [34]. They believe that the one-stage me-
thod is fast but not as accurate as the two-stage because the positive and negative
samples are not balanced.
The one-stage detector designed a new loss, focal loss for the obstacle problem
of category imbalance during the training process, and the cross-entropy error of
the regression task was changed to focal loss.
Focal loss is a cross entropy loss that can be dynamically zoomed. When the
confidence of the correct category increases, the zoom factor attenuates to 0. The
zoom factor can automatically reduce the weight of the loss contributed by easy
examples during training so that the model pays attention to hard examples.
FPN serves as the Backbone. It adds a top-down path and a lateral path to the
ResNet [35] network and builds a rich, multi-scale feature pyramid from the
single resolution of the picture. The features of each layer of the pyramid are
used to detect targets of different sizes. The structure of RetinaNet network is
shown in Figure 12.
4.15. CornerNet
In 2018, Hei et al. published CornerNet [36] on ECCV2018. They proposed to
solve the object detection problem as a key point detection problem, that is, to
obtain the prediction frame by detecting the two key points of the upper left
corner and the lower right corner of the target frame. Therefore, there is no
concept of anchor in the CornerNet algorithm. This approach is used in object
detection. The field is relatively innovative and can achieve good results. The
training of the entire detection network is started from scratch and is not based
on a pre-trained classification model. This allows users to freely design a feature
extraction network without being restricted by the pre-training model. Corner-
Net also proposed a new pooling method: corner pooling. The structure of Cor-
nerNet network is shown in Figure 13.
4.16. CenterNet
CenterNet [37], it can be seen from the name of the algorithm that this algorithm
is to predict the center point of the target, instead of the two corner points in
CornerNet; CenterNet uses a heat map to achieve this, introducing the Gaussian
distribution area of the predicted points Calculate the true predicted value. At
the same time, the heat map output by the network will first be normalized to 0
to 1 through the sigmoid function and then transferred to the loss function.
CenterNet does not include operations such as corner pooling, because the
probability of the center point of the target frame falling on the target is rela-
tively large, so the conventional pooling operation can extract effective features.
CenterNet also uses the same offset prediction as CornerNet, which represents
the coordinate error caused by the rounding operation when the annotation in-
formation is mapped from the input image to the output feature map, but the
calculation in CornerNet is 2. The offset of the corner point, and CenterNet cal-
culates the offset of the center point.
4.17. EffcientNet
An et al. proposed the EffcientDet [38] algorithm on CVPR 2020. They believe
that the current object detection, either pursues more accurate detection results,
but costs a lot, or is more efficient, but at the expense of accuracy. Therefore, the
paper designs a set of object detection frameworks to adapt to different con-
straints, while satisfying high precision and high efficiency. They mainly pro-
posed BiFPN and compound scale methods.
BiFPN is an improvement based on FPN. The original FPN module adds
edges to add contextual information and multiplies each edge by corresponding
Weights. It allows simple and fast multi-scale feature fusion, secondly, the com-
pound scale method can uniformly scale the resolution, depth and width, feature
network, and box/class prediction network of all backbones.
4.18. CentripetalNet
CentripetalNet [39] published in CVPR 2020 uses centripetal displacement to
pair corner points in the same instance. CentripetalNet predicts the position and
centripetal shift of corner points and matches the aligned corner points as a re-
sult of the shift.
Combining location information, CentripetalNet matches corner points more
accurately than traditional embedding methods. corner pooling extracts the in-
formation in the bounding box to the boundary. To make the information at the
5. Conclusions
In recent years, with the rapid development of deep learning technology, uni-
versal object detection technology has developed rapidly and made break-
throughs. However, there is still a huge gap between the efficiency and speed of
the detection model and the humanized performance. Existing research methods
show that: based on depth The problems to be solved and future research trends
of the learned general object detection technology mainly include:
1) Unsupervised object detection: Automated labeling technology is exciting
and promising. Unsupervised object detection can eliminate manual labeling.
2) To study a detection method that can have the advantages of both Tow
Stage and One Stage models at the same time.
3) Design an efficient feature extraction network.
4) GAN object detector: We know that deep learning object detectors usually
require a lot of data for training. In contrast, the GAN target detector is an im-
portant structure for producing false images. The combination of real scenes and
GAN simulation data helps the detector to be more robust and general.
5) Multi-domain object detection: a general object detector is mainly devel-
oped, which can detect multi-domain objects without prior knowledge.
Funding
Supported by the scientific research project of the Sichuan University of Science
and Engineering (ZHZJ19-01).
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this pa-
per.
References
[1] Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2017) ImageNet classification with
Deep Convolutional Neural Networks. Communications of the ACM, 60, 84-90.
https://doi.org/10.1145/3065386
[2] Kim, B. and Lee, J. (2019) A Video-Based Fire Detection Using Deep Learning
Models. Applied Sciences, 9, Article No. 2862. https://doi.org/10.3390/app9142862
[3] Li, P., Chen, X. and Shen, S. (2019) Stereo R-CNN Based 3D Object Detection for
Autonomous Driving. Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, Long Beach, 15-20 June 2019, 7636-7644.
https://doi.org/10.1109/CVPR.2019.00783
[4] Zhang, X., Yi, W.-J. and Saniie, J. (2019) Home Surveillance System Using Com-
puter Vision and Convolutional Neural Network. 2019 IEEE International Confe-
https://doi.org/10.1109/ICCV.2017.324
[35] He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image
Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Las Vegas, 27-30 June 2016, 770-778.
https://doi.org/10.1109/CVPR.2016.90
[36] Law, H. and Deng, J. (2018) Cornernet: Detecting Objects as Paired Keypoints.
Proceedings of the European Conference on Computer Vision (ECCV), Munich,
8-14 September 2018, 765-781. https://doi.org/10.1007/978-3-030-01264-9_45
[37] Zhou, X., Wang, D. and Krähenbühl, P. (2019) Objects as Points.
https://arxiv.org/abs/1904.07850
[38] Tan, M. and Le, Q. (2019) EfficientNet: Rethinking Model Scaling for Convolution-
al Neural Networks. International Conference on Machine Learning, Long Beach,
10-15 June 2019, 6105-6114.
[39] Dong, Z., Li, G., Liao, Y., Wang, F., Ren, P. and Qian, C. (2020) Centripetalnet:
Pursuing High-Quality Keypoint Pairs for Object Detection. Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19
June 2020, 10516-10525. https://doi.org/10.1109/CVPR42600.2020.01053