Vision-Based Learning For Drones A Survey
Vision-Based Learning For Drones A Survey
Image Processing
Images
Images
Deep learning
Visual odometry
Pa�erns
Flight Control
Commands
Commands
Visual Percep�on
Visual Percep�on
II. BACKGROUND
A. Vision-based Learning Drones
(a) (b)
A typical vision-based drone consists of three parts (see Fig.
3): (1) Visual perception: sensing the environment around Fig. 4: Vision-based control for drones’ obstacle avoidance in
the drone via monocular cameras or stereo cameras; (2) simple dynamic environments. (a) Drone racing in a dynamic
Image processing: extracting features from an observed image environment with moving gates [26]; (b) A drone avoiding a
sequence and output specific patterns or information, such as ball thrown to it with event cameras [27].
3
(b)
III. O BJECT D ETECTION WITH V ISUAL P ERCEPTION
Fig. 7: (a) R-CNN neural network architecture [64]; (b) Fast
Object detection is a pivotal module in vision-based learning R-CNN neural network architecture [65].
drones when handling complex missions such as inspection,
avoidance, and search and rescue. Object detection is to find
out all the objects of interest in the image and determine their such, CNN not only has the ability to recognize the image
position and size [57]. Object detection is one of the core but also effectively decreases the requirement for computing
problems in the field of computer vision (CV). Nowadays, resources. Recently, vision transformers (ViTs) [61], originally
the applications of object detection include face detection, proposed for image classification tasks, have been extended to
pedestrian detection, vehicle detection, and terrain detection the realm of object detection [62]. These models demonstrate
in remote sensing images. Object detection has always been superior performance by utilizing the self-attention mecha-
one of the most challenging problems in the field of CV due to nism, which processes visual information non-locally [63].
the different appearances, shapes, and poses of various objects, However, a major limitation of ViTs is their high computa-
as well as the interference of factors such as illumination tional demand. This presents difficulties in achieving real-time
and occlusion during imaging. At present, the object detection inference, particularly on platforms with limited resources like
algorithm can be roughly divided into two categories: multi- drones.
stage (two-stage) algorithm, whose idea is to first generate
candidate regions and then perform classification, and one-
stage algorithm, the idea of which is to directly apply the A. Multi-stage Algorithms
algorithm to the input image and output the categories and Classic multi-stage algorithms include RCNN (Region-
corresponding positions. Beyond that, to retrieve 3D positions, based Convolutional Neural Network) [64], Fast R-CNN [65],
depth estimation has been a popular research subbranch related Faster R-CNN [66]. Multi-stage algorithms can basically meet
to object detection whether using monocular [58] or stereo the accuracy requirements in real-life scenarios, but the model
depth estimation [59]. For a very long time, the core neural is more complex and cannot be really applied to scenarios with
network module (backbone) of object detection has been the high-efficiency requirements. In the R-CNN structure [64],
convolutional neural network (CNN) [60]. CNN is a classic it is necessary to first give some regional proposals (RP),
neural network in image processing that originates from the then use the convolutional layer for feature extraction, and
study of the human optic nerve system. The main idea is then classify the regions according to these features. That is,
to convolve the image with the convolution kernel to obtain the object detection problem is transformed into an image
a series of reorganization features, and these reorganization classification problem. The R-CNN model is very intuitive,
features represent the important information of the image. As but the disadvantage is that it is too slow, and the output
5
is obtained via training multiple Support Vector Machines into three phases, namely, zooming the image, passing the
(SVMs). To solve the problem of slow training speed, the image through a full convolutional neural network, and using
Fast R-CNN model is proposed (Fig. 7b). This model has maximum value suppression (NMS). The main advantages of
two improvements to R-CNN: (1) first use the convolutional the YOLO model are that it is fast, with few background
layer to perform feature selection on the image so that only errors via global processing, and it has good generalization
one convolutional layer can be used to obtain RP; (2) convert performance. Meanwhile, YOLO can formulate the detection
training multiple SVMs to use only one fully-connected layer task as a unified, end-to-end regression problem, and simul-
and a softmax layer. These techniques greatly improve the taneously obtain the location and classification by processing
computation speed but still fail to address the efficiency issue the image only once. But there are also some problems with
of the Selective Search Algorithm (SSA) for RP. the YOLO, such as rough mesh, which will limit YOLO’s
Faster R-CNN is an improvement on the basis of Fast performance over small objects. However, the subsequent
R-CNN. In order to solve the problem of SSA, the SSA YOLOv3, YOLOv5, YOLOX [69] and YOLOv8 improved the
that generates RP in Fast R-CNN is replaced by a Region network on the basis of the original YOLO and achieved better
Proposal Network (RPN) and uses a model that integrates RP detection results.
generation, feature extraction, object classification and object
box regression. RPN is a fully convolutional network that
simultaneously predicts object boundaries at each location.
RPN is trained end-to-end to generate high-quality region
proposals, which are then detected by Fast R-CNN. At the
same time, RPN and Fast R-CNN share convolutional features.
Fig. 9: YOLO network architecture [68].
Meanwhile, in the feature extraction stage, Faster R-CNN uses
a convolutional neural network. The model achieves 73.2%
and 70.4% mean Average Precision (mAP) per category on the
PASCAL VOC 2007 and 2012 datasets, respectively. Faster R-
CNN has been greatly improved in speed than Fast R-CNN,
and the accuracy has reached the state-of-the-art (SOTA),
and it also fully developed an end-to-end object detection
framework. However, Faster R-CNN still cannot achieve real-
time object detection. Besides, after obtaining RP, it requires
heavy computation for each RP classification. Fig. 10: SSD network architecture [67].
C. Vision Transformer
ViTs have emerged as the most active research field in object
detection tasks recently, with models like Swin-Transformer
[62], [70], ViTdet [71], and DINO [72] leading the fore-
front. Unlike conventional CNNs, ViTs leverage self-attention
mechanisms to process image patches as sequences, offering
a more flexible representation of spatial hierarchies. The core
mechanism of these models involves dividing an image into
a sequence of patches and applying Transformer encoders
[73] to capture complex dependencies between them. This
process enables ViTs to efficiently learn global context, which
is pivotal in understanding comprehensive scene layouts and Fig. 12: Air-to-air object detection of micro-UAVs with a
object relations. For instance, the Swin-Transformer [62] intro- monocular camera [76].
duces a hierarchical structure with shifted windows, enhancing
the model’s ability to capture both local and global features.
In the following, the Swin-Transformer was scaled to Swin- are required for training and testing. Zheng Ye et al. [76]
Transformer V2 [70] with the capability of training high- collected an air-to-air drone dataset “Det-Fly” (see Fig. 12)
resolution images (see Fig. 11). and evaluated air-to-air object detection of a micro-UAV with
eight different object detection algorithms, namely RetinaNet
[77], SSD, Faster R-CNN, YOLOv3 [78], FPN [79], Cascade
R-CNN [80] and Grid R-CNN [81]. The evaluation results in
[76] showed that the overall performance of Cascade R-CNN
and Grid R-CNN is superior compared to the others. However,
the YOLOv3 provides the fastest inference speed among
others. Wei Xun et al. [82] conducted another investigation
into drone detection, employing the YOLOv3 architecture and
deploying the model on the NVIDIA Jetson TX2 platform.
They collected a dataset comprising 1435 images featuring
various UAVs, including drones, hexacopters, and quadcopters.
Utilizing custom-trained weights, the YOLOv3 model demon-
Fig. 11: Swin Transformer V2 framework [70]. strated proficiency in drone detection within images. However,
the deployment of this trained model faced constraints due to
The primary advantages of ViTs in object detection are the limited computation capacity of the Jetson TX2, which
their scalability to large datasets and superior performance in posed challenges for effective real-time application.
capturing long-range dependencies. This makes them particu- In agile flight, computation speed is more important than
larly effective in scenarios where contextual understanding is accuracy since real-time object detection is required to avoid
crucial. Additionally, ViTs demonstrate strong transfer learn- obstacles swiftly. Therefore, a simple gate detector and a filter
ing capabilities, performing well across various domains with algorithm are adopted as the basis of the visual perception of
minimal fine-tuning. However, challenges with ViTs include Swift [15]. Considering the agility of the drone, a stabilization
their computational intensity due to self-attention mechanisms, module is required to obtain more robust and accurate object
particularly when processing high-resolution images. This can detection and tracking results in real-time flight. Moreover, in
limit their deployment in real-time applications where com- the drone datasets covered in existing works, each image only
putational resources are constrained. Additionally, ViTs often includes a single UAV. To classify and detect different classes
require large-scale datasets for pre-training to achieve optimal of drones in multi-drone systems, a new dataset of multiple
performance, which can be a limitation in data-scarce environ- types of drones has to be built from scratch. Furthermore, the
ments. Despite these challenges, ongoing advancements in ViT dataset can be adapted to capture adversary drones in omni-
architectures, such as the development of efficient attention directional visual perception to enhance avoidance capability.
mechanisms [74] and hybrid CNN-Transformer models [75],
continue to enhance their applicability and performance in
IV. V ISION - BASED C ONTROL
diverse object detection tasks.
When applying object detection algorithms to drone appli- Vision-based control for robotics has been widely studied in
cations, it is necessary to find the best balance between com- recent years, whether for ground robots or aerial robotics such
putation speed and accuracy. Besides, massive drone datasets as drones. For drones flying in a GPS-denied environment,
7
Fig. 13: Vision-based control methods for drone applications. Based on the ways of visual perception and control, the methods
can be divided into indirect methods, semi-direct methods, and end-to-end methods.
visual inertial odometry (VIO) [37]–[39] and visual simulta- is always required, which is a challenge to address with
neous localization and mapping systems (SLAM) [40]–[42] existing methods.
have been preferred choices for navigation. Meanwhile, in a On the mapping side, a point cloud map [94] or Octmap
clustered environment, research on obstacle avoidance [13], [95], representing a set of data points in a 3D space is com-
[28], [83], [84] based on visual perception has attracted much monly generated. Each point has its own Cartesian coordinates
attention in the past few years. Obstacle avoidance has been and can be used to represent a 3D shape or an object. A
a main task for vision-based control as well as for the current 3D point cloud map is not from the view of a drone but
learning algorithms of drones. constructs a global 3D map that provides global environmental
From the perspective of how drones obtain visual perception information for a drone to fly. The point-cloud map can be
(perception end) and how drones generate control commands generated from a LIDAR scanner or many overlapped images
from visual perception (control end), existing vision-based combined with depth information. An illustration of an original
control methods can be categorized into indirect methods, scene and a point cloud map are shown in Fig. 14, where the
semi-direct methods, and end-to-end methods. The relation- drone can travel around without colliding with static obstacles.
ship between these three categories is illustrated in Fig. 13. In Planning is a basic requirement for a vision-based drone
the following, we will discuss and evaluate these methods in to avoid obstacles. Within the indirect methods, planning can
three different categories, respectively. be further divided into two categories: one is offline meth-
ods based on high-resolution maps and pre-known position
A. Indirect Methods information, such as Dijkstra’s algorithm [96], A-star [97],
Indirect methods [13], [27], [85]–[93] refer to extracting RRT-connect [98] and sequential convex optimization [99]; the
features from images or videos to generate visual odometry, other is online methods based on real-time visual perception
depth maps, and 3D point cloud maps for drones to perform and decision-making. Online methods can be further catego-
path planning based on traditional optimization algorithms (see rized into online path planning [13], [90], [91] and artificial
Fig. 14). Obstacle states, such as 3D shape, position, and potential field (APF) methods [27], [100].
velocity, are detected and mapped before a maneuver is taken. Most vision-based drones rely on online methods. Com-
Once online maps are built or obstacles are located, the drone pared to offline methods, which require an accurate pre-built
can generate a feasible path or take actions to avoid obstacles. global map, online methods provide advanced maneuvering
SOTA indirect methods generally divide the mission into capabilities for drones, especially in a dynamic environment.
several subtasks, namely perception, mapping and planning. Currently, due to the advantages of optimization and prediction
On the perception side, depth images are always required to capabilities, online path planning methods have become the
generate corresponding distance and position information for preferred choice for drone obstacle avoidance. For instance,
navigation. A depth image is a grey-level or color image that in the SOTA work [90], Zhou Boyu et al. introduced a robust
can represent the distance between the surfaces of objects from and efficient motion planning system called Fast-Planner for a
the viewpoint of the agent. Fig. 15 shows color images and vision-based drone to perform high-speed flight in an unknown
corresponding depth images from a drone’s viewpoint. The cluttered environment. The key contributions of this work
illuminance is proportional to the distance from the camera. are a robust and efficient planning scheme incorporating path
A lighter color denotes a nearer surface, and darker areas searching, B-spline optimization, and time adjustment to gen-
mean further surfaces. A depth map provides the necessary erate feasible and safe trajectories for vision-based drones’ ob-
distance information for drones to make decisions to avoid stacle avoidance. Using only onboard vision-based perception
static and dynamic obstacles. Currently, off-the-shelf RGB-D and computing, this work demonstrated agile drone navigation
cameras, such as the Intel RealSense depth camera D415, the in unexplored indoor and outdoor environments. However, this
ZED 2 stereo camera, and the Structure Core depth camera, approach can only achieve maximum speeds of 3m/s and
are widely used for drone applications. Therefore, traditional requires 7.3ms for computation in each step. To improve the
obstacle avoidance methods can treat depth information as a flight performance and save computation time, Zhou Xin et al.
direct input. However, for omnidirectional perception in wide- [13] provided a Euclidean Signed Distance Field (ESDF)-free
view scenarios, efficient onboard monocular depth estimation gradient-based planning framework solution, EGO-Planner,
8
(a) (b)
Fig. 16: Artificial potential field (APF) methods in drones’
obstacle avoidance [100] (a) Generated artificial potential field;
(b) Flight trajectory based on the APF.
Fig. 20: Training process used in [28] to fly an agile drone Fig. 21: A drone avoids the obstacles in the forest with semi-
through forest. direct methods [83].
VI. O PEN Q UESTIONS AND P OTENTIAL S OLUTIONS lies in optimizing machine learning models for edge com-
Despite significant advancements in the domain of vision- puting, enabling drones to process data and make decisions
based learning for drones, numerous challenges remain that swiftly. Techniques like model pruning [185], [186], quanti-
impede the pace of development and real-world applicability zation [187], [188], distillation [189] and the development of
of these methods. These challenges span various aspects, specialized hardware accelerators can play a pivotal role in
from data collection and simulation accuracy to operational this regard.
efficiency and safety concerns.
[30] N. J. Sanket, C. M. Parameshwara, C. D. Singh, A. V. Kuruttukulam, [51] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. Mc-
C. Fermüller, D. Scaramuzza, and Y. Aloimonos, “Evdodgenet: Deep Grew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schnei-
dynamic obstacle dodging with event cameras,” in 2020 IEEE Interna- der, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba,
tional Conference on Robotics and Automation (ICRA). IEEE, 2020, and L. Zhang, “Solving rubik’s cube with a robot hand,” arXiv preprint,
pp. 10 651–10 657. 2019.
[31] N. J. Sanket, C. D. Singh, C. Fermüller, and Y. Aloimonos, “Ajna: [52] M. Liaq and Y. Byun, “Autonomous uav navigation using reinforcement
Generalized deep uncertainty for minimal perception on parsimonious learning,” International Journal of Machine Learning and Computing,
robots,” Science Robotics, vol. 8, no. 81, p. eadd5139, 2023. vol. 9, no. 6, 2019.
[32] R. Siegwart, I. R. Nourbakhsh, and D. Scaramuzza, Introduction to [53] C. Wu, B. Ju, Y. Wu, X. Lin, N. Xiong, G. Xu, H. Li, and X. Liang,
autonomous mobile robots. MIT press, 2011. “Uav autonomous target search based on deep reinforcement learning
[33] N. Michael, S. Shen, K. Mohta, Y. Mulgaonkar, V. Kumar, K. Nagatani, in complex disaster scene,” IEEE Access, vol. 7, pp. 117 227–117 245,
Y. Okada, S. Kiribayashi, K. Otake, K. Yoshida et al., “Collaborative 2019.
mapping of an earthquake-damaged building via ground and aerial [54] C. Xiao, P. Lu, and Q. He, “Flying through a narrow gap using end-to-
robots,” Journal of Field Robotics, vol. 29, no. 5, pp. 832–841, 2012. end deep reinforcement learning augmented with curriculum learning
[34] H. Guan, X. Sun, Y. Su, T. Hu, H. Wang, H. Wang, C. Peng, and and sim2real,” IEEE Transactions on Neural Networks and Learning
Q. Guo, “UAV-lidar aids automatic intelligent powerline inspection,” Systems, vol. 34, no. 5, pp. 2701–2708, 2023.
International Journal of Electrical Power and Energy Systems, vol. [55] Y. Song, M. Steinweg, E. Kaufmann, and D. Scaramuzza, “Autonomous
130, p. 106987, sep 2021. drone racing with deep reinforcement learning,” in 2021 IEEE/RSJ
[35] R. Opromolla, G. Fasano, G. Rufino, and M. Grassi, “Uncooperative International Conference on Intelligent Robots and Systems (IROS).
pose estimation with a lidar-based system,” Acta Astronautica, vol. 110, IEEE, 2021, pp. 1205–1212.
pp. 287–297, 2015. [56] J. Xiao, P. Pisutsin, and M. Feroskhan, “Collaborative target search
[36] Z. Wang, Z. Zhao, Z. Jin, Z. Che, J. Tang, C. Shen, and Y. Peng, with a visual drone swarm: An adaptive curriculum embedded multi-
“Multi-stage fusion for multi-class 3d lidar detection,” in Proceedings stage reinforcement learning approach,” IEEE Transactions on Neural
of the IEEE/CVF International Conference on Computer Vision, 2021, Networks and Learning Systems, pp. 1–15, 2023.
pp. 3120–3128. [57] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with
[37] M. O. Aqel, M. H. Marhaban, M. I. Saripan, and N. B. Ismail, “Review deep learning: A review,” IEEE Transactions on Neural Networks and
of visual odometry: types, approaches, challenges, and applications,” Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019.
SpringerPlus, vol. 5, pp. 1–26, 2016. [58] S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller, “Zoedepth:
[38] J. Delmerico and D. Scaramuzza, “A benchmark comparison of monoc- Zero-shot transfer by combining relative and metric depth,” arXiv
ular visual-inertial odometry algorithms for flying robots,” in 2018 preprint arXiv:2302.12288, 2023.
IEEE international conference on robotics and automation (ICRA). [59] H. Laga, L. V. Jospin, F. Boussaid, and M. Bennamoun, “A survey
IEEE, 2018, pp. 2502–2509. on deep learning techniques for stereo-based depth estimation,” IEEE
[39] D. Scaramuzza and Z. Zhang, Aerial Robots, Visual-Inertial Odometry Transactions on Pattern Analysis and Machine Intelligence, vol. 44,
of. Berlin, Heidelberg: Springer Berlin Heidelberg, 2020, pp. 1–9. no. 4, pp. 1738–1764, 2020.
[Online]. Available: https://doi.org/10.1007/978-3-642-41610-1 71-1 [60] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional
[40] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile neural networks: Analysis, applications, and prospects,” IEEE Trans-
and accurate monocular slam system,” IEEE transactions on robotics, actions on Neural Networks and Learning Systems, vol. 33, no. 12, pp.
vol. 31, no. 5, pp. 1147–1163, 2015. 6999–7019, 2022.
[41] T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile [61] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
Monocular Visual-Inertial State Estimator,” IEEE Transactions on T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
Robotics, vol. 34, no. 4, pp. 1004–1020, aug 2018. “An image is worth 16x16 words: Transformers for image recognition
[42] C. Campos, R. Elvira, J. J. Gómez Rodrı́guez, J. M. M. Montiel, and at scale,” arXiv preprint arXiv:2010.11929, 2020.
J. D. Tardós, “ORB-SLAM3: An Accurate Open-Source Library for [62] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,
Visual, Visual-Inertial and Multi-Map SLAM,” 2021. “Swin transformer: Hierarchical vision transformer using shifted win-
[43] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, dows,” in Proceedings of the IEEE/CVF international conference on
S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event- computer vision, 2021, pp. 10 012–10 022.
based vision: A survey,” IEEE transactions on pattern analysis and [63] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi,
machine intelligence, vol. 44, no. 1, pp. 154–180, 2020. J. Fan, and Z. He, “A survey of visual transformers,” IEEE Transactions
[44] W. Gao, K. Wang, W. Ding, F. Gao, T. Qin, and S. Shen, “Autonomous on Neural Networks and Learning Systems, pp. 1–21, 2023.
aerial robot using dual-fisheye cameras,” Journal of Field Robotics, [64] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
vol. 37, no. 4, pp. 497–514, 2020. convolutional networks for accurate object detection and segmenta-
[45] V. R. Kumar, S. Yogamani, H. Rashed, G. Sitsu, C. Witt, I. Leang, tion,” IEEE transactions on pattern analysis and machine intelligence,
S. Milz, and P. Mäder, “Omnidet: Surround view cameras based multi- vol. 38, no. 1, pp. 142–158, 2015.
task visual perception network for autonomous driving,” IEEE Robotics [65] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
and Automation Letters, vol. 6, no. 2, pp. 2830–2837, 2021. conference on computer vision, 2015, pp. 1440–1448.
[46] A. D. Haumann, K. D. Listmann, and V. Willert, “DisCoverage: A [66] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
new paradigm for multi-robot exploration,” in Proceedings - IEEE object detection with region proposal networks,” Advances in neural
International Conference on Robotics and Automation, 2010, pp. 929– information processing systems, vol. 28, 2015.
934. [67] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,
[47] A. H. Tan, F. P. Bejarano, Y. Zhu, R. Ren, and G. Nejat, “Deep and A. C. Berg, “Ssd: Single shot multibox detector,” in European
reinforcement learning for decentralized multi-robot exploration with conference on computer vision. Springer, 2016, pp. 21–37.
macro actions,” IEEE Robotics and Automation Letters, vol. 8, no. 1, [68] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
pp. 272–279, 2022. once: Unified, real-time object detection,” in Proceedings of the IEEE
[48] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and conference on computer vision and pattern recognition, 2016, pp. 779–
A. Farhadi, “Target-driven visual navigation in indoor scenes using 788.
deep reinforcement learning,” in 2017 IEEE international conference [69] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo
on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364. series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
[49] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, [70] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang,
M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and L. Dong et al., “Swin transformer v2: Scaling up capacity and
R. Hadsell, “Learning to navigate in complex environments,” 5th resolution,” in Proceedings of the IEEE/CVF conference on computer
International Conference on Learning Representations, ICLR 2017 - vision and pattern recognition, 2022, pp. 12 009–12 019.
Conference Track Proceedings, 2017. [71] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision
[50] Q. Wu, X. Gong, K. Xu, D. Manocha, J. Dong, and J. Wang, “Towards transformer backbones for object detection,” in European Conference
target-driven visual navigation in indoor scenes via generative imitation on Computer Vision. Springer, 2022, pp. 280–296.
learning,” IEEE Robotics and Automation Letters, vol. 6, no. 1, pp. [72] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y.
175–182, 2020. Shum, “Dino: Detr with improved denoising anchor boxes for end-
18
to-end object detection,” in The Eleventh International Conference on [94] R. B. Rusu, Z. C. Marton, N. Blodow, M. Dolha, and M. Beetz, “To-
Learning Representations, 2022. wards 3d point cloud based object maps for household environments,”
[73] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Robotics and Autonomous Systems, vol. 56, no. 11, pp. 927–941, 2008.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” [95] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and
Advances in neural information processing systems, vol. 30, 2017. W. Burgard, “OctoMap: An efficient probabilistic 3D mapping
[74] Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: framework based on octrees,” Autonomous Robots, 2013, software
Attention with linear complexities,” in Proceedings of the IEEE/CVF available at https://octomap.github.io. [Online]. Available: https:
winter conference on applications of computer vision, 2021, pp. 3531– //octomap.github.io
3539. [96] E. W. Dijkstra, “A note on two problems in connexion with graphs,”
[75] M. Maaz, A. Shaker, H. Cholakkal, S. Khan, S. W. Zamir, R. M. Numerische Mathematik, vol. 1, pp. 269–271, 1959.
Anwer, and F. Shahbaz Khan, “Edgenext: efficiently amalgamated cnn- [97] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the
transformer architecture for mobile vision applications,” in European heuristic determination of minimum cost paths,” IEEE Transactions on
Conference on Computer Vision. Springer, 2022, pp. 3–20. Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968.
[76] Y. Zheng, Z. Chen, D. Lv, Z. Li, Z. Lan, and S. Zhao, “Air-to-air visual [98] J. J. Kuffner and S. M. LaValle, “Rrt-connect: An efficient approach
detection of micro-uavs: An experimental evaluation of deep learning,” to single-query path planning,” in Proceedings 2000 ICRA. Millennium
IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1020–1027, Conference. IEEE International Conference on Robotics and Automa-
2021. tion. Symposia Proceedings (Cat. No. 00CH37065), vol. 2. IEEE,
[77] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss 2000, pp. 995–1001.
for dense object detection,” in Proceedings of the IEEE international [99] F. Augugliaro, A. P. Schoellig, and R. D’Andrea, “Generation of
conference on computer vision, 2017, pp. 2980–2988. collision-free trajectories for a quadrocopter fleet: A sequential convex
[78] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” programming approach,” in 2012 IEEE/RSJ international conference
arXiv preprint arXiv:1804.02767, 2018. on Intelligent Robots and Systems. IEEE, 2012, pp. 1917–1922.
[79] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [100] I. Iswanto, A. Ma’arif, O. Wahyunggoro, and A. Imam, “Artificial
“Feature pyramid networks for object detection,” in Proceedings of the potential field algorithm implementation for quadrotor path planning,”
IEEE conference on computer vision and pattern recognition, 2017, Int. J. Adv. Comput. Sci. Appl, vol. 10, no. 8, pp. 575–585, 2019.
pp. 2117–2125. [101] T. Huang, S. Zhao, L. Geng, and Q. Xu, “Unsupervised monocular
[80] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality depth estimation based on residual neural network of coarse–refined
object detection,” in Proceedings of the IEEE conference on computer feature extractions for drone,” Electronics, vol. 8, no. 10, p. 1179,
vision and pattern recognition, 2018, pp. 6154–6162. 2019.
[81] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid r-cnn,” in Proceed- [102] O. Khatib, “Real-time obstacle avoidance for manipulators and mo-
ings of the IEEE/CVF Conference on Computer Vision and Pattern bile robots,” in Proceedings. 1985 IEEE International Conference on
Recognition, 2019, pp. 7363–7372. Robotics and Automation, vol. 2. IEEE, 1985, pp. 500–505.
[82] D. T. Wei Xun, Y. L. Lim, and S. Srigrarom, “Drone detection [103] X. Dai, Y. Mao, T. Huang, N. Qin, D. Huang, and Y. Li, “Automatic
using yolov3 with transfer learning on nvidia jetson tx2,” in 2021 obstacle avoidance of quadrotor uav via cnn-based learning,” Neuro-
Second International Symposium on Instrumentation, Control, Artificial computing, vol. 402, pp. 346–358, 2020.
Intelligence, and Robotics (ICA-SYMP), 2021, pp. 1–6. [104] M. A. Anwar and A. Raychowdhury, “Navren-rl: Learning to fly in real
[83] S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, environment via end-to-end deep reinforcement learning using monoc-
J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav ular images,” in 2018 25th International Conference on Mechatronics
control in cluttered natural environments,” in 2013 IEEE international and Machine Vision in Practice (M2VIP). IEEE, 2018, pp. 1–6.
conference on robotics and automation. IEEE, 2013, pp. 1765–1772. [105] Y. Zhang, K. H. Low, and C. Lyu, “Partially-observable monocular
[84] L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular autonomous navigation for uav through deep reinforcement learning,”
vision based obstacle avoidance through deep reinforcement learning,” in AIAA AVIATION 2023 Forum, 2023, p. 3813.
arXiv preprint arXiv:1706.09829, 2017. [106] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
[85] K. Mohta, M. Watterson, Y. Mulgaonkar, S. Liu, C. Qu, A. Makineni, MIT press, 2018.
K. Saulnier, K. Sun, A. Zhu, J. Delmerico et al., “Fast, autonomous [107] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
flight in gps-denied and cluttered environments,” Journal of Field recognition,” in Proceedings of the IEEE conference on computer vision
Robotics, vol. 35, no. 1, pp. 101–120, 2018. and pattern recognition, 2016, pp. 770–778.
[86] F. Gao, W. Wu, J. Pan, B. Zhou, and S. Shen, “Optimal Time [108] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
Allocation for Quadrotor Trajectory Generation,” in IEEE International large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
Conference on Intelligent Robots and Systems. Institute of Electrical [109] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
and Electronics Engineers Inc., dec 2018, pp. 4715–4722. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
[87] F. Gao, L. Wang, B. Zhou, X. Zhou, J. Pan, and S. Shen, “Teach- S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
repeat-replan: A complete and robust system for aggressive flight in D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
complex environments,” IEEE Transactions on Robotics, vol. 36, no. 5, deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
pp. 1526–1545, 2020. feb 2015.
[88] F. Gao, W. Wu, W. Gao, and S. Shen, “Flying on point clouds: Online [110] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
trajectory generation and autonomous navigation for quadrotors in imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
cluttered environments,” Journal of Field Robotics, vol. 36, no. 4, pp. 2017.
710–733, jun 2019. [111] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
[89] F. Gao, L. Wang, B. Zhou, L. Han, J. Pan, and S. Shen, “Teach-repeat- no. 3, pp. 279–292, 1992.
replan: A complete and robust system for aggressive flight in complex [112] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
environments,” pp. 1526–1545, may 2019. with double q-learning,” in Proceedings of the AAAI conference on
[90] B. Zhou, F. Gao, L. Wang, C. Liu, and S. Shen, “Robust and efficient artificial intelligence, vol. 30, no. 1, 2016.
quadrotor trajectory generation for fast autonomous flight,” IEEE [113] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Off-policy imitation learning from
Robotics and Automation Letters, vol. 4, no. 4, pp. 3529–3536, 2019. observations,” Advances in Neural Information Processing Systems,
[91] B. Zhou, J. Pan, F. Gao, and S. Shen, “Raptor: Robust and perception- vol. 33, pp. 12 402–12 413, 2020.
aware trajectory replanning for quadrotor fast flight,” IEEE Transac- [114] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson,
tions on Robotics, 2021. “Counterfactual multi-agent policy gradients,” Proceedings of the AAAI
[92] L. Quan, Z. Zhang, X. Zhong, C. Xu, and F. Gao, “Eva-planner: En- conference on artificial intelligence, vol. 32, no. 1, 2018.
vironmental adaptive quadrotor planning,” in 2021 IEEE International [115] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster,
Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. and S. Whiteson, “Monotonic value function factorisation for deep
398–404. multi-agent reinforcement learning,” The Journal of Machine Learning
[93] Y. Zhang, Q. Yu, K. H. Low, and C. Lv, “A self-supervised monocular Research, vol. 21, no. 1, pp. 7234–7284, 2020.
depth estimation approach based on uav aerial images,” in 2022 [116] J. Xiao, Y. X. M. Tan, X. Zhou, and M. Feroskhan, “Learning
IEEE/AIAA 41st Digital Avionics Systems Conference (DASC). IEEE, collaborative multi-target search for a visual drone swarm,” in 2023
2022, pp. 1–8. IEEE Conference on Artificial Intelligence (CAI), 2023, pp. 5–7.
19
[117] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity [138] E. Lygouras, N. Santavas, A. Taitzoglou, K. Tarchanidis, A. Mitropou-
visual and physical simulation for autonomous vehicles,” in Field and los, and A. Gasteratos, “Unsupervised human detection with an em-
service robotics. Springer, 2018, pp. 621–635. bedded vision system on a fully autonomous uav for search and rescue
[118] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion, C. Goy, operations,” Sensors, vol. 19, no. 16, p. 3542, 2019.
Y. Gao, H. Henry, M. Mattar et al., “Unity: A general platform for [139] T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair,
intelligent agents,” arXiv preprint arXiv:1809.02627, 2018. I. L. Grixa, F. Ruess, M. Suppa, and D. Burschka, “Toward a fully
[119] I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero, “Extending autonomous uav: Research platform for indoor and outdoor urban
the openai gym for robotics: a toolkit for reinforcement learning using search and rescue,” IEEE robotics & automation magazine, vol. 19,
ros and gazebo,” arXiv preprint arXiv:1608.05742, 2016. no. 3, pp. 46–56, 2012.
[120] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. To- [140] J. Senthilnath, M. Kandukuri, A. Dokania, and K. Ramesh, “Applica-
bin, P. Abbeel, and W. Zaremba, “Transfer from simulation to real tion of uav imaging platform for vegetation analysis based on spectral-
world through learning deep inverse dynamics model,” arXiv preprint spatial methods,” Computers and Electronics in Agriculture, vol. 140,
arXiv:1610.03518, 2016. pp. 8–24, 2017.
[121] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bo- [141] M. R. Khosravi and S. Samadi, “Bl-alm: A blind scalable edge-guided
hez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for reconstruction filter for smart environmental monitoring through green
quadruped robots,” arXiv preprint arXiv:1804.10332, 2018. iomt-uav networks,” IEEE Transactions on Green Communications and
Networking, vol. 5, no. 2, pp. 727–736, 2021.
[122] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
[142] C. Donmez, O. Villi, S. Berberoglu, and A. Cilek, “Computer vision-
for fast adaptation of deep networks,” in International conference on
based citrus tree detection in a cultivated environment using uav
machine learning. PMLR, 2017, pp. 1126–1135.
imagery,” Computers and Electronics in Agriculture, vol. 187, p.
[123] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, 106273, 2021.
“Domain randomization for transferring deep neural networks from [143] B. Lu and Y. He, “Species classification using unmanned aerial vehicle
simulation to the real world,” in 2017 IEEE/RSJ international con- (uav)-acquired high spatial resolution imagery in a heterogeneous
ference on intelligent robots and systems (IROS). IEEE, 2017, pp. grassland,” ISPRS Journal of Photogrammetry and Remote Sensing,
23–30. vol. 128, pp. 73–85, 2017.
[124] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. Mc- [144] D. Kim, M. Liu, S. Lee, and V. R. Kamat, “Remote proximity mon-
Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al., itoring between mobile construction resources using camera-mounted
“Learning dexterous in-hand manipulation,” The International Journal uavs,” Automation in Construction, vol. 99, pp. 168–182, 2019.
of Robotics Research, vol. 39, no. 1, pp. 3–20, 2020. [145] T. Khuc, T. A. Nguyen, H. Dao, and F. N. Catbas, “Swaying displace-
[125] M. A. Akhloufi, S. Arola, and A. Bonnet, “Drones chasing drones: ment measurement for structural monitoring using computer vision and
Reinforcement learning and deep search area proposal,” Drones, vol. 3, an unmanned aerial vehicle,” Measurement, vol. 159, p. 107769, 2020.
no. 3, p. 58, 2019. [146] S. Li, E. van der Horst, P. Duernay, C. De Wagter, and G. C.
[126] S. Geyer and E. Johnson, “3d obstacle avoidance in adversarial environ- de Croon, “Visual model-predictive localization for computationally
ments for unmanned aerial vehicles,” in AIAA Guidance, Navigation, efficient autonomous racing of a 72-g drone,” Journal of Field Robotics,
and Control Conference and Exhibit, 2006, p. 6542. vol. 37, no. 4, pp. 667–692, 2020.
[127] F. Schilling, J. Lecoeur, F. Schiano, and D. Floreano, “Learning [147] M. Muller, G. Li, V. Casser, N. Smith, D. L. Michels, and B. Ghanem,
vision-based flight in drone swarms by imitation,” IEEE Robotics and “Learning a controller fusion network by online trajectory filtering for
Automation Letters, vol. 4, no. 4, pp. 4523–4530, 2019. vision-based uav racing,” in Proceedings of the IEEE/CVF Conference
[128] Y. Xie, M. Lu, R. Peng, and P. Lu, “Learning agile flights through nar- on Computer Vision and Pattern Recognition Workshops, 2019, pp.
row gaps with varying angles using onboard sensing,” IEEE Robotics 0–0.
and Automation Letters, vol. 8, no. 9, pp. 5424–5431, 2023. [148] M. Muller, V. Casser, N. Smith, D. L. Michels, and B. Ghanem,
[129] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocu- “Teaching uavs to race: End-to-end regression of agile controls in
lar depth estimation with left-right consistency,” in Proceedings of the simulation,” in Proceedings of the European Conference on Computer
IEEE conference on computer vision and pattern recognition, 2017, Vision (ECCV) Workshops, 2018, pp. 0–0.
pp. 270–279. [149] R. Penicka and D. Scaramuzza, “Minimum-time quadrotor waypoint
[130] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging flight in cluttered environments,” IEEE Robotics and Automation
into self-supervised monocular depth estimation,” in Proceedings of Letters, vol. 7, no. 2, pp. 5719–5726, 2022.
the IEEE/CVF International Conference on Computer Vision, 2019, [150] E. Kaufmann, M. Gehrig, P. Foehn, R. Ranftl, A. Dosovitskiy,
pp. 3828–3838. V. Koltun, and D. Scaramuzza, “Beauty and the beast: Optimal methods
[131] Y. Liu, L. Wang, and M. Liu, “Yolostereo3d: A step back to 2d for meet learning for drone racing,” in 2019 International Conference on
efficient stereo 3d detection,” in 2021 IEEE International Conference Robotics and Automation (ICRA). IEEE, 2019, pp. 690–696.
on Robotics and Automation (ICRA). IEEE, 2021, pp. 13 018–13 024. [151] L. Xing, X. Fan, Y. Dong, Z. Xiong, L. Xing, Y. Yang, H. Bai, and
C. Zhou, “Multi-uav cooperative system for search and rescue based
[132] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia, “Fast
on yolov5,” International Journal of Disaster Risk Reduction, vol. 76,
robust monocular depth estimation for obstacle detection with fully
p. 102972, 2022.
convolutional networks,” in 2016 IEEE/RSJ International Conference
[152] B. Lin, L. Wu, and Y. Niu, “End-to-end vision-based cooperative target
on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4296–
geo-localization for multiple micro uavs,” Journal of Intelligent &
4303.
Robotic Systems, vol. 106, no. 1, p. 13, 2022.
[133] A. Singh, D. Patil, and S. Omkar, “Eye in the sky: Real-time drone [153] M. E. Campbell and W. W. Whitacre, “Cooperative tracking using
surveillance system (dss) for violent individuals identification using vision measurements on seascan uavs,” IEEE Transactions on Control
scatternet hybrid deep learning network,” in Proceedings of the IEEE Systems Technology, vol. 15, no. 4, pp. 613–626, 2007.
conference on computer vision and pattern recognition workshops, [154] J. Gu, T. Su, Q. Wang, X. Du, and M. Guizani, “Multiple moving
2018, pp. 1629–1637. targets surveillance based on a cooperative network for multi-uav,”
[134] W. Li, H. Li, Q. Wu, X. Chen, and K. N. Ngan, “Simultaneously IEEE Communications Magazine, vol. 56, no. 4, pp. 82–89, 2018.
detecting and counting dense vehicles from drone images,” IEEE [155] Y. Cao, F. Qi, Y. Jing, M. Zhu, T. Lei, Z. Li, J. Xia, J. Wang, and G. Lu,
Transactions on Industrial Electronics, vol. 66, no. 12, pp. 9651–9662, “Mission chain driven unmanned aerial vehicle swarms cooperation for
2019. the search and rescue of outdoor injured human targets,” Drones, vol. 6,
[135] H. Zhou, H. Kong, L. Wei, D. Creighton, and S. Nahavandi, “On no. 6, p. 138, 2022.
detecting road regions in a single uav image,” IEEE transactions on [156] G. Loianno, J. Thomas, and V. Kumar, “Cooperative localization and
intelligent transportation systems, vol. 18, no. 7, pp. 1713–1722, 2016. mapping of mavs using rgb-d sensors,” in 2015 IEEE International
[136] N. H. Motlagh, M. Bagaa, and T. Taleb, “Uav-based iot platform: A Conference on Robotics and Automation (ICRA). IEEE, 2015, pp.
crowd surveillance use case,” IEEE Communications Magazine, vol. 55, 4021–4028.
no. 2, pp. 128–134, 2017. [157] P. Tong, X. Yang, Y. Yang, W. Liu, and P. Wu, “Multi-uav collaborative
[137] M. A. Goodrich, B. S. Morse, D. Gerhardt, J. L. Cooper, M. Quigley, absolute vision positioning and navigation: A survey and discussion,”
J. A. Adams, and C. Humphrey, “Supporting wilderness search and Drones, vol. 7, no. 4, p. 261, 2023.
rescue using a camera-equipped mini uav,” Journal of Field Robotics, [158] N. Piasco, J. Marzat, and M. Sanfourche, “Collaborative localization
vol. 25, no. 1-2, pp. 89–110, 2008. and formation flying using distributed stereo-vision,” in 2016 IEEE
20
International Conference on Robotics and Automation (ICRA). IEEE, [178] Y. Zhao, Z. Ju, T. Sun, F. Dong, J. Li, R. Yang, Q. Fu, C. Lian,
2016, pp. 1202–1207. and P. Shan, “Tgc-yolov5: An enhanced yolov5 drone detection model
[159] D. Liu, X. Zhu, W. Bao, B. Fei, and J. Wu, “Smart: Vision-based based on transformer, gam & ca attention mechanism,” Drones, vol. 7,
method of cooperative surveillance and tracking by multiple uavs in the no. 7, p. 446, 2023.
urban environment,” IEEE Transactions on Intelligent Transportation [179] Y. Song, S. Naji, E. Kaufmann, A. Loquercio, and D. Scaramuzza,
Systems, vol. 23, no. 12, pp. 24 941–24 956, 2022. “Flightmare: A flexible quadrotor simulator,” in Conference on Robot
[160] N. Farmani, L. Sun, and D. J. Pack, “A scalable multitarget tracking Learning. PMLR, 2021, pp. 1147–1157.
system for cooperative unmanned aerial vehicles,” IEEE Transactions [180] J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoel-
on Aerospace and Electronic Systems, vol. 53, no. 4, pp. 1947–1961, lig, “Learning to fly—a gym environment with pybullet physics for
2017. reinforcement learning of multi-agent quadcopter control,” in 2021
[161] M. Jouhari, A. K. Al-Ali, E. Baccour, A. Mohamed, A. Erbad, IEEE/RSJ International Conference on Intelligent Robots and Systems
M. Guizani, and M. Hamdi, “Distributed cnn inference on resource- (IROS), 2021, pp. 7512–7519.
constrained uavs for surveillance systems: Design and optimization,” [181] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:
IEEE Internet of Things Journal, vol. 9, no. 2, pp. 1227–1242, 2021. An open urban driving simulator,” in Conference on robot learning.
[162] W. J. Yun, S. Park, J. Kim, M. Shin, S. Jung, D. A. Mohaisen, and J.-H. PMLR, 2017, pp. 1–16.
Kim, “Cooperative multiagent deep reinforcement learning for reliable [182] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and
surveillance via autonomous multi-uav control,” IEEE Transactions on Q. He, “A comprehensive survey on transfer learning,” Proceedings of
Industrial Informatics, vol. 18, no. 10, pp. 7086–7096, 2022. the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
[163] Y. Tang, Y. Hu, J. Cui, F. Liao, M. Lao, F. Lin, and R. S. Teo, “Vision- [183] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
aided multi-uav autonomous flocking in gps-denied environment,” examples: A survey on few-shot learning,” ACM computing surveys
IEEE Transactions on industrial electronics, vol. 66, no. 1, pp. 616– (csur), vol. 53, no. 3, pp. 1–34, 2020.
626, 2018. [184] J. Fonseca and F. Bacao, “Tabular and latent space synthetic data
[164] J. Scherer, S. Yahyanejad, S. Hayat, E. Yanmaz, T. Andre, A. Khan, generation: a literature review,” Journal of Big Data, vol. 10, no. 1, p.
V. Vukadinovic, C. Bettstetter, H. Hellwagner, and B. Rinner, “An 115, 2023.
autonomous multi-uav system for search and rescue,” in Proceedings [185] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
of the first workshop on micro aerial vehicle networks, systems, and value of network pruning,” in International Conference on Learning
applications for civilian use, 2015, pp. 33–38. Representations, 2018.
[165] Y. Rizk, M. Awad, and E. W. Tunstel, “Cooperative heterogeneous [186] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and
multi-robot systems: A survey,” ACM Computing Surveys (CSUR), L. Tassiulas, “Model pruning enables efficient federated learning on
vol. 52, no. 2, pp. 1–31, 2019. edge devices,” IEEE Transactions on Neural Networks and Learning
[166] G. Niu, L. Wu, Y. Gao, and M.-O. Pun, “Unmanned aerial vehicle Systems, 2022.
(uav)-assisted path planning for unmanned ground vehicles (ugvs) [187] Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard,
via disciplined convex-concave programming,” IEEE Transactions on “Adaptive quantization for deep neural network,” in Proceedings of
Vehicular Technology, vol. 71, no. 7, pp. 6996–7007, 2022. the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[188] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, “Model compression and
[167] D. Liu, W. Bao, X. Zhu, B. Fei, Z. Xiao, and T. Men, “Vision-aware
hardware acceleration for neural networks: A comprehensive survey,”
air-ground cooperative target localization for uav and ugv,” Aerospace
Proceedings of the IEEE, vol. 108, no. 4, pp. 485–532, 2020.
Science and Technology, vol. 124, p. 107525, 2022.
[189] L. Wang and K.-J. Yoon, “Knowledge distillation and student-teacher
[168] J. Li, G. Deng, C. Luo, Q. Lin, Q. Yan, and Z. Ming, “A hybrid path
learning for visual intelligence: A review and new outlooks,” IEEE
planning method in unmanned air/ground vehicle (uav/ugv) cooperative
transactions on pattern analysis and machine intelligence, vol. 44,
systems,” IEEE Transactions on Vehicular Technology, vol. 65, no. 12,
no. 6, pp. 3048–3068, 2021.
pp. 9585–9596, 2016.
[190] R. Gong, Q. Huang, X. Ma, H. Vo, Z. Durante, Y. Noda, Z. Zheng,
[169] L. Zhang, F. Gao, F. Deng, L. Xi, and J. Chen, “Distributed estimation S.-C. Zhu, D. Terzopoulos, L. Fei-Fei et al., “Mindagent: Emergent
of a layered architecture for collaborative air–ground target geolocation gaming interaction,” arXiv preprint arXiv:2309.09971, 2023.
in outdoor environments,” IEEE Transactions on Industrial Electronics, [191] A. Bhoopchand, B. Brownfield, A. Collister, A. Dal Lago, A. Edwards,
vol. 70, no. 3, pp. 2822–2832, 2022. R. Everett, A. Fréchette, Y. G. Oliveira, E. Hughes, K. W. Mathewson
[170] G. Niu, Q. Yang, Y. Gao, and M.-O. Pun, “Vision-based autonomous et al., “Learning few-shot imitation as cultural transmission,” Nature
landing for unmanned aerial and ground vehicles cooperative systems,” Communications, vol. 14, no. 1, p. 7536, 2023.
IEEE robotics and automation letters, vol. 7, no. 3, pp. 6234–6241, [192] J. Xiao and M. Feroskhan, “Cyber attack detection and isolation for
2021. a quadrotor uav with modified sliding innovation sequences,” IEEE
[171] Z.-C. Xu, B.-B. Hu, B. Liu, X. Wang, and H.-T. Zhang, “Vision- Transactions on Vehicular Technology, vol. 71, no. 7, pp. 7202–7214,
based autonomous landing of unmanned aerial vehicle on a motional 2022.
unmanned surface vessel,” in 2020 39th Chinese Control Conference [193] T. T. Nguyen and V. J. Reddi, “Deep reinforcement learning for
(CCC). IEEE, 2020, pp. 6845–6850. cyber security,” IEEE Transactions on Neural Networks and Learning
[172] C. Hui, C. Yousheng, L. Xiaokun, and W. W. Shing, “Autonomous Systems, vol. 34, no. 8, pp. 3779–3795, 2023.
takeoff, tracking and landing of a uav on a moving ugv using onboard [194] I. Ilahi, M. Usama, J. Qadir, M. U. Janjua, A. Al-Fuqaha, D. T. Hoang,
monocular vision,” in Proceedings of the 32nd Chinese control confer- and D. Niyato, “Challenges and countermeasures for adversarial attacks
ence. IEEE, 2013, pp. 5895–5901. on deep reinforcement learning,” IEEE Transactions on Artificial
[173] I. Kalinov, A. Petrovsky, V. Ilin, E. Pristanskiy, M. Kurenkov, Intelligence, vol. 3, no. 2, pp. 90–109, 2021.
V. Ramzhaev, I. Idrisov, and D. Tsetserukou, “Warevision: Cnn barcode [195] A. Rawal, J. McCoy, D. B. Rawat, B. M. Sadler, and R. S. Amant,
detection-based uav trajectory optimization for autonomous warehouse “Recent advances in trustworthy explainable artificial intelligence:
stocktaking,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. Status, challenges, and perspectives,” IEEE Transactions on Artificial
6647–6653, 2020. Intelligence, vol. 3, no. 6, pp. 852–866, 2021.
[174] S. Minaeian, J. Liu, and Y.-J. Son, “Vision-based target detection [196] G. A. Vouros, “Explainable deep reinforcement learning: state of the
and localization via a team of cooperative uav and ugvs,” IEEE art and challenges,” ACM Computing Surveys, vol. 55, no. 5, pp. 1–39,
Transactions on systems, man, and cybernetics: systems, vol. 46, no. 7, 2022.
pp. 1005–1016, 2015.
[175] A. Adaptable, “Building an aerial–ground robotics system for precision
farming,” IEEE ROBOTICS & AUTOMATION MAGAZINE, 2021.
[176] Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi,
C. Xu, J. Luo, L. Tan, D. Shah et al., “Open x-embodiment: Robotic
learning datasets and rt-x models,” in 2nd Workshop on Language and
Robot Learning: Language as Grounding, 2023.
[177] N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo,
Q. Ye, J. Jiao, J. Zhao, and Z. Han, “Anti-uav: A large-scale benchmark
for vision-based uav tracking,” IEEE Transactions on Multimedia,
vol. 25, pp. 486–500, 2023.