0% found this document useful (0 votes)
36 views20 pages

Vision-Based Learning For Drones A Survey

Uploaded by

20151312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views20 pages

Vision-Based Learning For Drones A Survey

Uploaded by

20151312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1

Vision-Based Learning for Drones: A Survey


Jiaping Xiao, Rangya Zhang, Yuhang Zhang, and Mir Feroskhan, Member, IEEE

Abstract—Drones as advanced cyber-physical systems are


undergoing a transformative shift with the advent of vision-
based learning, a field that is rapidly gaining prominence due
to its profound impact on drone autonomy and functionality.
Different from existing task-specific surveys, this review offers
a comprehensive overview of vision-based learning in drones,
emphasizing its pivotal role in enhancing their operational
capabilities under various scenarios. We start by elucidating the
arXiv:2312.05019v2 [cs.RO] 2 Jan 2024

fundamental principles of vision-based learning, highlighting how


it significantly improves drones’ visual perception and decision-
making processes. We then categorize vision-based control meth-
ods into indirect, semi-direct, and end-to-end approaches from
the perception-control perspective. We further explore various
applications of vision-based drones with learning capabilities,
ranging from single-agent systems to more complex multi-
agent and heterogeneous system scenarios, and underscore the
challenges and innovations characterizing each area. Finally, we Fig. 1: Applications of vision-based drones. (a) Parcel deliv-
explore open questions and potential solutions, paving the way for ery; (b) Photography; (c) Precision agriculture; (d) Power grid
ongoing research and development in this dynamic and rapidly inspection.
evolving field. With growing large language models (LLMs) and
embodied intelligence, vision-based learning for drones provides
a promising but challenging road towards artificial general
intelligence (AGI) in 3D physical world. without space constraints. The small autonomous drones
Index Terms—Drones, learning systems, robotics learning, raised the interest of scientists to obtain more inspiration
embodied intelligence. from biology, such as bees [9] and flipping birds [10].
• The novel designs of drones. Structure and aerodynamic
I. I NTRODUCTION design enable drones to obtain increased maneuverability
and improved flight performance. Tilting, morphing, and

D RONES are intelligent cyber-physical systems (CPS) [1],


[2] that rely on functional sensing, system communi-
cation, real-time computation, and flight control to achieve
folding structures and actuators are widely studied in
drone design and control [7], [11], [12].
• The autonomy of drones. Drone autonomy achieves
perception and autonomous flight. Due to their high autonomy autonomous navigation and task execution. It requires
and maneuverability, drones [3] have been widely used in real-time perception and onboard computation capabili-
various missions, such as industrial inspection [4], precision ties. Drones that integrate visual sensors, efficient online
agriculture [5], parcel delivery [6], [7] and search-and-rescue planning, and learning-based decision-making algorithms
[8] (some applications are shown in Fig. 1). As such, drones have notably enhanced autonomy and intelligence [13],
in smart cities are an essential feature of the service industry [14], even beating world-level champions in drone racing
of tomorrow. For future applications, many novel drones and with a vision-based learning system called Swift [15].
functions are under development with the advent of advanced
materials (adhesive and flexible materials), miniaturization of Recently, to improve drone autonomy, vision-based learning
electronic and optical components (sensors, microprocessors), drones, which combine advanced sensing with learning capa-
onboard computers (Nividia Jetson, Intel NUC, Raspeberry bilities, are attracting more insights (see the rapid growth trend
Pi, etc.), batteries, and localization systems (SLAM, UWB, in Fig. 2). With such capabilities, drones are even emerging
GPS, etc.). Meanwhile, the functionality of drones is becoming towards artificial general intelligence (AGI) in the 3D physical
more complex and intelligent due to the rapid advancement of world, especially when integrated with rapid-growing large
artificial intelligence (AI) and onboard computation capability. language models (LLMs) [16] and embodied intelligence [17].
From the perspective of innovation, there are three main Existing vision-based drone-related surveys focus only on
directions for the future development of drones. Specifically: specific tasks and applications, such as UAV navigation [18],
• The minimization of drones. Micro and nano drones
[19], vision-based UAV landing [20], obstacle avoidance [21],
are capable of accomplishing missions at a low cost and vision-based inspection with UAVs [22], [23], and autonomous
drone racing [24], which limits the understanding of vision-
J. Xiao, R. Zhang, Y. Zhang and M. Feroskhan are with based learning in drones from a holistic perspective. Therefore,
the School of Mechanical and Aerospace Engineering, Nanyang this survey provides a comprehensive review of vision-based
Technological University, Singapore 639798, Singapore (e-mail:
jiaping001@e.ntu.edu.sg; rangya001@e.ntu.edu.sg; yuhang002@e.ntu.edu.sg; learning for drones to deliver a more general view of current
mir.feroskhan@ntu.edu.sg). (Corresponding author: Mir Feroskhan.) drone autonomy technologies, including background, visual
2

Image Processing
Images
Images

Deep learning
Visual odometry

Pa�erns
Flight Control

Commands
Commands
Visual Percep�on
Visual Percep�on

Fig. 3: General framework of vision-based drones


Fig. 2: Number of related publications in Google Scholar using
keyword “vision-based learning drones”.
navigation information, depth information, object information;
(3) Flight controller: generating high-level and low-level
commands for drones to perform assigned missions. Image
perception, vision-based control, applications and challenges,
processing and flight controllers are generally conducted on
and open questions with potential solutions. To summarize,
the onboard computer, while visual perception relies on the
our main contributions are:
performance of visual sensors. Vision-based drones have been
• We discussed the development of vision-based drones
widely used in traditional missions such as environmental
with learning capabilities and analyzed the core compo- exploration [25], navigation [13], and obstacle avoidance [26],
nents, especially visual perception and machine learning [27]. With efficient image processing and simple path planners,
applied in drones. We further highlighted object detection they can avoid dynamic obstacles effectively (see Fig. 4).
with visual perception and how it benefits drone applica- Currently, vision-based learning drones, which utilize vi-
tions. sual sensors and efficient learning algorithms, have achieved
• We discussed the current state of vision-based control
remarkable advanced performance in a series of standardized
methods for drones and categorized them into indirect, visual perception and decision-making tasks, such as agile
semi-direct, and end-to-end methods from the perception- flight control [28], navigation [29] and obstacle avoidance
control perspective. This perspective helps to understand [27]. Various cases have showcased the power of learning al-
vision-based control methods and differentiate them with gorithms in improving the agility and perception capabilities of
better features. vision-based drones. For instance, using only depth cameras,
• We summarized the applications of vision-based learning
inertial measurement units (IMU) and a lightweight onboard
drones in single-agent systems, multi-agent systems, and computer, the vision-based learning drone in [28] succeeded in
heterogeneous systems and discussed the corresponding performing high-speed flight in unseen and unstructured wild
challenges in different applications. environments. The controller of the drone in [28] was trained
• We explored several open questions that can hinder the
from a high-fidelity simulation and transferred to a physical
development and applicability of vision-based learning platform. Following that, the Swift system was developed
for drones. Furthermore, we discuss potential solutions in [15] to achieve world champion-level autonomous drone
for each question. racing with a tracking camera and a shallow neural network.
Organization: The rest of this survey is organized as fol- Such kinds of vision-based learning drones are leading the
lows: Section II discusses the concept of vision-based learning future of drones due to their perception and learning capabil-
drones and their core components; Section III summarizes ities in complex environments. Similarly, with event cameras
object detection with visual perception and its application to and a trained narrow neural network-EVDodgeNet, the vision-
vision-based drones; We introduce the vision-based control based learning drone in [30] was able to dodge multiple
methods for drones and categorize them in Section IV; The
applications and challenges of vision-based learning for drones
are discussed in Section V; We list the open questions faced by
vision-based learning drones and potential solutions in Section
VI; Section VII summarizes and concludes this survey.

II. BACKGROUND
A. Vision-based Learning Drones
(a) (b)
A typical vision-based drone consists of three parts (see Fig.
3): (1) Visual perception: sensing the environment around Fig. 4: Vision-based control for drones’ obstacle avoidance in
the drone via monocular cameras or stereo cameras; (2) simple dynamic environments. (a) Drone racing in a dynamic
Image processing: extracting features from an observed image environment with moving gates [26]; (b) A drone avoiding a
sequence and output specific patterns or information, such as ball thrown to it with event cameras [27].
3

(a) (b) (a)


Fig. 5: LIDAR on drones for visual perception. (a) A typical
surrounding LIDAR; (b) Generated point cloud with LIDAR.

dynamic obstacles (balls) during flight. Afterwards, to improve


the perception onboard in the real world, an uncertainty
estimation module was trained with the Ajna network [31],
which significantly increased the generalization capability of
learning-based control for drones. The power of deep learning
in handling uncertain information frees traditional approaches
from complex computation with necessary accurate modeling.

B. Visual Perception (b)


Visual perception for drones is the ability of drones to Fig. 6: Visual perception with cameras for drones. (a) Visual
perceive their surroundings and their own states through the Odometry for drones’ positioning; (b) Object detection for
extraction of necessary features for specific tasks with visual drones’ obstacle avoidance with event camera [27].
sensors. Light detection and ranging (LIDAR) and cameras are
commonly used sensors to perceive the surrounding environ-
ment for drones. commonly used perception sensors for drones to sense en-
1) LIght Detection And Ranging (LIDAR): LIDAR is a kind vironment information, such as objects’ position and a point
of active range sensor that relies on the calculated time of flight cloud map of the environment. In contrast to the motion
(TOF) between the transmitted and received beams (laser) capture system, which can only broadcast global geometric
to estimate the distance between the robot and the reflected and dynamic pose information within a limited space from
surface of objects [32]. Based on the scanning mechanism, an offboard synchronized system, cameras enable a drone
LIDAR can be divided into solid-state LIDAR, which has to fly without space constraints. Cameras can provide po-
a fixed field of view without moving parts, and surrounding sitioning for drone navigation in GPS-denied environments
LIDAR, which spins to provide a 360-degree horizontal view. via visual odometry (VO) [37]–[39] and visual simultane-
Surrounding LIDAR is also referred to as “laser scanning” ous localization and mapping systems (V-SLAM) [40]–[42].
or “3D scanning”, which creates a 3D representation of the Meanwhile, object detection and depth estimation can be
explored environment using eye-safe laser beams. A typical performed with cameras to obtain the relative positions and
LIDAR (see Fig. 5) consists of laser emitters, laser receivers, sizes of obstacles. However, to avoid dynamic obstacles, even
and a spinning motor. The vertical field of view (FOV) of physical attacks like bird chasing, the agile vision-based drone
a LIDAR is determined by the number of vertical arrays of poses fundamental challenges to visual perception. Motion
lasers. For instance, a vertical array of 16 lasers scanning 30 blur, sparse texture environments, and unbalanced lighting
degrees gives a vertical resolution of 2 degrees in a typical conditions can cause the loss of feature detection in the VIO
configuration. LIDAR has recently been used on drones for and object detection. LIDAR and event cameras [43] can
mapping [33], power grid inspection [34], pose estimation partially address these challenges. However, LIDAR and event
[35] and object detection [36]. LIDAR provides sufficient and cameras are either too bulky or too expensive for agile drone
accurate depth information for drones to navigate in cluttered applications. Considering the agility requirement of physical
environments. However, it is bulky and power-hungry and does attack avoidance, lightweight dual-fisheye cameras are used
not fit within the payload restrictions of agile autonomous for visual perception. With dual fisheye cameras, the drone can
drones. Meanwhile, using raycast representation in a simu- achieve better navigation capability [44] and omnidirectional
lation environment makes it hard to match the inputs of a visual perception [45]. Some sensor fusion and state estimation
real LIDAR device, which brings many challenges for the techniques are required to alleviate the accuracy loss brought
Sim2Real transfer when a learning approach is considered. by the motion blur.
2) Camera: Compared to LIDAR, cameras provide a cheap
and lightweight way for drones to perceive the environment.
Cameras are external passive sensors used to monitor the C. Machine Learning
drone’s geometric and dynamic relationship to its task, en- Recently, machine learning (ML), especially deep learning
vironment or the objects that it is handling. Cameras are (DL), has attracted much attention from various fields and has
4

been widely applied to robotics for environmental exploration


[46], [47], navigation in unknown environments [48]–[50], ob-
stacle avoidance, and intelligent control [51]. In the domain of
drones, learning-based methods have also achieved promising
success, particularly for deep reinforcement learning (DRL)
[15], [28], [52]–[56]. In [54], a curriculum learning augmented
end-to-end reinforcement learning was proposed for a UAV to
fly through a narrow gap in the real world. A vision-based end-
to-end learning method was successfully developed in [28] to
fly agile quadrotors through complex wild and human-made
environments with only onboard sensing and computation
capabilities, such as depth information. A visual drone swarm
was developed in [56] to perform collaborative target search
with adaptive curriculum embedded multistage learning. These (a)
works verified the marvelous power of learning-based methods
on drone applications, which pushes the agility and coopera-
tion of drones to a level that classical approaches can hardly
reach. Different from the classical approaches relying on
separate mapping, localization, and planning, learning-based
methods map the observations, such as the visual information
or localization of obstacles, to commands directly without
further planning. This greatly helps drones handle uncertain
information in operations. However, learning-based methods
require massive experiences and training datasets to obtain
good generalization capability, which poses another challenge
in deployment over unknown environments.

(b)
III. O BJECT D ETECTION WITH V ISUAL P ERCEPTION
Fig. 7: (a) R-CNN neural network architecture [64]; (b) Fast
Object detection is a pivotal module in vision-based learning R-CNN neural network architecture [65].
drones when handling complex missions such as inspection,
avoidance, and search and rescue. Object detection is to find
out all the objects of interest in the image and determine their such, CNN not only has the ability to recognize the image
position and size [57]. Object detection is one of the core but also effectively decreases the requirement for computing
problems in the field of computer vision (CV). Nowadays, resources. Recently, vision transformers (ViTs) [61], originally
the applications of object detection include face detection, proposed for image classification tasks, have been extended to
pedestrian detection, vehicle detection, and terrain detection the realm of object detection [62]. These models demonstrate
in remote sensing images. Object detection has always been superior performance by utilizing the self-attention mecha-
one of the most challenging problems in the field of CV due to nism, which processes visual information non-locally [63].
the different appearances, shapes, and poses of various objects, However, a major limitation of ViTs is their high computa-
as well as the interference of factors such as illumination tional demand. This presents difficulties in achieving real-time
and occlusion during imaging. At present, the object detection inference, particularly on platforms with limited resources like
algorithm can be roughly divided into two categories: multi- drones.
stage (two-stage) algorithm, whose idea is to first generate
candidate regions and then perform classification, and one-
stage algorithm, the idea of which is to directly apply the A. Multi-stage Algorithms
algorithm to the input image and output the categories and Classic multi-stage algorithms include RCNN (Region-
corresponding positions. Beyond that, to retrieve 3D positions, based Convolutional Neural Network) [64], Fast R-CNN [65],
depth estimation has been a popular research subbranch related Faster R-CNN [66]. Multi-stage algorithms can basically meet
to object detection whether using monocular [58] or stereo the accuracy requirements in real-life scenarios, but the model
depth estimation [59]. For a very long time, the core neural is more complex and cannot be really applied to scenarios with
network module (backbone) of object detection has been the high-efficiency requirements. In the R-CNN structure [64],
convolutional neural network (CNN) [60]. CNN is a classic it is necessary to first give some regional proposals (RP),
neural network in image processing that originates from the then use the convolutional layer for feature extraction, and
study of the human optic nerve system. The main idea is then classify the regions according to these features. That is,
to convolve the image with the convolution kernel to obtain the object detection problem is transformed into an image
a series of reorganization features, and these reorganization classification problem. The R-CNN model is very intuitive,
features represent the important information of the image. As but the disadvantage is that it is too slow, and the output
5

is obtained via training multiple Support Vector Machines into three phases, namely, zooming the image, passing the
(SVMs). To solve the problem of slow training speed, the image through a full convolutional neural network, and using
Fast R-CNN model is proposed (Fig. 7b). This model has maximum value suppression (NMS). The main advantages of
two improvements to R-CNN: (1) first use the convolutional the YOLO model are that it is fast, with few background
layer to perform feature selection on the image so that only errors via global processing, and it has good generalization
one convolutional layer can be used to obtain RP; (2) convert performance. Meanwhile, YOLO can formulate the detection
training multiple SVMs to use only one fully-connected layer task as a unified, end-to-end regression problem, and simul-
and a softmax layer. These techniques greatly improve the taneously obtain the location and classification by processing
computation speed but still fail to address the efficiency issue the image only once. But there are also some problems with
of the Selective Search Algorithm (SSA) for RP. the YOLO, such as rough mesh, which will limit YOLO’s
Faster R-CNN is an improvement on the basis of Fast performance over small objects. However, the subsequent
R-CNN. In order to solve the problem of SSA, the SSA YOLOv3, YOLOv5, YOLOX [69] and YOLOv8 improved the
that generates RP in Fast R-CNN is replaced by a Region network on the basis of the original YOLO and achieved better
Proposal Network (RPN) and uses a model that integrates RP detection results.
generation, feature extraction, object classification and object
box regression. RPN is a fully convolutional network that
simultaneously predicts object boundaries at each location.
RPN is trained end-to-end to generate high-quality region
proposals, which are then detected by Fast R-CNN. At the
same time, RPN and Fast R-CNN share convolutional features.
Fig. 9: YOLO network architecture [68].
Meanwhile, in the feature extraction stage, Faster R-CNN uses
a convolutional neural network. The model achieves 73.2%
and 70.4% mean Average Precision (mAP) per category on the
PASCAL VOC 2007 and 2012 datasets, respectively. Faster R-
CNN has been greatly improved in speed than Fast R-CNN,
and the accuracy has reached the state-of-the-art (SOTA),
and it also fully developed an end-to-end object detection
framework. However, Faster R-CNN still cannot achieve real-
time object detection. Besides, after obtaining RP, it requires
heavy computation for each RP classification. Fig. 10: SSD network architecture [67].

SSD is another classic one-stage object detection algorithm.


The flowchart of SSD is (1) first to extract features from the
image through a CNN, (2) generate feature maps, (3) extract
feature maps of multiple layers, and then (4) generate default
boxes at each point of the feature map. Finally, (5) all the
generated default boxes are collected and filtered using NMS.
The neural network architecture of SSD is shown in Fig. 10.
SSD uses more convolutional layers than YOLO, which means
that SSD has more feature maps. At the same time, SSD uses
different convolution segments based on the VGG model to
output feature maps to the regressor, which tries to improve
the detection accuracy over small objects.
The aforementioned multi-stage algorithms and one-stage
algorithms have their own advantages and disadvantages.
Multi-stage algorithms achieve high detection accuracy, but
bring more computing overhead and repeated detection. The
one-stage model generally consists of a basic network (Back-
Fig. 8: Faster RCNN neural network architecture [66].
bone Network) and a Detection Head. The former is used as a
feature extractor to give representations of different sizes and
abstraction levels of images; the latter one learns classification
B. One-stage Algorithms and location associations based on these representations and a
One-stage algorithms such as the Single Shot Multibox supervised dataset. The two tasks of category prediction and
Detector (SSD) model [67] and the YOLO series models position regression, which are responsible for detecting the
[68] are generally slightly less accurate than the two-stage head, are often carried out in parallel, formulating a multi-
algorithms, but have simpler architectures, which can facilitate task loss function for joint training. There is only one class
end-to-end training and are more suitable for real-time object prediction and position regression, and most of the weights are
detection. The basic process of YOLO (see Fig. 9) is divided shared. Hence, one-stage algorithms are more time-efficient
6

at the cost of accuracy. In any case, with the continuous


development of deep learning in the field of computer vision,
object detection algorithms are also constantly learning from
and improving each other.

C. Vision Transformer
ViTs have emerged as the most active research field in object
detection tasks recently, with models like Swin-Transformer
[62], [70], ViTdet [71], and DINO [72] leading the fore-
front. Unlike conventional CNNs, ViTs leverage self-attention
mechanisms to process image patches as sequences, offering
a more flexible representation of spatial hierarchies. The core
mechanism of these models involves dividing an image into
a sequence of patches and applying Transformer encoders
[73] to capture complex dependencies between them. This
process enables ViTs to efficiently learn global context, which
is pivotal in understanding comprehensive scene layouts and Fig. 12: Air-to-air object detection of micro-UAVs with a
object relations. For instance, the Swin-Transformer [62] intro- monocular camera [76].
duces a hierarchical structure with shifted windows, enhancing
the model’s ability to capture both local and global features.
In the following, the Swin-Transformer was scaled to Swin- are required for training and testing. Zheng Ye et al. [76]
Transformer V2 [70] with the capability of training high- collected an air-to-air drone dataset “Det-Fly” (see Fig. 12)
resolution images (see Fig. 11). and evaluated air-to-air object detection of a micro-UAV with
eight different object detection algorithms, namely RetinaNet
[77], SSD, Faster R-CNN, YOLOv3 [78], FPN [79], Cascade
R-CNN [80] and Grid R-CNN [81]. The evaluation results in
[76] showed that the overall performance of Cascade R-CNN
and Grid R-CNN is superior compared to the others. However,
the YOLOv3 provides the fastest inference speed among
others. Wei Xun et al. [82] conducted another investigation
into drone detection, employing the YOLOv3 architecture and
deploying the model on the NVIDIA Jetson TX2 platform.
They collected a dataset comprising 1435 images featuring
various UAVs, including drones, hexacopters, and quadcopters.
Utilizing custom-trained weights, the YOLOv3 model demon-
Fig. 11: Swin Transformer V2 framework [70]. strated proficiency in drone detection within images. However,
the deployment of this trained model faced constraints due to
The primary advantages of ViTs in object detection are the limited computation capacity of the Jetson TX2, which
their scalability to large datasets and superior performance in posed challenges for effective real-time application.
capturing long-range dependencies. This makes them particu- In agile flight, computation speed is more important than
larly effective in scenarios where contextual understanding is accuracy since real-time object detection is required to avoid
crucial. Additionally, ViTs demonstrate strong transfer learn- obstacles swiftly. Therefore, a simple gate detector and a filter
ing capabilities, performing well across various domains with algorithm are adopted as the basis of the visual perception of
minimal fine-tuning. However, challenges with ViTs include Swift [15]. Considering the agility of the drone, a stabilization
their computational intensity due to self-attention mechanisms, module is required to obtain more robust and accurate object
particularly when processing high-resolution images. This can detection and tracking results in real-time flight. Moreover, in
limit their deployment in real-time applications where com- the drone datasets covered in existing works, each image only
putational resources are constrained. Additionally, ViTs often includes a single UAV. To classify and detect different classes
require large-scale datasets for pre-training to achieve optimal of drones in multi-drone systems, a new dataset of multiple
performance, which can be a limitation in data-scarce environ- types of drones has to be built from scratch. Furthermore, the
ments. Despite these challenges, ongoing advancements in ViT dataset can be adapted to capture adversary drones in omni-
architectures, such as the development of efficient attention directional visual perception to enhance avoidance capability.
mechanisms [74] and hybrid CNN-Transformer models [75],
continue to enhance their applicability and performance in
IV. V ISION - BASED C ONTROL
diverse object detection tasks.
When applying object detection algorithms to drone appli- Vision-based control for robotics has been widely studied in
cations, it is necessary to find the best balance between com- recent years, whether for ground robots or aerial robotics such
putation speed and accuracy. Besides, massive drone datasets as drones. For drones flying in a GPS-denied environment,
7

Fig. 13: Vision-based control methods for drone applications. Based on the ways of visual perception and control, the methods
can be divided into indirect methods, semi-direct methods, and end-to-end methods.

visual inertial odometry (VIO) [37]–[39] and visual simulta- is always required, which is a challenge to address with
neous localization and mapping systems (SLAM) [40]–[42] existing methods.
have been preferred choices for navigation. Meanwhile, in a On the mapping side, a point cloud map [94] or Octmap
clustered environment, research on obstacle avoidance [13], [95], representing a set of data points in a 3D space is com-
[28], [83], [84] based on visual perception has attracted much monly generated. Each point has its own Cartesian coordinates
attention in the past few years. Obstacle avoidance has been and can be used to represent a 3D shape or an object. A
a main task for vision-based control as well as for the current 3D point cloud map is not from the view of a drone but
learning algorithms of drones. constructs a global 3D map that provides global environmental
From the perspective of how drones obtain visual perception information for a drone to fly. The point-cloud map can be
(perception end) and how drones generate control commands generated from a LIDAR scanner or many overlapped images
from visual perception (control end), existing vision-based combined with depth information. An illustration of an original
control methods can be categorized into indirect methods, scene and a point cloud map are shown in Fig. 14, where the
semi-direct methods, and end-to-end methods. The relation- drone can travel around without colliding with static obstacles.
ship between these three categories is illustrated in Fig. 13. In Planning is a basic requirement for a vision-based drone
the following, we will discuss and evaluate these methods in to avoid obstacles. Within the indirect methods, planning can
three different categories, respectively. be further divided into two categories: one is offline meth-
ods based on high-resolution maps and pre-known position
A. Indirect Methods information, such as Dijkstra’s algorithm [96], A-star [97],
Indirect methods [13], [27], [85]–[93] refer to extracting RRT-connect [98] and sequential convex optimization [99]; the
features from images or videos to generate visual odometry, other is online methods based on real-time visual perception
depth maps, and 3D point cloud maps for drones to perform and decision-making. Online methods can be further catego-
path planning based on traditional optimization algorithms (see rized into online path planning [13], [90], [91] and artificial
Fig. 14). Obstacle states, such as 3D shape, position, and potential field (APF) methods [27], [100].
velocity, are detected and mapped before a maneuver is taken. Most vision-based drones rely on online methods. Com-
Once online maps are built or obstacles are located, the drone pared to offline methods, which require an accurate pre-built
can generate a feasible path or take actions to avoid obstacles. global map, online methods provide advanced maneuvering
SOTA indirect methods generally divide the mission into capabilities for drones, especially in a dynamic environment.
several subtasks, namely perception, mapping and planning. Currently, due to the advantages of optimization and prediction
On the perception side, depth images are always required to capabilities, online path planning methods have become the
generate corresponding distance and position information for preferred choice for drone obstacle avoidance. For instance,
navigation. A depth image is a grey-level or color image that in the SOTA work [90], Zhou Boyu et al. introduced a robust
can represent the distance between the surfaces of objects from and efficient motion planning system called Fast-Planner for a
the viewpoint of the agent. Fig. 15 shows color images and vision-based drone to perform high-speed flight in an unknown
corresponding depth images from a drone’s viewpoint. The cluttered environment. The key contributions of this work
illuminance is proportional to the distance from the camera. are a robust and efficient planning scheme incorporating path
A lighter color denotes a nearer surface, and darker areas searching, B-spline optimization, and time adjustment to gen-
mean further surfaces. A depth map provides the necessary erate feasible and safe trajectories for vision-based drones’ ob-
distance information for drones to make decisions to avoid stacle avoidance. Using only onboard vision-based perception
static and dynamic obstacles. Currently, off-the-shelf RGB-D and computing, this work demonstrated agile drone navigation
cameras, such as the Intel RealSense depth camera D415, the in unexplored indoor and outdoor environments. However, this
ZED 2 stereo camera, and the Structure Core depth camera, approach can only achieve maximum speeds of 3m/s and
are widely used for drone applications. Therefore, traditional requires 7.3ms for computation in each step. To improve the
obstacle avoidance methods can treat depth information as a flight performance and save computation time, Zhou Xin et al.
direct input. However, for omnidirectional perception in wide- [13] provided a Euclidean Signed Distance Field (ESDF)-free
view scenarios, efficient onboard monocular depth estimation gradient-based planning framework solution, EGO-Planner,
8

Fig. 14: A vision-based drone is traversing through a cluttered


indoor environment with generated point cloud maps and
online path planning [90].

(a) (b)
Fig. 16: Artificial potential field (APF) methods in drones’
obstacle avoidance [100] (a) Generated artificial potential field;
(b) Flight trajectory based on the APF.

Fig. 15: Depth maps generated from the viewpoint of a drone


[101].
(a) (b)
Fig. 17: Quadrotor drone flying in a wild environment with
for drone autonomous navigation in unknown obstacle-rich end-to-end reinforcement learning [28]. (a) Training process
situations. Compared to the Fast-Planner, the EGO-Planner in the simulation platform; (b) Real flight test in the wild snow
achieved faster speeds and saved a lot of computation time. environment.
However, these online path planning methods require bulky
visual sensors, such as RGBD cameras or LIDAR, and a pow-
erful onboard computer for the complex numerical calculation a technique for mapping the state (observation) space to the
to obtain a local or global optimal trajectory. action space in order to maximize a long-term return with
In contrast to online path planning, the artificial potential given rewards. The learner is not explicitly told what action
field methods require less computation resources and can well to carry out but must figure out which action will yield the
cope with dynamic obstacle avoidance using limited sensor highest reward during the exploring process. A typical RL
information. The APF algorithm is one of the algorithms in model (see Fig. 18) features agents, the environment, reward
the robot path planning approach that uses attractive force to functions, action, and state space. The policy model achieves
achieve the objective position and repulsive force to avoid convergent status via constant interactions between the agents
obstacles in an unknown environment [102]. Falanga et al. and the environment, where the reward function guides the
[27] developed an efficient and fast control strategy based on training process.
the artificial potential field method to avoid fast approaching
dynamic obstacles. The obstacles in [27] are represented as
repulsive fields that decay over time, and the repulsive forces
are generated from the first-order derivation of the repulsive
fields at each time step. However, the repulsive forces com-
puted only reach substantial values when the obstacle is very
close, which may lead to unstable and aggressive behavior.
Besides, artificial potential field methods are heuristic methods
that cannot guarantee global optimization and robustness for Fig. 18: The reinforcement learning process.
drones. Hence, it is not ideal to adopt potential field methods
to navigate through cluttered environments.
With end-to-end methods, the visual perception of drones
is encoded by a deep neural network into an observation
B. End-to-end Methods vector of the policy network. The mapping process is usu-
In contrast to the indirect methods, which divide the whole ally trained offline with abundant data, which requires high-
mission into multiple sub-tasks, such as perception, mapping, performance computers and simulation platforms. The training
and planning, end-to-end methods [28], [48], [103]–[105] data set is collected from expert flight demonstrations for
combine computer vision and reinforcement learning (RL) to imitation learning or from simulations for online training.
map the visual observations to actions directly. RL [106] is To better generalize the performance of a trained neural net-
9

work model, scenario randomization (domain randomization)


is essential during the training process. Malik Aqeel Anwar
et al. [104] presented an end-to-end reinforcement learning
approach called NAVREN-RL to navigate a quadrotor drone in
an indoor environment with expert data and knowledge-based
data aggregation. The reward function in [104] was formulated
from a ground truth depth image and a generated depth image.
Loquercio et al. [28] developed an end-to-end approach that
can autonomously guide a quadrotor drone through complex
wild and human-made environments at high speeds with purely
onboard visual perception (depth image) and computation. The
neural network policy was trained in a high-fidelity simulation
environment with massive expert knowledge data. While end-
to-end methods provide us with a straightforward way to
generate obstacle avoidance policies for drones, they require (a)
massive training data with domain randomization (usually
counted in the millions) to obtain acceptable generalization
capabilities. Meanwhile, without expert knowledge data, it is
challenging for the neural network policy to update its weights
when the reward space is sparse. To implement end-to-end
methods, we commonly consider the following aspects: neural
network architecture, training process, and Sim2Real transfer.
Neural Network Architecture: The neural network ar-
chitecture is the core component in the end-to-end methods,
which determines the computation efficiency and intelligence
level of the policy. In the end-to-end method, the input
of the neural network architecture is the image raw data
(RGB/RGBD), and the output is the action vector an agent
needs to take. The images are encoded into a vector and
then concatenated with other normalized observations to form (b)
an input vector of a policy network. For the image encoder,
Fig. 19: The neural network architectures of end-to-end rein-
there are many pre-trained neural network architectures that
forcement learning methods. (a) Target driven navigation with
can be considered, such as ResNet [107], VGG [108], and
ResNet50 encoder [48]; (b) Target driven visual navigation for
nature Convolutional Neural Network (CNN) [109]. Zhu et
a robot with shallow CNN encoder [50].
al. [48] developed a target-driven visual navigation approach
for robots with end-to-end deep reinforcement learning, where
a pre-trained ResNet-50 is used to encode the image into a
feature vector. In [50], a more data-efficient image encoder learning algorithm that estimates the value function of the
with 5 layers was designed for target-driven visual navigation policy being carried out. PPO [110] is another efficient on-
for robots with end-to-end imitation learning. Before designing policy algorithm widely used in reinforcement learning. By
the neural network architecture, we first need to determine the compensating for the fact that more probable actions are going
observation space and action space of the task. The training to be taken more often, PPO addresses the issue of high
efficiency and space complexity are the two aspects we need variance and low convergence with policy gradient methods.
to consider in the designing process. To avoid large weights’ vibration, a “clipped” policy gradient
Training Process: The training process is the most time- update formulation is designed.
consuming part of the end-to-end learning methods. It requires In contrast, off-policy algorithms evaluate or improve a
the use of gradient information from the loss function to policy different from that used to generate the action currently.
update the weights of the neural network and improve the The off-policy algorithms such as Q-learning [111], [112] try
policy model. There are two main training algorithms for to learn a greedy policy in every step. Off-policy algorithms
reinforcement learning, namely the on-policy algorithm and perform better at movement predictions, especially in an
the off-policy algorithm. On-policy algorithms attempt to unknown environment. However, off-policy algorithms can be
evaluate or improve the policy that is used to make decisions, very unstable due to the unstatic environments the offline data
i.e., the behavior policy for generating actions is the same as seldom covers. Imitation learning [113] is another form of off-
the target policy for learning. When the agent is exploring policy algorithm that tries to mimic demonstrated behavior in
the environment, on-policy algorithms are more stable than a given task. Through training from demonstrations, imitation
off-policy algorithms when the agent is exploring the environ- learning can save remarkable exploration costs. [28] adopted
ment. SARSA (State-Action-Reward-State-Action experiences imitation learning to speed up the training process. The whole
to update the Q-values) [106] is an on-policy reinforcement training process is illustrated in Fig. 20. With this privileged
10

Fig. 20: Training process used in [28] to fly an agile drone Fig. 21: A drone avoids the obstacles in the forest with semi-
through forest. direct methods [83].

expert knowledge, the policy can be trained to find a time-


efficient trajectory to avoid obstacles.
For multi-agent systems, multi-agent reinforcement learning
(MARL) is widely studied to encourage agent collaboration. In
MARL, the credit assignment is a crucial issue that determines
the contribution of each agent to the group’s success or failure.
COMA [114] is a baseline method that uses a centralized Fig. 22: A drone is tracking and chasing a target drone with
critic to estimate the action-value function, and for each agent, object detection and deep reinforcement learning [125].
it computes a counterfactual advantage function to represent
the value difference, which can determine the contribution
of each agent. However, COMA still does not fully address C. Semi-direct Methods
the model complexity issue. QMIX [115] is well developed Compared to the end-to-end methods, which generate ac-
to address the scalability issue by decomposing the global tions from image raw data directly, semi-direct methods [45],
action-value function into individual agent’s value functions. [83], [125], [126] introduce an intermediate phase for drones
However, QMIX assumes the environment is fully observable to take actions from visual perception, aiming to improve
and may not be able to handle scenarios with continuous the generalization and transfer capabilities of the methods
action space. Hence, attention mechanism-enabled MARL is over unseen environments. There are two ways to design
a promising direction to address the variable observations. the semi-direct method architectures: one is to generate the
Besides, to balance individual and team reward, a MARL with required information from image processing (such as the rela-
mixed credit assignment algorithm, POCA-Mix, was proposed tive positions of obstacles from object detection and tracking
in [116] to achieve collaborative multi-target search with a or the point cloud map) and train the control policy with
visual drone swarm. deep reinforcement learning; another is to obtain the required
Sim2Real Transfer: For complex tasks, the policy neural states (such as depth image) directly from the raw image data
networks are usually trained on simulation platforms such as with deep learning and avoid the obstacles using numerical
AirSim [117], Unity [118] or Gazebo [119]. The differences or heuristic methods. These two methods can be denoted as
between the simulation and the real environment are non- indirect (front end)-direct (back end) methods and direct (front
negligible. Sim2Real is the way to deploy the neural net- end)-indirect (back end) methods.
work model trained in a simulation environment on a real Indirect-direct methods [83], [125] firstly obtain inter-
physical agent. For deployment, it is essential to validate the mediate features such as relative position or velocity of the
generalization capability of trained neural network models. obstacles from image processing and then use this interme-
A wide variety of Sim2Real techniques [120]–[124] have diate feature information as observations to train the policy
been developed to improve the generalization and transfer neural network via deep reinforcement learning. Indirect-direct
capabilities of models. Domain randomization [123] is one of methods generally rely on designing suitable intermediate
these techniques to improve the generalization capability of the features. In [83], the features related to depth cues such as
trained neural network to unseen environments. Domain ran- Radon features (30 dimensional), structure tensor statistics (15
domization is a method of trying to discover a representation dimensional), Laws’ masks (8 dimensional), and optical flow
that can be used in a variety of scenes or domains. Existing do- (5 dimensional) were extracted and concatenated into a single
main randomization techniques [28], [54] for drones’ obstacle feature vector as visual observation. Together with the other
avoidance include position randomization, depth image noise nine additional features, the control policy was trained with
randomization, texture randomization, size randomization, etc. imitation learning to navigate the drone through a dense forest
Therefore, we need to apply domain randomization in our environment (see Fig. 21).
training process to enhance the generalization capability of Moulay et al. [125] proposed a semi-direct vision-based
the trained model in deployment. learning control policy for UAV pursuit-evasion. Firstly, a
11

TABLE I: Summary of Vision-Based Control Methods in Drone Technology


Method Type Perception End Control End Key Studies Description and Applicability
Depth maps, 3D point Traditional [85], [86], [87], Focus on generating visual odometry, depth maps, and
Indirect Methods
cloud maps optimization algorithms [91] 3D point cloud maps for path planning. Suitable for
safety-critical tasks with accurate models.
Depth maps, 3D point Online path planning, [27], [92] Utilize depth information for real-time visual perception
cloud maps APF, etc. and decision-making in dynamic environments. Suitable
for rapid response tasks with certain information.
Visual observations en- RL for action mapping [56], [103], Combine deep learning for visual perception with RL
End-to-End Methods
coded by DNNs [104] for direct action response. Suitable for complex tasks
with uncertain information.
Encoded visual obser- Imitation learning and [127], [28] Use deep learning for visual encoding and train the
vations online training control system with expert demonstrations and online
training. Suitable for sample-efficient tasks with expert
demonstrations.
Intermediate features DRL for action [83], [125], Extract intermediate features like relative positions or
Semi-Direct Methods
from image processing [15], [128] velocities and use DRL for action decisions. Suitable
for complex tasks with high generalization requirement.
Raw image data for Numerical/Heuristic [126], [45] Utilize direct image data to obtain necessary states for
depth images or obsta- methods for obstacle obstacle avoidance using non-learning-based methods.
cle tracking avoidance Suitable for tasks where robust stereo information is
not available.

traditional optimization algorithms and depth or 3D point


cloud maps for navigation and obstacle avoidance. End-to-
end methods leverage deep neural networks for visual per-
ception and utilize reinforcement learning for direct action
mapping. Semi-direct methods balance between computational
efficiency and generalization by using intermediate features
Fig. 23: Via deep learning, a 3D space generated from the from image processing and a combination of DRL and heuris-
monocular image for obstacle avoidance [132]. tic methods for action generation. A comprehensive overview
of these methods, along with key studies in each category, is
summarized in Table I, which provides a detailed comparison
of their perception and control strategies.
deep object detector (YOLOv2) and a search area proposal
(SAP) were used to predict the relative position of the target
UAV in the next frame for target tracking. Afterward, deep V. A PPLICATIONS AND C HALLENGES
reinforcement learning (see Fig. 22) was adopted to predict A. Single Drone Application
the actions the follower UAV needs to perform to track the The versatility of single drones is increasingly recognized in
target UAV. Indirect-direct methods are able to improve the a variety of challenging environments. These unmanned vehi-
generalization capability of policy neural networks, but at the cles, with their inherent advantages, are effectively employed
cost of heavy computation and time overhead. in critical areas such as hazardous environment detection
Direct-indirect methods try to obtain depth images [129], and search and rescue operations. Single drone applications
[130] and track obstacles [131], [132] from training and use in vision-based learning primarily involve tasks like obsta-
non-learning-based methods, such as path planning or APF to cle avoidance [28], [30], surveillance [133]–[136], search-
avoid obstacles. Direct-indirect methods can be applied to a and-rescue operations [137]–[139], environmental monitoring
microlight drone with only monocular vision, but they require [140]–[143], industrial inspection [23], [144], [145] and au-
a lot of training data to obtain depth images or 3D poses of the tonomous racing [146]–[148]. Each field, while benefiting
obstacles. Michele et al. [132] developed an object detection from the unique capabilities of drones, also presents its own
system to detect obstacles at a very long range and at a very set of challenges and areas for development.
high speed, without certain assumptions on the type of motion. 1) Obstacle Avoidance: The development of obstacle avoid-
With a deep neural network trained on real and synthetic image ance capabilities in drones, especially for vision-based control
data, fast, robust and consistent depth information can be systems, poses significant challenges. Recent studies have
used for drones’ obstacle avoidance. Direct-indirect methods primarily focused on static or simple dynamic environments,
address the ego drift problem of monocular depth estimation where obstacle paths are predictable [26], [28], [149]. How-
using Structure from Motion (SfM) and provide a direct way ever, complex scenarios involving unpredictable physical at-
to get depth information from the image. However, the massive tacks from birds or intelligent adversaries remain largely
train dataset and limited generalization capability are the main unaddressed. For instance, [26], [27], [30], [150] have explored
challenges for their further applications. basic dynamic obstacle avoidance but do not account for adver-
In summary, the field of vision-based control for drones sarial environments. To effectively handle such threats, drones
encompasses a variety of methods, each with its own unique require advanced features like omnidirectional visual percep-
approach to perception and control. Indirect methods rely on tion and agile maneuvering capabilities. Current research,
12

however, is limited in addressing these needs, underscoring


the necessity for further development in drone technology
to enhance evasion strategies against smart, unpredictable
adversaries.
2) Surveillance: While drones play a pivotal role in surveil-
lance tasks, their deployment is not without challenges. Key
obstacles include managing high data processing loads and
addressing the limitations of onboard computational resources.
In addressing these challenges, the study by Singh et al. [133]
presented a real-time drone surveillance system used to iden- (a) (b)
tify violent individuals in public areas. The proposed study was
facilitated by cloud processing of drone images to address the
challenge of slow and memory-intensive computations while
still maintaining onboard short-term navigation capabilities.
Additionally, in the study [136], a drone-based crowd surveil-
lance system was tested to achieve the goal of saving scarce
energy of the drone battery. This approach involved offloading
video data processing from the drones by employing the
Mobile Edge Computing (MEC) method. Nevertheless, while
off-board processing diminishes computational demands and (c) (d)
energy consumption, it inevitably heightens the need for data
transmission. Addressing the challenge of achieving real-time Fig. 24: Applications of single drone. (a) Surveillance [133];
surveillance in environments with limited signal connectivity (b) Search and Rescue [151]; (c) Environmental Monitoring
is an additional critical issue that requires resolution. [143]; (d) Autonomous Racing [15].
3) Search and Rescue: In the field of search and rescue
operations, a number of challenges exist that hinder the de-
velopment of drone technology. A primary challenge faced by ments and conducting precise measurements in the presence of
drones is extracting maximum useful information from limited various disturbances. Kim et al. [144] addressed the challenge
data sources. This is crucial for improving the efficiency of autonomous navigation by using drones for proximity
and success rate of these missions. Goodrich et al. [137] measurement among construction entities, enhancing safety
address this by developing a contour search algorithm designed in the construction industry. Additionally, Khuc et al. [145]
to optimize video data analysis, enhancing the capability to focused on precise structural health inspection with drones,
identify key elements swiftly. However, incorporating temporal especially in high or inaccessible locations. Despite these
information into this algorithm introduces additional compu- advancements in autonomous navigation and measurement ac-
tational demands. These increased requirements present new curacy, maintaining data accuracy and reliability in industrial
challenges, such as the need for more powerful processing settings with interference from machinery, electromagnetic
capabilities and potentially greater energy consumption. fields, and physical obstacles continues to be a significant
4) Environmental Monitoring: A major challenge in the challenge, necessitating further research and development in
application of drones for environmental monitoring lies in this domain.
efficiently collecting high-resolution data while navigating 6) Autonomous Racing: In autonomous drone racing, the
the constraints of battery life, flight duration, and diverse central challenge is reducing delays exist in visual information
weather conditions. Addressing this, Senthilnath et al. [140] processing and decision making and enhancing the adaptability
showcased the use of fixed-wing and VTOL (Vertical Take- of perception networks. In [146], a novel sensor fusion method
Off and Landing) drones in vegetation analysis, focusing on was proposed to enable high-speed autonomous racing for
the challenge of detailed mapping through spectral-spatial mini-drones. This work also addressed issues with occasional
classification methods. In another study, Lu et al. [143] large outliers and vision delays commonly encountered in fast
demonstrated the utility of drones for species classification drone racing. Another work [147] introduced an innovative ap-
in grasslands, contributing to the development of methodolo- proach to drone control, where a deep neural network (DNN)
gies for drone-acquired imagery processing, which is crucial was used to fuse trajectories from multiple controllers. In the
for environmental assessment and management. While these latest work [15], the vision-based drone outperformed world
studies represent significant steps in drone applications for champions in the racing task, purely relying on onboard per-
environmental monitoring, several challenges persist. Future ception and a trained neural network. The primary challenge
research may need to address the problems of improving the in autonomous drone racing, as identified in these studies, lies
drones’ resilience to diverse environmental conditions, and in the need for improved adaptability of perception networks
extend their operational range and duration to comprehensively to various environments and textures, which is crucial for the
cover extensive and varied landscapes. high-speed demands of the sport.
5) Industrial Inspection: In industrial inspection, drones Overall, the primary challenges in single drone applica-
face key challenges like safely navigating complex environ- tions include limited battery life, which restricts operational
13

SLAM. Micro-air vehicles also play an outstanding role in the


field of coordinated surveying. Similarly, in [158], a sensor
fusion scheme was proposed to improve the accuracy of
localization in MAV fleets. Ensuring that each MAV effectively
contributes to the overall perception of the environment poses
a challenge addressed in this work. Both methods depend
on effective collaboration and communication among multiple
agents. However, maintaining stable communication between
drones remains a critical and unresolved issue in multi-drone
(a) (b) operations.
2) Cooperative Tracking: In the field of multi-drone object
tracking, navigating complex operational environments and
overcoming limited communication bandwidth are significant
challenges. These topics have also garnered significant re-
search interest. In the study [159], researchers developed a
system for cooperative surveillance and tracking in urban
settings, specifically tackling the issue of collision avoidance
among drones. Additionally, another work by Farmani et al.
[160] explores a decentralized tracking system for drones,
(c) (d) focusing on overcoming limited communication bandwidth
and intermittent connectivity challenges. However, a persistent
Fig. 25: Applications of multi-drone. (a) Coordinated Survey- difficulty in this field is navigating complex external environ-
ing [152]; (b) Cooperative Tracking [153]; (c) Synchronized ments. Effective path planning and avoiding obstacles during
Monitoring [154]; (d) Disaster Response [155]. multi-drone operations remain crucial challenges that require
ongoing attention and innovation.
3) Synchronized Monitoring: In synchronized monitoring
duration, and the need for effective obstacle avoidance in var-
missions with multiple drones, the focus is on how to effec-
ied environments. Additionally, limitations in data processing
tively allocate tasks among drones and improve overall mission
capabilities affect real-time decision-making and adaptability.
efficiency, as well as overcoming computational limitations.
Advanced technological solutions are essential to overcome
Gu et al. in [154] developed a small target detection and
these challenges, ensuring that single drones can operate
tracking model using data fusion, establishing a cooperative
efficiently and reliably in diverse scenarios and paving the
network for synchronized monitoring. This study also ad-
way for future innovations.
dresses task distribution challenges and efficiency optimiza-
tion. Moreover, [161] explores implementing deep neural
B. Multi-Drone Application network (DNN) models on computationally limited drones,
While single drones offer convenience, their limited mon- focusing on reducing classification latency in collaborative
itoring range has prompted interest in multi-drone collabora- drone systems. However, issues like collision avoidance in
tion. This approach seeks to overcome range limitations by complex environments and signal interference in multi-drone
leveraging the collective capabilities of multiple drones for systems are not comprehensively addressed in these studies.
broader, more efficient operations. Multi-drone applications, 4) Disaster Response: In the domain of multi-drone sys-
encompassing activities such as coordinated surveying [152], tems for disaster response, challenges such as autonomous
[156]–[158], cooperative tracking [153], [159], [160], synchro- navigation, communication limitations, and real-time decision-
nized monitoring [154], [161], [162], and disaster response making are paramount. To address these problems, Tang et al.
[56], [151], [155], [163], [164], bring the added complexity [163] introduced a tracking-learning-detection framework with
of inter-drone communication, coordination and real-time data an advanced flocking strategy for exploration and search mis-
integration. These applications leverage the combined capa- sions. However, the study does not examine the crucial aspect
bilities of multiple drones to achieve greater efficiency and of task distribution and optimization for enhancing mission
coverage than single drone operations. efficiency in multi-drone disaster response. This shortcoming
1) Coordinated Surveying: In coordinated surveying, sev- points to an essential area for further research, as effective task
eral challenges are prominent: merging diverse data from management is key to utilizing the full capabilities of multi-
individual drones, and addressing computational demands in drone systems, particularly in the dynamic and urgent context
cooperative process. These challenges were tackled by some of disaster scenarios.
works. In [156], a monocular visual odometry algorithm was The primary challenges in multi-drone applications include
used to enable autonomous onboard control with cooperative maintaining stable and reliable communication links, collision
localization and mapping. This work addressed the challenges avoidance between drones, and effective distribution of tasks
of coordinating and merging different maps constructed by to optimize the overall mission efficiency. Additionally, issues
each drone platform and what’s more, the computational like signal interference and managing the flight paths of multi-
bottlenecks typically associated with 3D RGB-D cooperative ple drones simultaneously are significant hurdles. Addressing
14

and path planning, relying solely on cost-effective cameras


and wireless communication channels. Niu [166] introduced
a framework wherein a single UAV served multiple UGVs,
achieving optimal path planning based on aerial imagery.
This approach surpassed traditional heuristic path planning
algorithms in performance. Furthermore, Liu et al. [167]
presented a joint UAV-UGV architecture designed to overcome
frequent target occlusion issues encountered by single-ground
platforms. This architecture enabled accurate and dynamic
(a) (b) target localization, leveraging visual inputs from UAVs.

2) Precise Landing: Given the potential necessity for bat-


tery recharging and emergency maintenance of UAVs dur-
ing extended missions, the UAV-UGV heterogeneous system
can facilitate UAV landings. This design reduces reliance
on manual intervention while enhancing the UAVs’ capacity
for prolonged, uninterrupted operation. In the study [170], a
vision-based heterogeneous system was proposed to address
the challenge of UAVs’ temporary landings during long-
range inspections. This system accomplished precise target
geolocation and safe landings in the absence of GPS data by
(c) (d)
detecting QR codes mounted on UGVs. Additionally, Xu et
Fig. 26: Applications of heterogeneous systems. (a) UAV-UGV al. [171] illustrated the application of UAV heterogeneous sys-
path planning [166]; (b) UAV-UGV precise landing [170]; (c) tems for landing on USVs. A similar approach was explored
UAV-UGV inventory inspection [173]; (d) UAV-UGV object for UAVs’ target localization and landing, leveraging QR code
detection [174]. recognition on USVs. Collectively, these studies underscored
the feasibility and adaptability of the heterogeneous landing
system across different platforms.
these challenges is vital for unlocking the full potential of
multi-drone systems, paving the way for advancements in 3) Inspection and Detection: Heterogeneous UAV systems
various application domains. present an effective solution to overcome the limitations of
background clutter and incoherent target interference often
C. Heterogeneous Systems Application encountered in single-ground vision detection platforms. By
leveraging the expansive field of view and swift scanning
As the complexity and uncertainty of task scenarios escalate,
capabilities of UAVs, in conjunction with the endurance and
it becomes increasingly challenging for a single robot or even a
high accuracy of UGVs, such heterogeneous systems can
homogeneous multi-robot system to efficiently adapt to diverse
achieve time-efficient and accurate target inspection and detec-
environments. Consequently, in recent years, heterogeneous
tion in specific applications. For instance, Kalinov et al. [173]
multi-robot systems (HMRS) have emerged as a focal point of
introduced a heterogeneous inventory management system,
research within the community [165]. In the domain of UAV
pairing a ground robot with a UAV. In this system, the ground
applications, HMRS mainly refers to communication-enabled
robot determined motion trajectories by deploying the SLAM
networks that integrate various UAVs with other intelligent
algorithm, while the UAV, with its high maneuverability, was
robotic platforms, such as unmanned ground vehicles (UGVs)
tasked with scanning barcodes. Furthermore, Pretto [175]
and unmanned surface vehicles (USVs). This integration fa-
developed a heterogeneous farming system to enhance agri-
cilitates a diverse range of applications, leveraging the unique
cultural automation. This innovative system utilized the aerial
capabilities of each system to enhance overall operational
perspective of the UAV to assist in farmland segmentation and
efficiency. These systems execute a range of tasks, either
the classification of crops from weeds, significantly contribut-
individually or in a collaborative manner. The inherently
ing to the advancement of automated farming practices.
heterogeneous scheduling approach of HMRS significantly en-
hances feasibility and adaptability, thereby effectively tackling To sum up, most of the applications above primarily focus
a series of demanding tasks across different environments. on single UAV to single UGV or single UAV to multiple
Consequently, applications of HMRS are rapidly evolving; UGV configurations, with few scenarios designed for multiple
examples include but are not limited to localization and UAVs interacting with multiple UGVs. It is evident that there
path planning [166]–[169], precise landing [170]–[172], and remains significant research potential in the realm of vision-
comprehensive inspection and detection [173]–[175]. based, multi-agent-to-multi-agent heterogeneous systems. Key
1) Localization and Path Planning: The UAV-UGV co- areas such as communication and data integration within het-
operation system can address the challenges of GPS denial erogeneous systems, coordination and control in dynamic and
and limited sensing range in UGVs. In this system, UAVs unpredictable environments, and individual agents’ autonomy
provide essential auxiliary information for UGV localization and decision-making capabilities warrant further exploration.
15

VI. O PEN Q UESTIONS AND P OTENTIAL S OLUTIONS lies in optimizing machine learning models for edge com-
Despite significant advancements in the domain of vision- puting, enabling drones to process data and make decisions
based learning for drones, numerous challenges remain that swiftly. Techniques like model pruning [185], [186], quanti-
impede the pace of development and real-world applicability zation [187], [188], distillation [189] and the development of
of these methods. These challenges span various aspects, specialized hardware accelerators can play a pivotal role in
from data collection and simulation accuracy to operational this regard.
efficiency and safety concerns.

E. Real World Deployment


A. Dataset
A major impediment in the field is the absence of a com- Transitioning from controlled simulation environments to
prehensive, public dataset analogous to Open X-Embodiment real-world deployment (Sim2Real) involves addressing un-
[176] in robotic manipulation. This unified dataset should ide- predictability in environmental conditions, regulatory compli-
ally encompass a wide range of scenarios and tasks to facilitate ance, and adaptability to diverse operational contexts. Domain
generalizable learning. The current reliance on domain-specific randomization [123] tries to address the Sim2Real issue in
datasets like “Anti-UAV” [177] and “SUAV-DATA” [178] lim- a certain way but is limited to predicted scenarios with
its the scope and applicability of research. A potential solution known domain distributions. Developing robust and adaptive
is the collaborative development of a diverse, multi-purpose algorithms capable of on-the-fly learning and decision-making,
dataset by academic and industry stakeholders, incorporating along with rigorous field testing under varied conditions, can
various environmental, weather, and lighting conditions. aid in overcoming these challenges.

B. Simulator F. Embodied Intelligence in Open World


While simulators are vital for training and validating vision-
based drone models, their realism and accuracy often fall short Existing vision-based learning methods for drones require
of replicating real-world complexities. This gap hampers the explicit task descriptions and formal constraints, while in an
transition from simulation to actual deployment. Meanwhile, open world, it is hard to provide all necessary formulations at
there is no unified simulator covering most of the drone tasks, the beginning to find the optimal solution. For instance, in a
resulting in repetitive domain-specific simulator development complex search and rescue mission, the drone can only find
[117], [179], [180]. Drawing inspiration from the self-driving the targets first and conduct rescue based on the information
car domain, the integration of off-the-shelf and highly flexible collected. In each stage, the task may change, and there is
simulators such as CARLA [181] could be a solution. These no prior explicit problem at the start. Human interactions are
simulators, known for their advanced features in realistic traffic necessary during this mission. With large language models and
simulation and diverse environmental conditions, can provide embodied intelligence, the potential of drone autonomy can
more authentic and varied data for training. Adapting such be greatly increased. Through interactions in the open world
simulators to drone-specific scenarios could greatly enhance [17], [190] or provide few-shot imitation [191], vision-based
the quality of training and testing environments. learning can emerge with full autonomy for drone applications.

C. Sample Efficiency G. Safety and Security


Enhancing sample efficiency in machine learning models
for drones is crucial, particularly in environments where data Ensuring the safety and security of drone operations is
collection is hazardous or impractical. Even though simu- paramount, especially in densely populated or sensitive areas.
lators are available for generating training data, there are This includes not only physical safety but also cybersecurity
still challenges in ensuring the realism and diversity of these concerns [192], [193]. The security aspect extends beyond data
simulated environments. The gap between simulated and real- protection, including the resilience of drones to adversarial
world data can lead to performance discrepancies when mod- attacks. Such attacks could take various forms, from signal
els are deployed in actual scenarios. Developing algorithms jamming to deceptive inputs aimed at misleading vision-based
that leverage transfer learning [182], few-shot learning [183], systems and DRL algorithms [194]. Addressing these con-
and synthetic data generation [184] could provide significant cerns requires a multifaceted approach. Firstly, incorporating
strides in learning efficiently from limited datasets. These advanced cryptographic techniques ensures data integrity and
approaches aim to bridge the gap between simulation and secure communication. Secondly, implementing anomaly de-
reality, enhancing the applicability and robustness of machine tection systems can help identify and mitigate unusual patterns
learning models in diverse and dynamic real-world situations. indicative of adversarial interference. Moreover, improving
the robustness of learning models against adversarial attacks
and investigating the explainability of designed models [195],
D. Inference Speed [196] are imperative. Lastly, regular updates and patches to
Balancing inference speed with accuracy is a critical chal- the drone’s software, based on the latest threat intelligence,
lenge for drones operating in dynamic environments. The key can fortify its defenses against evolving cyber threats.
16

VII. C ONCLUSION [7] P. M. Kornatowski, M. Feroskhan, W. J. Stewart, and D. Floreano, “A


morphing cargo drone for safe flight in proximity of humans,” IEEE
This comprehensive survey has thoroughly explored the Robotics and Automation Letters, vol. 5, no. 3, pp. 4233–4240, 2020.
rapidly growing field of vision-based learning for drones, [8] D. Câmara, “Cavalry to the rescue: Drones fleet to help rescuers
operations over disasters scenarios,” in 2014 IEEE Conference on
particularly emphasizing their evolving role in multi-drone Antenna Measurements & Applications (CAMA). IEEE, 2014, pp.
systems and complex environments such as search and res- 1–4.
cue missions and adversarial settings. The investigation re- [9] M. Graule, P. Chirarattananon, S. Fuller, N. Jafferis, K. Ma, M. Spenko,
R. Kornbluh, and R. Wood, “Perching and takeoff of a robotic insect on
vealed that drones are increasingly becoming sophisticated, overhangs using switchable electrostatic adhesion,” Science, vol. 352,
autonomous systems capable of intricate tasks, largely driven no. 6288, pp. 978–982, 2016.
by advancements in AI, machine learning, and sensor tech- [10] D. Floreano and R. J. Wood, “Science, technology and the future of
small autonomous drones,” Nature, vol. 521, no. 7553, pp. 460–466,
nology. The exploration of micro and nano drones, innovative 2015.
structural designs, and enhanced autonomy stand out as key [11] J. Shu and P. Chirarattananon, “A quadrotor with an origami-inspired
trends shaping the future of drone technology. Crucially, protective mechanism,” IEEE Robotics and Automation Letters, vol. 4,
no. 4, p. 3820–3827, 2019.
the integration of visual perception with machine learning [12] E. Ajanic, M. Feroskhan, S. Mintchev, F. Noca, and D. Floreano,
algorithms, including deep reinforcement learning, opens up “Bioinspired wing a and tail morphing extends drone flight capabil-
new avenues for drones to operate with greater efficiency and ities,” Sci. Robot., vol. 5, p. eabc2897, 2020.
[13] X. Zhou, J. Zhu, H. Zhou, C. Xu, and F. Gao, “Ego-swarm: A fully
intelligence. These capabilities are particularly pertinent in the autonomous and decentralized quadrotor swarm system in cluttered
context of object detection and decision-making processes, environments,” in 2021 IEEE international conference on robotics and
vital for complex drone operations. The survey categorized automation (ICRA). IEEE, 2021, pp. 4101–4107.
[14] E. Kaufmann, A. Loquercio, R. Ranftl, M. Müller, V. Koltun,
vision-based control methods into indirect, semi-direct, and and D. Scaramuzza, “Deep drone acrobatics,” arXiv preprint
end-to-end methods, offering a nuanced understanding of how arXiv:2006.05768, 2020.
drones perceive and interact with their environment. Applica- [15] E. Kaufmann, L. Bauersfeld, A. Loquercio, M. Müller, V. Koltun, and
D. Scaramuzza, “Champion-level drone racing using deep reinforce-
tions of vision-based learning drones, spanning from single- ment learning,” Nature, vol. 620, no. 7976, pp. 982–987, 2023.
agent to multi-agent and heterogeneous systems, demonstrate [16] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay,
their versatility and potential in various sectors, including D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated
robot task plans using large language models,” in 2023 IEEE Interna-
agriculture, industrial inspection, and emergency response. tional Conference on Robotics and Automation (ICRA). IEEE, 2023,
However, this expansion also brings forth challenges such pp. 11 523–11 530.
as data processing limitations, real-time decision-making, and [17] A. Gupta, S. Savarese, S. Ganguli, and L. Fei-Fei, “Embodied intel-
ligence via learning and evolution,” Nature communications, vol. 12,
ensuring robustness in diverse operational scenarios. no. 1, p. 5721, 2021.
The survey highlighted open questions and potential so- [18] Y. Lu, Z. Xue, G.-S. Xia, and L. Zhang, “A survey on vision-based
lutions in the field, stressing the need for comprehensive uav navigation,” Geo-spatial information science, vol. 21, no. 1, pp.
21–32, 2018.
datasets, realistic simulators, improved sample efficiency, and [19] M. Y. Arafat, M. M. Alam, and S. Moh, “Vision-based navigation
faster inference speeds. Addressing these challenges is crucial techniques for unmanned aerial vehicles: Review and challenges,”
for the effective deployment of drones in real-world scenarios. Drones, vol. 7, no. 2, p. 89, 2023.
[20] E. Kakaletsis, C. Symeonidis, M. Tzelepi, I. Mademlis, A. Tefas,
Safety and security, especially in the context of adversarial N. Nikolaidis, and I. Pitas, “Computer vision for autonomous uav flight
environments, remain paramount concerns that need ongo- safety: An overview and a vision-based safe landing pipeline example,”
ing attention. While significant progress has been made in Acm Computing Surveys (Csur), vol. 54, no. 9, pp. 1–37, 2021.
[21] A. Mcfadyen and L. Mejias, “A survey of autonomous vision-based
vision-based learning for drones, the journey towards fully see and avoid for unmanned aircraft systems,” Progress in Aerospace
autonomous, intelligent, and reliable systems, even AGI in the Sciences, vol. 80, pp. 1–17, 2016.
physical world, is ongoing. Future research and development [22] R. Jenssen, D. Roverso et al., “Automatic autonomous vision-based
power line inspection: A review of current status and the potential role
in this field hold the promise of revolutionizing various indus- of deep learning,” International Journal of Electrical Power & Energy
tries, pushing the boundaries of what’s possible with drone Systems, vol. 99, pp. 107–120, 2018.
technology in complex and dynamic environments. [23] B. F. Spencer Jr, V. Hoskere, and Y. Narazaki, “Advances in computer
vision-based civil infrastructure inspection and monitoring,” Engineer-
ing, vol. 5, no. 2, pp. 199–222, 2019.
R EFERENCES [24] D. Hanover, A. Loquercio, L. Bauersfeld, A. Romero, R. Penicka,
Y. Song, G. Cioffi, E. Kaufmann, and D. Scaramuzza, “Autonomous
[1] R. Rajkumar, I. Lee, L. Sha, and J. Stankovic, “Cyber-physical systems: drone racing: A survey,” arXiv e-prints, pp. arXiv–2301, 2023.
the next computing revolution,” in Design Automation Conference. [25] B. Zhou, H. Xu, and S. Shen, “Racer: Rapid collaborative explo-
IEEE, 2010, pp. 731–736. ration with a decentralized multi-uav system,” IEEE Transactions on
[2] R. Baheti and H. Gill, “Cyber-physical systems,” The impact of control Robotics, vol. 39, no. 3, pp. 1816–1835, 2023.
technology, vol. 12, no. 1, pp. 161–166, 2011. [26] E. Kaufmann, A. Loquercio, R. Ranftl, A. Dosovitskiy, V. Koltun, and
[3] M. Hassanalian and A. Abdelkefi, “Classifications, applications, and D. Scaramuzza, “Deep drone racing: Learning agile flight in dynamic
design challenges of drones: A review,” Progress in Aerospace Sci- environments,” in Conference on Robot Learning. PMLR, 2018, pp.
ences, vol. 91, pp. 99–131, 2017. 133–145.
[4] P. Nooralishahi, C. Ibarra-Castanedo, S. Deane, F. López, S. Pant, [27] D. Falanga, K. Kleber, and D. Scaramuzza, “Dynamic obstacle avoid-
M. Genest, N. P. Avdelidis, and X. P. Maldague, “Drone-based non- ance for quadrotors with event cameras,” Science Robotics, vol. 5,
destructive inspection of industrial sites: A review and case studies,” no. 40, 2020.
Drones, vol. 5, no. 4, p. 106, 2021. [28] A. Loquercio, E. Kaufmann, R. Ranftl, M. Müller, V. Koltun, and
[5] N. J. Stehr, “Drones: The newest technology for precision agriculture,” D. Scaramuzza, “Learning high-speed flight in the wild,” Science
Natural Sciences Education, vol. 44, no. 1, pp. 89–91, 2015. Robotics, vol. 6, no. 59, p. eabg5810, 2021.
[6] P. M. Kornatowski, M. Feroskhan, W. J. Stewart, and D. Floreano, [29] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-
“Downside up:rethinking parcel position for aerial delivery,” IEEE ular visual-inertial state estimator,” IEEE Transactions on Robotics,
Robotics and Automation Letters, vol. 5, no. 3, pp. 4297–4304, 2020. vol. 34, no. 4, pp. 1004–1020, 2018.
17

[30] N. J. Sanket, C. M. Parameshwara, C. D. Singh, A. V. Kuruttukulam, [51] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. Mc-
C. Fermüller, D. Scaramuzza, and Y. Aloimonos, “Evdodgenet: Deep Grew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schnei-
dynamic obstacle dodging with event cameras,” in 2020 IEEE Interna- der, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba,
tional Conference on Robotics and Automation (ICRA). IEEE, 2020, and L. Zhang, “Solving rubik’s cube with a robot hand,” arXiv preprint,
pp. 10 651–10 657. 2019.
[31] N. J. Sanket, C. D. Singh, C. Fermüller, and Y. Aloimonos, “Ajna: [52] M. Liaq and Y. Byun, “Autonomous uav navigation using reinforcement
Generalized deep uncertainty for minimal perception on parsimonious learning,” International Journal of Machine Learning and Computing,
robots,” Science Robotics, vol. 8, no. 81, p. eadd5139, 2023. vol. 9, no. 6, 2019.
[32] R. Siegwart, I. R. Nourbakhsh, and D. Scaramuzza, Introduction to [53] C. Wu, B. Ju, Y. Wu, X. Lin, N. Xiong, G. Xu, H. Li, and X. Liang,
autonomous mobile robots. MIT press, 2011. “Uav autonomous target search based on deep reinforcement learning
[33] N. Michael, S. Shen, K. Mohta, Y. Mulgaonkar, V. Kumar, K. Nagatani, in complex disaster scene,” IEEE Access, vol. 7, pp. 117 227–117 245,
Y. Okada, S. Kiribayashi, K. Otake, K. Yoshida et al., “Collaborative 2019.
mapping of an earthquake-damaged building via ground and aerial [54] C. Xiao, P. Lu, and Q. He, “Flying through a narrow gap using end-to-
robots,” Journal of Field Robotics, vol. 29, no. 5, pp. 832–841, 2012. end deep reinforcement learning augmented with curriculum learning
[34] H. Guan, X. Sun, Y. Su, T. Hu, H. Wang, H. Wang, C. Peng, and and sim2real,” IEEE Transactions on Neural Networks and Learning
Q. Guo, “UAV-lidar aids automatic intelligent powerline inspection,” Systems, vol. 34, no. 5, pp. 2701–2708, 2023.
International Journal of Electrical Power and Energy Systems, vol. [55] Y. Song, M. Steinweg, E. Kaufmann, and D. Scaramuzza, “Autonomous
130, p. 106987, sep 2021. drone racing with deep reinforcement learning,” in 2021 IEEE/RSJ
[35] R. Opromolla, G. Fasano, G. Rufino, and M. Grassi, “Uncooperative International Conference on Intelligent Robots and Systems (IROS).
pose estimation with a lidar-based system,” Acta Astronautica, vol. 110, IEEE, 2021, pp. 1205–1212.
pp. 287–297, 2015. [56] J. Xiao, P. Pisutsin, and M. Feroskhan, “Collaborative target search
[36] Z. Wang, Z. Zhao, Z. Jin, Z. Che, J. Tang, C. Shen, and Y. Peng, with a visual drone swarm: An adaptive curriculum embedded multi-
“Multi-stage fusion for multi-class 3d lidar detection,” in Proceedings stage reinforcement learning approach,” IEEE Transactions on Neural
of the IEEE/CVF International Conference on Computer Vision, 2021, Networks and Learning Systems, pp. 1–15, 2023.
pp. 3120–3128. [57] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with
[37] M. O. Aqel, M. H. Marhaban, M. I. Saripan, and N. B. Ismail, “Review deep learning: A review,” IEEE Transactions on Neural Networks and
of visual odometry: types, approaches, challenges, and applications,” Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019.
SpringerPlus, vol. 5, pp. 1–26, 2016. [58] S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller, “Zoedepth:
[38] J. Delmerico and D. Scaramuzza, “A benchmark comparison of monoc- Zero-shot transfer by combining relative and metric depth,” arXiv
ular visual-inertial odometry algorithms for flying robots,” in 2018 preprint arXiv:2302.12288, 2023.
IEEE international conference on robotics and automation (ICRA). [59] H. Laga, L. V. Jospin, F. Boussaid, and M. Bennamoun, “A survey
IEEE, 2018, pp. 2502–2509. on deep learning techniques for stereo-based depth estimation,” IEEE
[39] D. Scaramuzza and Z. Zhang, Aerial Robots, Visual-Inertial Odometry Transactions on Pattern Analysis and Machine Intelligence, vol. 44,
of. Berlin, Heidelberg: Springer Berlin Heidelberg, 2020, pp. 1–9. no. 4, pp. 1738–1764, 2020.
[Online]. Available: https://doi.org/10.1007/978-3-642-41610-1 71-1 [60] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional
[40] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile neural networks: Analysis, applications, and prospects,” IEEE Trans-
and accurate monocular slam system,” IEEE transactions on robotics, actions on Neural Networks and Learning Systems, vol. 33, no. 12, pp.
vol. 31, no. 5, pp. 1147–1163, 2015. 6999–7019, 2022.
[41] T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile [61] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
Monocular Visual-Inertial State Estimator,” IEEE Transactions on T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
Robotics, vol. 34, no. 4, pp. 1004–1020, aug 2018. “An image is worth 16x16 words: Transformers for image recognition
[42] C. Campos, R. Elvira, J. J. Gómez Rodrı́guez, J. M. M. Montiel, and at scale,” arXiv preprint arXiv:2010.11929, 2020.
J. D. Tardós, “ORB-SLAM3: An Accurate Open-Source Library for [62] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,
Visual, Visual-Inertial and Multi-Map SLAM,” 2021. “Swin transformer: Hierarchical vision transformer using shifted win-
[43] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, dows,” in Proceedings of the IEEE/CVF international conference on
S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event- computer vision, 2021, pp. 10 012–10 022.
based vision: A survey,” IEEE transactions on pattern analysis and [63] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi,
machine intelligence, vol. 44, no. 1, pp. 154–180, 2020. J. Fan, and Z. He, “A survey of visual transformers,” IEEE Transactions
[44] W. Gao, K. Wang, W. Ding, F. Gao, T. Qin, and S. Shen, “Autonomous on Neural Networks and Learning Systems, pp. 1–21, 2023.
aerial robot using dual-fisheye cameras,” Journal of Field Robotics, [64] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
vol. 37, no. 4, pp. 497–514, 2020. convolutional networks for accurate object detection and segmenta-
[45] V. R. Kumar, S. Yogamani, H. Rashed, G. Sitsu, C. Witt, I. Leang, tion,” IEEE transactions on pattern analysis and machine intelligence,
S. Milz, and P. Mäder, “Omnidet: Surround view cameras based multi- vol. 38, no. 1, pp. 142–158, 2015.
task visual perception network for autonomous driving,” IEEE Robotics [65] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
and Automation Letters, vol. 6, no. 2, pp. 2830–2837, 2021. conference on computer vision, 2015, pp. 1440–1448.
[46] A. D. Haumann, K. D. Listmann, and V. Willert, “DisCoverage: A [66] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
new paradigm for multi-robot exploration,” in Proceedings - IEEE object detection with region proposal networks,” Advances in neural
International Conference on Robotics and Automation, 2010, pp. 929– information processing systems, vol. 28, 2015.
934. [67] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,
[47] A. H. Tan, F. P. Bejarano, Y. Zhu, R. Ren, and G. Nejat, “Deep and A. C. Berg, “Ssd: Single shot multibox detector,” in European
reinforcement learning for decentralized multi-robot exploration with conference on computer vision. Springer, 2016, pp. 21–37.
macro actions,” IEEE Robotics and Automation Letters, vol. 8, no. 1, [68] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
pp. 272–279, 2022. once: Unified, real-time object detection,” in Proceedings of the IEEE
[48] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and conference on computer vision and pattern recognition, 2016, pp. 779–
A. Farhadi, “Target-driven visual navigation in indoor scenes using 788.
deep reinforcement learning,” in 2017 IEEE international conference [69] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo
on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364. series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
[49] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, [70] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang,
M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and L. Dong et al., “Swin transformer v2: Scaling up capacity and
R. Hadsell, “Learning to navigate in complex environments,” 5th resolution,” in Proceedings of the IEEE/CVF conference on computer
International Conference on Learning Representations, ICLR 2017 - vision and pattern recognition, 2022, pp. 12 009–12 019.
Conference Track Proceedings, 2017. [71] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision
[50] Q. Wu, X. Gong, K. Xu, D. Manocha, J. Dong, and J. Wang, “Towards transformer backbones for object detection,” in European Conference
target-driven visual navigation in indoor scenes via generative imitation on Computer Vision. Springer, 2022, pp. 280–296.
learning,” IEEE Robotics and Automation Letters, vol. 6, no. 1, pp. [72] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y.
175–182, 2020. Shum, “Dino: Detr with improved denoising anchor boxes for end-
18

to-end object detection,” in The Eleventh International Conference on [94] R. B. Rusu, Z. C. Marton, N. Blodow, M. Dolha, and M. Beetz, “To-
Learning Representations, 2022. wards 3d point cloud based object maps for household environments,”
[73] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Robotics and Autonomous Systems, vol. 56, no. 11, pp. 927–941, 2008.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” [95] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and
Advances in neural information processing systems, vol. 30, 2017. W. Burgard, “OctoMap: An efficient probabilistic 3D mapping
[74] Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: framework based on octrees,” Autonomous Robots, 2013, software
Attention with linear complexities,” in Proceedings of the IEEE/CVF available at https://octomap.github.io. [Online]. Available: https:
winter conference on applications of computer vision, 2021, pp. 3531– //octomap.github.io
3539. [96] E. W. Dijkstra, “A note on two problems in connexion with graphs,”
[75] M. Maaz, A. Shaker, H. Cholakkal, S. Khan, S. W. Zamir, R. M. Numerische Mathematik, vol. 1, pp. 269–271, 1959.
Anwer, and F. Shahbaz Khan, “Edgenext: efficiently amalgamated cnn- [97] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the
transformer architecture for mobile vision applications,” in European heuristic determination of minimum cost paths,” IEEE Transactions on
Conference on Computer Vision. Springer, 2022, pp. 3–20. Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968.
[76] Y. Zheng, Z. Chen, D. Lv, Z. Li, Z. Lan, and S. Zhao, “Air-to-air visual [98] J. J. Kuffner and S. M. LaValle, “Rrt-connect: An efficient approach
detection of micro-uavs: An experimental evaluation of deep learning,” to single-query path planning,” in Proceedings 2000 ICRA. Millennium
IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1020–1027, Conference. IEEE International Conference on Robotics and Automa-
2021. tion. Symposia Proceedings (Cat. No. 00CH37065), vol. 2. IEEE,
[77] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss 2000, pp. 995–1001.
for dense object detection,” in Proceedings of the IEEE international [99] F. Augugliaro, A. P. Schoellig, and R. D’Andrea, “Generation of
conference on computer vision, 2017, pp. 2980–2988. collision-free trajectories for a quadrocopter fleet: A sequential convex
[78] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” programming approach,” in 2012 IEEE/RSJ international conference
arXiv preprint arXiv:1804.02767, 2018. on Intelligent Robots and Systems. IEEE, 2012, pp. 1917–1922.
[79] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [100] I. Iswanto, A. Ma’arif, O. Wahyunggoro, and A. Imam, “Artificial
“Feature pyramid networks for object detection,” in Proceedings of the potential field algorithm implementation for quadrotor path planning,”
IEEE conference on computer vision and pattern recognition, 2017, Int. J. Adv. Comput. Sci. Appl, vol. 10, no. 8, pp. 575–585, 2019.
pp. 2117–2125. [101] T. Huang, S. Zhao, L. Geng, and Q. Xu, “Unsupervised monocular
[80] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality depth estimation based on residual neural network of coarse–refined
object detection,” in Proceedings of the IEEE conference on computer feature extractions for drone,” Electronics, vol. 8, no. 10, p. 1179,
vision and pattern recognition, 2018, pp. 6154–6162. 2019.
[81] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid r-cnn,” in Proceed- [102] O. Khatib, “Real-time obstacle avoidance for manipulators and mo-
ings of the IEEE/CVF Conference on Computer Vision and Pattern bile robots,” in Proceedings. 1985 IEEE International Conference on
Recognition, 2019, pp. 7363–7372. Robotics and Automation, vol. 2. IEEE, 1985, pp. 500–505.
[82] D. T. Wei Xun, Y. L. Lim, and S. Srigrarom, “Drone detection [103] X. Dai, Y. Mao, T. Huang, N. Qin, D. Huang, and Y. Li, “Automatic
using yolov3 with transfer learning on nvidia jetson tx2,” in 2021 obstacle avoidance of quadrotor uav via cnn-based learning,” Neuro-
Second International Symposium on Instrumentation, Control, Artificial computing, vol. 402, pp. 346–358, 2020.
Intelligence, and Robotics (ICA-SYMP), 2021, pp. 1–6. [104] M. A. Anwar and A. Raychowdhury, “Navren-rl: Learning to fly in real
[83] S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, environment via end-to-end deep reinforcement learning using monoc-
J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav ular images,” in 2018 25th International Conference on Mechatronics
control in cluttered natural environments,” in 2013 IEEE international and Machine Vision in Practice (M2VIP). IEEE, 2018, pp. 1–6.
conference on robotics and automation. IEEE, 2013, pp. 1765–1772. [105] Y. Zhang, K. H. Low, and C. Lyu, “Partially-observable monocular
[84] L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular autonomous navigation for uav through deep reinforcement learning,”
vision based obstacle avoidance through deep reinforcement learning,” in AIAA AVIATION 2023 Forum, 2023, p. 3813.
arXiv preprint arXiv:1706.09829, 2017. [106] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
[85] K. Mohta, M. Watterson, Y. Mulgaonkar, S. Liu, C. Qu, A. Makineni, MIT press, 2018.
K. Saulnier, K. Sun, A. Zhu, J. Delmerico et al., “Fast, autonomous [107] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
flight in gps-denied and cluttered environments,” Journal of Field recognition,” in Proceedings of the IEEE conference on computer vision
Robotics, vol. 35, no. 1, pp. 101–120, 2018. and pattern recognition, 2016, pp. 770–778.
[86] F. Gao, W. Wu, J. Pan, B. Zhou, and S. Shen, “Optimal Time [108] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
Allocation for Quadrotor Trajectory Generation,” in IEEE International large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
Conference on Intelligent Robots and Systems. Institute of Electrical [109] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
and Electronics Engineers Inc., dec 2018, pp. 4715–4722. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
[87] F. Gao, L. Wang, B. Zhou, X. Zhou, J. Pan, and S. Shen, “Teach- S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
repeat-replan: A complete and robust system for aggressive flight in D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
complex environments,” IEEE Transactions on Robotics, vol. 36, no. 5, deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
pp. 1526–1545, 2020. feb 2015.
[88] F. Gao, W. Wu, W. Gao, and S. Shen, “Flying on point clouds: Online [110] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
trajectory generation and autonomous navigation for quadrotors in imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
cluttered environments,” Journal of Field Robotics, vol. 36, no. 4, pp. 2017.
710–733, jun 2019. [111] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
[89] F. Gao, L. Wang, B. Zhou, L. Han, J. Pan, and S. Shen, “Teach-repeat- no. 3, pp. 279–292, 1992.
replan: A complete and robust system for aggressive flight in complex [112] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
environments,” pp. 1526–1545, may 2019. with double q-learning,” in Proceedings of the AAAI conference on
[90] B. Zhou, F. Gao, L. Wang, C. Liu, and S. Shen, “Robust and efficient artificial intelligence, vol. 30, no. 1, 2016.
quadrotor trajectory generation for fast autonomous flight,” IEEE [113] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Off-policy imitation learning from
Robotics and Automation Letters, vol. 4, no. 4, pp. 3529–3536, 2019. observations,” Advances in Neural Information Processing Systems,
[91] B. Zhou, J. Pan, F. Gao, and S. Shen, “Raptor: Robust and perception- vol. 33, pp. 12 402–12 413, 2020.
aware trajectory replanning for quadrotor fast flight,” IEEE Transac- [114] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson,
tions on Robotics, 2021. “Counterfactual multi-agent policy gradients,” Proceedings of the AAAI
[92] L. Quan, Z. Zhang, X. Zhong, C. Xu, and F. Gao, “Eva-planner: En- conference on artificial intelligence, vol. 32, no. 1, 2018.
vironmental adaptive quadrotor planning,” in 2021 IEEE International [115] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster,
Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. and S. Whiteson, “Monotonic value function factorisation for deep
398–404. multi-agent reinforcement learning,” The Journal of Machine Learning
[93] Y. Zhang, Q. Yu, K. H. Low, and C. Lv, “A self-supervised monocular Research, vol. 21, no. 1, pp. 7234–7284, 2020.
depth estimation approach based on uav aerial images,” in 2022 [116] J. Xiao, Y. X. M. Tan, X. Zhou, and M. Feroskhan, “Learning
IEEE/AIAA 41st Digital Avionics Systems Conference (DASC). IEEE, collaborative multi-target search for a visual drone swarm,” in 2023
2022, pp. 1–8. IEEE Conference on Artificial Intelligence (CAI), 2023, pp. 5–7.
19

[117] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity [138] E. Lygouras, N. Santavas, A. Taitzoglou, K. Tarchanidis, A. Mitropou-
visual and physical simulation for autonomous vehicles,” in Field and los, and A. Gasteratos, “Unsupervised human detection with an em-
service robotics. Springer, 2018, pp. 621–635. bedded vision system on a fully autonomous uav for search and rescue
[118] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion, C. Goy, operations,” Sensors, vol. 19, no. 16, p. 3542, 2019.
Y. Gao, H. Henry, M. Mattar et al., “Unity: A general platform for [139] T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair,
intelligent agents,” arXiv preprint arXiv:1809.02627, 2018. I. L. Grixa, F. Ruess, M. Suppa, and D. Burschka, “Toward a fully
[119] I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero, “Extending autonomous uav: Research platform for indoor and outdoor urban
the openai gym for robotics: a toolkit for reinforcement learning using search and rescue,” IEEE robotics & automation magazine, vol. 19,
ros and gazebo,” arXiv preprint arXiv:1608.05742, 2016. no. 3, pp. 46–56, 2012.
[120] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. To- [140] J. Senthilnath, M. Kandukuri, A. Dokania, and K. Ramesh, “Applica-
bin, P. Abbeel, and W. Zaremba, “Transfer from simulation to real tion of uav imaging platform for vegetation analysis based on spectral-
world through learning deep inverse dynamics model,” arXiv preprint spatial methods,” Computers and Electronics in Agriculture, vol. 140,
arXiv:1610.03518, 2016. pp. 8–24, 2017.
[121] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bo- [141] M. R. Khosravi and S. Samadi, “Bl-alm: A blind scalable edge-guided
hez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for reconstruction filter for smart environmental monitoring through green
quadruped robots,” arXiv preprint arXiv:1804.10332, 2018. iomt-uav networks,” IEEE Transactions on Green Communications and
Networking, vol. 5, no. 2, pp. 727–736, 2021.
[122] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
[142] C. Donmez, O. Villi, S. Berberoglu, and A. Cilek, “Computer vision-
for fast adaptation of deep networks,” in International conference on
based citrus tree detection in a cultivated environment using uav
machine learning. PMLR, 2017, pp. 1126–1135.
imagery,” Computers and Electronics in Agriculture, vol. 187, p.
[123] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, 106273, 2021.
“Domain randomization for transferring deep neural networks from [143] B. Lu and Y. He, “Species classification using unmanned aerial vehicle
simulation to the real world,” in 2017 IEEE/RSJ international con- (uav)-acquired high spatial resolution imagery in a heterogeneous
ference on intelligent robots and systems (IROS). IEEE, 2017, pp. grassland,” ISPRS Journal of Photogrammetry and Remote Sensing,
23–30. vol. 128, pp. 73–85, 2017.
[124] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. Mc- [144] D. Kim, M. Liu, S. Lee, and V. R. Kamat, “Remote proximity mon-
Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al., itoring between mobile construction resources using camera-mounted
“Learning dexterous in-hand manipulation,” The International Journal uavs,” Automation in Construction, vol. 99, pp. 168–182, 2019.
of Robotics Research, vol. 39, no. 1, pp. 3–20, 2020. [145] T. Khuc, T. A. Nguyen, H. Dao, and F. N. Catbas, “Swaying displace-
[125] M. A. Akhloufi, S. Arola, and A. Bonnet, “Drones chasing drones: ment measurement for structural monitoring using computer vision and
Reinforcement learning and deep search area proposal,” Drones, vol. 3, an unmanned aerial vehicle,” Measurement, vol. 159, p. 107769, 2020.
no. 3, p. 58, 2019. [146] S. Li, E. van der Horst, P. Duernay, C. De Wagter, and G. C.
[126] S. Geyer and E. Johnson, “3d obstacle avoidance in adversarial environ- de Croon, “Visual model-predictive localization for computationally
ments for unmanned aerial vehicles,” in AIAA Guidance, Navigation, efficient autonomous racing of a 72-g drone,” Journal of Field Robotics,
and Control Conference and Exhibit, 2006, p. 6542. vol. 37, no. 4, pp. 667–692, 2020.
[127] F. Schilling, J. Lecoeur, F. Schiano, and D. Floreano, “Learning [147] M. Muller, G. Li, V. Casser, N. Smith, D. L. Michels, and B. Ghanem,
vision-based flight in drone swarms by imitation,” IEEE Robotics and “Learning a controller fusion network by online trajectory filtering for
Automation Letters, vol. 4, no. 4, pp. 4523–4530, 2019. vision-based uav racing,” in Proceedings of the IEEE/CVF Conference
[128] Y. Xie, M. Lu, R. Peng, and P. Lu, “Learning agile flights through nar- on Computer Vision and Pattern Recognition Workshops, 2019, pp.
row gaps with varying angles using onboard sensing,” IEEE Robotics 0–0.
and Automation Letters, vol. 8, no. 9, pp. 5424–5431, 2023. [148] M. Muller, V. Casser, N. Smith, D. L. Michels, and B. Ghanem,
[129] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocu- “Teaching uavs to race: End-to-end regression of agile controls in
lar depth estimation with left-right consistency,” in Proceedings of the simulation,” in Proceedings of the European Conference on Computer
IEEE conference on computer vision and pattern recognition, 2017, Vision (ECCV) Workshops, 2018, pp. 0–0.
pp. 270–279. [149] R. Penicka and D. Scaramuzza, “Minimum-time quadrotor waypoint
[130] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging flight in cluttered environments,” IEEE Robotics and Automation
into self-supervised monocular depth estimation,” in Proceedings of Letters, vol. 7, no. 2, pp. 5719–5726, 2022.
the IEEE/CVF International Conference on Computer Vision, 2019, [150] E. Kaufmann, M. Gehrig, P. Foehn, R. Ranftl, A. Dosovitskiy,
pp. 3828–3838. V. Koltun, and D. Scaramuzza, “Beauty and the beast: Optimal methods
[131] Y. Liu, L. Wang, and M. Liu, “Yolostereo3d: A step back to 2d for meet learning for drone racing,” in 2019 International Conference on
efficient stereo 3d detection,” in 2021 IEEE International Conference Robotics and Automation (ICRA). IEEE, 2019, pp. 690–696.
on Robotics and Automation (ICRA). IEEE, 2021, pp. 13 018–13 024. [151] L. Xing, X. Fan, Y. Dong, Z. Xiong, L. Xing, Y. Yang, H. Bai, and
C. Zhou, “Multi-uav cooperative system for search and rescue based
[132] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia, “Fast
on yolov5,” International Journal of Disaster Risk Reduction, vol. 76,
robust monocular depth estimation for obstacle detection with fully
p. 102972, 2022.
convolutional networks,” in 2016 IEEE/RSJ International Conference
[152] B. Lin, L. Wu, and Y. Niu, “End-to-end vision-based cooperative target
on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4296–
geo-localization for multiple micro uavs,” Journal of Intelligent &
4303.
Robotic Systems, vol. 106, no. 1, p. 13, 2022.
[133] A. Singh, D. Patil, and S. Omkar, “Eye in the sky: Real-time drone [153] M. E. Campbell and W. W. Whitacre, “Cooperative tracking using
surveillance system (dss) for violent individuals identification using vision measurements on seascan uavs,” IEEE Transactions on Control
scatternet hybrid deep learning network,” in Proceedings of the IEEE Systems Technology, vol. 15, no. 4, pp. 613–626, 2007.
conference on computer vision and pattern recognition workshops, [154] J. Gu, T. Su, Q. Wang, X. Du, and M. Guizani, “Multiple moving
2018, pp. 1629–1637. targets surveillance based on a cooperative network for multi-uav,”
[134] W. Li, H. Li, Q. Wu, X. Chen, and K. N. Ngan, “Simultaneously IEEE Communications Magazine, vol. 56, no. 4, pp. 82–89, 2018.
detecting and counting dense vehicles from drone images,” IEEE [155] Y. Cao, F. Qi, Y. Jing, M. Zhu, T. Lei, Z. Li, J. Xia, J. Wang, and G. Lu,
Transactions on Industrial Electronics, vol. 66, no. 12, pp. 9651–9662, “Mission chain driven unmanned aerial vehicle swarms cooperation for
2019. the search and rescue of outdoor injured human targets,” Drones, vol. 6,
[135] H. Zhou, H. Kong, L. Wei, D. Creighton, and S. Nahavandi, “On no. 6, p. 138, 2022.
detecting road regions in a single uav image,” IEEE transactions on [156] G. Loianno, J. Thomas, and V. Kumar, “Cooperative localization and
intelligent transportation systems, vol. 18, no. 7, pp. 1713–1722, 2016. mapping of mavs using rgb-d sensors,” in 2015 IEEE International
[136] N. H. Motlagh, M. Bagaa, and T. Taleb, “Uav-based iot platform: A Conference on Robotics and Automation (ICRA). IEEE, 2015, pp.
crowd surveillance use case,” IEEE Communications Magazine, vol. 55, 4021–4028.
no. 2, pp. 128–134, 2017. [157] P. Tong, X. Yang, Y. Yang, W. Liu, and P. Wu, “Multi-uav collaborative
[137] M. A. Goodrich, B. S. Morse, D. Gerhardt, J. L. Cooper, M. Quigley, absolute vision positioning and navigation: A survey and discussion,”
J. A. Adams, and C. Humphrey, “Supporting wilderness search and Drones, vol. 7, no. 4, p. 261, 2023.
rescue using a camera-equipped mini uav,” Journal of Field Robotics, [158] N. Piasco, J. Marzat, and M. Sanfourche, “Collaborative localization
vol. 25, no. 1-2, pp. 89–110, 2008. and formation flying using distributed stereo-vision,” in 2016 IEEE
20

International Conference on Robotics and Automation (ICRA). IEEE, [178] Y. Zhao, Z. Ju, T. Sun, F. Dong, J. Li, R. Yang, Q. Fu, C. Lian,
2016, pp. 1202–1207. and P. Shan, “Tgc-yolov5: An enhanced yolov5 drone detection model
[159] D. Liu, X. Zhu, W. Bao, B. Fei, and J. Wu, “Smart: Vision-based based on transformer, gam & ca attention mechanism,” Drones, vol. 7,
method of cooperative surveillance and tracking by multiple uavs in the no. 7, p. 446, 2023.
urban environment,” IEEE Transactions on Intelligent Transportation [179] Y. Song, S. Naji, E. Kaufmann, A. Loquercio, and D. Scaramuzza,
Systems, vol. 23, no. 12, pp. 24 941–24 956, 2022. “Flightmare: A flexible quadrotor simulator,” in Conference on Robot
[160] N. Farmani, L. Sun, and D. J. Pack, “A scalable multitarget tracking Learning. PMLR, 2021, pp. 1147–1157.
system for cooperative unmanned aerial vehicles,” IEEE Transactions [180] J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoel-
on Aerospace and Electronic Systems, vol. 53, no. 4, pp. 1947–1961, lig, “Learning to fly—a gym environment with pybullet physics for
2017. reinforcement learning of multi-agent quadcopter control,” in 2021
[161] M. Jouhari, A. K. Al-Ali, E. Baccour, A. Mohamed, A. Erbad, IEEE/RSJ International Conference on Intelligent Robots and Systems
M. Guizani, and M. Hamdi, “Distributed cnn inference on resource- (IROS), 2021, pp. 7512–7519.
constrained uavs for surveillance systems: Design and optimization,” [181] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:
IEEE Internet of Things Journal, vol. 9, no. 2, pp. 1227–1242, 2021. An open urban driving simulator,” in Conference on robot learning.
[162] W. J. Yun, S. Park, J. Kim, M. Shin, S. Jung, D. A. Mohaisen, and J.-H. PMLR, 2017, pp. 1–16.
Kim, “Cooperative multiagent deep reinforcement learning for reliable [182] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and
surveillance via autonomous multi-uav control,” IEEE Transactions on Q. He, “A comprehensive survey on transfer learning,” Proceedings of
Industrial Informatics, vol. 18, no. 10, pp. 7086–7096, 2022. the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
[163] Y. Tang, Y. Hu, J. Cui, F. Liao, M. Lao, F. Lin, and R. S. Teo, “Vision- [183] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
aided multi-uav autonomous flocking in gps-denied environment,” examples: A survey on few-shot learning,” ACM computing surveys
IEEE Transactions on industrial electronics, vol. 66, no. 1, pp. 616– (csur), vol. 53, no. 3, pp. 1–34, 2020.
626, 2018. [184] J. Fonseca and F. Bacao, “Tabular and latent space synthetic data
[164] J. Scherer, S. Yahyanejad, S. Hayat, E. Yanmaz, T. Andre, A. Khan, generation: a literature review,” Journal of Big Data, vol. 10, no. 1, p.
V. Vukadinovic, C. Bettstetter, H. Hellwagner, and B. Rinner, “An 115, 2023.
autonomous multi-uav system for search and rescue,” in Proceedings [185] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
of the first workshop on micro aerial vehicle networks, systems, and value of network pruning,” in International Conference on Learning
applications for civilian use, 2015, pp. 33–38. Representations, 2018.
[165] Y. Rizk, M. Awad, and E. W. Tunstel, “Cooperative heterogeneous [186] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and
multi-robot systems: A survey,” ACM Computing Surveys (CSUR), L. Tassiulas, “Model pruning enables efficient federated learning on
vol. 52, no. 2, pp. 1–31, 2019. edge devices,” IEEE Transactions on Neural Networks and Learning
[166] G. Niu, L. Wu, Y. Gao, and M.-O. Pun, “Unmanned aerial vehicle Systems, 2022.
(uav)-assisted path planning for unmanned ground vehicles (ugvs) [187] Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard,
via disciplined convex-concave programming,” IEEE Transactions on “Adaptive quantization for deep neural network,” in Proceedings of
Vehicular Technology, vol. 71, no. 7, pp. 6996–7007, 2022. the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[188] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, “Model compression and
[167] D. Liu, W. Bao, X. Zhu, B. Fei, Z. Xiao, and T. Men, “Vision-aware
hardware acceleration for neural networks: A comprehensive survey,”
air-ground cooperative target localization for uav and ugv,” Aerospace
Proceedings of the IEEE, vol. 108, no. 4, pp. 485–532, 2020.
Science and Technology, vol. 124, p. 107525, 2022.
[189] L. Wang and K.-J. Yoon, “Knowledge distillation and student-teacher
[168] J. Li, G. Deng, C. Luo, Q. Lin, Q. Yan, and Z. Ming, “A hybrid path
learning for visual intelligence: A review and new outlooks,” IEEE
planning method in unmanned air/ground vehicle (uav/ugv) cooperative
transactions on pattern analysis and machine intelligence, vol. 44,
systems,” IEEE Transactions on Vehicular Technology, vol. 65, no. 12,
no. 6, pp. 3048–3068, 2021.
pp. 9585–9596, 2016.
[190] R. Gong, Q. Huang, X. Ma, H. Vo, Z. Durante, Y. Noda, Z. Zheng,
[169] L. Zhang, F. Gao, F. Deng, L. Xi, and J. Chen, “Distributed estimation S.-C. Zhu, D. Terzopoulos, L. Fei-Fei et al., “Mindagent: Emergent
of a layered architecture for collaborative air–ground target geolocation gaming interaction,” arXiv preprint arXiv:2309.09971, 2023.
in outdoor environments,” IEEE Transactions on Industrial Electronics, [191] A. Bhoopchand, B. Brownfield, A. Collister, A. Dal Lago, A. Edwards,
vol. 70, no. 3, pp. 2822–2832, 2022. R. Everett, A. Fréchette, Y. G. Oliveira, E. Hughes, K. W. Mathewson
[170] G. Niu, Q. Yang, Y. Gao, and M.-O. Pun, “Vision-based autonomous et al., “Learning few-shot imitation as cultural transmission,” Nature
landing for unmanned aerial and ground vehicles cooperative systems,” Communications, vol. 14, no. 1, p. 7536, 2023.
IEEE robotics and automation letters, vol. 7, no. 3, pp. 6234–6241, [192] J. Xiao and M. Feroskhan, “Cyber attack detection and isolation for
2021. a quadrotor uav with modified sliding innovation sequences,” IEEE
[171] Z.-C. Xu, B.-B. Hu, B. Liu, X. Wang, and H.-T. Zhang, “Vision- Transactions on Vehicular Technology, vol. 71, no. 7, pp. 7202–7214,
based autonomous landing of unmanned aerial vehicle on a motional 2022.
unmanned surface vessel,” in 2020 39th Chinese Control Conference [193] T. T. Nguyen and V. J. Reddi, “Deep reinforcement learning for
(CCC). IEEE, 2020, pp. 6845–6850. cyber security,” IEEE Transactions on Neural Networks and Learning
[172] C. Hui, C. Yousheng, L. Xiaokun, and W. W. Shing, “Autonomous Systems, vol. 34, no. 8, pp. 3779–3795, 2023.
takeoff, tracking and landing of a uav on a moving ugv using onboard [194] I. Ilahi, M. Usama, J. Qadir, M. U. Janjua, A. Al-Fuqaha, D. T. Hoang,
monocular vision,” in Proceedings of the 32nd Chinese control confer- and D. Niyato, “Challenges and countermeasures for adversarial attacks
ence. IEEE, 2013, pp. 5895–5901. on deep reinforcement learning,” IEEE Transactions on Artificial
[173] I. Kalinov, A. Petrovsky, V. Ilin, E. Pristanskiy, M. Kurenkov, Intelligence, vol. 3, no. 2, pp. 90–109, 2021.
V. Ramzhaev, I. Idrisov, and D. Tsetserukou, “Warevision: Cnn barcode [195] A. Rawal, J. McCoy, D. B. Rawat, B. M. Sadler, and R. S. Amant,
detection-based uav trajectory optimization for autonomous warehouse “Recent advances in trustworthy explainable artificial intelligence:
stocktaking,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. Status, challenges, and perspectives,” IEEE Transactions on Artificial
6647–6653, 2020. Intelligence, vol. 3, no. 6, pp. 852–866, 2021.
[174] S. Minaeian, J. Liu, and Y.-J. Son, “Vision-based target detection [196] G. A. Vouros, “Explainable deep reinforcement learning: state of the
and localization via a team of cooperative uav and ugvs,” IEEE art and challenges,” ACM Computing Surveys, vol. 55, no. 5, pp. 1–39,
Transactions on systems, man, and cybernetics: systems, vol. 46, no. 7, 2022.
pp. 1005–1016, 2015.
[175] A. Adaptable, “Building an aerial–ground robotics system for precision
farming,” IEEE ROBOTICS & AUTOMATION MAGAZINE, 2021.
[176] Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi,
C. Xu, J. Luo, L. Tan, D. Shah et al., “Open x-embodiment: Robotic
learning datasets and rt-x models,” in 2nd Workshop on Language and
Robot Learning: Language as Grounding, 2023.
[177] N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo,
Q. Ye, J. Jiao, J. Zhao, and Z. Han, “Anti-uav: A large-scale benchmark
for vision-based uav tracking,” IEEE Transactions on Multimedia,
vol. 25, pp. 486–500, 2023.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy