Collaborative Coverage Path Planning of UAVs Using RL
Collaborative Coverage Path Planning of UAVs Using RL
Abstract—With the continuous application of unmanned aerial coverage while avoiding obstacles and minimizing the path
vehicles (UAV) in the field of national defense and civil use, the length. As for multiple UAVs, the researches [1, 4] decompose
UAV cluster system in which multiple UAVs cooperate to perform the terrain and assign the resulting sub-regions among the
tasks has become a key research in many countries. This paper UAVs to transform the mCPP problem into irrelevant CPP
focuses on the problem of multi-UAV’s Coverage Path Planning problem. However, those decomposition methods rely on prior
(mCPP) to exploit all points of interests within an area based on cognition of the environment and cannot coordinate with each
Reinforcement Learning (RL), where each UAV starts with a other when it comes to multi-UAV. It is difficult to predict
random position and carries a camera during the mission. There trajectories with the large-scale environment and random start
have been a number of optimal algorithms proposed for the
positions.
coverage path planning of single UAV, however, it is under-
explored for multiple UAVs. As such, we leverage Deep With the help of deep learning techniques, Reinforcement
reinforcement learning with Double Q-learning Networks (DDQN) Learning (RL) has shown impressive improvement nowadays,
to learn a global optimal control policy for a team of UAVs under such as playing video games [5], path planning for data
certain power constraints to cooperate effectively to explore a harvesting [6], and V2X resource allocation [7]. This paper
wider area. Regarding the task area as a 2D plane, we divide it into aims to deal with the coverage path planning of UAV cluster
a collection of uniform grid cells, which represent a section of the based on the RL algorithm, to be specific, the Deep Double Q-
environment. The camera field of view of each UAV covers a cell
learning Networks (DDQN) method. Various RL methods [5,
area underneath the UAV. Simulation results demonstrate that
11, 12] have been developed for UAV path planning. However,
wherever the start positions are, the UAV cluster can fully cover
the whole task area under the energy constraints and achieve they are only designed for single UAV. With the increasing
autonomous collaboration. The proposed method has great number of UAVs, the problem of path planning will become
potential in applying to dynamic environment. more complicated and uncertain. In this paper, we discretize the
task area by splitting it into grid cells with equal size, and design
Keywords-UAV cluster system, mCPP, DDQN, full coverage, the state set and action set of the UAV cluster. More importantly,
energy constraints we formulate the reward and punishment mechanisms to help
the UAV cluster achieve better path planning ability with higher
I. INTRODUCTION accumulative scores. In addition, the establishment of the two
With the advantage of high mobility, flexible deployment deep neural networks weakens the dependence between the
and low cost, unmanned aerial vehicles (UAV) served as an target value and the network parameters, so that speed up the
emerging facility have been widely applied to terrain coverage, training process to convergence.
agricultural production, environmental reconnaissance, air The main contributions of this paper are as following:
rescue, disaster warning, and other social industries [1-2]. More
recently, the UAV cluster system composed of several We introduce a novel control policy based on DDQN
collaborative UAVs has become the focus of attention in for UAV cluster to completely cover the task area.
various application domains for higher execution efficiency. In The proposed method can generalize over random start
this paper, we investigate the problem of multi-UAV’s positions and balance the requirement of full coverage,
Coverage Path Planning (mCPP) to achieve complete coverage shortening path, and energy limitation.
under battery duration, which is a fundamental issue for the
UAV cluster system. The control method enables the UAV cluster to
autonomously and cooperatively plan the flight
Generally, the Coverage Path Planning (CPP) task aims to trajectory without inter-vehicle communication and
determine an optimal path that travel over all points within an prior knowledge of task area.
area of interest while reducing redundant paths [2]. For the case
of single agent, it is easy to design a non-repetitive route or The remainder of this paper is organized as follows: Section
minimize the path length. A recent survey [3] has analyzed Ⅱ introduces the multi-UAV mobility and scenario model,
different CPP approaches for UAV, where the CPP problem is Section Ⅲ describes the proposed DDQN-based approach for
typically solved by splitting the target area into different non- UAV cluster. Section Ⅳ present the simulation results and
intersecting shapes, so that UAV can travel over the maximum discussion. We conclude the paper with a summary in Section
Ⅴ.
978-1-6654-0605-5/21/$31.00 ©2021 IEEE
201
Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 06,2023 at 03:33:58 UTC from IEEE Xplore. Restrictions apply.
II. SYSTEM MODEL
In this section, we present the key models for the UAV
cluster coverage path planning. In order to implement RL
approach, we make some simplification and reasonable
assumptions in our models.
A. Task Scenario Model
It is assumed that there are n UAVs performing coverage
reconnaissance mission in an open area which can be
represented by a square grid world of K km*K km with cell size
C km*C km. In that way, we abstract and discretize the area Figure 2. Flight directions of the UAV
into a grid by size of N * N, where N equals to the ceiling
number of K/C. As is shown in Figure 1, the red dotted lines III. METHODOLOGY
represent the warning areas where the UAVs in them have to In this section, we describe the RL-based method to address
control their direction to prevent being out of bounds. the aforementioned mCPP issue in detail. RL is usually
modeled as a Markov Decision Process (MDP), which is
defined through the tuple (S, A, P, R). S indicates the state space
of the UAV cluster, which is a set of observation from the UAV
cluster interacting with the environment. A describes the joint
action space of UAVs.
demonstrates a probability function determining how state
transfers. R represents a reward function to criticize A selected
by the agent according to the goals.
A. State space
State space reflects the observation from environment over
a sequence of discrete time slots. Note that the environment is
dynamic because of the mobility of UAV cluster during the
mission. In this paper, we focus on a centralized learning way
and distributed implementation to achieve collaboration among
UAVs. Therefore, we need to collect the global position
information from the UAV cluster as the basis for making the
Figure 1. Task environment of the model
decision. An individual UAV can obtain its current position
B. UAV Model by GPS as a sub-state, where and indicate
In Figure 1, each yellow point represent one UAV with the X-axis position and Y-axis position at time t. Then, the joint
random start positions. As the reconnaissance range of small state space is presented by
UAVs is limited, the field of view of a UAV can be
approximated to a grid cell in the area map with a certain range. (1)
Thus the cell where the UAV trajectory is located is the covered
area, which afterwards would be marked as 1 no matter how where . In this way, the whole position state of
many times it is overlapped, otherwise as 0. As such, by other UAVs’ can be available to each individual UAV.
counting the number of grid cell with 1, we can calculate the B. Action space
coverage rate at each moment. As mentioned before, mCPP aims to build a strategy that
Because the DDQN algorithm can only dispose of discrete can autonomously make multiple UAVs find an optimal
variables, the flight direction D of each UAV agent is direction to completely cover the task area as efficient as
discretized to several certain directions. As shown in Figure 2, possible. While the DDQN-based method requires to optimize
the UAV can choose one flight direction from north, south, west, discrete variables. As a result, the joint action space of the UAV
or east, marked by 1, 2, 3 and 4, namely, . Also, cluster is composed of 4 directions, namely, north, south, west,
the total moving distance of each UAV is limited to 300 steps and east. Each UAV will be respectively assigned a flight
for the consideration of battery capacity. direction based on DDQN. As such, the size of the proposed
joint action space is with 4 actions and n UAVs. The
action space is expressed as
(2)
202
Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 06,2023 at 03:33:58 UTC from IEEE Xplore. Restrictions apply.
C. Reward Function but cumulative discounted rewards in the long run [11]. The
Using a reward signal to formalize the idea of a goal is one expected return is also called action-value function or Q-value,
of the most distinctive features of reinforcement learning, denoted as follows:
which plays a critical role in solving problems with hard-to-
optimize objectives [11]. A reward function should involve the
encouragement to the correct actions as well as the punishment
(6)
to the false behaviors of agents. According to the final objective
and constraints, the reward is defined as a simple number where is the discount rate determining the present value of
at each time step, and the whole UAV cluster share the future reward. A larger means greater reward in the
the same reward such that collaborative behavior among them future would be concerned into the total return until the mission
is motivated [12]. In particular, we make the main target is finished. On the contrary, UAV cluster would be concerned
decompositive, namely credit assignment, to instruct UAV only with maximizing immediate rewards when .
agents by generating available signals densely. In response to
the main goal, full coverage, reward function is formulated as D. Learning algorithm
1) Q-learning
Q-learning is a model-free RL algorithm to generalize over
(3) various situations, which does not require the prior information
on state transition function [13]. For the case in this paper, Q-
where indicates the current covering rate of UAV cluster at learning is based on a cycle of interaction between multi-agent
time , and represents the covering rate at next moment (namely UAV cluster) and task environment, through which an
after UAV cluster taking an action , and optimizer behavior rule is trained to get as higher reward as
symbols for division rounded down. It is worth noting that with possible. Figure 3 demonstrates that cycle of interaction, i.e. the
the expansion of coverage, the reward increases UAV cluster receive the current state from the
nonlinearly, as the gap with the final goal determines the environment, which in turn makes each UAV determines a
intensity of incentive to guide agents and the final goal flight action and integrate different actions together as a joint
corresponds to the maximum reward. Moreover, a larger action . Subsequently, the environment feedbacks a reward
coverage means the path of UAVs is more likely to overlap. As to the UAV cluster to evaluate their performance and evolves
such, reducing repeated path is necessary and we set a
punishment with negative reward -3 in such case. to the next state . It can be seen that Q-learning is devoted
to studying an improved policy mapping from state to
Meanwhile, with regard to unallowed behavior, like out-of-
bound, the dangerous action would be forecast in warning area. action through action-value function , which is
Once a UAV is predicted to fly out of range at next step, it formally expressed as
would be forced to hover at current positon until its action in
sequence is legal. So the punishment from stationary coverage (7)
also help to avoid out-of-bound behavior. where is denoted in (6). Hence, for a given state, an
In addition to illegal behavior and overlapping coverage, we optimal policy can be simply trained by selecting action as
take the battery capacity into consideration as the constraints in
mCPP mission, which is transferred into the limited flight steps (8)
for each UAV. This forms the other part of reward function as That is, the action to be taken is the one that maximizes the
follows: Q-value.
(4)
where acts as the maximum flight steps, is the actual
average steps of each UAV, and indicates the
discount factor to make the extra expense as a linearly varying
negative reward.
According to (3) and (4), the total reward function is
summarized as
(5)
where and are the weights of each part. However, it
is difficult to predict these agents’ performance in the future by
simply depending on the score in the next state. Therefore, the Figure 3. The cycle of interaction for Q-learning
goal of RL is to maximize the total amount of discounted return
that the agents receive, that is, maximize not immediate reward
203
Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 06,2023 at 03:33:58 UTC from IEEE Xplore. Restrictions apply.
2) Double Deep Q-learning Networks copied from in Q-evaluate network periodically and update
In general, Q-learning method can be tackled by after several rounds. is set as a discount factor. Formula (9)
constructing a table with the dimension of . means that when acquiring the optimal Q-value , TD error
But for the case with large-scale state space and action space, it
will converge to zero. Simply, we use to represent the
is inefficient and inaccurate for traditional Q-learning method
former term of (9) as target value, i.e.
to search optimal actions, because a number of state-action
pairs will be seldom visited. As a result, the emerging and
widely used Deep Neutral Network (DNN) is considered in this (10)
paper to address the sophisticate and large-scale problem. But
As such, with from Q-target network acting as a
if we just rely on one DNN to work on both calculating target
label and the evaluative Q-value from Q-evaluate network, the
Q-value and evaluating Q-value and updating parameters of
loss function for updating weights of DQN can be
network based on target Q-value, it is of disadvantage to let
expressed as
network converge because of the excessive dependence. For
this reason, we use two DNNs of identical structure, one is to
select actions and update network parameters, the other is only
responsible for calculating target Q-values and asynchronously (11)
copy the parameters of the former networks. With the RMS
Nevertheless, it is not robust and proper to blindly select the
optimizer, we train a Deep Q-Network (DQN) to enable the
action that makes the Q-value maximum from Q-target network,
expected temporal difference (TD) error converge efficiently to
which may lead to overestimation of Q-values under certain
achieve the optimal mapping between the input to the conditions. As a result, further improvements have been made
output . in Double Deep Q-Networks (DDQN), which separates the
function of selecting the action that generates target Q-values
In the training phase, we store the previous experience in a
and calculating target Q-values. The Q-evaluate network is used
memory buffer to break the correlations of successive training
data in the sequence, denoted by . To be specific, after the for choosing the action that makes maximum.
UAV cluster performs a cycle of iteration, a training sample Subsequently, the Q-target network outputs the target Q-value
by finding the mapping of and previous selected action.
represented by a tuple would be collected
This operation ensures the target Q-value to be appropriate
into . When the number of samples grows up to the rather than the maximum one to avoid choosing the over-
maximum size of , the new sample would randomly estimated action. Thus, we finally adopt DDQN method in this
replace one sample stored in the library. As such, DQN could paper and the loss function for our model is given by
be trained with a mini-batch of diverse and irrelevant data at
every episode. At each training step, the UAV cluster leverages
a dynamic soft policy, i.e., , to guarantee that both
(12)
exploration and exploitation are considered. This policy
indicates that the action with maximal estimated value is chosen where the target value is described by
with a probability while a random action is instead
chosen with a probability , and decreases gradually with
the increase of training steps. (13)
In particular, there are two separate networks, called Q- 3) Training and Testing
evaluate and Q-target networks, to construct the TD error as
The entire structure of DDQN algorithm is shown as figure
4.
(9)
We also summarize the procedure of how to solve the mCPP
where and indicate the parameters of Q-evaluate problem for UAV cluster based on DDQN in Algorithm 1.
network and Q-target network respectively. Note that is
204
Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 06,2023 at 03:33:58 UTC from IEEE Xplore. Restrictions apply.
Figure 4. The structure of DDQN
Algorithm 1 UAVs coverage path planning with DDQN results are presented to illustrate the performance of the
proposed method.
1: Set task environment parameters and start environment
modeling We assume that all UAVs in one cluster keep the same flight
altitude and speed during the mission. Within every episode, the
2: Build neural network model and initialize network UAV cluster generates their initial positions randomly. This
parameters paper follows the parameters in Table 1 to conduct the
3: for each episode do: experiment.
4: Update the location and coverage of UAV cluster TABLE I. SIMULATION PARAMETERS
nodes.
Parameter Value
5: for each step do Number of UAVs n 4
205
Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 06,2023 at 03:33:58 UTC from IEEE Xplore. Restrictions apply.
output is the Q-values of joint actions. Furthermore, this verifies
that it is not necessary for the UAV cluster to get prior
knowledge about the geographic information of task area. Each
individual UAV only need to upload its GPS location message
periodically to the dispatching center, and all UAVs receive and
obey the command in coherence without the redundant delay
and energy consumption through inter-communication.
206
Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 06,2023 at 03:33:58 UTC from IEEE Xplore. Restrictions apply.
[2] Galceran, E., Carreras, M. (2013). A survey on coverage path planning for
robotics. J. Robotics and Autonomous systems. 61(12), 1258-1276.
[3] Cabreira, T. M., Brisolara, L. B., Ferreira Jr, P. R. (2019). Survey on
coverage path planning with unmanned aerial vehicles. J. Drones, 3(1), 4.
[4] Maza I., Ollero A. (2007) Multiple UAV cooperative searching operation
using polygon area decomposition and efficient coverage algorithms. In:
Alami R., Chatila R., Asama H. (Eds), Distributed Autonomous Robotic
Systems 6. Springer, Tokyo. pp. 221-230.
[5] Yu, C., Velu, A., Vinitsky, E., Wang, Y., Bayen, A., Wu, Y. (2021). The
surprising effectiveness of mappo in cooperative, multi-agent games. J.
arXiv preprint arXiv:2103.01955.
[6] Bayerlein, H., Theile, M., Caccamo, M., Gesbert, D. (2021). Multi-uav
path planning for wireless data harvesting with deep reinforcement
learning. J. IEEE Open Journal of the Communications Society, 2, 1171-
1187.
[7] Ye, H., Li, G. Y., Juang, B. H. F. (2019). Deep reinforcement learning
based resource allocation for V2V communications. J. IEEE Transactions
on Vehicular Technology, 68(4), 3163-3173.
[8] Theile, M., Bayerlein, H., Nai, R., Gesbert, D., Caccamo, M. (2020). UAV
coverage path planning under varying power constraints using deep
reinforcement learning. In: 2020 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS). IEEE. pp. 1444-1449.
[9] Maciel-Pearson, B. G., Marchegiani, L., Akcay, S., Atapour-Abarghouei,
A., Garforth, J., Breckon, T. P. (2019). Online deep reinforcement learning
for autonomous UAV navigation and exploration of outdoor
environments. J. arXiv preprint arXiv:1912.05684.
[10] Piciarelli, C., Foresti, G. L. (2019). Drone patrolling with reinforcement
learning. In: Proceedings of the 13th International Conference on
Distributed Smart Cameras. pp. 1-6.
[11] Thrun, S., Littman, M. L. (2000). Reinforcement learning: an introduction.
J. AI Magazine, 21(1), 103-103.
[12] Liang, L., Ye, H., & Li, G. Y. (2019). Spectrum sharing in vehicular
networks based on multi-agent reinforcement learning. J. IEEE Journal on
Selected Areas in Communications, 37(10), 2282-2292.
[13] Wu, F., Zhang, H., Wu, J., Han, Z., Poor, H. V., & Song, L. (2021). UAV-
to-device underlay communications: Age of information minimization by
multi-agent deep reinforcement learning. J. IEEE Transactions on
Communications.
207
Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 06,2023 at 03:33:58 UTC from IEEE Xplore. Restrictions apply.