0% found this document useful (0 votes)
7 views15 pages

Version 9

This paper presents a novel approach for optimizing the trajectory and data collection of UAV-assisted IoT systems using deep reinforcement learning (DRL). The proposed method aims to maximize data collection while minimizing flight distance and time, addressing challenges such as limited UAV power and dynamic environments. The results demonstrate significant improvements in throughput and efficiency compared to existing methods, highlighting the effectiveness of the DRL techniques employed.

Uploaded by

Saud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

Version 9

This paper presents a novel approach for optimizing the trajectory and data collection of UAV-assisted IoT systems using deep reinforcement learning (DRL). The proposed method aims to maximize data collection while minimizing flight distance and time, addressing challenges such as limited UAV power and dynamic environments. The results demonstrate significant improvements in throughput and efficiency compared to existing methods, highlighting the effectiveness of the DRL techniques employed.

Uploaded by

Saud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

3D UAV Trajectory and Data Collection Optimisation via Deep

Reinforcement Learning
Nguyen, K. K., Duong, T. Q., Do-Duy, T., Claussen, H., & Hanzo, L. (2022). 3D UAV Trajectory and Data
Collection Optimisation via Deep Reinforcement Learning. IEEE Transactions on Communications. Advance
online publication. https://doi.org/10.1109/TCOMM.2022.3148364

Published in:
IEEE Transactions on Communications

Document Version:
Peer reviewed version

Queen's University Belfast - Research Portal:


Link to publication record in Queen's University Belfast Research Portal

Publisher rights
© 2022 IEEE.
This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher.

General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.

Take down policy


The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to
ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the
Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.

Open Access
This research has been made openly available by Queen's academics and its Open Research team. We would love to hear how access to
this research benefits you. – Share your feedback with us: http://go.qub.ac.uk/oa-feedback

Download date:12. Jan. 2024


1

3D UAV Trajectory and Data Collection


Optimisation via Deep Reinforcement Learning
Khoi Khac Nguyen, Student Member, IEEE, Trung Q. Duong, Fellow, IEEE, Tan Do-Duy, Member, IEEE, Holger
Claussen, Fellow, IEEE, and Lajos Hanzo, Fellow, IEEE,

Abstract—Unmanned aerial vehicles (UAVs) are now beginning Wireless networks supported by UAVs constitute a promis-
to be deployed for enhancing the network performance and cov- ing technology for enhancing the network performance [5].
erage in wireless communication. However, due to the limitation The applications of UAVs in wireless networks span across
of their on-board power and flight time, it is challenging to
obtain an optimal resource allocation scheme for the UAV-assisted diverse research fields, such as wireless sensor networks
Internet of Things (IoT). In this paper, we design a new UAV- (WSNs) [6], caching [7], heterogeneous cellular networks [8],
assisted IoT system relying on the shortest flight path of the massive multiple-input multiple-output (MIMO) [9], disaster
UAVs while maximising the amount of data collected from IoT communications [10], [11] and device-to-device communica-
devices. Then, a deep reinforcement learning-based technique is tions (D2D) [12]. For example, in [13], UAVs were deployed
conceived for finding the optimal trajectory and throughput in a
specific coverage area. After training, the UAV has the ability to to provide network coverage for people in remote areas and
autonomously collect all the data from user nodes at a significant disaster zones. UAVs were also used for collecting data in a
total sum-rate improvement while minimising the associated WSN [6]. Nevertheless, the benefits of UAV-aided wireless
resources used. Numerical results are provided to highlight how communication are critically dependent on the limited on-
our techniques strike a balance between the throughput attained, board power level. Thus, the resource allocation of UAV-
trajectory, and the time spent. More explicitly, we characterise
the attainable performance in terms of the UAV trajectory, the aided wireless networks plays a pivotal role in approaching
expected reward and the total sum-rate. the optimal performance. Yet, the existing contributions typ-
ically assume having static environment [10], [11], [14] and
Keywords- UAV-assisted wireless network, trajectory, data often ignore the stringent flight time constraints in real-life
collection, and deep reinforcement learning. applications [6], [8], [15].
I. I NTRODUCTION Machine learning has recently been proposed for the in-
telligent support of UAVs and other devices in the network
Given the agility of unmanned aerial vehicles (UAVs), they
[9], [16]–[24]. Reinforcement learning (RL) is capable of
are capable of supporting compelling applications and are
searching for an optimal policy by trial-and-error learning.
beginning to be deployed more broadly. Recently, the UK
However, it is challenging for model-free RL algorithms, such
and Chile authorities proposed to deliver medical support and
as Q-learning to obtain an optimal strategy, while considering
other essential supplies by using UAVs to vulnerable people in
a large state and action space. Fortunately, with the emerging
response to Covid-19 [1], [2]. In [3], the authors used UAVs for
neural networks, the sophisticated combination of RL and
image collection and high-resolution topography exploration.
deep learning, namely deep reinforcement learning (DRL)
However, given the several limitations of on-board power
is eminently suitable for solving high-dimensional problems.
level and the ability to adapt to changes in the environment,
Hence, DRL algorithms have been widely applied in fields
UAVs may not be fully autonomous and can only operate for
such as robotics [25], business management [26] and gaming
short flight-durations, unless remote laser-charging is used [4].
[27]. Recently, DRL has also become popular in solving
Moreover, due to some challenging tasks such as topographic
diverse problems in wireless networks thanks to their decision-
surveying, data collection or obstacle avoidance, the existing
making ability and flexible interaction with the environment
UAV technologies cannot operate in an optimal manner.
[7], [9], [18]–[24], [28]–[30]. For example, DRL was used for
Khoi Khac Nguyen and Trung Q. Duong are with the School of Electronics, solving problems in the areas of resource allocation [18], [19],
Electrical Engineering and Computer Science, Queen’s University Belfast, [29], navigation [9], [31] and interference management [22].
Belfast BT7 1NN, U.K. (e-mail: {knguyen02,trung.q.duong}@qub.ac.uk).
Tan Do-Duy is with Ho Chi Minh City University of Technology and
Education, Vietnam (e-mail: tandd@hcmute.edu.vn).
Holger Claussen is with Tyndall National Institute, Dublin, Ireland (e-mail: A. Related Contributions
holger.claussen@tyndall.ie).
Lajos Hanzo is with the School of Electronics and Computer Sci- UAV-aided wireless networks have also been used for
ence, University of Southampton, Southampton, SO17 1BJ, U.K. (e-mail: machine-to-machine communications [32] and D2D scenar-
lh@ecs.soton.ac.uk). ios in 5G [14], [33], but the associated resource allocation
The work of T. Q. Duong was supported by the U.K. Royal Academy
of Engineering (RAEng) under the RAEng Research Chair and Senior problems remain challenging in real-life applications. Several
Research Fellowship scheme Grant RCSRF2021\11\41. L. Hanzo would like techniques have been developed for solving resource allocation
to acknowledge the financial support of the Engineering and Physical Sciences problems [18], [19], [31], [34]–[36]. In [34], the authors
Research Council projects EP/P034284/1 and EP/P003990/1 (COALESCE)
as well as of the European Research Council’s Advanced Fellow Grant have conceived a multi-beam UAV communications and a
QuantCom (Grant No. 789028) cooperative interference cancellation scheme for maximising
2

the uplink sum-rate received from multiple UAVs by the base user location and the transmit power. Moreover, in [9], the
stations (BS) on the ground. The UAVs were deployed as authors used the DQL algorithm for the UAV’s navigation
access points to serve several ground users in [35]. Then, based on the received signal strengths estimated by a massive
the authors proposed successive convex programming for MIMO scheme. In [23], Q-learning was used for controlling
maximising the minimum uplink rate gleaned from all the the movement of multiple UAVs in a pair of scenarios,
ground users. In [31], the authors characterised the trade- namely for static user locations and for dynamic user locations
off between the ground terminal transmission power and the under a random walk model. However, the aforementioned
specific UAV trajectory both in a straight and in a circular contributions have not addressed the joint trajectory and data
trajectory. collection optimisation of UAV-assisted networks, which is a
The issues of data collection, energy minimisation, and difficult research challenge. Furthermore, these existing works
path planning have been considered in [23], [32], [37]–[45]. mostly neglected interference, 3D trajectory and dynamic
In [38], the authors minimised the energy consumption of environment.
the data collection task considered by jointly optimising the
sensor nodes’ wakeup schedule and the UAV trajectory. The B. Contributions and Organisation
authors of [39] proposed an efficient algorithm for joint
trajectory and power allocation optimisation in UAV-assisted A novel DRL-aided UAV-assisted system is conceived for
networks to maximise the sum-rate during a specific length of finding the optimal UAV path for maximising the joint re-
time. A pair of near-optimal approaches for optimal trajectory ward function based on the shortest flight distance and the
was proposed for a given UAV power allocation and power uplink transmission rate. We boldly and explicitly contrast our
allocation optimisation for a given trajectory. In [32], the proposed solution to the state-of-the-art in Table I. Our main
authors introduced a communication framework for UAV-to- contributions are further summarised as follows:
UAV communication under the constraints of the UAV’s flight • In our UAV-aided system, the maximum amount of data

speed, location uncertainty and communication throughput. is collected from the users with the shortest distance
Then, a path planning algorithm was proposed for minimising travelled.
the associated completion time task while balancing the perfor- • Our UAV-aided system is specifically designed for tack-

mance by computational complexity trade-off. However, these ling the stringent constraints owing to the position of
techniques mostly operate in offline modes and may impose the destination, the UAV’s limited flight time and the
excessive delay on the system. It is crucial to improve the communication link’s realistic constraints. The UAV’s ob-
decision-making time for meeting the stringent requirements jective is to find the optimal trajectory for maximising the
of UAV-assisted wireless networks. total network throughput, while minimising its distance
Again, machine learning has been recognised as a powerful travelled.
tool of solving the high-dynamic trajectory and resource • Explicitly, these challenges are tackled by conceiving

allocation problems in wireless networks. In [36], the authors bespoke DRL techniques for solving the above problem.
proposed a model based on the classic k-means algorithm To elaborate, the area is divided into a grid to enable
for grouping the users into clusters and assigned a dedicated fast convergence. Following its training, the UAV can
UAV to serve each cluster. By relying on their decision- have the autonomy to make a decision concerning its next
making ability, DRL algorithms have been used for lending action at each position in the area, hence eliminating the
each node some degree of autonomy [7], [18]–[21], [28], [29], need for human navigation. This makes our UAV-aided
[46]. In [28], an optimal DRL-based channel access strategy system more reliable, practical and optimises the resource
to maximise the sum rate and α-fairness was considered. requirements.
In [18], [19], we deployed DRL techniques for enhancing • A pair of scenarios are considered relying either on

the energy-efficiency of D2D communications. In [21], the three or five clusters for quantifying the efficiency of
authors characterised the DQL algorithm for minimising the our novel DRL techniques in terms of both the sum-rate,
data packet loss of UAV-assisted power transfer and data the trajectory and the associated time. A convincing 3D
collection systems. As a further advance, caching problems trajectory visualisation is also provided.
were considered in [7] to maximise the cache success hit rate • Finally, but most importantly, it is demonstrated that our

and to minimise the transmission delay. The authors designed DRL techniques approach the performance of the optimal
both a centralised and a decentralised system model and used “genie-solution” associated with the perfect knowledge of
an actor-critic algorithm to find the optimal policy. the environment.
DRL algorithms have also been applied for path planning Although the existing DRL algorithms have been well
in UAV-assisted wireless communications [9], [22]–[24], [30], exploited in wireless networks, it is challenging to apply to
[47]. In [22], the authors proposed a DRL algorithm based on current scenarios due to stringent constraints of the considered
the echo state network of [48] for finding the flight path, trans- system, such as UAV’s flying time, transmission distance, and
mission power and associated cell in UAV-powered wireless mobile users. As with the DQL and dueling DQL algorithm,
networks. The so-called deterministic policy gradient algo- we discretise the flying path into grid and the UAV only
rithm of [49] was invoked for UAV-assisted cellular networks needs to decide the action in a finite action space. With
in [30]. The UAV’s trajectory was designed for maximising the finite state and action space, the neural networks can
the uplink sum-rate attained without the knowledge of the be easily trained and deployed for online phase. With other
3

TABLE I
A COMPARISON WITH EXISTING LITERATURE

[37] [6] [21] [23] [40] [9] [41] [47] [42] [43] Our work
3D trajectory
Sum-rate maximisation
Time minimisation
Dynamic environment
Unknown users
Reinforcement learning
Deep neural networks

existing RL algorithm, we have tried and found out that some


of them are not effective in solving our proposed problem.
Meanwhile, the continuous solver RL algorithms, e.g., deep
deterministic policy gradient (DDPG) and proximal policy
optimisation (PPO), are not suitable and so challenging for
the trade-off problem. Therefore, in this paper, we propose the
DQL and dueling DQL algorithm to obtain the optimal trade-
off in total achievable sum-rate and trajectory. As such, we can
transferred a real-life application into a digital environment for
optimisation and solve it efficiently.
The rest of our paper is organised as follows. In Section II,
we describe our data collection system model and the problem
formulation of IoT networks relying on UAVs. Then, the
mathematical background of the DRL algorithms is presented
in Section III. Deep Q-learning (DQL) is employed for finding
the best trajectory and for solving our data collection problem
in Section IV. Furthermore, we use the dueling DQL algorithm
of [50] for improving the system performance and convergence
speed in Section V. Next, we characterise the efficiency of Fig. 1. System model of UAV-aided IoT communications.
the DRL techniques in Section VI. Finally, in Section VII, we
summarise our findings and discuss our future research.
mth cluster at time step t follows the free-space path loss
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION model, which is represented as

Consider a system consisting of a single UAV and M −2


htm,k = β0 dtm,k
groups of users, as shown in Fig. 1, where the UAV relying β0 (2)
on a single antenna visits all clusters to cover all the users. = 2 ,
The 3D coordinate of the UAV at time step t is defined as (xt0 − xtm,k )2 + (y0 − t
ym,k )2 + H0t
X t = (xt0 , y0t , H0t ). Each cluster consists of K users, which
where the channel’s power gain at a reference distance of d =
are unknown and distributed randomly within the coverage
1m is denoted by β0 .
radius of C. The users are moving following the random
walk model with the maximum velocity v. The position of The achievable throughput from the kth user in the mth
the kth user in the mth cluster at time step t is defined as cluster to the UAV at time t if the user satisfies the distance
t
Xm,k = (xtm,k , ym,k
t
). The UAV’s objective is to find the best constraint is defined as follows:
trajectory while covering all the users and to reach the dock t
Rm,k =
upon completing its mission. !
ptm,k htm,k
B log2 1 + PM PK PK , ∀m, k,
t t t t 2
i6=m j pi,j hi,j + u6=k pm,u hm,u + α
A. Observation model
(3)
The distance from the UAV to user k in cluster m at time
step t is given by: where B and α2 are the bandwidth and the noise power,
q respectively; pm,k is the transmit power at the kth user in
2
dtm,k = (xt0 − xtm,k )2 + (y0t − ym,k
t )2 + H0t . (1) the mth cluster. Then the total sum-rate over the T time step
from the kth user in cluster m to the UAV is given by:
We assume that the communication channels between the
UAV and users are dominated by line-of-sight (LoS) links; Z T
t
thus the channel between the UAV and the kth user in the Rm,k = Rm,k dt, ∀m, k. (4)
0
4

B. Game formulation • Probability: We define Pst st+1 (at , π) as the probability


Both the current location and the action taken jointly influ- of transition from state st to state st+1 by taking the
ence the rewards obtained by the UAV; thus the trial-and-error action at under the policy π.
based learning task of the UAV satisfies the Markov property. At each time step t, the UAV chooses the action at based
We formulate the associated Markov decision process (MDP) on its local information to obtain the reward rt under the
[51] as a 4 tuple < S, A, Pss0 , R >, where S is the state space policy π. Then the UAV moves to the next state st+1 by
of the UAV, A is the action space; R is the expected reward taking the action at and starts collecting information from the
of the UAV and Pss0 is the probability of transition from state users if any available node in the network satisfies the distance
s to state s0 , where we have s0 = st+1 |s = st . Through constraint. Meanwhile, the users in clusters also move to new
learning, the UAV can find the optimal policy π ∗ : S → A positions following the random walk model with velocity
for maximising the reward R. As the definition of RL, the v. Again, we use the DRL techniques to find the optimal
UAV does not have any knowledge about the environment. policy π ∗ for the UAV to maximise the reward attained (7).
We transfer a real-life application of the data collection in Following the policy π, the UAV forms a chain of actions
the UAV-assisted IoT networks into a digital form. Thus, the (a0 , a1 , . . . , at , . . . , af inal ) to reach the landing dock.
UAV only has local information and the state is defined by Our target is to maximise the reward expected by the UAV
the position of UAV. We have also discretised the state and upon completing a single mission during which the UAV flies
action space for learning. More particularly, we formulate from the initial position over the clusters and lands at the
the trajectory and data collection game of UAV-aided IoT destination. Thus, we design the trajectory reward Rplus when
networks as follows: the UAV reaches the destination in two different ways. Firstly,
• Agent: The UAV acts like an agent interacting with the the binary reward function is defined as follows:
environment to find the peak of the reward.

1 , Xf inal ∈ Xtarget
• State space: We define the state space by the position of
Rplus = , (8)
0 , otherwise.
UAV as
where Xf inal and Xtarget are the final position of UAV
S = {x, y, H}. (5)
and the destination, respectively. However, the UAV has to
At time step t, the state of the UAV is defined as st = move a long distance to reach the final destination. It may
(xt , y t , H t ). also be trapped in a zone and cannot complete the mission.
• Action space: The UAV at state st can choose an action These situations lead to increased energy consumption and
at of the action space by following the policy at time- reduced convergence. Thus, we consider the value of Rplust
in
step t. By dividing the area into a grid, we can define the a different form by calculating the horizontal distance between
action space as follows: the UAV and the final destination at time step t, yielding:

A = {left,right, forward, backward,  1 , Xf inal ∈ Xtarget
(6) t
Rplus =  −1 (9)
upward, downward, hover}.  exp(dtarg ) , otherwise,
The UAV moves in the environment and begins collecting p
information when the users are in the coverage of the where dtarg = (xtarget − xt0 )2 + (ytarget − y0t )2 is the
UAV. When the UAV has sufficient information Rm,k ≥ distance from the UAV to the landing dock.
rmin from the kth user in the mth cluster, that user will When we design the reward function as in (9), the UAV
be marked as collected in this mission and may not be is motivated to move ahead to reach the final destination.
visited by the UAV again. However, one of the disadvantages is that the UAV only moves
• Reward function: In joint trajectory and data collection forward. Thus, the UAV is unable to attain the best perfor-
optimisation, we design the reward function to be depen- mance in terms of its total sum-rate in some environmental
dent on both the total sum-rate of ground users associated settings. We compare the performance of the two trajectory
with the UAV plus the reward gleaned when the UAV reward function definitions in Section VI to evaluate the pros
completes one route, which is formulated as follows: and cons of each approach.

M X K
 In our work, we optimise the 3D trajectory of the UAV
β  X and data collection for the IoT network. Particularly, we
R= P (m, k)Rm,k  + ζRplus , (7)
MK m
have design the reward function by a trade-off game between
k
the achievable sum-rate and the trajectory. Denote the flying
where β and ζ are positive variables that represent the path of the UAV from the initial point to final position by
trade-off between the network’s sum-rate and UAV’s X = (X0 , X1 , . . . , Xf inal ), the agent needs to learn by
movement, which will be described in the sequel. Here, iterating with the environment to find an optimal X. We
P (m, k) ∈ {0, 1} indicates whether or not user k of have defined a trade-off value β and ζ to make our approach
cluster m is associated with the UAV; Rplus is the more adaptive and flexible. By modifying the value of β/ζ
acquired reward when the UAV completes a mission by , the UAV adapts to several scenarios: a) fast deployment
reaching
PM PK the final destination. On the other hand, the term for emergency services, b) maximising the total sum-rate, and
k P (m,k)Rm,k
m
MK defines the average throughput of all c) maximising the number of connections between the UAV
users. and users. Depending on the specific problems, we can adjust
5

the value of the trade-off parameters β, ζ to achieve the best The action-value function is obtained, when the agent at
performance. Thus, the game formulation is defined as follows: state st takes action at and receives the reward rt under the
  agent policy π. The optimal Q-value can be formulated as:
M K
β X X  
max R = P (m, k)Rm,k  + ζRplus ,
X
∗ ∗
MK
t t
Q (s, a, π) = E R (s , π ) + γ Pss0 (a, π ∗ )V (s0 , π ∗ ).
m k
s0 ∈S
s.t. Xf inal = Xtarget , (13)
The optimal policy π ∗ can be obtained from Q∗ (s, a, π) as
dm,k ≤ dcons , (10) follows:
Rm,k ≥ rmin , V ∗ (s, π) = max Q(s, a, π). (14)
a∈A
P (m, k) ∈ {0, 1},
T ≤ Tcons From (13) and (14), we have
 
β ≥ 0, ζ ≥ 0, ∗ ∗
X
t t
Q (s, a, π) = E R (s , π ) + γ Pss0 (a, π ∗ ) max
0
Q(s0 , a0 , π),
a ∈A
0
where T and Tcons are the number of steps that the UAV  s ∈S

takes in a single mission and the maximum number of UAV’s
= E Rt (st , π ∗ ) + γ max Q(s0 0
, a , π) ,
steps given its limited power, respectively. The term Xf inal = 0
a ∈A
Xtarget denotes the completed flying route when the final (15)
position of the UAV belongs to the destination zone. We have
where the agent takes the action a0 = at+1 at state st+1 .
designed the reward function following this constraint with
Through learning, the Q-value is updated based on the
two functions: binary reward function in (8) and exponential
available information as follows:
reward function in (9). The term dm,k ≤ dcons , Rm,k ≥
rmin , P (m, k) ∈ {0, 1} denote the communication constraint. Q(s, a, π) = Q(s, a, π)
Particularly, the distance constraint dm,k ≤ dcons indicates
 
t t ∗ 0 0
that the served (m, k)-user has a satisfying distance to the + α R (s , π ) + γ max
0
Q(s , a , π) − Q(s, a, π) ,
a ∈A
UAV. P (m, k) ∈ {0, 1} indicates whether or not user k of (16)
cluster m is associated with the UAV. Rm,k ≥ rmin denotes
the minimum information collected during the flying path. All where α denotes the updated parameter of the Q-value func-
the constraints are integrated into the reward functions in the tion.
RL algorithm. The term T ≤ Tcons denotes the constraint In RL algorithms, it is challenging to balance the explo-
about the flying time. Consider the maximum flying time is ration and exploitation for appropriately selecting the action.
Tcons , the UAV needs to complete a route by reaching the The most common approach relies on the -greedy policy for
destination zone before Tcons . If the UAV can not complete a the action selection mechanism as follows:

route before Tcons , the Rplus = 0 as we defined in (8) and (9). arg max Q(s, a, π) with 
a= (17)
We have the trade-off value in reward function β ≥ 0, ζ ≥ 0. randomly if 1 − .
Those stringent constraints, such as the transmission distance,
Upon assuming that each episode lasts T steps, the action
position and flight time make the optimisation problem more
at time step t is at that is selected by following the -greedy
challenging. Thus, we propose DRL techniques for the UAV
policy as in (17). The UAV at state st communicates with
in order to attain optimal performance.
the user nodes from the ground if the distance constraint
of dm,k ≤ dcons is satisfied. Following the information
III. P RELIMINARIES transmission phase, the user nodes are marked as collected
users and may not be revisited later during that mission. Then,
In this section, we introduce the fundamental concept of
after obtaining the immediate reward r(st , at ) the agent at
Q-learning, where the so-called value function is defined by a
state st takes action at to move to state st+1 as well as to
reward of the UAV at state st as follows:
update the Q-value function in (16). Each episode ends when
XT  the UAV reaches the final destination and the flight duration
t t
V (s, π) = E γR (s , π)|s0 = s , (11) constraint is satisfied.
t

where E[] represents an average of the number of samples IV. A N E FFECTIVE D EEP R EINFORCEMENT L EARNING
and 0 ≤ γ ≤ 1 denotes the discount factor. In a finite game, A PPROACH FOR UAV- ASSISTED I OT N ETWORKS
there is always an optimal policy π ∗ that satisfies the Bellman In this section, we conceive the DQL algorithm for tra-
optimality equation [52] jectory and data collection optimisation in UAV-aided IoT
V ∗ (s, π) = V (s, π ∗ ) networks. However, Q-learning technique typically falters for
"   # large state and action spaces due to its excessive Q-table size.
t t
= max E R (s , π ) + γ ∗
X
∗ 0 ∗
Pss0 (a, π )V (s , π ) . Thus, instead of applying the Q-table in Q-learning, we use
a∈A
s0 ∈S
deep neural networks to represent the relationship between
(12) the action and state space. Furthermore, we employ a pair of
6

techniques for stabilising the neural network’s performance in Algorithm 1 The deep Q-learning algorithm for trajectory and
our DQL algorithm as follows: data collection optimisation in UAV-aided IoT networks.
• Experience replay buffer: Instead of using current expe- 1: Initialise the network Q and the target network Q0 with
rience, we use a so-called replay buffer B to store the the random parameters θ and θ0 , respectively
transitions (s, a, r, s0 ) for supporting the neural network 2: Initialise the replay memory pool B
in overcoming any potential instability. When the buffer 3: for episode = 1, . . . , L do
B is filled with the transitions, we randomly select a mini- 4: Receive initial observation state s0
batch of K samples for training the networks. The finite 5: while Xf inal ∈ / Xtarget or T ≤ Tcons do
buffer size of B allows it to be always up-to-date, and 6: Obtain the action at of the UAV according to the
the neural networks learn from the new samples. -greedy mechanism (17)
• Target networks: If we use the same network to calculate 7: Execute the action at and estimate the reward rt
the state-action value Q and the target network, the according to (7)
network can be shifted dramatically in the training phase. 8: Observe the next state st+1
Thus, we employ a target network Q0 for the target value 9: Store the transition (st , at , rt , st+1 ) in the replay
estimator. After a number of iterations, the parameters of buffer B
the target network Q0 will be updated by the network Q. 10: Randomly select a mini-batch of K transitions
The UAVs start from the initial position and interact with (sk , ak , rk , sk+1 ) from B
the environment to find the proper action in each state. The 11: Update the network parameters using gradient de-
agent chooses the action at following current policy at state st . scent to minimise the loss
" 2 #
By execute the action at , the agent receives the response from DQL
the environment with reward rt and new state st+1 . After each L(θ) = Es,a,r,s0 y − Q(s, a; θ) , (18)
time step, the UAVs have new positions and the environment
is changed with moving users. The obtained transitions are The gradient update is
stored into a finite memory buffer and used for training the "  #
DQL
neural networks. ∇θ L(θ) = Es,a,r,s0 y −Q(s, a; θ) ∇θ Q(s, a; θ) ,
The neural network parameters are updated by minimising
the loss function defined as follows: (19)
" 2 # 12: Update the state st = st+1
13: Update the target network parameters after a number
L(θ) = Es,a,r,s0 y DQL − Q(s, a; θ) , (20)
of iterations as θ0 = θ
14: end while
where θ is a parameter of the network Q and we have 15: end for
rt if terminated at st+1

y= t 0 0 0 0
r + γ maxa ∈A Q (s , a ; θ )
0 otherwise.
(21) estimate the value of each action choice in a particular state.
The details of the DQL approach in our joint trajectory and For example, in our environment setting, the UAV has to
data collection trade-off game designed for UAV-aided IoT consider moving either to the left or to the right when it
networks are presented in Algorithm 1 where L denotes the hits the boundaries. Thus, we can improve the convergence
number of episode. Moreover, in this paper, we design the speed by avoiding visiting all state-action pairs. Instead of
reward obtained in each step to assume one of two different using Q-value function of the conventional DQL algorithm,
forms and compare them in our simulation results. Firstly, we the dueling neural network of [50] is introduced for improving
calculate the difference between the current and the previous the convergence rate and stability. The so-called advantage
reward of the UAV as follows: function A(s, a) = Q(s, a) − V (s) related both to the value
function and to the Q-value function describes the importance
r1t (st , at ) = rt (st , at ) − rt−1 (st−1 , at−1 ). (22) of each action related to each state.
Secondly, we design the total episode reward as the accu- The idea of a dueling deep network is based on a combi-
mulation of all immediate rewards of each step within one nation of two streams of the value function and the advantage
episode as function used for estimating the single output Q-function. One
t
X of the streams of a fully-connected layer estimates the value
r2t (st , at ) = r1t (st , at ). (23) function V (s; θV ), while the other stream outputs a vector
i=0 A(s, a; θA ), where θA and θV represent the parameters of the
two networks. The Q-function can be obtained by combining
V. D EEP R EINFORCEMENT L EARNING A PPROACH FOR the two streams’ outputs as follows:
UAV- ASSISTED I OT NETWORKS : A D UELING D EEP
Q- LEARNING A PPROACH Q(s, a; θ, θA , θV ) = V (s; θV ) + A(s, a; θA ). (27)
According to Wang et. al. [50], the standard Q-learning Equation (27) applies to all (s, a) instances; thus, we have
algorithm often falters due to the over-supervision of all the to replicate the scalar V (s; θV ), |A| times to form a matrix.
state-action pairs. On the other hand, it is unnecessary to However, Q(s, a; θ, θA , θV ) is a parameterised estimator of
7

Algorithm 2 The dueling deep Q-learning algorithm for TABLE II


trajectory and data collection optimisation in UAV-aided IoT SIMULATION PARAMETERS.
networks.
Parameters Value
1: Initialise the network Q and the target network Q0 with
Bandwidth (W ) 1 MHz
the random parameters, θ and θ0 , respectively UAV transmission power 5W
2: Initialise the replay memory pool B The start position of UAV (0, 0, 200)
3: for episode = 1, . . . , L do Discounting factor γ = 0.9
4: Receive the initial observation state s0 Max number of users per cluster 10
5: while Xf inal ∈ / Xtarget or T ≤ Tcons do Noise power α2 = −110dBm
6: Obtain the action at of the UAV according to the The reference channel power gain β0 = −50dB
-greedy mechanism (17) Path-loss exponent 2
7: Execute the action at and estimate the reward rt
according to (7)
advantage function estimator. We can transform (28) using an
8: Observe the next state st+1
average formulation instead of the max operator as follows:
9: Store the transition (st , at , rt , st+1 ) in the replay
buffer B Q(s, a; θ, θA , θV ) = V (s; θV )
10: Randomly select a mini-batch of K transitions !
1 X 0
(sk , ak , rk , sk+1 ) from B + A(s, a; θA ) − A(s, a ; θA ) .
11: Estimate the Q-value function by combining the two |A| 0
a
streams as follows: (29)
Q(s, a; θ, θA , θV ) = V (s; θV ) Now, we can solve the problem of identifiability by subtract-
! ing the mean as in (29). Based on (29), we propose a dueling
1 X
+ A(s, a; θA ) − A(s, a0 ; θA ) . DQL algorithm for our joint trajectory and data collection
|A| 0 problem in UAV-assisted IoT networks relying on Alg. 2. Note
a
(24) that estimating V (s; θV ) and A(s, a; θA ) does not require any
12: Update the network parameters using gradient de- extra supervision and they will be computed automatically.
scent to minimise the loss
" 2 # VI. S IMULATION R ESULTS
DuelingDQL
L(θ) = Es,a,r,s0 y −Q(s, a; θ, θA , θV ) , In this section, we present our simulation results charac-
terising the joint optimisation problem of UAV-assisted IoT
(25)
networks. To highlight the efficiency of our proposed model
13: where
and the DRL methods, we consider a pair of scenarios: a
0 0 0 0
y DuelingDQL = rt + γ max Q (s , a ; θ , θA , θV ). simple having three clusters, and a more complex one with
a0 ∈A
(26) five clusters in the coverage area. We use Tensorflow 1.13.1
14: Update the state st = st+1 [53] and the Adam optimiser of [54] for training the neural
15: Update the target network parameters after a number networks. In this paper, we set the maximum value of β/ζ not
of iterations as θ0 = θ too large because we prefer the completion of a mission. The
16: end while maximum value is set to β/ζ = 4/1. All the other parameters
17: end for are provided in Table II.
In Fig. 2, we present the trajectory obtained after training
using the DQL algorithm in the 5-cluster scenario. The green
circle and blue dots represent the clusters’ coverage and the
the true Q-function; thus, we cannot uniquely recover the
user nodes, respectively. The red line and black line in the
value function V and the advantage function A. Therefore,
figure represent the UAV’s trajectory based on our method in
(27) results in poor practical performances when used directly.
(8) and (9), respectively. The UAV starts at (0, 0), visits about
To address this problem, we can map the advantage function
40 users, and lands at the destination that is denoted by the
estimator to have no advantage at the chosen action by
black star. In a complex environment setting, it is challenging
combining the two streams as follows:
to expect the UAV to visit all users, while satisfying the flight-
Q(s, a; θ, θA , θV ) = V (s; θV ) duration and power level constraints.
 
+ A(s, a; θA ) − max
0
A(s, a0 ; θA ) . A. Expected reward
a ∈|A|
(28) We compare our proposed algorithm with opitimal perfor-
mance and the Q-learning algorithm. However, to achieve the
Intuitively, for a∗ = arg maxa0 ∈A Q(s, a0 ; θ, θA , θV ) = optimal results, we have defined some assumptions of knowing
arg maxa0 ∈A A(s, a0 ; θA ), we have the IoT’s position and unlimited power level of the UAV. For
Q(s, a∗ ; θ, θA , θV ) = V (s; θV ). Hence, the stream V (s; θV ) purposes of comparison, we run the algorithm five times in five
estimates the value function and the other streams is the different environmental settings and take the average to draw
8

3.0
Proposed path with (8)
Proposed
300 path with (9)
Initial location 2.5
250 zone
Destination

Expected reward
IoT device
200 2.0
DQL with (8)
DQL with (9)
150 Dueling DQL with (8)
Dueling DQL with (9)
100 1.5 Q-learning
50 Optimal

0 1.0

1000
800 0.5
600
400 0 200 400 600 800
200 Episodes
0 0 200 400 600 800 1000
(a)

5.0
Fig. 2. Trajectory obtained by using our dueling DQL algorithm
4.5

4.0
the figures. Firstly, we compare the reward obtained following
Expected reward
(7). Let us consider the 3-cluster scenario and β/ζ = 2 : 1 3.5
in Fig. 3a, where the DQL and the dueling DQL algorithms
3.0
using the exponential function (9) reach the best performance.
When using the exponential trajectory design function (9), the 2.5
performance converges faster than that of the DQL and of the DQL with (8)
dueling DQL methods using the binary trajectory function (8). 2.0 DQL with (9)
Dueling DQL with (8)
The performance of using the Q-learning algorithm is worst. 1.5 Dueling DQL with (9)
In addition, in Fig. 3b, we compare the performance of the Optimal
DQL and dueling DQL techniques using different β/ζ values. 1:4 1:3 1:2 1:1 2:1 3:1 4:1
β/ζ
The average performance of the dueling DQL algorithm is
better than that of the DQL algorithm. Furthermore, the results (b)
of using the exponential function (9) are better than that of
Fig. 3. The performance when using the DQL and dueling DQL algorithms
the ones using the binary function (8). When the value of with 3 clusters while considering different β/ζ values
β/ζ ≥ 1 : 2, the performance achieved by the DQL and
dueling DQL algorithm close to the optimal performance.
Furthermore, we compare the rewards obtained by the DQL performance for all the β/ζ pair values, exhibiting better
and dueling DQL algorithms in complex scenarios with 5 rewards. Additionally, when using the exponential function
clusters and 50 user nodes in Fig. 4. The performance of (9), both proposed algorithms show better performance than
using the episode reward (23) is better than that using the the ones using the binary function (8) if β/ζ ≤ 1 : 1, but
immediate reward (22) in both trajectory designs relying on it becomes less effective when β/ζ is set higher. Again, we
the DQL and dueling DQL algorithms. In Fig. 4a, we compare achieve a near-optimal solution while we consider a complex
the performance in conjunction with the binary trajectory environment without knowing the IoT nodes’ position and
design while in Fig. 4b the exponential trajectory design is mobile users. It is challenging to expect the UAV to visit all
considered. or β/ζ = 1 : 1, the rewards obtained by the IoT nodes with limited flying power and duration.
DQL and dueling DQL are similar and stable after about We compare the performance of the DQL and of the dueling
400 episodes. When using the exponential function (9), the DQL algorithm using different reward function setting in Fig.
dueling DQL algorithm reaches the best performance and close 6 and in Fig. 7, respectively. The DQL algorithm reaches
to the optimal performance. Moreover, the convergence of the best performance when using the episode reward (23) in
the dueling DQL technique is faster than that of the DQL Fig. 6a while the fastest convergence speed can be achieved
algorithm. In both reward definitions, the Q-learning with (22) by using the exponential function (9). When β/ζ ≥ 1 : 1,
shows the worst performance. the DQL algorithm relying on the episode function (23)
In Fig. 5, we compare the performance of the DQL and of outperforms the ones using the immediate reward function
the dueling DQL algorithms while considering different β/ζ (22) in Fig. 6b. The reward (7) using the exponential trajectory
parameter values. The dueling DQL algorithm shows better design (9) has a better performance than that using the binary
9

2.00 5.0

1.75
4.5

1.50
4.0
Expected reward

Expected reward
1.25
3.5
1.00

3.0
0.75
DQL with (22)
0.50 DQL with (23) 2.5 DQL with (8)
Dueling DQL with (22) DQL with (9)
0.25 Dueling DQL with (23) 2.0
Dueling DQL with (8)
Q-learning Dueling DQL with (9)
0.00 Optimal Optimal
0 200 400 600 800 1:4 1:3 1:2 1:1 2:1 3:1 4:1
Episode β/ζ

(a) With (8)


Fig. 5. The performance when using the DQL and dueling DQL algorithms
with 5 clusters and different β/ζ values

2.00

details are shown in Fig. 8b, where we compare the expected


1.75
throughput of both the DQL and dueling DQL algorithms.
1.50 The best throughput is achieved when using the dueling DQL
Expected reward

algorithm with β/ζ = 1 : 1 in conjunction with (8), which is


1.25 higher than the peak of the DQL method with β/ζ = 1 : 2.
1.00
In Fig. 9, we compare the throughput of different techniques
in the 5-cluster scenario. Let us now consider the binary
0.75
DQL with (22) trajectory design function (8) in Fig. 9a, where the DQL
DQL with (23)
Dueling DQL with (22) algorithm achieves the best performance using β/ζ = 1 : 1
0.50 Dueling DQL with (23)
Q-learning and β/ζ = 2 : 1. There is a slight difference between the DQL
0.25
Optimal method having different settings, when using exponential the
0 200 400 600 800 trajectory design function (9), as shown in Fig. 9b.
Episode
In Fig. 10 and Fig. 11, we compare the throughput of
(b) With (9) different β/ζ pairs. The DQL algorithm reaches the optimal
throughput with the aid of trial-and-learn methods, hence it
Fig. 4. The expected reward when using the DQL and dueling DQL
algorithms with 5-cluster scenario
is important to carefully design the reward function to avoid
excessive offline training. As shown in Fig. 10, the DQL and
dueling DQL algorithm exhibit reasonable stability for several
trajectory design (8) for all the β/ζ values. The similar results β/ζ ≤ 1 : 1 pairs as well as reward functions. While we
are shown when using the dueling DQL algorithm in Fig. 7. can achieve the similar expected reward with different reward
The immediate reward function (22) is less effective than the setting in Fig. 6, the throughput is degraded when the β/ζ
episode reward function (23). increases. In contrast, with higher β values, the UAV can finish
the mission faster. It is a trade-off game when we can choose
an approximate β/ζ value for our specific purposes. When
B. Throughput comparison we employ the DQL and the dueling DQL algorithms with the
In (7), we consider two elements: the trajectory cost and episode reward (23), the throughput attained is higher than that
the average throughput. In order to quantify the communica- using the immediate reward (22) with different β/ζ values.
tion efficiency, we compare the total throughput in different Furthermore, we compare the expected throughput of the
scenarios. In Fig. 8, the performances of the DQL algorithm DQL and of the dueling DQL algorithm when using the
associated with several β/ζ values are considered while using exponential trajectory design (9) in Fig. 11a and the episode
the binary trajectory function (8), the episode reward (23) reward (23) in Fig. 11b. In Fig. 11a, the dueling DQL method
and 3 clusters. The throughput obtained for β/ζ = 1 : 1 outperforms the DQL algorithm for almost all β/ζ values
is higher than that of the others and when β increases, the in both function (22) and (23). When we use the episode
performance degrades. However, when comparing with the reward (23), the obtained throughput are stable with different
Fig. 3b, we realise that in some scenarios the UAV was stuck β/ζ values. The throughput attained by using the exponential
and could not find the way to the destination. That leads function (9) is lower than that using the binary trajectory
to increased flight time spent and distance travelled. More (8) and by using the episode reward (23) is higher than that
10

1.8
4.5
1.6
4.0
1.4
Expected reward

Expected reward
3.5
1.2

1.0 3.0

0.8 2.5

0.6 2.0
DQL with (8), (22) Dueling DQL with (8), (22)
DQL with (8), (23) Dueling DQL with (8), (23)
0.4 DQL with (9), (22) Dueling DQL with (9), (22)
1.5
DQL with (9), (23) Dueling DQL with (9), (23)
0 200 400 600 800 1:4 1:3 1:2 1:1 2:1 3:1 4:1
Episode β/ζ

(a)
Fig. 7. The performance when using the dueling DQL with 5 clusters, and
different β/ζ values

4.5
achieves the optimal performance with a batch size of K = 32.
There is a slight difference in terms of convergence speed with
4.0
batch size K = 32 is the fastest. Overall, we set the mini-batch
Expected reward

3.5
size to K = 32 for our DQL algorithm.
Fig. 14 shows the performance of the DQL algorithm
3.0 with different learning rates in updating the neural networks
parameters while considering the scenarios of 5 clusters. When
2.5 the learning rate is as high as α = 0.01, the pace of updating
the network may result the fluctuating performance. Moreover,
2.0
DQL with (8), (22)
DQL with (8), (23) when α = 0.0001 or α = 0.00001 the convergence speed is
DQL with (9), (22) slower and may be stuck in a local optimum instead reaching
1.5 DQL with (9), (23)
the global optimum. Thus, based on our experiments, we opted
1:4 1:3 1:2 1:1 2:1 3:1 4:1
β/ζ for the learning rate of α = 0.001 for the algorithms.

(b) VII. C ONCLUSION


Fig. 6. The expected reward when using the DQL algorithm with 5 clusters In this paper, the DRL technique has been proposed jointly
and different reward function settings optimising the flight trajectory and data collection performance
of UAV-assisted IoT networks. The optimisation game has
been formulated to balance the flight time and total throughput
using the immediate reward (22). We can achieve the best
while guaranteeing the quality-of-service constraints. Bearing
performance when using the dueling DQL algorithm with (9)
in mind the limited UAV power level and the associated
and (23). However, in some scenarios, we can achieve the
communication constraints, we proposed a DRL technique for
better performance with different algorithmic setting as we
maximising the throughput while the UAV has to move along
can see in Fig. 8b and Fig. 10a. Thus, there is a trade-off
the shortest path to reach the destination. Both the DQL and
governing the choice of the algorithm and function design.
dueling DQL techniques having a low computational com-
plexity have been conceived. Our simulation results showed
C. Parametric Study
the efficiency of our techniques both in simple and complex
In Fig. 12, we compare the performance of our DQL tech- environmental settings.
nique using different exploration parameters γ and  values
in our -greedy method. The DQL algorithm achieves the R EFERENCES
best performance with the discounting factor of γ = 0.9 and [1] “Drone trial to help Isle of Wight receive medical supplies faster during
 = 0.9 in the 5-cluster scenario of Fig. (12). Balancing the COVID19 pandemic.” [Online]. Available: https://www.southampton.
exploration and exploitation as well as the action chosen is ac.uk/news/2020/04/drones-covid-iow.page
[2] “This Chilean community is using drones to deliver medicine to the
quite challenging, in order to maintain a steady performance elderly.” [Online]. Available: https://www.weforum.org/agenda/2020/04/
of the DQL algorithm. Based on the results of Fig. 12, we drone-chile-covid19/
opted for γ = 0.9 and  = 0.9 for our algorithmic setting. [3] M. Gao, X. Xu, Y. Klinger, J. van der Woerd, and P. Tapponnier, “High-
resolution mapping based on an unmanned aerial vehicle (UAV) to
Next, we compare the expected reward of different mini- capture paleoseismic offsets along the Altyn-Tagh fault, China,” Sci.
batch sizes, K. In the 5-cluster scenario of Fig. 13, the DQL Rep., vol. 7, no. 1, pp. 1–11, Aug. 2017.
11

30.0

35
27.5

25.0
30
Througput (bits/s/Hz)

Througput (bits/s/Hz)
22.5
25
20.0

20
17.5

15.0
β/ζ = 1 : 1 15 β/ζ = 1 : 1
β/ζ = 2 : 1 β/ζ = 2 : 1
12.5
β/ζ = 3 : 1 β/ζ = 3 : 1
10
β/ζ = 4 : 1 β/ζ = 4 : 1
10.0
0 200 400 600 800 0 200 400 600 800 1000
Episode Episode

(a) With (8) (a) With (8), (23)

30
Expected throughput (bits/s/Hz)

30
28
Througput (bits/s/Hz)

26 25

24 20

DQL with (8) β/ζ = 1 : 1


22 15
DQL with (9) β/ζ = 2 : 1
Dueling DQL with (8) β/ζ = 3 : 1
Dueling DQL with (9) β/ζ = 4 : 1
20
1:4 1:3 1:2 1:1 2:1 3:1 4:1 0 200 400 600 800 1000
β/ζ Episode

(b) (b) With (9), (23)

Fig. 8. The network’s sum-rate when using the DQL and dueling DQL Fig. 9. The obtained total throughput when using the DQL algorithm with 5
algorithms with 3 clusters clusters

[4] Q. Liu, J. Wu, P. Xia, S. Zhao, Y. Yang, W. Chen, and L. Hanzo, of path planning and completion time of data collection for UAV-enabled
“Charging unplugged: Will distributed laser charging for mobile wireless disaster communications,” in Proc. 15th Int. Wireless Commun. Mobile
power transfer work?” IEEE Vehicular Technology Magazine, vol. 11, Computing Conf. (IWCMC), Tangier, Morocco, Jun. 2019, pp. 372–377.
no. 4, pp. 36–45, Dec. 2016. [12] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Unmanned aerial
[5] H. Claussen, “Distributed algorithms for robust self-deployment and load vehicle with underlaid device-to-device communications: Performance
balancing in autonomous wireless access networks,” in Proc. IEEE Int. and tradeoffs,” IEEE Trans. Wireless Commun., vol. 15, no. 6, pp. 3949–
Conf. on Commun. (ICC), vol. 4, Istanbul, Turkey, Jun. 2006, pp. 1927– 3963, Jun. 2016.
1932. [13] L. D. Nguyen, A. Kortun, and T. Q. Duong, “An introduction of real-time
[6] J. Gong, T.-H. Chang, C. Shen, and X. Chen, “Flight time minimization embedded optimisation programming for UAV systems under disaster
of UAV for data collection over wireless sensor networks,” IEEE J. communication,” EAI Endorsed Transactions on Industrial Networks
Select. Areas Commun., vol. 36, no. 9, pp. 1942–1954, Sept. 2018. and Intelligent Systems, vol. 5, no. 17, pp. 1–8, Dec. 2018.
[7] C. Zhong, M. C. Gursoy, and S. Velipasalar, “Deep reinforcement [14] M.-N. Nguyen, L. D. Nguyen, T. Q. Duong, and H. D. Tuan, “Real-
learning-based edge caching in wireless networks,” IEEE Trans. Cogn. time optimal resource allocation for embedded UAV communication
Commun. Netw., vol. 6, no. 1, pp. 48–61, Mar. 2020. systems,” IEEE Wireless Commun. Lett., vol. 8, no. 1, pp. 225–228,
[8] H. Wu, Z. Wei, Y. Hou, N. Zhang, and X. Tao, “Cell-edge user offloading Feb. 2019.
via flying UAV in non-uniform heterogeneous cellular networks,” IEEE [15] X. Li, H. Yao, J. Wang, X. Xu, C. Jiang, and L. Hanzo, “A near-optimal
Trans. Wireless Commun., vol. 19, no. 4, pp. 2411–2426, Apr. 2020. UAV-aided radio coverage strategy for dense urban areas,” IEEE Trans.
[9] H. Huang et al., “Deep reinforcement learning for UAV navigation Veh. Technol., vol. 68, no. 9, pp. 9098–9109, Sept. 2019.
through massive MIMO technique,” IEEE Trans. Veh. Technol., vol. 69, [16] H. Zhang and L. Hanzo, “Federated learning assisted multi-UAV net-
no. 1, pp. 1117–1121, Jan. 2020. works,” IEEE Trans. Veh. Technol., vol. 69, no. 11, pp. 14 104–14 109,
[10] T. Q. Duong, L. D. Nguyen, H. D. Tuan, and L. Hanzo, “Learning-aided Nov. 2020.
realtime performance optimisation of cognitive UAV-assisted disaster [17] X. Liu, Y. Liu, Y. Chen, and L. Hanzo, “Trajectory design and power
communication,” in Proc. IEEE Global Communications Conference control for multi-UAV assisted wireless networks: A machine learning
(GLOBECOM), Waikoloa, HI, USA, Dec. 2019. approach,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7957–7969,
[11] T. Q. Duong, L. D. Nguyen, and L. K. Nguyen, “Practical optimisation Aug. 2019.
12

40.0
36

37.5
Expected throughput (bits/s/Hz)

Expected throughput (bits/s/Hz)


34
35.0
32
32.5

30
30.0

28 27.5

25.0
26 DQL with (8), (22) DQL with (22)
DQL with (8), (23) DQL with (23)
DQL with (9), (22) 22.5 Dueling DQL with (22)
24
DQL with (9), (23) Dueling DQL with (23)
20.0
1:4 1:3 1:2 1:1 2:1 3:1 4:1 1:4 1:3 1:2 1:1 2:1 3:1 4:1
β/ζ β/ζ

(a) (a) With (9)

40
36
38
Expected throughput (bits/s/Hz)

Expected throughput (bits/s/Hz)


34
36
32
34
30
32
28
30
26
Dueling DQL with (8), (22) 28 DQL with (8)
24 Dueling DQL with (8), (23) DQL with (9)
Dueling DQL with (9), (22) Dueling DQL with (8)
Dueling DQL with (9), (23) 26 Dueling DQL with (9)
22
1:4 1:3 1:2 1:1 2:1 3:1 4:1 1:4 1:3 1:2 1:1 2:1 3:1 4:1
β/ζ β/ζ

(b) (b) With (23)

Fig. 10. The obtained throughput when using the DQL and dueling DQL Fig. 11. The expected throughput when using the DQL and dueling DQL
algorithms in 5-cluster scenario algorithms with 5 clusters

learning approach,” IEEE Trans. Veh. Technol., vol. 68, no. 3, pp. 2124–
[18] K. K. Nguyen, T. Q. Duong, N. A. Vien, N.-A. Le-Khac, and L. D.
2136, Mar. 2019.
Nguyen, “Distributed deep deterministic policy gradient for power
allocation control in D2D-based V2V communications,” IEEE Access, [25] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement
vol. 7, pp. 164 533–164 543, Nov. 2019. learning for robotic manipulation with asynchronous off-policy updates,”
in Proc. IEEE International Conf. Robot. Autom. (ICRA), May 2017, pp.
[19] K. K. Nguyen, T. Q. Duong, N. A. Vien, N.-A. Le-Khac, and N. M.
3389–3396.
Nguyen, “Non-cooperative energy efficient power allocation game in
[26] Q. Cai, A. Filos-Ratsikas, P. Tang, and Y. Zhang, “Reinforcement
D2D communication: A multi-agent deep reinforcement learning ap-
mechanism design for fraudulent behaviour in e-commerce,” in Proc.
proach,” IEEE Access, vol. 7, pp. 100 480–100 490, Jul. 2019.
Thirty-Second AAAI Conf. Artif. Intell., 2018.
[20] K. K. Nguyen, N. A. Vien, L. D. Nguyen, M.-T. Le, L. Hanzo, and T. Q. [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
Duong, “Real-time energy harvesting aided scheduling in UAV-assisted D. Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement
D2D networks relying on deep reinforcement learning,” IEEE Access, learning,” 2013. [Online]. Available: arXivpreprintarXiv:1312.5602
vol. 9, pp. 3638–3648, Dec. 2021. [28] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning multiple
[21] K. Li, W. Ni, E. Tovar, and A. Jamalipour, “On-board deep Q-network access for heterogeneous wireless networks,” IEEE J. Select. Areas
for UAV-assisted online power transfer and data collection,” IEEE Trans. Commun., vol. 37, no. 6, pp. 1277–1290, Jun. 2019.
Veh. Technol., vol. 68, no. 12, pp. 12 215–12 226, Dec. 2019. [29] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep
[22] U. Challita, W. Saad, and C. Bettstetter, “Interference management reinforcement learning for user association and resource allocation
for cellular-connected UAVs: A deep reinforcement learning approach,” in heterogeneous cellular networks,” IEEE Trans. Wireless Commun.,
IEEE Trans. Wireless Commun., vol. 18, no. 4, pp. 2125–2140, Apr. vol. 18, no. 11, pp. 5141–5152, Nov. 2019.
2019. [30] S. Yin, S. Zhao, Y. Zhao, , and F. R. Yu, “Intelligent trajectory design in
[23] X. Liu, Y. Liu, and Y. Chen, “Reinforcement learning in multiple- UAV-aided communications with reinforcement learning,” IEEE Trans.
UAV networks: Deployment and movement design,” IEEE Trans. Veh. Veh. Technol., vol. 68, no. 8, pp. 8227–8231, Aug. 2019.
Technol., vol. 68, no. 8, pp. 8036–8049, Aug. 2019. [31] D. Yang, Q. Wu, Y. Zeng, and R. Zhang, “Energy tradeoff in ground-to-
[24] C. Wang, J. Wang, Y. Shen, and X. Zhang, “Autonomous navigation UAV communication via trajectory design,” IEEE Trans. Veh. Technol.,
of UAVs in large-scale complex environments: A deep reinforcement vol. 67, no. 7, pp. 6721–6726, Jul. 2018.
13

1.8
1.75

1.6
1.50
1.4

1.25
Expected reward

Expected reward
1.2

1.0 1.00

0.8
0.75

0.6
γ = 0.3, ε = 0.9 0.50
0.4 γ = 0.6, ε = 0.9 lr = 0.01
γ = 0.9, ε = 0.9 lr = 0.001
0.25
0.2 γ = 0.9, ε = 0.6 lr = 0.0001

γ = 0.9, ε = 0.3 lr = 0.00001

0 200 400 600 800 1000 0 200 400 600 800 1000
Episode Episode

Fig. 12. The performance when using the DQL algorithm with different Fig. 14. The performance when using DQL algorithm with different learning
discount factors, γ, and exploration factors,  rate, lr

[38] C. Zhan, Y. Zeng, and R. Zhang, “Energy-efficient data collection in


1.8 UAV enabled wireless sensor network,” IEEE Wireless Commun. Lett.,
vol. 7, no. 3, pp. 328–331, Jun. 2018.
1.6 [39] H. Wang, G. Ren, J. Chen, G. Ding, and Y. Yang, “Unmanned aerial
vehicle-aided communications: Joint transmit power and trajectory op-
1.4 timization,” IEEE Wireless Commun. Lett., vol. 7, no. 4, pp. 522–525,
Aug. 2018.
Expected reward

1.2 [40] Z. Wang, R. Liu, Q. Liu, J. S. Thompson, and M. Kadoch, “Energy-


efficient data collection and device positioning in UAV-assisted IoT,”
1.0 IEEE Internet Things J., vol. 7, no. 2, pp. 1122–1139, Feb. 2020.
[41] J. Li et al., “Joint optimization on trajectory, altitude, velocity, and link
0.8
scheduling for minimum mission time in UAV-aided data collection,”
IEEE Internet Things J., vol. 7, no. 2, pp. 1464–1475, Feb. 2020.
0.6
[42] M. Samir, S. Sharafeddine, C. M. Assi, T. M. Nguyen, and A. Ghrayeb,
K = 32 “UAV trajectory planning for data collection from time-constrained IoT
0.4 K = 64 devices,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 34–46, Jan.
K = 128
K = 256
2020.
0.2
[43] M. Hua, L. Yang, Q. Wu, and A. L. Swindlehurst, “3D UAV trajec-
0 200 400 600 800 1000 tory and communication design for simultaneous uplink and downlink
Episode transmission,” IEEE Trans. on Commun., vol. 68, no. 9, pp. 5908–5923,
Sept. 2020.
Fig. 13. The performance when using the DQL algorithm in 5-cluster scenario [44] C. Zhan and Y. Zeng, “Aerial–ground cost tradeoff for multi-UAV-
and different batch sizes, K enabled data collection in wireless sensor networks,” IEEE Trans. on
Commun., vol. 68, no. 3, pp. 1937–1950, Mar. 2020.
[45] S. Zhang and R. Zhang, “Radio map-based 3D path planning for cellular-
connected UAV,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp.
[32] H. Wang, J. Wang, G. Ding, J. Chen, F. Gao, and Z. Han, “Completion 1975–1989, Mar. 2021.
time minimization with path planning for fixed-wing UAV communica- [46] Y. Zeng, X. Xu, S. Jin, and R. Zhang, “Simultaneous navigation and
tions,” IEEE Trans. Wireless Commun., vol. 18, no. 7, pp. 3485–3499, radio mapping for cellular-connected UAV with deep reinforcement
Jul. 2019. learning,” IEEE Trans. Wireless Commun., vol. 20, no. 7, pp. 4205–
[33] H. T. Nguyen, H. D. Tuan, T. Q. Duong, H. V. Poor, and W.-J. Hwang, 4220, Jul. 2021.
“Joint D2D assignment, bandwidth and power allocation in cognitive [47] M. Samir, C. Assi, S. Sharafeddine, D. Ebrahimi, and A. Ghrayeb,
UAV-enabled networks,” IEEE Trans. Cogn. Commun. Netw., vol. 6, “Age of information aware trajectory planning of UAVs in intelligent
no. 3, pp. 1084–1095, Sept. 2020. transportation systems: A deep learning approach,” IEEE Trans. Veh.
[34] L. Liu, S. Zhang, and R. Zhang, “Multi-beam UAV communication Technol., vol. 69, no. 11, pp. 12 382–12 395, Nov. 2020.
in cellular uplink: Cooperative interference cancellation and sum-rate [48] H. Jaeger, “The “echo state” approach to analysing and training recurrent
maximization,” IEEE Trans. Wireless Commun., vol. 18, no. 10, pp. neural networks-with an erratum note,” ” GMD - German National
4679–4691, Oct. 2019. Research Institute for Computer Science, Tech. Rep., vol. 148, no. 34,
[35] L. Xie, J. Xu, and R. Zhang, “Throughput maximization for UAV- p. 13, Jan. 2010.
enabled wireless powered communication networks,” IEEE Internet [49] T. P. Lillicrap et al., “Continuous control with deep reinforcement
Things J., vol. 6, no. 2, pp. 1690–1703, Apr. 2019. learning,” in Proc. 4th International Conf. on Learning Representations
[36] L. D. Nguyen, K. K. Nguyen, A. Kortun, and T. Q. Duong, “Real- (ICLR), 2016.
time deployment and resource allocation for distributed UAV systems [50] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and
in disaster relief,” in Proc. IEEE 20th International Workshop on Signal N. de Freitas, “Dueling network architectures for deep reinforcement
Processing Advances in Wireless Commun. (SPAWC), Cannes, France, learning,” 2015. [Online]. Available: arXivpreprintarXiv:1511.06581
Jul. 2019, pp. 1–5. [51] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dy-
[37] Q. Wu, Y. Zeng, and R. Zhang, “Joint trajectory and communication namic Programming. John Wiley & Sons, Inc., 1994.
design for multi-UAV enabled wireless networks,” IEEE Trans. Wireless [52] D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena
Commun., vol. 17, no. 3, pp. 2109–2121, Mar. 2018. Scientific Belmont, MA, 1995, vol. 1, no. 2.
14

[53] M. Abadi et al., “Tensorflow: A system for large-scale machine learn- Holger Claussen (Fellow, IEEE) is Head of the
ing,” in Proc. 12th USENIX Sym. Opr. Syst. Design and Imp. (OSDI 16), Wireless Communications Laboratory at Tyndall Na-
Nov. 2016, pp. 265–283. tional Institute, and Research Professor at University
[54] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic College Cork, where he is building up research
optimization,” 2014. [Online]. Available: arXivpreprintarXiv:1412.6980 teams in the area of RF, Access, Protocols, AI, and
Quantum Systems to invent the future of Wireless
Communication Networks. Previously he led the
Wireless Communications Reserarch Department of
Nokia Bell Labs located in Ireland and the US. In
this role, he and his team innovated in all areas re-
lated to future evolution, deployment, and operation
Khoi Khac Nguyen (Student Member, IEEE) was of wireless networks to enable exponential growth in mobile data traffic and
born in Bac Ninh, Vietnam. He received his B.S. reliable low latency communications. His research in this domain has been
degree in information and communication technol- commercialised in Nokia’s (formerly Alcatel-Lucent’s) Small Cell product
ogy from the Hanoi University of Science and portfolio and continues to have significant impact. He received the 2014 World
Technology (HUST), Vietnam in 2018. He is work- Technology Award in the individual category Communications Technologies
ing towards his Ph.D. degree with the School of for innovative work of “the greatest likely long-term significance”. Prior to
Electronics, Electrical Engineering and Computer this, Holger directed research in the areas of self-managing networks to enable
Science, Queen’s University Belfast, U.K. His re- the first large scale femtocell deployments. Holger joined Bell Labs in 2004,
search interests include machine learning and deep where he began his research in the areas of network optimisation, cellular
reinforcement learning for real-time optimisation architectures, and improving energy efficiency of networks. Holger received
in wireless networks, reconfigurable intelligent sur- his Ph.D. degree in signal processing for digital communications from the
faces, unmanned air vehicle (UAV) communication and massive Internet of University of Edinburgh, United Kingdom in 2004. He is author of the book
Things (IoTs). ”Small Cell Networks”, more than 130 journal and conference publications,
78 granted patent families, and 46 filed patent applications pending. He is
Fellow of the IEEE, Fellow of the World Technology Network, and member
of the IET.

Trung Q. Duong (Fellow, IEEE) is a Chair Pro-


fessor of Telecommunications at Queen’s University
Belfast (UK), where he was a Lecturer (Assistant
Professor) (2013-2017), a Reader (Associate Profes-
sor) (2018-2020), and Full Professor from August
2020. He also holds a prestigious Research Chair
of Royal Academy of Engineering. His current re-
search interests include wireless communications,
machine learning, realtime optimisation, and data
analytic.
Dr. Duong currently serves as an Editor for the
IEEE T RANSACTIONS ON W IRELESS C OMMUNICATIONS, IEEE T RANS -
ACTIONS ON V EHICULAR T ECHNOLOGY , IEEE W IRELESS C OMMUNICA -
TIONS L ETTERS, and an Executive Editor for IEEE C OMMUNICATIONS
L ETTERS. He has served as an Editor/Guest Editor for IEEE T RANSACTIONS
ON C OMMUNICATIONS , IEEE W IRELESS C OMMUNICATIONS , IEEE C OM -
MUNICATIONS M AGAZINES , IEEE C OMMUNICATIONS L ETTERS, and IEEE Lajos Hanzo (Fellow, IEEE) received his Master
J OURNAL ON S ELECTED A REAS IN C OMMUNICATIONS. He was awarded degree and Doctorate in 1976 and 1983, respectively
the Best Paper Award at the IEEE Vehicular Technology Conference (VTC- from the Technical University (TU) of Budapest.
Spring) in 2013, IEEE International Conference on Communications (ICC) He was also awarded the Doctor of Sciences (DSc)
2014, IEEE Global Communications Conference (GLOBECOM) 2016 and degree by the University of Southampton (2004) and
2019, IEEE Digital Signal Processing Conference (DSP) 2017, and Interna- Honorary Doctorates by the TU of Budapest (2009)
tional Wireless Communications & Mobile Computing Conference (IWCMC) and by the University of Edinburgh (2015). He is
2019. He is the recipient of prestigious Royal Academy of Engineering a Foreign Member of the Hungarian Academy of
Research Fellowship (2015-2020) and has won a prestigious Newton Prize Sciences and a former Editor-in-Chief of the IEEE
2017. He is a Fellow of IEEE (2022 Class). Press. He has served several terms as Governor of
both IEEE ComSoc and of VTS. He has published
2000+ contributions at IEEE Xplore, 19 Wiley-IEEE Press books and has
helped the fast-track career of 123 PhD students. Over 40 of them are
Professors at various stages of their careers in academia and many of them
are leading scientists in the wireless industry. He is also a Fellow of the
Royal Academy of Engineering (FREng), of the IET and of EURASIP. He
Tan Do-Duy (Member, IEEE) received his B.S. was bestowed upon the Eric Sumner Field Award.
degree from Ho Chi Minh City University of Tech-
nology (HCMUT), Vietnam, and M.S. degree from
Kumoh National Institute of Technology, Korea, in
2010 and 2013, respectively. He received his Ph.D.
degree from Autonomous University of Barcelona,
Spain, in 2019. He is currently with the Department
of Computer and Communication Engineering, Ho
Chi Minh City University of Technology and Edu-
cation (HCMUTE) in Vietnam as an Assistant Pro-
fessor. His main research interests include wireless
cooperative communications, real-time optimisation for resource allocation in
wireless networks, and coding applications for wireless communications.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy