Version 9
Version 9
Reinforcement Learning
Nguyen, K. K., Duong, T. Q., Do-Duy, T., Claussen, H., & Hanzo, L. (2022). 3D UAV Trajectory and Data
Collection Optimisation via Deep Reinforcement Learning. IEEE Transactions on Communications. Advance
online publication. https://doi.org/10.1109/TCOMM.2022.3148364
Published in:
IEEE Transactions on Communications
Document Version:
Peer reviewed version
Publisher rights
© 2022 IEEE.
This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher.
General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.
Open Access
This research has been made openly available by Queen's academics and its Open Research team. We would love to hear how access to
this research benefits you. – Share your feedback with us: http://go.qub.ac.uk/oa-feedback
Abstract—Unmanned aerial vehicles (UAVs) are now beginning Wireless networks supported by UAVs constitute a promis-
to be deployed for enhancing the network performance and cov- ing technology for enhancing the network performance [5].
erage in wireless communication. However, due to the limitation The applications of UAVs in wireless networks span across
of their on-board power and flight time, it is challenging to
obtain an optimal resource allocation scheme for the UAV-assisted diverse research fields, such as wireless sensor networks
Internet of Things (IoT). In this paper, we design a new UAV- (WSNs) [6], caching [7], heterogeneous cellular networks [8],
assisted IoT system relying on the shortest flight path of the massive multiple-input multiple-output (MIMO) [9], disaster
UAVs while maximising the amount of data collected from IoT communications [10], [11] and device-to-device communica-
devices. Then, a deep reinforcement learning-based technique is tions (D2D) [12]. For example, in [13], UAVs were deployed
conceived for finding the optimal trajectory and throughput in a
specific coverage area. After training, the UAV has the ability to to provide network coverage for people in remote areas and
autonomously collect all the data from user nodes at a significant disaster zones. UAVs were also used for collecting data in a
total sum-rate improvement while minimising the associated WSN [6]. Nevertheless, the benefits of UAV-aided wireless
resources used. Numerical results are provided to highlight how communication are critically dependent on the limited on-
our techniques strike a balance between the throughput attained, board power level. Thus, the resource allocation of UAV-
trajectory, and the time spent. More explicitly, we characterise
the attainable performance in terms of the UAV trajectory, the aided wireless networks plays a pivotal role in approaching
expected reward and the total sum-rate. the optimal performance. Yet, the existing contributions typ-
ically assume having static environment [10], [11], [14] and
Keywords- UAV-assisted wireless network, trajectory, data often ignore the stringent flight time constraints in real-life
collection, and deep reinforcement learning. applications [6], [8], [15].
I. I NTRODUCTION Machine learning has recently been proposed for the in-
telligent support of UAVs and other devices in the network
Given the agility of unmanned aerial vehicles (UAVs), they
[9], [16]–[24]. Reinforcement learning (RL) is capable of
are capable of supporting compelling applications and are
searching for an optimal policy by trial-and-error learning.
beginning to be deployed more broadly. Recently, the UK
However, it is challenging for model-free RL algorithms, such
and Chile authorities proposed to deliver medical support and
as Q-learning to obtain an optimal strategy, while considering
other essential supplies by using UAVs to vulnerable people in
a large state and action space. Fortunately, with the emerging
response to Covid-19 [1], [2]. In [3], the authors used UAVs for
neural networks, the sophisticated combination of RL and
image collection and high-resolution topography exploration.
deep learning, namely deep reinforcement learning (DRL)
However, given the several limitations of on-board power
is eminently suitable for solving high-dimensional problems.
level and the ability to adapt to changes in the environment,
Hence, DRL algorithms have been widely applied in fields
UAVs may not be fully autonomous and can only operate for
such as robotics [25], business management [26] and gaming
short flight-durations, unless remote laser-charging is used [4].
[27]. Recently, DRL has also become popular in solving
Moreover, due to some challenging tasks such as topographic
diverse problems in wireless networks thanks to their decision-
surveying, data collection or obstacle avoidance, the existing
making ability and flexible interaction with the environment
UAV technologies cannot operate in an optimal manner.
[7], [9], [18]–[24], [28]–[30]. For example, DRL was used for
Khoi Khac Nguyen and Trung Q. Duong are with the School of Electronics, solving problems in the areas of resource allocation [18], [19],
Electrical Engineering and Computer Science, Queen’s University Belfast, [29], navigation [9], [31] and interference management [22].
Belfast BT7 1NN, U.K. (e-mail: {knguyen02,trung.q.duong}@qub.ac.uk).
Tan Do-Duy is with Ho Chi Minh City University of Technology and
Education, Vietnam (e-mail: tandd@hcmute.edu.vn).
Holger Claussen is with Tyndall National Institute, Dublin, Ireland (e-mail: A. Related Contributions
holger.claussen@tyndall.ie).
Lajos Hanzo is with the School of Electronics and Computer Sci- UAV-aided wireless networks have also been used for
ence, University of Southampton, Southampton, SO17 1BJ, U.K. (e-mail: machine-to-machine communications [32] and D2D scenar-
lh@ecs.soton.ac.uk). ios in 5G [14], [33], but the associated resource allocation
The work of T. Q. Duong was supported by the U.K. Royal Academy
of Engineering (RAEng) under the RAEng Research Chair and Senior problems remain challenging in real-life applications. Several
Research Fellowship scheme Grant RCSRF2021\11\41. L. Hanzo would like techniques have been developed for solving resource allocation
to acknowledge the financial support of the Engineering and Physical Sciences problems [18], [19], [31], [34]–[36]. In [34], the authors
Research Council projects EP/P034284/1 and EP/P003990/1 (COALESCE)
as well as of the European Research Council’s Advanced Fellow Grant have conceived a multi-beam UAV communications and a
QuantCom (Grant No. 789028) cooperative interference cancellation scheme for maximising
2
the uplink sum-rate received from multiple UAVs by the base user location and the transmit power. Moreover, in [9], the
stations (BS) on the ground. The UAVs were deployed as authors used the DQL algorithm for the UAV’s navigation
access points to serve several ground users in [35]. Then, based on the received signal strengths estimated by a massive
the authors proposed successive convex programming for MIMO scheme. In [23], Q-learning was used for controlling
maximising the minimum uplink rate gleaned from all the the movement of multiple UAVs in a pair of scenarios,
ground users. In [31], the authors characterised the trade- namely for static user locations and for dynamic user locations
off between the ground terminal transmission power and the under a random walk model. However, the aforementioned
specific UAV trajectory both in a straight and in a circular contributions have not addressed the joint trajectory and data
trajectory. collection optimisation of UAV-assisted networks, which is a
The issues of data collection, energy minimisation, and difficult research challenge. Furthermore, these existing works
path planning have been considered in [23], [32], [37]–[45]. mostly neglected interference, 3D trajectory and dynamic
In [38], the authors minimised the energy consumption of environment.
the data collection task considered by jointly optimising the
sensor nodes’ wakeup schedule and the UAV trajectory. The B. Contributions and Organisation
authors of [39] proposed an efficient algorithm for joint
trajectory and power allocation optimisation in UAV-assisted A novel DRL-aided UAV-assisted system is conceived for
networks to maximise the sum-rate during a specific length of finding the optimal UAV path for maximising the joint re-
time. A pair of near-optimal approaches for optimal trajectory ward function based on the shortest flight distance and the
was proposed for a given UAV power allocation and power uplink transmission rate. We boldly and explicitly contrast our
allocation optimisation for a given trajectory. In [32], the proposed solution to the state-of-the-art in Table I. Our main
authors introduced a communication framework for UAV-to- contributions are further summarised as follows:
UAV communication under the constraints of the UAV’s flight • In our UAV-aided system, the maximum amount of data
speed, location uncertainty and communication throughput. is collected from the users with the shortest distance
Then, a path planning algorithm was proposed for minimising travelled.
the associated completion time task while balancing the perfor- • Our UAV-aided system is specifically designed for tack-
mance by computational complexity trade-off. However, these ling the stringent constraints owing to the position of
techniques mostly operate in offline modes and may impose the destination, the UAV’s limited flight time and the
excessive delay on the system. It is crucial to improve the communication link’s realistic constraints. The UAV’s ob-
decision-making time for meeting the stringent requirements jective is to find the optimal trajectory for maximising the
of UAV-assisted wireless networks. total network throughput, while minimising its distance
Again, machine learning has been recognised as a powerful travelled.
tool of solving the high-dynamic trajectory and resource • Explicitly, these challenges are tackled by conceiving
allocation problems in wireless networks. In [36], the authors bespoke DRL techniques for solving the above problem.
proposed a model based on the classic k-means algorithm To elaborate, the area is divided into a grid to enable
for grouping the users into clusters and assigned a dedicated fast convergence. Following its training, the UAV can
UAV to serve each cluster. By relying on their decision- have the autonomy to make a decision concerning its next
making ability, DRL algorithms have been used for lending action at each position in the area, hence eliminating the
each node some degree of autonomy [7], [18]–[21], [28], [29], need for human navigation. This makes our UAV-aided
[46]. In [28], an optimal DRL-based channel access strategy system more reliable, practical and optimises the resource
to maximise the sum rate and α-fairness was considered. requirements.
In [18], [19], we deployed DRL techniques for enhancing • A pair of scenarios are considered relying either on
the energy-efficiency of D2D communications. In [21], the three or five clusters for quantifying the efficiency of
authors characterised the DQL algorithm for minimising the our novel DRL techniques in terms of both the sum-rate,
data packet loss of UAV-assisted power transfer and data the trajectory and the associated time. A convincing 3D
collection systems. As a further advance, caching problems trajectory visualisation is also provided.
were considered in [7] to maximise the cache success hit rate • Finally, but most importantly, it is demonstrated that our
and to minimise the transmission delay. The authors designed DRL techniques approach the performance of the optimal
both a centralised and a decentralised system model and used “genie-solution” associated with the perfect knowledge of
an actor-critic algorithm to find the optimal policy. the environment.
DRL algorithms have also been applied for path planning Although the existing DRL algorithms have been well
in UAV-assisted wireless communications [9], [22]–[24], [30], exploited in wireless networks, it is challenging to apply to
[47]. In [22], the authors proposed a DRL algorithm based on current scenarios due to stringent constraints of the considered
the echo state network of [48] for finding the flight path, trans- system, such as UAV’s flying time, transmission distance, and
mission power and associated cell in UAV-powered wireless mobile users. As with the DQL and dueling DQL algorithm,
networks. The so-called deterministic policy gradient algo- we discretise the flying path into grid and the UAV only
rithm of [49] was invoked for UAV-assisted cellular networks needs to decide the action in a finite action space. With
in [30]. The UAV’s trajectory was designed for maximising the finite state and action space, the neural networks can
the uplink sum-rate attained without the knowledge of the be easily trained and deployed for online phase. With other
3
TABLE I
A COMPARISON WITH EXISTING LITERATURE
[37] [6] [21] [23] [40] [9] [41] [47] [42] [43] Our work
3D trajectory
Sum-rate maximisation
Time minimisation
Dynamic environment
Unknown users
Reinforcement learning
Deep neural networks
the value of the trade-off parameters β, ζ to achieve the best The action-value function is obtained, when the agent at
performance. Thus, the game formulation is defined as follows: state st takes action at and receives the reward rt under the
agent policy π. The optimal Q-value can be formulated as:
M K
β X X
max R = P (m, k)Rm,k + ζRplus ,
X
∗ ∗
MK
t t
Q (s, a, π) = E R (s , π ) + γ Pss0 (a, π ∗ )V (s0 , π ∗ ).
m k
s0 ∈S
s.t. Xf inal = Xtarget , (13)
The optimal policy π ∗ can be obtained from Q∗ (s, a, π) as
dm,k ≤ dcons , (10) follows:
Rm,k ≥ rmin , V ∗ (s, π) = max Q(s, a, π). (14)
a∈A
P (m, k) ∈ {0, 1},
T ≤ Tcons From (13) and (14), we have
β ≥ 0, ζ ≥ 0, ∗ ∗
X
t t
Q (s, a, π) = E R (s , π ) + γ Pss0 (a, π ∗ ) max
0
Q(s0 , a0 , π),
a ∈A
0
where T and Tcons are the number of steps that the UAV s ∈S
takes in a single mission and the maximum number of UAV’s
= E Rt (st , π ∗ ) + γ max Q(s0 0
, a , π) ,
steps given its limited power, respectively. The term Xf inal = 0
a ∈A
Xtarget denotes the completed flying route when the final (15)
position of the UAV belongs to the destination zone. We have
where the agent takes the action a0 = at+1 at state st+1 .
designed the reward function following this constraint with
Through learning, the Q-value is updated based on the
two functions: binary reward function in (8) and exponential
available information as follows:
reward function in (9). The term dm,k ≤ dcons , Rm,k ≥
rmin , P (m, k) ∈ {0, 1} denote the communication constraint. Q(s, a, π) = Q(s, a, π)
Particularly, the distance constraint dm,k ≤ dcons indicates
t t ∗ 0 0
that the served (m, k)-user has a satisfying distance to the + α R (s , π ) + γ max
0
Q(s , a , π) − Q(s, a, π) ,
a ∈A
UAV. P (m, k) ∈ {0, 1} indicates whether or not user k of (16)
cluster m is associated with the UAV. Rm,k ≥ rmin denotes
the minimum information collected during the flying path. All where α denotes the updated parameter of the Q-value func-
the constraints are integrated into the reward functions in the tion.
RL algorithm. The term T ≤ Tcons denotes the constraint In RL algorithms, it is challenging to balance the explo-
about the flying time. Consider the maximum flying time is ration and exploitation for appropriately selecting the action.
Tcons , the UAV needs to complete a route by reaching the The most common approach relies on the -greedy policy for
destination zone before Tcons . If the UAV can not complete a the action selection mechanism as follows:
route before Tcons , the Rplus = 0 as we defined in (8) and (9). arg max Q(s, a, π) with
a= (17)
We have the trade-off value in reward function β ≥ 0, ζ ≥ 0. randomly if 1 − .
Those stringent constraints, such as the transmission distance,
Upon assuming that each episode lasts T steps, the action
position and flight time make the optimisation problem more
at time step t is at that is selected by following the -greedy
challenging. Thus, we propose DRL techniques for the UAV
policy as in (17). The UAV at state st communicates with
in order to attain optimal performance.
the user nodes from the ground if the distance constraint
of dm,k ≤ dcons is satisfied. Following the information
III. P RELIMINARIES transmission phase, the user nodes are marked as collected
users and may not be revisited later during that mission. Then,
In this section, we introduce the fundamental concept of
after obtaining the immediate reward r(st , at ) the agent at
Q-learning, where the so-called value function is defined by a
state st takes action at to move to state st+1 as well as to
reward of the UAV at state st as follows:
update the Q-value function in (16). Each episode ends when
XT the UAV reaches the final destination and the flight duration
t t
V (s, π) = E γR (s , π)|s0 = s , (11) constraint is satisfied.
t
where E[] represents an average of the number of samples IV. A N E FFECTIVE D EEP R EINFORCEMENT L EARNING
and 0 ≤ γ ≤ 1 denotes the discount factor. In a finite game, A PPROACH FOR UAV- ASSISTED I OT N ETWORKS
there is always an optimal policy π ∗ that satisfies the Bellman In this section, we conceive the DQL algorithm for tra-
optimality equation [52] jectory and data collection optimisation in UAV-aided IoT
V ∗ (s, π) = V (s, π ∗ ) networks. However, Q-learning technique typically falters for
" # large state and action spaces due to its excessive Q-table size.
t t
= max E R (s , π ) + γ ∗
X
∗ 0 ∗
Pss0 (a, π )V (s , π ) . Thus, instead of applying the Q-table in Q-learning, we use
a∈A
s0 ∈S
deep neural networks to represent the relationship between
(12) the action and state space. Furthermore, we employ a pair of
6
techniques for stabilising the neural network’s performance in Algorithm 1 The deep Q-learning algorithm for trajectory and
our DQL algorithm as follows: data collection optimisation in UAV-aided IoT networks.
• Experience replay buffer: Instead of using current expe- 1: Initialise the network Q and the target network Q0 with
rience, we use a so-called replay buffer B to store the the random parameters θ and θ0 , respectively
transitions (s, a, r, s0 ) for supporting the neural network 2: Initialise the replay memory pool B
in overcoming any potential instability. When the buffer 3: for episode = 1, . . . , L do
B is filled with the transitions, we randomly select a mini- 4: Receive initial observation state s0
batch of K samples for training the networks. The finite 5: while Xf inal ∈ / Xtarget or T ≤ Tcons do
buffer size of B allows it to be always up-to-date, and 6: Obtain the action at of the UAV according to the
the neural networks learn from the new samples. -greedy mechanism (17)
• Target networks: If we use the same network to calculate 7: Execute the action at and estimate the reward rt
the state-action value Q and the target network, the according to (7)
network can be shifted dramatically in the training phase. 8: Observe the next state st+1
Thus, we employ a target network Q0 for the target value 9: Store the transition (st , at , rt , st+1 ) in the replay
estimator. After a number of iterations, the parameters of buffer B
the target network Q0 will be updated by the network Q. 10: Randomly select a mini-batch of K transitions
The UAVs start from the initial position and interact with (sk , ak , rk , sk+1 ) from B
the environment to find the proper action in each state. The 11: Update the network parameters using gradient de-
agent chooses the action at following current policy at state st . scent to minimise the loss
" 2 #
By execute the action at , the agent receives the response from DQL
the environment with reward rt and new state st+1 . After each L(θ) = Es,a,r,s0 y − Q(s, a; θ) , (18)
time step, the UAVs have new positions and the environment
is changed with moving users. The obtained transitions are The gradient update is
stored into a finite memory buffer and used for training the " #
DQL
neural networks. ∇θ L(θ) = Es,a,r,s0 y −Q(s, a; θ) ∇θ Q(s, a; θ) ,
The neural network parameters are updated by minimising
the loss function defined as follows: (19)
" 2 # 12: Update the state st = st+1
13: Update the target network parameters after a number
L(θ) = Es,a,r,s0 y DQL − Q(s, a; θ) , (20)
of iterations as θ0 = θ
14: end while
where θ is a parameter of the network Q and we have 15: end for
rt if terminated at st+1
y= t 0 0 0 0
r + γ maxa ∈A Q (s , a ; θ )
0 otherwise.
(21) estimate the value of each action choice in a particular state.
The details of the DQL approach in our joint trajectory and For example, in our environment setting, the UAV has to
data collection trade-off game designed for UAV-aided IoT consider moving either to the left or to the right when it
networks are presented in Algorithm 1 where L denotes the hits the boundaries. Thus, we can improve the convergence
number of episode. Moreover, in this paper, we design the speed by avoiding visiting all state-action pairs. Instead of
reward obtained in each step to assume one of two different using Q-value function of the conventional DQL algorithm,
forms and compare them in our simulation results. Firstly, we the dueling neural network of [50] is introduced for improving
calculate the difference between the current and the previous the convergence rate and stability. The so-called advantage
reward of the UAV as follows: function A(s, a) = Q(s, a) − V (s) related both to the value
function and to the Q-value function describes the importance
r1t (st , at ) = rt (st , at ) − rt−1 (st−1 , at−1 ). (22) of each action related to each state.
Secondly, we design the total episode reward as the accu- The idea of a dueling deep network is based on a combi-
mulation of all immediate rewards of each step within one nation of two streams of the value function and the advantage
episode as function used for estimating the single output Q-function. One
t
X of the streams of a fully-connected layer estimates the value
r2t (st , at ) = r1t (st , at ). (23) function V (s; θV ), while the other stream outputs a vector
i=0 A(s, a; θA ), where θA and θV represent the parameters of the
two networks. The Q-function can be obtained by combining
V. D EEP R EINFORCEMENT L EARNING A PPROACH FOR the two streams’ outputs as follows:
UAV- ASSISTED I OT NETWORKS : A D UELING D EEP
Q- LEARNING A PPROACH Q(s, a; θ, θA , θV ) = V (s; θV ) + A(s, a; θA ). (27)
According to Wang et. al. [50], the standard Q-learning Equation (27) applies to all (s, a) instances; thus, we have
algorithm often falters due to the over-supervision of all the to replicate the scalar V (s; θV ), |A| times to form a matrix.
state-action pairs. On the other hand, it is unnecessary to However, Q(s, a; θ, θA , θV ) is a parameterised estimator of
7
3.0
Proposed path with (8)
Proposed
300 path with (9)
Initial location 2.5
250 zone
Destination
Expected reward
IoT device
200 2.0
DQL with (8)
DQL with (9)
150 Dueling DQL with (8)
Dueling DQL with (9)
100 1.5 Q-learning
50 Optimal
0 1.0
1000
800 0.5
600
400 0 200 400 600 800
200 Episodes
0 0 200 400 600 800 1000
(a)
5.0
Fig. 2. Trajectory obtained by using our dueling DQL algorithm
4.5
4.0
the figures. Firstly, we compare the reward obtained following
Expected reward
(7). Let us consider the 3-cluster scenario and β/ζ = 2 : 1 3.5
in Fig. 3a, where the DQL and the dueling DQL algorithms
3.0
using the exponential function (9) reach the best performance.
When using the exponential trajectory design function (9), the 2.5
performance converges faster than that of the DQL and of the DQL with (8)
dueling DQL methods using the binary trajectory function (8). 2.0 DQL with (9)
Dueling DQL with (8)
The performance of using the Q-learning algorithm is worst. 1.5 Dueling DQL with (9)
In addition, in Fig. 3b, we compare the performance of the Optimal
DQL and dueling DQL techniques using different β/ζ values. 1:4 1:3 1:2 1:1 2:1 3:1 4:1
β/ζ
The average performance of the dueling DQL algorithm is
better than that of the DQL algorithm. Furthermore, the results (b)
of using the exponential function (9) are better than that of
Fig. 3. The performance when using the DQL and dueling DQL algorithms
the ones using the binary function (8). When the value of with 3 clusters while considering different β/ζ values
β/ζ ≥ 1 : 2, the performance achieved by the DQL and
dueling DQL algorithm close to the optimal performance.
Furthermore, we compare the rewards obtained by the DQL performance for all the β/ζ pair values, exhibiting better
and dueling DQL algorithms in complex scenarios with 5 rewards. Additionally, when using the exponential function
clusters and 50 user nodes in Fig. 4. The performance of (9), both proposed algorithms show better performance than
using the episode reward (23) is better than that using the the ones using the binary function (8) if β/ζ ≤ 1 : 1, but
immediate reward (22) in both trajectory designs relying on it becomes less effective when β/ζ is set higher. Again, we
the DQL and dueling DQL algorithms. In Fig. 4a, we compare achieve a near-optimal solution while we consider a complex
the performance in conjunction with the binary trajectory environment without knowing the IoT nodes’ position and
design while in Fig. 4b the exponential trajectory design is mobile users. It is challenging to expect the UAV to visit all
considered. or β/ζ = 1 : 1, the rewards obtained by the IoT nodes with limited flying power and duration.
DQL and dueling DQL are similar and stable after about We compare the performance of the DQL and of the dueling
400 episodes. When using the exponential function (9), the DQL algorithm using different reward function setting in Fig.
dueling DQL algorithm reaches the best performance and close 6 and in Fig. 7, respectively. The DQL algorithm reaches
to the optimal performance. Moreover, the convergence of the best performance when using the episode reward (23) in
the dueling DQL technique is faster than that of the DQL Fig. 6a while the fastest convergence speed can be achieved
algorithm. In both reward definitions, the Q-learning with (22) by using the exponential function (9). When β/ζ ≥ 1 : 1,
shows the worst performance. the DQL algorithm relying on the episode function (23)
In Fig. 5, we compare the performance of the DQL and of outperforms the ones using the immediate reward function
the dueling DQL algorithms while considering different β/ζ (22) in Fig. 6b. The reward (7) using the exponential trajectory
parameter values. The dueling DQL algorithm shows better design (9) has a better performance than that using the binary
9
2.00 5.0
1.75
4.5
1.50
4.0
Expected reward
Expected reward
1.25
3.5
1.00
3.0
0.75
DQL with (22)
0.50 DQL with (23) 2.5 DQL with (8)
Dueling DQL with (22) DQL with (9)
0.25 Dueling DQL with (23) 2.0
Dueling DQL with (8)
Q-learning Dueling DQL with (9)
0.00 Optimal Optimal
0 200 400 600 800 1:4 1:3 1:2 1:1 2:1 3:1 4:1
Episode β/ζ
2.00
1.8
4.5
1.6
4.0
1.4
Expected reward
Expected reward
3.5
1.2
1.0 3.0
0.8 2.5
0.6 2.0
DQL with (8), (22) Dueling DQL with (8), (22)
DQL with (8), (23) Dueling DQL with (8), (23)
0.4 DQL with (9), (22) Dueling DQL with (9), (22)
1.5
DQL with (9), (23) Dueling DQL with (9), (23)
0 200 400 600 800 1:4 1:3 1:2 1:1 2:1 3:1 4:1
Episode β/ζ
(a)
Fig. 7. The performance when using the dueling DQL with 5 clusters, and
different β/ζ values
4.5
achieves the optimal performance with a batch size of K = 32.
There is a slight difference in terms of convergence speed with
4.0
batch size K = 32 is the fastest. Overall, we set the mini-batch
Expected reward
3.5
size to K = 32 for our DQL algorithm.
Fig. 14 shows the performance of the DQL algorithm
3.0 with different learning rates in updating the neural networks
parameters while considering the scenarios of 5 clusters. When
2.5 the learning rate is as high as α = 0.01, the pace of updating
the network may result the fluctuating performance. Moreover,
2.0
DQL with (8), (22)
DQL with (8), (23) when α = 0.0001 or α = 0.00001 the convergence speed is
DQL with (9), (22) slower and may be stuck in a local optimum instead reaching
1.5 DQL with (9), (23)
the global optimum. Thus, based on our experiments, we opted
1:4 1:3 1:2 1:1 2:1 3:1 4:1
β/ζ for the learning rate of α = 0.001 for the algorithms.
30.0
35
27.5
25.0
30
Througput (bits/s/Hz)
Througput (bits/s/Hz)
22.5
25
20.0
20
17.5
15.0
β/ζ = 1 : 1 15 β/ζ = 1 : 1
β/ζ = 2 : 1 β/ζ = 2 : 1
12.5
β/ζ = 3 : 1 β/ζ = 3 : 1
10
β/ζ = 4 : 1 β/ζ = 4 : 1
10.0
0 200 400 600 800 0 200 400 600 800 1000
Episode Episode
30
Expected throughput (bits/s/Hz)
30
28
Througput (bits/s/Hz)
26 25
24 20
Fig. 8. The network’s sum-rate when using the DQL and dueling DQL Fig. 9. The obtained total throughput when using the DQL algorithm with 5
algorithms with 3 clusters clusters
[4] Q. Liu, J. Wu, P. Xia, S. Zhao, Y. Yang, W. Chen, and L. Hanzo, of path planning and completion time of data collection for UAV-enabled
“Charging unplugged: Will distributed laser charging for mobile wireless disaster communications,” in Proc. 15th Int. Wireless Commun. Mobile
power transfer work?” IEEE Vehicular Technology Magazine, vol. 11, Computing Conf. (IWCMC), Tangier, Morocco, Jun. 2019, pp. 372–377.
no. 4, pp. 36–45, Dec. 2016. [12] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Unmanned aerial
[5] H. Claussen, “Distributed algorithms for robust self-deployment and load vehicle with underlaid device-to-device communications: Performance
balancing in autonomous wireless access networks,” in Proc. IEEE Int. and tradeoffs,” IEEE Trans. Wireless Commun., vol. 15, no. 6, pp. 3949–
Conf. on Commun. (ICC), vol. 4, Istanbul, Turkey, Jun. 2006, pp. 1927– 3963, Jun. 2016.
1932. [13] L. D. Nguyen, A. Kortun, and T. Q. Duong, “An introduction of real-time
[6] J. Gong, T.-H. Chang, C. Shen, and X. Chen, “Flight time minimization embedded optimisation programming for UAV systems under disaster
of UAV for data collection over wireless sensor networks,” IEEE J. communication,” EAI Endorsed Transactions on Industrial Networks
Select. Areas Commun., vol. 36, no. 9, pp. 1942–1954, Sept. 2018. and Intelligent Systems, vol. 5, no. 17, pp. 1–8, Dec. 2018.
[7] C. Zhong, M. C. Gursoy, and S. Velipasalar, “Deep reinforcement [14] M.-N. Nguyen, L. D. Nguyen, T. Q. Duong, and H. D. Tuan, “Real-
learning-based edge caching in wireless networks,” IEEE Trans. Cogn. time optimal resource allocation for embedded UAV communication
Commun. Netw., vol. 6, no. 1, pp. 48–61, Mar. 2020. systems,” IEEE Wireless Commun. Lett., vol. 8, no. 1, pp. 225–228,
[8] H. Wu, Z. Wei, Y. Hou, N. Zhang, and X. Tao, “Cell-edge user offloading Feb. 2019.
via flying UAV in non-uniform heterogeneous cellular networks,” IEEE [15] X. Li, H. Yao, J. Wang, X. Xu, C. Jiang, and L. Hanzo, “A near-optimal
Trans. Wireless Commun., vol. 19, no. 4, pp. 2411–2426, Apr. 2020. UAV-aided radio coverage strategy for dense urban areas,” IEEE Trans.
[9] H. Huang et al., “Deep reinforcement learning for UAV navigation Veh. Technol., vol. 68, no. 9, pp. 9098–9109, Sept. 2019.
through massive MIMO technique,” IEEE Trans. Veh. Technol., vol. 69, [16] H. Zhang and L. Hanzo, “Federated learning assisted multi-UAV net-
no. 1, pp. 1117–1121, Jan. 2020. works,” IEEE Trans. Veh. Technol., vol. 69, no. 11, pp. 14 104–14 109,
[10] T. Q. Duong, L. D. Nguyen, H. D. Tuan, and L. Hanzo, “Learning-aided Nov. 2020.
realtime performance optimisation of cognitive UAV-assisted disaster [17] X. Liu, Y. Liu, Y. Chen, and L. Hanzo, “Trajectory design and power
communication,” in Proc. IEEE Global Communications Conference control for multi-UAV assisted wireless networks: A machine learning
(GLOBECOM), Waikoloa, HI, USA, Dec. 2019. approach,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7957–7969,
[11] T. Q. Duong, L. D. Nguyen, and L. K. Nguyen, “Practical optimisation Aug. 2019.
12
40.0
36
37.5
Expected throughput (bits/s/Hz)
30
30.0
28 27.5
25.0
26 DQL with (8), (22) DQL with (22)
DQL with (8), (23) DQL with (23)
DQL with (9), (22) 22.5 Dueling DQL with (22)
24
DQL with (9), (23) Dueling DQL with (23)
20.0
1:4 1:3 1:2 1:1 2:1 3:1 4:1 1:4 1:3 1:2 1:1 2:1 3:1 4:1
β/ζ β/ζ
40
36
38
Expected throughput (bits/s/Hz)
Fig. 10. The obtained throughput when using the DQL and dueling DQL Fig. 11. The expected throughput when using the DQL and dueling DQL
algorithms in 5-cluster scenario algorithms with 5 clusters
learning approach,” IEEE Trans. Veh. Technol., vol. 68, no. 3, pp. 2124–
[18] K. K. Nguyen, T. Q. Duong, N. A. Vien, N.-A. Le-Khac, and L. D.
2136, Mar. 2019.
Nguyen, “Distributed deep deterministic policy gradient for power
allocation control in D2D-based V2V communications,” IEEE Access, [25] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement
vol. 7, pp. 164 533–164 543, Nov. 2019. learning for robotic manipulation with asynchronous off-policy updates,”
in Proc. IEEE International Conf. Robot. Autom. (ICRA), May 2017, pp.
[19] K. K. Nguyen, T. Q. Duong, N. A. Vien, N.-A. Le-Khac, and N. M.
3389–3396.
Nguyen, “Non-cooperative energy efficient power allocation game in
[26] Q. Cai, A. Filos-Ratsikas, P. Tang, and Y. Zhang, “Reinforcement
D2D communication: A multi-agent deep reinforcement learning ap-
mechanism design for fraudulent behaviour in e-commerce,” in Proc.
proach,” IEEE Access, vol. 7, pp. 100 480–100 490, Jul. 2019.
Thirty-Second AAAI Conf. Artif. Intell., 2018.
[20] K. K. Nguyen, N. A. Vien, L. D. Nguyen, M.-T. Le, L. Hanzo, and T. Q. [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
Duong, “Real-time energy harvesting aided scheduling in UAV-assisted D. Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement
D2D networks relying on deep reinforcement learning,” IEEE Access, learning,” 2013. [Online]. Available: arXivpreprintarXiv:1312.5602
vol. 9, pp. 3638–3648, Dec. 2021. [28] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning multiple
[21] K. Li, W. Ni, E. Tovar, and A. Jamalipour, “On-board deep Q-network access for heterogeneous wireless networks,” IEEE J. Select. Areas
for UAV-assisted online power transfer and data collection,” IEEE Trans. Commun., vol. 37, no. 6, pp. 1277–1290, Jun. 2019.
Veh. Technol., vol. 68, no. 12, pp. 12 215–12 226, Dec. 2019. [29] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep
[22] U. Challita, W. Saad, and C. Bettstetter, “Interference management reinforcement learning for user association and resource allocation
for cellular-connected UAVs: A deep reinforcement learning approach,” in heterogeneous cellular networks,” IEEE Trans. Wireless Commun.,
IEEE Trans. Wireless Commun., vol. 18, no. 4, pp. 2125–2140, Apr. vol. 18, no. 11, pp. 5141–5152, Nov. 2019.
2019. [30] S. Yin, S. Zhao, Y. Zhao, , and F. R. Yu, “Intelligent trajectory design in
[23] X. Liu, Y. Liu, and Y. Chen, “Reinforcement learning in multiple- UAV-aided communications with reinforcement learning,” IEEE Trans.
UAV networks: Deployment and movement design,” IEEE Trans. Veh. Veh. Technol., vol. 68, no. 8, pp. 8227–8231, Aug. 2019.
Technol., vol. 68, no. 8, pp. 8036–8049, Aug. 2019. [31] D. Yang, Q. Wu, Y. Zeng, and R. Zhang, “Energy tradeoff in ground-to-
[24] C. Wang, J. Wang, Y. Shen, and X. Zhang, “Autonomous navigation UAV communication via trajectory design,” IEEE Trans. Veh. Technol.,
of UAVs in large-scale complex environments: A deep reinforcement vol. 67, no. 7, pp. 6721–6726, Jul. 2018.
13
1.8
1.75
1.6
1.50
1.4
1.25
Expected reward
Expected reward
1.2
1.0 1.00
0.8
0.75
0.6
γ = 0.3, ε = 0.9 0.50
0.4 γ = 0.6, ε = 0.9 lr = 0.01
γ = 0.9, ε = 0.9 lr = 0.001
0.25
0.2 γ = 0.9, ε = 0.6 lr = 0.0001
0 200 400 600 800 1000 0 200 400 600 800 1000
Episode Episode
Fig. 12. The performance when using the DQL algorithm with different Fig. 14. The performance when using DQL algorithm with different learning
discount factors, γ, and exploration factors, rate, lr
[53] M. Abadi et al., “Tensorflow: A system for large-scale machine learn- Holger Claussen (Fellow, IEEE) is Head of the
ing,” in Proc. 12th USENIX Sym. Opr. Syst. Design and Imp. (OSDI 16), Wireless Communications Laboratory at Tyndall Na-
Nov. 2016, pp. 265–283. tional Institute, and Research Professor at University
[54] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic College Cork, where he is building up research
optimization,” 2014. [Online]. Available: arXivpreprintarXiv:1412.6980 teams in the area of RF, Access, Protocols, AI, and
Quantum Systems to invent the future of Wireless
Communication Networks. Previously he led the
Wireless Communications Reserarch Department of
Nokia Bell Labs located in Ireland and the US. In
this role, he and his team innovated in all areas re-
lated to future evolution, deployment, and operation
Khoi Khac Nguyen (Student Member, IEEE) was of wireless networks to enable exponential growth in mobile data traffic and
born in Bac Ninh, Vietnam. He received his B.S. reliable low latency communications. His research in this domain has been
degree in information and communication technol- commercialised in Nokia’s (formerly Alcatel-Lucent’s) Small Cell product
ogy from the Hanoi University of Science and portfolio and continues to have significant impact. He received the 2014 World
Technology (HUST), Vietnam in 2018. He is work- Technology Award in the individual category Communications Technologies
ing towards his Ph.D. degree with the School of for innovative work of “the greatest likely long-term significance”. Prior to
Electronics, Electrical Engineering and Computer this, Holger directed research in the areas of self-managing networks to enable
Science, Queen’s University Belfast, U.K. His re- the first large scale femtocell deployments. Holger joined Bell Labs in 2004,
search interests include machine learning and deep where he began his research in the areas of network optimisation, cellular
reinforcement learning for real-time optimisation architectures, and improving energy efficiency of networks. Holger received
in wireless networks, reconfigurable intelligent sur- his Ph.D. degree in signal processing for digital communications from the
faces, unmanned air vehicle (UAV) communication and massive Internet of University of Edinburgh, United Kingdom in 2004. He is author of the book
Things (IoTs). ”Small Cell Networks”, more than 130 journal and conference publications,
78 granted patent families, and 46 filed patent applications pending. He is
Fellow of the IEEE, Fellow of the World Technology Network, and member
of the IET.