agronomy-14-02669
agronomy-14-02669
Abstract: Traditional manual or semi-mechanized pesticide spraying methods often suffer from
issues such as redundant coverage and cumbersome operational steps, which fail to meet current
pest and disease control requirements. Therefore, there is an urgent need to develop an efficient pest
control technology system. This paper builds upon the Deep Q-Network algorithm by integrating
the Bi-directional Long Short-Term Memory structure to propose the BL-DQN algorithm. Based on
this, a path planning framework for pest and disease control using agricultural drones is designed.
This framework comprises four modules: remote sensing image acquisition via the Google Earth
platform, task area segmentation using a deep learning U-Net model, rasterized environmental map
creation, and coverage path planning. The goal is to enhance the efficiency and safety of pesticide
application by drones in complex agricultural environments. Through simulation experiments, the
BL-DQN algorithm achieved a 41.68% improvement in coverage compared with the traditional DQN
algorithm. The repeat coverage rate for BL-DQN was 5.56%, which is lower than the 9.78% achieved
by the DQN algorithm and the 31.29% of the Depth-First Search (DFS) algorithm. Additionally, the
number of steps required by BL-DQN was only 80.1% of that of the DFS algorithm. In terms of
target point guidance, the BL-DQN algorithm also outperformed both DQN and DFS, demonstrating
superior performance.
sensors such as hyperspectral and thermal imaging systems. These sensors are employed
to monitor plant health, soil moisture, and other agricultural concerns. Additionally,
agricultural drones may be equipped with liquid spraying systems for precise pesticide or
fertilizer application, a feature that is relatively uncommon in other types of drones [8–11].
The coverage path planning (CPP) problem, as a critical research area in drone path
planning, aims to design a path within a given area such that the drone can traverse
every point or cover each sub-region of the map with the shortest number of steps [12–14].
Solutions to the CPP problem can be roughly categorized into four types. First, Depth-First
Search (DFS) is a graph traversal algorithm that explores as far as possible along each
branch before backtracking. It is often used for solving problems that can be modeled as a
graph, where the goal is to visit all nodes or find a specific path. DFS uses a stack to manage
the nodes to be explored and systematically searches each branch to its end before retracing
its steps [15]. Second, heuristic algorithms, such as the A* algorithm, model the search
space as a tree structure and use heuristic search techniques to solve the problem [16]. Third,
the Artificial Potential Field (APF) method, as a local obstacle avoidance path planning
algorithm, completes the planning task by simulating the potential fields between the
target and obstacles [17]. Fourth, the Deep Q-Network (DQN) algorithm, based on deep
reinforcement learning (DRL) [18,19], approximates the Q-value function through deep
neural networks, allowing the agent to learn and optimize path planning strategies [20].
Cai et al. proposed a coverage path planning algorithm based on an improved A* algo-
rithm, which efficiently accomplishes coverage tasks for cleaning robots by incorporating a
U-turn search algorithm. However, despite its effectiveness in high-precision maps, the A*
algorithm is associated with significant computational complexity and node redundancy
issues [21].
Wang et al. proposed a multi-agent coverage path planning method based on the
Artificial Potential Field (APF) theory, which guides the movement of agents by simulating
the interactions of forces in a physical field. However, the APF method is susceptible to
becoming trapped in local optima, faces challenges with complex parameter adjustments,
and may encounter potential singularity issues during planning [22].
Tang et al. proposed a coverage path planning method based on region-optimal
decomposition, which combines an improved Depth-First Search (DFS) algorithm with a
genetic algorithm to achieve efficient coverage inspection tasks for drones in port environ-
ments [23]. Although the improved DFS algorithm successfully completes the coverage
path planning tasks, it may encounter local optima in certain complex environments and
faces challenges in ensuring safety during actual drone flight operations.
Unlike classical algorithms that rely on predetermined rules, the DQN algorithm,
based on deep reinforcement learning, introduces two optimization techniques: “target
network” and “experience replay”. The target network updates its parameters at regular
intervals to maintain relative stability in the target values, thereby reducing fluctuations
during the training process. Experience replay allows for the reuse of past experiences,
sampling from diverse previous interactions to mitigate the instability issues caused by data
correlation. Through this continuous improvement of experience, the drone is capable of
making decisions under complex environmental conditions, making it particularly effective
in dynamic and challenging environments [24–26].
In recent years, Mirco Theile and colleagues have addressed the CPP problem for
drones under fluctuating power limitations by leveraging the DDQN algorithm to balance
battery budgets, enabling the achievement of full coverage maps [27]. S.Y. Luis and
colleagues approached the problem of patrolling water resources by modeling it as a
Markov Decision Process (MDP) and using DQN and DDQN algorithms for training.
However, the large number of parameters involved made it challenging to ensure algorithm
stability [28].
Agronomy 2024, 14, 2669 3 of 19
Figure 1. The
Figure 1. The comprehensive
comprehensive process of the
process of the framework.
framework.
Area Description
2.1. Designed Planning Area
This research explores notable topographical differences differences in various
various agricultural
agricultural pro-
pro-
duction settings,
settings, focusing
focusing onon Jilin
Jilin Province
Province to to address
address the the diversity
diversity found
found in in farmlands.
farmlands.
Situated in inthe
theheart
heartofofthe
the Northeast
Northeast Plain,
Plain, JilinJilin Province
Province is part
is part of of
of one onetheofglobe’s
the globe’s
three
three main
main black black soil areas
soil areas and serves
and serves as a vital
as a vital agricultural
agricultural region region and significant
and significant graingrain
pro-
production
duction hubhub in China.
in China. TheThe province
province featuresdiverse
features diverseterrain
terrainwith
withhigher
higherelevations
elevations in
in the
southeast and
southeast and lower
lower elevations
elevations inin the
the northwest.
northwest. Its Its landscape
landscape predominantly
predominantly comprises
comprises
plains and mountainous regions, covering an area of approximately
plains and mountainous regions, covering an area of approximately 187,400 square km, 187,400 square km,
with elevations ranging from 5 m
with elevations ranging from 5 m to 2691 m. to 2691 m.
High-resolution remote
High-resolution remote sensing
sensing images
images of of farmlands
farmlands in in Jilin
Jilin Province
Province fromfrom April
April
2021 to October 2022 were acquired using the Google Earth platform.
2021 to October 2022 were acquired using the Google Earth platform. The geographic co- The geographic
coordinates
ordinates of the
of the study
study areaarea range
range fromfrom 123◦to
123°01′
′ to 128◦ 08′ east longitude and 43◦ 14′ to
01128°08′ east longitude and 43°14′ to 45°36′
45 ◦ 36 ′ north latitude [30]. Six regions were randomly selected for analysis, as in
depicted
north latitude [30]. Six regions were randomly selected for analysis, as depicted Figure
in Figure 2. Furthermore, the U-Net model was utilized for detailed
2. Furthermore, the U-Net model was utilized for detailed segmentation of the farmland segmentation of the
areas, classifying them into two categories: (1) task areas and (2) non-task areas, as areas,
farmland areas, classifying them into two categories: (1) task areas and (2) non-task illus-
as illustrated
trated in Figurein Figure
3. 3.
Additionally, this study employs a gridded map approach, dividing the environment
into a 10 × 10 grid and using a two-dimensional integer array for storage and operations.
Based on this, the drone’s environment map is defined as a state matrix with 100 elements,
where each element represents a grid cell on the map. The side length of the map is denoted
by L, and M( x,y) represents the environmental state at position (x, y) on the map. Each
position is assigned one of five distinct values to characterize its specific environmental
features, as illustrated in Table 1.
In this paper, each grid cell is considered as a unit of the map for each movement
of the drone. When an action Ai is executed, the corresponding state matrix changes,
transitioning from the current state Si to the next state Si+1 , as illustrated in Figure 4.
Agronomy 2024, 14, x FOR PEER REVIEW 5 of 21
Agronomy 2024, 14, 2669 5 of 19
Figure 2. Remote sensing map extraction (task areas are highlighted with red boxes).
Figure 2. Remote sensing map extraction (task areas are highlighted with red boxes).
Designed
Figure3.3.Designed
Figure planning
planning areaarea process.
process.
Additionally, this study employs a gridded map approach, dividing the environment
into a 10 × 10 grid and using a two-dimensional integer array for storage and operations.
Based on this, the drone’s environment map is defined as a state matrix with 100 elements,
where each element represents a grid cell on the map. The side length of the map is de-
noted by L, and 𝑀( , ) represents the environmental state at position (x, y) on the map.
Each position is assigned one of five distinct values to characterize its specific environ-
Agronomy 2024, 14, 2669 6 of 19
4 Obstacle area
UAVaction
Figure4.4.UAV
Figure actionselection
selectionspace.
space.
Here, 𝐴Ai =
Here, {1, 2,
= {1, 2, 3,
3, 4}
4} represents
represents the
the four
fourallowed
allowedmovement
movementdirections
directionsfor
forthe
thedrone
droneat
its current position: left, right, down, and up, respectively.
at its current position: left, right, down, and up, respectively.
2.2. Basic Theory
2.2. Basic Theory
2.2.1. Deep Q-Learning
2.2.1. Deep Q-Learning
In 2016, Volodymyr Mnih proposed the DQN algorithm [31], which integrates the
In 2016, Volodymyr
characteristics of DL with Mnih
the proposed
principlesthe of DQN algorithmlearning
reinforcement [31], which
(RL).integrates the
This method
characteristics of DLtowith
utilizes DL models the principles
directly of reinforcement
acquire control strategies from learning (RL).
complex This method
sensory uti-
information
lizes DL models to directly acquire control strategies from complex
and assesses the value through neural networks. The DQN algorithm implements an sensory information
and assessesreplay
experience the value through
system, neural
in which networks.
experience tuplesThecreated
DQN algorithm
while the implements anwith
agent interacts ex-
perience replay system, in which experience tuples created while the agent
the environment are saved in a replay buffer, as shown in Equation (1). These experiences interacts with
the
areenvironment
then randomly aresampled
saved infrom
a replay
the buffer,
buffer for as shown
training.in Equation (1). These experiences
are then randomly sampled from the buffer for training.
< s t , a t , r t , s t +1 > (1)
< 𝑠 ,𝑎 ,𝑟 ,𝑠 (1)
Furthermore,DQN
Furthermore, DQN utilizes
utilizes two
two networks
networksthat
thatshare
sharethe
thesame
samestructure
structure and
andparameters:
parame-
the policy
ters: network
the policy andand
network thethe
target network.
target network.Throughout
Throughout training, only
training, the
only parameters
the parametersof
ofthe
thepolicy
policynetwork
networkareareconsistently
consistentlyupdated.
updated. At specific intervals,
At specific intervals, these
theseparameters
parametersare
are
transferred to the target network to reduce instability caused by the frequent updates toto
transferred to the target network to reduce instability caused by the frequent updates
thetarget
the targetQ-values.
Q-values.TheTheDQNDQNalgorithm
algorithmhashasshown
shownimpressive
impressiveperformance
performancein invarious
various
classic Atari 2600 games, reaching near-human-level proficiency through
classic Atari 2600 games, reaching near-human-level proficiency through learning, and learning, and has
consequently become a significant subject in artificial intelligence research in
has consequently become a significant subject in artificial intelligence research in recent recent years.
years.
2.2.2. Reward
Since the model relies solely on feedback rewards obtained through interactions with
the environment to guide learning, the design of an effective reward mechanism typically
determines the efficiency and accuracy of the model’s learning process. Therefore, a well-
Agronomy 2024, 14, 2669 7 of 19
2.2.2. Reward
Since the model relies solely on feedback rewards obtained through interactions with
the environment to guide learning, the design of an effective reward mechanism typically
determines the efficiency and accuracy of the model’s learning process. Therefore, a well-
designed reward function should be simple and reflect the specific requirements of the task.
The reward function used in traditional DQN path planning algorithms is represented by
Equation (2).
roverlay , Task area coverage
rt = r , Collision (2)
crash
0, Other cases
Based on the different outcomes at the next timestep, the rewards are divided into
three parts. The action roverlay for reaching the target is given a positive value to encourage
the model to find the target. Conversely, the action rcrash for collisions is assigned a negative
value to penalize collision behavior. However, sparse rewards, which only occur when
reaching the target or experiencing collisions, result in a lack of valuable feedback during
each action. This not only reduces learning efficiency and increases exploration difficulty
but also complicates policy optimization [32].
The target guidance reward for the drone is designed as shown in Equation (4). In
this Equation, current− distance refers to the distance between the initial position and the
destination, while new− distance is the distance to the destination from the new position
Agronomy 2024, 14, 2669 8 of 19
after the movement. Mx and My represent the current position coordinates of the drone on
the map, Mnewx and Mnewy are the coordinates of the agent’s position after executing the
current action, selected based on a greedy strategy, and Gx and Gy are the coordinates of
the target location on the map. β is the adjustable scaling factor for the target point reward.
Equations (5)–(7) illustrate these calculations.
To calculate the distance measured from the current location to the destination, use
the Euclidean distance equation:
q 2
current− distance = ( Mx − Gx )2 + My − Gy (5)
To calculate the distance between the new position and the destination, use the Eu-
clidean distance equation:
r 2
new− distance = ( Mnewx − Gx )2 + Mnewy − Gy (6)
To calculate the reward for reaching the target R goal based on the difference between
the two distances, use the following Equation:
(
β(current− distance − new− distance) if newd < currentd
R goal = (7)
0 otherwise
To avoid the program getting stuck in local optima and to reduce the training burden,
this paper presents three solutions, where “True” indicates that the task has been completed
or failed, signaling that the current episode has ended, as illustrated in Equation (8).
1. The current episode is terminated when the drone collides with an obstacle.
2. The current episode is terminated if the drone exceeds the maximum step limit or if
the drone has not scored for a number of consecutive steps beyond the specified threshold.
3. The current episode is terminated when the drone reaches the target point and there
are no uncovered areas within the task zone.
rcarsh (−1 and True), if Mnewx ,newy = 4
R, Done = r (−0.1 and True), S ≥ Smax or N ≥ Nmax (8)
step
rreach (1 and True), if Mnewx ,newy = 3 and ones− ratio = 0
2.4. BL-DQN
Bi-LSTM is an extended LSTM network that simultaneously considers data from both
past and future contexts by employing two LSTM layers operating in opposite directions
to capture contextual relationships. This structure enables the model to obtain more
comprehensive information when processing data, thereby enhancing performance. This
paper extends the DQN algorithm by integrating the BL-LSTM structure, which improves
the focus on multi-temporal action values through deep neural networks. The network
architecture consists of two LSTM layers in different directions to increase model depth
and capture more complex sequence dependencies, and a fully connected layer that maps
the high-dimensional feature vectors from the Bi-LSTM layers to a low-dimensional space
matching the number of output classes for the task, as illustrated in Figure 5c.
The entire model inference process is illustrated in Figure 5b. Initially, the input data
were converted into an array and subjected to normalization. Subsequently, the data were
transformed into tensors and dimensions were added to meet the input requirements of the
subsequent model, as shown in Figure 5a. The preprocessed data were then input into the
model after being flattened and reshaped, allowing the model to generate output results
that provided guidance for path planning.
Agronomy 2024, 14, x FOR PEER REVIEW 10 of 21
Agronomy 2024, 14, 2669 9 of 19
Figure 5. (a) Data processing. (b) Model inference process. (c) Model structure.
Figure 5. (a) Data processing. (b) Model inference process. (c) Model structure.
At each time step t, the Bi-LSTM’s output consisted of the merging of hidden states
At each time step t, the Bi-LSTM’s output consisted of the merging of hidden states
from both the forward and backward LSTM layers, as shown in Equation (9).
from both the forward and backward LSTM layers, as shown in Equation (9).
ℎ = ℎ , ℎh t
BiLSTM = [ht , h′t ] (9)(9)
Here,ℎht and
Here, and hℎt ′′denote
denotethe
thehidden
hiddenstates
statesofofthe
theforward
forwardand
andbackward
backwardLSTM
LSTMlayers
lay-
atat
ers time
timestep
stept.t.Subsequently,
Subsequently,thetheoutput
outputofofthe
theBi-LSTM
Bi-LSTMlayer
layer will
will be
be processed
processed through
through a
a fully
fully connected
connected layer
layer to
to estimate
estimate the
the Q-values.
Q-values.
𝑄(𝑠, 𝑎) 𝑄(𝑠,
Q(𝑎;
s, 𝜃)
a) =
≈𝑊Q(⋅s,ℎa; θ ) = +
W 𝑏· htBiLSTM + b (10)
(10)
The Q-value update equation is given by Equation (11).
The Q-value update equation is given by Equation (11).
𝑄 ( , ) ← 𝑄(𝑠, 𝑎) + 𝛼 𝑟 + 𝛾𝑚𝑎𝑥 𝑄(𝑠 , 𝑎 ) − 𝑄(𝑠, 𝑎) (11)
Q′(s,a) ← Q(s, a) + α[r + γmax a′ Q(s′ , a′ ) − Q(s, a)] (11)
Here, 𝑄(𝑠, 𝑎) denotes the Q-value for taking action 𝑎 in state 𝑠 , where 𝛼 repre-
sents the learning
Here, Q(s, arate, 𝑟 is the
) denotes the Q-value
reward, for 𝛾 is the
andtaking discount
action factor.
a in state A larger
s, where 𝛾 empha-
α represents the
sizes long-term
learning rate, rrewards more heavily,
is the reward, and γ is whereas factor. 𝛾A prioritizes
a smaller
the discount short-termlong-term
larger γ emphasizes gains.
This paper
rewards employswhereas
more heavily, a greedy strategyγ to
a smaller balance short-term
prioritizes the concepts of exploitation and
gains.
This paper employs a greedy strategy to balance
exploration when selecting actions at each time step, as shown in Equation the concepts of exploitation and
(12). In this
equation, 𝑅𝑁 when
exploration selecting
represents actions
a random at eachgenerated
number time step,inasthe shown
rangeinofEquation
0 to 1 for(12).
each In this
time
step. 𝜖 is aRN
equation, represents a random
hyperparameter used tonumber
balance generated
exploitation in the
andrange of 0 to 1dynamically
exploration, for each time
step. ϵ throughout
adjusted is a hyperparameter used
the training to balance
phase, as shownexploitation
in Equation and (13). When 𝑅𝑁
exploration, 𝜖, the
dynamically
adjusted throughout the training phase, as shown in Equation (13).
action with the highest Q-value for the current state is chosen for exploitation. Otherwise,When RN > ϵ, the
action with the highest Q-value for
a random action is selected for exploration. the current state is chosen for exploitation. Otherwise,
a random action is selected for exploration.
𝑎𝑟𝑔𝑚𝑎𝑥(𝑄(𝑠, 𝑎; 𝜃)) if 𝑅𝑁 𝜖
𝐴𝑐𝑡𝑖𝑜𝑛 = (12)
𝑟𝑎𝑛𝑑𝑜𝑚 𝑎𝑐𝑡𝑖𝑜𝑛 argmax ( Qotherwise
(
(s, a; θ )) if RN > ϵ
Action = (12)
𝜖=𝜖 + (𝜖 − 𝜖 ) random𝑒 action otherwise (13)
This paper usesϵ the
= ϵSmooth L1 Loss as the loss
min + ( ϵinitial − ϵmin ) × e
function,
− Decay which
rate×currunt smooths the input
episode (13)
values close to zero to reduce the occurrence of extreme gradient values during gradient
descent. This loss function applies squared error for small errors and linear error for larger
errors. By computing the loss between 𝑄value and 𝑄target , and optimizing parameters
Agronomy 2024, 14, 2669 10 of 19
This paper uses the Smooth L1 Loss as the loss function, which smooths the input
values close to zero to reduce the occurrence of extreme gradient values during gradient
descent. This loss function applies squared error for small errors and linear error for larger
errors. By computing the loss between Qvalue and Qtarget , and optimizing parameters
through backpropagation, the agent learns the actions that maximize expected rewards in a
given state after extensive training and optimization, as shown in Equation (14).
(
0.5( Q(s, a; θ ) − y)2 if | x − y| < 1
L(y, ŷ) = (14)
| Q(s, a; θ ) − y| − 0.5 otherwise
Here, y represents the Qtarget value given by the target network, as shown in Equation (15).
Table 2 presents the hyperparameter configurations for the BL-DQN algorithm. The
maximum training episodes (EP ) was set to 50,000, with a maximum step count per episode
(Smax ) of 100 to prevent excessive training time. The maximum number of consecutive steps
without reward (Nmax ) was set to 5, encouraging exploration in the absence of rewards. The
discount factor (γ) was 0.95, highlighting the importance of long-term returns. The batch
size (B) was set to 128, and the capacity of the experience replay buffer (M) was 100,000. The
Agronomy 2024, 14, 2669 11 of 19
learning rate (LR) was 1 × 10−3 , affecting the step size for weight adjustments. The initial
exploration rate (ϵ_initial) was set to 1.0 to encourage the agent to explore by selecting
random actions during the early training phase. The minimum exploration rate (ϵ_min) was
0.1, ensuring that the agent retains some randomness for exploration in the later training
stages. The decay rate was set to 3000, controlling the speed at which ϵ epsilonϵ decreases.
Each LSTM layer contained 128 neurons (Layers LSTM1 and Layers LSTM2 ), enhancing the
model’s expressive capability. The network update frequency (n) was set to 10, meaning
that the parameters of the policy network would be transferred to the target network every
10 training episodes. The action space size (N_actions) was 4, corresponding to movements
in four directions. Finally, the Adam optimizer was selected to accelerate convergence and
improve learning efficiency.
Figure6.6.Path
Figure Pathplanning
planning results
results of
of the
the DQN
DQN(left),
(left),DFS
DFS(middle),
(middle),and
andBL-DQN
BL-DQN(right) onon
(right) Map 1. 1.
Map
Figure7.7.Path
Figure Pathplanning
planning results
results of
of the
the DQN
DQN (left),
(left),DFS
DFS(middle),
(middle),and
andBL-DQN
BL-DQN(right) onon
(right) Map 2. 2.
Map
Agronomy 2024, 14, 2669 13 of 19
Figure 7. Path planning results of the DQN (left), DFS (middle), and BL-DQN (right) on Map 2.
Figure8.8.Path
Figure Pathplanning
planning results
results of
of the
the DQN
DQN(left),
(left),DFS
DFS(middle),
(middle),and
andBL-DQN
BL-DQN(right) on on
(right) Map 3. 3.
Map
Figure
Figure 9.
9. Path
Path planning
planning results
results of
of the
the DQN
DQN (left),
(left), DFS
DFS (middle),
(middle), and
and BL-DQN
BL-DQN (right)
(right) on
on Map
Map 4.
4.
Figure9.9.Path
Figure Pathplanning
planning results
results of
of the
the DQN
DQN (left),
(left),DFS
DFS(middle),
(middle),and
andBL-DQN
BL-DQN(right) onon
(right) Map 4. 4.
Map
Figure
Figure 10.
10. Path
Path planning
planning results
results of
of the
the DQN
DQN (left),
(left), DFS
DFS (middle),
(middle), and
and BL-DQN
BL-DQN (right)
(right) on
on Map
Map 5.
5.
Figure 10.Path
Figure10. Pathplanning
planning results
results of the
of the DQN (left),
DQN (left),DFS
DFS(middle),
(middle),and
andBL-DQN
BL-DQN (right)
(right) onon Map
Map 5. 5.
Figure
Figure 11.
Figure11. Path
11.Path planning
Pathplanning results
planning results of
of the
results of the DQN
the DQN (left),
DQN (left),DFS
(left), DFS(middle),
DFS (middle),and
(middle), andBL-DQN
and BL-DQN
BL-DQN (right) on
(right)
(right) Map
onon 6.
Map
Map 6. 6.
Figure 11. Path planning results of the DQN (left), DFS (middle), and BL-DQN (right) on Map 6.
As shown
As shown in
shown in Figure
in Figure 12,
Figure 12, the
12, the loss
the loss values
loss values of
values of the
of the BL-DQN
the BL-DQN algorithm
BL-DQN algorithm were
algorithm were generally
were generally lower
generally lower
lower
than As
those of the DQN algorithm. During training, the BL-DQN demonstrated greater
than
than those
those of the
ofmore DQN
the DQN algorithm.
algorithm. During training,
During strategy the BL-DQN
training,learning
the BL-DQN demonstrated
demonstrated greater
greater
stability and
stability and
and more efficient
more efficient problem-solving
efficient problem-solving
problem-solving strategy
strategy learning capabilities,
learning capabilities, with
capabilities, with better
with better gen-
better gen-
gen-
stability
eralization performance. Although there may be significant fluctuations in the loss values
eralization performance. Although there may be significant fluctuations in the loss
eralization performance. Although there may be significant fluctuations in the loss values values
Agronomy 2024, 14, 2669 14 of 19
As shown in Figure 12, the loss values of the BL-DQN algorithm were generally lower
than those of the DQN algorithm. During training, the BL-DQN demonstrated greater
stability and more efficient problem-solving strategy learning capabilities, with better
generalization performance. Although there may be significant fluctuations in the loss
values of the BL-DQN under certain specific conditions, leading to occasional performance
Agronomy 2024, 14, x FOR PEER REVIEW 16 of 21
degradation, its overall performance remained superior to that of the DQN algorithm.
Figure
Figure 12.12. Comparisonof
Comparison ofloss
loss between
between BL-DQN
BL-DQNalgorithm andand
algorithm DQN algorithm.
DQN algorithm.
AsAs illustrated in
illustrated in Figure
Figure13,
13,thethe
reward value
reward of theof
value proposed BL-DQNBL-DQN
the proposed algorithmalgorithm
sig-
nificantly outperformed that of the DQN algorithm during training. However,
significantly outperformed that of the DQN algorithm during training. However, as the as the com-
plexity ofof
complexity thethe
map increased,
map the the
increased, model’s performance
model’s also exhibited
performance greater
also exhibited fluctuations.
greater fluctuations.
Agronomy 2024,
Agronomy 2024, 14,
14, 2669
x FOR PEER REVIEW 17 of 21
15 of 19
Figure 13. Comparison of reward between BL-DQN algorithm and DQN algorithm.
To perform
To perform aa quantitative assessment of
quantitative assessment of the
the three
three algorithms,
algorithms, the
the research
research measured
measured
the repetition rate, the steps involved in path planning, coverage rate, number of collisions,
the repetition rate, the steps involved in path planning, coverage rate, number of collisions,
target point
target point arrival
arrival status,
status, and
and adherence
adherence to
to rules
rules (including
(including collisions
collisions with
with obstacles
obstacles and
and
deviations from the specified direction) for each algorithm in completing the
deviations from the specified direction) for each algorithm in completing the task area task area
coverage. The
coverage. The analysis
analysis results
results are
are shown
shown in
in Table
Table 3.3. The
The analysis
analysis shows
shows that
that the
the BL-DQN
BL-DQN
algorithm surpasses the other algorithms in terms of drone path planning, coverage
algorithm surpasses the other algorithms in terms of drone path planning, coverage rate, rate,
number of steps, target point guidance, and adherence to rules. After 50,000
number of steps, target point guidance, and adherence to rules. After 50,000 training iter- training
iterations, the DQN algorithm did not achieve full coverage and effective target point
ations, the DQN algorithm did not achieve full coverage and effective target point guid-
guidance. Although the DFS algorithm showed stable coverage, it did not match the
ance. Although the DFS algorithm showed stable coverage, it did not match the BL-DQN
BL-DQN in terms of target point accuracy and task completion, and it exhibited higher
in terms of target point accuracy and task completion, and it exhibited higher repeat rates
repeat rates and rule violations.
and rule violations.
Agronomy 2024, 14, 2669 16 of 19
3.3. Discussion
The BL-DQN algorithm outperformed traditional DQN and DFS algorithms in terms
of the number of steps, coverage rate, and repeat coverage rate. It also achieved significant
advancements in task-oriented guidance, indicating that the BL-DQN algorithm enhances
efficiency in path planning while better optimizing drone recovery issues. However,
despite the notable optimization effects demonstrated by the proposed BL-DQN algorithm
in simulated environments, several limitations remain in the current research. In recent
years, Pan and colleagues have made innovative improvements to the traditional APF
method for formation control in three-dimensional constrained spaces by introducing
the concept of rotational potential fields. They developed a novel formation controller
utilizing potential function methods [33]. Additionally, Zhou and associates proposed a
biologically inspired path planning algorithm for real-time obstacle avoidance in unmapped
environments for unmanned aerial vehicles [34]. Fang et al. proposed a solution that
integrates distributed network localization with formation maneuvering control. This
approach utilizes relative measurement information among agents to achieve real-time
positioning and coordinated formation management of multi-agent systems in multi-
dimensional spaces [35]. Enrique Aldao and colleagues introduced a real-time obstacle
avoidance algorithm based on optimal control theory, suitable for autonomous navigation
of UAVs in dynamic indoor environments. By integrating pre-registered three-dimensional
model information with onboard sensor data, this algorithm optimizes UAV flight paths and
effectively avoids collisions with both fixed and moving obstacles in the environment [36].
He, Y et al. proposed a new stability analysis method for dealing with hybrid systems with
double time delays, which has important implications for the design of control strategies
in the field of UAVs [37]. Considering the research trends of the past three years, there
remains room for exploration in the following areas within this study.
Agronomy 2024, 14, 2669 17 of 19
1. A limitation of the current model is its reliance on pre-defined map data. Future
research should focus on integrating real-time environmental data, such as weather
conditions, crop growth dynamics, pest distribution information, and other distur-
bances, to enable dynamic adjustments in path planning, ensuring the stability of the
drones. The development of such adaptive algorithms will substantially enhance the
robustness and effectiveness of the model in practical agricultural applications.
2. Extending the existing single-agent model to a multi-agent framework holds promise
for further improving operational efficiency and coverage in large-scale farmland.
Investigating how to coordinate multiple drones for joint path optimization, while
considering communication constraints and task allocation strategies, represents a
challenging yet promising direction for future research.
3. As depicted in Figures 7 and 10, the complexity of the maps resulted in target points
being unmet in Map 2 and Map 5. This indicates that there is potential for enhance-
ment. Future efforts will focus on refining the model and adjusting parameters to
improve planning efficacy.
Author Contributions: Conceptualization, H.F., Z.L., X.F. and J.L.; methodology, J.L. and W.Z.;
software, H.F. and Y.F.; investigation, L.Z. and W.Z.; resources, Z.L. and Y.F.; writing—original draft,
Z.L.; writing—review and editing, H.F.; visualization, Z.L.; supervision, H.F., J.L. and X.F.; funding
acquisition, L.Z.; validation, L.Z. and X.F.; data curation, W.Z.; project administration, X.F. All authors
have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Jilin Province Science and Technology Development Plan
Project (20240302092GX).
Data Availability Statement: The original contributions presented in the study are included in the
article; further inquiries can be directed to the corresponding author.
Agronomy 2024, 14, 2669 18 of 19
References
1. Liu, J. Trend forecast of major crop diseases and insect pests in China in 2024. China Plant Prot. Guide 2024, 44, 37–40.
2. Tudi, M.; Li, H.; Li, H.; Wang, L.; Lyu, J.; Yang, L.; Tong, S.; Yu, Q.J.; Ruan, H.D.; Atabila, A.; et al. Exposure Routes and Health
Risks Associated with Pesticide Application. Toxics 2022, 10, 335. [CrossRef]
3. Benbrook, C.M. Trends in Glyphosate Herbicide Use in the United States and Globally. Environ. Sci. Eur. 2016, 28, 3. [CrossRef]
4. Fang, X.; Xie, L.; Li, X. Distributed Localization in Dynamic Networks via Complex Laplacian. Automatica 2023, 151, 110915.
[CrossRef]
5. Kim, J.; Kim, S.; Ju, C.; Son, H.I. Unmanned Aerial Vehicles in Agriculture: A Review of Perspective of Platform, Control, and
Applications. IEEE Access 2019, 7, 105100–105115. [CrossRef]
6. Fang, X.; Li, J.; Li, X.; Xie, L. 2-D Distributed Pose Estimation of Multi-Agent Systems Using Bearing Measurements. J. Autom.
Intell. 2023, 2, 70–78. [CrossRef]
7. He, Y.; Zhu, D.; Chen, C.; Wang, Y. Data-Driven Control of Singularly Perturbed Hybrid Systems with Multi-Rate Sampling. ISA
Trans. 2024, 148, 490–499. [CrossRef]
8. Nazarov, D.; Nazarov, A.; Kulikova, E. Drones in Agriculture: Analysis of Different Countries. BIO Web Conf. 2023, 67. [CrossRef]
9. Ayamga, M.; Akaba, S.; Nyaaba, A.A. Multifaceted Applicability of Drones: A Review. Technol. Forecast. Soc. Change 2021, 167,
120677. [CrossRef]
10. Tsouros, D.C.; Bibi, S.; Sarigiannidis, P.G. A Review on UAV-Based Applications for Precision Agriculture. Information 2019,
10, 349. [CrossRef]
11. An, D.; Chen, Y. Non-Intrusive Soil Carbon Content Quantification Methods Using Machine Learning Algorithms: A Comparison
of Microwave and Millimeter Wave Radar Sensors. J. Autom. Intell. 2023, 2, 152–166. [CrossRef]
12. Cabreira, T.M.; Brisolara, L.B.; Paulo, R.F., Jr. Survey on Coverage Path Planning with Unmanned Aerial Vehicles. Drones 2019,
3, 4. [CrossRef]
13. Aggarwal, S.; Kumar, N. Path Planning Techniques for Unmanned Aerial Vehicles: A Review, Solutions, and Challenges. Comput.
Commun. 2020, 149, 270–299. [CrossRef]
14. Fang, X.; Xie, L. Distributed Formation Maneuver Control Using Complex Laplacian. IEEE Trans. Autom. Control 2024, 69,
1850–1857. [CrossRef]
15. Tarjan, R. Depth-First Search and Linear Graph Algorithms. In Proceedings of the 12th Annual Symposium on Switching and
Automata Theory (swat 1971), East Lansing, MI, USA, 13–15 October 1971; pp. 114–121.
16. Tang, G.; Tang, C.; Claramunt, C.; Hu, X.; Zhou, P. Geometric A-Star Algorithm: An Improved A-Star Algorithm for AGV Path
Planning in a Port Environment. IEEE Access 2021, 9, 59196–59210. [CrossRef]
17. Sang, H.; You, Y.; Sun, X.; Zhou, Y.; Liu, F. The Hybrid Path Planning Algorithm Based on Improved A* and Artificial Potential
Field for Unmanned Surface Vehicle Formations. OCEAN Eng. 2021, 223, 108709. [CrossRef]
18. Hu, L.; Hu, H.; Naeem, W.; Wang, Z. A Review on COLREGs-Compliant Navigation of Autonomous Surface Vehicles: From
Traditional to Learning-Based Approaches. J. Autom. Intell. 2022, 1, 100003. [CrossRef]
19. Ning, Z.; Xie, L. A Survey on Multi-Agent Reinforcement Learning and Its Application. J. Autom. Intell. 2024, 3, 73–91. [CrossRef]
20. Li, L.; Wu, D.; Huang, Y.; Yuan, Z.-M. A Path Planning Strategy Unified with a COLREGS Collision Avoidance Function Based on
Deep Reinforcement Learning and Artificial Potential Field. Appl. Ocean Res. 2021, 113, 102759. [CrossRef]
21. Cai, Z.; Li, S.; Gan, Y.; Zhang, R.; Zhang, Q. Research on Complete Coverage Path Planning Algorithms Based on A* Algorithms.
Open Cybern. Syst. J. 2014, 8, 418–426.
22. Wang, Z.; Zhao, X.; Zhang, J.; Yang, N.; Wang, P.; Tang, J.; Zhang, J.; Shi, L. APF-CPP: An Artificial Potential Field Based
Multi-Robot Online Coverage Path Planning Approach. IEEE Robot. Autom. Lett. 2024, 9, 9199–9206. [CrossRef]
23. Tang, G.; Tang, C.; Zhou, H.; Claramunt, C.; Men, S. R-DFS: A Coverage Path Planning Approach Based on Region Optimal
Decomposition. Remote Sens. 2021, 13, 1525. [CrossRef]
24. Liu, L.; Wang, X.; Yang, X.; Liu, H.; Li, J.; Wang, P. Path Planning Techniques for Mobile Robots: Review and Prospect. Expert Syst.
Appl. 2023, 227, 120254. [CrossRef]
25. Qin, H.; Shao, S.; Wang, T.; Yu, X.; Jiang, Y.; Cao, Z. Review of Autonomous Path Planning Algorithms for Mobile Robots. Drones
2023, 7, 211. [CrossRef]
26. Patle, B.K.; Babu, L.G.; Pandey, A.; Parhi, D.R.K.; Jagadeesh, A. A Review: On Path Planning Strategies for Navigation of Mobile
Robot. Def. Technol. 2019, 15, 582–606. [CrossRef]
27. Theile, M.; Bayerlein, H.; Nai, R.; Gesbert, D.; Caccamo, M. UAV Coverage Path Planning under Varying Power Constraints
Using Deep Reinforcement Learning. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 1444–1449.
28. Luis, S.Y.; Reina, D.G.; Marín, S.L.T. A Deep Reinforcement Learning Approach for the Patrolling Problem of Water Resources
Through Autonomous Surface Vehicles: The Ypacarai Lake Case. IEEE Access 2020, 8, 204076–204093. [CrossRef]
29. Li, J.; Zhang, W.; Ren, J.; Yu, W.; Wang, G.; Ding, P.; Wang, J.; Zhang, X. A Multi-Area Task Path-Planning Algorithm for
Agricultural Drones Based on Improved Double Deep Q-Learning Net. Agriculture 2024, 14, 1294. [CrossRef]
Agronomy 2024, 14, 2669 19 of 19
30. Ma, C.; Wang, L.; Chen, Y.; Wu, J.; Liang, A.; Li, X.; Jiang, C.; Omrani, H. Evolution and Drivers of Production Patterns of Major
Crops in Jilin Province, China. Land 2024, 13, 992. [CrossRef]
31. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.;
Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [CrossRef]
32. Guo, S.; Zhang, X.; Du, Y.; Zheng, Y.; Cao, Z. Path Planning of Coastal Ships Based on Optimized DQN Reward Function. J. Mar.
Sci. Eng. 2021, 9, 210. [CrossRef]
33. Pan, Z.; Zhang, C.; Xia, Y.; Xiong, H.; Shao, X. An Improved Artificial Potential Field Method for Path Planning and Formation
Control of the Multi-UAV Systems. IEEE Trans. Circuits Syst. II-Express Briefs 2022, 69, 1129–1133. [CrossRef]
34. Zhou, Y.; Su, Y.; Xie, A.; Kong, L. A Newly Bio-Inspired Path Planning Algorithm for Autonomous Obstacle Avoidance of UAV.
Chin. J. Aeronaut. 2021, 34, 199–209. [CrossRef]
35. Fang, X.; Xie, L.; Li, X. Integrated Relative-Measurement-Based Network Localization and Formation Maneuver Control. IEEE
Trans. Autom. Control 2024, 69, 1906–1913. [CrossRef]
36. Aldao, E.; Gonzalez-deSantos, L.M.; Michinel, H.; Gonzalez-Jorge, H. UAV Obstacle Avoidance Algorithm to Navigate in
Dynamic Building Environments. Drones 2022, 6, 16. [CrossRef]
37. He, Y.; Zhu, G.; Gong, C.; Shi, P. Stability Analysis for Hybrid Time-Delay Systems with Double Degrees. IEEE Trans. Syst. Man
Cybern. -Syst. 2022, 52, 7444–7456. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.