0% found this document useful (0 votes)
27 views11 pages

Design and Application of Adaptive PID Controller Based On Asynchronous Advantage Actor-Critic Learning Method

Uploaded by

Jorge CB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views11 pages

Design and Application of Adaptive PID Controller Based On Asynchronous Advantage Actor-Critic Learning Method

Uploaded by

Jorge CB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Wireless Networks (2021) 27:3537–3547

https://doi.org/10.1007/s11276-019-02225-x (0123456789().,-volV)(0123456789().,-volV)

Design and application of adaptive PID controller based


on asynchronous advantage actor–critic learning method
Qifeng Sun1 • Chengze Du1 • Youxiang Duan1 • Hui Ren1 • Hongqiang Li2

Published online: 31 December 2019


Ó The Author(s) 2019

Abstract
To address the problems of the slow convergence and inefficiency in the existing adaptive PID controllers, we propose a
new adaptive PID controller using the asynchronous advantage actor–critic (A3C) algorithm. Firstly, the controller can
train the multiple agents of the actor–critic structures in parallel exploiting the multi-thread asynchronous learning
characteristics of the A3C structure. Secondly, in order to achieve the best control effect, each agent uses a multilayer
neural network to approach the strategy function and value function to search the best parameter-tuning strategy in
continuous action space. The simulation results indicate that our proposed controller can achieve the fast convergence and
strong adaptability compared with conventional controllers.

Keywords Reinforcement learning  Asynchronous advantage actor–critic  Adaptive PID control  Stepping motor

1 Introduction evolutionary adaptive PID controller [8] has difficulty in


achieving real-time control due to the fact that it requires
The PID controller is a control loop feedback mechanism less prior knowledge [9]. The adaptive PID controller,
which is widely used in industrial control system [1]. Based which is based on reinforcement learning [10], solves the
on the investigation of conventional PID controller, the problem by obtaining the teacher’s signal in unsupervised
adaptive PID controller adopts online parameter adjust- learning process. And the optimization of the control
ment method according to the state of the system, therefore parameters is simple. The actor–critic (AC) adaptive PID
it has better system adaptability. The fuzzy PID controller [11, 12] is the most widely used reinforcement learning
[2] adopts the ideology of matrix estimations [3, 4]. In controller. However, the convergence speed of the con-
order to satisfy the requirement of the self-tuning PID troller is affected by the correlation of the learning data in
parameters, the method adjusts the parameters by querying the AC algorithm [13].
fuzzy matrix table. The limitation of this method is that it Google’s DeepMind team proposed the asynchronous
needs much more prior knowledge. Moreover, this method advantage actor–critic (A3C) learning algorithm [14, 15].
has a large number of parameters that is needed to be This algorithm adopts multi-strategies to train multiple
optimized [5]. agents in parallel, each agent will experience different
The adaptive PID controller [6, 7] approximates non- learning state. So the correlation of the learning sample is
linear structure by neural networks, which can achieve broken while improving the computational efficiency [16].
effective control without identifying the complex nonlinear This algorithm has been applied in many fields [17, 18].
controlled object. But, it is difficult to obtain the teacher The proposed method aims to improve the convergence
signals in the supervised learning process. The and adaptive ability of the PID controller. To achieve this
purpose, we use the A3C algorithm that enhance the
& Qifeng Sun learning rate to train agent in the parallel threads. And two
Sunqf@upc.edu.cn BP neural networks are used to approach policy function
1
and value function separately. The experiments show that
China University of Petroleum, Qingdao 266580, China the proposed algorithm outperforms the conventional PID
2
China Petrochemical Group Victory Drilling Technology controlling algorithms. The rest of paper is arranged as
Research Institute, Dongying 257000, China

123
3538 Wireless Networks (2021) 27:3537–3547

fellows. Starting from a brief description of PID controller 2.3 Reinforcement learning adaptive controller
in Sect. 2 and 3, we introduce our new approach in Sect. 4
and show experimental results in Sect. 5. We conclude the Aziz Khater [26] proposed a PID controller that combining
paper in Sect. 6. the ASN reinforcement-learning network with fuzzy math.
Despite this method did not need too much accurate
training samples compared the neural network PID, its
2 Related work structure was too complex to guarantee the real-time per-
formance. In view of this point, Adel [10] designed an
The conventional PID controlling algorithms can be adaptive PID controller based on AC algorithm. This
roughly classified into three categories: the fuzzy PID controller had simple structure with one RBF network.
controller, neural network PID controller and reinforce- However, its speed of convergence was slow owing to the
ment learning PID controller. relevance in learning sample of AC algorithm.

2.1 Fuzzy PID controller


3 Basic structure of PID controller
Tang [19] proposed a method that combined the fuzzy
math with the PID control. However, this method still had Incremental PID is an algorithm of PID control by incre-
some limitations such as that it required a lot of manual ment of control volume. The typical control system struc-
experience to establish the rule table. Besides, the rule ture is shown in Fig. 1. Besides, its formula is as follows:
table was often only adapted to a specific application sce-
uðtÞ ¼ uðt  1Þ þ DuðtÞ
nario. To address these issues, Sun [20] developed a fuzzy
PID controller based on improved genetic algorithm, which ¼ uðt  1Þ þ Ki ðtÞeðtÞ þ Kp ðtÞDeðtÞ ð1Þ
2
used multiple fuzzy control rules to adjust parameters by þ Kd ðtÞD eðtÞ
genetic algorithm. The controller abandoned the plenty of
where
manual work and set up an exclusive rule under the envi-
ronment. Spired by the idea of the work, Zhu [21] added e ðt Þ ¼ y0 ðt Þ  y ðt Þ
the normalized velocity parameter reflecting the response DeðtÞ ¼ eðtÞ  eðt  1Þ
of the system based on the adjusting factor of fuzzy rules.
D2 eðtÞ ¼ eðtÞ  2  eðt  1Þ þ eðt  2Þ
The method aimed to change the mapping between input
and output variables with the fuzzy subsets so that it made y0 ðtÞ, yðtÞ, eðtÞ, DeðtÞ, D2 eðtÞ represents the current actual
the controller be able to divide the error and the rate of signal value, the output value of the current system, the
error into multiple control stages. system output error, the first-order difference of error and
the second-order difference of error respectively. In the
2.2 Adaptive controller based on neural network form 1, incremental PID is cancelled the integral summa-
tion, so it saves the time of calculation. Besides, it influ-
Liao [22] proposed a method utilizing the neural network ence the system lightly when the system is broken. In the
to reinforce the performance of PID controller for the synthesis the factor, the incremental PID is optimum choice
nonlinear system. Although the initial parameters of neural for the practical application.
network could be determined by artificial test, it could not
ensure the reliability of the manual result. Based on this, Li
[23] adopted the genetic algorithm to obtain the optimal 4 A3C adaptive PID control
initial parameters of the network. However, the genetic
algorithm was easily to fall into local optimum. In order to A3C algorithm is a deep reinforcement learning algorithm.
solve the problem, Patel [24] appended the immigration It introduces an asynchronous training method on the basis
mechanism, 10% of the elite population and the inferior of AC framework. The A3C learning framework consists
population were selected as the variant population, to the of a central network (Global Net) and multiple AC
neural network adaptive PID controller (MN-PID). In
addition, Nie [25] presented an adaptive chaos particles
swarm optimization for tuning parameters of PID con- ∆ um(t-1)
-1
Z
troller (CSP-PID) to avoid the local minima. ym`(t) Incremental
State converter PID Plant
em(t) controller ∆ um(t) ym(t)

Fig. 1 PID control structure

123
Wireless Networks (2021) 27:3537–3547 3539

structures, which are executed and learned in parallel by rm ðtÞ ¼ a1 r1 ðtÞ þ a2 r2 ðtÞ
creating multiple agent in same environmental instances. 
0; jem ðtÞj\e
The central network is responsible for updating and storing r1 ðtÞ ¼
e  em ðtÞ; other
AC network parameters. One agent has its own AC struc- 
ture. Different agent will transfer learning data to central 0; jem ðtÞj  jem ðt  1Þj
r2 ðtÞ ¼
network to update their parameters of AC network. Further jem ðtÞj  jem ðt  1Þj; other
the Actor network is responsible for policy learning, while ð2Þ
critic network is responsible for estimating value function.
In the next step, the Actor (m) and the Critic (m) send their
0 0
4.1 Structure of A3C-PID controller own parameters Wam , Wvm and the generated dTD into the
Global Net to update Wa and Wv with the policy gradient
The design of A3C adaptive PID controller is to combine and the descend gradient. Accordingly, the Global Net
the asynchronous learning structure of A3C with the passes their Wa and Wv to Actor (m) and Critic (m), making
incremental PID controller. Its structure is shown in Fig. 2. them continue to learn new parameters.
The whole process is as follow:
Step 1: For each thread, the initial error em ðtÞ enters the 4.2 A3C learning with neural networks
state converter to calculate Dem ðtÞD2 em ðtÞ and output the
 T Multilayer feed-forward neural network [27, 28], also
state vector Sm ðtÞ ¼ em ðtÞ; Dem ðtÞ; D2 em ðtÞ . known as BP neural network, is a back-propagation algo-
Step 2: The Actor (m) maps the state vector Sm ðtÞ to rithm for multilayer feed-forward networks. It has strong
three parameters, Kp Ki and Kd, of PID controller. ability for nonlinear mapping and is suitable for solving
Step 3: The updated controller acts on the environment problems with complex internal mechanism. Therefore, the
to receive the reward rm ðtÞ. method uses two BP neural networks respectively to realize
After n times, Critic (m) receives Sm ðt þ nÞ which is the the learning of policy function and value function. The
state vector of the system. Then it produces the value network structure is as follows.
function estimation VðStþn ; Wv0 Þ and n-step TD error dTD , As shown in Fig. 3, the Actor network has three layers:
which are the important basis for updating parameters. The
The first level is the input layer. The input vector S ¼
formula of the reward function is shown as Formula (2)  T
em ðtÞ; Dem ðtÞ; D2 em ðtÞ represents the state vector. The
second layer is the hidden layer. The input of the hidden
layer as follows:

Z-1
∆ um(t-1)

y`m(t) Incremental
State
PID Plant(m)
conventor(m) ym(t)
em(t) Controller(m) ∆ um(t)

Kp Ki Kd

WA
Actor(1)

Actor(m) W`Am
Global Net
TD (t)

Critic(1) Wv

Critic(m)
rm(t) W`vm TD (t)

Fig. 2 Adaptive PID control diagram based on A3C learning

123
3540 Wireless Networks (2021) 27:3537–3547

e(t)
.
Kp .
e(t) . .
. .

∆e(t) .
v
∆e(t) . Ki .
. .
.

Output layer

.
∆2e(t) ∆2e(t) .
. .
.
. Kd
Input layer
Input layer
Output layer Hidden layer

Hidden layer Fig. 4 Critic network structure of actor–critic


Fig. 3 Actor network structure of actor–critic
value VðSt ; Wv0 Þ of the initial state and the estimation value
P
n after n-step, as followed:
hik ðtÞ ¼ wik xi ðtÞ  bk k ¼ 1; 2; 3. . .20 ð3Þ  
i¼1 dTD ¼ qt  V St ; Wv0
 
where k represents the number of neurons in the hidden qt ¼ rtþ1 þ crtþ2 þ    þ cn1 rtþn þ cn V Stþn ; Wv0
layer, wik is the weights connected the input layer and the ð7Þ
hidden layer, bk is the bias of the k neuron. The output of
The 0\c\1, represents the discount factor, is used to
the hidden layer as follows:
determine the ratio of the delayed returns and the imme-
hok ðtÞ ¼ minðmaxðhik ðtÞ; 0Þ; 6Þ k ¼ 1; 2; 3. . .20 ð4Þ diate returns. Wv0 is the weight of the Critic network. The
TD error dTD reflects the quality of the selected actions in
The third layer is the output layer. The input of the output
the Actor network. The performance of the system learning
layer as follows:
is:
P
k
yio ðtÞ ¼ who hoj  bo o ¼ 1; 2; 3 ð5Þ 1
j¼1
EðtÞ ¼ d2TD ðtÞ ð8Þ
2
where o represents the number of neurons in the output After calculating the TD error, each AC network in the
layer, who is the weights connected the hidden layer and the A3C structure does not update its network weight directly,
output layer, bo is the bias of the k neuron. but updates the AC network parameters of the central
The output of the output layer as follows: network (Global-Net) with its own gradient. The update
  formula is as follows:
yoo ðtÞ ¼ log 1 þ eyio ðtÞ o ¼ 1; 2; 3 ð6Þ
    
Wa ¼ Wa þ aa dWa þ rw0 a log p as; Wa0 dTD ð9Þ
Actor network does not output the value of Kp Ki and Kd
  
directly, but output the mean and variance of the three Wv ¼ Wv þ ac dWv þ od2TD Wv0 ð10Þ
parameters. Finally, the actual value of Kp, Ki and Kd is
estimated by the Gauss distribution. where Wa is the weight of Actor network stored by the
The Critic network structure is similar to the Actor central network, and Wa0 represents the weights of Actor
network structure. As shown in Fig. 4, the Critic network network in AC structure, Wv is the weight of Critic network
also uses BP neural networks with three layers structure. in the central network, and Wv0 represents the Critic net-
The first two layers are the same as the layers in the Actor work weights for each AC structure. aa is the learning rate
network. The output layer of the Critic network has only of Actor, and ac is the learning rate of Critic.
one node to output the value function VðSt ; Wv0 Þ of the state.
In the A3C structure, Actor and Critic networks use n- 4.3 The network initialization of A3C-PID
step TD error method [29, 30] to learn action probability controller
function and value function. In the learning method of this
algorithm, the calculation of the n-step TD error dTD is The initial parameters of the network directly affect the
realized by the difference between the state estimation stability of the closed loop control system. However, the

123
Wireless Networks (2021) 27:3537–3547 3541

PID controller of the neural network is difficult to obtain 5 Experiments


the teacher’s signal. Therefore it is necessary to determine
the network parameters by experience or manual trial. The 5.1 Simulation experiment of nonlinear signal
unsupervised learning characteristics of reinforcement
learning enable the controller to obtain the optimal initial In order to verify the effectiveness and superiority of this
parameters of the network through iterative learning. algorithm, the nonlinear objects are simulated and analyzed
However, the AC-PID controller has a slow convergence based on PID, CSP-PID, MN-PID, AC-PID and A3C-PID
speed due to the correlation between the learning samples respectively. The discrete model of the object is as follows:
obtained by the AC algorithm. A3C-PID Controller learns
yðk þ 1Þ ¼ f ðyðkÞ; yðk  1Þ; yðk  2Þ; uðkÞ; uðk  1ÞÞ
network parameters in multi-threading asynchronously,
which can break the relevance of samples and improve the ð11Þ
3 x5 ðx3 1Þþx4
convergence rate. The learning process of A3C-PID net- where f ðx1 ; x2 ; x3 ; x4 ; x5 Þ ¼ x1 x2 x1þx 2 þx2 .
3 2
work parameter is similar to that described in the 3.1 The inputs rin is that:
section, but the difference is that A3C-PID sets the m for 8
> 0:5 sinðpk=25Þ k\250
the number of computer CPU cores in iterative learning, >
>
>
> 0:5 250  k\460
then the value of m is set to one when online controlling. >
< 0:5 460  k\660
rinðtÞ ¼ 0:5 660  k\870
>
>
4.4 Working process of A3C-PID controller >
> 0:3 sinðpk=25Þ þ 0:4 sin
>
> 870  k\1000
:
ðpk=32Þ þ 0:3 sinðpk=40Þ
Based on the architecture of asynchronous learning and the
learning mode with taking n-step TD error as the perfor- ð12Þ
mance, the working process of A3C-PID controller is as The parameters of nonlinear signal simulation are set as
follows: follows: the sampling period is 1 s, m = 4, aa ¼ 0:001,
(a) Setting the sampling period ts, the number of threads ac ¼ 0:01, e ¼ 0:001, c ¼ 0:9, n = 30, K = 3000. The root
of the A3C algorithm m, update the period n, and mean square error (RMSE) and the mean absolute error
initialize the network parameters of each AC struc- (MAE) are used to describe the accuracy of the controller.
ture through iteration learning on K times; The simulation results are shown in Figs. 5, 6, 7, 8 and 9
(b) Calculating errors of system and constructing state and Table 1.
vectors as inputs to Actor(m) and Critic(m); The simulation results show that the A3C-PID controller
(c) Critic(m) outputs VðSt ; Wv0 Þ; reaches the minimum for the root mean square error
(d) Actor(m) outputs the value of Kp, Ki and Kd. Then (RMSE) and the mean absolute error (MAE) value. Com-
the system observes the error em ðt þ 1Þ when next pared with the other three controllers, the control accuracy
sampling time and calculate the reward rm ðtÞ of A3C-PID is higher. It not only proves that our design of
according the Eq. (2); a new PID controller is reasonable but the controller has
(e) Determining whether to update the parameters of the better control performance for the nonlinear system.
Actor(m) and Critic(m). The Critic outputs the state
value VðStþn ; Wv0 Þ then the system updates the
parameters of Global Net which is Wa and Wc
according to Eqs. (9) and (10), if it has meet update
time n. Otherwise, returning step d);
0
(f) Global Net transmits the new parameters Wam and
0
Wcm to each Actor(m) and Critic(m);
(g) Determining whether the end condition is satisfied, if
that exiting the controlling, otherwise updating Sm ðtÞ
and returning step (c).

Fig. 5 Position tracking of PID

123
3542 Wireless Networks (2021) 27:3537–3547

Fig. 6 Position tracking of CPS-PID Fig. 9 Position tracking of A3C-PID

Table 1 The comparison of controller performance


Kinds of controllers RMSE MAE

PID 0.1547 0. 0705


CPS-PID 0.1201 0.0620
MN-PID 0.1203 0.0628
AC-PID 0.1196 0.0621
A3C-PID 0.0884 0.0326

5.2 Simulation experiment of inverted


pendulum

The control of single inverted pendulum is a classic


problem in the control study. The control process is to
Fig. 7 Position tracking of MN-PID
apply the force F to the bottom of the car to make the car
stay in the setting position and make the angle between the
rod and the vertical line in a deviation range.

Fig. 8 Position tracking of AC-PID


Fig. 10 The structure of single inverted pendulum

123
Wireless Networks (2021) 27:3537–3547 3543

Figure 10 shows a single inverted pendulum. As shown


in Fig. 10, the quality of the car is M, the quality of the
pendulum is m, the position of the car is x, the angle of the
pendulum is, the equation of the single inverted pendulum
is obtained as Eqs. (13) and (14).
mðm þ M Þgl ml
h€ ¼ h F ð13Þ
ðm þ M ÞI þ mMl2 ðm þ M Þl þ mMl2
m2 gl2 I þ ml
x€ ¼ h F ð14Þ
ðm þ M ÞI þ mMl2 ðm þ M Þl þ mMl2
1
where I¼ 12 mL2 ; l ¼ 12 L, F represents the force acting on
the car, and take continuous value on [- 10, 10]. Sampling
period is 20 ms. Single inverted pendulum has 4 control
indexes: pendulum angle, swing speed, position of trolley
and speed of car. There initial conditions are as follow:
_ Fig. 12 The output of controller among the PID, AC-PID and A3C-
hð0Þ ¼ 10 ; hð0Þ ¼ 0; xð0Þ ¼ 0:2; xð0Þ
_ ¼0 ð15Þ
PID
The final state of expectation is:
can be seen that under the A3C-PID controlling, the
_
hð0Þ ¼ 0 ; hð0Þ ¼ 0; xð0Þ ¼ 0; xð0Þ
_ ¼0 ð16Þ inverted pendulum can quickly reach the stable state of 4
In the simulation, the parameters of the inverted pendulum control indicators. Figure 12 is the output of A3C-PID,
are as follows: AC-PID and traditional PID control. It can be seen that
A3C-PID controller has better system tracking perfor-
g ¼ 9:8 m=s2 ; M ¼ 10 kg; m ¼ 0:1 kg; L ¼ 0:5 m; mance than traditional PID and AC-PID.
lc ¼ 0:005; lp ¼ 2  105
5.3 Position control of two phase hybrid
The lc presents the friction coefficient of the car relative stepping motor
to the guide rail indicates. The lp presents the friction
coefficient of the rod to the car. The parameters of the 5.3.1 Closed loop control structure of stepping motor
A3C-PID controller are set to as follow:
m = 4; aa ¼ 0:002; ac ¼ 0:01; e ¼ 0:001; c ¼ 0:9; n ¼ 50 The stepper motor is a low speed permanent magnet syn-
chronous motor. It is not used as the input of the pulse
The results of the simulation are shown in Figs. 11 and sequence. But used in the digital control system by
12. Figure 11 is the response of the four controlling indi- changing the excitation state to realize the angle actuating
cators of the inverted pendulum in 10 s. From Fig. 11, it element. The stepper motor usually adds a photoelectric
encoder, a rotating transformer or other measuring feed-
back elements to achieve high precision positioning control
in the closed loop control. The block diagram of the closed-
loop servo control system is shown in Fig. 13. From the
Fig. 13, the inner loop includes the current loop and the
speed loop. The current loop is used to track the current of
the two phase hybrid stepping motor, so that the dual phase
hybrid stepping motor can output the torque smoothly
under the micro step. The speed loop control enables the
load electricity to track the setting speed and achieve the
effect of speed control. The outer loop is the position loop,
which loads the output to track a given position. The
position loop controller usually adopts PID control.
Therefore, we added the A3C-PID to the position loop to
test the validity of the controller.

Fig. 11 The response of four index with A3C-PID

123
3544 Wireless Networks (2021) 27:3537–3547

Input position Position Speed Torque Current Power Motor and


controller controller controller controller drive load

S Current
feedback
Position
feedback

Fig. 13 The closed-loop servo control system of hybrid stepping motor

Fig. 14 The simulation of servo system

5.3.2 Modeling and simulation of two-phase hybrid


In above formulas, ua and ub are two-phase voltage and
stepping motor
current respectively of A and B. R is winding resistance. L
is winding inductance. ke is torque coefficient. h and x are
In this paper, a two-phase hybrid stepping motor is used to
rotation angle and angular velocity of motor respectively.
control in the simulation experiment. Firstly, we need to
Nr is the number of rotor teeth. Te is electromagnetic torque
establish a mathematical model. However, the two-phase
of hybrid stepping motor. TL is Load torque. J and B are the
hybrid stepping motor is a highly nonlinear mechanical and
load moment of inertia and the viscous friction coefficient
electrical device, so it is difficult to describe it accurately.
respectively. It can be seen from the mathematical model
Therefore, the mathematical model of a two phase hybrid
of a two phase hybrid stepping motor that the two phase
stepping motor is studied in this paper. It is simplified and
hybrid stepping motor is still a highly nonlinear and cou-
assumed to be as follows: The magnetic chain in the phase
pled system under a series of simplified conditions.
winding of the permanent magnet varies with the rotor
The simulation model of two phase hybrid stepping
position according to the sinusoidal law. The magnetic
motor servo control system is built by using Simulink in
hysteresis and the eddy current effect are not considered,
Matlab. The simulation is shown in Fig. 14. The parame-
only the mean and fundamental components of the air gap
ters of the motor are as follows: L = 0.5H, Nr = 50, R¼8X,
magnetic conductance are considered. The mutual induc-
J = 2 g cm2, B = 0 N m s/rad, N = 100, TL = 0, ke-
tance between the two phase windings is ignored. On the
= 17.5 N m/A. The N is the reduction ratio of the har-
basis of the above limit, the mathematical model of the two
monic reducer. The parameters of A3C-PID controller are
phase hybrid stepping motor can be described by the
set as follows: m = 4, aa ¼ 0:001, ts = 0.001 s, ac ¼ 0:01,
Eqs. 17–21.
e ¼ 0:001, c ¼ 0:9, n = 30, K = 3000. The results are
dia shown in Figs. 15 and 16 and Table 2.
ua ¼ L þ Ria  ke x sinðNr hÞ ð17Þ
dt Dynamic performance of the A3C, BP, and AC adaptive
dib PID controller are shown on Fig. 15. In the time of early
ub ¼ L þ Rib  ke x sinðNr hÞ ð18Þ simulation (20 cycles), the BP-PID controller has a faster
dt
response speed and a shorter rise time (12 ms), but it has a
Te ¼ ke ia sinðNr hÞ þ ke ib cosðNr hÞ ð19Þ
higher overshoot of 2.1705%. On the contrary, both the
dx AC-PID and the A3C-PID controller have smaller over-
J þ Bx þ TL ¼ Te ð20Þ
dt shoot as 0.1571% and 0.1021%. But the adjustment time of
dh AC-PID is long (48 ms), and the rise time is 21 ms. In
¼x ð21Þ
dt contrast, A3C-PID controller has better stability and
rapidity.

123
Wireless Networks (2021) 27:3537–3547 3545

Fig. 15 Position tracking Fig. 17 Reward value curve of reinforcement learning

Table 2 The comparison of controller performance


Controller Overshoot Rise time Steady state Adjustment
(%) (ms) error time (ms)

A3C-PID 0.1571 18 0 33
AC-PID 0.1021 21 0 48
BP-PID 2.1705 12 0 32

Simulation results show that the A3C-PID controller has


good adaptive capabilities.
The AC-PID and A3C-PID reward value curves are
shown in Fig. 17. The goal of reinforcement learning is to
learn the best strategy to maximize reward value U. The
calculation formula is as seen in Eq. (22)
" #
Fig. 16 The result of controller parameter turning X
end
U¼E ct RðSt Þ ð22Þ
t¼0
Figure 16 shows the process of adaptive transformation
of A3C-PID controller parameters. As be seen from We can conclude from the analysis of Fig. 17 that A3C-
Fig. 16, the A3C-PID controller is able to adjust the PID PID controller has a higher U value after 3000 iterations
parameters based on errors in different periods. At the than AC-PID controller. In addition, the U value of A3C-
beginning of the simulation, the tracking error of system is PID has become stable after about 1800 iterations, while
large. In order to ensure a fast response speed of the sys- AC-PID converges only after the 2500 iterations. So, A3C-
tem, KP is continuously increasing while Kd is reducing. PID has a faster convergence rate than the AC-PID.
Then the system is in order to prevent from having a high
overshoot, which limits the increasing of Ki. With the error
decreasing, KP begins to decrease and the value of Ki is 6 Conclusions
gradually increased to eliminate the cumulative error, but
at the same time, a small amount of overshoot is caused. Machine Learning and Intelligent Algorithms have been
Since the Kd value at this stage has a large influence on the well applied in many industrial fields [31–36]. The purpose
system, it tends to be stable. When the final tracking error of this paper has been to present our efforts to improve the
comes to zero, KP, Ki and Kd reach a steady state. convergence and adaptability of the adaptive PID

123
3546 Wireless Networks (2021) 27:3537–3547

controller. In this paper, a new PID controller is proposed new membership function in nonlinear fuzzy PID controllers with
with A3C algorithm. The controller uses the BP neural variable gains. Information and Control, 5, 1–7.
6. Caocang, Li, & Cuifang, Zhang. (2015). Adaptive neuron PID
network to approach the policy function and the value control based on minimum resource allocation network. Appli-
function. BP neural network have the strong ability in cation Research of Computers, 32(1), 167–169.
nonlinear mapping, which can enhance the adaptive ability 7. Sheng, X., Jiang, T., Wang, J., et al. (2015). Speed-feed-forward
of the controller. The learning speed of A3C PID controller PID controller design based on BP neural network. Journal of
Computer Applications, 35(S2), 134–137.
is accelerated with the parallel training of CPU multi- 8. Wang, X. S., Cheng, Y. H., & Wei, S. (2007). A proposal of
threading. The method of asynchronous multi-thread adaptive PID controller based on reinforcement learning. Journal
training reduces the correlation of the training data, and of China University of Mining and Technology, 17(1), 40–44.
makes the controller more stable and adaptable. Our 9. Huo, L., Jiang, D., & Lv, Z. (2018). Soft frequency reuse-based
optimization algorithm for energy efficiency of multi-cell net-
experiments of nonlinear signal and inverted pendulum works. Computers and Electrical Engineering, 66(2), 316–331.
demonstrate that A3C-PID controller has higher control 10. Akbarimajd, A. (2015). Reinforcement learning adaptive PID
accuracy than others PID controller. The experiments about controller for an under-actuated robot arm. International Journal
the position control of two phase hybrid stepping motor of Integrated Engineering, 7(2), 20–27.
11. Chen, X. S., & Yang, Y. M. (2011). A novel adaptive PID con-
show that A3C-PID controller has a good performance on troller based on actor–critic learning. Control Theory and
overshoot, rise time, steady state error and adjustment time. Applications, 28(8), 1187–1192.
According these work, the effectiveness and application 12. Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J.,
significance of the new method can be confirmed. Our aim et al. (2016). An actor–critic algorithm for sequence prediction.
arXiv preprint arXiv:1607.07086.
is to make the controller apply to the multi-axis motion 13. Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavuk-
control and the actual industrial production. cuoglu, K., et al. (2016). Sample efficient actor–critic with
experience replay. arXiv preprint arXiv:1611.01224.
Acknowledgements This work was supported by National Science 14. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,
and Technology Major Project of China (Grant Number Harley, T., et al. (2016). Asynchronous methods for deep rein-
2017ZX05009-001). forcement learning. In International conference on machine
learning (pp. 1928–1937).
Open Access This article is licensed under a Creative Commons 15. Jiang, D., Huo, L., Lv, Z., Song, H., & Qin, W. (2018). A joint
Attribution 4.0 International License, which permits use, sharing, multi-criteria utility-based network selection approach for vehi-
adaptation, distribution and reproduction in any medium or format, as cle-to-infrastructure networking. IEEE Transactions on Intelli-
long as you give appropriate credit to the original author(s) and the gent Transportation Systems, 19(10), 3305–3319.
source, provide a link to the Creative Commons licence, and indicate 16. Liu, Q., Zhai, J. W., Zhang, Z. Z., & Zhong, S. (2018). A survey
if changes were made. The images or other third party material in this on deep reinforcement learning. Chinese Journal of Computers,
article are included in the article’s Creative Commons licence, unless 41(01), 1–27.
indicated otherwise in a credit line to the material. If material is not 17. Qin, R., Zeng, S., Li, J. J., & Yuan, Y. (2015). Parallel enterprises
included in the article’s Creative Commons licence and your intended resource planning based on deep reinforcement learning. Zi-
use is not permitted by statutory regulation or exceeds the permitted donghua Xuebao/Acta Automatica Sinica, 43(9), 1588–1596.
use, you will need to obtain permission directly from the copyright 18. Jiang, D., Wang, Y., Lv, Z., Qi, S., & Singh, S. (2019). Big data
holder. To view a copy of this licence, visit http://creativecommons. analysis-based network behavior insight of cellular networks for
org/licenses/by/4.0/. industry 4.0 applications. IEEE Transactions on Industrial
Informatics. https://doi.org/10.1109/TII.2019.2930226.
19. Tang, H. C., Li, Z. X., Wang, Z. T., et al. (2005). A fuzzy PID
control system. Electric Machines and Control, 2, 136–138.
References 20. Sun, J. P., Yan, L., Li, Y., et al. (2006). Design of fuzzy PID
controllers based on improved genetic algorithms. Chinese
1. Adel, T., & Abdelkader, C. (2013). A particle swarm optimiza- Journal of Scientific Instrument, S3, 1991–1992.
tion approach for optimum design of PID controller for nonlinear 21. Zhu, Y. H., Xue, L. Y., & Huang, W. (2011). Design of fuzzy
systems. In International conference on electrical engineering PID controller based on self-organizing adjustment factors.
and software applications (pp. 1–4). IEEE. Journal of System Simulation, 23(12), 2732–2737.
2. Savran, A. (2013). A multivariable predictive fuzzy PID control 22. Liao, F. F., & Xiao, J. (2005). Research on self-tuning of PID
system. Applied Soft Computing, 13(5), 2658–2667. parameters based on BP neural networks. Acta Simulata Sys-
3. Jiang, D., Wang, W., Shi, L., & Song, H. (2018). A compressive tematica Sinica, 07, 1711–1713.
sensing-based approach to end-to-end network traffic recon- 23. Li, G. Y., & Chen, X. L. (2008). Neural network self-learning
struction. IEEE Transactions on Network Science and Engi- PID controller based on real-coded genetic algorithm. Micro-
neering, 5(3), 1–12. motors Servo Technique, 1, 43–45.
4. Jiang, D., Huo, L., & Li, Y. (2018). Fine-granularity inference 24. Patel, R., & Kumar, V. (2015). Multilayer neuro PID controller
and estimations to network traffic for SDN. PLoS ONE, 13(5), based on back propagation algorithm. Procedia Computer Sci-
1–23. ence, 54, 207–214.
5. Zhang, X., Bao, H., Du, J., & Wang, C. (2014). Application of a

123
Wireless Networks (2021) 27:3537–3547 3547

25. Nie, S. K., Wang, Y. J., Xiao, S., & Liu, Z. (2017). An adaptive Chengze Du was born in 1996,
chaos particle swarm optimization for tuning parameters of PID Postgraduate. He is a researcher
controller. Optimal Control Applications and Methods, 38(6), of college of computer science
1091–1102. and technology, China Univer-
26. Aziz Khater, A., El-Bardini, M., & El-Rabaie, N. M. (2015). sity of Petroleum, Qingdao,
Embedded adaptive fuzzy controller based on reinforcement 26658, China. His research
learning for DC motor with flexible shaft. Arabian Journal for interest includes deep learning,
Science and Engineering, 40(8), 2389–2406. industrial application, etc.
27. Liu, Z., Zeng, X., Liu, H., & Chu, R. (2015). A heuristic two-
layer reinforcement learning algorithm based on BP neural net-
works. Journal of Computer Research and Development, 52(3),
579–587.
28. Zhu, J., Song, Y., Jiang, D., & Song, H. (2018). A new deep-Q-
learning-based transmission scheduling mechanism for the cog-
nitive Internet of things. IEEE Internet of Things Journal, 5(4),
2375–2385.
29. Xu, X., Zuo, L., & Huang, Z. (2014). Reinforcement learning Youxiang Duan was born in
algorithms with function approximation: Recent advances and 1964, Ph.D. Graduated from
applications. Information Sciences, 261, 1–31. China University of Petroleum.
30. Jiang, D., Huo, L., & Song, H. (2018). Rethinking behaviors and Now he is a professor of college
activities of base stations in mobile cellular networks based on of computer science and tech-
big data analysis. IEEE Transactions on Network Science and nology, China University of
Engineering, 1(1), 1–12. Petroleum, Qingdao, 26658,
31. Wang, F., Jiang, D., & Qi, S. (2019). An adaptive routing algo- China. His research interest
rithm for integrated information networks. China Communica- includes service computing,
tions, 7(1), 196–207. intelligent control, machine
32. Huo, L., & Jiang, D. (2019). Stackelberg game-based energy- learning etc.
efficient resource allocation for 5G cellular networks. Telecom-
munication System, 23(4), 1–11.
33. Jiang, D., Zhang, P., Lv, Z., & Song, H. (2016). Energy-efficient
multi-constraint routing algorithm with load balancing for smart
city applications. IEEE Internet of Things Journal, 3(6),
1437–1447. Hui Ren was born in 1993,
34. Jiang, D., Li, W., & Lv, H. (2017). An energy-efficient cooper- received his B.S. degree in
ative multicast routing in multi-hop wireless networks for smart Electrical Engineering from
medical applications. Neurocomputing, 220(2017), 160–169. China University of Petroleum,
35. Wang, F., Jiang, D., Wen, H., & Song, H. (2019). Adaboost- China. His research interest
based security level classification of mobile intelligent terminals. includes intelligent control,
The Journal of Supercomputing, 75, 1–19. Actor-Critic learning, etc.
36. Sun, M., Jiang, D., Song, H., & Liu, Y. (2017). Statistical reso-
lution limit analysis of two closely-spaced signal sources using
Rao test. IEEE Access, 5, 22013–22022.

Publisher’s Note Springer Nature remains neutral with regard to


jurisdictional claims in published maps and institutional affiliations.

Qifeng Sun was born in 1976,


Ph.D. Graduated from China Hongqiang Li was born in 1975,
University of Petroleum. Now Senior Engineer, Shengli Dril-
he is a lecturer of college of ling Institute of Sinopec Group,
computer science and technol- Dongying, 257000, China. He
ogy, China University of Petro- has been engaged in field ser-
leum, Qingdao, 26658, China. vice of instruments while dril-
His research interest includes ling. He is dedicated to the
intelligent control, machine research control accuracy anal-
learning etc. ysis of horizontal wells with
large displacement, azimuth
gamma imaging while drilling,
diameter while drilling, and
array imaging while drilling
measurement methods.

123

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy