Fcteg 02 722092
Fcteg 02 722092
Faculty of Engineering and Information Technology (FEIT), School of Computer Science, Australian Artificial Intelligence Institute,
University of Technology Sydney, Sydney, NSW, Australia
In formation control, a robot (or an agent) learns to align itself in a particular spatial
alignment. However, in a few scenarios, it is also vital to learn temporal alignment along
with spatial alignment. An effective control system encompasses flexibility, precision,
and timeliness. Existing reinforcement learning algorithms excel at learning to select an
action given a state. However, executing an optimal action at an appropriate time
remains challenging. Building a reinforcement learning agent which can learn an
optimal time to act along with an optimal action can address this challenge. Neural
networks in which timing relies on dynamic changes in the activity of population
neurons have been shown to be a more effective representation of time. In this
work, we trained a reinforcement learning agent to create its representation of time
using a neural network with a population of recurrently connected nonlinear firing rate
neurons. Trained using a reward-based recursive least square algorithm, the agent
learned to produce a neural trajectory that peaks at the “time-to-act”; thus, it learns
“when” to act. A few control system applications also require the agent to temporally
Edited by:
Qin Wang,
scale its action. We trained the agent so that it could temporally scale its action for
Yangzhou University, China different speed inputs. Furthermore, given one state, the agent could learn to plan
Reviewed by: multiple future actions, that is, multiple times to act without needing to observe a
Peng Liu, new state.
North University of China, China
Tianhong Liu, Keywords: reinforcement learning, recurrent neural network, time perception, formation control, temporal scaling
Yangzhou University, China
*Correspondence:
Chin-Teng Lin 1 INTRODUCTION
chin-teng.lin@uts.edu.au
A powerful formation control system requires continuously monitoring the current state, comparing
Specialty section: the performance, and deciding whether to take necessary actions. This process does not only need to
This article was submitted to understand the system’s state and optimal actions but also needs to learn the appropriate time to
Nonlinear Control, perform an action. Deep reinforcement learning algorithms which have achieved remarkable success
a section of the journal
in the field of robotics, games, and board games have also been shown to perform well in adaptive
Frontiers in Control Engineering
control system problems Li et al. (2019); Oh et al. (2015); Xue et al. (2013). However, the challenge of
Received: 08 June 2021
learning the precise time to act has not been directly addressed.
Accepted: 12 July 2021
The ability to measure time from the start of a state change and use it accordingly is an essential
Published: 06 August 2021
part of applications such as adaptive control systems. In general, the environment encodes as four
Citation:
dimensions: the three dimensions of space and the dimension. The role of representation of time
Akella A and Lin C-T (2021) Time and
Action Co-Training in Reinforcement
affects the decision-making process along with the spatial aspects of the environment Klapproth
Learning Agents. (2008). However, in the field of reinforcement learning (RL), the essential role of time is not explicitly
Front. Control. Eng. 2:722092. acknowledged, and existing RL research mainly focuses on the spatial dimensions. The lack of time
doi: 10.3389/fcteg.2021.722092 sense might not be an issue when considering a simple behavioral task, but many tasks in control
FIGURE 2 | Proposed reinforcement learning architecture. (A) State input is received by the agent over an episode with a length of 3,600 ms. The agent contains an
RNN (B) and a deep Q-network (C). The RNN receives a continuous input signal with state values for 20 ms and zeros for the remaining time. The state values shown
here are s1 1.0, s2 1.5, s3 2.0, s4 2.5, and s5 3. The weights W ln (the orange connections) are initialized randomly and held constant throughout the
experiment. The weights W Rec and W out (the blue connections) are initialized randomly and trained over the episodes. The DQN with one input and four output
nodes receives the state value as its input and outputs the Q-value for each circle.
action given a specific state. The RNN and DQN are co-trained to 2 METHODS
learn the time to act and action. The RNN was trained using a
reward-based recursive least squares algorithm, and the DQN was 2.1 Task-Switching Scenario
trained using the Bellman equation. The results of a series of task- In the scenario, there are n different circles, and the agent must
switching scenarios show that the agent learned to produce a learn to click on each circle within a specific time interval and in a
neural trajectory reflecting its own sense of time that peaked at the specific order. This task involves learning to decide which circle to
correct time-to-act. Furthermore, the agent was able to click and when that circle should be clicked. Figure 1 shows an
temporally scale its time-to-act more quickly or more slowly example scenario with four circles. Circle 1 must be clicked at
according to the input speed. We also compared the performance some point between 800 and 900 ms. Similarly, circles 2, 3, and 4
of the proposed architecture with DNN models such as LSTM, must be clicked at 1,500–1,600, 2,300–2,400, and 3,300–3,400 ms,
which can implicitly represent time. We observed that for tasks respectively. If the agent clicks the correct circle in the correct
involving precisely timed action, neural network models such as time period, it receives a positive reward. If it clicks a circle at the
the population clock model perform better than the LSTM. incorrect time, it receives a negative reward (refer to Table 1 for
This article first presents the task-switching scenario and the exact reward values). Each circle becomes inactive once its
describes the proposed architecture and training methodology time interval has passed. For example, circle 1 in Figure 1
used in the work. Section 3 presents the performance of the becomes inactive at 901 ms, meaning that the agent cannot
trained RL agent on six different experiments. In Section 4, we click it after 900 ms and receives a reward of 0 if it attempts
present the performance of LSTM in comparison with the to click the inactive circle. Each circle can only be clicked once
proposed model. Finally, Section 5 presents an extensive during an episode.
discussion about the learned time representation with respect The same scenario was modified to conduct the following
to prior electrophysiology studies. experiments:
FIGURE 3 | Trained activity of four different scenarios. Each scenario contains different times to act. Each colored bar represents the time-to-act interval. The
orange line in each figure represents the threshold (0.5).
FIGURE 4 | RNN with speed as the input and state input (A) to the RNN (B).
∞
where π * is the optimal policy. Action a can be determined as
Rt ct rt (1) follows:
t1
where cϵ[0, 1 ] is the discount factor that determines the a argmaxa Q* (s, a) (5)
importance of the immediate reward and the future reward. If When the state space and the action space are discrete and
c 0, the agent will learn to choose actions that produce an finite, the Q function can be a table that contains all possible state-
immediate reward. If c 1, the agent will evaluate its actions action values. However, when the state and action spaces are large
based on the sum of all its future rewards. To learn the sequence or continuous, a neural network is commonly used as a
of actions that lead to the maximum discounted sum of future Q-function approximator Mnih et al. (2015); Lillicrap et al.
rewards, an agent estimates optimal values for all possible actions (2015). In this work, we model a reinforcement learning agent
in a given state. These estimated values are defined by the which uses a fully connected DNN as a Q-function approximator
expected sum of future rewards under a given policy π. to select one of the four circles.
Qπ (s, a) Eπ {Rt /st s, at a} (2)
2.2.2 Recurrent Neural Network
where Eπ is the expectation under the policy π, and Qπ (s, a) is the In this study, we used the population clock model for training the
expected sum of discounted rewards when the action a is chosen RL agent to learn the representation of time. In previous studies,
by the agent in the state s under a policy π. Q-learning Watkins this model has been shown to robustly learn and generate simple-
and Dayan (1992) is a widely used reinforcement learning to-complex temporal patterns Laje and Buonomano (2013);
algorithm that enables the agent to update its Qπ (s, a) Hardy et al. (2018). The population clock model (i.e., RNN)
estimation iteratively by using the following formula: contains a pool of recurrently connected nonlinear firing rate
neurons with random initial weights as shown at the top of
Qπ (st , at ) Qπ (st , at ) + αrt + c max Qπ (st+δt , a) − Qπ (st , at )
Figure 2. To achieve “time-to-act” and temporal scaling of timing
(3) behavior, we trained the weights of both recurrent neurons and
where α is the learning rate, and Qπ (st+1 , a) is the future value output neurons. The network we used in this study contained 300
estimate. By iteratively updating the Q values based on the agent’s recurrent neurons, as indicated by the blue neurons inside the
experience, the Q function can be converged to the optimal Q green circle, plus one input and one output neuron. The dynamics
function, which satisfies the following Bellman optimality of the network Sompolinsky et al. (1988) are governed by Eqs
equation: 6–8. The learning showed a similar performance on a larger
number of neurons, and the performance started to decline when
Qπ (s, a) E rt + c max Qπ s′, a′
* *
(4) 200 neurons were used.
FIGURE 5 | RNN activity with a training speed of 1 and test speeds of 0.01, 0.8, and 1.3. The colored bars indicate the expected time-to-act intervals.
dxi N I output neuron with a W out , which is a 1xN output weight matrix.
τ −xi (t) + WijRec frj (t) + WijIn yj (t) (6) In this study, we trained W Rec and W out using a reward-based
dt j1 j1
recursive least squares method. The variable y represents the
N
activity level of the input neurons (states), and z represents the
z WjOut rj (7)
j1
output. xi (t) represents the state of the ith recurrent neuron, which
is initially zero, and τ is the neuron time constant.
fri tanh(xi ) (8) Initially, due to the high gain caused by W Rec (when g 1.6),
the network produces chaotic dynamics, which in theory can
Given a network that contains N recurrent neurons, fri encode time for a long time Hardy et al. (2018). In practice, the
represents the firing of the ith [1, 2..., N ] recurrent neuron. recurrent weights need to be tuned to reduce this chaos and
W Rec , which is an NxN weight matrix, defines the connectivity of locally stabilize the output activity. The parameters, such as
the recurrent neurons, which is initialized randomly from a connection probability, Δt, g (gain of the network), and τ,
normal distribution
with a mean of 0 and a standard deviation were chosen based on the existing population clock model
of 1/ g p N , where g represents the gain of the network. Each research Buonomano and Maass (2009); Laje and Buonomano
input neuron is connected to every recurrent neuron in the (2013). In this work, we trained both recurrent and output
network with a W ln , which is an Nx1 input weight matrix. weights using a reward-based recursive least square algorithm.
W ln is initialized randomly from a normal distribution with a During an episode, the agent chooses to act when the output
mean of 0 and a standard deviation of 1 and is fixed during activity exceeds a threshold (in this study, 0.5). We experimented
training. Similarly, every recurrent neuron is connected to each with other threshold values between 0.4 and 1, but each produced
FIGURE 6 | Multiple times to act. The state input (A) and output activity (B) which peaks at three different intervals after state s1 and at one interval after state s2 . The
colored bars indicate the correct time-to-act.
similar results to 0.5. If the activity never exceeds a threshold, then 2.3.2 Recurrent Neural Network
the agent chooses a random time point to act. This is to ensure In the RNN, both the recurrent weights and output weights were
that the agent tries different time points and acts before it learns updated at every Δt 10ms, using the collected experiences. The
the temporal nature of the task. recursive least square algorithm (RLS) Åström and
As illustrated in Figure 2 (left side), a sequence of state Wittenmark (2013) is a basic recursive application of the
inputs are given to an agent during an episode lasting least square algorithm. Given an input signal x1 , x2 , . . . .xn
3,600 ms, where each state for the RNN network is a 20- and the set of desired responses y1 , y2 , . . . .yn , the RLS
ms input signal and a single value for the DQN. The agent updates the parameters W Rec and W Out to minimize the
receives state s1 at 0ms. At this point, all circles are active. At mean difference between the desired and the actual output
900ms, the first circle turns inactive, and the agent receives of the RNN (which is the firing rate fri of the recurrent
state s2 . In other words, the agent only receives the next state neuron). In the proposed architecture, we generate the
after the previous state has changed. In this case, the changes desired response of recurrent neurons by adding a reward
are caused by the circle turning inactive due to time to the firing rate fri (t) neuron i at time t such that the desired
constraints preset in the task. The final state, s5, is a firing rate decreases at time t if rt < 0 and increases if rt > 0. The
terminal state where all the circles are inactive. Note that desired response of output neurons was generated by adding a
each action given by the Q network is only executed at the reward to output activity z, as defined in Eq 7.
time points defined by the RNN network. The error erec i (t) of recurrent neurons is computed using Eq 12,
where fri (t) is the firing rate of neuron i at time t, and rt is the reward
received at time t. The desired signal ri (t) + reward(t) is clipped
2.3 Time and Action Co-Training in between Rmin and Rmax due to the high variance of the firing rate. The
Reinforcement Learning Agent update of parameters W Rec is dictated by Eq 11, where WijRec is the
At the start of an episode, an agent explores the environment recurrent weight between the ith neuron and the jth neuron. The exact
by selecting random circles to click. At the end of the episode, values of Zmin , Zmax , Rmin andRmax are shown in Table 1. Zmin and
the agent collects a set of different experience tuples Zmax act as clamping values of the desired output activity. So, in this
(st , at , rt+δt , st+δt ) that are used to train the DQN and RNN. study, the value of Zmax was chosen to be close to the positive
threshold (+0.5), and the value of Zmin was chosen to be close to
2.3.1 DQN the negative threshold (−0.5). The parameter Δt was set based on the
The parameters of the Q network θ are iteratively updated existing population clock model research Buonomano and Maass
using Eqs. 9, 10 for action at taken in state st , which results in (2009); Laje and Buonomano (2013).
reward rt+δt . In this study, we trained only a subset of recurrent neurons,
which were randomly selected at the start of training. SubRec is a
θt+1 θt + αy − Q(st , at ; θt ) ∇θt Q(at , at ; θt ) (9)
subset of randomly selected neurons from the population. For the
y rt+1 + c max Q(st+1 , a; θt ) (10) experiments in this study, we selected 30% of the recurrent
t
FIGURE 7 | Results of the skip state test. The top figures show the state input (left) and the corresponding RNN output (right), where all states are present in the
input. The bottom figures show the state input with the fourth skipped state (left), which results in subdued output activity from 3,200 to 3,300 ms (right).
neurons for training. The square matrix Pi governs the learning 3 EXPERIMENTS
rate of the recurrent neuron i, which is updated at every Δt using
Eq 13. 3.1 Different Scenarios
To understand the proficiency of this model, we trained and
WijRec (t) WijRec (t − Δt) − erec
i (t) Pjk (t) frk (t)
i
(11) tested the agent on multiple different scenarios with different
k∈SubRec time intervals and different numbers of circles. We observed that
i (t) fri (t) − max Rmin , minfri (t) + rt , Rmax
erec (12) the agent learned to produce a neural trajectory that peaked at the
P (t − Δt) fr(t) fr′(t)P (t − Δt)
i i time-to-act intervals with near-perfect accuracy. Figure 3
Pi (t) Pi (t − Δt) − (13) demonstrates the learned neural trajectory of a few of the
1 + fr′(t) Pi (t − Δt) fr(t)
scenarios we trained. The colored bars in Figure 3 indicate
The output weights WijOut (weight between recurrent neuron j and the correct time-to-act interval.
output neuron i) are also updated in a similar way; the error is The proposed RNN training method exhibited some notable
calculated using Eq 14 as follows: behavioral features, such as the following: 1) the agent learned to
subdue its activity as soon as it observed a new state, analogous to
j (t) z(t) − max(Zmin , min((z(t) + reward(t)), Zmax )) (14)
eout restarting a clock, and 2) depending on the observed state, the
FIGURE 8 | RNN output when trained on a scenario with 20 circles. The colored bars indicate the expected time-to-act.
agent learned to ramp its activity to peak at the time-to-act. We of the time, the speed input is zero. We trained the model only
also observed that the agent could learn to do the same without with one speed (speed 1) and tested it at three different speeds:
training the recurrent weights (i.e., by only training the output speed 1.3, speed 0.01, and speed 0.8. Figure 5 shows the
weights WOut ). However, by training a percentage of the recurrent results. We observed that the shift in click time with respect to
neurons, we observed that the agent could learn to produce the speed could be defined using Eq 15. We used a similar procedure
desired activity in relatively fewer episodes of training. to that described in Section 2.3.2 to train for temporal scaling.
click time click time + speed / default speed + 200 (15)
3.2 Temporal Scaling
It is interesting how humans can execute their actions, such as
speaking, writing, or playing music at different speeds. Temporal
scaling is another feature we observed in our proposed method. A
few studies have explored temporal scaling in humans 3.3 Learning to Plan Multiple Future
Diedrichsen et al. (2007); Collier and Wright (1995), Times-to-Act
particularly the study by Hardy et al. (2018), which modeled One of the inherent properties of an RNN is that it can produce
temporal scaling using an RNN and a supervised learning multiple peaks at different time points, even with only one input
method. Their approach involved training recurrent neurons at the start of the trial. Results of the study by Hardy et al. (2018)
using a second RNN that generates a target output for each of showed that the output of the RNN (trained using supervised
the recurrent neurons in the population. Unfortunately, this learning) peaked at multiple time points given a single input of
approach is not feasible with an online learning algorithm 250 ms at the start of the trial. To understand whether an agent
such as reinforcement learning. So, to explore the possibility could learn to plan such multiple future times-to-act given one
of temporal scaling with our method, we trained the model using state using the proposed training, we trained an agent on a slightly
an additional speed input (shown in Figure 4), using the same modified task-switching scenario. Here, the agent needed to click
approach as is outlined in Eqs. 11, 12, 14. In this set-up, the RNN on the first circle at three different time intervals, 400–500 ms,
receives both a state input and a speed input. The speed input is a 1,000–1,100 ms, and 1,700–1,800 ms, and on the second circle at
constant value given only when there is a state input; for the rest 2,300–2,400 ms. The first circle was set to deactivate at 1,801 ms.
FIGURE 9 | RNN output when trained on a scenario with two circles, where the first circle must be clicked after 2,000 ms. The colored bars indicate the expected
time-to-act.
At the first state s1, the agent learned to produce a neural Hence, to evaluate whether the learned network was truly
trajectory that peaked at three intervals, followed by state s2, dependent on the state, we tested it by skipping one of the
which peaked at 2,300–2,400 ms, as shown in Figure 6. input states. As Figure 7 shows, when the agent did not
receive a state at 2,400 milliseconds, it did not choose to act
3.4 Skip State Test during the 3,200–3,300 interval, proving that the learned time-to-
As seen in experiment-3, the multiple peaks (multiple times-to- act is truly state dependent.
act) that the agent was producing could be based on its inherent
property of the RNN. In reinforcement learning, however, the 3.5 Task Switching With 20 Tasks
peak at the time-to-act should be truly dependent on each input To investigate the scalability of the proposed method to a
state and also leverage the temporal properties of the RNN. relatively large state space, we trained and tested the model in
FIGURE 10 | Left shows the pendulum scenario. The pendulum rod (the black line) is 1 m long, and the blob (blue dot) weighs 1 kg. Right shows the training RNN
activity.
FIGURE 11 | Input difference between the RNN and the LSTM network.
a scenario consisting of 20 circles with 20 different times-to-act. blob of a moving damped pendulum. The length of the pendulum
Figure 8 demonstrates that the agent could indeed still learn the is 1 m, and the weight of the blob is 1 kg. We trained the DQN to
time-to-act with near-perfect accuracy. select the direction of shooting and the RNN to learn the exact
time to release the trigger. The agent was rewarded positively for
3.6 Memory Task hitting the blob with an error of 0.1 m and negatively if the agent
From the above experiments, the agent was able to learn and employ missed the target. The learned activity is shown in Figure 10; the
its time representation in multiple ways. However, we are also left shows the motion of the pendulum and the right shows the
interested to know for how long an agent can remember a given learned RNN activity. The threshold in this experiment was 0.05,
input. To investigate this, we delayed the time-to-act for 2,000 ms after and the agent was able to hit the blob 5 times in 3,000 ms.
the offset of the input and trained the agent. The trained agent Although it is still not clear why the agent did not peak its activity
remembered a state seen at 0–20 ms until 2,000 ms (see Figure 9), from 0 to 1,500 ms, the agent showed better performance after
which is indicated by the peak in the output activity. We also trained 1,500 ms.
the agent to remember a state at 3,000 ms. With the current amount of
recurrent neurons (i.e., 300 neurons), the agent was not able to
remember for 3,000 ms from the offset of an input. 4 COMPARISON WITH LONG
SHORT-TERM MEMORY (LSTM)
3.7 Shooting a Moving Target
Similar to the task-switching experiment, we trained the RL agent
NETWORK
to learn “when to act” on a different scenario. In this scenario, the A recent study by Deverett et al. (2019) investigated the interval
agent is rewarded for shooting a moving target. The target is the timing abilities in a reinforcement learning agent. In the study, an
FIGURE 12 | Output activity of the trained LSTM network for a task-switching scenario containing four circles, with time-to-act intervals shown in colored bars.
RL agent was trained to reproduce a given temporal interval. representation of time, such as velocity or acceleration)
However, the time representation in the study was in the form of using LSTM.
movement (or velocity) control. In other words, the agent had to In order to investigate in this direction, we trained an RL agent
move from one point to the goal point within the same interval as with only one LSTM network as its DQN network (no RNN was
presented at the start of the experiment. The agent which used the used in this test) on the same task-switching scenario. The input
LSTM network in this study by Deverett et al. (2019) performed sequence for an RNN works in terms of dt (as shown in Eq 6),
the task with near-perfect accuracy, indicating the ability to learn whereas input for LSTM works in terms of sequence length, as
temporal properties using LSTM networks. Following these shown in Figure 11. For example, an input signal with a length of
findings, our study endeavors to understand if an agent can 3,000 ms can be given as 1 ms at a time to an RNN, and for LSTM,
learn a direct representation of time (instead of an indirect the same input should be divided into a fixed length to effectively
capture the temporal properties in the input. We used an LSTM (speed 0.01) and higher (speed 1.3) than the trained
with 100 input nodes and gave an input signal of 100 ms to the speed. Notably, the agent was not able to scale its actions
network, followed by the next 100 ms. Indeed, the sequence beyond speed 1.3.
length can be smaller than 100 ms. In our experiments, we • We observed that neural networks such as the LSTM might
trained the agent with different sequence lengths (50, 100, 200, not be able to learn an explicit representation of time when
and 300 ms), and the agent showed better performance for compared with population clock models. Deverett et al.
300 ms (results for 50, 100, and 200 ms are given in the (2019) showed that an RL agent can scale its actions
Appendix). The architecture of the LSTM we used contained (increase or decrease the velocity) using the LSTM
one LSTM layer with 256 hidden units, 300 input nodes, and two network. However, when we trained the LSTM network
linear layers with 100 nodes each. The output size of the network to learn a direct representation of the time, it learned
was 300, which resulted in an activity of n points for a given input periodic activity.
signal of n ms. The hidden states of the LSTM network were • In this research study, we trained an RL agent in a similar
carried on throughout the episode. environment to task switching; shooting a moving target.
The trained activity of the LSTM network is shown in The target in our experiment is a blob of a damped
Figure 12 (bottom), where the light blue region shows the pendulum with a length of 1 m and a mass of 1 kg. The
output activity of the network. The colored bars in Figure 12 agent was able to shoot the fast-moving blob by learning to
show the output activity of the LSTM network and the correct shoot at a few near-accurate time points.
time-to-act intervals for clicking each circle. The LSTM network
did learn to exceed the threshold indicating when to act at a few
time-to-act intervals. However, there is periodicity learned by the DATA AVAILABILITY STATEMENT
network, meaning that for every 300 ms, the network learned to
produce similar activity. The raw data supporting the conclusions of this article will be
made available by the authors, without undue reservation.
5 DISCUSSION
AUTHOR CONTRIBUTIONS
In this study, we trained a reinforcement learning agent to learn “when
to act” using an RNN and “what to act” using a DQN. We introduced All authors listed have made a substantial, direct, and intellectual
a reward-based recursive least square algorithm to train the RNN. By contribution to the work and approved it for publication.
disentangling the process of learning the temporal and spatial aspects
of action into independent tasks, we intend to understand explicit time
representation in an RL agent. Through this strategy, the agent learned ACKNOWLEDGMENTS
to create its representation of time. Our experiments, which employed
a peak-interval style, show that the agent could learn to produce a This work was supported in part by the Australian Research
neural trajectory that peaked at the time-to-act with near-perfect Council (ARC) under discovery grant DP180100656 and
accuracy. We also observed several other intriguing behaviors. DP210101093. Research was also sponsored in part by the
Australia Defence Innovation Hub under Contract No. P18-
• The agent learned to subdue its activity immediately after 650825, US Office of Naval Research Global under
observing a new state. We interpreted this as the agent Cooperative Agreement Number ONRG - NICOP - N62909-
restarting its clock. 19-1-2058, and AFOSR ‒ DST Australian Autonomy Initiative
• The agent was able to temporally scale its actions in our agreement ID10134. We also thank the NSW Defence Innovation
proposed learning method. Even though we trained the Network and NSW State Government of Australia for financial
agent with a single-speed value (speed 1), it learned to support in part of this research through grant DINPP2019 S1-03/
temporally scale its action to speeds that were both lower 09 and PP21-22.03.02.
Carrara, N., Leurent, E., Laroche, R., Urvoy, T., Maillard, O. A., and Pietquin, O.
REFERENCES (2019). “Budgeted Reinforcement Learning in Continuous State Space,” in
Advances in Neural Information Processing Systems 32: Annual Conference on
Åström, K. J., and Wittenmark, B. (2013). Computer-controlled Systems: Theory Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC,
and Design. Englewood Cliffs, NJ: Courier Corporation. December 8–14, 2019, 9295–9305.
Bakker, B. (2002). “Reinforcement Learning with Long Short-Term Memory,” in Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical Evaluation of Gated
Proceedings of the 14th International Conference on Neural Information Recurrent Neural Networks on Sequence Modeling in NIPS 2014 Workshop on
Processing Systems: Natural and Synthetic, Vancouver, Canada, 1475–1482. Deep Learning, Quebec, Canada, December, 2014. preprint arXiv:1412.3555.
Buonomano, D. V., and Laje, R. (2011). “Population Clocks,” in Space, Time And Collier, G. L., and Wright, C. E. (1995). Temporal Rescaling of Simple and
Number In the Brain (Elsevier), 71–85. doi:10.1016/b978-0-12-385948-8.00006-2 Complex Ratios in Rhythmic Tapping. J. Exp. Psychol. Hum. Perception
Buonomano, D. V., and Maass, W. (2009). State-dependent Computations: Perform. 21, 602–627. doi:10.1037/0096-1523.21.3.602
Spatiotemporal Processing in Cortical Networks. Nat. Rev. Neurosci. 10, Deverett, B., Faulkner, R., Fortunato, M., Wayne, G., and Leibo, J. Z. (2019).
113–125. doi:10.1038/nrn2558 “Interval Timing in Deep Reinforcement Learning Agents,” in 33rd Conference
on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Petter, E. A., Gershman, S. J., and Meck, W. H. (2018). Integrating Models of
Canada, 6689–6698. Interval Timing and Reinforcement Learning. Trends. Cogn. Sci. 22, 911–922.
Diedrichsen, J., Criscimagna-Hemminger, S. E., and Shadmehr, R. (2007). doi:10.1016/j.tics.2018.08.004
Dissociating Timing and Coordination as Functions of the Cerebellum. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al.
J. Neurosci. 27, 6291–6301. doi:10.1523/jneurosci.0061-07.2007 (2017). Mastering the Game of Go without Human Knowledge. Nature 550,
Doya, K. (2000). Reinforcement Learning in Continuous Time and Space. Neural 354. doi:10.1038/nature24270
Comput. 12, 219–245. doi:10.1162/089976600300015961 Simen, P., Balci, F., deSouza, L., Cohen, J. D., and Holmes, P. (2011). A Model of
Durstewitz, D. (2003). Self-organizing Neural Integrator Predicts Interval Times Interval Timing by Neural Integration. J. Neurosci. 31, 9238–9253. doi:10.1523/
through Climbing Activity. J. Neurosci. 23, 5342–5353. doi:10.1523/ jneurosci.3121-10.2011
jneurosci.23-12-05342.2003 Sompolinsky, H., Crisanti, A., and Sommers, H.-J. (1988). Chaos in
Hardy, N. F., Goudar, V., Romero-Sosa, J. L., and Buonomano, D. V. (2018). Random Neural Networks. Phys. Rev. Lett. 61, 259. doi:10.1103/
A Model of Temporal Scaling Correctly Predicts that Motor Timing physrevlett.61.259
Improves with Speed. Nat. Commun. 9, 4732–4814. doi:10.1038/s41467- Tallec, C., Blier, L., and Ollivier, Y. (2019). Making Deep Q-Learning Methods
018-07161-6 Robust to Time Discretization. International conference on machine learning
Hochreiter, S., and Schmidhuber, J. (1997). Long Short-Term Memory. Neural (ICML), Long Beach. arXiv. preprint arXiv:1901.09732.
Comput. 9, 1735–1780. doi:10.1162/neco.1997.9.8.1735 Vinyals, O., Babuschkin, I., Czarnecki, W. M, Mathieu, M., Dudzik, A., Junyoung, C., et al.
Klapproth, F. (2008). Time and Decision Making in Humans. Cogn. Affective, (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning.
Behav. Neurosci. 8, 509–524. doi:10.3758/cabn.8.4.509 Nature 575, 350–354.
Laje, R., and Buonomano, D. V. (2013). Robust Timing and Motor Patterns by Watkins, C. J., and Dayan, P. (1992). Q-learning. Machine Learn. 8, 279–292.
Taming Chaos in Recurrent Neural Networks. Nat. Neurosci. 16, 925–933. doi:10.1023/a:1022676722315
doi:10.1038/nn.3405 Xue, D., Yao, J., Wang, J., Guo, Y., and Han, X. (2013). Formation Control of Multi-Agent
Li, D., Ge, S. S., He, W., Ma, G., and Xie, L. (2019). Multilayer Formation Control of Systems with Stochastic Switching Topology and Time-Varying Communication
Multi-Agent Systems. Automatica 109, 108558. doi:10.1016/ Delays. IET Control. Theor. Appl. 7, 1689–1698. doi:10.1049/iet-cta.2011.0325
j.automatica.2019.108558
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Conflict of Interest: The authors declare that the research was conducted in the
Continuous Control with Deep Reinforcement Learning. 4th International absence of any commercial or financial relationships that could be construed as a
Conference on Learning Representations, (ICLR), San Juan, Puerto Rico, May potential conflict of interest.
2–4, 2016. preprint arXiv:1509.02971.
Matell, M. S., Meck, W. H., and Nicolelis, M. A. L. (2003). Interval Timing and the Publisher’s Note: All claims expressed in this article are solely those of the authors
Encoding of Signal Duration by Ensembles of Cortical and Striatal Neurons. and do not necessarily represent those of their affiliated organizations, or those of
Behav. Neurosci. 117, 760–773. doi:10.1037/0735-7044.117.4.760 the publisher, the editors and the reviewers. Any product that may be evaluated in
Miall, C. (1989). The Storage of Time Intervals Using Oscillating Neurons. Neural this article, or claim that may be made by its manufacturer, is not guaranteed or
Comput. 1, 359–371. doi:10.1162/neco.1989.1.3.359 endorsed by the publisher.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al. (2013).
Playing Atari with Deep Reinforcement Learning. arXiv. preprint arXiv:1312.5602. Copyright © 2021 Akella and Lin. This is an open-access article distributed under the
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. terms of the Creative Commons Attribution License (CC BY). The use, distribution
(2015). Human-level Control through Deep Reinforcement Learning. Nature or reproduction in other forums is permitted, provided the original author(s) and the
518, 529–533. doi:10.1038/nature14236 copyright owner(s) are credited and that the original publication in this journal is
Oh, K.-K., Park, M.-C., and Ahn, H.-S. (2015). A Survey of Multi-Agent Formation cited, in accordance with accepted academic practice. No use, distribution or
Control. Automatica 53, 424–440. doi:10.1016/j.automatica.2014.10.022 reproduction is permitted which does not comply with these terms.