ABIDES-Gym: Gym Environments For Multi-Agent Discrete Event Simulation and Application To Financial Markets
ABIDES-Gym: Gym Environments For Multi-Agent Discrete Event Simulation and Application To Financial Markets
ABSTRACT 1 INTRODUCTION
Model-free Reinforcement Learning (RL) requires the ability to sam- Reinforcement learning (RL) [22] is a field of machine learning that
ple trajectories by taking actions in the original problem environ- consists in maximizing the objective of an agent. The environment
ment or a simulated version of it. Breakthroughs in the field of RL the agent evolves in is modeled by a markov decision process (MDP).
have been largely facilitated by the development of dedicated open The objective is typically defined as a cumulative numerical reward
source simulators with easy to use frameworks such as OpenAI formulation. It is maximized by optimizing the policy used by the
Gym and its Atari environments. In this paper we propose to use agent to choose the actions it takes.
the OpenAI Gym framework on discrete event time based Discrete The world is considered to be divided into two distinct parts:
Event Multi-Agent Simulation (DEMAS). We introduce a general the experimental agent and the rest of the world called the envi-
technique to wrap a DEMAS simulator into the Gym framework. ronment. Interactions between the agent and the environment are
We expose the technique in detail and implement it using the simu- summarized in: (1) the experimental agent taking actions, (2) the
lator ABIDES as a base. We apply this work by specifically using environment evolving to a new state based solely on the previous
the markets extension of ABIDES, ABIDES-Markets, and develop state and the action taken by the experimental agent. At each step
two benchmark financial markets OpenAI Gym environments for the agent receives a numerical reward based on the state and the
training daily investor and execution agents.1 As a result, these two action taken to reach that state.
environments describe classic financial problems with a complex There are two classic types of methods to approach RL problems:
interactive market behavior response to the experimental agent’s model-based methods [11] and model-free methods [22]. Model-
action. based RL assumes that a model of the state action transition distri-
bution and reward distribution are known. Model-free RL assumes
ACM Reference Format: these models are unknown but that instead the agent can interact
Selim Amrouni, Aymeric Moulin, Jared Vann, Svitlana Vyetrenko, Tucker with the environment and collect samples.
Balch, and Manuela Veloso. 2021. ABIDES-Gym: Gym Environments for
Model-free approaches to RL require the ability for the exper-
Multi-Agent Discrete Event Simulation and Application to Financial Markets.
In Proceedings of ICAIF’21. ACM, New York, NY, USA, 9 pages. https://doi.
imental agent to interact with the MDP environment to gather
org/10.1145/1122445.1122456 information about state-action transitions and resulting rewards.
This can be done by directly interacting with a real-world system;
however, the cost and risk associated with this interaction has been
∗ Both
authors contributed equally to this research. proven to be challenging in most cases. The largest success stories
1 ABIDESsource code is open-sourced on https://github.com/jpmorganchase/abides- of RL happened with problems where the original target environ-
jpmc-public and available upon request. Please reach out to Selim Amrouni and
Aymeric Moulin.
ment is numerical and cheap to run by nature or the environment
can be simulated as such (as developed in [6]).
If an environment is straightforward to model, the reward and
Permission to make digital or hard copies of all or part of this work for personal or new state arising from a state-action pair can be simulated directly.
classroom use is granted without fee provided that copies are not made or distributed However it is not always the case, there are systems where it is
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
non-trivial to directly model the state and action transition steps.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, Some of them are by nature multi-agent systems. In that case, the
to post on servers or to redistribute to lists, requires prior specific permission and/or a easiest way to model the transition from a state observed by an
fee. Request permissions from permissions@acm.org.
ICAIF’21, November 03–05, 2021, London, UK
agent to the next state some time in the future after it took an action
© 2021 Association for Computing Machinery. is to simulate the actions taken by all the agents in the system.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
https://doi.org/10.1145/1122445.1122456
ICAIF’21, November 03–05, 2021, London, UK Amrouni and Moulin, et al.
Discrete Event Multi-Agent Simulation (DEMAS) has been a another or (2) a wakeup message sent by an agent to itself to be
topic of interest for a long time [7, 8]. There are two main types of awaken later. An agent is only active when it receives a message
DEMAS: (general or wakeup). An agent that is active and taking actions
• Time step based simulation: the simulator advances time by will likely result in more messages sent and added to the queue.
increments already determined before starting the simula- The kernel will keep processing the message queue until either the
tion, typically of fixed size. queue is empty or simulation time reaches the end time.
• Event time based simulation: the simulator advances time as Once the end time has been reached, the kernel calls
it processes events from the queue. Time jumps to the next kernelStopping on all agents then it calls kernelTerminating
event time. on all agents. These functions are used to clean and format the data
In the case of event time based simulation, most of the research and logs agents have collected throughout the simulation.
has focused on simulating the entire system and its agents and 2.1.2 Agents.
observing the evolution of the different variable of the system. An agent is an object with the following methods:
In this paper, we propose a framework for wrapping an event kernelInitialize, kernelStarting, receiveMessage,
time based DEMAS simulator into an OpenAI Gym framework. It kernelStopping, kernelTerminate and wakeUp.
enables using a multi-agent discrete event simulator as a straight- Apart from these requirements, agent functioning is flexible. The
forward environment for RL research. The framework abstracts agent can perform any computations and communicate with the
away the details of the simulator to only present the MDP of the rest of the world by sending messages routed through the kernel.
agent of interest.
To the best of our knowledge, in the context of event time based 2.2 ABIDES-Markets
DEMAS, there has not been any work published where one agent ABIDES-Markets extends ABIDES-Core. It implements a market
is considered separately with its own MDP and the other agents with a single exchange and several market participants. The ex-
considered together as the rest of world background that drives the change is an agent in the simulation and market participants as
state action transitions and rewards. other agents.
For practical purposes we detail the framework mechanism by This way, by construction of ABIDES-core, market participants
applying it to ABIDES [5] - a multipurpose multi-agent based dis- and the exchange only communicate via messages. Typical mes-
crete event simulator. We illustrate the benefits for RL research sages like orders but also market data rely on the messaging system.
by using ABIDES-markets, the markets simulation extension of (Market data is based on either a direct single request or a recurring
ABIDES, as a base simulator to build two financial markets trading subscription)
OpenAI Gym environments and train RL agents in them. This work on ABIDES-Markets has focused on representing the
NASDAQ equity exchange and its regular hours continuous trading
2 ABIDES: AGENT BASED INTERACTIVE session [19]. The exchange receives trading instructions similar to
DISCRETE EVENT SIMULATOR the OUCH protocol [20]. Orders are matched based on price/time
In this section we present the details of the original implementa- priority model.
tion and use of ABIDES. We introduce the core simulator and its The implementation of interactions with the exchange is facili-
extension to equity markets simulation through ABIDES-Markets. tated by parent object classes FinancialAgent and TradingAgent.
The basic background agent inherit from them. They include value
2.1 ABIDES-Core agents, momentum agents and others ([5] gives a description of the
ABIDES is a DEMAS simulator where agents exclusively interact agents ).
through a messaging system. Agents only have access to their own
state and information about the rest of the world from messages 3 ABIDES-GYM
they receive. An optional latency model is applied to the messaging In this section we introduce the Gym wrapping. We expose the
system. details of the two-layer wrapping on ABIDES: ABIDES-Gym-Core
2.1.1 Kernel. and ABIDES-Gym sub-environment.
The kernel (see figure 1) drives and coordinates the entire simula-
tion. It is composed of a priority message queue used for handling
messages between agents. It takes as input a start time, an end time, 3.1 Motivation
a list of agents, a latency model and a pseudo-random seed. As described in section 2, ABIDES is a flexible tool that facilitates
It first sets the clock to the start time, executes the DEMAS. However in its original version, ABIDES presents draw-
kernelInitialize method for all agents. Then, it calls the backs that possibly make it difficult to use for some applications.
KernelStarting method for all agents (this effectively has the Creating an experimental agent and adding it to an existing config-
same purpose as the kernelInitialize method but has the guar- uration requires a deep understanding of ABIDES. Additionally, the
antee that all agents have already been instantiated when it runs). framework is unconventional and makes it hard to leverage popu-
Agents start sending messages in the initialisation stage. The lar RL tools. Figure 2 illustrates the above point: the experimental
kernel starts "processing" messages based on reception simulation agent is part of the simulation like the others. For this reason the
time (messages are not opened, just routed to the recipient). Mes- simulation returns nothing until it is done. The full experimental
sages can be: (1) a general data request message by an agent from agent behavior has to be put in the agent code, inside the simulator.
ABIDES-Gym ICAIF’21, November 03–05, 2021, London, UK
There is no direct access to the MDP of the RL problem from outside Open AI Gym framework we need to be able to start the simulation,
of ABIDES. pause at specified points in time, return a state and then resume
the simulation again.
3.2 Approach We propose a new kernel version in which initialization, running
To address the aforementioned difficulties and make ABIDES eas- and termination phases are broken down into 3 separate methods.
ily usable for RL we introduce ABIDES-Gym, a novel way to use The kernel is initialized using the initialization method (effectively
ABIDES through the OpenAI Gym environment framework. In calling the kernelInitializing and kernelStarting methods).
other words to run ABIDES while leaving the learning algorithm Then the kernel is run using runner method until either the mes-
and the MDP formulation outside of the simulator. To the best of our sage queue is empty or an agent sends an interruption instruction
knowledge, it is the first instance of a DEMAS simulator allowing to the kernel. When runner finishes, it returns a state from a speci-
interaction through an openAI Gym framework. fied agent. Additionally, we add to the runner method the option
Figure 2 shows that ABIDES-Gym allows using ABIDES as a to send an action for an agent to execute as first event when simu-
black box. From the learning algorithm’s perspective, the entire lation resumes. This new kernel can be used in the original mode
interaction with ABIDES-Gym can be summarized into: (1) drawing or in the new Gym mode:
an initial state by calling 𝑒𝑛𝑣.𝑟𝑒𝑠𝑒𝑡 (), (2) calling 𝑒𝑛𝑣.𝑠𝑡𝑒𝑝 (𝑎) to take
an action 𝑎 and obtain the next state, action, reward and done • Original mode. To run ABIDES in the original mode: we
variables. The sample code listing 1 shows a short example of the successively run the initialization, runner and termination
training loop for a learning algorithm. methods. Running a configuration with agents that never
send interruptions, the runner method runs until the end of
import gym; import ABIDES_gym
env = gym.make('markets-daily_investor-v0')
the simulation.
env.seed(0) • New Gym mode. To run ABIDES in the new Gym mode: we
state, done = env.reset(), False introduce a "placeholder" agent we call Gym agent. At every
agent = MyAgentStrategy(params)
while not done: wake-up call this agent receives, it sends an interruption
action = agent.choose_action(state) instruction and its current raw state to the kernel. The kernel
new_state, reward, done, info = env.step(action) pauses the simulation and returns the raw state passed by the
agent.update_policy(new_state,state,reward,action)
state = new_state Gym agent (the raw state contains all the information passed
from the Gym agent to outside of the simulator). The "pause"
Listing 1: Use of Abide-Gym with Open AI Gym APIs gives back control to the main user script/thread. The user
can use the raw state to perform any computation it wants
to select the next action 𝑎. Using runner with action 𝑎 as
3.3 Key idea: interruptible simulation kernel input, the user takes its action and resumes the simulation
Most DEMAS simulators, including ABIDES, run in one single until the next interruption or the queue is empty.
uninterruptible block. To be able to interact with ABIDES in the
ICAIF’21, November 03–05, 2021, London, UK Amrouni and Moulin, et al.
3.4 ABIDES-Gym-Core environment: • Reward: function is defined to compute it from the raw state.
Wrapping ABIDES in OpenAI Gym An additional function is defined to update reward at the
framework end of an episode if needed.
With the new kernel described in the subsection 3.3, ABIDES-Gym
wraps ABIDES in an OpenAI Gym framework:
3.6 More details on ABIDES-Gym-Markets
• env.reset(): instantiating the kernel with the configura- ABIDES-Gym-Markets inherits from ABIDES-Gym-Core and con-
tion, starting simulation using kernel runner method, wait- stitutes a middle abstract layer between ABIDES-Gym-Core and
ing for the Gym agent to interrupt and send its state, return environments dedicated to financial markets. In addition to the
this state. general wrapping defined by ABIDES-Gym-Core, ABIDES-Gym-
• env.step(a): calling runner method on the kernel previ- Markets handles the market data subscriptions with the exchange,
ously obtained with env.reset() and feeding it action 𝑎. tracking of orders sent, order API and others. In practice we achieve
these functionalities by introducing a Gym agent that is more spe-
This wrapping is independent of the nature of the simulation cific to markets simulation.
performed with ABIDES. We structure this into an abstract Gym
environment ABIDES-Gym-Core.
4 ABIDES-GYM APPLICATION TO FINANCE:
3.5 ABIDES-Gym sub-environments: Fully INTRODUCING TWO MARKET
defining a Markov Decision Process ENVIRONMENTS
ABIDES-Gym-Core abstract environment enforces the gym frame- In this section we focus on using the ABIDES-Gym approach for
work mechanisms. It leaves the MDP undefined. The notions of ABIDES-Markets simulations and introduce two market environ-
time steps, state, reward are left undefined. ABIDES-Gym sub- ments to address classic problems.
environments, inheriting from ABIDES-Gym-Core, specify these
notions as follows: 4.1 Daily Investor Environment
• Time-steps: Gym agent is given a process to follow for its This environment presents an example of the classic problem where
wake-up times (Can be deterministic or stochastic). an investor tries to make money buying and selling a stock through-
• State: function is defined to compute actual state of the MDP out a single day. The investor starts the day with cash but no posi-
from the raw state returned by the gym agent. tion then repeatedly buy and sell the stock in order to maximize
ABIDES-Gym ICAIF’21, November 03–05, 2021, London, UK
Figure 3: ABIDES-Gym kernel mechanism when running in Gym mode. RL training loop on the left. Describes communica-
tions inside between agents an kernel inside ABIDES as well as communications between the RL loop and the simulation. In
the figure time is represented by reading the events from top to bottom following the RL training loop on the left
marked to market value at the end of the day (i.e. cash plus holdings • 𝑠𝑝𝑟𝑒𝑎𝑑𝑡 = 𝑏𝑒𝑠𝑡𝐴𝑠𝑘𝑡 − 𝑏𝑒𝑠𝑡𝐵𝑖𝑑𝑡
valued at the market price). • 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑡 = 𝑚𝑖𝑑𝑃𝑟𝑖𝑐𝑒𝑡 − 𝑙𝑎𝑠𝑡𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑃𝑟𝑖𝑐𝑒𝑡
4.1.1 Time steps. • 𝑅𝑡𝑘 = (𝑟𝑡 , ..., 𝑟𝑡 −𝑘+1 ) series of mid price differences, where
As described in subsection 3.5, we introduce a notion of time step 𝑟𝑡 −𝑖 = 𝑚𝑖𝑑𝑡 −𝑖 − 𝑚𝑖𝑑𝑡 −𝑖−1 . It is set to 0 when undefined. By
by considering the experimental agent’s MDP. Here we make the default 𝑘 = 3
experimental agent wake up every minute starting at 09:35. 4.1.3 Action space.
The environment allows for 3 simple actions: "BUY","HOLD" and
4.1.2 State space. "SELL". The "BUY" and "SELL" actions correspond to market or-
The experimental agent perceives the market through the state ders of a constant size, which is defined at the instantiation of the
representation: environment. It is is defaulted to 100.
𝑠 (𝑡) = (ℎ𝑜𝑙𝑑𝑖𝑛𝑔𝑠𝑡 , 𝑖𝑚𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑡 , 𝑠𝑝𝑟𝑒𝑎𝑑𝑡 , 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑡 , 𝑅𝑡𝑘 ) 4.1.4 Reward.
We define the step reward as: 𝑟𝑒𝑤𝑎𝑟𝑑𝑡 = 𝑚𝑎𝑟𝑘𝑒𝑑𝑇𝑜𝑀𝑎𝑟𝑘𝑒𝑡𝑡 −
where: 𝑚𝑎𝑟𝑘𝑒𝑡𝑇𝑜𝑀𝑎𝑟𝑘𝑒𝑡𝑡 −1 where 𝑚𝑎𝑟𝑘𝑒𝑡𝑇𝑜𝑀𝑎𝑟𝑘𝑒𝑡𝑡 = 𝑐𝑎𝑠ℎ𝑡 +ℎ𝑜𝑙𝑑𝑖𝑛𝑔𝑠𝑡 ·
• ℎ𝑜𝑙𝑑𝑖𝑛𝑔𝑠𝑡 : number of shares of the stock held by the experi- 𝑙𝑎𝑠𝑡𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑡 and 𝑙𝑎𝑠𝑡𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑡 the price at which the last
ment agent at time step 𝑡 transaction in the market was executed before time step 𝑡.
𝑏𝑖𝑑𝑠 𝑣𝑜𝑙𝑢𝑚𝑒
• 𝑖𝑚𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑡 = 𝑏𝑖𝑑𝑠 𝑣𝑜𝑙𝑢𝑚𝑒+𝑎𝑠𝑘𝑠 𝑣𝑜𝑙𝑢𝑚𝑒 using the first 3 levels
of the order book. Value is respectively set to 0, 1, 0.5 for no
bids, no asks and empty book.
ICAIF’21, November 03–05, 2021, London, UK Amrouni and Moulin, et al.
Reward
5.1.2 Results.
Figure 4 shows the average reward throughout training for different 1.0
initial environment seeds (1, 2 and 3). For all of these different global 1.5 Global intial seed:
initial seeds the agent is able to learn a profitable strategy. 1
2.0 2
3
5.2 Execution environment 0 25000 50000 75000 100000 125000 150000 175000 200000
5.2.1 Setup. Steps
Same setup as in 5.1.1 except for the learning rate fixed at 1 · 10−4 .
5.2.2 Results. Figure 5: Daily Execution env: training of a DQN agent (av-
Figure 5 shows the average reward throughout training for different erage reward per 1000 with 5-point moving average)
initial environment seeds (1, 2 and 3). For the execution problem,
achieving a reward close to 0 means executing at a low cost. The
agent indeed learns to execute the parent order while minimizing
the cost of execution. At the moment there are positive spikes kernel that enables precise control of information dissemination
explained by the rare unrealistic behaviors of the synthetic market. between agents. This makes ABIDES particularly well suited for
the simulation of financial markets.
Even though the above described simulators enable modeling
6 RELATED WORK complex systems they are not specifically designed for RL. Many
We now briefly discuss related work on simulation approaches, Re- different MDPs can be formulated depending on the part of the
inforcement Learning, OpenAI Gym Environment and applications whole system chosen as experimental agent and the considered task
to trading environment. and reward formulation. Thus, they often lack the ability to interact
with the simulator via an easy API to train one or more experimental
6.1 DEMAS simulators RL agents with the rest of the system as a MDP background.
DEMAS enables modeling the behavior and evolution of complex
systems. It is sometimes easier to generate samples or estimate sta- 6.2 Multi-agent RL
tistics from a distribution by reproducing the components’ behavior As we deal with RL in multi-agent scenarios it is important to
rather than trying to directly express the distribution or directly mention Multi-Agent RL (MARL). MARL is a sub-field of RL where
produce a sample. Multiple open-source simulators exist, however several learning agents are considered at once interacting with each
none of them has become a de-facto standard. Many of them contain other. On the one hand MARL could be considered as a more general
specifics of the domain they arose from. For example, the Swarm instance of the problem we introduce in this paper as all agents (or
simulator [16] is well suited to describe biological systems. It has at least most) are learning agents in MARL. On the other hand, so
the notion of swarm, group of agents and their schedule, which can far the problem formulations that have been used for MARL can
in turn be connected to build a configuration. For this paper we be quite restrictive for the nature of agent interaction as described
choose to use ABIDES [5] as a base because of its message based below. Classic MARL problem formulations used:
• Partially Observable Stochastic Games (POSGs) [14]: This
formulation is widely used. All agents "take steps" simulta-
Training average reward neously by taking an action and observing the next state
12 Global intial seed: and reward. It is inherently time-step based – one needs
1
10 2 to consider all time-steps and provide many null actions if
3 the game is not fully simultaneous by nature. Additionally,
8
environments modeling these type of games are typically
Reward
REFERENCES Metrics for Robust Limit Order Book Market Simulations. arXiv:1912.04941 [q-
[1] AminHP. [n.d.]. github repo: gym-anytrading. https://github.com/AminHP/gym- fin.TR]
anytrading
[2] Robert Aumann and S. Hart (Eds.). 1992. Handbook of Game Theory with Economic
Applications (1 ed.). Vol. 1. Elsevier. https://EconPapers.repec.org/RePEc:eee:
gamhes:1
[3] Tucker Hybinette Balch, Mahmoud Mahfouz, Joshua Lockhart, Maria Hybinette,
and David Byrd. 2019. How to Evaluate Trading Strategies: Single Agent Market
Replay or Multiple Agent Interactive Simulation? arXiv:1906.12010 [q-fin.TR]
[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John
Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym.
arXiv:1606.01540 [cs.LG]
[5] David Byrd, Maria Hybinette, and Tucker Hybinette Balch. 2019.
ABIDES: Towards High-Fidelity Market Simulation for AI Research.
arXiv:1904.12066 [cs.MA]
[6] Gabriel Dulac-Arnold, Daniel J. Mankowitz, and Todd Hester. 2019. Chal-
lenges of Real-World Reinforcement Learning. CoRR abs/1904.12901 (2019).
arXiv:1904.12901 http://arxiv.org/abs/1904.12901
[7] Daniele Gianni. 2008. Bringing Discrete Event Simulation Concepts into Multi-
agent Systems. In Tenth International Conference on Computer Modeling and
Simulation (uksim 2008). 186–191. https://doi.org/10.1109/UKSIM.2008.139
[8] Alfred Hartmann and Herb Schwetman. 1998. Discrete-Event Simula-
tion of Computer and Communication Systems. John Wiley & Sons,
Ltd, Chapter 20, 659–676. https://doi.org/10.1002/9780470172445.ch20
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470172445.ch20
[9] Harold William Kuhn. 1953. Contributions to the Theory of Games (AM-28), Volume
II. Princeton University Press. https://doi.org/doi:10.1515/9781400881970
[10] Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi,
Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls,
Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds,
Ryan Faulkner, János Kramár, Bart De Vylder, Brennan Saeta, James Bradbury,
David Ding, Sebastian Borgeaud, Matthew Lai, Julian Schrittwieser, Thomas An-
thony, Edward Hughes, Ivo Danihelka, and Jonah Ryan-Davis. 2019. OpenSpiel:
A Framework for Reinforcement Learning in Games. CoRR abs/1908.09453 (2019).
arXiv:1908.09453 [cs.LG] http://arxiv.org/abs/1908.09453
[11] Sergey Levine and Vladlen Koltun. 2013. Guided Policy Search. In Proceedings of
the 30th International Conference on Machine Learning (Proceedings of Machine
Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR,
Atlanta, Georgia, USA, 1–9. http://proceedings.mlr.press/v28/levine13.html
[12] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Gold-
berg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2018. RLlib: Abstrac-
tions for Distributed Reinforcement Learning. In Proceedings of the 35th In-
ternational Conference on Machine Learning (Proceedings of Machine Learning
Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 3053–3062.
http://proceedings.mlr.press/v80/liang18b.html
[13] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez,
and Ion Stoica. 2018. Tune: A Research Platform for Distributed Model Selection
and Training. arXiv preprint arXiv:1807.05118 (2018).
[14] Michael L. Littman. 1994. Markov Games as a Framework for Multi-Agent
Reinforcement Learning. In In Proceedings of the Eleventh International Conference
on Machine Learning. Morgan Kaufmann, 157–163.
[15] Thomas Lux and Michele Marchesi. 1998. Scaling and Criticality in a Stochastic
Multi-Agent Model of a Financial Market. Nature 397 (08 1998). https://doi.org/
10.1038/17290
[16] Nelson Minar, Roger Burkhart, Chris Langton, and et al. 1996. The Swarm
Simulation System: A Toolkit for Building Multi-Agent Simulations. Technical
Report.
[17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing Atari with
Deep Reinforcement Learning. (2013). http://arxiv.org/abs/1312.5602 cite
arxiv:1312.5602Comment: NIPS Deep Learning Workshop 2013.
[18] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard
Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan,
and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications.
arXiv:1712.05889 [cs.DC]
[19] NASDAQ. [n.d.]. The Nasdaq Stock Market (Nasdaq). https://www.nasdaqtrader.
com/trader.aspx?id=tradingusequities
[20] NASDAQ. 2021. O*U*C*H Version 4.2. http://www.nasdaqtrader.com/content/
technicalsupport/specifications/TradingProducts/OUCH4.2.pdf
[21] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. 2006. Reinforcement learning
for optimized trade execution. ICML 2006 - Proceedings of the 23rd International
Conference on Machine Learning 2006, 673–680. https://doi.org/10.1145/1143844.
1143929
[22] Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An intro-
duction. MIT press.
[23] Svitlana Vyetrenko, David Byrd, Nick Petosa, Mahmoud Mahfouz, Danial Der-
vovic, Manuela Veloso, and Tucker Hybinette Balch. 2019. Get Real: Realism