0% found this document useful (0 votes)
112 views8 pages

Data Driven Market Making Via Model Free Learning

This document proposes using a model-free reinforcement learning approach called Q-learning to develop an optimal trading strategy for a market-making firm. The strategy would determine when to place or cancel limit orders on an electronic limit order book to maximize expected profit while controlling risk. The authors formulate the problem as a Markov decision process but address the challenges of large state spaces and unknown transition probabilities by using state aggregation and model-free Q-learning instead of model-based methods. Their proposed approach was backtested successfully against a market-making firm's real order book data and outperformed benchmark strategies.

Uploaded by

if05041736
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views8 pages

Data Driven Market Making Via Model Free Learning

This document proposes using a model-free reinforcement learning approach called Q-learning to develop an optimal trading strategy for a market-making firm. The strategy would determine when to place or cancel limit orders on an electronic limit order book to maximize expected profit while controlling risk. The authors formulate the problem as a Markov decision process but address the challenges of large state spaces and unknown transition probabilities by using state aggregation and model-free Q-learning instead of model-based methods. Their proposed approach was backtested successfully against a market-making firm's real order book data and outperformed benchmark strategies.

Uploaded by

if05041736
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Data-Driven Market-Making via Model-Free Learning

Yueyang Zhong1 , YeeMan Bergstrom2 and Amy Ward3


1,3
Booth School of Business, University of Chicago
2
Proprietary Trading, Chicago
yzhong0@chicagobooth.edu, yee.man.bergstrom@gmail.com, Amy.Ward@chicagobooth.edu

Abstract
This paper studies when a market-making firm
should place orders to maximize their expected net
profit, while also constraining risk, assuming or-
ders are maintained on an electronic limit order
book (LOB). To do this, we use a model-free and
off-policy method, Q-learning, coupled with state
aggregation, to develop a proposed trading strat-
egy that can be implemented using a simple lookup
table. Our main training dataset is derived from Figure 1: An illustration of a limit order book (LOB)
event-by-event data recording the state of the LOB.
Our proposed trading strategy has passed both in- limit orders. Provided that the lowest ask price exceeds the
sample and out-of-sample testing in the backtester highest bid price,1 the market-making firm earns profit when
of the market-making firm with whom we are col- one market order to buy trades with its resting limit sell or-
laborating, and it also outperforms other bench- der and another market order to sell trades with its resting
mark strategies. As a result, the firm desires to put limit buy order. The challenge is that the market-making firm
the strategy into production. cannot guarantee always being on both sides of the trade due
to the stochasticity of order arrivals, and the resulting move-
1 Introduction ments of the lowest ask and highest bid prices. The market-
making firm with whom we partnered prefers to begin with
We consider a financial asset traded on an electronic ex- the simplest possible strategy that places at most one order
change. Market participants, including institutional investors, per side. Furthermore, the firm is most interested in a strat-
market makers, and speculators, can post two types of egy for placing orders at the best bid and ask prices.
buy/sell orders. A market order is an order to buy/sell a cer- Our objective is to provide real-time guidance for how to
tain quantity of the asset at the best available price in the mar- manage the firm’s portfolio of limit buy and sell orders on
ket. A limit order is an order to trade a certain amount at a the LOB, so as to maximize the expected net profit, while pe-
specified price, known as an ask price for a sell order, and nalizing mismatch between the amount bought and sold, and
a bid price for a buy order. Limit orders are posted to an ensuring a sufficiently high Sharpe ratio.2 To do this, we use
electronic trading system, and all the outstanding limit orders historical trading data to train a model for real-time decision
are summarized by stating the quantities posted at each price making. More specifically, we formulate this problem as a
level in a limit order book (LOB), as shown in Figure 1, which Markov decision problem (MDP). Two main issues in solving
is the dominant market structure among exchange-traded U.S. the MDP are: (1) difficulty in estimating the transition prob-
equities and futures. The LOB is available to all market par- abilities, and (2) a very large state space (the notorious curse
ticipants. of dimensionality). To overcome these issues and be able to
The limit orders rest or wait in the LOB, and are matched find a well-performing heuristic, we implement a model-free
against incoming market orders. A market buy (sell) order
executes first at the lowest ask (highest bid) price, and next in 1
The arrivals of limit buy orders with bid prices higher than the
ascending (descending) order with higher (lower) priced asks
lowest ask price will be fulfilled immediately, similarly for the ar-
(bids). The execution within each price level is prioritized rivals of limit sell orders with ask prices lower than the highest bid
in accordance with the limit order time of arrival, in a first- price; thus the highest bid price does not exceed the lowest ask price.
come-first-served (FCFS) fashion. 2
The Sharpe ratio measures the return of an investment com-
In this paper, we take the perspective of a market-making pared to its risk. Usually, any Sharpe ratio greater than 1.0 is con-
firm. The market-making firm provides liquidity by submit- sidered acceptable to good by investors. A ratio higher than 2.0 is
ting limit orders, and removes liquidity by canceling existing rated as very good. A ratio of 3.0 or higher is considered excellent.
available actions are to add, cancel, or do nothing, and we
encode this using 0 and 1. A 0 on the bid side implies we
do not want an order resting at the best bid price, and so
we cancel any existing order on the bid side, and otherwise
do nothing. A 1 implies we do want an order resting at the
best bid price, and so we place an order at the best bid price
and simultaneously cancel any existing order on the bid side.
This leads to the allowable action space for any state Rt be-
Figure 2: Timing of LOB events ing A := {(0, 0), (0, 1), (1, 0), (1, 1)}, where the two com-
ponents in an action pair correspond to the action on the bid
side and the ask side, respectively. Later, this will be useful
Q-learning algorithm together with state aggregation. for us to restrict the action space when there is too much mis-
match between the amounts bought and sold, in which case
2 Model the allowable actions will be a subset of A. The state after
We model this problem as a finite-horizon discrete-time MDP. taking an action At = (At1 , At2 ) ∈ A can be expressed by
The simplified assumed timing of events happening in the two n-dimensional vectors Rta1 and Rta2 , defined as
LOB is illustrated in Figure 2. The objective is to provide a a2
Rtp 2
:=Rtp , for all p ∈ {1, 2, . . . , n}
strategy for when (and when not) to have one buy and/or one
At1 = 1 and p = βRt or
(
sell order resting on the LOB. The assumption that at most 1, if (1)
a1
one buy and one sell order can rest on the buy and sell side Rtp := At2 = 1 and p = αRt
respectively is based on a high-frequency trading convention 0, otherwise.
to (1) backtest whether the simple strategy is profitable, (2)
see how the simple strategy performs in production, and (3) 2.3 Exogenous Order Arrivals and Cancellations
expand to more complicated order strategies (such as order Let D̂tM B and D̂tM S be the number of units demanded respec-
stacking). tively by market buy and sell orders, which arose between
time t and t + 1. We have at most one resting order at αRt
2.1 LOB State Variable and one at βRt , which rest at the end of queue, and none else-
Assume there are n price levels in the order book, indexed where. The implication is that our orders execute if D̂tM B
by P := {1, 2, . . . , n}. At time t ∈ T := {0, 1, 2, ..., T }, and/or D̂tM S is no fewer than the number of orders resting at
1
|Rtp | ∈ {0, 1} denotes whether there exists a limit order be- the best ask and/or best bid; the state can then be updated in
2
longing to us at price p ∈ P, and |Rtp | ∈ {0, 1, 2, . . . } de- terms of the first n-dimensional vector, for all p ∈ P,
notes the total number of limit orders resting from other mar- 
ket participants at price p ∈ P. We distinguish between the  0, a1
if p = αRt and D̂tM B ≥ Rtp a2
+ Rtp ,
m1 MS a1 a2
bid and the ask side according to whether Rtp i
(i = 1, 2) is Rtp := 0, if p = βRt and D̂t ≥ Rtp + Rtp , (2)
 a1
i Rtp , otherwise.
negative or positive; Rtp < 0 (i = 1, 2) for the bid side, and
i In order to update the second n-dimensional vector, which
Rtp > 0 (i = 1, 2) for the ask side. Whenever the state is
such that we have an order resting, we conservatively assume represents the resting orders from other market participants,
that our order rests at the back of the queue. The implication we require more detailed knowledge. Define pα Rt to be the
is that our model will tend to underestimate the frequency at highest ask price against which a market buy order will ex-
which our orders are executed, resulting in an underestima- ecute.3 If there are enough limit orders resting at the low-
tion of profit. est ask price to fill the incoming market buy orders (i.e.,
The best bid and ask prices (also called the market bid D̂tM B ≤ Rtα a1
Rt
a2
+ Rtα Rt
), then pα
Rt = αRt and the trade
and ask prices) can be expressed as a function of Rt = quantity at price αRt is kRα
:= D̂tM B . Otherwise pα
(Rtp1 2
, Rtp )p∈P . t Rt > αRt ,
the trade quantities at any ask prices p lower than pα Rt exactly
• The best bid price (which is the highest bid price) is equals the number of resting orders at the price, and the trade
1 2
βRt := max{p ∈ {0, 1, ..., n} : Rtp + Rtp < 0}. quantity at price pα
Rt can be expressed by
• The best ask price (which is the lowest ask price) is α MB a1 a2
PpαRt −1 a2
1 2 kR := D̂ − (R + R ) − p=αRt +1 Rtp ,
αRt := min{p ∈ {1, ..., n, n + 1} : Rtp + Rtp > 0}. t t tα Rt tα Rt
Pn
In the above, p = 0 and p = n + 1 represent the degenerate assuming D̂tM B ≤ (Rtα a1
Rt
+ Rtα a2
Rt
) + p=αR +1 Rtp a2
t
cases of no bids and no asks, respectively. Since the best bid (where the summation in the above display is the empty set
and ask prices can be determined from Rt , there is no need if pαRt = αRt + 1). In the rare case that the total num-
to include them as part of the state variable. Then, the pre- ber of limit orders resting on the book is not enough to
decision state variable at time t is given by Rt . fill all the incoming Pmarket buy orders (that is, if D̂tM B >
a1 a2 n a2
(RtαRt + RtαRt ) + p=αR +1 Rtp ), then pα
Rt = n and the
2.2 Decision Variable t

A trading policy can be decomposed into a sequence of ac- 3


This assumes the non-degenerate case that there are ask orders
tions taken at the best bid and/or the best ask price. The resting on the LOB (αRt < n + 1).
trade quantity at price n is the amount resting at that price, between time t and t + 1 after the arrival of market orders is
α a2
so that kR t
:= Rtn (excess demand is lost). Similarly, de- defined as
β Xt
fine pRt to be the lowest bid price against which a market sell invt := 1{Ai1 = 1}1{D̂iM S ≥ Riβ a1 a2
+ Riβ }
Ri Ri
β i=0
order will execute,4 and kR t
to be the trade quantity at price Xt
β
pRt . Then the second component of the state after the arrival − 1{Ai2 = 1}1{D̂iM B ≥ Riα a1
R
+ Riαa2
R
}, (7)
i=0 i i
of market orders is, for all p ∈ P, where the first and second summation represent the cumu-

p ∈ {pβRt + 1, . . . , βRt }∪
lative amount bought and sold, respectively. Then, if ∆mt
 0, if denotes the change in the mid price between time period t − 1
{αRt , . . . , pα

Rt − 1},


and t (defined to be 0 for t = 0), the objective function is

a2 α
m2
Rtp := Rtp − kR t
, if p = pα
Rt , (3)
a2
 Rtp − kR

 β
, if p = pβRt , V (Rt , At , D̂tM B , D̂tM S , invt ) := C(Rt , At , D̂tM B , D̂tM S ) + invt · ∆mt .
 t
 a2
Rtp , otherwise, (8)
Some papers like [Spooner et al., 2018] also studied two
where Rta2 is as defined in (1). alternative penalty terms in the objective function, symmet-
Finally, the other market participants add and cancel orders rically dampened PnL: η · invt · ∆mt , and asymmetrically
between time t and t + 1, denoted by Ôt = (Ôtp )p∈P and dampened PnL: min(0, η · invt · ∆mt ), to disincentivize
Ĉt = (Ĉtp )p∈P . This results in the state update trend-following and bolster spread capture, because a damp-
ening applied to the inventory term reduces the profit gained
o1 m1 through speculation (i.e., following behavior) relative to that
Rtp := Rtp ,
o2 m2
for all p ∈ P, (4) from capturing the spread. We also tried both in our experi-
Rtp := Rtp + Ôtp − Ĉtp , ments, but they do not display a better performance, so here
m2 we only consider the basic objective function in (8).
and has the restriction that Rtp + Ôtp ≥ Ĉtp , for all p ∈ P;
i.e., the number of orders canceled cannot exceed the number
of orders present.
3 Data Analysis
Our dataset is a common and competitive futures contract
2.4 Transition Function traded on the Chicago Mercantile Exchange (CME)’s Globex
According to the timing of events as shown in Figure 2, there electronic trading platform in 2019. It is Level II order book
are three state updates from time t to t + 1, which are elab- data, which provides a more granular information than the
orated in equations (1), (2), (3), and (4). Then, we can write trade and quotes (TAQ) data mostly used by traders to do
the pre-decision state vector in the next decision epoch as financial analysis. Since trading is extremely active during
the time near market open and market close, the dynamics of
o1 o2
Rt+1 = (Rtp , Rtp ). (5) the LOB may differ significantly during these time periods,
as compared to the behavior throughout the remainder of the
2.5 Objective Function trading day. Hence, we truncate the data to the timeframe
Over the course of each day, the market-making firm gains 9:00 a.m.–14:30 p.m. for every day.
profit and incurs loss when market orders execute against the In contrast to most of the literature where the time stamps
market maker’s resting limit orders. For a given state and are only accurate to 1 second, our event-by-event data records
action pair, (Rt , At ), and arrival of market buy and sell or- the state of the LOB with microsecond decimal precision,
ders, (D̂tM B , D̂tM S ), we define the contribution function as once an order submission, order cancellation, or order exe-
the common financial metric profit and loss (PnL): cution occurs. From this, we extract time-stamped detailed
information on order adds, order cancels, and order transac-
C(Rt , At , D̂tM B , D̂tM S ) := E β · (mRt − βRt ) + E α · (αRt − mRt ), tions at each of the highest 10 price levels on the bid side and
(6) the lowest 10 price levels on the ask side.5 Then, we aggre-
where binary variables E β and E α indicate respectively gate the data at the second level and construct six time series,
whether we have a resting order that was executed on the bid for the following six order book events: (1) market buy or-
side and on the ask side, and mRt := (αRt + βRt )/2 denotes ders; (2) market sell orders; (3) limit buy orders; (4) limit sell
the mid price. orders; (5) cancellations on the bid side; (6) cancellations on
Since PnL is accounted for relative to the mid price, it is the ask side. The reason we aggregate at the second level is
necessary to include another term in the objective function that our purpose is to derive a strategy that can have slow ex-
that penalizes the potential change in cash value due to move- ecution speed (i.e., need not execute at the microsecond level
ments in the mid price. To do this, the decision-maker must or faster). This is because the market-making firm with whom
also track his open position, or inventory level. Recalling that we partnered does not view speed as its primary competitive
At = (At1 , At2 ) has first component corresponding to an ac- advantage.
tion on the bid side and second component corresponding to 5
an action on the ask side, the open position, or inventory level, For the product we study, the difference between the best bid
and ask prices, i.e., the spread, is rarely more than one tick. A spread
of more than one tick occurs less than 0.01% of the time. As a result,
4
This assumes the non-degenerate case that there are bid orders in contrast to some of the past literature [Spooner et al., 2018], we
resting on the LOB (βRt > 0). do not need to record the spread for decision making purposes.
3.1 Independence Check 3.2 Distribution Fitting
Traditionally, to solve a MDP, we need information on the
For analysis purposes, we would like to know that D̂tM B , transition matrix. To this end, we investigate the distribution
D̂tM S , Ôt , Ĉt are independent across time, as well as inde- of order size and inter-arrival time for market orders, limit or-
pendent of each other. The intuitive reason this may be true ders, and cancellations on the bid and ask sides. After drop-
is that aggregation at the second level is large enough to min- ping the last 1% outliers in the dataset, we try more than 50
imize the impact of any following behavior occurring in the common discrete and continuous distributions to fit the data,
other market participants, as that usually happens at much but it turns out that the p-values of the chi-squared test or
finer timescales. In other words, although the data may show the KS-test for all the fits are close to zero, and the sums
slight autocorrelations at the microsecond level, one second of squared errors of prediction (SSE) are much greater than
is large enough for that autocorrelation to be negligible. 1, which are both indicative of very poor fits. We also con-
We investigate the autocorrelation and cross-correlation of sidered estimating an empirical distribution, but that did not
the sizes and inter-arrival times of the aforementioned six pass statistical tests, due to the heavy-tail pattern of our data.
time series, and also examine the cross-correlation between Hence, it is difficult to estimate the MDP transition matrix.
different price levels for each time series. The absence of cor-
relation is necessary but not sufficient to show that successive 4 Q-learning Model
observations of a random variable are independent. However,
in our particular application setting, no correlation for both Our statistical analysis in Section 3 suggests that first es-
the observations and their common variants (e.g., square, in- timating transition probabilities for the MDP in Section 2,
verse) should suffice as an indication of independence, in the and next applying standard MDP solution techniques will not
same spirit of [Cont and De Larrard, 2012]. yield satisfactory results. This observation motivates us to
consider a stochastic iterative (also called stochastic approx-
Autocorrelation imation) method called Q-learning. Q-learning is a model-
free algorithm which can be applied to obtain an optimal
We first study the autocorrelations regarding size, and the control policy for an MDP when the transition rewards and
results show that all autocorrelation coefficients are signifi- the transition probabilities are unknown. Previous works,
cantly close to zero for all time-lag separations by the Durbin- such as [Bertsekas and Tsitsiklis, 1996; Tsitsiklis, 1994;
Watson test. We also investigate the autocorrelation of the Jaakkola et al., 1994], have shown the convergence property
square and inverse of the size. The results remain the same. of Q-learning.
Thus, we conclude that the order sizes of each of the order However, [Powell, 2007] and others have observed that the
book events are all independent. Q-learning algorithm only works well in small state and ac-
Doing a similar check for inter-arrival times, we find that tion spaces, and even in modest spaces, the performance may
the sequences of inter-arrival times for market buy and sell 1 2
not be good. The LOB state variable, (Rtp , Rtp ) for any given
orders are positively autocorrelated with autocorrelation co- t ∈ T , has n = 20 price levels, and, for each price level, two
efficient around 0.2, but the Durbin-Watson statistic is not 1
possible values for Rtp and an infinite number of possible
statistically significant. The inter-arrival times of limit or- 2
values for Rtp , which we truncate to size 1000 for implemen-
ders and cancellations are significantly positively correlated tation purposes. This results in a lower bound on the LOB
but the correlation coefficients are both smaller than 0.1. As state space size that is 100020 = 1 × 1060 , and this does not
mentioned earlier, such small autocorrelations can be ignored account for the history-dependent inventory level. That large
when we aggregate at the second level. Moreover, note that number motivates us to create a state aggregation function, in
the large number of observations in the high-frequency data the same spirit as [Pepyne et al., 1996], allowing us to reduce
induces narrow confidence bands and spurious significance; the original large-scale MDP to a much smaller, and more
thus when the number of observations is large, statistically easily implementable, size.
significant autocorrelations do not indicate practical signifi-
cance if the correlations are very small. Therefore, we can 4.1 Aggregation Method
state that the arrivals of all events are also independent. From the extensive literature on market microstructure, such
as [Cartea and Jaimungal, 2016; Spooner et al., 2018; Cartea
Cross-Correlation et al., 2018], these are some attributes commonly used to de-
When we examine the cross-correlation between different scribe the condition of the market and the decision-maker:
time series, we use the Spearman’s coefficient to measure the imbalance of the book size on both the bid and ask side,
correlation, where +1 and −1 represent strong positive and the magnitude of the market price movement, the trade vol-
negative correlation respectively, and 0 represents no corre- ume, the relative strength index (RSI), the net amount bought
lation. The result shows that the Spearman’s correlation co- and/or sold, and the current PnL. After experimenting with
efficients, with p-value smaller than 0.05, are all very close many different combinations of the aforementioned state at-
to zero (smaller than 0.01). Thus, we conclude that the or- tributes, we find the best results using the five attributes listed
der sizes and the arrivals of these six order book events are below. Note that the attributes used to describe the condi-
pairwise uncorrelated/independent; and the sizes and arrivals tion of the market (the first three below) come directly from
of limit orders and cancellations at different price levels are the LOB data, whereas the attributes used to describe the
uncorrelated as well. decision-maker (the last two below) are history-dependent
and must be updated in real-time as the market-making firm 4.2 Q-learning Model
executes trades. At each time t ∈ T , Given an LOB state Rt ∈ R defined in Section 2.1, inventory
• bidSpeed(BS) ∈ {0, 1}: indicates whether the market value invt defined in (7), and cumulative PnL pnlt defined
sell orders exceed the book size at the best bid price, and in (9), the state space aggregation function is
is defined as
G(Rt , invt , pnlt ) := (BSt , ASt , M Ft , ISt , CPt ), for all t ∈ T .
BS := 1{D̂tM S > Rtβ
1
Rt
2
+ RtβRt
}. (10)
Let ran(G) be the range of the state space aggregation func-
• askSpeed(AS) ∈ {0, 1}: indicates whether the market tion G, and denote the aggregated state space by G :=
buy orders exceed the book size at the best ask price, and ran(G). Then, it is straightforward that the size of the aggre-
is defined as gated state space is 2 × 2 × 5 × 5 × 2 = 200, which is small
enough that the Q-learning algorithm has good performance.
AS = 1{D̂tM B > Rtα
1
Rt
2
+ RtαRt
}. We restrict the admissible action space in Section 2.2 to
prevent placing orders when the invSign is either +2 or −2,
• avgmidChangeF rac(M F ) ∈ {0, ±1, ±2}: character- meaning we have had many more limit buy orders execute
izes the relative change in the average mid price6 from than limit sell orders or vice versa. This is because there is a
time period t − 1 to t compared to the range in the mid high level of risk associated with such imbalance. For a given
price over these two time periods, defined as ft . The pa- aggregated state s ∈ G, let sIS denote the fourth component
rameter f ∈ [−1, 1] determines the direction of the mid of the right-hand-side of (10). The restricted admissible ac-
price movement (positive or negative), and if that move- tion space is
ment is large or small; that is, |M F | = 2 if |ft | > f ,
|M F | = 1 if |ft | ∈ (0, f ], and M F = 0 if ft = 0. {(0, 0), (1, 0)}, if sIS = −2,
(
As := {(0, 0), (0, 1)}, if sIS = +2, (11)
• invSign(IS) ∈ {0, ±1, ±2}: characterizes the side and
{(0, 0), (0, 1), (1, 0), (1, 1)}, otherwise.
magnitude of open positions, invt , as defined in (7). The
state is oversold if invt < 0 and overbought if invt > As in [Watkins and Dayan, 1992], the Q factor Q(s, a) rep-
0. The parameter I ∈ (0, ∞) determines if the firm is resents the value of taking action a when in aggregated state
oversold or overbought by a large amount, in which case s. The recommended action when in state s is
|invt | > I and |IS| = 2, or by a small amount, in which
case |invt | ∈ (0, I] and |IS| = 1. The balanced state a∗ (s) = arg max Q(s, a), (12)
IS = 0 occurs when invt = 0. a∈As

• cumP nL(CP ) ∈ {0, 1}: indicates whether the cumu- and we record these in a lookup table called Q table. The
lative PnL, defined from equation (6) as resulting size of the Q table used to look up the recommended
Xt action associated with any given state is 2 × 2 × 5 × 3 × 2 ×
pnlt := C(Rt , At , D̂tM B , D̂tM S ), (9) 4 + 2 × 2 × 5 × 2 × 2 × 2 = 640.
i=0

is large or small, as determined from the parameter P ∈ 4.3 Algorithm


(−∞, ∞). Specifically, CP = 1 if pnlt ≤ P and CP = The core of a Q-learning algorithm is the iterative updates of
0 if pnlt > P . Q factors based on sample paths. However, since we have no
The attributes bidSpeed and askSpeed together charac- information about real-time inventory and PnL in our dataset,
terize market volatility. For instance, bidSpeed = 1 and for a given aggregated state, we select a sample path in the
askSpeed = 0 indicate a sell-heavy market, which might dataset based on the first three dimensions of the state and
be followed by a decrease of market (bid and ask) prices; randomly set an inventory level and PnL value consistent with
thus it is suggestive to place limit sells and cancel limit buys. the invSign term and the cumP nL term, respectively. For
The avgmidChangeF rac measures the magnitude and di- ease of exposition, we define a sampling-related aggregation
rection of mid price changes, which provide signals about function by, for all Rt ∈ R,
whether orders should be placed on the bid or ask side. The
Γ(Rt ) := (BSRt , ASRt , M FRt ). (13)
invSign indicates if there is a high mismatch between the
amount bought and sold, and could be used to restrict order For any aggregated state s ∈ G, define Ms := {Rt ∈ R :
placement on the overbought/oversold side, even when condi- Γ(Rt ) = (sBS , sAS , sM F )}, where sBS , sAS , sM F denote
tions are favorable otherwise. Lastly, the cumP nL monitors the first three components of the right-hand-side of (10), to
the decision-maker’s profitability in real time, and could in- be the set of full states that can be mapped into aggregated
duce more conservative behavior in the face of large losses, or state s. Let τs := {t : Rt ∈ Ms } be the set of timestamps at
more risky behavior in the face of large gains. Overall, there which the full state of the order book is an element of the set
are three parameters that must be configured: f , I, and P . Ms . Suppose a sample path ω starts at time t ∈ τs . Then, we
denote the immediate exogenous information (i.e., adds, can-
6
We choose mid price rather than market bid and/or market ask cels, and trades) in the following one second by Ô(ω), Ĉ(ω),
prices to represent the state of the book due to the fact that the market D̂M B (ω), and D̂M S (ω). Let Ωs be the set of all possible
bid and ask prices always move in the same direction. sample paths for aggregated state s.
Algorithm 1 Q-learning algorithm pseudocode 4.4 Resulting Q Factors
1: Initialization: Set Q0 (s, a) = 0, α0 (s, a) = α0 , The trading policy learned from the Q-learning algorithm can
K0 (s, a) = 0, for all s ∈ G, a ∈ As , and stopping crite- be summarized by the following several rules: (1) it is prof-
rion N̄ ; and set n = 0. itable to place limit orders on the more active side—that is,
2: while n = 0, 1, 2, . . . do it is better to add a limit buy order on a sell-heavy market,
3: Randomly select SnQ ⊆ {(s, a)|s ∈ G, a ∈ As }; and vice versa; (2) market-making is not directional—that is,
4: for (s, a) : s ∈ G, a ∈ As do movements in the mid price do not affect the decisions regard-
5: if (s, a) ∈ SnQ then ing where to place an order, which aligns with the market-
6: (i) Randomly select a sample path ωns ∈ Ωs , and making strategy structure in [Menkveld, 2013]; (3) the op-
denote its initial full state by Rns . Randomly set timal market-making strategy depends on the level of inven-
an exact value for inventory level and cumulative tory, and maintaining inventory near zero is preferable, which
PnL based on sIS and sCP , denoted by invns and is consistent with [Guilbaud and Pham, 2013]; (4) market
pnlns . (ii) Update the full state to Rn+1s
, as de- makers control their cumulative PnL by canceling all orders
tailed in equations (1)–(4) in Section 2. Denote in many cases when cumulative PnL is low.
the updated inventory level and cumulative PnL
s s
by invn+1 and pnln+1 . (iii) Translate the up- 5 Results
s s s
dated full state (Rn+1 , invn+1 , pnln+1 ) into ag-
gregated state by s̄ = G(Rn+1 s s
, invn+1 s
, pnln+1 ).
5.1 Performance Evaluation
(iv) Update Kn (s, a) = Kn−1 (s, a) + 1. Our dataset provides information on the LOB during time
(v) Compute Qn+1 by Equation (14) with periods at which our partner market-making firm was trad-
αn (s, a) = 1+Kαn0(s,a) . ing, but we do not know the strategies the firm was using.
7: else In other words, we do not know what is called the behavior
8: Qn+1 (s, a) = Qn (s, a) ; policy in the off-policy evaluation literature. The implica-
Kn (s, a) = Kn−1 (s, a). tion is that we cannot use any of the three main off-policy
9: end if evaluation methods: direct method [Bertsekas et al., 1995;
10: end for Lagoudakis and Parr, 2003; Sutton and Barto, 2018], impor-
11: if n > N̄ then tance sampling[Swaminathan and Joachims, 2015], and dou-
12: break while. bly robust method [Dudı́k et al., 2014; Jiang and Li, 2015;
13: else Robins et al., 1994].
14: n = n + 1. Fortunately for evaluation purposes, our partner market-
15: end if making firm represents a small percentage of the trades
16: end while recorded in the LOB. As a result, it is reasonable to assume
17: return For any Rt ∈ R in which the decision-maker’s that the orders placed by our partner market-making firm are
current inventory and PnL is inv and pnl, the optimal not unduly influencing the other players in the market, and
action is: arg maxa∈As Qn+1 (G(Rt , inv, pnl), a). so the information we see recorded in the LOB regarding the
arrival of market buy and sell orders, and the arrival of limit
buy and sell orders, as well as cancellations, by other market
participants should not change too much when our partner
The Q-learning algorithm, more specifically detailed in the market-making firm trades according to a different strategy.
pseudocode in Algorithm 1, follows the steps below. We This suggests that a straightforward backtest evaluation of the
set the maximum number of iterations N̄ to be large enough profit that would have been made, and the associated Sharpe
such that the resulting objective function has become stable. ratio, is a representative test of the performance of our pro-
At each iteration, we make a uniform random selection of posed trading strategy, in which orders are executed accord-
some certain number of state-action pairs to update. For each ing to the Q-table output from the Q-learning algorithm.
aggregated state s of interest, we randomly select a sample Our partner market-making firm developed its own back-
path ω ∈ Ωs that has initial full state R(ω) ∈ Ms . Then we tester to evaluate any proposed trading strategy. This is ac-
update the full state based on the action and exogenous in- complished by using historical data to reconstruct the trades
formation from the sample path, calculate the inventory level that would have occurred in the past using the proposed trad-
inv(ω) and cumulative PnL, and further translate the updated ing strategy, and recording the resulting cumulative PnL and
full state into an aggregated version, denoted by s̄, using ag- associated Sharpe ratio. We first used the backtester to con-
gregation function G defined in equation (10). Then, we can duct in-sample experiments, and to use the results of those ex-
update the Q factor according to the update function periments to set algorithm parameters (specifically, the f , I,
Qn+1 (s, a) = (1 − αn (s, a)) · Qn (s, a) + αn (s, a)· and P thresholds defined in Section 4.1). We further used the
backtester to set two external controls, one that forces closing
(V (R(ω), a, D̂M B (ω), D̂M S (ω), inv(ω)) + γ max Qn (s̄, v)),
v∈As̄ all open positions if the PnL becomes too negative, and an-
(14) other that forces closing if the maximum drawdown becomes
MB MS
where V (R(ω), a, D̂ (ω), D̂ (ω), inv(ω)) is as defined too large. After finalizing our algorithm’s parameters and the
in equation (8), and the learning rate αn (s, a) is as defined in aforementioned external control parameters, we tested once
the Q-learning algorithm pseudocode. on the out-of-sample data in the backtester.
The out-of-sample performance results of our algorithm in 6 Future Work
the backtester resulted in an average daily PnL with three or- From Figure 3, we note that under our Q-learning algorithm,
ders of magnitude, and a Sharpe ratio above 3. This passed we may encounter the following phenomenon: several days
the firm’s test standards. Consequently, our partner market- in a row we lose money, followed by one day in which we
making firm desired to put our algorithm into production. make a large amount of money. This is not ideal from a risk
management perspective and motivates our main desire in fu-
5.2 Benchmarks
ture work, to smooth out the resulting equity curve. On the
To further anchor the performance of our algorithm in the days with losses, often we could have been profitable on that
literature, we compare with a set of benchmarks as below. day had we locked in the profit earlier, and closed down trad-
From [Spooner et al., 2018], [Lim and Gorse, 2018] and ing. However, we are still working to develop an algorithmic
[Doloc, 2019], common benchmarks include fixed spread- approach to decide when to close down trading, resulting in
based strategies, random strategies, and the Avellaneda- the length of the time horizon T being a random variable.
Stoikov strategy [Avellaneda and Stoikov, 2008]. The fixed
spread-based strategy that provides the most relevant bench- Acknowledgments
mark is the one that at all times has limit orders resting at the
best bid and ask prices. The most natural random strategy We would like to extend our deepest gratitude to Volodymyr
benchmark is the one that flips two fair coins in each time Babich, Nathan Kallus, Melanie Rubino for their helpful
period, one to decide whether or not to have an order resting comments.
at the best bid price and the other for the best ask price. The
Avellaneda-Stoikov strategy is not relevant for us because or- References
ders may be placed at prices other than the best bid and ask [Avellaneda and Stoikov, 2008] Marco Avellaneda and
prices. Sasha Stoikov. High-frequency trading in a limit order
Figure 3 compares our Q-learning algorithm against the book. Quantitative Finance, 8(3):217–224, 2008.
aforementioned two common benchmarks, as well as against [Bertsekas and Tsitsiklis, 1996] Dimitri P Bertsekas and
our partner firm’s implemented trading strategy. For this,
John N Tsitsiklis. Neuro-dynamic programming, vol-
there is no need to conduct in-sample testing because the pa-
ume 5. Athena Scientific Belmont, MA, 1996.
rameters of the benchmark strategies are all fixed, and we
use the same external controls7 for the benchmark strategies [Bertsekas et al., 1995] Dimitri P Bertsekas, Dimitri P Bert-
as we used when we implemented our Q-learning algorithm. sekas, Dimitri P Bertsekas, and Dimitri P Bertsekas.
Figure 3 clearly shows that our algorithm attains the high- Dynamic programming and optimal control, volume 1.
est cumulative PnL over the one-month out-of-sample trading Athena scientific Belmont, MA, 1995.
period. As for the Sharpe ratio, the only benchmark strategy [Cartea and Jaimungal, 2016] Álvaro Cartea and Sebastian
for which the Sharpe ratio is positive is our partner firm’s im- Jaimungal. Incorporating order-flow into optimal execu-
plemented trading strategy; however, the ratio is still smaller tion. Mathematics and Financial Economics, 10(3):339–
than our proposed Q-learning strategy. 364, 2016.
[Cartea et al., 2018] Álvaro Cartea, Ryan Donnelly, and Se-
bastian Jaimungal. Enhancing trading strategies with order
book signals. Applied Mathematical Finance, 25(1):1–35,
2018.
[Cont and De Larrard, 2012] Rama Cont and Adrien De Lar-
rard. Order book dynamics in liquid markets: limit the-
orems and diffusion approximations. Available at SSRN
1757861, 2012.
[Doloc, 2019] Cris Doloc. Applications of Computational
Intelligence in Data-Driven Trading. John Wiley & Sons,
2019.
[Dudı́k et al., 2014] Miroslav Dudı́k, Dumitru Erhan, John
Langford, Lihong Li, et al. Doubly robust policy evalua-
tion and optimization. Statistical Science, 29(4):485–511,
2014.
[Guilbaud and Pham, 2013] Fabien Guilbaud and Huyen
Figure 3: Out-of-Sample Cumulative PnL, with re-scaled y-axis to Pham. Optimal high-frequency trading with limit and mar-
protect confidentiality ket orders. Quantitative Finance, 13(1):79–94, 2013.
[Jaakkola et al., 1994] Tommi Jaakkola, Michael I Jordan,
7
Recall from Section 5.1 that the two external controls force and Satinder P Singh. Convergence of stochastic iterative
trading to stop if the PnL becomes too negative or if the maximum dynamic programming algorithms. In Advances in neural
drawdown becomes too large. information processing systems, pages 703–710, 1994.
[Jiang and Li, 2015] Nan Jiang and Lihong Li. Doubly ro-
bust off-policy value evaluation for reinforcement learn-
ing. arXiv preprint arXiv:1511.03722, 2015.
[Lagoudakis and Parr, 2003] Michail G Lagoudakis and
Ronald Parr. Least-squares policy iteration. Journal of
machine learning research, 4(Dec):1107–1149, 2003.
[Lim and Gorse, 2018] Ye-Sheen Lim and Denise Gorse.
Reinforcement learning for high-frequency market mak-
ing. In ESANN, 2018.
[Menkveld, 2013] Albert J Menkveld. High frequency trad-
ing and the new market makers. Journal of financial Mar-
kets, 16(4):712–740, 2013.
[Pepyne et al., 1996] David L Pepyne, Douglas P Looze,
Christos G Cassandras, and Theodore E Djaferis. Applica-
tion of q-learning to elevator dispatcidng. IFAC Proceed-
ings Volumes, 29(1):4742–4747, 1996.
[Powell, 2007] Warren B Powell. Approximate Dynamic
Programming: Solving the curses of dimensionality, vol-
ume 703. John Wiley & Sons, 2007.
[Robins et al., 1994] James M Robins, Andrea Rotnitzky,
and Lue Ping Zhao. Estimation of regression coefficients
when some regressors are not always observed. Journal
of the American statistical Association, 89(427):846–866,
1994.
[Spooner et al., 2018] Thomas Spooner, John Fearnley,
Rahul Savani, and Andreas Koukorinis. Market making
via reinforcement learning. In Proceedings of the 17th In-
ternational Conference on Autonomous Agents and Mul-
tiAgent Systems, pages 434–442. International Foundation
for Autonomous Agents and Multiagent Systems, 2018.
[Sutton and Barto, 2018] Richard S Sutton and Andrew G
Barto. Reinforcement learning: An introduction. MIT
press, 2018.
[Swaminathan and Joachims, 2015] Adith Swaminathan and
Thorsten Joachims. The self-normalized estimator for
counterfactual learning. In advances in neural informa-
tion processing systems, pages 3231–3239, 2015.
[Tsitsiklis, 1994] John N Tsitsiklis. Asynchronous stochas-
tic approximation and q-learning. Machine learning,
16(3):185–202, 1994.
[Watkins and Dayan, 1992] Christopher JCH Watkins and
Peter Dayan. Q-learning. Machine learning, 8(3-4):279–
292, 1992.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy