Multi-Agent Q Learning Daily Trading 2007
Multi-Agent Q Learning Daily Trading 2007
6, NOVEMBER 2007
Abstract—The portfolio management for trading in the stock economics over a 40-year period without definitive findings,
market poses a challenging stochastic control problem of signif- states that no investment system can consistently yield average
icant commercial interests to finance industry. To date, many returns exceeding the average returns of a market as a whole.
researchers have proposed various methods to build an intelligent
portfolio management system that can recommend financial deci- Throughout many years, finance theoreticians argue for EMH
sions for daily stock trading. Many promising results have been as a basis of denouncing the techniques that attempt to find
reported from the supervised learning community on the possibil- useful information about the future behavior of stock prices by
ity of building a profitable trading system. More recently, several using historical data [2].
studies have shown that even the problem of integrating stock However, the assumptions underlying this hypothesis turns
price prediction results with trading strategies can be successfully
addressed by applying reinforcement learning algorithms. Moti- out to be unrealistic in many cases [3], and in particular,
vated by this, we present a new stock trading framework that most approaches taken to testing the hypothesis were based
attempts to further enhance the performance of reinforcement on linear time series modeling [4]. Accordingly, as claimed
learning-based systems. The proposed approach incorporates mul- in [4], given enough data and time, an appropriate nonpara-
tiple Q-learning agents, allowing them to effectively divide and metric machine learning method may be able to discover
conquer the stock trading problem by defining necessary roles for
cooperatively carrying out stock pricing and selection decisions. more complex nonlinear relationships through learning from
Furthermore, in an attempt to address the complexity issue when examples. Furthermore, if we step back from being able to
considering a large amount of data to obtain long-term depen- “consistently” beat the market, we may find many interesting
dence among the stock prices, we present a representation scheme empirical results indicating that the market might be somehow
that can succinctly summarize the history of price changes. Exper- predictable [5].
imental results on a Korean stock market show that the proposed
trading framework outperforms those trained by other alternative Indeed, the last decade has witnessed the abundance of such
approaches both in terms of profit and risk management. approaches to financial analysis both from academia and indus-
try. Application of various machine learning techniques to stock
Index Terms—Financial prediction, intelligent multiagent sys-
tems, portfolio management, Q-learning, stock trading. trading and portfolio management has experienced significant
growth, and many trading systems have been proposed in the
literature based on different computational methodologies and
I. I NTRODUCTION investment strategies [6]–[10]. In particular, there has been a
huge amount of interest in the application of neural networks
B UILDING an intelligent system that can produce timely
stock trading suggestions has always been a subject of
great interest for many investors and financial analysts. Nev-
to predict the stock market behavior based on current and
historical data, and this popularity continues mainly due to the
ertheless, the problem of finding out the best time to buy or fact that the neural networks do not require an exact parametric
sell has remained extremely hard since there are too many system model and that they are relatively insensitive to unusual
factors that may influence stock prices [1]. The famous “ef- data patterns [3], [11].
ficient market hypothesis” (EMH), which was tested in the More recently, numerous studies have shown that even the
problem of integrating stock price prediction results with dy-
namic trading strategies to develop an automatic trading system
can be successfully addressed by applying reinforcement learn-
Manuscript received August 5, 2005; revised February 21, 2006. This work ing algorithms. Reinforcement learning provides an approach
was supported by a research grant (2004) from Sungshin Women’s University. to solving the problem of how an autonomous agent that
This paper was recommended by Associate Editor R. Subbu.
J. W. Lee and E. Hong are with the School of Computer Science and
senses and acts in its environment can learn to choose optimal
Engineering, Sungshin Women’s University, Seoul 136-742, Korea (e-mail: actions to achieve its goals [12]. Compared with the supervised
jwlee@sungshin.ac.kr; hes@sungshin.ac.kr). learning techniques such as neural networks, which require
J. Park (corresponding author) is with the Department of Industrial
Engineering, Seoul National University, Seoul 151-742, Korea (e-mail:
input and output pairs, a reinforcement learning agent learns
jonghun@snu.ac.kr). behavior through trial-and-error interactions with a dynamic
J. O was with the School of Computer Science and Engineering, Seoul environment, while attempting to compute an optimal policy
National University, Seoul 151-742, Korea. He is now with NHN Corporation,
Seongnam 463-811, Korea (e-mail: rupino11@naver.com). under which the agent can achieve maximal average rewards
J. Lee is with the Department of Multimedia Science, Sookmyung Women’s from the environment.
University, Seoul 140-742, Korea (e-mail: bigrain@sookmyung.ac.kr). Hence, considering the problem characteristics of design-
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. ing a stock trading system that interacts with a highly dy-
Digital Object Identifier 10.1109/TSMCA.2007.904825 namic stock market in an objective of maximizing profit, it is
worth considering a reinforcement learning algorithm such as In Section II, we present the architecture of the proposed
Q-learning to train a trading system. There have been several framework, describe how cooperation among the trading agents
research results published in the literature along this line. in MQ-Trader is achieved, and subsequently define the state
Neuneier [8] used a Q-learning approach to make asset al- representation schemes. Section III presents learning algo-
location decisions in financial market, and Neuneier and rithms for the participating agents after briefly introducing
Mihatsch [13] incorporated a notion of risk sensitivity into basic concepts of Q-learning. Experimental setup and results
the construction of Q-function. Another portfolio management on a real Korean stock market, i.e., Korea Composite Stock
system built by use of Q-learning was presented in [14] where Price Index (KOSPI), are described in Section IV. Finally, Sec-
absolute profit and relative risk-adjusted profit were consid- tion V concludes this paper with discussion on future research
ered as performance functions to train a system. In [15], an directions.
adaptive algorithm, which was named recurrent reinforcement
learning, for direct reinforcement was proposed, and it was
II. P ROPOSED F RAMEWORK FOR
used to learn an investment strategy online. Later, Moody and
M ULTIAGENT Q-L EARNING
Saffell [16] have shown how to train trading systems via direct
reinforcement. Performance of the learning algorithm proposed In this section, we first present the proposed MQ-Trader
in [16] was demonstrated through the intraday currency trader framework that employs cooperative multiagent architecture for
and monthly asset allocation system for S&P 500 stock index Q-learning. After describing the behavior of individual agents
and T-Bills. during the learning process, this section proceeds to define the
In this paper, we propose a new stock trading framework that necessary state representations for the agents. Detailed learning
attempts to further enhance the performance of reinforcement algorithms are presented in Section III.
learning-based systems. The proposed framework, which is
named MQ-Trader, aims to make buy and sell suggestions
A. Proposed Learning Framework
for investors in their daily stock trading. It takes a multiagent
approach in which each agent has its own specialized capability In an attempt to simulate a human investor’s behavior and
and knowledge, and employs a Q-learning algorithm to train at the same time to divide and conquer the considered learning
the agents. The motivation behind the incorporation of multiple problem more effectively, MQ-Trader defines four agents. First,
Q-learning agents is to enable them to effectively divide and a stock trading problem is divided into the timing and the
conquer the complex stock trading problem by defining nec- pricing problem of which the objectives are, respectively, to
essary roles for cooperatively carrying out stock pricing and determine the best time and the best price for trading. This
selection decisions. At the same time, the proposed multiagent naturally leads to the introduction of the following two types
architecture attempts to model a human trader’s behavior as of agents: 1) the signal agent and 2) the order agent.
closely as possible. Second, motivation for the separation of the buy signal agent
Specifically, MQ-Trader defines an architecture that consists from the sell signal agent comes from the fact that an investor
of four cooperative Q-learning agents: The first two agents, has different criteria for decision making depending on whether
which were named buy and sell signal agents, respectively, she/he buys or sells a stock. When buying a stock, the investor
attempt to determine the right time to buy and sell shares usually considers the possibility of rising and falling of the
based on global trend prediction. The other two agents, which stock price. In contrast, when selling a stock, the investor
were named buy and sell order agents, carry out intraday order considers not only the tendency of the stock price movements
executions by deciding the best buy price (BP) and sell price but also the profit or loss incurred by the stock. Accordingly, the
(SP), respectively. Individual behavior of the order agents is separation is necessary to allow the agents to have different state
defined in such a way that microscopic market characteristics representations. That is, while the buy signal agent maintains
such as intraday price movements are considered. Cooperation the price history information as its state to estimate future trend
among these proposed agents facilitates efficient learning of based on the price changes over a long-term period, the sell
trading policies that can maximize profitability while managing signal agent needs to consider the current profit/loss obtained
risks effectively in a unified framework. in addition to the price history.
One of the important issues that must be addressed when Finally, the buy order and the sell order agents, respectively,
designing a reinforcement learning algorithm is the represen- generate orders to buy and sell a stock at some specified price.
tation of states. In particular, the problem of maintaining the These are called bid and offer. The objective of these order
whole raw series of stock price data in the past to compute long- agents is to decide the best price for trading within a single day
term correlations becomes intractable as the size of considered in an attempt to maximize profit.
time window grows large. Motivated by this, we propose a new Fig. 1 shows the overall learning procedure defined in
state representation scheme, which is named turning point (TP) MQ-Trader. It aims to maximize the profit from investment
matrix, that can succinctly summarize the historical information by considering the global trend of stock price as well as the
of price changes. The TP matrix is essentially a binary matrix intraday price movements. Under this framework, each agent
for state representation of the signal agents. Furthermore, in has its own goal while interacting with others to share episodes
MQ-Trader, various technical analysis methods such as short- throughout the learning process.
term moving averages (MAs) and Japanese candlestick repre- More specifically, given a randomly selected stock item,
sentation [17] are utilized by the order agents. an episode for learning is started by randomly selecting a
866 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 6, NOVEMBER 2007
(Case 1) 1 ≤ i < n
i−1 i
1, if there exists a TP such that CTP,D ≤ 0 and Fk ≤ |CTP,D | × 100 < Fk
j−1 k=0 k=0
aij = j
during the period Fk + 1, Fk
k=0 k=0
0, otherwise
(Case 2) i = n
i−1
1, if there exists a TP such that CTP,D ≤ 0 and Fk ≤ |CTP,D | × 100 < ∞
j−1 k=0
aij = j
during the period Fk + 1, Fk
k=0 k=0
0, otherwise
868 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 6, NOVEMBER 2007
TABLE I
SAMPLE ENCODING SCHEME FOR PROFIT RATIO
Q(st , αt ) ← Q(st , αt )
of the stock even with the lowest possible price is not possible, TABLE II
STATE REPRESENTATION FOR THE ORDER AGENTS
the SP is set to PδCSELL +1 , which is the closing price on day
δSELL + 1. The lowest reward, i.e., 0, is given in this case.
Otherwise, the agent tries different prices until a feasible SP
is obtained as in the case of the buy order agent. The reward
function that considers the TC and price slippage for this case
is defined similarly to that of the buy order agent and achieves
the maximum value when the SP determined is equal to the
highest possible price.
Fig. 10. Distribution of the ratio of the difference between the five-day MA and the lowest stock price to the five-day MA.
TABLE IV
PREDICTION PERFORMANCE OF THE CONSIDERED NEURAL NETWORK CONFIGURATIONS
Fig. 12. Performance comparison result for the first test data set.
Fig. 13. Performance comparison result for the second test data set.
terms of the asset growth rates achieved by each trading grown to 4.55 (= 0.4 × 1138.7 %) billion won by the end of
system throughout the entire test period, from June 2001 to November 2005.
November 2005, when it was provided with 0.4 billion won Only the best case results for the two aforementioned test
initially (the basic unit of money in Korea). For example, the sets in this section are separately given in Figs. 12 and 13,
best case performance of MQ-Trader shows that its asset has where the dotted line at the bottom indicates the KOSPI index
LEE et al.: MULTIAGENT APPROACH TO Q-LEARNING FOR DAILY STOCK TRADING 875
that is similarly defined as the S&P 500 index of the U.S. results of the asset growth rates and the trading frequencies
stock market. KOSPI index is shown to compare the perfor- achieved by MQ-Trader for different configurations during the
mances of the aforementioned trading systems to the baseline entire period of June 2001 through November 2005. The initial
market performance during the test period, and it is equated to asset given to MQ-Trader was 0.4 billion won.
0.4 billion at the beginning for visualization purposes. The From Table VI, it can be seen that the most profitable results
series (a) through (f) in Fig. 12 show the accumulated assets (1138.7% asset growth) were obtained when both of the TC
for each trading system during the first test period (June 2001 rate and the price slippage percentage were lowest, and that
to August 2003) when each system starts with an initial asset the profit decreases as the TC rate increases and the chance of
of 0.4 billion won. To make the performance comparison clear, price slippages becomes higher. Similar results were observed
the starting assets for the trading systems during the second test for the experimentation on the number of trades made during
period (September 2003 to November 2005) are also equated to the same test period, as shown in Table VII. These together
0.4 billion in Fig. 13. imply that MQ-Trader has learned the risks associated with
As can be seen from these results, the proposed MQ- stock trading, which were introduced through the TC and price
Trader outperformed the other alternative trading frameworks slippage. When the TC is expensive, MQ-Trader buys and sells
(represented by the series (c) to (f) in Figs. 12 and 13) a stock carefully, leading to less frequent trades and smaller net
by achieving more than four times of asset growth for the profits. This is a natural consequence since a trade with small
first test period and more than 2.5 times for the second test profit may end up with overall loss after paying the TCs for the
period. The performance of MQ-Trader has always lied be- stock purchase and disposition. In addition, with high chance
tween those of the I2Q-Trader and the 2Q-Trader (respectively of price slippage, it is advantageous for MQ-Trader to avoid
represented as the series (a) and (c) in Figs. 12 and 13) as aggressive trading.
expected. Accordingly, the performance difference between Furthermore, Figs. 14 and 15 show how the profitability of
MQ-Trader and 2Q-Trader can be attributed to the contri- MQ-Trader decreases as the risks increase throughout the entire
butions of the order agents. Furthermore, it can be deduced test period. The TC rate and the percentage of price slippage
by comparing the series (b) and (d) that the proposed TP used for each series in Figs. 14 and 15 are summarized in
matrix can facilitate the performance improvement of a trading Table VIII. As expected, the profitability becomes highest when
system. the risks represented by the TC and price slippage are lowest,
It is interesting to note that MQ-Trader performs satis- and it becomes lowest when the risks are highest.
factorily during the long period of bear market between Finally, we found out that consideration of the current profit
April 2002 and April 2003. In addition, it endured quite well or loss by MQ-Trader did not necessarily lead to the disposition
the short stock market shock during May 2004 with a relatively effect in contrast to the human investors who are subject to the
small loss. Note that, however, in the period of bull market disposition effect due to psychological reasons. The average
(May 2005 to July 2005), the traders with multiple agents number of days of holding a stock item by MQ-Trader for
including MQ-Trader were not able to exploit the opportunity, the profitable trades was 6.9, whereas it was 7.3 days for the
while the other two single agent traders (indicated by the sharp unsuccessful trades. Therefore, this small difference of 0.4 day
rises of the series (e) and (f) during the period) were able to suggests that the MQ-Trader is not prone to the disposition
exploit the opportunity. Based on this observation, it appears effect.
that MQ-Trader can achieve a good performance particularly
when the stock prices are sharply declining due to the mar-
V. C ONCLUSION
ket inefficiency incurred by some psychological reasons of
investors. There has long been a strong interest in applying machine
The results of experimentation study to examine the effects learning techniques to financial problems. This paper has
of TCs and price slippages on the performance of MQ-Trader explored the issues of designing a multiagent system that
are presented in Tables VI and VII, where only the best case aims to provide an effective decision support for daily stock
performances are shown among 20 multiple trials with different trading problem. The proposed approach, which was named
random initializations of neural networks. Three different rates MQ-Trader, defines multiple Q-learning agents in order to
for calculating TCs as well as three different probabilities of effectively divide and conquer the stock trading problem in an
price slippages were considered, resulting to a total of nine integrated environment. We presented the learning framework
configurations. Tables VI and VII, respectively, present the along with the state representations for the cooperative agents
876 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 6, NOVEMBER 2007
Fig. 14. Performance comparison result for different levels of risks during the first test period.
Fig. 15. Performance comparison result for different levels of risks during the second test period.
of MQ-Trader and described the detailed algorithms for produces better trading performances than the systems based
training the agents. on other alternative frameworks. Based on these observations,
Furthermore, in an attempt to address the complexity prob- the profits that can be obtained from the proposed framework
lem that arises when considering a large amount of data to appear to be promising.
compute long-term dependence among the stock prices, we had From the future research point of view, there are some clear
proposed a new state representation scheme, which was named extensions to be investigated. These include addressing the
TP matrix, that can succinctly represent the history of price issues of distributing the asset to multiple portfolios and of
changes. adapting to the trend of a stock market. While the reinforcement
Through an extensive empirical study using real financial learning is promising, introduction of these considerations will
data from a Korean stock market, we found that our approach make the problem more complex. Therefore, one of the future
LEE et al.: MULTIAGENT APPROACH TO Q-LEARNING FOR DAILY STOCK TRADING 877
TABLE VIII [20] R. Caruana, S. Lawrence, and L. Giles, “Overfitting in neural nets: Back-
LEGEND FOR THE SERIES IN THE PLOTS OF FIGS. 14 AND 15 propagation conjugate gradient, and early stopping,” in Proc. Adv. Neural
Inf. Process. Syst., 2001, vol. 13, pp. 402–408.
[21] M. A. H. Dempster and C. M. Jones, The Profitability of Intra-Day
FX Trading Using Technical Indicators, 2000, Cambridge, U.K.: Centre
Financ. Res., Univ. Cambridge. Working Paper.