0% found this document useful (0 votes)

12 views14 pages

Multi-Agent Q Learning Daily Trading 2007

Uploaded by

biddon14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views14 pages

Multi-Agent Q Learning Daily Trading 2007

Uploaded by

biddon14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

864 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO.

6, NOVEMBER 2007

A Multiagent Approach to Q-Learning

for Daily Stock Trading
Jae Won Lee, Jonghun Park, Member, IEEE, Jangmin O, Jongwoo Lee, and Euyseok Hong

Abstract—The portfolio management for trading in the stock economics over a 40-year period without definitive findings,
market poses a challenging stochastic control problem of signif- states that no investment system can consistently yield average
icant commercial interests to finance industry. To date, many returns exceeding the average returns of a market as a whole.
researchers have proposed various methods to build an intelligent
portfolio management system that can recommend financial deci- Throughout many years, finance theoreticians argue for EMH
sions for daily stock trading. Many promising results have been as a basis of denouncing the techniques that attempt to find
reported from the supervised learning community on the possibil- useful information about the future behavior of stock prices by
ity of building a profitable trading system. More recently, several using historical data [2].
studies have shown that even the problem of integrating stock However, the assumptions underlying this hypothesis turns
price prediction results with trading strategies can be successfully
addressed by applying reinforcement learning algorithms. Moti- out to be unrealistic in many cases [3], and in particular,
vated by this, we present a new stock trading framework that most approaches taken to testing the hypothesis were based
attempts to further enhance the performance of reinforcement on linear time series modeling [4]. Accordingly, as claimed
learning-based systems. The proposed approach incorporates mul- in [4], given enough data and time, an appropriate nonpara-
tiple Q-learning agents, allowing them to effectively divide and metric machine learning method may be able to discover
conquer the stock trading problem by defining necessary roles for
cooperatively carrying out stock pricing and selection decisions. more complex nonlinear relationships through learning from
Furthermore, in an attempt to address the complexity issue when examples. Furthermore, if we step back from being able to
considering a large amount of data to obtain long-term depen- “consistently” beat the market, we may find many interesting
dence among the stock prices, we present a representation scheme empirical results indicating that the market might be somehow
that can succinctly summarize the history of price changes. Exper- predictable [5].
imental results on a Korean stock market show that the proposed
trading framework outperforms those trained by other alternative Indeed, the last decade has witnessed the abundance of such
approaches both in terms of profit and risk management. approaches to financial analysis both from academia and indus-
try. Application of various machine learning techniques to stock
Index Terms—Financial prediction, intelligent multiagent sys-
tems, portfolio management, Q-learning, stock trading. trading and portfolio management has experienced significant
growth, and many trading systems have been proposed in the
literature based on different computational methodologies and
I. I NTRODUCTION investment strategies [6]–[10]. In particular, there has been a
huge amount of interest in the application of neural networks
B UILDING an intelligent system that can produce timely
stock trading suggestions has always been a subject of
great interest for many investors and financial analysts. Nev-
to predict the stock market behavior based on current and
historical data, and this popularity continues mainly due to the
ertheless, the problem of finding out the best time to buy or fact that the neural networks do not require an exact parametric
sell has remained extremely hard since there are too many system model and that they are relatively insensitive to unusual
factors that may influence stock prices [1]. The famous “ef- data patterns [3], [11].
ficient market hypothesis” (EMH), which was tested in the More recently, numerous studies have shown that even the
problem of integrating stock price prediction results with dy-
namic trading strategies to develop an automatic trading system
can be successfully addressed by applying reinforcement learn-
Manuscript received August 5, 2005; revised February 21, 2006. This work ing algorithms. Reinforcement learning provides an approach
was supported by a research grant (2004) from Sungshin Women’s University. to solving the problem of how an autonomous agent that
This paper was recommended by Associate Editor R. Subbu.
J. W. Lee and E. Hong are with the School of Computer Science and
senses and acts in its environment can learn to choose optimal
Engineering, Sungshin Women’s University, Seoul 136-742, Korea (e-mail: actions to achieve its goals [12]. Compared with the supervised
jwlee@sungshin.ac.kr; hes@sungshin.ac.kr). learning techniques such as neural networks, which require
J. Park (corresponding author) is with the Department of Industrial
Engineering, Seoul National University, Seoul 151-742, Korea (e-mail:
input and output pairs, a reinforcement learning agent learns
jonghun@snu.ac.kr). behavior through trial-and-error interactions with a dynamic
J. O was with the School of Computer Science and Engineering, Seoul environment, while attempting to compute an optimal policy
National University, Seoul 151-742, Korea. He is now with NHN Corporation,
Seongnam 463-811, Korea (e-mail: rupino11@naver.com). under which the agent can achieve maximal average rewards
J. Lee is with the Department of Multimedia Science, Sookmyung Women’s from the environment.
University, Seoul 140-742, Korea (e-mail: bigrain@sookmyung.ac.kr). Hence, considering the problem characteristics of design-
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. ing a stock trading system that interacts with a highly dy-
Digital Object Identifier 10.1109/TSMCA.2007.904825 namic stock market in an objective of maximizing profit, it is

1083-4427/$25.00 © 2007 IEEE

LEE et al.: MULTIAGENT APPROACH TO Q-LEARNING FOR DAILY STOCK TRADING 865

worth considering a reinforcement learning algorithm such as In Section II, we present the architecture of the proposed
Q-learning to train a trading system. There have been several framework, describe how cooperation among the trading agents
research results published in the literature along this line. in MQ-Trader is achieved, and subsequently define the state
Neuneier [8] used a Q-learning approach to make asset al- representation schemes. Section III presents learning algo-
location decisions in financial market, and Neuneier and rithms for the participating agents after briefly introducing
Mihatsch [13] incorporated a notion of risk sensitivity into basic concepts of Q-learning. Experimental setup and results
the construction of Q-function. Another portfolio management on a real Korean stock market, i.e., Korea Composite Stock
system built by use of Q-learning was presented in [14] where Price Index (KOSPI), are described in Section IV. Finally, Sec-
absolute profit and relative risk-adjusted profit were consid- tion V concludes this paper with discussion on future research
ered as performance functions to train a system. In [15], an directions.
adaptive algorithm, which was named recurrent reinforcement
learning, for direct reinforcement was proposed, and it was
II. P ROPOSED F RAMEWORK FOR
used to learn an investment strategy online. Later, Moody and
M ULTIAGENT Q-L EARNING
Saffell [16] have shown how to train trading systems via direct
reinforcement. Performance of the learning algorithm proposed In this section, we first present the proposed MQ-Trader
in [16] was demonstrated through the intraday currency trader framework that employs cooperative multiagent architecture for
and monthly asset allocation system for S&P 500 stock index Q-learning. After describing the behavior of individual agents
and T-Bills. during the learning process, this section proceeds to define the
In this paper, we propose a new stock trading framework that necessary state representations for the agents. Detailed learning
attempts to further enhance the performance of reinforcement algorithms are presented in Section III.
learning-based systems. The proposed framework, which is
named MQ-Trader, aims to make buy and sell suggestions
A. Proposed Learning Framework
for investors in their daily stock trading. It takes a multiagent
approach in which each agent has its own specialized capability In an attempt to simulate a human investor’s behavior and
and knowledge, and employs a Q-learning algorithm to train at the same time to divide and conquer the considered learning
the agents. The motivation behind the incorporation of multiple problem more effectively, MQ-Trader defines four agents. First,
Q-learning agents is to enable them to effectively divide and a stock trading problem is divided into the timing and the
conquer the complex stock trading problem by defining nec- pricing problem of which the objectives are, respectively, to
essary roles for cooperatively carrying out stock pricing and determine the best time and the best price for trading. This
selection decisions. At the same time, the proposed multiagent naturally leads to the introduction of the following two types
architecture attempts to model a human trader’s behavior as of agents: 1) the signal agent and 2) the order agent.
closely as possible. Second, motivation for the separation of the buy signal agent
Specifically, MQ-Trader defines an architecture that consists from the sell signal agent comes from the fact that an investor
of four cooperative Q-learning agents: The first two agents, has different criteria for decision making depending on whether
which were named buy and sell signal agents, respectively, she/he buys or sells a stock. When buying a stock, the investor
attempt to determine the right time to buy and sell shares usually considers the possibility of rising and falling of the
based on global trend prediction. The other two agents, which stock price. In contrast, when selling a stock, the investor
were named buy and sell order agents, carry out intraday order considers not only the tendency of the stock price movements
executions by deciding the best buy price (BP) and sell price but also the profit or loss incurred by the stock. Accordingly, the
(SP), respectively. Individual behavior of the order agents is separation is necessary to allow the agents to have different state
defined in such a way that microscopic market characteristics representations. That is, while the buy signal agent maintains
such as intraday price movements are considered. Cooperation the price history information as its state to estimate future trend
among these proposed agents facilitates efficient learning of based on the price changes over a long-term period, the sell
trading policies that can maximize profitability while managing signal agent needs to consider the current profit/loss obtained
risks effectively in a unified framework. in addition to the price history.
One of the important issues that must be addressed when Finally, the buy order and the sell order agents, respectively,
designing a reinforcement learning algorithm is the represen- generate orders to buy and sell a stock at some specified price.
tation of states. In particular, the problem of maintaining the These are called bid and offer. The objective of these order
whole raw series of stock price data in the past to compute long- agents is to decide the best price for trading within a single day
term correlations becomes intractable as the size of considered in an attempt to maximize profit.
time window grows large. Motivated by this, we propose a new Fig. 1 shows the overall learning procedure defined in
state representation scheme, which is named turning point (TP) MQ-Trader. It aims to maximize the profit from investment
matrix, that can succinctly summarize the historical information by considering the global trend of stock price as well as the
of price changes. The TP matrix is essentially a binary matrix intraday price movements. Under this framework, each agent
for state representation of the signal agents. Furthermore, in has its own goal while interacting with others to share episodes
MQ-Trader, various technical analysis methods such as short- throughout the learning process.
term moving averages (MAs) and Japanese candlestick repre- More specifically, given a randomly selected stock item,
sentation [17] are utilized by the order agents. an episode for learning is started by randomly selecting a
866 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 6, NOVEMBER 2007

Fig. 2. Sample episode.

Fig. 1. Learning procedure of MQ-Trader.

certain day in the history, which is denoted by δ, from the

environment. As shown in Fig. 1, the buy signal agent first
makes price prediction by analyzing recent price movements of
the considered stock and then makes a stock purchase decision Fig. 3. Example of plots representing five-day MAs of closing prices.
based on whether the price is likely to rise in near future
or not. When it decides not to buy the stock, the episode is
ended, and a new episode is started at another randomly chosen that an infinite loop in which a stock is indefinitely held is
day. The steps denoted as 1a and 2a in Fig. 1 correspond to prevented by defining maximum possible days during which
this case. the sell signal agent may hold a stock. Detailed definitions of
On the other hand, if the decision is to buy the stock on the state, action, and reward of each agent in MQ-Trader are
day δ, the buy order agent, which first checks the feasibility given in the following sections.
of the purchase by consulting the maximum allowed BP, is in-
formed. Subsequently, if the purchase is feasible, the buy order
B. State Representations for Signal Agents
agent makes an actual purchase on δ + 1 after it determines an
appropriate BP. In some cases, BP might be set too low, result- One of the most important issues in defining a Q-learning
ing to an unsuccessful purchase. When this happens, the buy or- framework is the representation of state for agent. Indeed,
der agent tries different BPs until a successful purchase is made. the development of an effective and efficient reinforcement
The circle encompassing steps 3 and 4 in Fig. 1 represents this learning system is an art of designing state and reward rep-
looping behavior. A reward is given by the environment to the resentations. In this section, we present the proposed state
buy order agent based on how much BP is close to the optimum, representations for the signal agents.
which is the lowest possible BP for successful purchase on As discussed in the previous section, the signal agent is
day δ + 1. responsible for determining the day when a stock is bought or
After the purchase is made on day δ + 1, the sell signal agent sold, and it maintains price history information of the stock in
examines the recent price history of the purchased stock as well consideration as its state for being able to predict future trend.
as the current profit or loss accrued in order to decide either to Furthermore, it was mentioned in the previous section that the
hold the stock or to sell on each day starting from δ + 1. When sell signal agent defines additional state that summarizes profit
the sell signal agent decides to hold the stock on a day δ + k, or loss that occurred during an episode.
where k ≥ 1, as indicated by step 6b in Fig. 1, the environment In order to efficiently represent the price change history of a
provides it with a reward and an updated state so that the same stock over a long-term period, we utilize the notion of resistance
process can repeat on the next day. Otherwise, in case that the and support instead of taking the whole raw price data as a
stock is sold at the SP valuated by the sell order agent, the sell state, which is computationally extensive for training agents.
order agent is provided with a reward in a way similar to the Specifically, we propose a state representation scheme, which
case of the buy order agent on day δSELL + 1, where δSELL is called the TP matrix, which can succinctly summarize the
indicates the day when the sell signal agent decides to sell history of stock price changes over a long period.
the stock. Finally, a single episode ends after the environment A TP is a local extremal point in the plots generated by
notifies the buy signal agent of the resulting profit rate as a computing five-day MAs of closing prices. When it is a local
reward. A sample episode is illustrated in Fig. 2. We remark minimum, it is called an upward TP, and similarly when it is
LEE et al.: MULTIAGENT APPROACH TO Q-LEARNING FOR DAILY STOCK TRADING 867

a local maximum, it is called a downward TP. The sequence

of TPs shows the history of resistance to and support for
stock price changes, and it has implications on the future price
movements. For instance, the existence of a downward TP at the
price of 100 for a stock in the past may be an indication that the
future stock price is not likely to rise beyond 100. An example
of plots representing the five-day MAs is shown in Fig. 3, where
TPs are depicted as arrowheads.
A TP matrix M = [A/B] is a partitioned matrix in which
submatrices A and B are binary valued square matrices of
size n. An element of M represents an existence of TP with
specified properties that are defined for columns and rows.
The columns of M represent time windows, and they are
defined by the use of Fibonacci numbers F0 , F1 , . . . , Fn , where
F0 = 0, F1 = 2, and F2 = 3, in such a way that thejth column
corresponds to the time period ( j−1 k=0 Fk + 1,
j
k=0 Fk ) in Fig. 4. Example of TP matrices.
the past. Given a time window (x, y) in the past, x represents
the xth day when the days are counted backward starting from
day D, which is a reference day on which the signal agent The elements bij ∈ B, i = 1, . . . , n and j = 1, . . . , n, are
makes a decision. That is, the first column indicates the similarly defined as in the case of submatrix A except that the
time window containing the yesterday and the day before condition on CTP,D is replaced with CTP,D > 0 for B.
yesterday. The rationale behind the employment of Fibonacci numbers
On the other hand, the rows of M represent the ranges of to subdivide the time windows as defined previously is to pay
price change ratio of a stock on day D with respect to the more interest in recent history. TPs in the recent past are consid-
price at TP, which is defined as CTP,D = (PTP C
− PD C
)/PDC
, ered by use of several time windows of small size, whereas TPs
C C
where PTP and PD , respectively, indicate the closing prices in the distant past are aggregated by use of a few windows of
of a stock on days TP and D. Similar to the case of time C
large size. Similarly, TPs with small price differences from PD
window definition, the whole range of the possible price change receives more attention than those with big price differences
ratio is subdivided into the distinct intervals according to since they resemble more closely the situation of day D. Fig. 4
Fibonacci numbers. In particular, submatrix A represents the shows an example of the upward and downward TP matrices
case in which price has not fallen on day D compared to for the region circled in Fig. 3.
that of TP, i.e., CTP,D ≤ 0, whereas submatrix B represents In Fig. 4, the past 230 days are considered, and accordingly,
the opposite case. Therefore, it follows that the first row n is set to 9 to make the last time window include the 230th day
of A corresponds to the price increase within the range of in the past counted backward from D. From the definition of
0% to 2%. the TP matrix, it follows that the Fibonacci number associated
Each element aij ∈ A, i = 1, . . . , n and j = 1, . . . , n, is with each column represents the size of a time window in terms
formally defined as shown at the bottom of the page. of days, and it also follows that the starting day for the time

(Case 1) 1 ≤ i < n


i−1 i

 1, if there exists a TP such that CTP,D ≤ 0 and Fk ≤ |CTP,D | × 100 < Fk


j−1 k=0 k=0
aij = j

 during the period Fk + 1, Fk


 k=0 k=0
0, otherwise

(Case 2) i = n


i−1

 1, if there exists a TP such that CTP,D ≤ 0 and Fk ≤ |CTP,D | × 100 < ∞


j−1 k=0
aij = j

 during the period Fk + 1, Fk


 k=0 k=0
0, otherwise
868 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 6, NOVEMBER 2007

TABLE I
SAMPLE ENCODING SCHEME FOR PROFIT RATIO

Fig. 5. Japanese candlestick representation.

An MA is an indicator that shows the average value of stock

window is equal to the sum of all preceding Fibonacci numbers prices over a certain period of time. An arithmetic N -day MA
plus one. For instance, the third column of the matrix in Fig. 4 on a trading day D, i.e., MAND , is defined as
corresponds to the time window of five days that starts six
days ago and ends ten days ago. Finally, an element marked
D
as “N” in Fig. 4 indicates that it is not allowed to make such PiC
i=D−N +1
a price change within the corresponding time period due to the MAN
D =
regulation of a stock market considered. N
In addition to the TP matrix, the sell signal agent has a few
more bits as its state to represent profit rate obtained by holding where D ≥ N and PiC is the closing price of a considered
a stock during an episode. The profit rate on day D, i.e., P RD , stock on the ith trading day such that i = D − N + 1, . . . , D.
is defined as follows: We define two indicators that can capture the characteristics of
short-term price changes and incorporate them into the state
C
PD − BP representation for the order agents. First, a gradient of the
P RD = × 100. N -day MA on day D, i.e., gDN
, is defined as
BP

Finally, in order to encode the value of P RD as bits of fixed

D − MAD−1
MAN N
N
length, we divide the whole range of possible profit ratio into gD = .
MAND−1
the intervals and map a bit to each interval to indicate whether
or not a profit ratio belongs to the specific interval. Table I
shows a sample case in which 8 bits are used for representing Second, the normalized distance between PD and MAN
D , i.e.,
the profit ratio. Under the encoding scheme presented in Table I, dN
D, is defined as follows:
a profit ratio of +15%, for example, will be represented as
00000100.
C
PD − MAN
D
dN
D = .
MAN
D

C. State Representations for Order Agents N

Following the Granville’s law, gD and dN D can be used to
The objectives of buy order and sell order agents are, re- derive some sufficient conditions to make predictions on the
spectively, to figure out optimal bid and ask prices of orders N
price movements on day D + 1. When gD > 0 (i.e., a bull
for a specific trading day. In contrast to the signal agents that market), the stock price is likely to rise on day D, and the value
utilize the long-term price history information to predict future of dN N
D will normally be positive. However, if dD happens to
stock price movements, the order agents need to learn the have a negative value for a bull market, it is quite likely that
characteristics of intraday stock price changes. For this purpose, it is an indication of price rise on day D + 1. Furthermore, if
the proposed framework bases its state representation for the the value of dN D is too high, the stock price is expected to fall
order agents on the Granville’s law [18] and Japanese can- on D + 1. Similar arguments can be made for the case when
dlesticks, which are popular methods for short-term technical N
gD < 0 (i.e., a bear market).
analysis. Fig. 5 shows a standard representation of Japanese candle-
Granville’s law is a widely used method that considers the sticks. In this representation, a black bar indicates that the
correlations among the long-term and short-term MAs of clos- closing price of a stock is lower than the opening price on a
ing prices in order to predict the short-term price movements. trading day, whereas a white bar indicates the opposite case.
According to Granville’s law, the short-term temporary behav- Top line and bottom line of the candlestick, respectively, denote
ior of stock price changes eventually resembles the long-term the highest price and the lowest price on a trading day.
behavior, and therefore, a temporary deviation from the long- The shape of a candlestick conveys important information
term behavior can be identified as an indicator that the behavior for determining BP or SP. Accordingly, in MQ-Trader, the data
in the upcoming short period will soon follow the long-term contained in the Japanese candlestick are represented as a state
behavior. for the order agents in terms of the following four indicators:
We apply this principle to the problem of estimating the trend 1) the body bD ; 2) upper shadow uD ; 3) lower shadow lD ;
of intraday stock price movements by introducing necessary and 4) ratio of closing price difference qD that are formally
indicators to the state representations of order agents as follows. O H L
defined as follows. Let PD , PD , and PD , respectively, denote
LEE et al.: MULTIAGENT APPROACH TO Q-LEARNING FOR DAILY STOCK TRADING 869

the opening, highest, and lowest price of a stock on a trading

day D. Detailed definitions are given as follows:
C
PD − PD
O
bD = O
PD
O C
H
PD − max PD ,P
uD = O C D
max PD , PD
O C
min PD ,P − PL
lD = OD C D
min PD , PD
C
PD − PD−1
C
qD = C
.
PD−1

III. L EARNING A LGORITHMS FOR MQ-T RADER A GENTS

Fig. 6. Algorithm for the buy signal agent.
Q-learning is an incremental reinforcement learning method
that does not require a model structure for its application. The
objective of the Q-learning agent is to learn an optimal policy,
i.e., a mapping from a state to an action that maximizes the
expected discounted future reward, which is represented as a
value function Q. One-step Q-learning is a simple algorithm in
which the key formula to update the Q value to learn an optimal
policy is defined as follows [12]:

Q(st , αt ) ← Q(st , αt )

+ λ r(st , αt ) + γ max Q(st+1 , α) − Q(st , αt )

where Q(st , αt ) is a value function defined for a state–action

pair (st , αt ) at moment t, λ and γ are the learning rate and
discount factor, respectively, and r(st , αt ) is a reward received
as a result of taking action αt in state st .
When the state space to be explored by an agent is large, it is
necessary to approximate the Q value. One of the most com-
monly used approaches to the approximation is a gradient-
descent method in which the approximated Q value at t, i.e.,
Qt , is computed by use of a parameterized vector with a Fig. 7. Algorithm for the buy order agent.
fixed number of real valued components, which is denoted as
→
−
θt . Specifically, the function approximation in the proposed although each agent has its own definitions of state, action,
framework is carried out by use of a neural network in which reward, and Q function.
−
→ →
−
link weights correspond to θt . In this framework, θt is updated Fig. 6 shows the Q-learning algorithm for the buy signal
by the following expression, where the gradient ∇− → Q (s , αt )
θt t t
agent. The buy signal agent first examines the state of a stock
can be computed by use of the backpropagation algorithm [19]: on a randomly selected day δ, which includes the TP matrix de-
−
→ →
− scribed in the previous section. It then takes an action according
θ t+1 ← θt + λ∇−
→
θ
Qt (st , αt ) to a well-known ε-greedy policy function Γ(·) that is defined as
t
follows [19]:
× r(st , αt ) + γ max Qt (st+1 , α) − Qt (st , αt ) . (1)
α
arg max Q(sδ , α), with probability 1 − ε
Having discussed the employed Q-learning algorithm, we Γ(sδ ) = α∈Ω(sδ )
now proceed to formally define the learning algorithms for random α ∈ Ω(sδ ), with probability ε
the agents of MQ-Trader. The algorithms are presented in
Figs. 6–9. In the algorithm descriptions, sδ denotes the state where ε is an exploration factor, and Ω(sδ ) represents the set of
on day δ and αsδ denotes an action taken at state sδ . Further- actions that can be taken at state sδ .
more, BPδ and SPδ , respectively, represent the BP and the SP If the agent decides to buy the stock, it immediately invokes
determined on δ. For the notational brevity, we omit the index the buy order agent and waits until the sell order agent invokes
indicating the agent type throughout the algorithm descriptions it. The reward is given later in terms of the resulting profit
870 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 6, NOVEMBER 2007

agent to terminate training when a validation error rate starts to

grow up.
As described in Section II, the buy order agent has a state
representation for the N -day MA, gradient, and normalized
distance, as well as for several indicators for Japanese candle-
sticks. The action space for the buy order agent, i.e., Ω(sδ ),
is defined as a finite set of allowed BP ratio with respect to
MAN δ , {β1 , β2 , . . . , βK } such that β1 < β2 < · · · < βK and
N
β1 > 0. We refer to βK as βMAX in what follows to repre-
sent the fact that it is dependent on N , which is the length
of time window for the MA, and that it limits the maxi-
mum allowed BP. Given a BP ratio β ∈ Ω(sδ ), the actual
BP is determined by BPδ = MAN δ × β on day δ for a trade
on δ + 1.
The learning algorithm for the buy order agent is presented in
Fig. 7 in which β is used in place of αsδ whenever appropriate
for clarity. It starts on day δ that is provided by the buy signal
agent. If it turns out that a purchase cannot be made on day
δ + 1 with any BP ratio allowed in MQ-Trader, an episode ends
after giving the minimum reward, which is 0. In case that a
purchase is possible, the agent attempts to obtain a feasible
BP for day δ + 1 by repetitively trying different BP ratios by
invoking the ε-greedy policy function. Since no state transition
Fig. 8. Algorithm for the sell signal agent.
is made by the agent, the term γ maxα Qt (st+1 , α) in (1) is set
to 0. The reward function for the buy order agent is defined in
such a way that the computed reward is bounded by 0 and 1,
and the reward becomes maximum when the BP determined is
the same as the lowest possible BP of day δ + 1.
The sell signal agent is informed about δ + 1, which is the
day when the stock is actually purchased, by the buy order
agent. It then decides whether or not to sell the stock on
δ + 1 according to the ε-greedy function. Subsequently, if the
decision is to sell the stock, the agent is provided with a zero
reward as it will exit the market for the current episode. On
the other hand, when the agent decides to hold the stock, the
successive days are examined for selling the stock one by one
by updating the Q value. The reward defined for this update is
the ratio of closing price difference, i.e., qδ+k , which is defined
in Section II, to inform whether the closing price has increased
or not on the next day. We remark that unlike the buy order
agent whose reward is bounded between 0 and 1, the reward for
the sell signal agent may have a negative value. Furthermore,
when the agent decides to sell, the term γ maxα Qt (st+1 , α)
in (1) is set to 0, since no further state update is necessary for
the episode. The algorithm for the sell signal agent is presented
in Fig. 8.
Fig. 9. Algorithm for the sell order agent. Finally, δSELL , which is the day when the sell signal agent
decided to sell the stock, is provided to the sell order agent
that is responsible for determining an offer price. Similar to
rate by considering the following: 1) transaction cost (TC), the case of the buy order agent, we define the action space
which is defined in terms of a fixed rate charged for a stock for the sell order agent, i.e., Ω(sδSELL ), to be a finite set of
price whenever a stock is purchased, and 2) the price slippage allowed SP ratio with respect to MAN δSELL , {σ1 , σ2 , . . . , σK }
caused by the difference between the estimated and actual stock such that σ1 < σ2 < · · · < σK and σ1 > 0. We denote σ1 as
N
prices. Otherwise, the agent receives a zero reward to nullify σMIN , since it determines the minimum allowed SP. Given an
→
−
the episode. For the update of θ , the term γ maxα Qt (st+1 , α) SP ratio σ ∈ Ω(sδ ), the actual SP is computed in the same way
in (1) is set to 0, since no further Q value update for the as the case of the buy order agent.
current episode is necessary for the buy signal agent. Finally, As shown in Fig. 9, the agent first checks if it can sell the
an early stopping method [20] is adopted for the buy signal stock on day δSELL + 1 at the minimum allowed SP. If selling
LEE et al.: MULTIAGENT APPROACH TO Q-LEARNING FOR DAILY STOCK TRADING 871

of the stock even with the lowest possible price is not possible, TABLE II
STATE REPRESENTATION FOR THE ORDER AGENTS
the SP is set to PδCSELL +1 , which is the closing price on day
δSELL + 1. The lowest reward, i.e., 0, is given in this case.
Otherwise, the agent tries different prices until a feasible SP
is obtained as in the case of the buy order agent. The reward
function that considers the TC and price slippage for this case
is defined similarly to that of the buy order agent and achieves
the maximum value when the SP determined is equal to the
highest possible price.

IV. E MPIRICAL S TUDY

In this section, we first present the detail configuration of 5
an infeasible BP by the buy order agent with βMAX = 1.12
the MQ-Trader that is defined for empirical study and then
is less than 2.5%. A similar conclusion can be drawn for the
discuss the predictability analysis results for the feedforward
sell order agent from the right plot in Fig. 10, which shows
neural network employed for value function approximation. H
the distribution of PD+1 /MA5D . Based on this observation,
Finally, we present the results of an empirical study concerning 5
the order agents are configured to have βMAX = 1.12 and
the application of our multiagent approach to KOSPI 200, 5
σMIN = 0.88, and the actual action space and its encoding
which is composed of 200 major stocks listed on the Korea
used for the empirical study in this section are presented
stock exchange market, by comparing it with other alternative
in Table III.
frameworks.
Finally, we remark that the algorithms presented in Figs. 7–9
may not terminate in some very rare cases. Therefore, we set
A. MQ-Trader Conﬁguration the limit on the maximum number of iterations allowed during
execution of the algorithms to prevent an infinite loop. When
In addition to the major state definitions that were de-
the loop is exited abruptly by this condition, the episode is
scribed in Section II, some additional state components are
discarded.
introduced for empirical study to further optimize the perfor-
mance of MQ-Trader. Specifically, the signal agent is provided
with ten additional binary technical indicators that include B. Predictability Analysis
the relative strength index, MA convergence and divergence,
price channel breakout, stochastics, on-balance volume, MA The structure of a neural network for the Q value function
crossover, momentum oscillator, and commodity channel in- approximation has a significant influence on the performance of
dex. Detailed description of these indicators can be found MQ-Trader. In order to determine an appropriate structure, we
in [21]. considered several network structures by varying the number of
Furthermore, we consider the past 230 days for constructing hidden layers and the number of units for each layer.
the TP matrix and configure the sell signal agent to have the The data set for the experimentation is drawn from KOSPI
profit ratio representation scheme as shown in Table I where 200. The whole data set is divided into four subsets as follows:
8 bits are dedicated for the representation. Consequently, since 1) the training set with 32 019 data points, which covers the time
the total number of bits required to represent the TP matrix is period from January 1999 to December 2000; 2) the validation
324, it follows that the state of the buy signal agent consists set with 6102 data points from January 2001 to May 2001;
of 334 bits and that the state of the sell signal agent consists 3) the first test set with 33 127 data points from June 2001 to
of 342 bits. August 2003; and finally 4) the second test set with 34 716 data
As for the order agents, Table II shows the detailed state points from September 2003 to November 2005.
variables along with the number of bits configured for them Training of the neural networks was carried out by applying
for a trading day D. The value range of each state variable is the Q-learning algorithms presented in Section III. Specifically,
divided into mutually exclusive intervals, and each interval is we considered the network configurations with at most two
assigned with 1 bit to represent the fact that the current value hidden layers, and each of them was trained ten times with
belongs to the interval. Accordingly, both the buy order and the different initial weights. The same neural network structure was
sell order agents require 88 bits. used for all the agents of MQ-Trader. Prediction performance
We set N = 5, which reflects the number of workdays in a of the agents was investigated by examining the correlation
week to train the order agents. In an attempt to minimize the between the estimated Q values and the actual discounted
required number of bits for representing the action space of the cumulative rewards as well as the accuracy, which is defined
buy order agent while accommodating possible actions as many as the ratio of the number of successful trades to the number
as possible, we analyzed the characteristics of KOSPI 200 by of recommendations made by MQ-Trader, all under γ = 0.9,
L
plotting the distribution of PD+1 /MA5D , which is the ratio of λ = 0.3, and ε = 0.1.
the lowest stock price on D + 1 to the five-day MA of stock We remark that the correlation was calculated for all the
prices on a trading day D. The result is presented in the left stock items, whereas the accuracy was measured only for the
plot in Fig. 10, which suggests that the chance of producing stock items that were recommended by MQ-Trader. That is,
872 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 6, NOVEMBER 2007

Fig. 10. Distribution of the ratio of the difference between the five-day MA and the lowest stock price to the five-day MA.

TABLE III C. Performance Evaluation Results

ACTION SPACE FOR THE ORDER AGENTS
We implemented a simulation platform for trading systems to
evaluate our approach. The platform consists of five indepen-
dent subtraders for which initial assets are equally allocated.
Each trader is allowed to hold only one stock item at a time.
Motivation behind introducing multiple subtraders comes from
the fact that the platform with a single trader may result in
high variances of trading performance, making the performance
the accuracy essentially represents how many stock trades were comparison a sophisticated task. Indeed, in practice, there are
actually profitable out of those recommended for purchase. very few investors who allocate their whole asset to a single
Table IV shows the prediction performance results for the con- stock.
sidered neural network configurations. It suggests that the pre- At runtime, MQ-Trader implemented in the simulation plat-
dictability for the recommended trades can be satisfactory even form constructs a set of recommended candidates out of
though the predictability for individual stocks is not. Based on 200 stock items based on the profit rate estimated by the
these, we chose the network with 80 units in the first hidden trained neural network, and distributes the stocks with highest
layer and 20 units in the second hidden layer for implementing profitability randomly to the subtraders that do not hold a stock.
MQ-Trader. Once a stock is to be purchased by a subtrader, BP is determined
The behavior of the trading agents with the selected network by comparing the estimated Q values for the set of possible
structure in MQ-Trader during the training process is depicted actions. When the selected BP of the stock is unfortunately
in Fig. 11, where the average number of trades made and the lower than the lowest stock price of the trading day, the stock
average profit rate incurred every 20 000 episodes are shown in is abandoned, and another randomly chosen profitable stock is
the upper and lower graphs, respectively. In Fig. 11, the solid provided to the subtrader. This process is repeated until there is
line represents the case of the validation set, whereas the dotted no profitable stock left in the candidate set.
line represents the case of the training set. The vertical axes On the other hand, the decision of selling a stock proceeds as
on the left- and right-hand sides in the upper plot in Fig. 11 follows. On every trading day, two alternative actions, namely
represent the cases of the training set and the validation set, SELL or HOLD, are compared according to the Q values
respectively. returned by the trained neural network. In case that the stock
It is worth mentioning that the number of trades made during is to be sold, the SP is determined similarly to the case of the
the first 3 200 000 episodes is very small mainly due to the stock purchase. Whenever the SP determined is higher than the
fact that MQ-Trader in this stage makes decisions only through highest price of the trading day, the stock is sold at the closing
the random exploration. In fact, the trading performance in price of the day.
terms of the profit rate in this stage is not satisfactory as In order to incorporate real-world trading constraints, we
shown in the bottom plot in Fig. 11. However, after this further introduced TCs, price slippage, and limitation on the
initial phase, the number of trades and the profit rate begin stock purchase amount into the simulation model. First, three
to increase in both data sets, indicating that MQ-Trader starts different rates for computing TCs based on the BP or SP,
to trade stocks by use of the greedy policy. Finally, since it namely 0.5%, 1.0%, and 1.5%, were considered,1 and the TC
was observed that there was degradation of profit rate after
5 000 000 episodes, the training was stopped to prevent the 1 The actual rate for the transaction cost in KOSPI market is between 0.3%
overfitting. and 0.4%.
LEE et al.: MULTIAGENT APPROACH TO Q-LEARNING FOR DAILY STOCK TRADING 873

TABLE IV
PREDICTION PERFORMANCE OF THE CONSIDERED NEURAL NETWORK CONFIGURATIONS

Fig. 11. Behavior of trading agents during the training process.

was charged whenever a stock is purchased or sold. Second, in TABLE V

STATISTICAL BENCHMARKING RESULTS FOR THE TRADERS
order to account for the price slippages that may occur due to
the difference between the estimated and actual stock prices,
we introduced random perturbation of the actual stock prices
by 0%, 0.5%, and 1%. Third, in an attempt to address the issue
of minimizing the market influence caused by a stock trader, we
limited the daily purchase amount of a single stock item by the
trader to less than 1% of the daily trading volume of the stock
in the market.
We now proceed to compare the performance of the proposed
MQ-Trader with other trading systems with different archi-
tectures. The stock trading systems considered in this experi- It should be noted that all traders except the SMQ-Trader
mentation for performance comparisons are given as follows: implement the TP matrices for their state representations. We
(a) the Ideal 2Q (I2Q)-Trader that replaces the order agents also remark that the 2Q-Trader and I2Q-Trader are the traders
of MQ-Trader with an ideal policy in which buy orders defined for the purpose of showing how the order agents
are traded at the lowest daily prices and sell orders are play roles in enhancing the performance of the MQ-Trader by
traded at the highest daily prices; (b) the MQ-Trader; (c) the removing the order agents from the MQ-Trader and, respec-
2Q-Trader in which only the signal agents are employed and the tively, replacing them with two extreme pricing policies. From
BP and SP are set to the closing price of a trading day; (d) the the definitions of the 2Q-Trader and I2Q-Trader, it follows that
SMQ-Trader, which is the MQ-Trader without the TP matrix; the performance of the MQ-Trader should fall between those of
(e) the 1Q-Trader where only the buy signal agent is employed the 2Q-Trader and I2Q-Trader.
and the selling signal is automatically generated after some In this experimentation, the neural network of each trad-
predefined holding period; and finally (f) the SNN-Trader that ing system (introduced as (a) to (f) previously) was trained
has basically the same neural network structure as 1Q-Trader with 20 different random initializations of the weights, and
but employs a supervised learning algorithm. Table V summarizes the statistical benchmarking results in
874 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 6, NOVEMBER 2007

Fig. 12. Performance comparison result for the first test data set.

Fig. 13. Performance comparison result for the second test data set.

terms of the asset growth rates achieved by each trading grown to 4.55 (= 0.4 × 1138.7 %) billion won by the end of
system throughout the entire test period, from June 2001 to November 2005.
November 2005, when it was provided with 0.4 billion won Only the best case results for the two aforementioned test
initially (the basic unit of money in Korea). For example, the sets in this section are separately given in Figs. 12 and 13,
best case performance of MQ-Trader shows that its asset has where the dotted line at the bottom indicates the KOSPI index
LEE et al.: MULTIAGENT APPROACH TO Q-LEARNING FOR DAILY STOCK TRADING 875

TABLE VI TABLE VII

ASSET GROWTH RATES OF MQ-TRADER FOR DIFFERENT TRADING FREQUENCIES OF MQ-TRADER FOR DIFFERENT
TRANSACTION RATES AND PRICE SLIPPAGES TRANSACTION RATES AND PRICE SLIPPAGES

that is similarly defined as the S&P 500 index of the U.S. results of the asset growth rates and the trading frequencies
stock market. KOSPI index is shown to compare the perfor- achieved by MQ-Trader for different configurations during the
mances of the aforementioned trading systems to the baseline entire period of June 2001 through November 2005. The initial
market performance during the test period, and it is equated to asset given to MQ-Trader was 0.4 billion won.
0.4 billion at the beginning for visualization purposes. The From Table VI, it can be seen that the most profitable results
series (a) through (f) in Fig. 12 show the accumulated assets (1138.7% asset growth) were obtained when both of the TC
for each trading system during the first test period (June 2001 rate and the price slippage percentage were lowest, and that
to August 2003) when each system starts with an initial asset the profit decreases as the TC rate increases and the chance of
of 0.4 billion won. To make the performance comparison clear, price slippages becomes higher. Similar results were observed
the starting assets for the trading systems during the second test for the experimentation on the number of trades made during
period (September 2003 to November 2005) are also equated to the same test period, as shown in Table VII. These together
0.4 billion in Fig. 13. imply that MQ-Trader has learned the risks associated with
As can be seen from these results, the proposed MQ- stock trading, which were introduced through the TC and price
Trader outperformed the other alternative trading frameworks slippage. When the TC is expensive, MQ-Trader buys and sells
(represented by the series (c) to (f) in Figs. 12 and 13) a stock carefully, leading to less frequent trades and smaller net
by achieving more than four times of asset growth for the profits. This is a natural consequence since a trade with small
first test period and more than 2.5 times for the second test profit may end up with overall loss after paying the TCs for the
period. The performance of MQ-Trader has always lied be- stock purchase and disposition. In addition, with high chance
tween those of the I2Q-Trader and the 2Q-Trader (respectively of price slippage, it is advantageous for MQ-Trader to avoid
represented as the series (a) and (c) in Figs. 12 and 13) as aggressive trading.
expected. Accordingly, the performance difference between Furthermore, Figs. 14 and 15 show how the profitability of
MQ-Trader and 2Q-Trader can be attributed to the contri- MQ-Trader decreases as the risks increase throughout the entire
butions of the order agents. Furthermore, it can be deduced test period. The TC rate and the percentage of price slippage
by comparing the series (b) and (d) that the proposed TP used for each series in Figs. 14 and 15 are summarized in
matrix can facilitate the performance improvement of a trading Table VIII. As expected, the profitability becomes highest when
system. the risks represented by the TC and price slippage are lowest,
It is interesting to note that MQ-Trader performs satis- and it becomes lowest when the risks are highest.
factorily during the long period of bear market between Finally, we found out that consideration of the current profit
April 2002 and April 2003. In addition, it endured quite well or loss by MQ-Trader did not necessarily lead to the disposition
the short stock market shock during May 2004 with a relatively effect in contrast to the human investors who are subject to the
small loss. Note that, however, in the period of bull market disposition effect due to psychological reasons. The average
(May 2005 to July 2005), the traders with multiple agents number of days of holding a stock item by MQ-Trader for
including MQ-Trader were not able to exploit the opportunity, the profitable trades was 6.9, whereas it was 7.3 days for the
while the other two single agent traders (indicated by the sharp unsuccessful trades. Therefore, this small difference of 0.4 day
rises of the series (e) and (f) during the period) were able to suggests that the MQ-Trader is not prone to the disposition
exploit the opportunity. Based on this observation, it appears effect.
that MQ-Trader can achieve a good performance particularly
when the stock prices are sharply declining due to the mar-
V. C ONCLUSION
ket inefficiency incurred by some psychological reasons of
investors. There has long been a strong interest in applying machine
The results of experimentation study to examine the effects learning techniques to financial problems. This paper has
of TCs and price slippages on the performance of MQ-Trader explored the issues of designing a multiagent system that
are presented in Tables VI and VII, where only the best case aims to provide an effective decision support for daily stock
performances are shown among 20 multiple trials with different trading problem. The proposed approach, which was named
random initializations of neural networks. Three different rates MQ-Trader, defines multiple Q-learning agents in order to
for calculating TCs as well as three different probabilities of effectively divide and conquer the stock trading problem in an
price slippages were considered, resulting to a total of nine integrated environment. We presented the learning framework
configurations. Tables VI and VII, respectively, present the along with the state representations for the cooperative agents
876 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 6, NOVEMBER 2007

Fig. 14. Performance comparison result for different levels of risks during the first test period.

Fig. 15. Performance comparison result for different levels of risks during the second test period.

of MQ-Trader and described the detailed algorithms for produces better trading performances than the systems based
training the agents. on other alternative frameworks. Based on these observations,
Furthermore, in an attempt to address the complexity prob- the profits that can be obtained from the proposed framework
lem that arises when considering a large amount of data to appear to be promising.
compute long-term dependence among the stock prices, we had From the future research point of view, there are some clear
proposed a new state representation scheme, which was named extensions to be investigated. These include addressing the
TP matrix, that can succinctly represent the history of price issues of distributing the asset to multiple portfolios and of
changes. adapting to the trend of a stock market. While the reinforcement
Through an extensive empirical study using real financial learning is promising, introduction of these considerations will
data from a Korean stock market, we found that our approach make the problem more complex. Therefore, one of the future
LEE et al.: MULTIAGENT APPROACH TO Q-LEARNING FOR DAILY STOCK TRADING 877

TABLE VIII [20] R. Caruana, S. Lawrence, and L. Giles, “Overfitting in neural nets: Back-
LEGEND FOR THE SERIES IN THE PLOTS OF FIGS. 14 AND 15 propagation conjugate gradient, and early stopping,” in Proc. Adv. Neural
Inf. Process. Syst., 2001, vol. 13, pp. 402–408.
[21] M. A. H. Dempster and C. M. Jones, The Profitability of Intra-Day
FX Trading Using Technical Indicators, 2000, Cambridge, U.K.: Centre
Financ. Res., Univ. Cambridge. Working Paper.

Jae Won Lee received the Ph.D. degree from Seoul

National University, Seoul, Korea, in 1998.
In 1999, he joined the faculty of the School
of Computer Science and Engineering, Sungshin
Women’s University, Seoul, where he is currently an
Assistant Professor. His current research interests in-
clude computational finance, machine learning, and
natural language processing.

Jonghun Park (S’99–M’01) received the Ph.D. de-

research problems will be to make the reinforcement learning gree in industrial and systems engineering with a
formulation with these considerations tractable. minor in computer science from the Georgia Institute
of Technology, Atlanta, in 2000.
He is currently an Associate Professor with the
R EFERENCES Department of Industrial Engineering, Seoul Na-
tional University (SNU), Seoul, Korea. Before join-
[1] K. H. Lee and G. S. Jo, “Expert systems for predicting stock market timing ing SNU, he was an Assistant Professor with
using a candlestick chart,” Expert Syst. Appl., vol. 16, no. 4, pp. 357–364, the School of Information Sciences and Technol-
May 1999. ogy, Pennsylvania State University, University Park,
[2] B. G. Malkiel, A Random Walk Down Wall Street. New York: Norton, and with the Department of Industrial Engineering,
1996. Korea Advanced Institute of Science and Technology, Daejeon. His re-
[3] K. Pantazopoulos, L. H. Tsoukalas, N. G. Bourbakis, M. J. Brun, and search interests include Internet services, entertainment computing, and mobile
E. N. Houstis, “Financial prediction and trading strategies using neuro- services.
fuzzy approaches,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 28,
no. 4, pp. 520–531, Aug. 1998.
[4] T. Chenoweth, Z. Obradovic, and S. S. Lee, “Embedding technical analy-
Jangmin O received the Ph.D. degree from Seoul
sis into neural network based trading systems,” Appl. Artif. Intell., vol. 10, National University, Seoul, Korea, in 2006.
no. 6, pp. 523–541, Dec. 1996.
He is a Research Scientist with NHN Corporation,
[5] E. F. Fama and K. R. French, “Dividend yields and expected stock re-
Seongnam, Korea. His research interests include in-
turns,” J. Financ. Econ., vol. 22, no. 1, pp. 3–26, Oct. 1988.
telligent agent systems, reinforcement learning, data
[6] M. Kearns and L. Ortiz, “The Penn–Lehman automated trading project,” mining, and probabilistic graphical models.
IEEE Intell. Syst., vol. 18, no. 6, pp. 22–31, Nov./Dec. 2003.
[7] R. S. T. Lee, “iJADE stock advisor: An intelligent agent based stock
prediction system using hybrid RBF recurrent network,” IEEE Trans.
Syst., Man, Cybern. A, Syst., Humans, vol. 34, no. 3, pp. 421–428,
May 2004.
[8] R. Neuneier, “Enhancing Q-learning for optimal asset allocation,” in
Proc. Adv. Neural Inf. Process. Syst., 1998, vol. 10, pp. 936–942.
[9] S. T. Chou, H.-J. Hsu, C.-C. Yang, and F. Lai, “A stock selection DSS Jongwoo Lee received the Ph.D. degree from Seoul
combining AI and technical analysis,” Ann. Oper. Res., vol. 75, no. 1, National University, Seoul, Korea, in 1996.
pp. 335–353, Jan. 1997. From 1996 to 1999, he was with Hyundai Elec-
[10] Y. Luo, K. Liu, and D. N. Davis, “A multi-agent decision support tronics Industries, Co. He was an Assistant Professor
system for stock trading,” IEEE Netw., vol. 16, no. 1, pp. 20–27, with the Division of Information and Telecommuni-
Jan./Feb. 2002. cation Engineering, Hallym University, Chuncheon,
[11] E. W. Saad, D. V. Prokhorov, and D. C. Wunsch, II, “Comparative study Korea, from 1999 to 2002. Then, he moved to
of stock trend prediction using time delay, recurrent and probabilistic Kwangwoon University, Seoul, where he was an As-
neural networks,” IEEE Trans. Neural Netw., vol. 9, no. 6, pp. 1456–1470, sistant Professor of computer engineering from 2002
Nov. 1998. to 2003. Since 2004, he has been an Assistant Pro-
[12] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, 1997. fessor with the Department of Multimedia Science,
[13] R. Neuneier and O. Mihatsch, “Risk sensitive reinforcement learning,” in Sookmyung Women’s University, Seoul. His research interests include storage
Proc. Adv. Neural Inf. Process. Syst., 1999, vol. 1, pp. 1031–1037. systems, computational finance, cluster computing, parallel and distributed
[14] X. Gao and L. Chan, “Algorithm for trading and portfolio management us- systems, and system software.
ing Q-learning and Sharpe ratio maximization,” in Proc. ICONIP, Taejon,
Korea, 2000, pp. 832–837.
[15] J. Moody, Y. Wu, Y. Liao, and M. Saffell, “Performance functions and
reinforcement learning for trading systems and portfolios,” J. Forecast., Euyseok Hong received the Ph.D. degree from Seoul
vol. 17, no. 5, pp. 441–470, 1998. National University, Seoul, Korea, in 1999.
[16] J. Moody and M. Saffell, “Learning to trade via direct reinforcement,” He was an Assistant Professor with the Depart-
IEEE Trans. Neural Netw., vol. 12, no. 4, pp. 875–889, Jul. 2001. ment of Digital Media, Anyang University, Anyang,
[17] S. Nison, Japanese Candlestick Charting Techniques. New York: Korea, from 1999 to 2002. In 2002, he joined the
New York Inst. Finance, 1991. faculty of the School of Computer Science and
[18] J. E. Granville, A Strategy of Daily Stock Market Timing for Maximum Engineering, Sungshin Women’s University, Seoul,
Proﬁt. Englewood Cliffs, NJ: Prentice-Hall, 1960. where he is currently an Associate Professor. His
[19] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. research interests include software engineering and
Cambridge, MA: MIT Press, 1998. web information systems.

Core 5 Reviewing Internal Control System
100% (2)
Core 5 Reviewing Internal Control System
54 pages
Guna Suriya - Oracle Cloud PPM - Project and Grands - and EBS-Project Accoutning SME - FEB-23
No ratings yet
Guna Suriya - Oracle Cloud PPM - Project and Grands - and EBS-Project Accoutning SME - FEB-23
16 pages
A Novel Deep Reinforcement Learning Based Automated Stock Trading System Using Cascaded LSTM Networks
No ratings yet
A Novel Deep Reinforcement Learning Based Automated Stock Trading System Using Cascaded LSTM Networks
11 pages
GDiMPS Part Breakdown (1804c) Instructions
No ratings yet
GDiMPS Part Breakdown (1804c) Instructions
11 pages
A Q-Learning Agent For Automated Trading in Equity Stock Markets - Anna's Archive
No ratings yet
A Q-Learning Agent For Automated Trading in Equity Stock Markets - Anna's Archive
34 pages
Reinforcment Learning in Stock Trading
No ratings yet
Reinforcment Learning in Stock Trading
13 pages
2004A Q-Learning Based Approach To Design of Intelligent Stock Trading Agents
No ratings yet
2004A Q-Learning Based Approach To Design of Intelligent Stock Trading Agents
4 pages
A Multi-Layer and Multi-Ensemble Stock Trader Using Deep Learning and Deep Reinforcement Learning - Anna's Archive
No ratings yet
A Multi-Layer and Multi-Ensemble Stock Trader Using Deep Learning and Deep Reinforcement Learning - Anna's Archive
17 pages
A Q-Learning Agent For Automated Trading in Equity Stock Markets
No ratings yet
A Q-Learning Agent For Automated Trading in Equity Stock Markets
12 pages
Deep Reinforcement Learning For Stock Prediction
No ratings yet
Deep Reinforcement Learning For Stock Prediction
6 pages
Applsci 13 01956
No ratings yet
Applsci 13 01956
27 pages
Algorithmic Trading On Financial Time Series Using
No ratings yet
Algorithmic Trading On Financial Time Series Using
20 pages
Learning The Market - Sentiment-Based Ensemble Trading Agents
No ratings yet
Learning The Market - Sentiment-Based Ensemble Trading Agents
10 pages
Deep Robust Reinforcement Learning For Practical Algorithmic Trading
No ratings yet
Deep Robust Reinforcement Learning For Practical Algorithmic Trading
9 pages
Ensemble
No ratings yet
Ensemble
8 pages
IJF Volume 6 Issue 1 Pages 1-27
No ratings yet
IJF Volume 6 Issue 1 Pages 1-27
27 pages
Stock Price Prediction Based On Procedural Neural
No ratings yet
Stock Price Prediction Based On Procedural Neural
11 pages
Deep Reinforcement Learning For Automated Stock Trading - An Ensemble Strategy
No ratings yet
Deep Reinforcement Learning For Automated Stock Trading - An Ensemble Strategy
9 pages
Beating The Stock Market With A Deep Reinforcement Learning Day Trading System
No ratings yet
Beating The Stock Market With A Deep Reinforcement Learning Day Trading System
8 pages
Axioms 09 00130
No ratings yet
Axioms 09 00130
15 pages
Estimation of Return On Investment in Share Market Through Ann
No ratings yet
Estimation of Return On Investment in Share Market Through Ann
11 pages
Stock Trading Strategies Based On Deep Reinforcement Learning
No ratings yet
Stock Trading Strategies Based On Deep Reinforcement Learning
15 pages
Portfolio Optimization in Dynamic Markets Reinforcement Learning For Investment 2024
No ratings yet
Portfolio Optimization in Dynamic Markets Reinforcement Learning For Investment 2024
10 pages
5587-Article Text-8812-1-10-20200512
No ratings yet
5587-Article Text-8812-1-10-20200512
8 pages
Computation 07 00004
No ratings yet
Computation 07 00004
20 pages
Deep Reinforcement Learning Robots For Algorithmic Trading Considering Stock Market Conditions and U.S. Interest Rates
No ratings yet
Deep Reinforcement Learning Robots For Algorithmic Trading Considering Stock Market Conditions and U.S. Interest Rates
21 pages
Unveiling Future Trends For Predicting Online Smart Market Stock Prices Using Ensemble Neural Network
No ratings yet
Unveiling Future Trends For Predicting Online Smart Market Stock Prices Using Ensemble Neural Network
11 pages
An Investment Strategy Based On Stochastic Unit Root Models
No ratings yet
An Investment Strategy Based On Stochastic Unit Root Models
8 pages
Global Stock Market Prediction Based On Stock Chart Images Using Deep Q-Network
No ratings yet
Global Stock Market Prediction Based On Stock Chart Images Using Deep Q-Network
12 pages
1 s2.0 S2215098621000070 Main
No ratings yet
1 s2.0 S2215098621000070 Main
12 pages
Literature Survey: 2.1 Review On Machine Learning Techniques For Stock Price Prediction
No ratings yet
Literature Survey: 2.1 Review On Machine Learning Techniques For Stock Price Prediction
15 pages
2
No ratings yet
2
4 pages
10.1007@s00521 020 04942 3
No ratings yet
10.1007@s00521 020 04942 3
13 pages
MAPS: Multi-Agent Reinforcement Learning-Based Portfolio Management System
No ratings yet
MAPS: Multi-Agent Reinforcement Learning-Based Portfolio Management System
7 pages
Expert Systems With Applications: Qinghua Wen, Zehong Yang, Yixu Song, Peifa Jia
No ratings yet
Expert Systems With Applications: Qinghua Wen, Zehong Yang, Yixu Song, Peifa Jia
8 pages
Analysis On Stock Market Prediction Using Machine Learning Techniques
No ratings yet
Analysis On Stock Market Prediction Using Machine Learning Techniques
5 pages
A Novel Deep Reinforcement Learning Based Automated Stock Trading System Using Cascaded LSTM Networks
No ratings yet
A Novel Deep Reinforcement Learning Based Automated Stock Trading System Using Cascaded LSTM Networks
14 pages
Testing Stock Market Efficiency Using Historical Trading Data and Machine Learning
No ratings yet
Testing Stock Market Efficiency Using Historical Trading Data and Machine Learning
40 pages
Business Intelligence in Stock Market
No ratings yet
Business Intelligence in Stock Market
11 pages
33 Optimization of Multi Factor M
No ratings yet
33 Optimization of Multi Factor M
7 pages
Stock Market Prediction System Using Machine Learning Approach
No ratings yet
Stock Market Prediction System Using Machine Learning Approach
3 pages
Deep Reinforcement Learning in High Frequency Trad
No ratings yet
Deep Reinforcement Learning in High Frequency Trad
6 pages
An Intelligent Trading System With Fuzzy Rules and Fuzzy Capital Management
No ratings yet
An Intelligent Trading System With Fuzzy Rules and Fuzzy Capital Management
23 pages
Ic3 2019 8844891
No ratings yet
Ic3 2019 8844891
5 pages
Scientific Programming - 2022 - Zhang - Deep Reinforcement Learning For Stock Prediction
No ratings yet
Scientific Programming - 2022 - Zhang - Deep Reinforcement Learning For Stock Prediction
9 pages
Chen 2014
No ratings yet
Chen 2014
12 pages
Gefhr2023 774 781
No ratings yet
Gefhr2023 774 781
8 pages
Algorithmic Trading Using Sentiment Analysis and Reinforcement Learning
No ratings yet
Algorithmic Trading Using Sentiment Analysis and Reinforcement Learning
6 pages
SSRN 4885011
No ratings yet
SSRN 4885011
54 pages
Development of An Ensemble Learning-Based Intelligent Model For Stock Market Forecasting
No ratings yet
Development of An Ensemble Learning-Based Intelligent Model For Stock Market Forecasting
38 pages
Multimodal Deep Reinforcement Learning For
No ratings yet
Multimodal Deep Reinforcement Learning For
24 pages
Ang Quek
No ratings yet
Ang Quek
15 pages
Reinforcement Learning For Quantitative Trading: Shuo Sun Rundong Wang Bo An
No ratings yet
Reinforcement Learning For Quantitative Trading: Shuo Sun Rundong Wang Bo An
29 pages
Stock Market Prediction Using CNN and LSTM
No ratings yet
Stock Market Prediction Using CNN and LSTM
7 pages
1999forecasting Series-Based Stock Price Data Using
No ratings yet
1999forecasting Series-Based Stock Price Data Using
6 pages
Good Good Good Good - Great !!!! Great !!!! - A Machine Learning Approach To Stock Sceening
No ratings yet
Good Good Good Good - Great !!!! Great !!!! - A Machine Learning Approach To Stock Sceening
10 pages
Intraday Trading Strategy Based On Time Series and
No ratings yet
Intraday Trading Strategy Based On Time Series and
47 pages
(IJCST-V10I5P49) :mrs R Jhansi Rani, C Nithin
No ratings yet
(IJCST-V10I5P49) :mrs R Jhansi Rani, C Nithin
8 pages
Research Article: An Empirical Study of Machine Learning Algorithms For Stock Daily Trading Strategy
No ratings yet
Research Article: An Empirical Study of Machine Learning Algorithms For Stock Daily Trading Strategy
31 pages
1 s2.0 S0950705121003828 Main
No ratings yet
1 s2.0 S0950705121003828 Main
14 pages
Building Options at Project Front-End Strategizing: The Power of Capital Design for Evolvability
From Everand
Building Options at Project Front-End Strategizing: The Power of Capital Design for Evolvability
Guilherme Biesek
No ratings yet
Duplex Models of Complex Systems
From Everand
Duplex Models of Complex Systems
Steven H. Kim
No ratings yet
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
From Everand
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
Ken Kwong-Kay Wong
3/5 (1)
An Innovative High-Frequency Statistical Arbitrage in Chinese Futures Market
No ratings yet
An Innovative High-Frequency Statistical Arbitrage in Chinese Futures Market
12 pages
A Statistical Test of Market Efficiency Based On Information Theory
No ratings yet
A Statistical Test of Market Efficiency Based On Information Theory
21 pages
A Survey of Quantitative Trading Based On Artificial Intelligence
No ratings yet
A Survey of Quantitative Trading Based On Artificial Intelligence
7 pages
Reinforcement Learning Approaches To Optimal Market Making - 2021
No ratings yet
Reinforcement Learning Approaches To Optimal Market Making - 2021
22 pages
Distributor Agreement Final
No ratings yet
Distributor Agreement Final
9 pages
Accounting Worksheet - 1
No ratings yet
Accounting Worksheet - 1
3 pages
CH 3 - Theory of Firm
No ratings yet
CH 3 - Theory of Firm
38 pages
Multiple Choice Questions On Business & Commercial Law With Q&A
No ratings yet
Multiple Choice Questions On Business & Commercial Law With Q&A
56 pages
Esg And: Business Sustainability
No ratings yet
Esg And: Business Sustainability
9 pages
Unit 2 Lesson 6 Circular Flow
No ratings yet
Unit 2 Lesson 6 Circular Flow
17 pages
Electronic Banking and Customers Satisfaction in Yewa Ilaro Ogun State Nigeria
No ratings yet
Electronic Banking and Customers Satisfaction in Yewa Ilaro Ogun State Nigeria
5 pages
Mahusay Acc227 Module 1
No ratings yet
Mahusay Acc227 Module 1
12 pages
Assignment 4: SWOT Analysis Using EFAS and IFAS Table (8 Factors Each)
No ratings yet
Assignment 4: SWOT Analysis Using EFAS and IFAS Table (8 Factors Each)
6 pages
Eighteenth Edition, Global Edition: Company and Marketing Strategy
100% (1)
Eighteenth Edition, Global Edition: Company and Marketing Strategy
38 pages
Unit 1: Theories of International Trade
No ratings yet
Unit 1: Theories of International Trade
38 pages
Entrepreneurship Development Watermark
100% (1)
Entrepreneurship Development Watermark
10 pages
Opcity Departure Agreement Jack-Bowles k-p4YxCiV
No ratings yet
Opcity Departure Agreement Jack-Bowles k-p4YxCiV
6 pages
Steel Construction UKCA Marking v3
100% (1)
Steel Construction UKCA Marking v3
20 pages
Brief Profile: Prof. Rabi Narayan Kar Is A Post Graduate, Master of Philosophy and Ph.D. From
No ratings yet
Brief Profile: Prof. Rabi Narayan Kar Is A Post Graduate, Master of Philosophy and Ph.D. From
6 pages
(Aman Pathak) RESEARCH METHODOLOGY
No ratings yet
(Aman Pathak) RESEARCH METHODOLOGY
4 pages
Week 1 - 2 Business Plan Template (For All Tracks)
No ratings yet
Week 1 - 2 Business Plan Template (For All Tracks)
19 pages
Majala Manza Research
No ratings yet
Majala Manza Research
26 pages
Market Trends
No ratings yet
Market Trends
13 pages
Annual Report FY24
No ratings yet
Annual Report FY24
177 pages
Office of The City Mayor - Library Services /1122 Mandate, Vision/Mission, Major Final Output, Performance Indicators and Targets Cy 2020
No ratings yet
Office of The City Mayor - Library Services /1122 Mandate, Vision/Mission, Major Final Output, Performance Indicators and Targets Cy 2020
2 pages
7 Depreciation
No ratings yet
7 Depreciation
50 pages
Tally Prime Entries With Reasons
No ratings yet
Tally Prime Entries With Reasons
2 pages
International Investment Law and Policy Answer Sheet
No ratings yet
International Investment Law and Policy Answer Sheet
6 pages
Invoice
No ratings yet
Invoice
1 page
Significant Causes and Effects of Variation Orders in Construction Projects
No ratings yet
Significant Causes and Effects of Variation Orders in Construction Projects
10 pages
MSY Affidavit Format
No ratings yet
MSY Affidavit Format
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Multi-Agent Q Learning Daily Trading 2007

Uploaded by

Multi-Agent Q Learning Daily Trading 2007

Uploaded by

864 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO.

A Multiagent Approach to Q-Learning

1083-4427/$25.00 © 2007 IEEE

Fig. 2. Sample episode.

Fig. 1. Learning procedure of MQ-Trader.

certain day in the history, which is denoted by δ, from the

a local maximum, it is called a downward TP. The sequence

Fig. 5. Japanese candlestick representation.

An MA is an indicator that shows the average value of stock

Finally, in order to encode the value of P RD as bits of fixed

C. State Representations for Order Agents N

the opening, highest, and lowest price of a stock on a trading

III. L EARNING A LGORITHMS FOR MQ-T RADER A GENTS

+ λ r(st , αt ) + γ max Q(st+1 , α) − Q(st , αt )

where Q(st , αt ) is a value function defined for a state–action

agent to terminate training when a validation error rate starts to

IV. E MPIRICAL S TUDY

TABLE III C. Performance Evaluation Results

Fig. 11. Behavior of trading agents during the training process.

was charged whenever a stock is purchased or sold. Second, in TABLE V

TABLE VI TABLE VII

Jae Won Lee received the Ph.D. degree from Seoul

Jonghun Park (S’99–M’01) received the Ph.D. de-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.