Forex Trading DRL Approach
Forex Trading DRL Approach
Draft
T
he Forex market is the largest financial market in the
world for currency trading due to its high daily trading price prediction, prior knowledge of the financial domain is
volume [1]. Financial trading can be either manual or required for selecting trading actions, and the accuracy of the
algorithmic. Manual trading relies on technical and prediction influences the decision-making process [5]. Price
fundamental analysis, while algorithmic trading is carried out prediction aims to construct a model capable of forecasting
by computers using rule-based or machine learning (ML)- future prices. However, the algorithmic trading extends
beyond price prediction, focusing on active participation in the
financial market (such as selecting trading positions and the
This paragraph of the first footnote will contain the date on which you
submitted your paper for review, which is populated by IEEE. It is IEEE style number of shares traded) to maximize profits [4]. Supervised
to display support information, including sponsor and financial support learning methods have shown significant potential for
acknowledgment, here and not in an acknowledgment section at the end of the predicting financial markets, such as the Forex market.
article. For example, “This work was supported in part by the U.S.
Department of Commerce under Grant 123456.” The name of the However, their prediction accuracy may not be sufficient for
corresponding author appears after the financial information, e.g. algorithmic applications in real markets due to the
* -Correspond author: Dr. Parviz Rashidi-Khazaee. fluctuations, instability, and uncertain nature of financial time
Davoud Sarani, is Mastery program student at Information Technology and
Computer Engineering Department, Urmia University of Technology, Urmia, series data [2]. Financial data is noisy, and this might be a
Iran (e-mail: sarani.davoud@it.uut.ac.ir). reason why supervised learning methods have not been
Dr. Parviz Rashidi-Khazaee is Assistant Professor at Information successful in the past [6].
Technology and Computer Engineering Department, Urmia University of
Technology, 4km Band Road, Urmia, Iran (e-mail: p.rashidi@uut.ac.ir).
Supervised learning is not suitable for problems involving
2
Draft
long-term and delayed rewards, such as trading in financial and futures markets [11]. Kang et al. implement the A3C
markets. For addressing decision-making issues (conducting algorithm for stock selection and portfolio management,
trades) in an uncertain environment (financial market), RL is a observing enhanced stability and convergence in training but
more appropriate choice [4]. RL does not require supervision encountering less impressive performance during testing,
labels. In RL, an agent interacts with the environment and possibly due to data limitations and neural network (NN)
receives rewards or penalties. In a financial trading simplicity [12]. Ponomarev et al. investigate the A3C
environment, the agent decides what trading actions to take algorithm's efficacy in algorithmic trading, particularly on
and is rewarded or penalized based on its trading performance Russia Trading System (RTS) Index futures, by creating a
[3]. Training an RL agent eliminates the complexity of manual trading environment, testing NN architectures, and analyzing
label selection and allows it to determine which trading historical data, underscoring the algorithm's profitability [13].
positions have predictable results and value based on the The implementation of parallel workers with workload
received rewards (trading profits). It enables the direct distribution in the A3C [14]. enhances computational
optimization of the profitability and loss-related metrics [7]. efficiency, reduces agent training time, and effectively
The newly developed Deep Reinforcement Learning (DRL) explores the environment, learning an improved optimal
algorithms can independently make optimal decisions in policy in less time. Through parallel environment exploration,
complex environments and perform better than basic strategies A3C outperforms other algorithms in terms of diverse
[2]. It has been shown that the DRL algorithms, which use the experiences and trading profitability [13].
potential advantage of RL-based trading strategies, outperform While the use of multiple agents to make the ultimate
rule-based strategies [2]. trading decision has shown considerable advancements [2, 7,
DRL algorithms are divided into two single agents and 10] and is technically feasible to train agents to handle market-
multi-agent methods. Carapuco et al. used the Q-learning wide trading on a range of currency pairs [3, 7], training
algorithm and a method for more efficient use of available multiple teachers agents and distilling trading decisions to
historical tick data of the EUR/USD pair including bid and student agents enhanced the trading performance of the
spread to improve training quality and showed a stable equity student, using a diverse subset of currency pairs to train
growth and minimal drawdowns [8]. Tsantekidis et al. apply teachers can improve the student proficiency [7]. Training DL
the single agent learning method using the Proximal Policy models is time-consuming, but implementing distributed
Optimization (PPO) algorithm to Forex trading and propose a computing can expedite the learning process [10]. Thus
market-wide training approach with a feature extraction distilling only profitable trading decisions to students [7].
method to enable agents to adapt to diverse currency pairs [3]. limits the student knowledge and shortcoming the overall
The complexity and dynamic nature of the financial market acknowledgment of various circumstances. Therefore, not
makes it necessary to find an optimal trading strategy, training in a distributed manner leads to suboptimal resource
prompting the exploration of multi-agent systems within DRL, utilization.
which generally outperforms single-agent approaches [9]. In So far, the A3C algorithm has not been utilized for parallel
the Multi-agent domain, Shavandi and Khedmati propose a training multiple agents across various currency pairs to share
hierarchical DRL framework for forex trading, specialized in their knowledge with each other and develop a generalized
various periods [2]. These independent agents communicate optimal policy for the Forex market. In this study, we aim to
through a hierarchical mechanism, aggregating the intelligence pioneer this approach and explore its effectiveness.
across different timeframes to resist noise in financial data and Additionally, another objective of this work is to compare
improve trading decisions. Korczak and Hernes integrate DL single-agent (SA) with multi-agent (MA) approaches. The key
with a multi-agent system to enhance the ability to generate contribution of this study is to utilize distributed training to
profitable trading strategies, employing a supervisory agent to develop an agent capable of trading diverse pairs in financial
orchestrate diverse trading strategies and select the most markets like forex. This approach aims to enhance agent
promising recommendations [10]. Ma et al. underscores the learning and policy generalization across various market
importance of multi-agent systems in portfolio management, conditions, improve exploration efficiency, and accelerate
showcasing superior performance compared to single-agent learning and exploration in different environmental segments,
strategies [9]. Tsantekidis et al. underscore the effectiveness of thereby enabling the acquisition of more robust and
knowledge distillation from multiple teacher agents to student generalized policies. The utilization of multiple parallel
agents, thereby enhancing the trading performance of students. workers with the A3C algorithm, enables the acquisition of
It emphasizes the significance of diversifying teacher models extensive experience in diverse environments, resulting in
to trade various currencies in volatile markets, thus improving quicker adaptation and convergence to abrupt changes within
the performance of students [7]. financial markets.
Parallel multi-agent algorithms like asynchronous The forthcoming paper is structured as follows: Section II
advantage actor-critic (A3C), IMPALA, and SeedRL could be offers a comprehensive review of previous works relevant to
used in the forex trading market. From these categories, A3C multi-agent RL. Section III explores the details of RL models,
plays an important role in the forex market. Li et al. employ while Section IV discusses our methodology. Following this,
A3C algorithms to tackle feature extraction and strategy Section V provides an overview of the implementation details
adaptation issues and showcase superior performance in stock of the method. Section VI analyzes and interprets the results.
3
Draft
Finally, section VII provides conclusions. al. for training RL agents in the financial market by employing
teacher agents in diverse sub-environments to diversify their
II. RELATED WORKS learned policies. Subsequently, student agents then utilize
Carapuco et al. employ the Q-learning algorithm and a profitable knowledge from these teachers to emulate existing
method for more efficient utilization of historical tick data of the trading strategies. It emphasizes that diversifying teacher models
EUR/USD pair, including bid price and spread, to enhance for trading various currencies and knowledge distillation from
training quality. They demonstrate stable equity growth and multiple teacher agents can significantly enhance the
minimal drawdowns. Despite the non-deterministic and noisy performance of students in volatile financial markets and for this
nature of financial markets, the study showcases stable learning purpose, pre-processed observations of past candlestick price
in the training dataset and the Q-network's ability to identify patterns to identify percentage differences between sampled
relationships in financial data, resulting in profitable trading in a prices. They also suggest that using the Policy Gradient approach
test dataset. The potential for further optimization in parameters, is more efficient than the DQN approach [7].
network topology, and model selection methods suggests Tsantekidis et al. also a suggests reward-shaping method
promising avenues for future exploration in algorithmic trading based on prices for Forex trading with a DRL approach using the
strategies. The study proposes future work to concentrate on Proximal Policy Optimization (PPO) algorithm. This approach
optimizing parameters, network topology, and model selection enhances agent performance in terms of profit, Sharpe ratio, and
methods, alongside exploring enhancements in dataset selection maximum drawdown. The authors also employ a data
and financial optimizations [8]. preprocessing and fixed-feature extraction method to enable
Shavandi and Khedmati introduce a novel DRL multi-agent agent training on various Forex currency pairs, facilitating the
framework tailored for financial trading, where agents specialize development of RL agents across the wide range of pairs in the
in specific time periods. These agents operate independently yet market while mitigating overfitting. The paper emphasizes that
collaborate through a hierarchical feedback mechanism, RL agents have typically been trained to trade individual assets,
facilitating the transmission of knowledge from higher whereas human traders can adapt their trading strategies to
timeframe agents to lower ones. This mechanism serves to resist different assets and conditions. To overcome this limitation, they
noise in financial data, enabling the aggregation of intelligence propose a market-wide training approach that extracts valuable
across different timeframes. By sharing insights and learning insights from various financial instruments. The proposed feature
characteristics within each interval, the framework outperforms extraction method enhanced the effective data processing from
both independent agents and rule-based strategies. Its primary diverse distributions [3].
objective is to establish intertemporal learning interactions via Li et al. propose a framework for algorithmic trading by
collective intelligence among multiple agents, enabling utilizing DQN and A3C algorithms with SDAEs and LSTM
adaptation to noise and utilization of price movement details for networks, to address feature extraction and strategy adaptation
enhanced trading performance [2]. challenges. They are resulting in superior performance compared
The study of Korczak and Hernes discusses the integration of to baseline methods in both stock and futures markets,
DL into a multi-agent framework for creating profitable trading demonstrating substantial improvement and potential for
strategies in Forex. The system, called A-Trader, utilizes trading practical trading applications. Specifically, the SDAEs-LSTM
actions and fuzzy logic to make decisions based on factors like A3C model learns a more valuable strategy and surpasses LSTM
confidence levels and probabilities. A supervisory agent oversees in predictive accuracy [11].
decision-making, coordinating various trading strategies and Kang et al. apply the A3C algorithm to stock selection and
selecting the most appropriate suggestions for investment portfolio management using a subset of S&P500 index stocks,
decisions. DL is employed to forecast financial data, aiming to training asynchronously, with multiple environments initiated at
enhance A-Trader's ability to offer profitable trading different times to simulate experience buffers. Despite notable
recommendations. However, a drawback of DL is its time- improvements in stability and convergence during training
consuming learning mode, which could be mitigated by (shortening the training process and speeding up the
employing distributed cloud computing [10]. convergence process), the model's performance during the test
Ma et al. introduce a novel approach to financial portfolio period is not as impressive, potentially due to limitations in data
management, leveraging a multi-agent DRL algorithm with trend availability and the simplicity of the NN architecture.
consistency regularization to recognize consistency in stock recommended incorporating more data and features to boost
trends, guiding the agent’s trading strategies. this approach performance, emphasizing the significance of robust model
divides stock trends into two categories and trains two agents architecture and adequate data for effective results [12].
with the same policy model and value model and different Ponomarev et al. explore the A3C algorithm, in algorithmic
reward functions are constructed, differing in regularization, and trading, focusing on trading RTS Index futures. The study
enhanced adaptability to market conditions. By dynamically constructed a trading environment, experimenting with various
switching between agents based on market conditions, the NN architectures, testing on historical data and highlighting the
proposed algorithm optimizes portfolio allocation, achieving potential profitability and attractiveness of the algorithm for
higher returns and lower risk compared to existing algorithms investment, and verifying the effectiveness of Long short-term
[9]. memory (LSTM) and dropout layers, while debating the impact
A knowledge distillation method proposed by Tsantekidis Et of a reward function and the number of neurons in hidden layers.
4
Draft
emphasizing the importance of optimizing architectures for real 𝑟𝑡 (𝜃) =
𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
(3)
𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
trading systems [13]. old