0% found this document useful (0 votes)
170 views9 pages

Forex Trading DRL Approach

Uploaded by

Samy Mebarki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views9 pages

Forex Trading DRL Approach

Uploaded by

Samy Mebarki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

Draft

A Deep Reinforcement Learning Approach for


Trading Optimization in the Forex Market with
Multi-Agent Asynchronous Distribution
Davoud Sarani, Dr. Parviz Rashidi-Khazaee

 based strategies [2]. Humans may make erroneous trading


Abstract— In today's forex market traders increasingly turn to decisions due to their emotional and psychological factors.
algorithmic trading, leveraging computers to seek more profits. Therefore, traditional trading can be susceptible to human
Deep learning techniques as cutting-edge advancements in errors, while computers are not adversely affected by these
machine learning, capable of identifying patterns in financial
factors and are more precise in executing trading decisions
data. Traders utilize these patterns to execute more effective
trades, adhering to algorithmic trading rules. Deep compared to humans [2]. The rule-based approach in
reinforcement learning methods (DRL), by directly executing algorithmic trading involves identifying profitable trading
trades based on identified patterns and assessing their signals from specific market movements by analyzing the time
profitability, offer advantages over traditional DL approaches. series of price data. The ML approach, on the other hand,
This research pioneers the application of a multi-agent (MA) RL automatically learns patterns that lead to predictable market
framework with the state-of-the-art Asynchronous Advantage
movements and generates trading signals accordingly [3].
Actor-Critic (A3C) algorithm. The proposed method employs
parallel learning across multiple asynchronous workers, each The ML algorithms can extract patterns and relationships
specialized in trading across multiple currency pairs to explore from time series data without the need for predefined
the potential for nuanced strategies tailored to different market instructions or strategies set by domain experts, discovering
conditions and currency pairs. Two different A3C with lock and profitable trading patterns that may not be discernible to
without lock MA model was proposed and trained on single humans [2]. In recent years, ML, serving as an intelligent
currency and multi-currency. The results indicate that both
agent, has replaced traditional human decision-making
model outperform on Proximal Policy Optimization model. A3C
with lock outperforms other in single currency training scenario methods. This transition, particularly with the advent of deep
and A3C without Lock outperforms other in multi-currency learning (DL), has enhanced algorithm performance. These
scenario. The findings demonstrate that this approach facilitates capabilities encompass the extraction of complex patterns
broader and faster exploration of different currency pairs, from data and the immunity to emotions that could impact
significantly enhancing trading returns. Additionally, the agent their performance [4]. The ML approach to algorithmic
can learn a more profitable trading strategy in a shorter time.
trading can be divided into supervised learning and RL
Index Terms— Asynchronous Advantage Actor-Critic (A3C), approaches [5]. The supervised learning approach focuses on
Distribute Training, Financial market, Forex Trading, Multi- predicting stock prices or price trends in the next time step [5].
Agent One drawback of supervised learning and DL algorithms for
prediction and classification is the absence of executing
I. INTRODUCTION trading decisions [1]. The predicted prices or trends cannot be
directly associated with the order actions. Therefore, after

T
he Forex market is the largest financial market in the
world for currency trading due to its high daily trading price prediction, prior knowledge of the financial domain is
volume [1]. Financial trading can be either manual or required for selecting trading actions, and the accuracy of the
algorithmic. Manual trading relies on technical and prediction influences the decision-making process [5]. Price
fundamental analysis, while algorithmic trading is carried out prediction aims to construct a model capable of forecasting
by computers using rule-based or machine learning (ML)- future prices. However, the algorithmic trading extends
beyond price prediction, focusing on active participation in the
financial market (such as selecting trading positions and the
This paragraph of the first footnote will contain the date on which you
submitted your paper for review, which is populated by IEEE. It is IEEE style number of shares traded) to maximize profits [4]. Supervised
to display support information, including sponsor and financial support learning methods have shown significant potential for
acknowledgment, here and not in an acknowledgment section at the end of the predicting financial markets, such as the Forex market.
article. For example, “This work was supported in part by the U.S.
Department of Commerce under Grant 123456.” The name of the However, their prediction accuracy may not be sufficient for
corresponding author appears after the financial information, e.g. algorithmic applications in real markets due to the
* -Correspond author: Dr. Parviz Rashidi-Khazaee. fluctuations, instability, and uncertain nature of financial time
Davoud Sarani, is Mastery program student at Information Technology and
Computer Engineering Department, Urmia University of Technology, Urmia, series data [2]. Financial data is noisy, and this might be a
Iran (e-mail: sarani.davoud@it.uut.ac.ir). reason why supervised learning methods have not been
Dr. Parviz Rashidi-Khazaee is Assistant Professor at Information successful in the past [6].
Technology and Computer Engineering Department, Urmia University of
Technology, 4km Band Road, Urmia, Iran (e-mail: p.rashidi@uut.ac.ir).
Supervised learning is not suitable for problems involving
2
Draft
long-term and delayed rewards, such as trading in financial and futures markets [11]. Kang et al. implement the A3C
markets. For addressing decision-making issues (conducting algorithm for stock selection and portfolio management,
trades) in an uncertain environment (financial market), RL is a observing enhanced stability and convergence in training but
more appropriate choice [4]. RL does not require supervision encountering less impressive performance during testing,
labels. In RL, an agent interacts with the environment and possibly due to data limitations and neural network (NN)
receives rewards or penalties. In a financial trading simplicity [12]. Ponomarev et al. investigate the A3C
environment, the agent decides what trading actions to take algorithm's efficacy in algorithmic trading, particularly on
and is rewarded or penalized based on its trading performance Russia Trading System (RTS) Index futures, by creating a
[3]. Training an RL agent eliminates the complexity of manual trading environment, testing NN architectures, and analyzing
label selection and allows it to determine which trading historical data, underscoring the algorithm's profitability [13].
positions have predictable results and value based on the The implementation of parallel workers with workload
received rewards (trading profits). It enables the direct distribution in the A3C [14]. enhances computational
optimization of the profitability and loss-related metrics [7]. efficiency, reduces agent training time, and effectively
The newly developed Deep Reinforcement Learning (DRL) explores the environment, learning an improved optimal
algorithms can independently make optimal decisions in policy in less time. Through parallel environment exploration,
complex environments and perform better than basic strategies A3C outperforms other algorithms in terms of diverse
[2]. It has been shown that the DRL algorithms, which use the experiences and trading profitability [13].
potential advantage of RL-based trading strategies, outperform While the use of multiple agents to make the ultimate
rule-based strategies [2]. trading decision has shown considerable advancements [2, 7,
DRL algorithms are divided into two single agents and 10] and is technically feasible to train agents to handle market-
multi-agent methods. Carapuco et al. used the Q-learning wide trading on a range of currency pairs [3, 7], training
algorithm and a method for more efficient use of available multiple teachers agents and distilling trading decisions to
historical tick data of the EUR/USD pair including bid and student agents enhanced the trading performance of the
spread to improve training quality and showed a stable equity student, using a diverse subset of currency pairs to train
growth and minimal drawdowns [8]. Tsantekidis et al. apply teachers can improve the student proficiency [7]. Training DL
the single agent learning method using the Proximal Policy models is time-consuming, but implementing distributed
Optimization (PPO) algorithm to Forex trading and propose a computing can expedite the learning process [10]. Thus
market-wide training approach with a feature extraction distilling only profitable trading decisions to students [7].
method to enable agents to adapt to diverse currency pairs [3]. limits the student knowledge and shortcoming the overall
The complexity and dynamic nature of the financial market acknowledgment of various circumstances. Therefore, not
makes it necessary to find an optimal trading strategy, training in a distributed manner leads to suboptimal resource
prompting the exploration of multi-agent systems within DRL, utilization.
which generally outperforms single-agent approaches [9]. In So far, the A3C algorithm has not been utilized for parallel
the Multi-agent domain, Shavandi and Khedmati propose a training multiple agents across various currency pairs to share
hierarchical DRL framework for forex trading, specialized in their knowledge with each other and develop a generalized
various periods [2]. These independent agents communicate optimal policy for the Forex market. In this study, we aim to
through a hierarchical mechanism, aggregating the intelligence pioneer this approach and explore its effectiveness.
across different timeframes to resist noise in financial data and Additionally, another objective of this work is to compare
improve trading decisions. Korczak and Hernes integrate DL single-agent (SA) with multi-agent (MA) approaches. The key
with a multi-agent system to enhance the ability to generate contribution of this study is to utilize distributed training to
profitable trading strategies, employing a supervisory agent to develop an agent capable of trading diverse pairs in financial
orchestrate diverse trading strategies and select the most markets like forex. This approach aims to enhance agent
promising recommendations [10]. Ma et al. underscores the learning and policy generalization across various market
importance of multi-agent systems in portfolio management, conditions, improve exploration efficiency, and accelerate
showcasing superior performance compared to single-agent learning and exploration in different environmental segments,
strategies [9]. Tsantekidis et al. underscore the effectiveness of thereby enabling the acquisition of more robust and
knowledge distillation from multiple teacher agents to student generalized policies. The utilization of multiple parallel
agents, thereby enhancing the trading performance of students. workers with the A3C algorithm, enables the acquisition of
It emphasizes the significance of diversifying teacher models extensive experience in diverse environments, resulting in
to trade various currencies in volatile markets, thus improving quicker adaptation and convergence to abrupt changes within
the performance of students [7]. financial markets.
Parallel multi-agent algorithms like asynchronous The forthcoming paper is structured as follows: Section II
advantage actor-critic (A3C), IMPALA, and SeedRL could be offers a comprehensive review of previous works relevant to
used in the forex trading market. From these categories, A3C multi-agent RL. Section III explores the details of RL models,
plays an important role in the forex market. Li et al. employ while Section IV discusses our methodology. Following this,
A3C algorithms to tackle feature extraction and strategy Section V provides an overview of the implementation details
adaptation issues and showcase superior performance in stock of the method. Section VI analyzes and interprets the results.
3
Draft
Finally, section VII provides conclusions. al. for training RL agents in the financial market by employing
teacher agents in diverse sub-environments to diversify their
II. RELATED WORKS learned policies. Subsequently, student agents then utilize
Carapuco et al. employ the Q-learning algorithm and a profitable knowledge from these teachers to emulate existing
method for more efficient utilization of historical tick data of the trading strategies. It emphasizes that diversifying teacher models
EUR/USD pair, including bid price and spread, to enhance for trading various currencies and knowledge distillation from
training quality. They demonstrate stable equity growth and multiple teacher agents can significantly enhance the
minimal drawdowns. Despite the non-deterministic and noisy performance of students in volatile financial markets and for this
nature of financial markets, the study showcases stable learning purpose, pre-processed observations of past candlestick price
in the training dataset and the Q-network's ability to identify patterns to identify percentage differences between sampled
relationships in financial data, resulting in profitable trading in a prices. They also suggest that using the Policy Gradient approach
test dataset. The potential for further optimization in parameters, is more efficient than the DQN approach [7].
network topology, and model selection methods suggests Tsantekidis et al. also a suggests reward-shaping method
promising avenues for future exploration in algorithmic trading based on prices for Forex trading with a DRL approach using the
strategies. The study proposes future work to concentrate on Proximal Policy Optimization (PPO) algorithm. This approach
optimizing parameters, network topology, and model selection enhances agent performance in terms of profit, Sharpe ratio, and
methods, alongside exploring enhancements in dataset selection maximum drawdown. The authors also employ a data
and financial optimizations [8]. preprocessing and fixed-feature extraction method to enable
Shavandi and Khedmati introduce a novel DRL multi-agent agent training on various Forex currency pairs, facilitating the
framework tailored for financial trading, where agents specialize development of RL agents across the wide range of pairs in the
in specific time periods. These agents operate independently yet market while mitigating overfitting. The paper emphasizes that
collaborate through a hierarchical feedback mechanism, RL agents have typically been trained to trade individual assets,
facilitating the transmission of knowledge from higher whereas human traders can adapt their trading strategies to
timeframe agents to lower ones. This mechanism serves to resist different assets and conditions. To overcome this limitation, they
noise in financial data, enabling the aggregation of intelligence propose a market-wide training approach that extracts valuable
across different timeframes. By sharing insights and learning insights from various financial instruments. The proposed feature
characteristics within each interval, the framework outperforms extraction method enhanced the effective data processing from
both independent agents and rule-based strategies. Its primary diverse distributions [3].
objective is to establish intertemporal learning interactions via Li et al. propose a framework for algorithmic trading by
collective intelligence among multiple agents, enabling utilizing DQN and A3C algorithms with SDAEs and LSTM
adaptation to noise and utilization of price movement details for networks, to address feature extraction and strategy adaptation
enhanced trading performance [2]. challenges. They are resulting in superior performance compared
The study of Korczak and Hernes discusses the integration of to baseline methods in both stock and futures markets,
DL into a multi-agent framework for creating profitable trading demonstrating substantial improvement and potential for
strategies in Forex. The system, called A-Trader, utilizes trading practical trading applications. Specifically, the SDAEs-LSTM
actions and fuzzy logic to make decisions based on factors like A3C model learns a more valuable strategy and surpasses LSTM
confidence levels and probabilities. A supervisory agent oversees in predictive accuracy [11].
decision-making, coordinating various trading strategies and Kang et al. apply the A3C algorithm to stock selection and
selecting the most appropriate suggestions for investment portfolio management using a subset of S&P500 index stocks,
decisions. DL is employed to forecast financial data, aiming to training asynchronously, with multiple environments initiated at
enhance A-Trader's ability to offer profitable trading different times to simulate experience buffers. Despite notable
recommendations. However, a drawback of DL is its time- improvements in stability and convergence during training
consuming learning mode, which could be mitigated by (shortening the training process and speeding up the
employing distributed cloud computing [10]. convergence process), the model's performance during the test
Ma et al. introduce a novel approach to financial portfolio period is not as impressive, potentially due to limitations in data
management, leveraging a multi-agent DRL algorithm with trend availability and the simplicity of the NN architecture.
consistency regularization to recognize consistency in stock recommended incorporating more data and features to boost
trends, guiding the agent’s trading strategies. this approach performance, emphasizing the significance of robust model
divides stock trends into two categories and trains two agents architecture and adequate data for effective results [12].
with the same policy model and value model and different Ponomarev et al. explore the A3C algorithm, in algorithmic
reward functions are constructed, differing in regularization, and trading, focusing on trading RTS Index futures. The study
enhanced adaptability to market conditions. By dynamically constructed a trading environment, experimenting with various
switching between agents based on market conditions, the NN architectures, testing on historical data and highlighting the
proposed algorithm optimizes portfolio allocation, achieving potential profitability and attractiveness of the algorithm for
higher returns and lower risk compared to existing algorithms investment, and verifying the effectiveness of Long short-term
[9]. memory (LSTM) and dropout layers, while debating the impact
A knowledge distillation method proposed by Tsantekidis Et of a reward function and the number of neurons in hidden layers.
4
Draft
emphasizing the importance of optimizing architectures for real 𝑟𝑡 (𝜃) =
𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
(3)
𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
trading systems [13]. old

D. Asynchronous Advantage Actor-Critic (A3C)


III. REINFORCEMENT LEARNING (RL) MODELS
The A3C is an advanced variant of the actor-critic architecture,
In this research, algorithms based on PPO and A3C were The advantage in A3C refers to the advantage function, which
utilized to train an RL agent capable of executing trades in the quantifies how much better or worse a particular action is
forex market. compared to the average action [14]. In A3C, Multiple local
workers run in parallel with their copy of the policy network and
A. Actor-Critic
environment, collecting experiences and updating the global
Actor-Critic employs an NN architecture with two networks asynchronously. This enables efficient utilization of
components: an actor and a critic; The actor learns a policy (the resources, faster convergence, better exploration, and more
strategy for selecting actions), while the critic network estimates sample-efficient learning.
the expected future reward and reduces variance by providing a Accumulate gradients with respect to the local policy network
baseline for advantage estimates. The combination of actor-critic parameters 𝜃 ′ using the policy gradient and advantage estimation
architecture helps stabilize training by reducing the high variance are calculated as 𝛻𝜃′ 𝑙𝑜𝑔𝜋(𝑎𝑖 |𝑠𝑖 ; 𝜃 ′ )(𝑅 − 𝑉(𝑠𝑖 ; 𝜃𝑣′ )), and the
typically associated with policy-based methods. accumulated gradients with respect to the local value network
parameters 𝜃𝑣′ using the squared temporal difference error is
B. Advantage function
computed as 𝜕(𝑅 − 𝑉(𝑆𝑖 ; 𝜃𝑣′ ))2 /𝜕𝜃𝑣′ , Where (𝑅𝑡 − 𝑉(𝑠𝑡 )) is
The advantage function is used to compute the policy
the difference between the estimated value 𝑉(𝑠𝑡 ) and the
gradient. It guides the actor to select actions that lead to better
observed reward 𝑅𝑡 . The 𝜕/𝜕𝜃𝑣′ is the partial derivative with
outcomes and addresses the credit assignment problem by respect to the parameter 𝜃𝑣′ and it is used to find how a small
providing feedback on the quality of chosen actions. The change in the parameter 𝜃𝑣′ affects this expression.
advantage is calculated as (1), where the Q-value (action value) The 𝑅 is the total discounted return calculated as 𝑅 ← 𝑟𝑖 + 𝛾𝑅,
represents the expected cumulative reward by taking a specific in this expression 𝑟𝑖 is an immediate reward received at the time
action in a particular state and then following a certain policy step 𝑖, and 𝛾 is the discount factor to discount the values of
thereafter. The V-value (critic value) represents the expected future rewards.
cumulative reward that can be obtained from a particular state The Policy loss (actor loss) for the N time steps in A3C is
onwards, following a certain policy. computed as (4), The Value loss (critic loss) in A3C guides the
𝐴(𝑠𝑎) = 𝑄(𝑠𝑎) − 𝑉(𝑠) (1) value function towards better approximations of the expected
return (R) and it is the mean squared error between the estimated
C. Proximal Policy Optimization (PPO)
value function(V) and the actual return. The value loss for the N
The PPO [15] is an on-policy RL algorithm that optimizes the time steps is computed as (5).
agent's policy to maximize the expected cumulative reward. It
works by iteratively collecting data through interactions with the 1
𝑁
environment and updating the policy to improve its performance. 𝐿policy = − ∑𝑡=1 𝑙𝑜𝑔𝜋(𝑎𝑡 |𝑠𝑡 ) ⋅ 𝐴𝑡 (4)
𝑁
The policy maintains a probability distribution over actions for
𝑁
each state, represented by a NN. This algorithm computes the 1
𝐿critic = ∑𝑖=1 (𝑅𝑖 − 𝑉(𝑠𝑖 ; 𝜃𝑣 ))2 (5)
surrogate objective to guide policy gradient for actions with 𝑁
higher returns, and the clipped objective to limit policy updates In the asynchronous updating process of the global worker
from the local workers, the global network aggregates
to maintain stable training and prevent learning disruptions. By
gradients from multiple local workers and updates its
constraining policy updates and maintaining a trust region
parameters, while each local worker updates its parameters
between old and new policies, it addresses issues of high
independently based on its local experiences. Assuming the 𝜃
variance and unstable learning, ensuring that the policy update
and 𝜃𝑣 are the shared parameters of the global worker, 𝜃 ′ and
does not deviate too far from the current policy, preventing 𝜃𝑣′ represent the parameters of local workers; based on
major learning disruptions; and balances exploration and Algorithm S3 [14], the asynchronous update for the actor can
exploitation by iteratively collecting data with the current policy be defined as (6) and (7) for the critic.
and optimizing the policy using that data. The loss in PPO is
calculated as (2), in this equation the clip(𝑥, 𝑎, 𝑏) is a clipping 𝜃 ′ : 𝑑𝜃 ← 𝑑𝜃 + 𝛻𝜃′ 𝑙𝑜𝑔𝜋(𝑎𝑖 |𝑠𝑖 ; 𝜃 ′ )(𝑅 − 𝑉(𝑠𝑖 ; 𝜃𝑣′ )) (6)
function that clips the value 𝑥 to the range [𝑎, 𝑏] and ℋ(𝜋𝜃 (⋅
|𝑠𝑡 )) represents the entropy of the policy, the 𝑟𝑡 (𝜃) is the ratio of 𝜕(𝑅−𝑉(𝑠𝑖 ;𝜃𝑣′ ))2
𝜃𝑣′ : 𝑑𝜃𝑣 ← 𝑑𝜃𝑣 + (7)
the probability of the new policy to the old policy and calculated 𝜕𝜃𝑣′
as (3).
IV. METHODOLOGY
̂ ̂
𝐿(𝜃) = 𝔼𝑡 [𝑚𝑖𝑛 (𝑟𝑡 (𝜃)𝐴𝑡 ,clip(𝑟𝑡 (𝜃),1 − 𝜖, 1 + 𝜖)𝐴𝑡 ) − In this section, we will first address the necessary
prerequisites. Then we will examine and discuss our proposed
𝛽ℋ(𝜋𝜃 (⋅ |𝑠𝑡 ))] (2) method for enhancing the speed of the agent and learning
optimal policies in the forex market.
5
Draft
A. Data Preparation
Candlestick data of different currency pairs in the Forex
market in one-hour timeframes have been used to train the
agents. This data is directly retrieved from the forex broker's
terminal.
Due to price fluctuations in various currency pairs and the
recurring patterns in candlestick models, using raw candlestick
data is inefficient. To address this issue and normalize the
data, the method described in [3] is employed (8). This
method utilizes the ratio of candlestick changes from time
series data to create 5 new features as (9) instead of using raw
data.
𝑃𝑐 −𝑃𝑐
𝑥1𝑡 = 𝑡 𝑡−1
𝑃𝑐𝑡−1
𝑃ℎ𝑡 −𝑃ℎ𝑡−1
𝑥2𝑡 =
𝑃ℎ𝑡−1
𝑃𝑙𝑡 −𝑃𝑙𝑡−1
𝑥3𝑡 = (8)
𝑃𝑙𝑡−1
𝑃ℎ𝑡 −𝑃𝑐𝑡
𝑥4𝑡 =
𝑃𝑐𝑡
𝑃𝑐𝑡 −𝑃𝑙𝑡
𝑥5𝑡 =
𝑃𝑐𝑡
𝑋𝑡 = [𝑥1𝑡 , 𝑥2𝑡 , 𝑥3𝑡 , 𝑥4𝑡 , 𝑥5𝑡 ] (9)
𝑋 = [𝑋𝑡−𝑤𝑖𝑛𝑑𝑜𝑤 , . . . , 𝑋𝑡−2 , 𝑋𝑡−1 , 𝑋𝑡 ] (10)
Fig. 1. The proposed Actor-Critic model.
B. Reward Function
The evaluation of agent actions is carried out using rewards subsequently receives a reward from the environment based on
received from the environment. The agent through its profit or loss.
interactions with the environment and the selection of a Figure 1 and Scheme 1 showed the structure of proposed
trading decision (12), receives rewards based on the decision it model. A uniform NN model was used for both algorithms. The
made (13), which can be either profitable (positive reward) or model's input consists of two parts: the first part contains an
detrimental (negative reward). The received reward is a LSTM layer for normalized time series data (10), and the second
normalized value, determined by the price changes in part is a linear layer for the one-hot encoded trading decisions
consecutive candlestick closing prices within two successive made within a chosen time interval of a specific window size.
time intervals (11). The number of neurons in the hidden LSTM layer is set to 128,
𝑃𝑐 −𝑃𝑐
𝑧𝑡 = 𝑡 𝑡−1 (11) The output from the LSTM is first passed through a layer
𝑃𝑐𝑡−1
consisting of 32 neurons. Then, it is combined with the trading
1 Long
decisions from previous steps, based on window size, and fed
𝛿𝑡 ∈ {−1 Short (12)
into a fully connected (FC) layer.
0 Neutral
𝑟𝑡 = 𝛿𝑡 ∗ 𝑧𝑡 (13) The output from the previous steps is combined and sent to a
The value of δ is 1 for a long trading decision, -1 for a short FC layer with 64 neurons. This FC layer then sends the output to
trading decision, and 0 when closing a trade or staying out of the either an actor or critic FC layer, depending on whether it's for
market. trading decisions or evaluation. The actor FC layer's output
In this work, the primary objective was to highlight training represents the number of trading decisions made, while the critic
time efficiency and the learning of an optimal policy, and FC layer's output is used to evaluate the actor's performance.
therefore, the commission is not taken into the equation. Also, it
is assumed that there is no spread thus the bid and ask prices are ActorCritic(
(actor): LSTM_MLP(
equal and rewards are calculated based on the closing price. (lstm): LSTM(5, 128, batch_first=True)
(fc1): Linear(in_features=128, out_features=32, bias=True)
C. Proposed Reinforcement Learning (RL) Model (fc2): Linear(in_features=80, out_features=64, bias=True)
RL is comprised of the interaction between two main (fc3): Linear(in_features=64, out_features=64, bias=True)
components: the agent and the environment. The environment's (output_layer): Linear(in_features=64, out_features=3, bias=True)
)
function involves both preparing financial market data as the (critic): LSTM_MLP(
observation space and interacting with the received trading (lstm): LSTM(5, 128, batch_first=True)
decisions as the action space. The observation space consists of (fc1): Linear(in_features=128, out_features=32, bias=True)
historical financial data as (10) in conjunction with trading (fc2): Linear(in_features=80, out_features=64, bias=True)
(fc3): Linear(in_features=64, out_features=64, bias=True)
decisions made within the specified past window timeframe (output_layer): Linear(in_features=64, out_features=1, bias=True)
which are categorized into one of three groups: opening a long )
trading position, opening a short trading position, closing an
existing position, or staying out of the market. In RL, the agent Scheme 1 - Scheme of The Actor-Critic Model
receives an observation state, makes a trading decision, and
6
Draft
V. EXPERIMENTS TABLE I
The proposed methods are examined in this section, and SC SCENARIO EVALUATION RESULTS
their results are compared. Initially, the dataset is reviewed, Model Return Sharpe Win Profit
followed by a discussion of the training methods and Ratio Rate Factor
parameters. Finally, the impact of asynchronous learning using SA 71.27 % 1.27 51.80 % 1.20
multiple agents in parallel is investigated. MA-Lock 98.22 % 1.46 51.75 % 1.27
A. Training and evaluation Dataset MA-NoLock 81.33 % 1.58 52.09 % 1.20
A dataset based on transaction data from the Forex market
has been used to train and evaluate algorithms. The datasets
were obtained and saved using the MetaTrader5 terminal and
contain price movement charts (candles) for major, minors,
and cross-currency pairs within a one-hour timeframe from
2009 to mid-2017 The data was divided into two parts. The
data from 2009 to the end of 2016 was used for training, and
the first four-month data from 2017 were used for back-
testing.
B. Evaluation Metrics
The models are evaluated using key measures such as
"Return," which indicates whether a strategy is profitable or
not. The "Sharpe Ratio" helps to assess risk and returns, while
the "Profit Factor" measures the amount of money made
versus lost. "Maximum Drawdown" reveals the strategy's Fig. 2. Comparative return of backtesting in SC scenario.
largest loss. These measures are utilized to analyze and
compare models and determine their relative performance. resulting in a total of 16-row, are fed into the LSTM layer, and
the output, along with selected trading actions at each step in a
(𝐹𝑖𝑛𝑎𝑙𝑃𝑟𝑖𝑐𝑒−𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑃𝑟𝑖𝑐𝑒) one-hot encoding shape, are passed to the next corresponding
𝑅𝑒𝑡𝑢𝑟𝑛 = × 100 (14)
𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑃𝑟𝑖𝑐𝑒
FC layers.
𝑅𝑝
𝑆ℎ𝑎𝑟𝑝𝑒𝑅𝑎𝑡𝑖𝑜 = (15) F. Multi-Agent training (MA)
𝜎𝑝
The MA training process follows the same parameters and
In the Sharpe Ratio, the 𝑅𝑝 is the mean returns, and 𝜎𝑝 is the
conditions as SA. Five local workers are employed for parallel
standard deviation of the returns.
training to speed up the learning process. Each local worker
C. Parameters undergoes 20 training steps before updating the global worker.
The parameters used in all experiments were as follows: a This process continues until the end of one episode, after
discount factor of 0.99, a learning rate of 0.00004, and a time which a new episode begins.
window of 16, which considers previous time series samples During the training with A3C, two different approaches were
as inputs. In addition, the same reward function was employed examined. In the MA-Lock, a lock mechanism was
to evaluate the performance of all models, and the same seed implemented, which allowed only one local worker at a time
was used for both model training and evaluation. to make updates to the global worker. the MA-NoLock did not
have a locking mechanism and multiple local workers could
interact with the environment and concurrently share updates
D. Training with the global worker without any conflicts. Also, an
Training process for both single-agent (SA) and multi-agent optimizer with shared parameters was implemented across all
(MA) approaches, was conducted using two different workers
scenarios: single-currency (SC) and multi-currency (MC). In
the SC scenario, the EUR/USD currency pair was used for V. RESULTS
training, with a randomly chosen starting point for each The performance evaluation for each approach was
training episode. In the MC scenario, the training was carried conducted using the “backtesting” [16] library over four
out on 28 different currency pairs, where at each episode, a months with test data. In the A3C algorithm, the global worker
random currency pair and starting point were selected. In both is used for evaluation. In backtesting, when the agent
approaches, each training episode consisted of 600 steps, with repeatedly selects the same trading decisions in multiple
a total of one million steps for each training approach. consecutive environmental states, the chosen trading decisions
The PPO algorithm has been used for SA and the A3C are regarded as one continuous trade.
algorithm has been used for the MA approach.
A. Single Currency Pair scenario (SC)
E. Single-agent training (SA)
When comparing the performance of the SA and MA in a
In SA training at each training step, 5 extracted features at SC training scenario, both MA models, with and without a
the time 𝑡 along with the 15 previous extracted features lock mechanism, demonstrated higher returns, more favorable
7
Draft
risk-adjusted returns (as indicated by the Sharpe Ratio), and a recovery mechanism or risk management strategy in place.
slightly better profit factor compared to SA (Table I, Fig. 2). While SA has a lower drawdown magnitude, its longer
drawdown duration indicates that it takes more time to bounce
B. Multi-currency pairs scenario
back from losses. This observation underscores the importance
In the MC scenario, training the SA maintained relatively of not only considering the depth of drawdowns but also the
consistent yet generally lower returns, while MA exhibited time it takes to recover when evaluating trading models (Table
mixed performance in returns when comparing MA-Lock and V, Table VI).
MA-NoLock mechanisms (Table III).
TABLE VI
TABLE V MC BACKTESTING AVERAGE DRAWDOWN ON 9 PAIRS
MC BACKTESTING DRAWDOWN ON 9 PAIRS Model Max. Avg. Max. Drawdown Avg. Drawdown
Model Pair Max. Avg. Max. Avg. Drawdown Drawdown Duration Duration
Drawdown Drawdown Drawdown Drawdown
Duration Duration SA -27.16% -5.98% 53 days, 10 hours 7 days, 5 hours
SA EURUSD -14.52 % -2.74 % 17 days 2 days MA-Lock -35.72% -4.35% 60 days, 6 hours 5 days, 5.78 hours.
03:00:00 01:00:00 MA-NoLock -31.04% -5.76% 37 days, 18.33 hours 4 days, 11.56 hours
AUDUSD -22.82 % -4.43 % 55 days 6 days
07:00:00 03:00:00 TABLE VII
EURGBP -20.93 % -2.31 % 24 days 2 days MC SCENARIOS BACKTESTING EVALUATION RESULTS ON
04:00:00 03:00:00
AUDCAD -21.18 % -4.20 % 49 days 4 days EUR/USD
13:00:00 15:00:00
EURCHF -28.46 % -10.62 % 70 days 26 days Model Return Sharpe Ratio Win Rate Profit Factor
02:00:00 05:00:00 SA 54.21% 0.78 51.61 1.05
EURAUD -26.24 % -3.18 % 63 days 3 days MA-Lock 73.72% 1.72 52.64 1.21
01:00:00 05:00:00
USDCAD -25.03 % -6.25 % 76 days 6 days MA-NoLock 70.12% 1.29 49.16 1.19
03:00:00 23:00:00
GBPNZD -47.93 % -14.56 % 55 days 9 days
03:00:00 06:00:00
GBPUSD -37.32 % -5.55 % 90 days 6 days
13:00:00 10:00:00
MA-Lock EURUSD -15.66 % -1.82 % 36 days 1 days
12:00:00 17:00:00
AUDUSD -48.87 % -3.73 % 94 days 5 days
10:00:00 11:00:00
EURGBP -27.32 % -3.11 % 101 days 9 days
23:00:00 11:00:00
AUDCAD -50.37 % -4.47 % 94 days 6 days
12:00:00 12:00:00
EURCHF -12.81 % -1.59 % 34 days 2 days
19:00:00 15:00:00
EURAUD -49.62 % -4.31 % 84 days 4 days
16:00:00
USDCAD -40.44 % -10.89 % 82 days 14 days
09:00:00 21:00:00
GBPNZD -60.26 % -6.12 % 38 days 3 days
17:00:00 22:00:00
GBPUSD -16.10 % -3.13 % 15 days 2 days
13:00:00 03:00:00
MA- EURUSD -20.05 % -2.36 % 31 days 1 days
NoLock 01:00:00 21:00:00 Fig. 4. Comparative MC scenarios backtesting results on
AUDUSD -11.45 % -1.55 % 20 days 1 days
03:00:00 09:00:00
EUR/USD.
EURGBP -33.52 % -4.28 % 48 days 3 days
10:00:00 20:00:00
AUDCAD -14.56 % -2.82 % 46 days 2 days In The MC scenario, the backtesting results on EUR/USD
EURCHF -8.01 % -1.30 %
22:00:00
17 days
20:00:00
1 days
showed that both MA models outperformed the base SA model
08:00:00 12:00:00 (Table VII, Fig. 4). Additionally, the MA-Lock performed
EURAUD -45.69 % -7.48 % 65 days 5 days
16:00:00 19:00:00 better than the MA-NoLock, highlighting the superiority of
USDCAD -66.68 % -16.59 % 83 days 14 days using the A3C with Lock mechanism in the financial domain.
02:00:00 23:00:00
GBPNZD -61.36 % -12.65 % 65 days 7 days Furthermore, it demonstrated that training on a single currency
09:00:00 08:00:00 yields better results.
GBPUSD -18.08 % -2.76 % 27 days 1 days
22:00:00 12:00:00 Figure 5 compares the overall results of backtesting the
baseline model and the proposed model on training data for
MA-NoLock stood out with the highest average return SC and MC scenarios.
(Table IV, Fig. 3), favorable risk-adjusted returns (positive
C. Training time
Sharpe Ratio), and the best profit factor. On the other hand,
SA had the lowest return, with slightly lower risk-adjusted The MA-Nolock mechanism significantly outperformed the
returns and profit factors. MA-Lock fell in between these two, other two models with its training time in both the SC and MC
displaying slightly better risk-adjusted returns compared to SA scenarios (Table VIII, Fig. 6) being the lowest. In comparison,
but still lagging behind MA-NoLock in terms of return and MA-Lock exhibited slightly higher values, in both SC and MC
profit factor. scenarios. SA had the highest values in both scenarios;
Notably, MA-NoLock has the advantage of recovering from Therefore, MA-NoLock excels as the most effective model,
drawdowns in a shorter time compared to the other two showcasing its superior performance by a substantial margin,
entities, despite not having the lowest drawdown magnitude. over the other models.
This suggests that MA-NoLock has a more efficient and faster
8
Draft
D. Trading Execution V. CONCLUSION
Figure 7 demonstrates an snapshot of executed trades in We aim to train an RL agent capable of trading across
backtesting on the EUR/USD currency pair using MA-lock on diverse assets in forex to make more money while optimizing
the SC scenario. It shows the agent can identify price resource utilization to reduce training time. Our approach
movements' directions and execute trades, resulting in involves using the A3C algorithm to distribute training across
profitable outcomes. multiple processes. We explore different training approaches,
such as training on single currency pairs and multiple currency
pairs, and compare the results. Additionally, we experiment
with both single-agent and multi-agent setups, employing PPO
as our single-agent algorithm.
Comparing single-agent and multi-agent training, in single
currency pair training, both A3C models displayed superior
returns and Sharpe Ratio. Multi-currency training showed
PPO with generally lower returns, while A3C without Lock
stood out with the highest average returns over multiple pairs
backtesting and positive Sharpe Ratio. Backtesting results
confirmed A3C's superiority over PPO, especially A3C with
Lock. Single currency training yielded better overall results.
Fig. 5. Comparative SC and MC training scenarios A3C without Lock outperformed other models significantly in
backtesting result on EUR/USD. training time, asserting its effectiveness in both single and
multi-currency scenarios.
TABLE VIII Given the complexity of financial markets and the challenge
TRAINING TIME IN MINUTES (LOWEST IS BETTER ) of finding appropriate reward functions for training RL agents,
Model Single currency Multi-currency it is recommended for future work to consider the use of No-
SA 347 346 reward or Reward-Free RL methods.
MA-Lock 193 251
MA-NoLock 153 218
REFERENCES
[1] T. Chau, M. T. Nguyen, D. V. Ngo, A. D. T. Nguyen,
and T. H. Do, "Deep reinforcement learning methods
for Automation Forex Trading," in 2022 RIVF
International Conference on Computing and
Communication Technologies (RIVF), 20-22 Dec.
2022 2022, pp. 671-676, doi:
10.1109/RIVF55975.2022.10013861.
[2] A. Shavandi and M. Khedmati, "A multi-agent deep
reinforcement learning framework for algorithmic
trading in financial markets," Expert Systems with
Applications, vol. 208, p. 118124, 2022/12/01/ 2022,
doi: https://doi.org/10.1016/j.eswa.2022.118124.
Fig. 6. Training time of different models. [3] A. Tsantekidis, N. Passalis, A. S. Toufa, K. Saitas-
Zarkias, S. Chairistanidis, and A. Tefas, "Price
Trailing for Financial Trading Using Deep
Reinforcement Learning," IEEE Transactions on
Neural Networks and Learning Systems, vol. 32, no.
7, pp. 2837-2846, 2021, doi:
10.1109/TNNLS.2020.2997523.
[4] N. Majidi, M. Shamsi, and F. Marvasti, "Algorithmic
trading using continuous action space deep
reinforcement learning," Expert Systems with
Applications, vol. 235, p. 121245, 2024/01/01/ 2024,
doi: https://doi.org/10.1016/j.eswa.2023.121245.
[5] K. Lei, B. Zhang, Y. Li, M. Yang, and Y. Shen,
"Time-driven feature-aware jointly deep
reinforcement learning for financial signal
representation and algorithmic trading," Expert
Systems with Applications, vol. 140, p. 112872,
2020/02/01/ 2020, doi:
Fig. 7. A snapshot of MA-Lock SC scenario backtesting trades https://doi.org/10.1016/j.eswa.2019.112872.
on EUR/USD.
9
Draft
[6] T. Huotari, J. Savolainen, and M. Collan, "Deep
Reinforcement Learning Agent for S&P 500 Stock
Selection," Axioms, vol. 9, no. 4, doi:
10.3390/axioms9040130.
[7] A. Tsantekidis, N. Passalis, and A. Tefas, "Diversity-
driven knowledge distillation for financial trading
using Deep Reinforcement Learning," Neural
Networks, vol. 140, pp. 193-202, 2021/08/01/ 2021,
doi: https://doi.org/10.1016/j.neunet.2021.02.026.
[8] J. Carapuço, R. Neves, and N. Horta, "Reinforcement
learning applied to Forex trading," Applied Soft
Computing, vol. 73, pp. 783-794, 2018/12/01/ 2018,
doi: https://doi.org/10.1016/j.asoc.2018.09.017.
[9] C. Ma, J. Zhang, Z. Li, and S. Xu, "Multi-agent deep
reinforcement learning algorithm with trend
consistency regularization for portfolio
management," Neural Computing and Applications,
vol. 35, no. 9, pp. 6589-6601, 2023/03/01 2023, doi:
10.1007/s00521-022-08011-9.
[10] J. Korczak and M. Hemes, "Deep learning for
financial time series forecasting in A-Trader system,"
in 2017 Federated Conference on Computer Science
and Information Systems (FedCSIS), 3-6 Sept. 2017
2017, pp. 905-912, doi: 10.15439/2017F449.
[11] Y. Li, W. Zheng, and Z. Zheng, "Deep Robust
Reinforcement Learning for Practical Algorithmic
Trading," IEEE Access, vol. 7, pp. 108014-108022,
2019, doi: 10.1109/ACCESS.2019.2932789.
[12] Q. Kang, H. Zhou, and Y. Kang, "An Asynchronous
Advantage Actor-Critic Reinforcement Learning
Method for Stock Selection and Portfolio
Management," presented at the Proceedings of the
2nd International Conference on Big Data Research,
Weihai, China, 2018. [Online]. Available:
https://doi.org/10.1145/3291801.3291831.
[13] E. S. Ponomarev, I. V. Oseledets, and A. S. Cichocki,
"Using Reinforcement Learning in the Algorithmic
Trading Problem," Journal of Communications
Technology and Electronics, vol. 64, no. 12, pp.
1450-1457, 2019/12/01 2019, doi:
10.1134/S1064226919120131.
[14] V. Mnih et al., "Asynchronous Methods for Deep
Reinforcement Learning," 33rd International
Conference on Machine Learning, ICML 2016, vol.
4, pp. 2850-2869, 2016. [Online]. Available:
http://arxiv.org/abs/1602.01783.
[15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
O. Klimov, "Proximal policy optimization
algorithms," arXiv preprint arXiv:1707.06347, 2017.
[16] Backtest trading strategies in Python. (2023).
[Online]. Available:
https://github.com/kernc/backtesting.py

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy