Algorithms 17 00570 v2
Algorithms 17 00570 v2
Article
A Systematic Approach to Portfolio Optimization: A Comparative
Study of Reinforcement Learning Agents, Market Signals, and
Investment Horizons
Francisco Espiga-Fernández * , Álvaro García-Sánchez † and Joaquín Ordieres-Meré †
Industrial Engineering, Business Administration and Statistics Department, Escuela Técnica Superior de
Ingenieros Industriales, Universidad Politécnica de Madrid, José Gutierrez Abascal 2, 28006 Madrid, Spain;
alvaro.garcia@upm.es (Á.G.-S.); j.ordieres@upm.es (J.O.-M.)
* Correspondence: francisco.espiga.fernandez@alumnos.upm.es
† These authors contributed equally to this work.
Abstract: This paper presents a systematic exploration of deep reinforcement learning (RL) for portfo-
lio optimization and compares various agent architectures, such as the DQN, DDPG, PPO, and SAC.
We evaluate these agents’ performance across multiple market signals, including OHLC price data
and technical indicators, while incorporating different rebalancing frequencies and historical window
lengths. This study uses six major financial indices and a risk-free asset as the core instruments.
Our results show that CNN-based feature extractors, particularly with longer lookback periods,
significantly outperform MLP models, providing superior risk-adjusted returns. DQN and DDPG
agents consistently surpass market benchmarks, such as the S&P 500, in annualized returns. However,
continuous rebalancing leads to higher transaction costs and slippage, making periodic rebalancing a
more efficient approach to managing risk. This research offers valuable insights into the adaptability
of RL agents to dynamic market conditions, proposing a robust framework for future advancements
in financial machine learning.
Keywords: reinforcement learning; portfolio optimization; policy optimization; deep learning; deep
reinforcement learning; mean-variance optimization
Citation: Espiga-Fernández, F.;
García-Sánchez, Á.; Ordieres-Meré, J.
A Systematic Approach to Portfolio
Optimization: A Comparative Study 1. Introduction
of Reinforcement Learning Agents,
1.1. Overview of Portfolio Optimization
Market Signals, and Investment
Horizons. Algorithms 2024, 17, 570.
Portfolio optimization refers to the systematic selection of a combination of investment
https://doi.org/10.3390/a17120570 assets, or a portfolio, with the goal of maximizing returns while managing risk. The
optimization process typically aims to find an allocation of assets that achieves an investor’s
Academic Editor: Ulrich Kerzel
objectives, such as maximizing return for a given level of risk or minimizing risk for a
Received: 29 October 2024 given expected return. This balancing act is commonly framed through the lens of Modern
Revised: 8 December 2024 Portfolio Theory (MPT) [1], which models risk through the variance or covariance of
Accepted: 9 December 2024 asset returns. In practice, portfolio optimization applies to various investment horizons
Published: 12 December 2024 and strategies, involving the dynamic reallocation of assets over time in response to
market changes.
The assets or investment vehicles within a portfolio can include a broad range of
financial instruments. These include equities (stocks), bonds, commodities, currencies, and
Copyright: © 2024 by the authors.
derivatives such as options and futures. Each asset class behaves differently in terms of risk
Licensee MDPI, Basel, Switzerland.
and return, making it important for portfolio managers to consider the interrelationships
This article is an open access article
between asset returns when optimizing the portfolio. Investment vehicles vary in their
distributed under the terms and
functions: equities are generally employed for long-term capital appreciation, whereas
conditions of the Creative Commons
Attribution (CC BY) license (https://
bonds offer more predictable income streams and reduced risk exposure.
creativecommons.org/licenses/by/
One of the central challenges in portfolio optimization is handling transaction costs
4.0/). and slippage, both of which can significantly impact the returns of a portfolio. Transaction
costs include brokerage fees, taxes, and other costs incurred during the buying and selling
of assets. Meanwhile, slippage refers to the difference between the expected price of a trade
and the actual price, which often occurs in markets with low liquidity or during periods of
high volatility. These factors complicate the portfolio optimization process because frequent
trading, which becomes essential in rapidly changing market environments, can erode
potential returns through these hidden costs. Additionally, the need to rebalance portfolios
dynamically in response to market conditions adds further complexity.
Portfolio optimization techniques are widely applied across the financial industry,
particularly among large institutional players such as hedge funds, mutual funds, pension
funds, and investment banks. These institutions use sophisticated quantitative models
to manage the portfolios of high-net-worth individuals and institutional investors, with
the objective of exceeding market benchmarks while maintaining robust risk management
frameworks [2,3].
min w T Σw
w
subject to w T µ ≥ rmin , ∑ wi = 1, 0 ≤ wi ≤ 1 (1)
where w represents portfolio weights, Σ is the covariance matrix of asset returns, and µ is
the vector of expected returns [2]. Risk parity, on the other hand, aims to allocate assets
such that each contributes equally to the overall portfolio risk. This method focuses on
diversifying risk rather than returns, making it attractive for risk-averse investors. Value at
Risk (VaR) is a widely used risk measure that estimates the potential loss in portfolio value
over a specific time frame with a given confidence level. VaR is often used to determine the
maximum expected loss under normal market conditions, though its reliance on historical
data makes it vulnerable to extreme events [3].
Despite their widespread adoption, these traditional methods face significant limi-
tations, particularly in the context of modern financial markets. One key issue is their
lack of adaptability to rapidly changing market conditions. Methods like MVO rely on
historical data and assume that future asset returns and risks will behave similarly, which is
frequently invalid in volatile or non-stationary markets. This reliance on historical averages
can result in overfitting, where the model becomes too tailored to past performance, leading
to suboptimal decisions when market dynamics shift [5]. Moreover, MVO’s assumption
that risk can be fully captured by variance oversimplifies the complex nature of risk in
financial markets, ignoring factors such as tail risk or extreme market movements.
Another limitation of these traditional approaches is their reliance on strict assump-
tions about asset returns. MVO, for instance, assumes that asset returns follow a normal
distribution and that correlations between assets are stable over time. In reality, financial
returns tend to be skewed and exhibit fat tails, with extreme events occurring more fre-
quently than predicted by normal distribution models [6]. Similarly, VaR assumes a linear
relationship between risk factors and portfolio returns, which breaks down during periods
of market stress, leading to an underestimation of risk. Furthermore, risk parity can be
overly conservative in environments where certain assets exhibit low volatility but are
exposed to systemic risks, thus potentially skewing allocations [7].
Algorithms 2024, 17, 570 3 of 37
conditions. To date, no study has conducted an exhaustive analysis of how different rebal-
ancing frequencies, investment instruments, feature extractors, and reward functions affect
portfolio optimization performance under RL. As a result, there is a gap in understanding
the trade-offs involved in choosing one configuration over another in varying market
conditions [32].
the additional trading costs, providing insight into the optimal rebalancing frequency for
maximizing performance in real-world scenarios [34].
With respect to the value of technical indicators versus raw prices, two different
signal configurations were tested, as detailed in Section 2.4. By maintaining all other
parameters constant between experiments, we aimed to isolate and assess the additional
value provided by technical indicators. On the one hand, as neural networks are powerful
universal function approximators [35] and feature learners [36,37], it could be argued that
the raw price data alone should be sufficient for the agent to learn meaningful patterns,
and the technical indicators could be learned as latent-state representations. However,
given the limited amount of data available, technical indicators may offer a concise and
informative summary of the market environment, potentially enhancing learning efficiency
and convergence. This is particularly relevant in light of the EMH [7,33], which posits that
all available information is reflected in the price. Some studies in the literature suggest that
additional market indicators may contradict the EMH, providing opportunities for better
decision-making by RL agents. Therefore, the experiment sought to explore the balance
between raw data and engineered features in portfolio optimization.
With respect to capturing patterns and anticipating behavior in the time series of
the indicators for each financial instrument, two different feature extractors—MLP and
CNN—were employed. MLPs are effective at capturing general relationships across all
input features but might miss localized temporal patterns, while CNNs excel at identifying
local structures and sequential dependencies in time-series data, making them well suited
for detecting patterns in market signals. Additionally, the experiments varied the lookback
period, with lengths of 16 and 28 periods, to determine the trade-off between providing the
agent with more historical context and the increased complexity of the model. A longer
lookback may provide valuable historical context, allowing the agent to capture long-term
market trends, but it also increases the size of the input, especially for CNNs, which could
lead to more challenging training and potential overfitting. This experiment sought to
determine the optimal balance between the depth of historical data and network complexity.
Lastly, in terms of performance, the study compared different RL agent families
to determine whether continuous-action agents (e.g., SAC, DDPG) that allocate across
multiple instruments at once can outperform discrete-action agents (e.g., DQN, PPO) that
allocate the entire portfolio to a single instrument. The hypothesis is that continuous
agents, by diversifying across instruments, may be better equipped to handle complex
market environments, thus improving risk-adjusted returns. On the other hand, discrete-
action agents, which focus on identifying the single instrument with the highest expected
value, might excel in simpler, trending markets. By comparing these two approaches, the
experiment aimed to evaluate the trade-offs between diversification and targeted allocation
in RL-based portfolio optimization.
equal to 1, representing a full allocation to one instrument, while the remaining components
are set to 0. The precise architecture and functioning of the agent are further detailed in
Appendix A.
In each rebalancing period t + f , with f being the rebalancing frequency in trading
periods, the agent proposes a new set of portfolio weights, which are combined with the
previous portfolio weights decided at time t according to Section 2.2 to compute the return
over the rebalancing period. The calculation accounts for slippage and transaction costs,
modeled based on weight changes, which decrease the return of bought or sold positions
as a percentage. One of the assumptions of our paper is that instruments are liquid and
fractional assets, as opposed to discrete units (shares) priced at Pi . Specifically, the agent’s
proposal results in three subsets:
• H (held positions): Assets where the portfolio weight remains unchanged wi,t = wi,t+ f .
• B (bought positions): Assets where the portfolio weight has increased, requiring the
purchase of additional units, wi,t < wi,t+ f .
• S (sold positions): Assets where the portfolio weight has decreased, requiring the sale
of some units, wi,t > wi,t+ f .
For the first trading period, the return is computed by adjusting the portfolio based on
these three subsets, taking into account the impact of market fluctuations during the initial
transition period.
Using a weight decomposition, we can consider the portfolio weights as the evolution
from time t − f to time t, where the weights are defined as follows:
• ⃗ h corresponds to the portfolio weights of the held positions between t and t + f and
w
corresponds, for each instrument, to min(wi,t , wi,t+ f ).
• ⃗ b corresponds to the positive weight deltas δw+ = wi,t+ f − wi,t , where wi,t+ f − wi,t > 0,
w
i
and corresponds to increased positions on an instrument (buys).
• ⃗ s corresponds to the negative weight deltas δw− = wi,t+ f − wi,t , where wi,t+ f − wi,t < 0,
w
i
and corresponds to decreased positions on an instrument (sales).
⃗h + w
(w ⃗ s ) · ⃗1 = 1 (ensures portfolio balance)
⃗b + w (2)
In regard to the returns of the different positions, the returns of the held positions, rh
are computed as the ratio of closing prices between rebalancing periods t + f and t:
PClose,t+ f
rh = (3)
PClose,t
The returns of the increased positions (buys) are assumed to execute at the opening
price of the next trading period after the rebalancing date t + f + 1, and the return rb is
PClose,t+ f +1
rb = (executed at the opening price after rebalancing) (4)
POpen,t+ f +1
The returns of the decreased positions (sells) are assumed to execute at the opening
price of the next trading day after the rebalancing date t + f + 1, and the return rs is
POpen,t+ f +1
rs = (5)
PClose,t+ f
For the subsequent periods, returns are computed as the ratio of the closing price at
time t over the closing price at time t − 1. Now, a return trajectory between t and t + f
rebalancing dates can be obtained:
∀ T ∈ (t + f + 1, t + 2 f ] : r p f ,T = w
⃗ h · r⃗h (6)
Algorithms 2024, 17, 570 8 of 37
and for the first trading day after the rebalancing date, it is
⃗ h ·⃗rh,t+ f +1 + w
r p f ,t+ f +1 = w ⃗ b ·⃗rb,t+ f +1 + w
⃗ s ·⃗rs,t+ f +1 (7)
The returns of the increased and decreased positions are relatively decremented by
the transaction costs and slippage. In our study, transaction costs have been fixed to 5 and
2 basis points, respectively. This ensures the realistic modeling of transaction costs and
slippage, which are critical in practical portfolio management.
The goal of the agent is to maximize the total cumulative log-return over an entire
episode, which consists of 252 periods, representing a full trading year. By optimizing
over this time horizon, the agent learns to balance risk and return, adjusting the portfolio
dynamically based on market conditions to achieve long-term profitability.
By analyzing OHLC data alongside candlestick patterns, traders can make informed
decisions about potential entry and exit points in the market. These data are fundamental
to technical analysis, providing a visual representation of price action and helping to detect
market sentiment shifts. Although [39] highlights that postprocessing is required to obtain
representative OHLC data for a specific aggregation period, in this study, it was used as-is
so as to avoid the requirement for higher-frequency OHLC data or Limit Order Book (LOB)
data to build the aggregate OHLC data and ensure that our research is easily reproducible.
Figure 1. Candlestick chart displaying OHLC (Open-High-Low-Close) prices for the S&P 500 index.
The EMH posits that markets are fully efficient, meaning that all available information
is already reflected in asset prices. If the EMH holds true, no amount of technical or
fundamental analysis will allow an investor to consistently achieve returns that outperform
the market, as prices already represent all known information about an asset’s current
behavior and future potential. In this view, price alone would contain all relevant data,
rendering additional signals from technical indicators redundant or unnecessary. However,
critics of the EMH [33] argue that inefficiencies, such as human behavior, market psychology,
or anomalies, provide opportunities to generate excess returns through the use of advanced
signals beyond just price.
Market
AROONOSC RSI CCI CMO MFI Williams %R STOCHF
Behavior
>50
Overbought >70 >+100 >50 >80 >−20 >80
(Uptrend)
<−50
Oversold <30 <−100 <−50 <20 <−80 <20
(Downtrend)
High
N/A N/A N/A N/A N/A N/A N/A
Volatility
Low
N/A N/A N/A N/A N/A N/A N/A
Volatility
Bullish Price ↓, Price ↓, Price ↓,
Price ↓, RSI ↑ Price ↓, CCI ↑ Price ↓, MFI ↑ Price ↓, %R ↑
Divergence AROON ↑ CMO ↑ STOCHF ↑
Bearish Price ↑, Price ↑, Price ↑,
Price ↑, RSI ↓ Price ↑, CCI ↓ Price ↑, MFI ↓ Price ↑, %R ↓
Divergence AROON ↓ CMO ↓ STOCHF ↓
Momentum >+100 >−50
>0 (Uptrend) >50 (Bullish) >0 (Uptrend) >50 (Bullish) %K > %D
Confirmation (Uptrend) (Bullish)
Trend Crosses 0 Crosses Crosses Crosses Crosses
Crosses 50 Crosses 0
Reversal (Up/Down) +100/−100 80/20 −20/−80 80/20
Market behaviors and the associated directional changes (↑/↓) of Price and Technical indicators
z ( l ) = W ( l ) a ( l −1) + b ( l ) (8)
where
• z(l ) is the pre-activation value at layer l;
• W (l ) is the weight matrix for layer l;
• a(l −1) is the output (activation) from the previous layer;
• b(l ) is the bias vector.
The output a(l ) is then passed through a non-linear activation function σ, such as
ReLU (Rectified Linear Unit) or sigmoid:
where
(l )
• zi,j is the activation at position (i, j) in layer l;
(l )
• Wm,n represents the convolutional kernel applied at position (m, n);
( l −1)
• ai+m,j+n is the input from the previous layer at the corresponding position;
• b(l ) is the bias term.
CNNs are particularly advantageous for tasks where spatial or temporal relationships
are crucial, such as identifying trends in financial time-series data. By sharing weights
across different regions, CNNs are more parameter-efficient than MLPs, leading to faster
training and fewer parameters to tune.
Overall, MLPs offer greater flexibility at the expense of parameter efficiency, while
CNNs are more scalable and effective for structured data but may require more sophisti-
cated architectures and tuning.
In this paper, the MLPs in the different experiments are the vanilla implementations
for each agent from the StableBaselines3 [40] library. Table 2 provides a detailed view of
each architecture, including the activation functions, number of layers, and neurons used
for each agent for the MLP variant.
Table 2. Feature extractor architectures for the MLP for each agent type.
For the CNN, a custom model has been implemented with the architecture in Table 3.
Convolutional Layer Input Channels Output Channels Kernel Size Stride Padding Activation fn
2D Convolution input_size 32 2 1 0 ReLU
2D Convolution 32 64 2 1 0 ReLU
Fully Connected
Input Size Output Size Activation fn
Layer
Flatten (64, 32, 2, 2) 6656 - - - -
Linear 6656 128 - - - ReLU
The input_size of the first layers is equal to a 3-D tensor of dimension lookback × 6 × 4 in the experiments with
only OHLC prices and lookback × 6 × 12 in the experiments with technical indicators from Section 2.4.
Pclose
rt = log( ) (11)
Popen
where Pclose is the closing price, and Popen is the opening price for the time period t.
The reward function is designed to align with the agent’s objective of maximizing
cumulative log-returns over an entire episode. The log transformation simplifies the reward
computation, allowing the total reward R for an episode to be expressed as the sum of the
log-returns between rebalancing periods. For each rebalancing interval, the log-return is
calculated based on the portfolio value V at the beginning and end of the interval. The
portfolio value is derived as the dot product of the asset weights w and the corresponding
asset prices P at each time step:
T
Vt
R= ∑ log Vt−1
(12)
t =1
⃗ t · ⃗Pt
Vt = w (13)
with wt being the vector of portfolio weights and Pt the asset prices at time t.
Cumulative returns are a well-established metric within the financial industry, fre-
quently used to evaluate and compare the performance of various investment strategies.
One of the main advantages of cumulative returns is their ability to be annualized, al-
lowing for consistent comparisons across different investment horizons and rebalancing
frequencies, typically expressed in terms of years. This makes it easier to analyze both
the risk (volatility) and reward (return) in a standardized manner. Moreover, cumulative
log-returns are widely employed as the reward metric in RL-based portfolio optimization
studies. Although cumulative returns are the most common choice for RL agents, alterna-
tive reward metrics, such as the Sharpe ratio, which adjusts returns based on risk, or other
measures, like the Sortino ratio, Calmar ratio, and maximum drawdown (Section 2.7.5), are
also used to better account for downside risk [13,30].
• For continuous-action agents, the action represents the percentage allocation across
instruments, ensuring that ∀i : wi ≥ 0 and ∑i wi = 1, which guarantees long-only
positions and a fully invested portfolio.
The reward function in the environment is computed as the log-return between con-
secutive rebalancing periods. This is calculated by multiplying the portfolio weights by the
prices of each instrument, as outlined in Section 2.6. This framework enables the agent to
optimize cumulative log-returns over the duration of the episode.
Figure 3. Environment observations as vector for MLP and tensor for CNN. Each color represents a
different technical indicator for a chosen investment instrument
evaluation period allows for a comprehensive assessment of the agents’ adaptability and
performance across different market environments.
For technical indicators, particularly those that rely on moving averages, no embargo
period has been implemented, as suggested by [39], since the indicators used in this study
do not exhibit target leakage. Consequently, the entirety of the evaluation data has been
utilized without restrictions.
To ensure the robustness and validity of the results, the evaluation period from 2016
to April 2024 has been consistently applied across all RL agents and baseline models.
This consistent evaluation timeframe guarantees that each model is tested under identical
market conditions, allowing for a fair and meaningful comparison of their performance.
where rt represents the return at time t, and T is the number of time periods in a year.
2. Annualized Volatility: Volatility is a measure of risk, reflecting the standard deviation
of the portfolio’s returns over time, annualized to match the return horizon.
√
σPf,annualized = σ × T (15)
where σ is the standard deviation of returns, and T is the number of time periods in
a year.
3. Sharpe Ratio: The Sharpe ratio measures risk-adjusted returns, considering the excess
return over the risk-free rate relative to volatility.
RPf,annualized − Rf
RSharpe = (16)
σPf,annualized
Rannualized − Rrisk-free
RSortino = (17)
σdownside
where σdownside is the standard deviation of negative returns.
5. Maximum Drawdown: This metric measures the largest peak-to-trough decline in
portfolio value, providing insight into the worst-case wealth loss.
max(Vt ) − min(Vt )
Maximum Drawdown = (18)
max(Vt )
where Vt is the portfolio value at time t.
Algorithms 2024, 17, 570 18 of 37
6. Calmar Ratio: The Calmar ratio is a risk-adjusted return metric that evaluates the
portfolio’s performance relative to its maximum drawdown.
RPf, annualized
RCalmar = (19)
Maximum Drawdown
These metrics complement each other by offering a balanced view of portfolio perfor-
mance. While the Sharpe ratio uses symmetrical volatility as a risk measure, the Sortino
ratio focuses solely on downside risk, making it more suitable for risk-averse investors.
The Calmar ratio and maximum drawdown, on the other hand, emphasize worst-case loss
scenarios, which are particularly relevant for investors focused on capital preservation.
By combining these metrics, we can assess which RL agents and configurations excel in
maximizing returns, minimizing risk, or avoiding significant drawdowns, depending on
the specific investment objectives.
⃗ market = [1, 0, 0, 0, 0, 0, 0]
w (21)
These two portfolio models—the equally weighted portfolio and the market portfo-
lio—serve as baselines for evaluating the RL agent configurations. By comparing their
performance to these benchmarks, we gain insight into how well the RL agents perform in
relation to standard, market-based investment strategies. The cumulative reward of the
market and EQP portfolio for the period 2016-April 2024 is shown in Figure 4.
Algorithms 2024, 17, 570 19 of 37
Figure 4. Cumulative returns for the market (S&P 500 proxy) and the EWP for the period 2016–
April 2024.
3. Results
3.1. Agents’ Performance
The results of the different RL agent configurations are summarized in terms of key
financial metrics. These metrics were calculated for each configuration of agent type,
feature extractor, lookback period, and rebalancing frequency and are shown in Table 5.
For example, the DQN agent using an MLP feature extractor and a 16-period lookback
window achieved an annualized return of 18.25% with a Sharpe ratio of 0.66. A similar
configuration with a 28-period lookback showed a higher annualized return of 25.76% and
a Sharpe ratio of 0.80. These results suggest that longer lookback periods can yield higher
returns but introduce additional risk, particularly during periods of market volatility.
On the other hand, the PPO agent with an MLP feature extractor exhibited under-
whelming performance, with an annualized return of −1.27% and a low Sharpe ratio,
indicating inferior results when compared to the baselines, as shown in Table 6.
Algorithms 2024, 17, 570 20 of 37
When comparing RL agents to the market benchmark (S&P 500) and the equally
weighted portfolio (EWP), we observed the following:
• The S&P 500 baseline had a Sharpe ratio of 0.80 and an annualized return of 14.18%,
which outperformed several agent configurations, particularly those with shorter
lookback periods and high rebalancing frequency.
• The equally weighted portfolio baseline yielded an annualized return of 8.73% and a Sharpe
ratio of 0.68, serving as a middle-ground benchmark that several RL agents surpassed.
Additionally, it is worth mentioning that the SAC experiments experienced conver-
gence issues during training. Despite several tuning attempts, the models did not converge
consistently, and as a result, the SAC agent configurations were excluded from the final
analysis due to their unreliable performance.
In summary, the DQN consistently outperforms the other agent families in terms of
annualized return and risk-adjusted metrics such as the Sharpe, Sortino, and Calmar ratios,
especially when using longer lookback windows. On the other hand, DDPG excels in
managing risk, demonstrating the lowest volatility, downside risk, and max drawdown
among the agent families, making it more robust in risk-averse scenarios. Regarding feature
extractors, the CNN tends to perform better in terms of maximizing returns and achieving
higher risk-adjusted ratios (Sharpe, Sortino, Calmar), whereas the MLP shows superior
performance in risk mitigation, as evidenced by lower volatility and max drawdown. This
suggests that the choice between the CNN and MLP depends on whether the primary focus
is on maximizing returns or minimizing risk.
In contrast, the Russell 2000 (^RUT) and FTSE (^FTSE) indices are either underrepre-
sented or completely absent in several configurations, suggesting that the agents might find
these indices less favorable for maximizing returns given the prevailing market conditions.
Their lower inclusion could indicate that these indices are less predictable or offer lower
risk-adjusted returns.
CASH, representing the risk-free asset, holds a significant portion in some agent
configurations, particularly with DQN agents using CNN feature extractors. This suggests
a conservative, risk-averse strategy, where the agent opts to hedge against market volatility
by allocating a larger portion of the portfolio to CASH. In certain DQN configurations,
such as with CNN feature extraction, the allocation to CASH is notably high, indicating a
preference for safety during periods of market instability.
In general, DDPG agents (with either MLP or CNN extractors) show more diversified
portfolios across various instruments, whereas DQN agents tend to focus on fewer instru-
ments or exhibit extreme concentration in one or two indices. Some DQN configurations
allocate the entire portfolio to a single asset, as observed in cases with complete allocation
to ^IXIC as summarised in Table 7.
Reb.
^RUT ^IXIC ^GDAXI ^FTSE ^N225 ^GSPC CASH Agent Feat.Ext. Lookback Indicators
Freq.
0.00 0.27 0.27 0.00 0.03 0.14 0.29 DDPG MLP 16 1 No
0.15 0.00 0.11 0.00 0.03 0.36 0.36 DDPG MLP 28 1 No
0.12 0.59 0.02 0.01 0.00 0.05 0.21 DQN MLP 16 1 No
0.00 0.00 1.00 0.00 0.00 0.00 0.00 DQN MLP 28 1 No
0.14 0.09 0.17 0.25 0.11 0.10 0.13 PPO MLP 16 1 No
0.11 0.15 0.06 0.12 0.23 0.28 0.06 PPO MLP 28 1 No
0.21 0.21 0.00 0.00 0.15 0.21 0.21 DDPG MLP 16 1 Yes
0.25 0.25 0.25 0.00 0.25 0.00 0.00 DDPG MLP 28 1 Yes
0.01 0.02 0.18 0.00 0.78 0.01 0.00 DQN MLP 16 1 Yes
0.00 0.00 0.00 0.00 0.00 1.00 0.00 DQN MLP 28 1 Yes
0.17 0.06 0.15 0.32 0.15 0.04 0.11 PPO MLP 16 1 Yes
0.13 0.12 0.13 0.15 0.18 0.14 0.14 PPO MLP 28 1 Yes
0.00 0.39 0.03 0.00 0.00 0.19 0.38 DDPG MLP 16 10 No
0.25 0.02 0.07 0.12 0.23 0.25 0.06 DDPG MLP 28 10 No
0.00 0.01 0.09 0.61 0.00 0.28 0.01 DQN MLP 16 10 No
0.50 0.01 0.04 0.00 0.20 0.00 0.24 DQN MLP 28 10 No
0.14 0.10 0.25 0.03 0.28 0.07 0.12 PPO MLP 16 10 No
0.22 0.29 0.27 0.02 0.03 0.11 0.07 PPO MLP 28 10 No
0.28 0.33 0.00 0.05 0.33 0.02 0.00 DDPG MLP 16 10 Yes
0.31 0.31 0.31 0.03 0.03 0.00 0.01 DDPG MLP 28 10 Yes
0.38 0.00 0.38 0.02 0.10 0.01 0.11 DQN MLP 16 10 Yes
0.39 0.01 0.04 0.16 0.02 0.28 0.10 DQN MLP 28 10 Yes
0.01 0.08 0.13 0.08 0.10 0.26 0.35 PPO MLP 16 10 Yes
0.17 0.16 0.02 0.09 0.13 0.39 0.04 PPO MLP 28 10 Yes
0.25 0.00 0.25 0.25 0.00 0.00 0.25 DDPG CNN 16 10 No
0.08 0.01 0.20 0.20 0.18 0.14 0.20 DDPG CNN 16 1 No
0.00 0.29 0.00 0.29 0.00 0.29 0.13 DDPG CNN 28 10 No
0.01 0.31 0.31 0.00 0.00 0.06 0.30 DDPG CNN 28 1 No
0.00 0.00 0.01 0.00 0.00 0.00 0.99 DQN CNN 16 10 No
0.00 0.00 0.00 0.00 1.00 0.00 0.00 DQN CNN 16 1 No
0.32 0.00 0.03 0.65 0.00 0.00 0.00 DQN CNN 28 10 No
0.00 0.00 0.00 0.00 0.00 1.00 0.00 DQN CNN 28 1 No
0.12 0.08 0.46 0.16 0.02 0.13 0.03 PPO CNN 16 10 No
0.02 0.19 0.13 0.06 0.25 0.14 0.20 PPO CNN 16 1 No
0.01 0.15 0.06 0.07 0.01 0.11 0.60 PPO CNN 28 10 No
Algorithms 2024, 17, 570 22 of 37
Table 7. Cont.
Reb.
^RUT ^IXIC ^GDAXI ^FTSE ^N225 ^GSPC CASH Agent Feat.Ext. Lookback Indicators
Freq.
0.10 0.27 0.05 0.26 0.07 0.04 0.21 PPO CNN 28 1 No
0.04 0.24 0.24 0.03 0.00 0.24 0.20 DDPG CNN 16 10 Yes
0.13 0.00 0.22 0.21 0.22 0.22 0.00 DDPG CNN 16 1 Yes
0.00 0.20 0.20 0.00 0.20 0.20 0.20 DDPG CNN 28 10 Yes
0.02 0.37 0.23 0.06 0.32 0.00 0.00 DDPG CNN 28 1 Yes
0.00 0.00 0.00 0.00 0.00 1.00 0.00 DQN CNN 16 10 Yes
0.00 0.00 0.01 0.00 0.00 0.00 0.99 DQN CNN 16 1 Yes
0.98 0.00 0.00 0.00 0.01 0.00 0.00 DQN CNN 28 10 Yes
0.05 0.00 0.95 0.00 0.00 0.00 0.00 DQN CNN 28 1 Yes
0.20 0.04 0.10 0.22 0.29 0.03 0.12 PPO CNN 16 10 Yes
0.03 0.49 0.17 0.05 0.11 0.00 0.14 PPO CNN 16 1 Yes
0.11 0.19 0.08 0.34 0.25 0.02 0.01 PPO CNN 28 10 Yes
0.07 0.31 0.06 0.01 0.29 0.18 0.08 PPO CNN 28 1 Yes
Instrument allocation weights.
The top-performing portfolios (top 10) typically demonstrate a more diversified asset
allocation, with balanced distributions across multiple instruments. For instance, in the
DDPG-MLP-28 configuration using technical indicators, the allocation is spread evenly,
with 25% allocated across different assets and minimal reliance on cash. This reflects a more
aggressive investment strategy aiming to capitalize on growth across various markets.
In contrast to the top performers, the worst-performing portfolios (bottom 10) display
either extreme concentration or an over-allocation to CASH, limiting their potential to
capture the upside during market rallies. For example, some DQN-CNN configurations are
almost entirely allocated to the risk-free asset (CASH), neglecting other instruments, which
severely hampers growth potential. Similarly, certain CNN-based portfolios show poor
diversification, focusing heavily on one or two instruments, such as ^GDAXI, with little to
no allocation to other markets. This lack of diversification reduces the ability to mitigate
risks, resulting in underperformance. Additionally, these portfolios often feature a higher
reliance on CASH, reflecting a more conservative stance, but this ultimately sacrifices
opportunities for growth in more favorable market conditions.
4. Discussion
4.1. RL Agents
The primary goal of the RL agents was to maximize cumulative log-returns, aiming
for a consistent increase in the portfolio’s value over time. While risk management is funda-
mental in traditional portfolio optimization, in this study, risk was implicitly addressed only
when the agent incurred losses, reflected as negative log-returns. The allocation weights
assigned to different instruments, such as ^IXIC, ^GSPC, and ^N225, illustrate how each
agent family approached asset allocation. SAC models, however, faced convergence issues
and are excluded from further discussion due to unreliable results.
Distinct patterns emerged among the different RL agent families. DDPG agents, which
operate in continuous action spaces, exhibited greater portfolio diversification, distributing
allocations across multiple instruments. For example, the DDPG-CNN-16 configuration
allocated equal weights (0.25) to ^RUT, ^IXIC, and ^FTSE while maintaining a CASH
allocation of 0.25, demonstrating a balanced approach. In contrast, DQN and PPO agents,
operating in discrete action spaces, tended to focus on identifying and fully investing in a
single, high-performing asset. For instance, DQN-MLP-28 allocated 100% of its portfolio
to ^GDAXI, reflecting a concentrated strategy aimed at maximizing returns, with less
consideration for risk diversification.
Algorithms 2024, 17, 570 23 of 37
In contrast, MLP-based agents, particularly those using shorter lookback periods (e.g.,
16 periods), struggled to effectively anticipate market downturns. For example, PPO-MLP-
16 experienced higher maximum drawdowns, reflecting its difficulty in adjusting to rapid
market declines. These MLP-based agents frequently underperformed relative to the EWP
in risk-adjusted returns, as indicated by their lower Sharpe ratios and comparable Sortino
ratios. For instance, DDPG-MLP-16 yielded a Sharpe ratio of 0.66 and a Sortino ratio of
0.91, showing moderate performance but an inability to capitalize on bullish periods as
effectively as CNN-based models. The market portfolio itself often outperformed these
MLP configurations during periods of heightened volatility.
In summary, CNNs—especially with longer lookback periods—were more adept at
recognizing and reacting to periods of volatility, allowing for higher allocations to riskier,
high-reward assets like ^IXIC and ^GSPC. On the other hand, MLPs, particularly those with
shorter lookback windows, exhibited more conservative behavior but were less effective at
adapting to shifting market conditions. Overall, CNN-based models delivered stronger risk–
return performance, as evidenced by superior Sharpe and Sortino ratios, outperforming
MLP configurations and baseline portfolios in many cases.
return of 25.9%, a Sharpe ratio of 0.80, and a Sortino ratio of 1.12, compared to 25.8%, 0.74,
and 1.02 for the OHLC configuration.
However, in some cases, technical indicators did not provide a significant advantage.
For example, the PPO-CNN-16 agent showed little difference between configurations, with
the technical indicator version achieving an annualized return of 18.5% and a Sharpe ratio
of 0.72, versus 18.0% and 0.71, respectively, for the OHLC-only version. This suggests that,
while technical indicators can provide added insight in certain scenarios, their utility is not
universally applicable across all agent configurations and market conditions.
In conclusion, technical indicators can enhance an agent’s ability to identify market
trends and manage risk in specific configurations, but their effectiveness is contingent on
factors such as the agent architecture, the complexity of the market environment, and the
length of the lookback period. These findings do not fully contradict the EMH but indicate
that under certain conditions, technical indicators may offer incremental value.
It is important to note, however, that RL agent portfolios are not bound by the efficient
frontier. The efficient frontier assumes static mean returns and covariances for assets, which
is a simplified view of financial markets. In contrast, RL agents dynamically adapt to
changing market conditions, potentially enabling them to achieve superior performance in
specific environments. This adaptability allows RL agents to optimize portfolios in ways
that traditional mean-variance optimization may overlook, demonstrating the flexibility
and potential of reinforcement learning in portfolio management.
Algorithms 2024, 17, 570 26 of 37
4.6. Limitations
This study presents several limitations that could impact the generalizability of the
results. Firstly, the technical indicators and hyperparameters used were selected based on
prior knowledge rather than a systematic optimization process. As a result, it is possible
that alternative configurations could yield improved outcomes. The chosen set of indicators
may not represent the optimal solution across all market conditions, potentially limiting
the performance of the models in different environments.
Moreover, the reliance on daily data based on market close and open prices may not
fully capture the real-world execution prices faced by traders. Price fluctuations between
the close and the following day’s open can introduce discrepancies, leading to inaccuracies
in the simulation. Although the inclusion of slippage and transaction costs mitigates this
effect to some extent, alternative methods of preparing bar data, such as those proposed
by [39], may offer a more accurate reflection of trading conditions.
Additionally, this study assumes that financial instruments are both liquid and frac-
tional, simplifying the complexities of actual trading environments. In reality, liquidity
constraints and the need to trade in discrete units may affect portfolio adjustments, par-
ticularly in markets with limited liquidity or where fractional shares are not available.
These assumptions may limit the applicability of the results to real-world scenarios, where
trading constraints can significantly impact portfolio rebalancing and execution.
5. Conclusions
This study has demonstrated the potential of deep reinforcement learning (RL) agents
in portfolio optimization by evaluating various configurations of agents, feature extractors,
and rebalancing frequencies. The findings indicate that DQN and DDPG agents generally
outperform traditional baselines, such as the market portfolio (S&P 500) and the equally
weighted portfolio, in terms of both annualized returns and risk-adjusted performance,
with the best portfolios providing an additional yearly return of 10% compared to the
market portfolio and 17% compared to the equally weighted portfolio. Notably, CNN-
based feature extractors, particularly with longer lookback periods, were more effective at
identifying market patterns and adapting to volatile conditions, yielding superior Sharpe
and Sortino ratios compared to MLP-based agents.
Algorithms 2024, 17, 570 27 of 37
The analysis of rebalancing strategies revealed that while continuous rebalancing can
capture short-term opportunities, it often incurs higher transaction costs and slippage. As
a result, periodic rebalancing (e.g., every 10 periods) emerged as a more efficient strategy
for balancing risk and managing costs. Furthermore, the inclusion of technical indicators
provided marginal improvements in certain configurations, suggesting that while they
may enhance the agent’s understanding of market dynamics, they do not consistently
outperform raw OHLC data.
These findings underscore the dynamic adaptability of RL agents in real-time market
conditions, moving beyond the static assumptions of traditional portfolio optimization
models like the efficient frontier. RL agents offer robust portfolio strategies that adjust
dynamically to shifting market environments. Their strong performance, particularly in
high-volatility settings, highlights their potential as valuable tools for portfolio manage-
ment. Nonetheless, challenges remain, including the optimization of hyperparameters,
the reduction in transaction costs, and the exploration of advanced architectures such as
Transformers, offering promising directions for future research.
Author Contributions: Conceptualization, F.E.-F., Á.G.-S. and J.O.-M.; methodology, F.E.-F., Á.G.-S.
and J.O.-M.; software, F.E.-F.; validation, Á.G.-S. and J.O.-M.; formal analysis, F.E.-F., Á.G.-S. and
J.O.-M.; investigation, F.E.-F.; resources, F.E.-F. and J.O.-M.; data curation, F.E.-F.; writing—original
draft preparation, F.E.-F.; writing—review and editing, F.E.-F., Á.G.-S. and J.O.-M.; visualization,
F.E.-F.; supervision, Á.G.-S. and J.O.-M.; project administration, Á.G.-S. and J.O.-M. All authors have
read and agreed to the published version of the manuscript.
Funding: The authors want to thank the Spanish Agencia Estatal de Investigación because of this
research has been partially supported by the Ministerio de Ciencia e Innovación of Spain (Grant
Ref. PID2022-137748OB-C31 funded by MCIN/AEI/10.13039/501100011033) and “ERDF A way of
making Europe”.
Data Availability Statement: Data are available using the yfinance library for the OHLC prices and
TALIB to compute the technical indicators. All technical indicator parameters have been kept as the
library defaults.
Conflicts of Interest: The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
RL Reinforcement learning
MLP Multi-layer perceptron
CNN Convolutional Neural Network
EMH Efficient Market Hypothesis
DQN Deep Q-Network
DDPG Deep Deterministic Policy Gradient
PPO Proximal Policy Optimization
SAC Soft Actor–Critic
RSI Relative Strength Index
NATR Normalized Average True Range
AROONOSC Aroon Oscillator
OHLC Open-High-Low-Close
CCI Commodity Channel Index
CMO Chande Momentum Oscillator
MFI Money Flow Index
QP Quadratic Programming
GA Genetic Algorithm
MDP Markov Decision Process
Algorithms 2024, 17, 570 28 of 37
Appendix A. RL Agents
Appendix A.1. DQN
The DQN [24] extends classical Q-learning by employing deep neural networks to
approximate the Q-value function. The DQN estimates the action-value function Q(s, a),
which represents the expected cumulative future reward for taking action a in state s, and
follows the Bellman [48] equation to update Q-values iteratively. A key feature of the DQN
is its use of experience replay and target networks to stabilize training by breaking the
correlation between samples and reducing Q-value variance.
The Q-learning update in the DQN is defined by
where θ − represents the parameters of the target network, which are periodically updated
from the main Q-network.
While the DQN can achieve human-level performance in many environments, espe-
cially in discrete action spaces, it suffers from overestimation bias due to the maximization
step over noisy Q-value estimates. This overestimation can lead to suboptimal policies,
especially in environments with high variance or noise in the reward signals.
Double Q-learning [38], originally proposed to address the overestimation bias in
classical Q-learning, was later adapted to the DQN. In the Double DQN, the key idea is to
decouple the action selection from the action evaluation to reduce the overestimation bias in
the Q-values. Instead of using the same Q-values for both selecting and evaluating actions,
the Double DQN selects actions using the main Q-network and evaluates them using the
target Q-network. This modification reduces overestimation and improves stability. The
update rule in the Double DQN is modified as follows:
Here, the action is selected based on the online network parameters θ, while the
evaluation is performed using the target network θ − .
In comparison to the vanilla DQN, the Double DQN reduces overestimation substan-
tially, improving policy evaluation and performance, especially in noisy environments.
The Dueling DQN [49] architecture further improves upon the standard DQN by
separating the estimation of the state-value function V (s) and the advantage function
A(s, a), which measures the relative benefit of an action in a given state compared to the
average action. The key advantage of this separation is that the value function can be
learned more effectively, even in states where the choice of action has little impact on
the reward.
The Q-value function is decomposed as
!
1
|A| ∑
′
Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α) − A(s, a ; θ, α) (A3)
a′
This decomposition allows the agent to learn state values without needing to evaluate
the impact of every possible action. This is particularly useful in environments where many
actions lead to similar outcomes. The Dueling DQN often outperforms the standard DQN
and Double DQN in terms of both learning speed and final performance.
The algorithm uses two networks: an actor, which learns a policy that maps states
to actions, and a critic, which evaluates the action-value function Q(s, a). The critic’s
objective is to maximize the expected return, while the actor optimizes the policy based
on feedback from the critic. The critic learns the action-value function Q(s, a) using the
Bellman equation:
Similar to the DQN, DDPG employs a replay buffer to break the correlation between
consecutive transitions, enhancing training stability and efficiency. Both the actor and critic
networks have corresponding target networks, Q′ and µ′ , which are slowly updated to
provide stable target values during training:
θ ′ ← τθ + (1 − τ )θ ′ (A6)
Since the actor outputs deterministic actions, exploration is induced by adding noise
to the action selection process. DDPG typically uses an Ornstein–Uhlenbeck noise process for
temporally correlated exploration; this noise helps in exploring continuous action spaces
more effectively than simple uncorrelated Gaussian noise.
at = µ(st |θ µ ) + Nt (A7)
DDPG’s key advantage is its efficiency in continuous action spaces, directly outputting
continuous actions without requiring discretization. This makes it ideal for tasks requir-
ing precise control, such as real-time portfolio weight adjustments. Additionally, DDPG
incorporates target networks for both the actor and critic, which significantly improves
the stability of learning, especially in environments with high variance. Another strength
of DDPG is that it is an off-policy algorithm, meaning it can learn from past experiences
stored in the replay buffer, thus making better use of data and reducing the variance in
policy updates.
However, DDPG also presents some disadvantages, especially in terms of exploration.
Since it outputs deterministic actions, DDPG relies heavily on adding noise to ensure
sufficient exploration. If the exploration is not well tuned, this can lead to suboptimal
policies, particularly in highly stochastic environments such as financial markets with
unpredictable price movements. Moreover, DDPG is sensitive to hyperparameter settings,
such as the learning rates for the actor and critic, the type of noise process used, and the
size of the replay buffer. Poor tuning of these parameters can result in slow learning or, in
some cases, divergence of the model.
The core update in PPO uses a clipped surrogate objective, which is designed to limit
policy updates without requiring the complexity of TRPO’s constraints. The surrogate
objective is defined as
where H(π (·|st )) is the entropy of the policy at state st , and α is a temperature parameter
that controls the trade-off between reward maximization and entropy maximization. SAC
maintains two Q-value functions Q(s, a), a value function V (s), and a policy π ( a|s). The
update for each function is based on the following.
The Q-function is updated using soft Bellman backups:
And, lastly, the policy is updated by minimizing the KL divergence between the policy
and a Boltzmann distribution of the Q-function:
h i
Jπ (ϕ) = Est ∼ D Eat ∼πϕ α log πϕ ( at |st ) − Qθ (st , at ) (A12)
SAC’s most significant advantage is its ability to balance exploration and exploitation
through the entropy term. By promoting exploration, SAC avoids the risk of prematurely
converging to suboptimal policies, a common issue in deterministic approaches like DDPG.
SAC is also off-policy, meaning it can reuse past experiences stored in a replay buffer,
making it sample-efficient compared to on-policy methods. This efficiency is critical in
data-scarce environments like financial markets. Furthermore, SAC leverages a stochastic
policy, which allows for better exploration and smoother learning compared to algorithms
that rely on deterministic policies.
SAC’s reliance on entropy maximization introduces a trade-off: while promoting
exploration, the algorithm can occasionally over-explore, leading to slower convergence,
especially if the temperature parameter α is not well tuned. Additionally, SAC can be
more computationally intensive than simpler policy gradient methods due to the need for
maintaining and updating multiple value functions and a policy network. The performance
of SAC is also sensitive to the choice of hyperparameters, particularly the temperature α,
which must be carefully adjusted for each environment to achieve an appropriate balance
between exploration and exploitation.
to uncover robust strategies that perform well across a variety of market scenarios. Its
off-policy nature further enhances its efficiency by allowing the algorithm to learn from
historical market data, making it particularly advantageous in environments with limited
access to real-time data [27,32].
The ATR is the moving average of the True Range over N periods (typically 14):
The NATR is calculated by dividing the ATR by the current closing price:
ATR
N ATR = × 100 (A15)
Closet
∑(Gains) − ∑(Losses)
CMO = × 100 (A19)
∑(Gains) + ∑(Losses)
References
1. Markowitz, H. Portfolio Selection. J. Financ. 1952, 7, 77–91.
2. Benhamou, E.; Saltiel, D.; Ungari, S.; Mukhopadhyay, A. Bridging the gap between Markowitz planning and deep reinforcement
learning. arXiv 2020, arXiv:2010.09108. [CrossRef]
3. Halperin, I.; Liu, J.; Zhang, X. Combining Reinforcement Learning and Inverse Reinforcement Learning for Asset Allocation
Recommendations. arXiv 2022, arXiv:2201.01874. [CrossRef]
4. Markowitz, H. The optimization of a quadratic function subject to linear constraints. Nav. Res. Logist. Q. 1956, 3, 111–133.
[CrossRef]
5. Benhamou, E.; Saltiel, D.; Ohana, J.; Atif, J.; Laraki, R. Deep Reinforcement Learning (DRL) for Portfolio Allocation. In Machine
Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track; Springer: Cham, Switzerland, 2021; pp. 527–531.
[CrossRef]
6. Merton, R.C.; Samuelson, P.A. Fallacy of the log-normal approximation to optimal portfolio decision-making over many periods.
J. Financ. Econ. 1974, 1, 67–94. [CrossRef]
7. Odermatt, L.; Beqiraj, J.; Osterrieder, J. Deep Reinforcement Learning for Finance and the Efficient Market Hypothesis. SSRN
Electron. J. 2021. [CrossRef]
8. Li, G. Enhancing Portfolio Performances through LSTM and Covariance Shrinkage. Adv. Econ. Manag. Political Sci. 2023,
26, 187–198. [CrossRef]
9. Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
10. Sen, J.; Dasgupta, S. Portfolio Optimization: A Comparative Study. arXiv 2023, arXiv:2307.05048. [CrossRef]
11. Pun, C.S.; Wong, H.Y. Robust investment–reinsurance optimization with multiscale stochastic volatility. Insur. Math. Econ. 2015,
62, 245–256. [CrossRef]
12. Supandi, E.; Rosadi, D.; Abdurakhman, A. Improved robust portfolio optimization. Malays. J. Math. Sci. 2017, 11, 239–260.
13. Liu, X.Y.; Xiong, Z.; Zhong, S.; Yang, H.; Walid, A. Practical Deep Reinforcement Learning Approach for Stock Trading. arXiv
2022, arXiv:1811.07522. [CrossRef]
14. Wang, Z.; Jin, S.; Li, W. Research on Portfolio Optimization Based on Deep Reinforcement Learning. In Proceedings of the
2022 4th International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Shanghai, China, 28–30
October 2022; pp. 391–395. [CrossRef]
15. Yang, S. Deep reinforcement learning for portfolio management. Knowl.-Based Syst. 2023, 278, 110905. [CrossRef]
16. Wei, L.; Weiwei, Z. Research on Portfolio Optimization Models Using Deep Deterministic Policy Gradient. In Proceedings of the
2020 International Conference on Robots & Intelligent System (ICRIS), Sanya, China, 7–8 November 2020; pp. 698–701. [CrossRef]
17. Harnpadungkij, T.; Chaisangmongkon, W.; Phunchongharn, P. Risk-Sensitive Portfolio Management by using Distributional
Reinforcement Learning. In Proceedings of the 2019 IEEE 10th International Conference on Awareness Science and Technology
(iCAST), Morioka, Japan, 23–25 October 2019; pp. 1–6. [CrossRef]
18. Hakansson, N.H. Multi-Period Mean-Variance Analysis: Toward A General Theory of Portfolio Choice. J. Financ. 1971,
26, 857–884.
19. Sefiane, S.; Benbouziane, M. Portfolio Selection Using Genetic Algorithm. J. Appl. Financ. Bank. 2012, 2, 143–154.
20. Hochreiter, R. An Evolutionary Optimization Approach to Risk Parity Portfolio Selection. In Applications of Evolutionary
Computation; Springer International Publishing: Cham, Switzerland, 2015; pp. 279–288. [CrossRef]
21. Cen, S.; Cheng, C.; Chen, Y.; Wei, Y.; Chi, Y. Fast Global Convergence of Natural Policy Gradient Methods with Entropy
Regularization. arXiv 2021, arXiv:2007.06558. [CrossRef]
22. LeCun, Y.; Bengio, Y. Convolutional networks for images, speech, and time series. In The Handbook of Brain Theory and Neural
Networks; MIT Press: Cambridge, MA, USA, 1998; pp. 255–258.
23. Jiang, Z.; Xu, D.; Liang, J. A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. arXiv
2017, arXiv:1706.10059. [CrossRef]
24. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjel, A.K.; Ostrovski,
G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef]
25. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep
reinforcement learning. arXiv 2019, arXiv:1509.02971. [CrossRef]
26. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017,
arXiv:1707.06347. [CrossRef]
27. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with
a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [CrossRef]
28. Pendharkar, P.; Cusatis, P. Trading Financial Indices with Reinforcement Learning Agents. Expert Syst. Appl. 2018, 103, 1–13.
[CrossRef]
29. Yu, P.; Lee, J.S.; Kulyatin, I.; Shi, Z.; Dasgupta, S. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization.
arXiv 2019, arXiv:1901.08740. [CrossRef]
30. Lattimore, T.; Hutter, M. PAC Bounds for Discounted MDPs. arXiv 2012, arXiv:1202.3890. [CrossRef]
31. Longstaff, F.; Schwartz, E. Valuing American Options by Simulation: A Simple Least-Squares Approach. Rev. Financ. Stud. 2001,
14, 113–147. [CrossRef]
Algorithms 2024, 17, 570 37 of 37
32. Soleymani, F.; Paquet, E. Financial Portfolio Optimization with Online Deep Reinforcement Learning and Restricted Stacked
Autoencoder—DeepBreath. Expert Syst. Appl. 2020, 156, 113456. [CrossRef]
33. Fama, E.F. Efficient Capital Markets: A Review of Theory and Empirical Work. J. Financ. 1970, 2, 383–417. [CrossRef]
34. Carver, R. Systematic Trading: A Unique New Method for Designing Trading and Investing Systems; EBL-Schweitzer, Harriman House:
Hampshire, UK, 2015; p. 48.
35. Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989,
2, 359–366. [CrossRef]
36. Shi, Z.; Wei, J.; Liang, Y. A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage
over Fixed Features. arXiv 2022, arXiv:2206.01717. [CrossRef]
37. Yang, G.; Hu, E.J. Feature Learning in Infinite-Width Neural Networks. arXiv 2022, arXiv:2011.14522. [CrossRef]
38. van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. arXiv 2015, arXiv:1509.06461.
[CrossRef]
39. de Prado, M.L. Advances in Financial Machine Learning, 1st ed.; Wiley Publishing: Hoboken, NJ, USA, 2018; Chapter 2.
40. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning
Implementations. J. Mach. Learn. Res. 2021, 22, 1–8.
41. Vigna, E. On Time Consistency for Mean-Variance Portfolio Selection; Carlo Alberto Notebooks 476; Collegio Carlo Alberto: Torino,
Italy, 2016.
42. Espiga, F. Portfolio Optimization. 2024. Available online: https://figshare.com/collections/PORTFOLIO_OPTIMIZATION/7467
934 (accessed on 8 December 2024).
43. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016,
arXiv:1606.01540. [CrossRef]
44. Towers, M.; Terry, J.K.; Kwiatkowski, A.; Balis, J.U.; de Cola, G.; Deleu, T.; Goulão, M.; Kallinteris, A.; KG, A.; Krimmel, M.; et al.
Gymnasium. 2023. Available online: https://zenodo.org/records/8127026 (accessed on 8 December 2024).
45. Sharpe, W. Mutual Fund Performance. J. Bus. 1965, 39, 119–138. [CrossRef]
46. Sortino, F.A.; van der Meer, R. Downside Risk. J. Portf. Manag. 1991, 17, 27–31. [CrossRef]
47. Young, T. Calmar ratio: A smoother tool. Futures 1991, 20, 40–41.
48. Bellman, R. Dynamic Programming; Dover Publications: Mineola, NY, USA, 1957.
49. Wang, Z.; Schaul, T.; Hessel, M.; van Hasselt, H.; Lanctot, M.; de Freitas, N. Dueling Network Architectures for Deep Reinforce-
ment Learning. arXiv 2016, arXiv:1511.06581. [CrossRef]
50. Fujimoto, S.; Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. arXiv 2018, arXiv:1802.09477.
[CrossRef]
51. Gao, Z.; Gao, Y.; Hu, Y.; Jiang, Z.; Su, J. Application of Deep Q-Network in Portfolio Management. arXiv 2020, arXiv:2003.06365.
[CrossRef]
52. Wilder, J. New Concepts in Technical Trading Systems; Trend Research: Abu Dhabi, United Arab Emirates, 1978; pp. 21–23. 63–70.
53. Chande, T. The Time Price Oscillator. 1995. Available online: https://store.traders.com/-v13-c09-thetime-pdf.html?srsltid=
AfmBOooJTN6RVoFN_XXqhHegrxp0fXdJP3cTxfuPJ2j_4Uw2-oHy9SAG (accessed on 8 December 2024).
54. Lambert, D. Commodity Channel Index: Tool for Trading Cyclic Trends. 1982. Available online: https://store.traders.
com/-v01-c05-comm-pdf.html?srsltid=AfmBOoqNzNIaJtdSg7OwR3FazvfXfbfpjZQzcRaXYCvwpPH98gGv9B2M (accessed on
8 December 2024).
55. Chande, T.; Kroll, S. The New Technical Trader: Boost Your Profit by Plugging into the Latest Indicators; Wiley Finance, Wiley: Hoboken,
NJ, USA, 1994; pp. 94–118.
56. Quong, G.; Soudack, A. Volume-Weighted RSI: Money Flow. 1989. Available online: https://store.traders.com/-v07-
c03-volumew-pdf.html?srsltid=AfmBOortDQy0rOdmnMXX-QVdXorrh_q0h4DZt3HgP-8seBHmcA76z8CV (accessed on
8 December 2024).
57. Williams, L. How I Made One Million Dollars. . . Last Year. . . Trading Commodities; Windsor Books: Brightwaters, NY, USA, 1979;
pp. 94–118.
58. Lane, G.C. Lane’s Stochastics. 1984. Available online: https://store.traders.com/-v02-c03-lane-pdf.html?srsltid=AfmBOoqpO8
0_Wrs0Lx1ezAu1T3OEIb8X0Ztkx5OJsakotBnc-Lm4sLqg (accessed on 8 December 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.