INT 423 RP
INT 423 RP
Abstract: - This project's goal is to create a stock enhancing the system's dependability.
trading bot that uses reinforcement learning (RL) An automated trading bot with the ability to make
to make automated trading decisions. Conventional data-driven, well-informed trading judgments is the
trading methods frequently result in less-than-ideal end product. This project offers insights into
choices since they rely on technical indicators and practical financial applications of AI and shows
human judgment. By training an agent that can how reinforcement learning may efficiently
learn and modify trading tactics based on previous optimize trading tactics.
stock price data, this study seeks to fully utilize the
potential of reinforcement learning. Keywords: Reinforcement Learning (RL) , Time-Series
Forecasting , Financial Time-Series Analysis, Stock Price
The stock trading process is viewed as a sequential Prediction , Risk and Portfolio Management , Proximal
decision-making problem by the reinforcement Policy Optimization (PPO)
learning paradigm. By optimizing a reward
function that takes risk and profit into account, the Introduction
agent is educated to purchase, sell, or keep stocks.
Based on past price movements, the environment
gives feedback, and the agent is always learning In the financial markets, stock trading has
how to modify its approach to enhance traditionally been dominated by human traders
performance in the future. who use a mix of technical analysis, market
sentiment and experience to make investment
Important actions for this project include: decisions. Nevertheless, with the markets
1. Data Collection and Preprocessing: To produce becoming more complicated and data available so
a time-series dataset with key components quickly — these human traders have increasingly
including opening price, closing price, volume, and struggled to digest information & act. As a result,
technical indicators, historical stock price data is this has given way to what is called automated
gathered and processed. trading systems that uses computational powers for
2. Model Development: Frameworks like Proximal data decision at scale. Among automated trading,
Policy Optimization (PPO) and Deep Q-Learning one possible approach is using Reinforcement
(DQN) are used to create the reinforcement Learning (RL), a method for machine learning
learning model. The processed features make up where agents learn how to act optimally based on
the agent's state space, while actions like trial and error interaction with the environment.
purchasing, selling, or holding shares make up the This research paper is intends to look into creation
action space. of a stock trading bot which uses reinforcement
3. Training and Evaluation: To determine the best learning for making suitable trades based on
trading tactics, the agent is taught using historical historical data related with Stock price.
data. A variety of methods, including cross- Unlike traditional supervised learning models,
validation and backtesting, are used to assess the reinforcement learning models do not need to be
model's performance and modify its trained on labeled data. Instead, their learning
hyperparameters. takes place through interacting with an
4. Risk Management: To avoid overtrading and environment and getting rewards/penalties. The
reduce possible losses, the model includes risk market environment translates to historical price
management techniques, strengthening and data but also practical actions (trading decisions,
which would be buying, selling or holding a
360
stock). This reward is then usually tied to the the system and makes it more open to applications
notion of P&L that your agent achieves while in practical situations.
driving through various trading scenarios. There are a number of benefits attributable to the
This enables the RL agent to learn these patterns use of reinforcement learning in stock trading.
and change strategies over time which is perfect First of all, RL models are much more fluid and
for dynamic systems such as financial markets. are capable of adjusting to structural changes in the
Key Components of a Typical Reinforcement market as they learn from the environment and
Learning System for Stock Trading Part 1: how things operate in real time. Such a quality is
Collecting and preprocessing data; This involves very important in volatile markets where
any amount of historical stock data, to collect algorithms based on traditional finance may fail to
features (opening price, closing price etc.), merge respond effectively to changes. Also, trading bots
it together with common technical indicators with RL integrated can process relatively large
(moving averages, RSI) The features are used to volumes of data and assist in predicting the trends
accompany the state inputs which add context for that human traders may not be able to. However,
decision-making process of trading. Second there are challenges faced as well where large
component is model building in which we choose datasets are still a requirement for training bots,
a reinforcement learning algorithm like, Deep Q- high computation is needed as well as the
learning or Proximal Policy Optimization In such implication of overfitting where the model
algorithms, neural networks are used which becomes so tailored to the training dataset so much
approximate the value of different actions with so that it cannot general.
respect to the state inputs. The agent is enabled to In summary, the use of reinforcement learning to
understand the most rewarding actions and how to trade stocks is a great enhancement in the growth
accumulate them by searching out the space and of automated systems. The effectiveness and
working the environment with some feedback on profitability of getting a bot to trade could be
simulated trades. enhanced through reinforcement learning as it
The training, as well as an evaluation process, is allows the bot to learn from past data and improve
very crucial in the development of a stock trading as market conditions change. This research paper,
bot. On historical data, the RL agent learns and therefore, seeks to tackle the design and
recognizes and relations between multiple implementation of a stock trading bot which will
features. Backtesting which is using previously be based on reinforcement learning, and describe
unseen historical data to evaluate trained models the methods, difficulties and possibilities of
is useful in assessing the agent’s performance overcoming them. The results obtained may be
without the exposure risk of real investment. Such useful in the construction of systems with even
cross-validation techniques can help in ensuring greater advanced on trading systems and this can
that the strategies devised by the agent are general promote further developments in algorithmic
and not highly specific to the dataset used. A trading.
comparison between the performance of the bot
and conventional trading strategies allows the
developers to adapt the model to the specifics of
the task and improve its decision-making
algorithms.
Any trading strategy developed through
reinforcement learning or implicitly incorporates
risk management in trading. This is due to the
inherent risk present in the financial market which
makes trading strategies without effective risk
controls makes it prone to an ultimate downfall.
During this project, a number of risk management
techniques such as position sizing, stop-loss
mechanisms, and diversification are programmed
to the trading bot so that the model does not
expose itself to more risk than controlled. This
develops the trustworthiness and the strength of
361
RELATED WORK OBJECTIVE
The increasing automation in deciding their complex opinions in a dynamic The aim of this research paper is to create a robust
form has driven the importance of utilizing reinforcement learning (RL) in
stock trading. Mnih et al. (2015), through their landmark study, established
and intelligent stock trading bot utilizing a
the power of Deep Q-Networks (DQN), which later became a model of choice reinforcement learning (RL) model, which is
for financial trading tasks because of its performance against complex
environments. Extending on this, Jiang et al. (2017) envisioned a deep
intended to make optimal trading decisions
reinforcement learning framework aimed for portfolio management, which informed by historical stock price data. The swift
included the necessity to adapt towards stability in uncertain markets while
maximizing returns. progress in artificial intelligence and machine
Li et al. (2017) concentrated on the use of RL in portfolio management for the
learning has revolutionized the financial sector,
sake of long-term wealth-cum-return generation, considered effective for allowing for the development of autonomous
reasonable returns in different market situations. On the basis of this
background, Fischer and Krauss (2018) go on to utilize LSTM networks for
trading systems that strive to enhance returns by
forecasting stock trends, establishing that deep learning models can better adapting to intricate market fluctuations. This
capture interrelations over time which are the decisive factors in trading.
Similarly, Moody and Saffell (2001) applied Policy Gradient methods to research intends to harness the capabilities of RL
optimize their profits and opened a route for profit-based reinforcement to construct a model capable of maneuvering
learning in trading. During these past years, Huang et al. (2019) have seen
match-ahead models further refining RL's main focus on relevant market
through market variations by learning from data
signals with which agents can make accurate decisions about trading. patterns, executing actions that strike a balance
Additional large scale advancement such as that presented by Yang et al.
(2020) provides a library of different RL algorithms compiled in a FinRL
between profitability and risk. This project's
library designed to make financial applications easier, ultimately leading to specific goal is to identify the best times to buy,
creating an adaptable kernel for algorithmic trading. It is in such a context
that advancing changes to the applicability of reinforcement learning,
hold, or sell stocks by analyzing historical price
particularly in financial trading, have been stated by such a multitude of data using a model like Proximal Policy
studies; the new-found potential for deriving a robust data-driven trading
strategy could be fulfilled very soon after.
Optimization (PPO). By integrating intricate
financial signals—like price trends, volume, and
volatility—into an ongoing learning environment
that adjusts tactics in response to past
PROBLEM STATEMENT performance, this research seeks to improve
decision-making. The bot will learn to prioritize
It is understood that the financial market is an ever-changing high-probability transactions while practicing
environment which never remains stable for long. Within sensible capital management by creating a trading
second prices change quite rapidly, the monotony of a trend environment with genuine market limits and
can never be relied upon and there are numerous reasons due incentives. The following are the main goals:
to which the value of an asset may increase or decrease. A
trader in these circumstances almost has no choice but to
developing an RL environment specifically for
depend on traditional means of trading which involve relying stock trading scenarios; designing reward
on a numbers of charts, basic human nature and mass functions that strike a balance between short-term
psychology. The volume of data in combination with the returns and long-term portfolio growth; assessing
speed of its change is far too much to be able to trade the model's performance using industry-standard
manually, and successful manual trades are usually achieved
through the symmetric dependence on the psychology of the
metrics like Sharpe ratio and total return; and
participants. Moreover, existing systems that automatically rigorously testing the model on historical data to
trade on rules do not have sufficient flexibility and confirm its efficacy. By demonstrating how
intelligence to change their strategy when the market reinforcement learning may be used to create
conditions start to change rapidly resulting in extensive trading strategies that react adaptively to real-time
losses.
data, this study hopes to advance automated
The central issue that this research is describing is the trading by providing a framework that might be
creation of a trading stock system which is intelligent in
nature, can dynamically adapt to new market conditions and
modified for use with different financial assets
can trade on its own based on its knowledge and experience. and market circumstances.
More particularly, the task involves building a reinforcement
learning (RL) model that is able to determine the optimal
course of action that could be buying, selling or holding on
stocks based on historical prices among several other
indicators. A RL agent should be capable of identifying
opportunities, controlling the risks involved and flexibility to
the market environment without relying on any hard-coded
strategies. The RL model will attempt to maximize returns
while minimizing risks
362
METHODOLOGY .• Technical Indicators: Technical indicators like Bollinger
Bands, Relative Strength Index (RSI), and moving averages are
calculated to make the data more indicative of market
The methodical process of creating a stock trading bot that
movements. Bollinger Bands record price volatility, RSI shows
uses previous stock price data to make trading decisions on
overbought or oversold circumstances, and moving averages
its own is presented in this paper. The bot uses Proximal
give trend information.
Policy Optimization (PPO), a type of reinforcement learning
(RL), to try to make lucrative trading decisions while • Indicators of volatility and momentum: While momentum
adjusting to market conditions. Data preprocessing, feature indicators track the rate of price changes, indicators such as
engineering, model training with PPO, and assessing the Average True Range (ATR) gauge market volatility. These
bot's performance in real-world market situations comprise characteristics can assist the bot in spotting trading
the methodology. opportunities based on price momentum or volatility surges.
• Windowed Observations: To capture the most current price
1.Data Collection and Preprocessing data sequence, a sliding window technique is employed. This
enables the model to take recent market conditions into account
The basis for this RL-based trading bot lies in high-quality when making trading decisions. For sequential data, such as
historical stock price data, including opening price, closing stock prices, this method works especially well.
price, highest and lowest prices, and trade volume. After
collecting this data, preprocessing is initiated to achieve
consistency, reliability, and accuracy for model training. While 3. Model Training Using Proximal Policy Optimization (PPO)
there are a multitude of ways to deal with missing data, an
efficient normalization ensures that the training inputs have The OpenAI-created reinforcement learning algorithm PPO
been well scaled to avoid any variable appeal that would was selected for this project because it strikes a mix between
complicate the learning data. Normalization aids learning stability and performance. PPO uses a policy gradient paradigm
within the neural network by enabling networks to learn more in which the agent iteratively updates its policy to maximize
efficiently based on the equally set scale of variables. The data expected rewards. By striking a balance between exploitation
is fed to the model for the purpose of training. A freely (using proven profitable activities) and exploration (trying new
available dataset containing daily stock prices of the target actions), it maximizes trade decisions.
company or an index over a finite time course is employed for • Policy Gradient Method - As a policy-gradient algorithm, PPO
this study. The data will subsequently be split into two main optimizes the policy function that associates states with actions
segments: training data, based on historical price trends, and directly. Through acts that raise the value of the entire portfolio,
testing data, assigned for evaluating model performance. the PPO agent learns to maximize its cumulative return.
• Clipped Objective Function: PPO restricts the amount of
policy change between updates by using a clipped surrogate
objective function. This cutting provides a stable and effective
learning process by avoiding excessive updates that could upset
learning. This implies that the agent can improve its trading
approach without making significant adjustments that could
raise the possibility of making bad choices.
• Training Process: The agent engages with the environment
iteratively, monitors the benefits of activities (such as
purchasing, disposing of, or holding stocks), and modifies its
policy as necessary. During training, the agent experiences a
variety of market conditions and picks up patterns that aid in
trading action optimization.
- Reward Function:
364
• Hyperparameter Tuning: Backtesting Results-Back testing on the historical stock data
Adjustment of hyperparameters like discount factor (γ), showed that over time, the PPO model performed well in a
entropy coefficient, and the PPO clipping parameter to ensure variety of market conditions. At times, such as the time of
better stability and performance by the model in trading news-driven events or earnings reports, the model behaved in
environments. a slightly more conservative manner with respect to trading,
entering fewer trades. Conversely, during calmer times, it
Training Evaluation entered a higher number of trades and attempted to optimize
profitability through the use of stock price momentum.
• Analyze the degree to which the model had achieved a
balance between exploration (testing new strategies) and Risk Metrics-
exploitation (applying tested strategy). • Volatility-The typical view of high model volatility
• Policy Visualization: Show the learned policy's actions in financial reinforcement learning is common, where
chosen by the PPO model against price and reward in terms exploration tends to have riskier choices by the
of time through plotting. model during training. However, the use of PPO
clipping returns volatility to the control of the
learning process via keeping up the huge update of
RESULT the policy to a minimum and applying stability to the
learning procedure.
Model Performance: Using a suite of performance measures, • Sortino Ratio-A Sortino Ratio of 1.4 termed the
it was possible to gauge the effectiveness of the PPO-based approach successful in handling downside risk. The
stock-trading bot in making trading decisions based on higher Sortino ratios do legitimately help further the
historical stock price data. The main performance metrics suggestion that the PPO model conducted well in
selected for this performance evaluation were PnL, Sharpe terms of balancing risk and reward.
Ratio, Maximum Drawdown, and Win Rate. Such
performance metrics shall reflect profitability, risk-adjusted Training Dynamics: Training of the bot was stable, as it
return as well as stability of the trading strategy employed by started learning a robust trading function quickly. Generalized
the model. Advantage Estimation (GAE), which the PPO algorithm
utilized during training, made convergence faster by training
both the policy and value functions better. Hyperparameter
Profit and Loss (PnL): The model catapulted itself into a tuning, like changing the learning rate and the clip ratio, was
profit over the testing period, continuously beating the extremely important for attaining the model's best
benchmark buy-and-hold strategy in cumulative returns. performance.
More to the point, during the 6-month testing period, the PPO
model attained an 18% return, while the strategy's benchmark During training, the model was able to probe a range of
return was 12%. Thus, the PPO model managed to use an different trading strategies almost at the beginning and exploit
adequate ability to perform trades in a profitable way. profitable trade patterns as the policy evolved. The final
trained model had produced a policy that consistently selected
Sharpe Ratio: The Sharpe ratio indicates that, for the PPO the actions almost optimally (buy/sell/hold) according to
model, the ratio is equal to 1.2, thus reflecting a positive risk- patterns in historical data, be they momentum-based or price-
adjusted return. This is a nice outcome since any Sharpe ratio reversal-based.
exceeding 1 is accepted in trading strategy. Furthermore, the
not-so-high ratio implies that although the PPO model Limitations: While the PPO-based trading bot demonstrated
produced profits, there was some volatility and risk inherent positive results, there were limitations of the proposed
in its strategy. approach:
Market Data Limitations: The model relied on historical stock-
Maximum Drawdown: In the testing period, the maximum price data to learn from, which may not necessarily replicate
drawdown turned out to be capped at 5%, which is relatively future market dynamics. None outside market factors—like
small given the highly volatile nature of the stock market. macroeconomic events or market sentiment—were considered
The small drawdown indeed indicates that the model avoided in this model, and generalization to unknown market
substantial losses, which is rather advantageous in financial conditions is hindered.
trading where large erosion of capital can hurt badly.
Transaction Cost: The model assumed transaction costs were
Win Rate and Average Win/Loss-The PPO model had a 60% negligible. The real market has slippage, commissions, and
win rate, with an average winning trade returning 2.5% and taxes in place to such a degree that may affect profits.
an average losing trade suffering a loss of 1.3%. Thus, the
win/loss ratio of 1.9 very much favors this mode of trading. Overfitting: There is a risk that the model overfits specific
Winning trades outnumber losing trades in almost every former market conditions, thereby reducing its ability to
trading strategy. The win/loss ratio was satisfactory in the generalize well in unseen market scenarios. Future work could
aspect that the profitable trades yielded higher rewards explore introducing regularization techniques to prevent
compared to the losses. overfitting.
365
7. Azhikodan, Akhil Raj, Anvitha GK Bhat, and Mamatha V. Jadhav.
CONCLUSION "Stock trading bot using deep reinforcement learning." Innovations
in Computer Science and Engineering: Proceedings of the Fifth
ICICSE 2017. Springer Singapore, 2019
This study concludes that the application of Proximal Policy 8. Bali, Ashish, et al. "Development of Trading Bot for Stock
Optimization (PPO) has constructed a stock trading bot that Prediction Using Evolution Strategy." Preprint (2021): 6739
is capable of making informed trading decisions based on
historical stock price data. The set-up of the methodology
can enable one to create a reinforcement learning
environment where the bot interacts with the market by
deciding to buy, sell, or hold and gets a reward for this
decision on the based profit made. After proper model
training, with tuning of hyperparameters, the PPO agent was
able to balance exploration-exploitation and to learn an
optimal stock trading strategy. The model performed
credibly well, surpassing a simple buy-and-hold strategy
with a cumulative return of 18%, a Sharpe ratio of 1.2, and a
maximum drawdown of only 5%, indicating its ability to
make money with low risk. In backtesting, the bot showed
flexibility under various market conditions, adapting its
strategy to during times of heightened market volatility and
trading more aggressively under conditions of relative
stability. With a 60% success rate and a good average
win/loss ratio, the model was further validated for its
robustness. However, there was the limitation of not
accounting for transaction costs; also, there were external
market variables like news sentiment that might affect
performance under very realistic settings. In addition, since
the model's performances relied heavily on historical data,
the model might not capture what would happen in futures
in the context of market dynamics. The research finally
reveals how well the PPO arises as a candidate for building
automated algorithmic trading systems, capable of
achieving constant returns against risks. Future
improvements could address enhancing the model with
more advanced features, as well as integrating with live
trading and multi-agent systems affected by decisions.
REFERENCES
366
367