Research Project Stelios Gavriel
Research Project Stelios Gavriel
Stylianos Gavriel
University of Twente
P.O. Box 26, 7523SB Enschede
The Netherlands
s.gavriel@student.utwente.nl
Permission to make digital or hard copies of all or part of this work for In the remaining of this paper included in Section 2 is the
personal or classroom use is granted without fee provided that copies background giving more insight into the stock markets and
are not made or distributed for profit or commercial advantage and that recurrent neural networks used in this research. Further,
copies bear this notice and the full citation on the first page. To copy oth- some related researches are elaborated in Section 3, the
erwise, or republish, to post on servers or to redistribute to lists, requires approach of the research is being explained in Section 4,
prior specific permission and/or a fee.
data used is analysed in Section 5 and experiments for the
34th Twente Student Conference on IT Jun. 29th , 2021, Enschede, The
Netherlands. LSTM variants are being presented in Section 6. Finally a
Copyright 2021, University of Twente, Faculty of Electrical Engineer- discussion and future work are being presented in Section
ing, Mathematics and Computer Science. 7.
1
2. BACKGROUND tional LSTM. The study concluded that the bidirectional
In order to perform research in the field of predicting stock and stacked LSTMs had better performance for short term
prices it is important to understand the features found in prices opposed to the long term prediction results. Fur-
the market, and the machine learning techniques that will ther, the results have shown that deep architecture out-
be used. In this section, firstly an indication of quantita- performed their shallow counter parts. Another example
tive data about S&P 500, and secondly recurrent neural is a study made in 2015 by Roondiwala et al. [27], which
networks are being elaborated. attempted to create a model based on LSTM for an accu-
rate read of the future market indices. The researchers fur-
2.1 Stock Market ther analysed various hyperparameters of their model and
The S&P 500 is a stock market index which measures the mainly the number of epochs used with various variable
performance of the five hundred largest companies in the combinations of the market. At the end they concluded
United States, such companies include Apple Inc., Mi- that using multiple variables (High/Low/Open/Close) re-
crosoft, Amazon.com and many more. A share is char- sulted to the least errors. Latter on, in 2020 Hao et al. [34],
acterized by a price which is available on the S&P 500 proposed a hybrid model based on LSTM and multiple
index [1]. Stock markets usually open during weekdays at time scale feature learning and compared it to other ex-
nine-thirty a.m. and close at four p.m eastern time. Many isting models. The study also compared models based
data-sets used for the prediction of prices include features on single time scale feature learning. Furthermore, de-
such as the Open, Close, High, Low, Adjusted Closing sign was made to combine the output representation from
price and Volume [17]. High and Low refer to the prices three LSTM based architectures. A study by David G.
of a given stock at its maximum and minimum of a day, McMillan [24] attempts to understand the variables proxy
respectively. Adjusted Closing refers to the closing price for changes in the expected future cash flow. It was con-
taking into account any corporate actions, which differs cluded that forecasting combinations outperform single fu-
from the raw closing price. Finally, Volume characterises ture models.
the amount of stocks sold and bought each day. Earnings Looking into the combinations of the features, the hyper-
per share (EPS) is an important measure, which indicates parameters and different LSTM variations would allow to
how profitable a company is [14]. Price to Earnings ra- better understand LSTMs and expand on the research of
tio (P/E) refers to the ration of the current stock price to this topic in general. From the related studies we expect
their EPS [13]. that deep architectures would outperform their shallow
counterparts and multiple feature combinations would per-
2.2 Recurrent Neural Networks form generally better than single features.
Recurrent Neural Networks (RNN) are a class of neural
networks specifically designed to handle sequential data.
There are two types of RNNs, the discrete-time RNNs and 4. METHODOLOGY
continue-time RNNs [35]. They are designed with a cyclic In this section included is an introduction with all the ma-
connection architecture, which allows them to update their jor steps which will be considered in this research project:
current state given the previous states and current input (i) data acquisition, (ii) data prepossessing, (iii) details
data. RNNs are usually artificial neural networks that about the RNN-based models, and (iv) the evaluation met-
consist of standard recurrent cells. These types of neural rics.
networks are known to be very accurate for solving prob- (i) Data used for this research will be extracted from
lems. It is specialized in processing a sequence of values multpl.com [2] and finance.yahoo.com [1].
χ1 ..χn where n is the total number of features, and χ are
the features, such as time-series data. Scaling of images (ii) Data will be normalized using a python library sklearn.
with large width and height, and processing images of vari- Feature scaling is a method to normalize the range of inde-
able size is also feasible to a large extent. Furthermore, pendent feature variables. Data is then split into a training
most RNNs are capable of processing sequences of variable and a testing set.
length. However, RNNs are lacking the ability to learn (iii) Data will be used to train the single time scale feature
long-term dependencies as it is illustrated in a research models over many iterations to predict each variable inde-
contacted by Yashoua et al [8]. Therefore, in order to han- pendently. As many traditional predicting models Closing
dle these long-term dependencies, in 1997 Hochreiter and Price will be used as a feature to train the control model.
Schmidhuber proposed a solution called Long Short-Term Another model will be then trained using the price from
Memory (LSTM) [16]. multpl.com as feature. After evaluating these methods,
multiple time scale feature models are created and trained.
3. RELATED WORK First a control model will be trained using the best possible
Since the evolution of artificial intelligence many attempted features of the standard data-set, which will be selected by
to combine deep learning and machine learning using core running tests for each combination of features and com-
principles. Artificial intelligence methods include convolu- paring their losses, and the label will be set to the Close
tional neural network, multi-layer perceptron, naive Bayes price. The stock price can be calculated with the equation
network, back propagation network, recurrent neural net- below (2) using EPS and PE creating a new feature Calcu-
work, single-layer LSTM, support vector machine and many lated Price. Multiple feature models will be then trained
more [12]. A study in 2018 by Sima et al. [30] has shown using the EPS, PE and Calculated Price as features and
a comparison between LSTM and ARIMA [4] a model the Price as the label, and further compared with the con-
used for analysing time series data. This study focused trol model. Finally a comparison of the traditional models
on implementing and applying financial data on an LSTM with the proposed multiple feature models will be made. A
which was superior to the ARIMA model. Further, a study standard dropout LSTM model will be optimized though
by Khaled A. Althelaya et al. in 2018 [5], evaluated the experimentation of hyperparameters and compared with
performance of bidirectional and stacked LSTM for stock other variants of LSTM.
market prediction. The performance of the tuned mod- (iv) Evaluation of the methods will be made using root
els where also compared with a shallow and an unidirec- mean squared error (1) and visuals will be created de-
2
picting predicted and real values, where N are the total
number of values, Yi is the predicted price value and Ŷi is
the real price.
v
u
u1 X N
RM SE = t (Ŷi − Yi )2 (1)
N i=1
4.1 Long Short-Term Memory This section focuses on describing the original data col-
As mentioned in Section 2.2 Long Short-Term Memory lected, the processing steps, and the feature selection.
(LSTM) was introduced by Hochreiter and Schmidhuber
in 1997 [16] to cope with the problem of long-term depen- The first data-set is collected from finance.yahoo.com [1].
dencies. LSTM consist of a similar RNN architecture that Yahoo is one of the best resources for stock research be-
has been shown to outperform traditional RNN on numer- cause it is freely available and provides stock data from
ous tasks [16]. LSTM networks work extremely well with around the world. Yahoo provides approximately 1,822,800
sequence data and long-term dependencies due to their records of the S&P 500 index from 1927 to 2020. For
powerful learning capabilities and memory mechanisms. the purposes of this research, data of ten years is used
By introducing gates they were able to improve memory from 2010 to 2020, with a total number of 19,600 approx-
capacity and control the memory cell. One gate is dedi- imately records. The second data-set is collected from
cated for reading out the entries from the cell, the output multpl.com [2]. This website provides S&P 500 data not
gate. Another gate is needed to decide when data should only of the price index but of the price to earnings ra-
be read into the cell, this is called the input gate. Finally tio, earnings per share and dividend yield, to name a few.
a forget gate which resets the content of the cell. This There are approximately 5,400 records of monthly data,
design was used in order to decide when to remember and from April 1st, 1871 to January 28, 2021. Moreover, data
ignore inputs at the hidden state. A sigmoid activation of the last 120 years is used, with approximately 4,350
function computes the values of the three gates, these val- records. Needless to say that calculating the price gives
ues belong in the range of (0, 1), and represent the current values very similar to the real price. The formula (2) can
time step and hidden state of the previous time step. The be used to introduce another feature to the data-set. The
hidden states values are then calculated with a gated ver- graph in Figure 2 depicts the real price of S&P 500 from
sion of the tangent activation function of the memory cell multpl.com and the calculated price.
which take values in the range of (-1, 1) [37]. EP S × P/E = StockP rice (2)
4.2 Stacked Long Short-Term Memory
Stacked LSTMs are now a stable technique for challeng- 5.1 Datasets Basic Statistics
ing different sequential problems introduced by Graves et In order to understanding this data, numpy from python
al. in the paper of speech recognition in 2013 [15]. Exist- was used to calculate the mean and standard deviation
ing studies [22] have shown that LSTM architectures with for each feature. Table 1 shows some statistics, Figure 3
several hidden layers can build up higher level of repre- shows some box plots with the data collected from fi-
sentation of sequential data, therefore working more effec- nance.yahoo.com and data from multpl.com, with an extra
tively and with higher accuracy. Its architecture comprises calculated price using formula (2). Data in the boxplots
of multiple stacked LSTM layers, where the output of a and for the rest of the research will be scaled between zero
model’s hidden layer will be fed directly at the input of the and one. Figure 3 would allow to understand the distri-
subsequent hidden layer. Instead of the traditional multi- bution of numerical data and skewness through displaying
layer LSTM architecture where a single layer provides a the data quartiles and averages. We can hence observe
single output value to the next layer, stacked LSTM pro- that Open, High, Low, Close and Adjusted Close follow
vides a sequence of values. a very similar trend with the mean being almost iden-
tical. Moreover, Volume has a huge number of outliers,
4.3 Bidirectional Long Short-Term Memory that differ significantly from other observations or overall
A bidirectional LSTM (BiLSTM) invented in 1997 by Schus- Volume data has huge variations of numbers. The same
ter and Paliwal [29], is capable of getting trained with the can be said for PE, EPS, Price and the price calculated
sequence of data both forwards and backwards into two with formula (2). Machine learning are generally sensitive
separate recurrent networks which are connected into the to the distribution and range of values. Therefore, out-
same output layer [29, 6]. The idea is that you split the liers may mislead and spoil the training process resulting
state of neurons of a network in a part that is responsible in more losses and longer training times. A paper by Kai
for the forward states starting from a date frame of t=1 Zhang et al. in 2015 has concluded that outliers played
and a part for the backwards direction starting from t=T. a huge role at the performance of Extreme Learning ma-
chines (ELM) [38].
5. DATA In order to make Open, High, Low and Close more clear
3
Figure 3. A box plot for, Open, High, Low, Close, Adjusted Close and Volume
high correlation levels with the real Price, and should al-
Table 1. Data-set statistics, in terms of mean (µ) and stan- low for overall good results when used.
dard deviation (σ 2 ). The first six rows are data from fi-
nance.yahoo.com, while the next three rows are data from Statistics should be able to give us more insight into the
multpl.com. data and is generally considered an indispensable piece
Features Mean Standard Deviation to the field of machine learning. Understanding the data
Open 2570.66 432.82 and the characteristics of it is really important to finally
High 2583.49 435.41 come to a conclusion about certain results found in the
Low 2556.43 430.15 subsequent sections. In the next section I will execute
some experiments in order to select the best features that
Close 2570.89 432.77
could be applied on LSTMs.
Adj. Close 2570.89 432.77
Volume [×107 ] 383.20 95.51
6. EXPERIMENTS AND RESULTS
PE 16.13 9.25
In this section a model is constructed as a basis of testing
EPS 38.29 28.42 features, their combinations and model parameters.
Price 379.47 676.48
Calculated Price 678.29 704.70 6.1 LSTM Model Details
LSTMs in general are capable of coping with historical
data, hence they are really good candidates for stock pre-
diction. LSTMs can learn the order dependence between
items in a sequence and are known for their good per-
formance on sequence data-sets. For the purposes of se-
lecting the best combination of features, a dropout based
LSTM model (DrLSTM) with four hidden LSTM layers
and 50 units per hidden layer is trained and evaluated.
Each hidden LSTM layer has a subsequent dropout layer
and finally a dense layer is used to connect all the neurons
followed by the last dropout. Dropout is a technique which
selects neurons that will be ignored during training, this
means that their contribution to the activation of down-
stream neurons is temporarily removed. The structure of
the DrLSTM is found in Figure 7 of the appendix. The
DrLSTM is trained with windows of 60 previous days pre-
Figure 4. Ten days of S&P 500, Open, High, Low and Close. dicting the next day. Table 2 show the windows of days,
where X are the input arrays for the 60 days of data and
y are the predicted prices per day, the outcomes of the
Figure 4 is included. It is observed that Open and Close model for each array X and finally n is the total amount
fluctuate between High and Low, and the overall data is of days in the data-set.
following the same trend hence the high correlation levels
observed. The Calculated Price can be used as an extra
feature for prediction purposes. Table 2. Sliding window input (X, blue), the outcomes (y,
red), and n the number of days in total.
While Open, High, Low, Close and Adjusted Close price Days 1 2 3 ... 60 61 62 63 ... d
are almost identical they present some very minor differ- X1 y1
ences, which should in theory pose no huge effects for the X2 y2
selection process of the model. Adjusted Close price is X3 y3
identical to the feature Close, therefore for the purpose of
...
this research Adjusted Close will not be used. Figure 5
shows the correlation between features in heat maps. It is
noted that Volume has the least correlation between fea- 6.2 Feature Selection
tures. Further, Close and Adjusted Close have 100% cor- In order to perform the feature selection step I have done
relation supporting the previous statement of being iden- a grid-search using all future combinations. There are
d!
tical. Price is highly dependent on EPS, but surprisingly (r!(d−r)!)
of possible combinations for each data-set, where
less on P/E ratio. The Calculated Price as expected has d is the total number of possibilities and r is the number
4
Figure 5. Correlation Coefficient for the data-set of finance.yahoo.com.
5
Figure 6. Losses for number of nodes, dropout probability and optimizers used respectively.
Table 3. RMSE losses for LSTM four layered model with Dropouts (DrLSTM), stacked LSTM (StLSTM), shallow LSTM
(ShLSTM) and bidirectional LSTM (BiLSTM).
Features/Models DrLSTM StLSTM ShLSTM BiLSTM
Close 0.0346 0.0247 0.0230 0.0224
High, Vol. 0.0408 0.0275 0.0238 0.0233
High, Low, Vol. 0.0356 0.0297 0.0231 0.0219
High, Low, Close, Vol. 0.0389 0.0574 0.0233 0.0252
Price 0.0552 0.0454 0.0346 0.0712
EPS, Price 0.0411 0.0682 0.0535 0.0651
PE, Price, Calc.Price 0.0507 0.0818 0.0374 0.1197
terparts [23]. Finally, the best performing model was the used and seemed to perform generally good. However,
BiLSTM which had a loss of 0.0219 with the use of mul- running more experiments while adjusting the Adam’s pa-
tiple features. In order to understand the representative rameters accordingly can provide improvements. More im-
losses with real prices, it would mean that at the aver- provement can be achieved also by looking into the depth
age closing price of 2570.89 a model with a loss of 0.0346 (number of layers) and width (number of nodes) for each
has a deviation of 124.73 dollars, and the best performing variant. The window span used to create the input data for
with a loss of 0.0219 has a deviation of 56.18 dollars. In the models could be tested for values more or less than 60
Section 7, I will be discussing the possible reasons for the days. Finding a more suitable window could also fix the
behaviours observed. lag observer of the predicted price from the real values.
A study by Salah Bouktif et al. [9], tried solving the lag
arising from time features, with a selection of appropriate
7. DISCUSSION AND FUTURE WORK lag length using a genetic algorithm (GA). A deviation of
While experimenting with DrLSTM, we have observed 56.18 dollars for short term transactions could seem high
that dropouts introduce a bottleneck in the adjustment since a stock index in general requires more than a couple
of the model’s parameters. In many machine learning of days to deviate significantly in order to minimize trad-
processes it is useful to know how certain the output of ing losses, therefore even the BiLSTM leaves much room
a model is. For example, a prediction is more likely to for improvement.
be closer to the actual price when an input is very sim-
ilar to elements of the training set [21]. The outputs of
a dropout layer are randomly ignored, therefore having 8. CONCLUSION
the effect of reducing the capacity of a network during This research paper attempts to forecast the S&P 500 in-
training. Requiring more nodes in the context of dropout dex using multiple LSTM variants while performing sev-
could potentially remove this bottleneck. In Figure 6 we eral experiments for optimization purposes. I trained the
have observed that increasing the nodes gives more posi- models with a popular data-set from finance.yahoo.com
tive outcomes. The StLSTM supports this argument since and a data-set from multpl.com. This paper has proven
dropout layers are absence hence the better performance. that a single feature selection has performed better in
A ShLSTM was the second most performing model, con- some instances while multiple features have proven ad-
trary to what I was expecting. A good reason is that the vantageous in BiLSTMs. The testing results conform that
200 nodes used to train the ShLSTM in one layer was the LSTM variants are capable of tracing the evolution
a much better fit for the data used contrary to the 50 of closing price for long term transactions leaving much
nodes per layer of deep counterparts. A book by Andrew room for improvement of daily transactions. This study
R. Barron in 1993 [7] gives more insight into the size of a gave insight into two different data-sets and analysed the
single-layer neural network needed to approximate various results of different variants of LSTM, which should allow
tasks. Furthermore, the BiLSTM had performed the best researches and investors to use and expand upon in the
throughout the experiments and could potentially be used future. Although one of the many machine learning tech-
for long term transactions in the stock market, however niques has been used in this research, there are many more
it leaves much room for improvement. Since the BiLSTM methods that can be broken down into two categories (sta-
passes the data-set twice it makes certain trends more visi- tistical techniques and artificial intelligence).
ble, adding more weight to certain neurons and extending
data usage [11]. LSTM architecture is mainly used for 9. ACKNOWLEDGMENT
long term dependencies, so it is generally good to have I would like to thank Dr. Elena Mocanu for helping me
more and more contextual information. during the execution of this research. I also appreciate the
In this research the default optimizer parameters where time she spend directing me to the right sources.
6
10. REFERENCES of the istanbul stock Exchange. Expert Syst.
[1] SNP, Dec 10, 2020: S&P 500 (^GSPC), Appl.,38, 5311–5319. 2011.
retrieved from: https://yhoo.it/3ikW3DM. [19] W. Kenton. SP 500 Index – Standard Poor’s 500
[2] multpl https://www.multpl.com/s-p-500-pe-ratio. Index, https://bit.ly/2LWBUYO, Dec 22, 2020.
[3] A. Adebiyi, A. Adewumi, and C. Ayo. Comparison [20] D. P. Kingma and J. Ba. Adam: A method for
of arima and artificial neural networks models for stochastic optimization, 2017.
stock price prediction. J. Appl. Math., 1–7. 2014. [21] A. Labach, H. Salehinejad, and S. Valaee. Survey of
[4] R. Adhikari and R. K. Agrawal. An introductory Dropout Methods for Deep Neural Networks. 25
study on time series modeling and forecasting, 2013. Octomber 2019.
[5] K. A. Althelaya, E. M. El-Alfy, and S. Mohammed. [22] Y. LeCun, Y. Bengio, and G. Hinton. “Deep
Evaluation of bidirectional lstm for short-and learning,” Nature, vol. 521, no. 7553, pp. 436–444.
long-term stock market prediction. In 2018 9th 2015.
International Conference on Information and [23] Y. Levine, O. Sharir, and A. Shashua. Benefits of
Communication Systems (ICICS), pages 151–156, Depth for Long-Term Memory of Recurrent
2018. Networks. 15 February 2018.
[6] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and [24] D. G. McMillan. Which variables predict and
G. Pollastri. Exploiting the past and the future in forecast stock market returns? 2016.
protein secondary structure prediction BIOINF: [25] T. Moyaert and M. Petitjean. The performance of
Bioinformatics, p. 15. 1999. popular stochastic volatility option pricing models
[7] A. R. Barron. Universal approximation bounds for during the subprime crisis. Applied Financial
superpositions of a sigmoidal function. IEEE Economics. 21(14). 2011.
Transactions on Information Theory, 39(3):930–945, [26] P.-F. Pai and C.-S. Lin. A hybrid arima and support
1993. vector machines model in stock price forecasting.
[8] Y. Bengio, P. Frasconi, and P. Simard. The problem Omega, 33, 497–505. 2005.
of learning long-term dependencies in recurrent [27] M. Roondiwala, H. Patel, and S. Varma. Predicting
networks. In IEEE International Conference on stock prices using lstm. International Journal of
Neural Networks, pages 1183–1188 vol.3, 1993. Science and Research (IJSR), 6, 04 2017.
[9] S. Bouktif, A. Fiaz, A. Ouni, and M. Serhani. [28] S. Ruder. An overview of gradient descent
Optimal deep learning lstm model for electric load optimization algorithms. arXiv preprint
forecasting using feature selection and genetic arXiv:1609.04747, 2016.
algorithm: Comparison with machine learning [29] M. Schuster and K. Paliwal. Bidirectional recurrent
approaches †. Energies, 11(7):1636, Jun 2018. neural networks IEEE Transactions on Signal
[10] S. Chakraborty. Capturing financial markets to Processing, 45, pp. 2673-2681. 1997.
apply deep reinforcement learning, 2019. [30] S. Siami-Namini and A. S. Namin. Forecasting
[11] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, and economics and financial time series: Arima vs. lstm,
I. Androutsopoulos. Neural contract element 2018.
extraction revisited. In Workshop on Document [31] T. J. Strader, J. J. Rozycki, T. H. Root, and Y. J.
Intelligence at NeurIPS 2019, 2019. Huang. Machine learning stock market prediction
[12] G. Ding and L. Qin. Study on the prediction of studies: Review and research directions, Journal of
stock price based on the associated network model International Technology and Information
of lstm. International Journal of Machine Learning Management, 28(3). 2020.
and Cybernetics, 11, 06 2020. [32] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp:
[13] J. Fernando. Price-to-Earnings Ratio – P/E Ratio Divide the gradient by a running average of its
https://www.investopedia.com/terms/p/price- recent magnitude. COURSERA: Neural Networks
earningsratio.asp. Nov 13, for Machine Learning, 2012.
2020. [33] J. Yang and G. Yang. Modified convolutional neural
[14] J. Fernando. Earnings Per Share – EPS Definition network based on dropout and the stochastic
https://www.investopedia.com/terms/e/eps.asp. gradient descent optimizer. Algorithms, 11(3), 2018.
Nov 17, 2020. [34] H. Yaping and G. Qiang. Predicting the trend of
[15] A. Graves, N. Jaitly, and A. r. Mohamed. “Hybrid stock market index using the hybrid neural network
speech recognition with deep bidirectional lstm,” in based on multiple time scale feature learning.
Automatic Speech Recognition and Understanding Applied Sciences, 10(11), 2020.
(ASRU), 2013 IEEE Workshop on. IEEE, pp. 273– [35] Y. Yu, X. Si, C. Hu, and J. Zhang. A review of
278. 2013. recurrent neural networks: Lstm cells and network
[16] S. Hochreiter and J. Schmidhuber. Long short-term architectures. Neural Computation, 31(7):1235–1270,
memory. Neural Comput., 9(8):1735–1780, Nov. 2019.
1997. [36] M. D. Zeiler. ADADELTA: an adaptive learning rate
[17] J. Jagwani, M. Gupta, H. Sachdeva, and A. Singhal. method. CoRR, abs/1212.5701, 2012.
Stock price forecasting using data from yahoo [37] A. Zhang, Z. Lipton, M. Li, and A. Smola;. Dive
finance and analysing seasonal and nonseasonal into Deep Learning. 2020.
trend. pages 462–467, 06 2018.
[38] K. Zhang and M. Luo. Outlier-robust extreme
[18] Y. Kara, M. Acar, and Baykan. Predicting direction learning machine for regression problems, volume
of stock price index movement using artificial neural 151, part 3. pages 1519–1527, March 5 2015.
networks and support vector machines: The sample
7
APPENDIX Table 5 shows the results from the combination of fea-
tures collected from multpl.com. In contrast to Table 4 a
A. EXPERIMENTS AND RESULTS combination of two features performed the best.
A.1 Feature combinations A.2 LSTM Model Parameters
Table 4. RMSE losses from feature combinations. (to pre-
Table 6. RMSE for a number of nodes per layer.
dict Close values)
Nodes no. RMSE (Dropout 0.2)
Features RMSE Loss
25 0.0639
Close 0.0346 50 0.0346
Open - High 0.0446 75 0.0376
Open - Low 0.0573 100 0.0647
Open - Close 0.0616 125 0.0356
Open - Volume 0.0578 150 0.0329
High - Low 0.0513
High - Close 0.0505
High - Volume 0.0408 Table 6 shows the results of the tests performed to op-
Low - Close 0.0675 timize the DrLSTM for the number of nodes per layer.
Low - Volume 0.0412 It was observed that 150 nodes have performed the best
Close - Volume 0.0479 however the time required to train the DrLSTM was sig-
Open - High - Low 0.0497 nificantly higher than 50 nodes. In the discussion Section
Open - High - Close 0.0666 7 I expand on this observation.
Open - High - Volume 0.0506
Open - Low - Close 0.0755 Table 7. RMSE for a number of nodes per layer.
Open - Low - Volume 0.0623 Dropout Probability RMSE (Nodes 50)
Open - Close - Volume 0.0607 0.05 0.0274
High - Low - Close 0.0367 0.1 0.0314
High - Low - Volume 0.0356 0.15 0.0554
High - Close - Volume 0.0711 0.20 0.0346
Low - Close - Volume 0.0683 0.25 0.0705
Open - High - Low - Close 0.0432 0.30 0.0805
Open - High - Low - Volume 0.0455
Open - High - Close - Volume 0.0697
Open - Low - Close - Volume 0.0588 Table 7 shows the results of the tests that have performed
High - Low - Close - Volume 0.0389 in order to optimize the DrLSTM model, respectively with
Open - High - Low - Close -Volume 0.0548 the dropout probability of the layers. You can find more
information about the structure of the DrLSTM model in
Figure 7. It is observed that decreasing the dropout prob-
Table 4 shows the results from the combination of features ability would give better results. Therefore, the dropout
collected from finance.yahoo.com. Even though many ma- layers are creating a barrier to the DrLSTM’s process of
chine learning models have better results with a selection adjusting parameters during training.
of multiple features, in this research it was proven that a
single feature was capable of performing the better.
Table 8. RMSE for a number of nodes per layer.
Optimizers RMSE (50 nodes)
Table 5. RMSE losses from feature combinations(to predict (0.2 dropout)
Price). Adam 0.0346
Features RMSE Loss RMSprop 0.0847
Price 0.0552 SGD 0.0905
EPS - PE 0.5294 Adadelta 0.1198
EPS - Price 0.0411 Adamax 0.0528
EPS - Calculated Price 0.3440
PE - Price 0.1212
Table 8 shows the results of the tests that have performed
PE - Calculated Price 0.5054 in order to optimize the DrLSTM model, respectively with
Price - Calculated Price 0.0935 the optimizers that the DrLSTM uses to adjust the param-
EPS - PE - Price 0.0968 eters of the model during training. It was observed that
EPS - PE - Calculated Price 0.5305 the Adam optimizer is performing the best for the purpose
EPS - Price - Calculated Price 0.0916 of this research. In Section 7, I discuss the possibility of
PE - Price - Calculated Price 0.0507 adjusting the parameters of the optimizer, for simplicity
EPS - PE - Price - Calculated Price 0.0953 purposes this research used the default parameters.
8
A.3 Model Variants
Figure 7. Model variant architecture, (a) is for the dropout LSTM model (DrLSTM) which consists of 4 LSTM layers with
a dropout layer each. A stacked LSTM (StLSTM) (b) consists of the same four layers LSTM excluding the dropout layers.
The bidirectional LSTM (BiLSTM) (c) consists of a single forward and backward layer. The shallow LSTM (ShLSTM) in
graph (d) has a single LSTM 200 node layer.
Figure 8. Best results for each model depicting the actual price and the predicted price, (a) is for dropout LSTM model,
(b) is for stacked LSTM, (c) is for bidirectional LSTM and (d) is for shallow LSTM. In graph (a) it is noticeable that the
predicted values (in orange) and the real price (in blue) deviate and have a noticeable lag which is touched upon in Section
7. This lag is most noticeable in (a) but can be found in the rest of the graphs as well. In graph (c) it is noticeable that the
bidirectional LSTM had good performance, hence the darker color of the line.