Stock Prediction
Stock Prediction
Student
Varinder Jot Bangar
bangav01@myunitec.ac.nz
Supervisor Co-Supervisor
Dr. Hamid Sharifzadeh Dr. Soheil Varastehpour
Master of Computing
1
Contents
Chapter 1: Introduction 3
Chapter 4: Methodology
4.1: Proposed Research Approach 14
4.2: Research Methodologies
4.2.1: Data collection 14
4.2.2: Feature extraction 15
4.2.3: Training phase 16
4.2.4: Test phase 16
References 18
2
Chapter 1: Introduction
As used in North America and now around the world, the word “stock” refers to
“equity” [1]. The equity helps the businesses raise funds, and the investors get the equivalent
share in the company's profits. That is why the organisations tend to be listed on the stock
market to add value to their business and make the common population invest in their business,
hence making it an ever-changing market. People's intentions to invest in businesses that might
grow in the future, returning an increased value to their investment [2]. The things may go
wrong as well, for the investors as the stock market is highly influenced by some unforeseen
incidents as well, that change the course of the normal working of the businesses like terrorist
attacks, natural calamities like tsunami, floods and earthquakes, economic crisis and pandemics
are to name some. The sudden changes in the market always create a panic amongst the
investors in which they tend to make wrong moves with their stocks either by investing more
or backing off from certain shares [3]. This unusual increase or decrease in the investments
leads to further market imbalance, making the stock values grow or fall, respectively.
It has led to a lot of research in predicting the future values of stocks over the past
decade. The stock data being highly dynamic, non-linear and chaotic, always poses challenges
for stock data prediction [4], [5], [6]. So, to give investors an idea about the future trends to
lessen their panic. Also, to stabilise the market, researchers work on numerous models and
ideas to predict the stock values. The failure of classical models in fulfilling the need of
investors in predicting the ever-changing market and to make them understand the dynamics
of lower-level market [7], the ongoing research got a new life with the advancement in the
information technology and introduction of paradigms like Business Intelligence (BI),
Artificial Intelligence (AI), big data analysis and machine learning techniques. Utilising these
advancements in non-contemporary ways have helped researchers propose numerous different
models for the stock market predictions.
For example, A. Thakkar et al. [2] use Term Frequency-Invert Document Frequency (TF-IDF)
for stock data in the form of time-series numerical data, unlike its usual applications in
analysing textual data. The model proposed by W. Chen et al. in [4] also take an unusual
approach where the stock data is represented in the form of images using a hybrid of
Convolution Neural Network (CNN) and Graph Convolution (GC). The GC adds a feature of
considering the other stocks related to the stock under consideration to improve the prediction
performance. Sentiment Analysis (SA) textual features like the lexicon-based approach are
used byS. Bouktif et al. [5] to establish a correlation between the public moods and opinions
to the stock market trends and predict the future trends based on those public opinions. A
similar but enhanced version of this approach was proposed by J. Liu et al. in [3] where they
use text mining to verify an event of greater impact on the stock market that is terrorist attacks
and then combines it with visual data from the satellites to compare the night lights before and
after the incident to predict the stock values.
The stock market prediction using SA is gaining popularity these days because of
advancements in technology, the increasing number of platforms for registering public opinion,
and the increasing availability of sentiment data and ongoing developments in AI and BI. The
SA can be utilised for predicting different events [3], to analyse the social media posts [5] or
to find correlation between stock market and financial news [7]. All of this revolves around the
idea that there is an impact of public sentiments on the stock market trends. Different
3
researchers use different sources for extracting the investors’ mood data like social media
website “Twitter” [8]–[12], financial news [13] and bulletin boards [14], [15] for their models.
The larger stock markets like the USA, Japan, China, India, and the UK are always the
centre of attraction for such research as they impact the global economy for any sudden
fluctuations. Still, the stock markets like New Zealand Stock Exchange (NZX) are not explored
to their maximum potential because they are a bit isolated and have an autonomous kind of
architecture. Therefore, an SA based prediction model is proposed for NZX where the websites
that publish financial news will be analysed on certain parameters to find the dependencies of
the stock market on the public opinion.
The structure of the rest of the proposal is as follows: the literature review and the
statement for the research gap are discussed in Chapter 2. Objectives and the challenges are
discussed in Chapter 3, followed by a description of the proposed Methodology in Chapter 4.
The expected outcomes and a brief timeline for the research progress are given in Chapter 5.
4
Chapter 2: Research Project Development
The value of the company's shares depends not only on internal attributes of the
business but also on some outer aspects like pandemics, financial crisis and any other national,
international or local events that that might have an effect on the business directly or indirectly.
These incidents change the stock market trends in either way as “One man’s meat is another
man’s poison”. So, with these changing trends people can start investing more in the growing
stocks hence, making them increase even more or can even withdraw from the stocks in panic
causing the market to crash [3]. Since the world is fighting with the pandemic of Covid-19,
there has been a great disturbance in the global economy. By the end of April 2021, there were
almost 154 million confirmed cases of Covid-19 with 3.22 million approximate deaths [16]. In
March 2020, when this pandemic started to spread worldwide, the US and Japanese stock
market faced a 20% dip whereas in Germany, this drop was about 10% [17]. Similar was the
trend with all other stock exchanges in the world. To reduce the market risks and give investors
a fair idea of the future values of the stocks, there has been ample of research in the field of
stock market prediction, and different researchers use different approaches to identify the
relevant features and how different features effect the accuracy of the proposed models. Table
1 gives an overview of different prediction models regarding features, methods they use and
the research problem for future study. In this research proposal, the SA is considered along
with the other technical and financial features from the stock market data.
In the era of internet where Exabytes of data is generated each day, NLP techniques
provide a solution to analyse the unstructured data like the data in form of texts and speech,
which was not suitable to be processed numerically and algorithmically previously [18]. NLP
is an area of Artificial Intelligence (AI) that deals with the automatic analysis of human
language available in the form of text or speech. The advent of word encoders [19]–[21] and
sentence encoders [22]–[25] led to the successful application of deep learning models in NLP.
SA is an advancement or a better version of NLP [26]. The process of extracting the
information related to the feelings of the people from natural language and converting that
information into a machine understandable format is known as SA. SA helps businesses in
understanding the interest of the general public to customise their products or services based
on those calculations hence gaining value and reputation in the market [27][28]. Though, there
is no dearth of the sentiment defined datasets in the public domain but they mostly belong to
the category of products and movies [29]. The models based on these data sets [30]–[33]
acquired good performances in their corresponding fields. Still, applying the same models does
not give effective results as all fields have an exclusive set of words to express emotions.
Therefore, each area need to have specialised SA based on the data relevant to that particular
space.
5
PAPER/ METHOD DATASETS FEATURES PROBLEM
YEAR
[2]/ • TF-IDF for Daily stock • TF-IDF is commonly used for • The model can be upgraded
2020 extracting data from textual data but in this paper by using more technical
information from Yahoo is used for information indicators and features
stock data financial extraction from a weighted • The TF-IDF approach for data
• BackPropagation website for matrix formed by processing retrieval can be used in
Neural Network Standard & past stock data. combination with the other
(BPNN), Long Poor’s 500 • The predictions are done for neural network algorithms to
Short-Term (S&P500) the value of stocks in the test the efficiency of those
Memory (LSTM), and Dow next day and on the fifth day models
Jones
and Gated • These predictions are
Recurrent Unit Industrial
compared with two other
(GRU) for Average
existing approaches to prove
predicting the (DJIA) over high accuracy of the
trend in stock the period of
proposed model
seven years
market
between
2010 and
2017
[3]/ • CNN and decision From the • The unexpected incidents • Only incidents related to
2021 tree for incident stock that affect the market are terrorist attacks are
detection exchanges first detected with an considered for this research
• TF-IDF for of Israel, accuracy of 91%-93% but the scope of this model
information Columbia • The direction of the market can be extended to other
extraction and Spain graph is then predicted incidents as well
for 25, 14
• GRU and LSTM for based on the effect of • More information about the
and 23 years
stock value incident on businesses incidents can be extracted
respectively
prediction till 2018 • The stocks in 3 different from different resources
markets are analysed and • With the help of data available
the precision of almost 70% currently, only direction of the
is obtained using this model market can be predicted.
More precise prediction can
be done if more detailed and
live data is available instead of
data being updated each day
[4]/ GC-CNN 6 randomly • Takes into consideration the • Metaheuristic algorithms can
2021 selected stock market information in be used to optimise structural
Chinese addition to the individual parameters
stocks stock • Other types of images can be
between the • Improved graph convolution generated during image
years 2015
and Dual-CNN is used creation phase
and 2019
• Financial evaluation and • Investor sentiments not
computational performance considered
evaluation are performed to • Long term prediction can be
compare the predicted done by considering more
values to other trend features like industry
prediction methods and background, shareholder
common stock trading structure
strategies
6
[34]/ • Hybrid Attention Chinese • Uses two attention layers, • The news related to individual
2017 Network (HAN) stock data one for the news level and stocks are considered. The
with two attention and other for the temporal level research can be further
layers economic • The news level layer carried on utilising the
• Recurrent Neural news data separates the more relationship between the
Networks (RNN) from 2014- important news from the news related to other stocks
2017
for the news rest at a given date having industrial connections
feature processing • The temporal level layer with each other
• Self-paced correlates the news with • The approach based on news
Learning their corresponding stock can be combined with the
Mechanism (SPL) • The SPL imitates human technical analysis to further
for prediction learning features and increase the efficiency of
therefore can choose prediction model
between the training
samples that are more
suitable at initial stages
[5]/ • Natural Language 10 • Use SA on tweets related to • Different methods to better
2020 Processing (NLP) companies companies used for stock represent sentiments can be
for extracting of greater prediction considered
information from influence • Different combinations of • The ability of machine learning
tweets from machine learning algorithms techniques can be enhanced
National
• Support Vector and machine learning by using hybrid models
Association
Machine – techniques were used on
of Securities
Recursive Feature training set to compare with
Dealers
Elimination (SVM- Automated the test set of data for
RFE), Principal Quotations optimised result
Component (NASDAQ)
Analysis (PCA), stock
Kernel-PCA, domains and
Random Forest data from
(RF), Extreme Twitter
Gradient Boost about these
(XGB) for feature companies
selection in time
• SVM, Logistic range 2008-
Regression (LR), 18
Artificial Neural
Network (ANN),
RF, XGB,
Statistical
Modeling (SM) as
machine learning
techniques
[35]/ Adaptive moment Data from • Two different LSTM models • Forecast based on other
2020 estimation is used as the year are used to predict the next technical indicators can be
optimiser along with 2009 to day opening values of same performed
LSTM 2019 related company in different stock • The prediction model can be
to 3 IT exchanges. applied to check the results for
consulting • A 3rd model is used to predict a company listed in stock
companies
the difference in opening exchanges of different
listed on
values of both models countries
both
7
Bombay • Cross reference is done by
Stock combining model 1 or 2 with
Exchange the third one to predict the
and opening value of the stock in
National the other stock exchange
Stock and the results are
Exchange in compared to optimise the
India
prediction
[36]/ Effective Transfer A total of 55 • Granger-Causal relationship • The model can be developed
2020 Entropy (ETE) along stocks with is analysed to relate the further to be applied to all the
with machine learning top 5 stocks financial events to stock stocks in a broader view
techniques including from each market trends • The time varying ETE can be
LR, Multilayer sector from • ETE is used to improve the researched upon to find out its
Perceptron (MLP), RF, NASDAQ performance of prediction effectiveness with machine
XGB, LSTM and and New
model and, to reduce the learning in stock prediction
Adjusted Accuracy York Stock
noise in information flow
Exchange
(NYSE). • The network indicators from
The data ETE enhance the accuracy of
included in all the machine learning
research models used for stock
ranges from prediction
the year • The most effective models
2000 to for using with ETE are found
2018 to be MLP and LSTM
Table 1: Historical research on stock market prediction based on SA
As stated earlier, stock market trends depend not only on the performance of the
businesses but also on the natural or man-made events. The local and international media
covers these events. They are also talked about by the people on the public opinion platforms
that can be in form of blogs in the print media or can be in the form of posts or comments on
the social media websites like Facebook and Twitter. In the past, when the modes of
communication were not developed significantly, it was difficult for the news to spread around
the world quickly, hence keeping the influence on the businesses to the minimum. Now, when
the world has become one small village due to the internet and satellite communication, these
public opinions too, have certain effects on the stock market [37], [38]. Though, the researchers
have a mixed viewpoint on this idea, many researchers are still addressing it for the last few
years. There has been significant research supporting the theory that “investor’s sentiments do
influence the stock market trends”. SA have been a point of interest to various researchers,
working in the field of finance, to predict stock prices [29], [39], [40], international economic
market trends [41], [42] and to predict the revenues of the businesses [43].
S. Carta et al. in [44] break down the whole approach into technical analysis and
fundamental analysis stating that, predicting the stock market trend based on the mathematical
features developed from stock values comes under the category of the former and the forecast
based on the factors such as news and macroeconomics belongs to the latter category. To assist
8
this theory, S. Carta et al. [44] use the Efficient Market Hypothesis (EMH) proposed by D.
Shah et al. [45] and Adaptive Market Hypothesis (AMH) proposed by D.M. Blei et al. [46].
EMH confirms the immediate relationship between the stock prices and the latest information
available through news or other media, but certain events being unpredictable and so does the
news, it is not feasible to predict the stock prices. Thus, the problem proposed by Shah et al.
[45] was resolved by D.M. Blei et al. [46] by addressing the stock market inconsistences
through behavioural economic features in conjugation with the psychological theories.
C.J. Hutto and E. Gilbert in [47], use Twitter as the main source of data for SA where
all the tweets related to the stock market were analysed based on the lexicon approach and for
SA were used with Naïve Bayes (NB) as discussed by F.Z. Xing et al. in [48]. Three training
sets were generated with the pre-processed tweets where the first one was basic stock market
dataset, the second one was created by combining the basic stock market data with the
normalised numbers from the tweets containing the words worry, hope and fear. The third data
set was a combination of the basic dataset along with the normalised numbers from the tweets
containing words related to eight basic emotions. These datasets were used as an input to SVM
and other neural network algorithms, but SVM model was found to have more accuracy in the
stock trend prediction than other models. Lexicon based approach similar to C.J. Hutto and E.
Gilbert [47] is used by F.Z. Xing et al. [49]. The difference in their approach lies in the
application of SA not only on the individual stocks but to see the trends of public opinion on
particular business sectors as well in a particular time span by feeding the sentiment-based
stock data as input to a Decision Tree algorithm in White Box Models.
S. Bouktif et al. [5] not only take into consideration the lexicon-based approach but
supervised methods i.e., machine learning algorithms as well according to the features and
advantages of both the methods described by F.Z. Xing et al. [50]. The stock data for 6 of the
biggest brands in the world was taken from Yahoo finance along with textual data related to
these companies from Twitter between the years 2008-2018. The cleaning of tweets for
irrelevant information like retweets, non-English tweets or words, stop words, punctuations
and hashtags was done using Natural Language Toolkit (NLTK). Latent Dirichlet Allocation
(LDA) was used as proposed by J. Smailović et al. [11] for the classification of the stock related
words to their respective organisations from the pre-processed tweets. Python based TextBlob
and Valence Aware Dictionary and sEntiment Reasoner (VADER) were used for sentiment
feature extraction in the machine learning and lexicon-based environments respectively [51]
whereas, n-Grams for the action words related to the companies from the tweet corpus. The
prepared data was normalised and fed to different models namely NB, Logistic Regression,
SVM, ANN, RF and XGB. The lexicon-based approach gave better results for them, and the
accuracy was about 60%.
The data found from twitter contains too much of the noise because it is a social
networking platform where people from all walks of life can come and talk about anything
under the sun. So, to focus the research on the sentiments of investors or entrepreneurs, F. Xing
et al. extract data from StockTwits that is similar to Twitter [38], but the posts and discussion
is basically on financial topics. The variational RNN (VRNN), that is a hybrid of RNN is
proposed to predict volatility in the stock market, but the role of investors would be neglected
in this model setting as reckoned by P. Yu and X. Yan [52] as well. The stock market
fluctuations, as discussed earlier, could be an effect of some events and public opinions,
therefore, considering the suggestions from P. Koratamaddi et al. [53], N. Oliveira et al. [54]
and P. Yu et al. [55], the missing piece of public sentiment is added to the proposed model and
this hybrid model is named as Sentiment-Aware Volatility forecastING (SAVING). GRU are
9
used in conjugation with the RNN for the prediction model and compared to other models like
VRNN and NSVM, SAVING gives better efficiency for most of the stocks considered during
the research.
Unlike the previously discussed approaches, J. Kordonis et al. [56] not only take into
consideration the social networking platforms but various online news websites as well, having
prime focus on the economic news for stock data pre-processing, sentiment polarity model
[55], in which the stock prices are represented in the terms of positive, neutral and negative.
The stock data was then related to news sentiments at this stage. Later the data was fed to
sentiment classifier method constructed using different machine learning techniques (NB,
SVM, MLP). An LSTM RNN method is used to predict the future trends. Similarly, P.
Koratamaddi et al. in [53] takes into account both the sources for SA. To collect the news from
a wide range of websites, all in one place, Google News is used as an aggregator. Selenium, a
tool for the automation of the browsers was used for extracting news from Google News.
Getting a direction from the models proposed by N. Oliveira et al. [54], M. Arias et al. [57] and
J. Kordonis et al. [58], Twitter was used to extract stock market sentiments from a social media
platform. VADER is used for SA after converting raw textual data (from Twitter) into
numerical scores like S. Bouktif et al. [5]. The input data thus obtained was given to the
Adaptive Deep Deterministic Policy Gradient (ADDGP) model for predicting the portfolio
value of the stocks for an investor.
Within the above scenario, J. Liu et al. [6] try to elevate the stock trend-public opinion
relationship through Score-Inverse Similarity (S-IS) as talked about by Q. Liu et al. [59]. The
theory suggested by A. Fader et al. [60] and M. Banko et al. [61], for the openly extracted
information to be represented in a structured form was better than the conventional methods.
Still, it had a drawback of increased sparsity that X. Ding et al. [62] solved by characterising
the same structured events with the help dense vectors of event embeddings. The data crawling
technique was used for the Yahoo Finance website to capture the news data related to the
stocks. As the stock data cannot be shuffled randomly to maintain its usability, therefore the
technique proposed by [63], to divide the data set in chronology in the ratio 8:1:1. Matthews
Correlation Coefficient (MCC) is used as the metrics for evaluation in which the range from -
1 to 1 represent completely wrong and completely right binary classifiers, respectively. The
proposed model Multi-Element Hierarchical Attention Capsule Network (MHACN) was then
compared with the outputs from CNN and LSTM and came out to be the best with efficiency
of 60.56%.
Other than the SA based on the financial news or public opinions on social networking
websites, [3] brings into attention another approach to the event-based fluctuations in the stock
market. While going for the textual data analysis from the web, some sentiment-related words
can point towards a specific natural or man-made event that hinders the normal working of the
businesses, hence creating a chaos in the stock market. The authors in this paper talk about the
9/11 attacks in the USA, Nuclear Disaster in Fukushima (Japan) and the latest Covid-19
pandemic. The focus of this paper is on the terrorist attacks. For detecting an event, an approach
similar to [64] was used by eliminating the irrelevant data with the help of a desired set of
incident related keywords. Then the cleaned data was further classified into different event
types using CNN with binary classifiers to assure the data is related to incidents under
consideration and then a decision tree to group the data based on categories. The information
from the corpus thus created was extracted in a similar way to [65] i.e., Jaccard similarity
coefficients between the title words and names of locations. The events of terrorist attacks are
confirmed with the help of number of causalities using Stanford CoreNLP [66], weapons used
10
from the summary of news using TF-IDF. Already popular methods for the stock market
prediction namely HAN along with GRU [34] and model based on LSTM algorithm [13] were
attempted for forecasting the stock market trends. The former has two attention layers that help
in analysing news text belonging to important time periods whereas, latter is better if used in
data in the form of time series as the presence of feedback connection helps in forecasting next
day opening price for the stocks.
The research around financial news for stock market predictions was carried out in [44]
where the authors used dictionaries proposed by [67] [68] to extract the textual features from
the news data following the former that states the dictionaries developed for the SA for other
trades can classify the words found in financial data wrongly hence putting the efficiency of
the prediction model at stake. Their theory was proved by the researchers in the field of
financial predictions in the papers [69] [70]. For the prediction model, RF was applied that was
pruned using Gini impurity metric
Another research in [49] explored this category by just looking into the financial news
available on the web as they contain the cleanest information about the stock prices and trends.
The news data is broadly divided into the basis of business level and industry level. The
business level news or articles talk about the individual stock prices. In contrast, the industry
dependent news talks about the trends of a whole industry containing businesses that provide
similar products or services. Researchers in this model also use the lexicon-based approach, as
discussed earlier in [5]. The words related to industry are used to create a lexicon in the
footsteps of [71]. In addition to the two groups discussed before the news are divided into three-
time interval-based groups as well. The news belonging to last month, last week and previous
day add up to a total of 5 groups and the lexicon words in these five different groups are
calculated for average percentage depending on the number of times a particular word is seen
in the news on the scale of 0-100. After the pre-processing of data and lexicon creation, the
combined input is fed to Decision Tree classifier [72] for classification of the data based on
desired features being a supervised learning method [73]. The Decision Tree algorithm lacks
behind the ANN in terms of accuracy but, being white box model, they are easy to explain and
visualise [73] [74].
11
2.3: Research Gap
From the previous discussion, it is evident that there is no dearth of the research in the
direction of stock market prediction Still, the prediction based on SA is minimal because of the
different opinions of the researchers. One group advocates the impact of sentiments or public
moods on the share market. At the same time, the other one rejects the idea of forecasting using
SA as they believe that the stock market fluctuations are purely based on the performance of
the businesses [5]. We stand on the side of the former group, as the models proposed earlier
were successful in the stock market prediction. As concluded by [38], the stocks that are talked
about more on the internet or other media have more instability than the ones that are infamous
in those discussions, reckons the fact that the stock values are linked to the public opinion and
moods. The accuracies in the earlier models were not up to the mark. Still, the continuous
research and the new models incorporating the hybrid deep learning models and advanced
algorithms have addressed the drawbacks of their predecessors, hence improving the
accuracies. The ongoing research defends the idea of connection between public opinions and
stock values and encourages us to find the combination of right techniques and deep learning
algorithms to increase further the efficiency of stock market prediction models based on the
SA. In the proposed model, the sentiments related to stock market will be extracted from textual
data. They will be combined with the stock data to generate the input for the stock market
prediction model.
In Table 1, the problems discussed in some of the papers for future research are listed.
The research problems going to be addressed in this project are as follows:
• The timeline for stock data is chosen such that the inconsistencies of the stock market
can be considered during the pandemic of Covid-19.
• The research is focussed on NZX that is a smaller and autonomous type of stock market
and not much research is concentrated on these type of stock markets. Therefore, this
research will add to the knowledge of the dynamics of comparatively smaller stock
markets.
12
Chapter 3: Objectives and Challenges
3.1: Scope
This project registers our participation in the debate that revolves around the
dependency of stock market on the public/investor sentiments. We, through this research, try
to find the correlation between these two by predicting the future trends of NZX based on the
SA of the financial news available on the internet. This proposed prediction model might be
implemented in future for designing stock trading apps having the feature of suggesting the
users with stocks that have upward trend predictions.
The NZX is different from the stock exchanges in the rest of the world in terms of size,
dependency, capital formation and contribution to the economy. Therefore, the factors that
influence the variation in the stock values are unique to themselves, leading to the
incompatibility of prediction models developed based on research done for bigger stock
markets of the USA, UK, Japan and China. This brings into our attention the need of a
specialised and focussed research on the New Zealand market, and our proposed model tries to
fill in that gap. This research will also, encourage more people to participate in this field of
study.
3.3: Challenges
The biggest challenge faced in the study of NZX is the non-availability of the data.
There are no newsrooms that are dedicated towards the financial or economic news of the
country. Also, the lack of online platforms where investors can share their views and
understandings about the stocks, add to the complexity for extracting sentiments.
13
Chapter 4: Methodology
It is evident from the available literature that numerous approaches are taken into
account while developing a model for stock market prediction. SA being new paradigm is still
being explored by the researchers around the world. Based on the literature review done while
writing this proposal, it has been observed that the basic framework for the purpose is the same
throughout, but the difference lies in the tools and techniques used to perform those actions to
attain the maximum efficiency possible.
In this project, the first step is to get the stock data for major businesses in New Zealand
and a data set based on the news and other online platforms that, in second step, will be pre-
processed for feature extraction from the stock data using deep learning techniques as discussed
further in the next section. NLP algorithms are used for feature extraction from the latter data
set. In step three, deep learning techniques are applied to state a correlation between the
historical events/incidents and fluctuations in the stock values. A prediction model for future
incidents is then designed in step four and the accuracy of the prediction is evaluated by using
a test set of the processed data in the last step.
The prediction model for NZX can broadly be categorised in two phases, the training
phase and the test phase. As seen in Fig. 1, the historical stock data is downloaded from Yahoo
Finance for top five NZX companies. The downloaded data is in form of five different datasets
ranging from the year 2011 to 2021.
J. Liu et al. [6] use both newsrooms and Twitter for the public sentiments and F. Xing
et al. [38] use social media for analysing public opinion. The problem with such data is that it
is full of irrelevant information and a significant amount of time is consumed in the cleaning
of such data. Hence for this research the financial news data is downloaded from websites like
“sharechat.co.nz” and “Intensify NZ” that filter out the everyday financial news from the daily
online newsrooms. Below is an example of the headline in the “Latest News” section from
sharechat.co.nz
“Just Life Group Limited (NZX: JLG) Annual Results for the Year Ended 30 June 2021”[75]
Web scraping technique will be used to download the financial news data through
Python in Selenium web driver and Pandas library will further help in creating the data set in
a usable format.
14
Fig. 1: Training phase of prediction model
After the data sets are created the next step is to clean them and extract the required
features to be fed to the prediction algorithms. The raw data scraped from financial news have
features like Date, Day, Time, Title, Text, Links and Source. For our project we only need to
concentrate on the Date, Title, and Text. Therefore, rest of the features are discarded. The
downloaded textual data from financial news websites consists of a lot of irrelevant information
that requires to be cleaned. The text consists of hashtags, stop words, webpage links and special
characters that add no value to the sentiment value of the text. NLTK (a Python package that
provides all the suitable libraries and methods for processing human language) will be used for
eliminating this irrelevant information [5]. This step will leave behind only the words which
contain relevant information.
RNN is suitable for extracting contextual information from the data in form of texts
[76] [77]. The vocabulary size for the text will be determined based on the length of the text
and sentiment vectors will be generated and stored in the form of matrices to be used for
prediction model.
The stock data contains the features such as High, Low, Open, Close, Adjusted Close
and Volume from which the first four features are extracted on the time domain and are to be
infused with the features extracted from the financial news on that date to see the trend of the
market based on investor sentiments. The stock data is divided into the training data set and
validation data set with the weightage of 80% and 20% respectively.
15
4.2.3: Training phase
The data having long term dependencies as in the case of stock market data, is better
analysed by LSTM [35]. Z. Li et al. [3] also, use the LSTM based model proposed by R. Akita
et al. [13] for their prediction model. Therefore, it is evident that the LSTM gives the higher
efficiency while working with stock data and hence the matrix of training data that will created
by classifying the input data in terms of the extracted features will be fed to an LSTM model
for stock market prediction. This phase will utilise the 80% of the data that is training data set
and the debugging will also be done in this very phase.
As seen from the Fig. 2 the prediction model will then be evaluated for the efficiency
using validation dataset in the test phase. The validation dataset is nothing, but the testing
dataset that will be created after the pre-processing of the stock data. It will then be fed to the
prediction model and the output values will be compared with the already available values to
calculate the efficiency of the prediction model.
16
Chapter 5: Expected Outcomes and Timeline
5.1: Outcomes
5.3: Timeline
The approximated timeline for the proposed research is presented in the Fig. 3.
17
References
[1] R. J. Teweles and E. S. Bradley, The Stock Market, 7th ed. John Wiley & Sons, Inc.,
1998.
[2] A. Thakkar and K. Chaudhari, “Predicting stock trend using an integrated term
frequency–inverse document frequency-based feature weight matrix with neural
networks,” Applied Soft Computing Journal, vol. 96, Nov. 2020.
[3] Z. Li, S. Lyu, H. Zhang, and T. Jiang, “One Step Ahead: A Framework for Detecting
Unexpected Incidents and Predicting the Stock Markets,” IEEE Access, vol. 9, pp.
30292–30305, Feb. 2021.
[4] W. Chen, M. Jiang, W. G. Zhang, and Z. Chen, “A novel graph convolutional feature
based convolutional neural network for stock trend prediction,” Information Sciences,
vol. 556, pp. 67–94, May 2021.
[5] S. Bouktif, A. Fiaz, and M. Awad, “Augmented Textual Features-Based Stock Market
Prediction,” IEEE Access, vol. 8, pp. 40269–40282, Feb. 2020.
[6] J. Liu, H. Lin, L. Yang, B. Xu, and D. Wen, “Multi-Element Hierarchical Attention
Capsule Network for Stock Prediction,” IEEE Access, vol. 8, pp. 143114–143123,
Aug. 2020.
[7] X. Wan, J. Yang, S. Marinov, J. P. Calliess, S. Zohren, and X. Dong, “Sentiment
correlation in financial news networks and associated market movements,” Scientific
Reports, vol. 11, no. 1, Dec. 2021.
[8] A. Tafti, R. Zotti, and W. Jank, “Real-time diffusion of information on twitter and the
financial markets,” PLoS ONE, vol. 11, no. 8, Aug. 2016.
[9] A. Papana, C. Kyrtsou, and D. Kugiumtzis, “Detecting causality in non-stationary time
series using Partial Symbolic Transfer Entropy: Evidence in financial data,”
Computational Economics, vol. 47, no. 3, pp. 341–365, Mar. 2016.
[10] G. Ranco, D. Aleksovski, G. Caldarelli, M. Grčar, and I. Mozetič, “The effects of
twitter sentiment on stock price returns,” PLoS ONE, vol. 10, no. 9, Sep. 2015.
[11] J. Smailović, M. Grčar, N. Lavrač, and M. Žnidaršič, “Stream-based active learning for
sentiment analysis in the financial domain,” Information Sciences, vol. 285, no. 1, pp.
181–203, Nov. 2014.
[12] B. Liu, Q. Li, H. Li, J. Si, A. Mukherjee, and X. Deng, “Exploiting Topic based
Twitter Sentiment for Stock Prediction,” in Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics, Sofia, Bulgaria, Aug. 2013, pp. 24–29.
[13] R. Akita, A. Yoshihara, T. Matsubara, and K. Uehara, “Deep learning for stock
prediction using numerical and textual information,” in Proceedings of 15th
International Conference on Computer and Information Science, Okayama, Japan, pp.
1–6, Aug. 2016.
[14] T. Hai Nguyen Kiyoaki Shirai, “Topic Modeling based Sentiment Analysis on Social
Media for Stock Market Prediction,” in Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference
on Natural Language Processing, Beijing, China, pp. 1354–1364, Jul. 2015.
18
[15] S. R. Das and M. Y. Chen, “Yahoo! for amazon: Sentiment extraction from small talk
on the Web,” Management Science, vol. 53, no. 9, pp. 1375–1388, Sep. 2007.
[16] “Covid-19,” Google News , Aug. 18, 2021. Accessed on: 18/08/2021. [Online].
Available: https://news.google.com/covid19/map?hl=en-NZ&gl=NZ&ceid=NZ%3Aen
[17] W. Zhang and S. Hamori, “Crude oil market and stock markets during the COVID-19
pandemic: Evidence from the US, Japan, and Germany,” International Review of
Financial Analysis, vol. 74, pp. 1–13, Mar. 2021.
[18] M. Mitchell, J. Aguilar, T. Wilson, and B. van Durme, “Open Domain Targeted
Sentiment,” in Proceedings of the 2013 Conference on Empirical Methods in Natural
Language Processing, Seattle, Washington, USA, pp. 1643–1654, Oct. 2013.
[19] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Distributed Representations of Words
and Phrases and their Compositionality,” in Proceedings of NIPS’13: 26th
International Conference on Neural Information Processing Systems, Nevada, USA,
pp. 3111–3119, Dec. 2013.
[20] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word
Representation,” in Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543, Oct. 2014.
[21] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with
Subword Information,” Transactions of the Association for Computational Linguistics,
vol. 5, pp. 135–146, Jun. 2017.
[22] R. Kiros et al., “Skip-Thought Vectors,” in Proceedings of the 28th International
Conference on Neural Information Processing Systems (NIPS’15), Montreal, Canada,
pp. 3294–3302, Dec. 2015.
[23] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised Learning
of Universal Sentence Representations from Natural Language Inference Data,” in
Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing, Copenhagen, Denmark, pp. 670–680, Sep. 2017.
[24] M. Artetxe and H. Schwenk, “Massively Multilingual Sentence Embeddings for Zero-
Shot Cross-Lingual Transfer and Beyond,” Transactions of the Association for
Computational Linguistics, vol. 7, pp. 597–610, Mar. 2019.
[25] Q. Le and T. Mikolov, “Distributed Representations of Sentences and Documents,” in
ICML’14: Proceedings of the 31st International Conference on International
Conference on Machine Learning, Beijing, China, pp. 1188–1196, Jun. 2014.
[26] A. Montoyo, P. Martínez-Barco, and A. Balahur, “Subjectivity and sentiment analysis:
An overview of the current state of the area and envisaged developments,” Decision
Support Systems, vol. 53, no. 4, pp. 675–679, Nov. 2012.
[27] K. Ravi and V. Ravi, “A survey on opinion mining and sentiment analysis: Tasks,
approaches and applications,” Knowledge-Based Systems, vol. 89, pp. 14–46, Nov.
2015.
[28] M. Rushdi Saleh, M. T. Martín-Valdivia, A. Montejo-Ráez, and L. A. Ureña-López,
“Experiments with SVM to classify opinions in different domains,” Expert Systems
with Applications, vol. 38, no. 12, pp. 14799–14804, Nov. 2011.
[29] K. Mishev, A. Gjorgjevikj, I. Vodenska, L. T. Chitkushev, and D. Trajanov,
“Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers,” IEEE
Access, vol. 8, pp. 131662–131682, Jul. 2020.
[30] A. Yenter and A. Verma, “Deep CNN-LSTM with Combined Kernels from Multiple
Branches for IMDb Review Sentiment Analysis,” in Proceedings of 8th IEEE Annual
Ubiquitous Computing, Electronics & Mobile Communication Conference
(UEMCON), New York, USA, pp. 540–546, Oct. 2017.
19
[31] D. Cer et al., “Universal Sentence Encoder for English,” in Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing: System
Demonstrations, Brussels, Belgium, pp. 169–174, Nov. 2018.
[32] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding,” in Proceedings of North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies 2019, Minneapolis, USA, pp. 4171–4186, Jun. 2019.
[33] Y. Liu et al., “ROBERTA: A Robustly Optimised BERT Pre-Training Approach,” in
Proceedings of International Conference on Learning Representations, pp. 1–13, Jul.
2019.
[34] Z. Hu, W. Liu, J. Bian, X. Liu, and T. Y. Liu, “Listening to chaotic whispers: A deep
learning framework for news-oriented Stock trend prediction,” in WSDM 2018 -
Proceedings of the 11th ACM International Conference on Web Search and Data
Mining, New York, USA, pp. 261–269, Feb. 2018.
[35] A. Thakkar and K. Chaudhari, “CREST: Cross-Reference to Exchange-based Stock
Trend Prediction using Long Short-Term Memory,” Procedia Computer Science, vol.
167, pp. 616–625, Apr. 2020.
[36] S. Kim, S. Ku, W. Chang, W. Chang, W. Chang, and J. W. Song, “Predicting the
Direction of US Stock Prices Using Effective Transfer Entropy and Machine Learning
Techniques,” IEEE Access, vol. 8, pp. 111660–111682, Jun. 2020.
[37] N. Oliveira, P. Cortez, and N. Areal, “Stock market sentiment lexicon acquisition
using microblogging data and statistical measures,” Decision Support Systems, vol. 85,
pp. 62–73, May 2016, doi: 10.1016/j.dss.2016.02.013.
[38] F. Z. Xing, E. Cambria, and Y. Zhang, “Sentiment-aware volatility forecasting,”
Knowledge-Based Systems, vol. 176, pp. 68–76, Jul. 2019.
[39] W. Souma, I. Vodenska, and H. Aoyama, “Enhanced news sentiment analysis using
deep learning methods,” Journal of Computational Social Science, vol. 2, no. 1, pp.
33–46, Jan. 2019.
[40] M. Y. Day and C. C. Lee, “Deep learning for financial sentiment analysis on finance
news providers,” in 2016 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining (ASONAM), San Francisco, USA, pp. 1127–1134, Aug.
2016.
[41] S. F. Crone and C. Koeppel, “Predicting exchange rates with sentiment indicators: An
empirical evaluation using text mining and multilayer perceptrons,” in Proceedings of
IEEE/IAFE Conference on Computational Intelligence for Financial Engineering
(CIFEr), London, UK, pp. 1127–1134, Oct. 2014.
[42] C. Curme, H. E. Stanley, and I. Vodenska, “Coupled Network Approach to
Predictability of Financial Market Returns and News Sentiments,” International
Journal of Theoretical and Applied Finance, vol. 18, no. 7, Nov. 2015.
[43] K. Mishev, A. Gjorgjevikj, I. Vodenska, L. Chitkushev, W. Souma, and D. Trajanov,
“Forecasting Corporate Revenue by Using Deep-Learning Methodologies,” in
Proceedings of 3rd International Conference on Control, Artificial Intelligence,
Robotics and Optimization (ICCAIRO 2019), Athens, Greece, pp. 115–120, May 2019.
[44] A. Picasso, S. Merello, Y. Ma, L. Oneto, and E. Cambria, “Technical analysis and
sentiment embeddings for market trend prediction,” Expert Systems with Applications,
vol. 135, pp. 60–70, Nov. 2019.
[45] E. F. Fama et al., “Efficient Capital Markets: II The comments of Fischer Black,” THE
JOURNAL OF FINANCE, vol. 46, no. 5, pp. 1575–1617, Dec. 1991.
20
[46] A. W. Lo, “The Adaptive Markets Hypothesis: Market Efficiency from an
Evolutionary Perspective,” The Journal of Portfolio Management, vol. 30, no. 5, pp.
15–29, Oct. 2004.
[47] A. Porshnev, I. Redkin, and A. Shevchenko, “Machine learning in prediction of stock
market indicators based on historical data and data from twitter sentiment analysis,” in
Proceedings of 13th IEEE International Conference on Data Mining Workshops
(ICDMW), Dallas, USA, pp. 440–444, Dec. 2013.
[48] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment Classification using
Machine Learning Techniques,” in Proceedings of the 2002 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Pennsylvania, USA, pp. 79–86,
Jul. 2002.
[49] S. Carta, S. Consoli, L. Piras, A. S. Podda, and D. R. Recupero, “Explainable Machine
Learning Exploiting News and Domain-specific Lexicon for Stock Market
Forecasting,” IEEE Access, vol. 9, pp. 30193–30205, Feb. 2021.
[50] D. Shah, H. Isah, and F. Zulkernine, “Stock market analysis: A review and taxonomy
of prediction techniques,” International Journal of Financial Studies, vol. 7, no. 2, pp.
1–22, May 2019.
[51] C. J. Hutto and E. Gilbert, “VADER: A Parsimonious Rule-based Model for Sentiment
Analysis of Social Media Text,” 2014.
[52] F. Z. Xing, E. Cambria, L. Malandri, and C. Vercellis, “Discovering bayesian market
views for intelligent asset allocation,” in Machine Learning and Knowledge Discovery
in Databases, vol. 11053, Springer Verlag, pp. 120–135, 2019.
[53] P. Koratamaddi, K. Wadhwani, M. Gupta, and S. G. Sanjeevi, “Market sentiment-
aware deep reinforcement learning approach for stock portfolio allocation,”
Engineering Science and Technology, an International Journal, vol. 24, no. 4, pp.
848–859, Aug. 2021.
[54] N. Oliveira, P. Cortez, and N. Areal, “The impact of microblogging data for stock
market prediction: Using Twitter to predict returns, volatility, trading volume and
survey sentiment indices,” Expert Systems with Applications, vol. 73, pp. 125–144,
May 2017.
[55] P. Yu and X. Yan, “Stock price prediction based on deep neural networks,” Neural
Computing and Applications, vol. 32, no. 6, pp. 1609–1628, Mar. 2020.
[56] P. Mehta, S. Pandya, and K. Kotecha, “Harvesting social media sentiment analysis to
enhance stock market prediction using deep learning,” PeerJ Computer Science, vol. 7,
pp. 1–21, Apr. 2021.
[57] M. Arias, A. Arratia, and R. Xuriguera, “Forecasting with twitter data,” ACM
Transactions on Intelligent Systems and Technology, vol. 5, no. 1, pp. 1–24, Dec.
2013.
[58] J. Kordonis, S. Symeonidis, and A. Arampatzis, “Stock price forecasting via sentiment
analysis on Twitter,” in Proceedings of the 20th Pan-Hellenic Conference on
Informatics, New York, USA, pp. 1–6, Nov. 2016.
[59] Q. Liu, X. Cheng, S. Su, and S. Zhu, “Hierarchical complementary attention network
for predicting stock price movements with news,” in Proceedings of International
Conference on Information and Knowledge Management, New York, USA, pp. 1603–
1606, Oct. 2018.
[60] A. Fader, S. Soderland, and O. Etzioni, “Identifying Relations for Open Information
Extraction,” in Proceedings of the 2011 Conference on Empirical Methods in Natural
Language Processing, Edinburgh, UK, pp. 1535–1545, Jul. 2011.
[61] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “Open
Information Extraction from the Web,” in Proceedings of 20th International Joint
21
Conference on Artifical Intelligence, Hyderabad, India, pp. 2670–2676, Jan. 2007, pp.
2670–2676.
[62] X. Ding, Y. Zhang, T. Liu, and J. Duan, “Deep Learning for Event-Driven Stock
Prediction,” in Proceedings of the 24th International Conference on Artificial
Intelligence, Buenos Aires, Argentina, pp. 2327–2333, Jul. 2015.
[63] Y. Xu and S. B. Cohen, “Stock Movement Prediction from Tweets and Historical
Prices,” in Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics, Melbourne, Australia, pp. 1970–1979, Jul. 2018.
[64] F. Petroni et al., “An extensible event extraction system with cross-media event
resolution,” in Proceedings of the International Conference on Knowledge Discovery
and Data Mining, London, UK, pp. 626–635, Jul. 2018.
[65] K. Radinsky and E. Horvitz, “Proceedings of the sixth ACM international conference
on Web search and data mining.,” in Proceedings of the 6th ACM International
Conference on Web Search and Data Mining, Rome, Italy, pp. 255–264, Feb. 2013.
[66] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. Mcclosky, “The
Stanford CoreNLP Natural Language Processing Toolkit,” in Proceedings of 52nd
Annual Meeting of the Association for Computational Linguistics: System
Demonstrations, Maryland, USA, pp. 55–60, Jun. 2014.
[67] T. Loughran et al., “When Is a Liability Not a Liability? Textual Analysis,
Dictionaries, and 10-Ks,” The Journal of Finance , vol. 66, no. 1, pp. 35–65, Feb.
2011.
[68] E. Cambria, J. Fu, F. Bisio, and S. Poria, “AffectiveSpace 2: Enabling Affective
Intuition for Concept-Level Sentiment Analysis,” in Proceedings of the 29th AAAI
Conference on Artificial Intelligence, Texas, USA, pp. 508–514, Jan. 2015.
[69] X. Li, H. Xie, L. Chen, J. Wang, and X. Deng, “News impact on stock price return via
sentiment analysis,” Knowledge-Based Systems, vol. 69, no. 1, pp. 14–23, Oct. 2014.
[70] F. Jin, N. Self, P. Saraf, P. Butler, W. Wang, and N. Ramakrishnan, “Forex-foreteller:
Currency trend modeling using news articles,” in Proceedings of the International
Conference on Knowledge Discovery and Data Mining, Chicago, USA, pp. 1470–
1473., Aug. 2013
[71] S. Carta, S. Consoli, L. Piras, A. S. Podda, and D. R. Recupero, “Dynamic Industry-
Specific Lexicon Generation for Stock Market Forecast,” in Proceedings of the 6th
International Conference on Machine Learning, Optimization, and Data Science,
Siena, Italy, pp. 162–176, Sep. 2020.
[72] A. Atkins, M. Niranjan, and E. Gerding, “Financial news predicts stock market
volatility better than close price,” Journal of Finance and Data Science, vol. 4, no. 2,
pp. 120–137, Jun. 2018.
[73] C. D. Sutton, “Classification and Regression Trees, Bagging, and Boosting,” in
Handbook of Statistics, vol. 24, pp. 303–329, 2005.
[74] S. R. Safavian and D. Landgrebe, “A Survey of Decision Tree Classifier
Methodology,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 21, no. 3,
pp. 660–674, Jun. 1991.
[75] “Latest News”, sharechat.co.nz, Aug. 30, 2021. Accessed on: Aug. 30, 2021. [Online].
Available: http://www.sharechat.co.nz/article/6444f1bb/just-life-group-limited-nzx-
jlg-annual-results-for-the-year-ended-30-june-2021.html
[76] S. Gite, H. Khatavkar, K. Kotecha, S. Srivastava, P. Maheshwari, and N. Pandey,
“Explainable stock prices prediction from financial news articles using sentiment
analysis,” PeerJ Computer Science, vol. 7, pp. 1–21, Jan. 2021.
[77] K. Cho et al., “Learning Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical
22
Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734,
Oct. 2014.
23