Stock Price Prediction Using News Sentiment Analysis
Stock Price Prediction Using News Sentiment Analysis
Abstract—Predicting stock market prices has been a topic of financial news of a company instead of only considering the
interest among both analysts and researchers for a long time. past stock prices can lead to better prediction results.
Stock prices are hard to predict because of their high volatile
nature which depends on diverse political and economic factors, II. R ELATED W ORKS
change of leadership, investor sentiment, and many other factors.
Predicting stock prices based on either historical data or textual Most of the stock prediction approaches have been built
information alone has proven to be insufficient. on technical and fundamental analyses of stocks. In recent
Existing studies in sentiment analysis have found that there studies, it has been evident that there is a strong correlation
is a strong correlation between the movement of stock prices between news articles related to a company and its stock price
and the publication of news articles. Several sentiment analysis
movements.
studies have been attempted at various levels using algorithms
such as support vector machines, naive Bayes regression, and Alostad and Davulcu [1] used hourly stock prices of 30
deep learning. The accuracy of deep learning algorithms depends stocks and online stock news articles from the NASDAQ
upon the amount of training data provided. However, the amount website. They collected tweets related to those 30 stocks for
of textual data collected and analyzed during the past studies has a period of six months. Li et al. [2] collected five years
been insufficient and thus has resulted in predictions with low
of Hong Kong stock exchange data. They gathered financial
accuracy.
In our paper, we improve the accuracy of stock price pre- news articles over the same time period in order to draw a
dictions by gathering a large amount of time series data and correlation between the news articles and stock market trends.
analyzing it in relation to related news articles, using deep learn- In a particular trading day, they collected the open, high, close,
ing models. The dataset we have gathered includes daily stock and low prices of stock for each company.
prices for S&P500 companies for five years, along with more
Collected articles have been processed in different ways to
than 265,000 financial news articles related to these companies.
Given the large size of the dataset, we use cloud computing extract features. Alostad and Davulcu [1] extracted N-gram
as an invaluable resource for training prediction models and features, after removing stop-words, whitespace, punctuations,
performing inference for a given stock in real time. and numbers. They built final features into a document matrix
Index Terms—stock market prediction, cloud, big data, ma- and used OpenNLP to extract sentences from each docu-
chine learning, regression. ment. They then used the SentiStrength library with Loughran
and McDonald Financial Sentiment Dictionaries on those
I. I NTRODUCTION sentences to detect sentiment. Removal of HTML tags [3],
There are many factors that influence stock market prices. tokenization of sentences [3], noun phrasing [7], document
One of those factors is investor’s reaction to financial news and weighting [3], TFIDF [3], and extraction of named entities are
day to day events. Nowadays, news availability has increased some of the text pre-processing steps used in various papers.
dramatically. It is hard for investors to decide the trend of Alostad and Davulcu [1] solved the stock price trend
stock prices based on the huge amount of news. So, an prediction problem as a classification problem. They used
automated system to predict future stock prices will be helpful logistic regression to the n-gram document matrix, stock price
for investors. An automated system can gather financial news direction for each hour, and weight of each document. Later,
related to the companies of interest in real time and can they used SVM to perform the classification. The experiments
execute a machine learning model on those data, along with also showed that extracting document-level sentiment does not
historical stock price information, to predict price. significantly increase the prediction accuracy. In various other
For years, research has been done on predicting stock prices works, random forest, naive Bayesian, and genetic algorithms
either based on historical stock price data alone or by using were used for stock price/direction prediction.
textual data and historical data [1], [2], [3], [4], [5], [6]. Some III. DATA C OLLECTION
of the previous works used Twitter sentiments, financial blogs,
or news articles as the textual data. In our work, we use We have collected two different datasets for this re-
financial news articles from well-known sources to avoid fake search.The daily stock price dataset consists of closing stock
news that may be prevalent on social media. We have used past prices of the Standard and Poor’s 500 companies, from Febru-
stock prices and current day financial news to predict current ary 2013 to March 2017. We also collected news articles for
day closing stock price. We believe that this approach is better the S&P 500 companies from February 2013 to March 2017
because financial news related to the company has a significant from international daily newspaper websites. The total number
effect on its stock price. Hence, taking into consideration the of articles collected is 265463. The challenging aspect of stock
price prediction is making use of available data to make an
1 Computer Engineering, San José State University, San José, CA informed decision. A lot of data is being generated for many
∗ Corresponding Author, david.anastasiu at sjsu.edu companies and, if these data were to be processed manually,
206
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on March 22,2024 at 17:37:27 UTC from IEEE Xplore. Restrictions apply.
libraries to derive the sentiment or polarity of each document. Prophet. The RNN-pt model performed well for stable stocks
The sentiment library gives positive, negative, and neutral and was able to follow their trends, but performed poorly
values as output. In this experiment, we considered positive for low price and high volatile stocks. The RNN-pp model
and negative values only. The final polarity of a company was gave the best results across all experiments. This model
calculated by identifying the maximum absolute positive or performed well for companies for which we had more textual
negative sentiment in each text and averaging the identified data. Fig. 1 shows the stock price predictions for Apple for
polarities across all texts identified for a company, i.e., all models. Additionally, Fig. 2 shows the average MAPE
across all companies for each model. Most RNN approaches
P olarityic = (+/−)max(abs(Nic , Pic )),
perform very well, with MAPE values between 2.03 and
k
1 2.17. Classical time series models like ARIMA and Facebook
P olarity c = P olarityic ,
k i=1 Prophet obtain much lower scores, between 7.39 and 7.98. The
models that used text (financial news articles) as part of the
where Nij and Pij are negative and positive values correspond- input have performed very well, while models that predicted
ing to words in the ith of k documents about company c. future stock prices only on the basis of historical stock prices
We trained the model based on current polarity and previous lead to high percentage error. Table II shows a comparison
stock prices, i.e., we predicted the closing stock price at of these results for five major companies. There were more
a given time t (Pricet ) by considering pairs (Pricet−1 , Pt ), documents collected for these companies as compared to other
(Pricet−2 , Pt−1 ), ..., (Pricet−m , Pt−m+1 ). We transformed and companies, and the high number of news articles contributed
normalized data as described in Section VI-C1. We tuned to achieving low MAPE for these models. Finally, the global
various parameters of the RNN LSTM model, such as the RNN-mv model performed poorly overall, indicating that the
number of LSTM layers, the number of units in each layer, the individual signal for each company is more important towards
batch size, and the number of epochs. We used the RMSprop predicting its performance. More work is needed to incorporate
optimizer and linear activation to fit the model. global stock price information into a local company-specific
3) Approach 3 - RNN LSTM with Stock Prices and Textual prediction model.
Information: Instead of just identifying the sentiment of the TABLE I
text, as in the previous method, we processed the whole text LABELS TO REPRESENT VARIOUS MODELS
and fed it as input to the neural network along with the label Model
price. We computed a linear combination of tf-idf weights fbprophet Facebook Prophet
with the word2vec representation of words in the documents, RNN-p RNN-LSTM model with prices as input
which we then provided as input to a convolutional neural RNN-pp RNN-LSTM model with prices and text polarity as
input
network. The output of this neural network is a 10 dimensional RNN-pt RNN-LSTM model with prices and text as input
vector which is in turn given as input to the recurrent neural RNN-mv RNN-LSTM multivariate model
network (RNN) along with the normalized price. We designed
the convolutional network in the following sequence: two VIII. C ONCLUSION AND F UTURE W ORK
convolutional layers, max pooling, one convolutional layer, In this work, we predicted stock prices using time series
global average pooling2D, dense layer, dropout, and finally models, neural networks, and a combination of neural net-
two dense layers with ReLu as the activation function and a works and financial news articles. The results suggest that
dropout rate of 0.4. there is a strong relationship between stock prices and financial
4) Approach 4 - RNN LSTM Multivariate model: In this news articles. We built prediction models based on time series
experiment, we processed the textual data the same way as in forecasting models, such as ARIMA, RNN, and Facebook
Approach 2, using NLTK libraries, but constructed combined Prophet. We achieved better results with RNN and found that
daily samples that included stock prices and sentiment polarity there is a correlation between the textual information and
for all companies. We used four consecutive days of samples stock price direction. The models did not perform well in
as a window with the fifth day’s stock prices as expected cases where stock prices are low or highly volatile. There are
output. Stock prices were normalized across the companies. still different ways to build stock prediction models, which
We built this model in order to verify if there is any effect we leave as future work. Some of these include building a
or influence on the stock price of a particular company due domain-specific model by grouping companies according to
to the stock price changes of other companies. The output of their sector, considering adverse effects on the stock price of
this model is stock prices of all companies in the dataset in a a company due to news about other related companies, and
single day. considering more general industry and global news that could
indicate general market stability.
VII. R ESULTS AND A NALYSES
R EFERENCES
In this section, we discuss the results obtained using our
models. Table I shows the labels we use to represent models in [1] H. Alostad and H. Davulcu, “Directional prediction of stock prices
using breaking news on twitter,” in 2015 IEEE/WIC/ACM International
the result graphs and table. RNN models performed well when Conference on Web Intelligence and Intelligent Agent Technology (WI-
compared to traditional models like ARIMA and Facebook IAT), vol. 1, Dec 2015, pp. 523–530.
207
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on March 22,2024 at 17:37:27 UTC from IEEE Xplore. Restrictions apply.
(a) Apple: Stock prices plotted on zero axis. (b) Apple: Closer perspective of stock prices.
Fig. 1. Stock price predictions using multiple models
TABLE II
M ODEL P ERFORMANCE FOR A F EW C OMPANIES AND ALL M ODELS
208
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on March 22,2024 at 17:37:27 UTC from IEEE Xplore. Restrictions apply.