0% found this document useful (0 votes)
70 views3 pages

Final Dzuo Tding Vfang PDF

This document describes a project to predict stock market movements using time series data and Twitter sentiment analysis. The authors aim to determine the most accurate machine learning model (SVM, logistic regression, neural networks) for the prediction task and whether incorporating sentiment data improves results. They analyze S&P 500 data from 2008-2010 to classify daily movements as up, down, or unchanged. Initial models using 1, 5, 10, and 30 days of past market data achieve around 40-50% accuracy. Sentiment is analyzed using a Naive Bayes classifier trained on manually labeled tweets. The authors aim to compare prediction performances with and without the added sentiment attribute.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views3 pages

Final Dzuo Tding Vfang PDF

This document describes a project to predict stock market movements using time series data and Twitter sentiment analysis. The authors aim to determine the most accurate machine learning model (SVM, logistic regression, neural networks) for the prediction task and whether incorporating sentiment data improves results. They analyze S&P 500 data from 2008-2010 to classify daily movements as up, down, or unchanged. Initial models using 1, 5, 10, and 30 days of past market data achieve around 40-50% accuracy. Sentiment is analyzed using a Naive Bayes classifier trained on manually labeled tweets. The authors aim to compare prediction performances with and without the added sentiment attribute.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Stock Market Prediction based on Time Series Data and

Market Sentiment
Tina Ding Vanessa Fang Daniel Zuo
Northwestern University Northwestern University Northwestern University
Apt 301, 1940 Sherman Avenue 2133 1/2 ridge Ave, #2D 626 University Place
Evanston, IL 60201 Evanston, IL 60201 Evanston, IL 60201
1-847-702-4609 1-703-405-0688 1-847-220-3178
xiaotianding1.2013@u.north vanessafang2014@u.north pengzuo2014@u.northwest
western.edu western.edu ern.edu

ABSTRACT performances of these three learning models, with and without


In this project, we would like to create a system that predicts stock market sentiment, and determine which is the most effective. We
market movements on a given day, based on time series data and will use N-fold cross validation on data in the timeframe of
market sentiment analysis. We will use Twitter data on that day to January 2008 and April 2010 to measure the performance of the
predict the market sentiment and S&P 500 values to perform system.
analysis on historical data. 2.3 Method and Software Usage
We decided to use the Python NLTK (Natural Language Toolkit)
1. INTRODUCTION for our sentiment analysis [7]. The NLTK is an open source suite
that provides some useful tools and libraries for text processing.
Stock market predication has always been an interesting topic The NLTK package also includes a number of trainable
among researchers. We found the idea of combining market data classifiers, including a Naive Bayes classifier with built-in
with public sentiment to predict market movement particularly training and classifying methods.
interesting when addressing this topic. We believe that such a
combination could help more accurately predict stock market 2.4 Dataset
movement. We seek to find the most relevant historical data For time series data analysis, we directly imported the prices for
attributes, the best learning method, and whether the addition of a S&P 500 from January 2008 to April 2010 from Yahoo! Finance
public sentiment attribute is helpful in the prediction of stock into Excel spreadsheet.
market movement.
For sentiment analysis, we obtained the Twitter Census: Stock
2. OVERVIEW Tweets dataset from Infochimps, a privately held company that
offers a “data marketplace” that gives users access to public and
2.1 Objective proprietary data sets [8]. This dataset includes 2.3 million stock
Given the time series data and Twitter data from January 2008 to tweets. The data set provides the timestamp, ticker symbol, tweet
April 2010, we will construct a system to predict stock market ID, and keywords for each tweet. An example is provided below:
movement (up, same, down) on a given day, based on the key 20090323173524 $TAZ 1376714687 make,money,buy
attributes computed from the historical data from the past few
days and market sentiment. Given these inputs, the model will be After inspecting the dataset, a few concerns were raised. 1) The
expected to output a prediction of S&P index movement for a tweets before 2008 were sparse and did not contain representative
given day. keywords; 2) Many of the keywords extracted are irrelevant for
our purposes; 3) The data set does not provide original text of
We would like to train the system on three models and compare tweets, instead it features only extracted keywords, resulting in a
the performances of these three models. Furthermore, we would loss of context.
like to determine whether the addition of a public sentiment
attribute is beneficial in the prediction of market movement. To We decided to modify the raw dataset in order to address some of
help us examine these problems, we would like to raise three these concerns. We first removed all tweets before 2008. We then
questions: created a bag of 80 relevant words for market movement
prediction. Using this bag, we went through the data set and for
1. Which classification method is the most accurate in each tweet, extracted the timestamp and relevant keywords from
predicting market movement on a given day? SVM, our bag.
Logistic Regression, or Neural Networks?
The third concern mentioned was not something we addressed in
2. Is the use of 1, 5, 10, or 30 days of prior market day this project, however, it is an important one to take into
most helpful in the prediction of market movement? consideration. It is possible that the lack of context for our
3. Does the addition of public sentiment, in this case from examples prevents us from accurately determining the sentiment
Twitter, help in the prediction of market movement? from each tweet and each day.
2.2 General Approach To train the Naïve Bayes Classifier in NLTK, we manually
prepared a training set that contains 295 labeled tweets. An
We will train the system on three models, SVM, logistic example instance is provided:
regression, and neural networks. Afterwards, we will compare the
$SINA in short now, seems not much interest. -1
3. ANALYSIS & RESULTS Attribute 2: Price Momentum Oscillator- Movement

3.1 Times Series Data Analysis Momentum measures the amount that a financial instrument's
price has changed over a given timeframe.
Based on S&P movement chart, we found that the magnitude of
Attribute 3: Relative Strength Index- Movement
index movement varies a considerable amount and more than a
quarter of the movements are within [-5,5]. Therefore, we The RSI is classified as a momentum oscillator, measuring the
believed labeling movements into 3 categories would provide velocity and magnitude of directional price movements. It is
more useful information on market movement. Since the intended to chart the current and historical strength or weakness of
magnitude of movement between [-5, 5] is relatively small, we a stock or market based on the closing prices of a recent trading
chose to label this as “same” or “not moving.” We finally have 3 period.
labels for S&P index movement: Up, Down, and Same (relative to Attribute 4: Stochastic-Oscillator
previous day).
The Stochastic Oscillator is a momentum indicator that shows the
location of the close relative to the high-low range over a set
number of periods.
Attribute 5: Weighted Moving Average-Movement (WMA-
Movement)
A weighted moving average (WMA) has the specific meaning of
weights that decrease in arithmetical progression.
Results:
SVM (RBF) Logistic Neural
1-day 43.37% 42.37% 39.56%
5-day 51.00% 49.20% 48.19%
10-day 39.36% 43.78% 40.76%
30-day 43.57% 42.97% 40.56%
Table 2. Initial results from time series data analysis
Figure 1. Market Movement Histogram We observed that using 5-day data returns the best result, and
SVM outperformed the other two models in all cases.
Label Up Same Down Total
Count 186 137 175 498 3.2 Sentiment Analysis
(Percentage) (37.34%) (27.51%) (35.14%) (100%)
We decided to use a simple approach, Naive Bayes Classifier, to
Table 1. Data distribution analyze sentiment in the tweet data set. The Python NLTK
provides a built-in Naive Bayes classifier, which can be trained
given a labeled feature set and then used to classify future
We used the following attributes to predict S&P movement. We instances.
chose our attributes based on “Prediction of Closing Stock
Prices”[6]. For each attribute, there is a threshold or criteria that After training our Naive Bayes classifier on the training set, we
indicate the S&P will go up or down in the next day. Since we ran the classifier on each tweet in our data set. For each tweet, the
have 3 classes of index movements, we decided to change the classier outputs a score in the range of (0, 1). For each day, we
number of labels of each attribute from 2 to 3. computed the average score for all the tweets and used the
average score as a benchmark to label each tweet: we labeled the
Based on the original threshold value and criteria, we divided tweets with a score higher than average as positive, and the tweets
each attributes into three classes: we label it as “-1” if it strongly with a score lower than average as negative. In the end, we
indicates the market will go down, “1” if it strongly indicates the divided the number of positive tweets by the total number of
market will go up, and “0” if it does not have strong indication. tweets within a day to get the daily sentiment score.
We chose the threshold values so that each attribute are
approximately evenly divided.
We also wanted to compare the performance of using data of
different time period. Therefore, we adjusted the calculations of
attributes so that their values reflect the information of a certain
time period. Here are the calculation and labeling of each attribute
for t-day period data: Figure 2. Example output from NLTK Naïve Bayes Classifer
Attribute 1: On Balance Volume-Movement
On Balance Volume (OBV) measures buying and selling pressure 3.3 Combined Analysis
as a cumulative indicator that adds volume on up days and
subtracts volume on down days. Before incorporating the sentiment scores into original time series
data analysis, we labeled the sentiment scores as up, down or
same. We set the cutoffs to 0.52 and 0.56 such that any day with a moderately, and the improvement is only statistically
sentiment score lower than 0.52 is labeled down, between 0.52 significant in using SVM on 10-day and 30-day of prior
and 0.56 is labeled same, and above 0.56 is labeled up. data.
After adding the labeled sentiment results into the original model, For future work, there are three aspects that can be improved on.
we obtained the results represented in the following table. Firstly, the current data set that we performed sentiment analysis
on provided only keywords instead of original tweets. This lack of
SVM (RBF) Logistic NN
context may have affected the accuracy of our sentiment analysis.
1-day 40.21% 44.12% 42.01% The small training set that we built may be another aspect that can
be improved. With only a number of 295 training examples, the
5-day 51.88% 50.42% 48.13% learned model is far from being accurate and comprehensive.
10-day 43.75% 41.46% 38.54% Thirdly, the sentiment analysis method we used from the Python
NLTK is a very simple and preliminary textual sentiment analysis
30-day 48.54% 46.04% 42.92% tool. There are a lot of sophisticated tools available in the market
Table 3. Results from combined analysis that may yield more accurate results.
From the figures in Table 3, we observed that, in 8 out of the 12 We believe that with better sentiment analysis tools, a data set
cases, the performance was improved by adding in the sentiment providing more contexts for words and a larger training set, it may
analysis. Overall, there is an average increase of 1.11% in be possible to further increase the accuracy of the methods
classification correctness among the 12 cases. presented.

The 5-day timeframe still generated the best performances among


the four timeframes we considered, and SVM remained to be the 5. REFERENCES
best model except in 1-day timeframe.
[1] J. Bollen and H. Mao. Twitter mood as a stock market
We also observed that the classification correctness improved the predictor. IEEE Computer, 44(10):91–94.
most in 10-day and 30-day timeframes. It can be explained by the http://arxiv.org/pdf/1010.3003.pdf.
fact that, in time series data analysis, the historical data from 10-
day or 30-day timeframe was not quite relevant to current-day [2] V. H. Shah. 2007. Machine Learning Techniques for Stock
market movement, and thus the more relevant sentiment results Prediction. Foundations of Machine Learning, New York
helped to improve the performance more. University.
http://www.vatsals.com/Essays/MachineLearningTechniques
3.4 Statistical Tests forStockPrediction.pdf.
We did a one-tailed independent t-test on differences of prediction [3] T. B. Trafalis and H. Ince. Support vector machine for
accuracies of different learning models, different timeframes, and regression and applications to financial forecasting.
with/without sentiment attribute. We found that 1) Accuracies of IJCNN2000, 348-353.
using 3 models are not significantly different. 2) 5-day timeframes http://www.svms.org/regression/TrIn00.pdf
returned significantly higher accuracies than other timeframes,
[4] H. Yang, L. Chan, and I. King. Support vector machine
with critical level between 0.005 and 0.1. 3) Labeled sentiment
regression for volatile stock market prediction. Proceedings
results help to significantly improve the accuracies of 10-day and
of the Third International Conference on Intelligent Data
30-day timeframes of using SVM with a critical level of 0.1, but
Engineering and Automated Learning, 2002.
have insignificant impact on other cases.
http://www.cse.cuhk.edu.hk/~lwchan/papers/ideal2002.pdf
[5] M. Cohen, P. Damiani, S. Durandeu, R. Navas, H. Merlino,
4. CONCLUSION AND FUTURE WORK E. Fernandez. Sentiment analysis in microblogging: a
practical implementation. Red de Universidades con
Taking our results and analysis into consideration, we can now Carreras en Informática (RedUNCI), P. 191-200.
answer our three questions posed earlier in the paper. http://sedici.unlp.edu.ar/bitstream/handle/10915/18642/Docu
1. SVM appeared to be the most accurate learning model mento_completo.pdf?sequence=1
for predicting market movement. But the statistical tests [6] G Garner. Prediction of Closing Stock Prices. Course project
showed that SVM is not significantly better than logistic for Engineering Data Analysis and Modeling at Portland
regression. State University, Fall term, 2004.
2. Across all three learning methods, 5-days of prior data http://web.cecs.pdx.edu/~edam/Reports/2004/Garner.pdf
achieved the highest percentage of correctly classified [7] S. Bird, E Loper, and E Klein. Natural Language Processing
instances and are statistically better than other with Python. O’Reilly Media Inc., 2009. http://ntlk.org/
timeframes.
[8] http://www.infochimps.com/
3. For most cases, the addition of the Twitter sentiment
analysis results appeared to improve performance

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy