Twitter Sentiment Analysis System
Twitter Sentiment Analysis System
ABSTRACT data, we can gain insights into the individual. Insights which
Social media is increasingly used by humans to express their can be used for multiple different uses such as content
feelings and opinions in the form of short text messages. recommendation based on current mood, market segmentation
Detecting sentiments in text has a wide range of applications analysis and psychological analysis in humans. [3]
including identifying anxiety or depression of individuals and In this project, we have attempted to classify human sentiment
measuring well-being or mood of a community. Sentiments into two categories namely positive and negative. Which
can be expressed in many ways that can be seen such as facial helps us better understand human thinking and gives us an
expression and gestures, speech and by written text. Sentiment insight which can be used in a variety of ways as stated above.
Analysis in text documents is essentially a content – based
classification problem involving concepts from the domains 3. PROPOSED METHODOLOGY
of Natural Language Processing as well as Machine Learning. In this paper we classify sentiments with the help of machine
In this paper, sentiment recognition based on textual data and learning and natural language processing (NLP) algorithms,
the techniques used in sentiment analysis are discussed. we use the datasets from Kaggle which was crawled from the
internet and labeled positive/negative. The data provided
Keywords comes with emoticons (emoji), usernames and hashtags which
Machine Learning, Python, Social Media, Sentiment Analysis are required to be processed (so as to be readable) and
converted into a standard form. We also need to extract useful
1. INTRODUCTION features from the text such unigrams and bigrams which is a
What do you do when you want to express yourself or reach form of representation of the “tweet”. We use various
out to a large audience? We log on to one of our favorite machine learning algorithms based on NLP (Natural
social media services. Social Media has taken over in today's Language Processing) to conduct sentiment analysis using the
world, most of the methods we use to connect and extracted features. Finally, we report our experimental results
communicate are using social networks, and Twitter is one of and findings at the end.
the major places where we express our sentiments about a
specific topic or a concept. 3.1 Data Description
Twitter serves as a mean for individuals to express their The data given from the dataset is in the form of comma-
thoughts or feelings about different subjects. [1] These separated values files with “tweets” and their corresponding
emotions are used in various analytics for better sentiments. The training dataset is a csv (comma separated
understanding of humans. [2] In this paper, we have attempted value) file of type tweet_id, sentiment, tweet where the
to conduct sentiment analysis on “tweets” using different tweet_id is a unique integer identifying the tweet, sentiment
machine learning algorithms. We attempt to classify the is either 1 (positive) or 0 (negative), and tweet is the tweet
polarity of the tweet where it is either positive or negative. If enclosed in "<inverted brackets>". Similarly, the test dataset
the tweet has both positive and negative elements, the more is a csv file of type tweet_id, tweet respectively. The dataset
dominant sentiment should be picked as the final label. is a mixture of words, emoticons, symbols, URLs and
references to people as seen usually on twitter. Words and
2. PROBLEM STATEMENT & emoticons contribute to predicting the sentiment, but URLs
APPLICABILITY and references to people don’t. Therefore, URLs and
Since the Advent of the Internet, humans have used it as a references are being ignored. The words are also a mixture of
communication tool in the form of mostly text messages and misspelled words / incorrect, extra punctuations, and words
nowadays video and audio streams and as we increase our with many repeated letters. The “tweets”, therefore, must be
dependence on technology it becomes increasingly important preprocessed to standardize the dataset. The provided training
to better gauge human sentiments expressed with the help of and test dataset have 800000 and 200000 tweets respectively.
technology. However, in this textual communication data, we Preliminary statistical analysis of the contents of datasets,
lose the access to sentiments or the emotions conveyed behind after preprocessing as described in section 3.1, is shown in
a sentence, as we often use our hands and facial expressions to tables 1 and 2.
express our intent behind the statement. From this textual
35
International Journal of Computer Applications (0975 – 8887)
Volume 180 – No.47, June 2018
3.2 Preprocessing tweets with the word URL. The regular expression used to
Raw tweets scraped from twitter generally result in a noisy match URLs is ((www\.[\S]+)|(https?://[\S]+)).
and obscure dataset. This is due to the casual and ingenious
nature of people’s usage of social media. Tweets have certain 3.2.2 User Mention
special characteristics such as retweets, emoticons, user Every twitter user has a handle associated with them. Users
mentions, etc. which should be suitably extracted. Therefore, often mention other users in their tweets by @handle. We
raw twitter data must be normalized to create a dataset which replace all user mentions with the word USER_MENTION.
can be easily learned by various classifiers. We have applied The regular expression (regex) used to match user mention is
@[\S]+. 2.
an extensive number of pre-processing steps to standardize the
dataset and reduce its size. We first do some general pre- 3.2.2 Emoticon
processing on tweets which is as follows: Users often use several different emoticons in their tweet to
convey different emotions. It is impossible to exhaustively
● Convert the tweet characters to lowercase alphabet.
match all the different emoticons used on social media as the
● Replace 2 or more dots (.) with space
number is ever increasing. However, we match some common
● Strip spaces and quotes (" and ’) from the ends of
emoticons which are used very frequently. We replace the
tweet.
matched emoticons with either EMO_POS or EMO_NEG
● Replace 2 or more spaces with a single space.
depending on whether it is conveying a positive or a negative
We handle special twitter features as follows:
emotion. A list of all emoticons matched by our method is
3.2.1 Uniform Resource Locator (URL) given in table 3.
Users often share hyperlinks to other webpages in their
tweets. Any particular given URL is not important for text
classification as it would lead to very sparse features and
incorrect classification. Therefore, we replace all the URLs in
36
International Journal of Computer Applications (0975 – 8887)
Volume 180 – No.47, June 2018
Table 3: List of Emoticons matched by our method Table 4: Example Tweets from the Dataset and their
Normalized Version
Emoticon(s) Type Regex Replacement
Raw misses Swimming Class. http://plurk.com/p/12nt0b
:D, : D, :-D, Laugh (:\s?D|:-D|x- EMO_POS Raw @98PXYRochester HEYYYYYYYYY!! its Fer from
xD, x-D, XD, ?D|X-?D) Chile again
X-D
Normalized USER_MENTION heyy its fer from chile again
;-), ;), ;-D, ;D, Wink (:\s?\(|:-\(|\)\s?:|\)- EMO_POS
(;, (-; :) Raw Sometimes, You gotta hate #Windows updates.
<3, :* Love (<3|:\*) EMO_POS Normalized sometimes you gotta hate windows updates
:-(, : (, :(, ):, )-: Sad (:\s?\(|:-\(|\)\s?:|\)- EMO_NEG Raw @Santiago_Steph hii come talk to me i got candy :)
:)
Normalized USER_MENTION hii come talk to me i got candy
:,(, :’(, :"( Cry (:,\(|:\’\(|:"\() EMO_NEG EMO_POS
3.2.3 Hashtag
Hashtags are un-spaced phrases prefixed by the hash symbol Raw USER_MENTION oh no EMO_NEG r.i.p your bella
(#) which is frequently used by users to mention a trending
topic on twitter. We replace all the hashtags with the words
with the hash symbol. For example, #hello is replaced by 3.3 Feature Extraction
hello. The regular expression used to match hashtags is #(\S+). We extract two types of features from our dataset, namely
unigrams and bigrams. We create a frequency distribution of
3.2.4 Retweet the unigrams and bigrams present in the dataset and choose
Retweets are tweets which have already been sent by someone top N unigrams and bigrams for our analysis.
else and are shared by other users. Retweets begin with the
letters RT. We remove RT from the tweets as it is not an 3.3.1 Unigrams
important feature for text classification. The regular Probably the simplest and the most commonly used features
expression used to match retweets is \brt\b. for text classification is the presence of single words or tokens
After applying tweet level pre-processing, we processed in the text. We extract single words from the training dataset
individual words of tweets as follows: and create a frequency distribution of these words. A total of
181232 unique words are extracted from the dataset. Out of
● Strip any punctuation [’"?!,.():;] from the word. these words, most of the words at end of frequency spectrum
● Convert 2 or more letter repetitions to 2 letters. are noise and occur very few times to influence classification.
Some people send tweets like I am sooooo We, therefore, only use top N words from these to create our
happpppy adding multiple characters to emphasize vocabulary where N is 15000 for sparse vector classification.
on certain words. This is done to handle such tweets
by converting them to I am soo happy.
● Remove - and ’. This is done to handle words like t- Frequencies of Top 20
shirt and their’s by converting them to the more
general form tshirt and theirs.
Unigrams
● Check if the word is valid and accept it only if it is. 40000
We define a valid word as a word which begins with
an alphabet with successive characters being 30000
alphabets, numbers or one of dot (.) and
underscore(_). 20000
Some example tweets from the training dataset and their 10000
normalized versions are shown in table 4.
0
Unigram Frequency
37
International Journal of Computer Applications (0975 – 8887)
Volume 180 – No.47, June 2018
Bigram Frequency
4.1 Algorithms
4.1.1 Baseline (Evaluation Metric) 𝑐 = 𝑃 𝑐 𝑡 𝑎𝑟𝑔𝑚𝑎𝑥 𝑐
For a baseline, we use a simple positive and negative word 𝑛
counting method to assign sentiment to a given tweet. We use 𝑃 𝑐𝑡 ∝𝑃 𝑐 𝑃 𝑓ⅈ 𝑐
the Opinion Dataset of positive and negative words to classify
𝑖=1
tweets. In cases when the numbers of positive and negative
words are equal, we assign positive sentiment. In the formula above, 𝑓ⅈ represents the 𝑖-th feature of total 𝑛
features. 𝑃 𝑐 𝑡 and 𝑃 𝑓ⅈ 𝑐 can be obtained through
4.1.2 Naïve Bayes maximum likelihood estimates. We used MultinomialNB from
Naive Bayes classifiers are a family of simple "probabilistic sklearn.naive_bayes package of scikit-learn for Naive Bayes
classifiers “based on applying Bayes' theorem with strong classification. We used Laplace smoothed version of Naive
(naive) independence assumptions between the features.” [5] Bayes with the smoothing parameter α set to its default value
Naive Bayes is a simple model which can be used for text of 1. We used sparse vector representation for classification
and ran experiments using both presence and frequency
classification. In this model, the class 𝑐 is assigned to a tweet
feature types. We found that presence features outperform
t, where
frequency features because Naive Bayes is essentially built to
38
International Journal of Computer Applications (0975 – 8887)
Volume 180 – No.47, June 2018
work better on integer features rather than floats. We also algorithms, trying out different things in preprocessing and
observed that addition of bigram features improves the checking which ones get the best precision metrics.
accuracy.
6.1 Future Work
4.1.3 Maximum Entropy 6.1.1 Handling Emotion Ranges
Maximum Entropy Classifier model is based on the Principle We can improve and train our models to handle a range of
of Maximum Entropy. The main idea behind it is to choose sentiments. Tweets don’t always have positive or negative
the most uniform probabilistic model that maximizes the sentiment. At times they may have no sentiment i.e. neutral.
entropy, with given constraints. [6] Unlike Naive Bayes, it does Sentiment can also have gradations like the sentence, This is
not assume that features are conditionally independent of each good, is positive but the sentence, This is extraordinary. Is
other. So, we can add features like bigrams without worrying somewhat more positive than the first. We can therefore
about feature overlap. The model is represented by classify the sentiment in ranges; say from -2 to +2.
exp
[ 𝑖 𝜆𝑖 𝑓𝑖 (𝑐, 𝑑)]
𝑃 𝑀 𝐸 𝑐 𝑑, 𝜆 = 𝑃 6.1.2 Using symbols
exp[ 𝑖 𝜆𝑖 𝑓𝑖 (𝑐, 𝑑)]
𝑐′ During our pre-processing, we discard most of the symbols
Here, 𝑐 is the class, 𝑑 is the tweet and 𝜆 is the weight vector. like commas, full-stops, and exclamation mark. These
The weight vector is found by numerical optimization of the symbols may be helpful in assigning sentiment to a sentence.
lambdas to maximize the conditional probability.
7. ACKNOWLEDGMENTS
The nltk library provides several text analysis tools. We use We wish to express our thanks towards Prof. Deshpande, Dr.
the MaxentClassifier to perform sentiment analysis on the Ghadekar and our families as without their constant support
given tweets. Unigrams, bigrams and a combination of both and guidance this project would have not been possible.
were given as input features to the classifier. The Improved
Iterative Scaling algorithm for training provided better results 8. REFERENCES
than Generalized Iterative Scaling [1] R. Plutchick. “Emotions: A general psychoevolutionary
theory.” In K.R. Scherer & P. Ekman (Eds) Approaches
5. EVALUATION METRICS to emotion. Hillsdale, NJ; Lawrence Ealrbaum
For evaluation metrics, we use the baseline algorithm which Associates, 1984.
uses simple positive and negative word counting method to
[2] P. Basile, V. Basile, M. Nissim, N. Novielli, V. Patti.
assign sentiment to a given tweet. We use the Golden Dataset
“Sentiment Analysis of Microblogging Data”. To appear
of positive and negative words to classify tweets. In cases
in Encyclopedia of Social Network Analysis and Mining,
when the magnitude of positive and negative words is equal,
Springer. In press.
we assign positive sentiment. A baseline is a method that uses
heuristics, simple summary statistics, randomness, or machine [3] Johan Bollen, Huina Mao, and Alberto Pepe, “Modeling
learning to create predictions for a dataset. We can use these public mood and emotion: Twitter sentiment and socio-
predictions to measure the baseline performance (e.g. economic phenomena.,” in International AAAI
accuracy) this metric will then become what we compare any Conference on Weblogs and Social Media (ICWSM’11),
other machine learning algorithm against. 2011.
6. CONCLUSION [4] Cortes, Corinna; Vapnik, Vladimir N. (1995). "Support-
We tried to build a sentiment analysis system by studying and vector networks". Machine Learning. 20 (3): 273–297.
implementing algorithms of machine learning. We [5] Russell, Stuart; Norvig, Peter (2003) [1995]. Artificial
implemented Naive Bayes and Maximum Entropy algorithms. Intelligence: A Modern Approach (2nd ed.). Prentice
Baseline model performed the worst with no doubt as it had Hall. ISBN 978-0137903955.
least number of features. The modular system we've built can
easy be scaled for new algorithms be it in Machine Learning, [6] Greene, William H. (2012). Econometric Analysis
Deep learning or Natural Language Processing. Sentiment (Seventh ed.). Boston: Pearson Education. pp. 803–806.
analysis system is an active field of research and we can still ISBN 978-0-273-75356-8.
further improve our system by working more on the
39
IJCATM : www.ijcaonline.org