0% found this document useful (0 votes)
21 views5 pages

Twitter Sentiment Analysis System

Uploaded by

202312426
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views5 pages

Twitter Sentiment Analysis System

Uploaded by

202312426
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Applications (0975 – 8887)

Volume 180 – No.47, June 2018

Twitter Sentiment Analysis System

Shaunak Joshi Deepali Deshpande


Department of Information Technology Department of Information Technology
Vishwakarma Institute of Technology Vishwakarma Institute of Technology
Pune, Maharashtra, India Pune, Maharashtra, India

ABSTRACT data, we can gain insights into the individual. Insights which
Social media is increasingly used by humans to express their can be used for multiple different uses such as content
feelings and opinions in the form of short text messages. recommendation based on current mood, market segmentation
Detecting sentiments in text has a wide range of applications analysis and psychological analysis in humans. [3]
including identifying anxiety or depression of individuals and In this project, we have attempted to classify human sentiment
measuring well-being or mood of a community. Sentiments into two categories namely positive and negative. Which
can be expressed in many ways that can be seen such as facial helps us better understand human thinking and gives us an
expression and gestures, speech and by written text. Sentiment insight which can be used in a variety of ways as stated above.
Analysis in text documents is essentially a content – based
classification problem involving concepts from the domains 3. PROPOSED METHODOLOGY
of Natural Language Processing as well as Machine Learning. In this paper we classify sentiments with the help of machine
In this paper, sentiment recognition based on textual data and learning and natural language processing (NLP) algorithms,
the techniques used in sentiment analysis are discussed. we use the datasets from Kaggle which was crawled from the
internet and labeled positive/negative. The data provided
Keywords comes with emoticons (emoji), usernames and hashtags which
Machine Learning, Python, Social Media, Sentiment Analysis are required to be processed (so as to be readable) and
converted into a standard form. We also need to extract useful
1. INTRODUCTION features from the text such unigrams and bigrams which is a
What do you do when you want to express yourself or reach form of representation of the “tweet”. We use various
out to a large audience? We log on to one of our favorite machine learning algorithms based on NLP (Natural
social media services. Social Media has taken over in today's Language Processing) to conduct sentiment analysis using the
world, most of the methods we use to connect and extracted features. Finally, we report our experimental results
communicate are using social networks, and Twitter is one of and findings at the end.
the major places where we express our sentiments about a
specific topic or a concept. 3.1 Data Description
Twitter serves as a mean for individuals to express their The data given from the dataset is in the form of comma-
thoughts or feelings about different subjects. [1] These separated values files with “tweets” and their corresponding
emotions are used in various analytics for better sentiments. The training dataset is a csv (comma separated
understanding of humans. [2] In this paper, we have attempted value) file of type tweet_id, sentiment, tweet where the
to conduct sentiment analysis on “tweets” using different tweet_id is a unique integer identifying the tweet, sentiment
machine learning algorithms. We attempt to classify the is either 1 (positive) or 0 (negative), and tweet is the tweet
polarity of the tweet where it is either positive or negative. If enclosed in "<inverted brackets>". Similarly, the test dataset
the tweet has both positive and negative elements, the more is a csv file of type tweet_id, tweet respectively. The dataset
dominant sentiment should be picked as the final label. is a mixture of words, emoticons, symbols, URLs and
references to people as seen usually on twitter. Words and
2. PROBLEM STATEMENT & emoticons contribute to predicting the sentiment, but URLs
APPLICABILITY and references to people don’t. Therefore, URLs and
Since the Advent of the Internet, humans have used it as a references are being ignored. The words are also a mixture of
communication tool in the form of mostly text messages and misspelled words / incorrect, extra punctuations, and words
nowadays video and audio streams and as we increase our with many repeated letters. The “tweets”, therefore, must be
dependence on technology it becomes increasingly important preprocessed to standardize the dataset. The provided training
to better gauge human sentiments expressed with the help of and test dataset have 800000 and 200000 tweets respectively.
technology. However, in this textual communication data, we Preliminary statistical analysis of the contents of datasets,
lose the access to sentiments or the emotions conveyed behind after preprocessing as described in section 3.1, is shown in
a sentence, as we often use our hands and facial expressions to tables 1 and 2.
express our intent behind the statement. From this textual

35
International Journal of Computer Applications (0975 – 8887)
Volume 180 – No.47, June 2018

Table 1: Statistics of Preprocessed Train Dataset


Total Unique Average Maximum Positive Negative

Tweets 800000 N/A N/A N/A 400312 399688

User Mentions 393392 N/A 0.4917 12 N/A N/A

Emoticons 6797 N/A 0.0085 5 5807 990

URLs 38698 N/A 0.0484 5 N/A N/A

Unigrams 9823554 181232 12.279 40 N/A N/A

Bigrams 9025707 1954953 11.28 N/A N/A N/A

Table 2: Statistics of Preprocessed Test Dataset


Total Unique Average Maximum Positive Negative

Tweets 200000 N/A N/A N/A N/A N/A

User Mentions 97887 N/A 0.4894 11 N/A N/A

Emoticons 1700 N/A 0.0085 10 1472 228

URLs 9553 N/A 0.0478 5 N/A N/A

Unigrams 2457216 78282 12.286 36 N/A N/A

Bigrams 2257751 686530 11.29 N/A N/A N/A

3.2 Preprocessing tweets with the word URL. The regular expression used to
Raw tweets scraped from twitter generally result in a noisy match URLs is ((www\.[\S]+)|(https?://[\S]+)).
and obscure dataset. This is due to the casual and ingenious
nature of people’s usage of social media. Tweets have certain 3.2.2 User Mention
special characteristics such as retweets, emoticons, user Every twitter user has a handle associated with them. Users
mentions, etc. which should be suitably extracted. Therefore, often mention other users in their tweets by @handle. We
raw twitter data must be normalized to create a dataset which replace all user mentions with the word USER_MENTION.
can be easily learned by various classifiers. We have applied The regular expression (regex) used to match user mention is
@[\S]+. 2.
an extensive number of pre-processing steps to standardize the
dataset and reduce its size. We first do some general pre- 3.2.2 Emoticon
processing on tweets which is as follows: Users often use several different emoticons in their tweet to
convey different emotions. It is impossible to exhaustively
● Convert the tweet characters to lowercase alphabet.
match all the different emoticons used on social media as the
● Replace 2 or more dots (.) with space
number is ever increasing. However, we match some common
● Strip spaces and quotes (" and ’) from the ends of
emoticons which are used very frequently. We replace the
tweet.
matched emoticons with either EMO_POS or EMO_NEG
● Replace 2 or more spaces with a single space.
depending on whether it is conveying a positive or a negative
We handle special twitter features as follows:
emotion. A list of all emoticons matched by our method is
3.2.1 Uniform Resource Locator (URL) given in table 3.
Users often share hyperlinks to other webpages in their
tweets. Any particular given URL is not important for text
classification as it would lead to very sparse features and
incorrect classification. Therefore, we replace all the URLs in

36
International Journal of Computer Applications (0975 – 8887)
Volume 180 – No.47, June 2018

Table 3: List of Emoticons matched by our method Table 4: Example Tweets from the Dataset and their
Normalized Version
Emoticon(s) Type Regex Replacement
Raw misses Swimming Class. http://plurk.com/p/12nt0b

:), : ), :-), (:, Smile (:\s?\)|:-\)|\(\s?:|\(- EMO_POS


(:, (-:, :’) :|:\’\)) Normalized misses swimming class URL

:D, : D, :-D, Laugh (:\s?D|:-D|x- EMO_POS Raw @98PXYRochester HEYYYYYYYYY!! its Fer from
xD, x-D, XD, ?D|X-?D) Chile again
X-D
Normalized USER_MENTION heyy its fer from chile again
;-), ;), ;-D, ;D, Wink (:\s?\(|:-\(|\)\s?:|\)- EMO_POS
(;, (-; :) Raw Sometimes, You gotta hate #Windows updates.

<3, :* Love (<3|:\*) EMO_POS Normalized sometimes you gotta hate windows updates

:-(, : (, :(, ):, )-: Sad (:\s?\(|:-\(|\)\s?:|\)- EMO_NEG Raw @Santiago_Steph hii come talk to me i got candy :)
:)
Normalized USER_MENTION hii come talk to me i got candy
:,(, :’(, :"( Cry (:,\(|:\’\(|:"\() EMO_NEG EMO_POS

Normalized @bolly47 oh no :’( r.i.p. your bella

3.2.3 Hashtag
Hashtags are un-spaced phrases prefixed by the hash symbol Raw USER_MENTION oh no EMO_NEG r.i.p your bella
(#) which is frequently used by users to mention a trending
topic on twitter. We replace all the hashtags with the words
with the hash symbol. For example, #hello is replaced by 3.3 Feature Extraction
hello. The regular expression used to match hashtags is #(\S+). We extract two types of features from our dataset, namely
unigrams and bigrams. We create a frequency distribution of
3.2.4 Retweet the unigrams and bigrams present in the dataset and choose
Retweets are tweets which have already been sent by someone top N unigrams and bigrams for our analysis.
else and are shared by other users. Retweets begin with the
letters RT. We remove RT from the tweets as it is not an 3.3.1 Unigrams
important feature for text classification. The regular Probably the simplest and the most commonly used features
expression used to match retweets is \brt\b. for text classification is the presence of single words or tokens
After applying tweet level pre-processing, we processed in the text. We extract single words from the training dataset
individual words of tweets as follows: and create a frequency distribution of these words. A total of
181232 unique words are extracted from the dataset. Out of
● Strip any punctuation [’"?!,.():;] from the word. these words, most of the words at end of frequency spectrum
● Convert 2 or more letter repetitions to 2 letters. are noise and occur very few times to influence classification.
Some people send tweets like I am sooooo We, therefore, only use top N words from these to create our
happpppy adding multiple characters to emphasize vocabulary where N is 15000 for sparse vector classification.
on certain words. This is done to handle such tweets
by converting them to I am soo happy.
● Remove - and ’. This is done to handle words like t- Frequencies of Top 20
shirt and their’s by converting them to the more
general form tshirt and theirs.
Unigrams
● Check if the word is valid and accept it only if it is. 40000
We define a valid word as a word which begins with
an alphabet with successive characters being 30000
alphabets, numbers or one of dot (.) and
underscore(_). 20000

Some example tweets from the training dataset and their 10000
normalized versions are shown in table 4.
0

Unigram Frequency

Figure 1: Statistics of Unigram Occurrence Frequency

37
International Journal of Computer Applications (0975 – 8887)
Volume 180 – No.47, June 2018

3.3.2 Bigrams 4. MODEL


Bigrams are word pairs in the dataset which occur in We will be using a model consisting of both test and training
succession in the corpus. These features are a good way to dataset model using various algorithms as due to the modular
model negation in natural language like the phrase– This nature of the program we can add and remove the algorithms
is not good. A total of 1954953 unique bigrams were with ease. Let’s understand the workflow of our system with
extracted from the dataset. Out of these, most of the bigrams the help of above diagram. First, we have split the data into
at end of frequency spectrum are noise and occur very few training and test set. We also keep separate positive and
times to influence classification. We therefore use only top negative pre-labelled datasets for training the model and
10000 bigrams from these to create our vocabulary. checking their generalization classification in test set. After
this the training data is fed to some machine learning
algorithm like Naive Bayes, Maximum Entropy or SVM [4]
Frequencies of Top 20 Bigrams which learns to make predictions. To evaluate our system, we
use Baseline Classification which is our evaluation metric in
24000 which test data is fed to the learned algorithm which in return
22000 generates recommended prediction ratings of words. With
20000
18000 help of pre-classified golden set and evaluation metric we
16000 check the accuracy of our model.
14000
12000
10000

Bigram Frequency

Figure 2: Statistics of Bigram Occurrence Frequency

Fig 3: Machine Learning Model for Twitter System Analysis System

4.1 Algorithms
4.1.1 Baseline (Evaluation Metric) 𝑐 = 𝑃 𝑐 𝑡 𝑎𝑟𝑔𝑚𝑎𝑥 𝑐
For a baseline, we use a simple positive and negative word 𝑛
counting method to assign sentiment to a given tweet. We use 𝑃 𝑐𝑡 ∝𝑃 𝑐 𝑃 𝑓ⅈ 𝑐
the Opinion Dataset of positive and negative words to classify
𝑖=1
tweets. In cases when the numbers of positive and negative
words are equal, we assign positive sentiment. In the formula above, 𝑓ⅈ represents the 𝑖-th feature of total 𝑛
features. 𝑃 𝑐 𝑡 and 𝑃 𝑓ⅈ 𝑐 can be obtained through
4.1.2 Naïve Bayes maximum likelihood estimates. We used MultinomialNB from
Naive Bayes classifiers are a family of simple "probabilistic sklearn.naive_bayes package of scikit-learn for Naive Bayes
classifiers “based on applying Bayes' theorem with strong classification. We used Laplace smoothed version of Naive
(naive) independence assumptions between the features.” [5] Bayes with the smoothing parameter α set to its default value
Naive Bayes is a simple model which can be used for text of 1. We used sparse vector representation for classification
and ran experiments using both presence and frequency
classification. In this model, the class 𝑐 is assigned to a tweet
feature types. We found that presence features outperform
t, where
frequency features because Naive Bayes is essentially built to

38
International Journal of Computer Applications (0975 – 8887)
Volume 180 – No.47, June 2018

work better on integer features rather than floats. We also algorithms, trying out different things in preprocessing and
observed that addition of bigram features improves the checking which ones get the best precision metrics.
accuracy.
6.1 Future Work
4.1.3 Maximum Entropy 6.1.1 Handling Emotion Ranges
Maximum Entropy Classifier model is based on the Principle We can improve and train our models to handle a range of
of Maximum Entropy. The main idea behind it is to choose sentiments. Tweets don’t always have positive or negative
the most uniform probabilistic model that maximizes the sentiment. At times they may have no sentiment i.e. neutral.
entropy, with given constraints. [6] Unlike Naive Bayes, it does Sentiment can also have gradations like the sentence, This is
not assume that features are conditionally independent of each good, is positive but the sentence, This is extraordinary. Is
other. So, we can add features like bigrams without worrying somewhat more positive than the first. We can therefore
about feature overlap. The model is represented by classify the sentiment in ranges; say from -2 to +2.
exp⁡
[ 𝑖 𝜆𝑖 𝑓𝑖 (𝑐, 𝑑)]
𝑃 𝑀 𝐸 𝑐 𝑑, 𝜆 = 𝑃 6.1.2 Using symbols
exp⁡[ 𝑖 𝜆𝑖 𝑓𝑖 (𝑐, 𝑑)]
𝑐′ During our pre-processing, we discard most of the symbols
Here, 𝑐 is the class, 𝑑 is the tweet and 𝜆 is the weight vector. like commas, full-stops, and exclamation mark. These
The weight vector is found by numerical optimization of the symbols may be helpful in assigning sentiment to a sentence.
lambdas to maximize the conditional probability.
7. ACKNOWLEDGMENTS
The nltk library provides several text analysis tools. We use We wish to express our thanks towards Prof. Deshpande, Dr.
the MaxentClassifier to perform sentiment analysis on the Ghadekar and our families as without their constant support
given tweets. Unigrams, bigrams and a combination of both and guidance this project would have not been possible.
were given as input features to the classifier. The Improved
Iterative Scaling algorithm for training provided better results 8. REFERENCES
than Generalized Iterative Scaling [1] R. Plutchick. “Emotions: A general psychoevolutionary
theory.” In K.R. Scherer & P. Ekman (Eds) Approaches
5. EVALUATION METRICS to emotion. Hillsdale, NJ; Lawrence Ealrbaum
For evaluation metrics, we use the baseline algorithm which Associates, 1984.
uses simple positive and negative word counting method to
[2] P. Basile, V. Basile, M. Nissim, N. Novielli, V. Patti.
assign sentiment to a given tweet. We use the Golden Dataset
“Sentiment Analysis of Microblogging Data”. To appear
of positive and negative words to classify tweets. In cases
in Encyclopedia of Social Network Analysis and Mining,
when the magnitude of positive and negative words is equal,
Springer. In press.
we assign positive sentiment. A baseline is a method that uses
heuristics, simple summary statistics, randomness, or machine [3] Johan Bollen, Huina Mao, and Alberto Pepe, “Modeling
learning to create predictions for a dataset. We can use these public mood and emotion: Twitter sentiment and socio-
predictions to measure the baseline performance (e.g. economic phenomena.,” in International AAAI
accuracy) this metric will then become what we compare any Conference on Weblogs and Social Media (ICWSM’11),
other machine learning algorithm against. 2011.
6. CONCLUSION [4] Cortes, Corinna; Vapnik, Vladimir N. (1995). "Support-
We tried to build a sentiment analysis system by studying and vector networks". Machine Learning. 20 (3): 273–297.
implementing algorithms of machine learning. We [5] Russell, Stuart; Norvig, Peter (2003) [1995]. Artificial
implemented Naive Bayes and Maximum Entropy algorithms. Intelligence: A Modern Approach (2nd ed.). Prentice
Baseline model performed the worst with no doubt as it had Hall. ISBN 978-0137903955.
least number of features. The modular system we've built can
easy be scaled for new algorithms be it in Machine Learning, [6] Greene, William H. (2012). Econometric Analysis
Deep learning or Natural Language Processing. Sentiment (Seventh ed.). Boston: Pearson Education. pp. 803–806.
analysis system is an active field of research and we can still ISBN 978-0-273-75356-8.
further improve our system by working more on the

39
IJCATM : www.ijcaonline.org

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy