Dissertation Results Paper JETIR2209126 SEP22
Dissertation Results Paper JETIR2209126 SEP22
org (ISSN-2349-
5162)
Abstract- Data available online in the form of text coming from Typically, there are three phases in understanding short text:
various sources such as social media comments, product reviews, segmentation, type detection and sentiment analysis. Text segmentation is to
customer queries, search engine queries etc. is huge source of break input text into smaller terms which can a word or a phrase. Type
information, and extracting required knowledge out of it is valuable to detection is to attach meaningful type to each term within input text.
the business. However processing and inferring meaningful information Generally POS taggers determine lexical types based on grammatical rules.
from this unstructured data is highly challenging task. First and foremost These approaches are inapplicable as the short texts usually don’t follow
reason is data obtained from social networking comments, customer grammar and lacks sufficient contextual information. Further traditional POS
reviews are usually grammatically incorrect. Further it lacks sufficient tagging methods cannot distinguish semantic types which, however, are very
statistical information to support many state-of-the-art approaches. important for sentiment analysis. In instance disambiguation meaningful
Finally they are complex, ambiguous with misspelled words and are labels or types are assigned to each term. These concepts are derived from
generated in an enormous volume, which further increases the difficulty domain ontology. Sentiment analysis, also referred to as opinion mining, is an
to handle them. After studying multiple methods proposed recently in the approach to NLP that identifies the emotional tone behind a body of text. It
field of text analysis it is observed that in order to infer the actual helps organizations to determine and categorize opinions about their products,
meaning of short text it is essential to have semantic knowledge. This services, and ideas. Organization can use Sentiment analysis to gather insights
work has proposed a prototype system that uses deep neural network for from complex and unstructured data that comes from online sources such as
text processing. The proposed system has two phases namely Model customer reviews, emails, blog posts, support tickets, web chats, social media
building and Live testing. In first phase, KERAS sequential model is built channels, forums and comments. In addition to identifying sentiment, opinion
and trained on YELP dataset of business reviews. In Live testing the user mining can extract the polarity or the amount of positivity and negativity
input query is processed in real time to derive its semantics and sentiment within the text. Furthermore it can be applied to varying scopes such as
which can be either positive, negative or neutral. For getting semantic document, paragraph, sentence and sub-sentence levels.
information the proposed method uses Simple Lesk algorithm and the Although the three steps for short text understanding looks straight
sentiment is derive using KERAS model built in first phase. Proposed forward there are many challenges and new approaches must be introduced to
method is tested on Ebay customer review data and results are compared tackle these challenges. Short texts are usually noisy, informal and error-
with some of the state-of-art methods namely TextBlob, VADAR analysis prone. It contains abbreviations, nicknames, misspellings etc. For example,
and SWN analysis. The results show our method is more effective than “New York city” is sometimes referred as “nyc”. This calls for the vocabulary
Sentiwordnet analysis, almost similar as VADAR analysis but lesser to incorporate as much information about abbreviations and nicknames as
effective as compared to Textblob. possible. Meanwhile, extracting approximate terms is also required to handle
spelling mistakes in short texts. Next challenge is ambiguous type where a
Keywords- Text processing, Simple Lesk, KERAS Sequential model. term can belong to several types, and its best type in a short text depends on
context semantics. For example, "watch" in "watch price" refers to wrist
watch and should be labeled as instance, whereas in text "watch movie", it is a
I. INTRODUCTION verb. Short texts are generated in a large volume as compared to whole
documents. For example, latest statistics indicate Google now processes over
In today’s world everyone is connected through internet as the smart 40,000 search queries every second on average, which translates to over 8.5
phones and internet have become so chip that everyone could afford it. With billion searches per day and around 3 trillion searches per year worldwide.
the increasing use of internet, the amount of data generated in the form of Twitter is generating around 6000 tweets every second which corresponds to
short text is enormous. The sources are chatting applications (Whatsapp, 500 million tweets per day. Therefore, a feasible method for short text
Telegram etc), social networking sites (Face book, Twitter, Instagram), online processing should be able to handle short texts more effectively and
customer reviews, internet blogs, search engine queries, news titles etc. efficiently. However, a short text can have multiple possible segmentations, a
Analyzing and accurately deriving the useful information is very crucial to term can be labeled with multiple types, and an instance can refer to hundreds
many business applications such as e-commerce, spam filtering, web search of concepts. Hence, it is extremely difficult and time consuming to eliminate
engines, chatbots etc. However analyzing short texts is very difficult and these complexities and achieve the best semantic interpretation for a short
challenging task. Primary reason is short texts usually do not follow language text.
grammar. It may be ambiguous, noisy with random word placement. Further
in today’s fast-changing world it is generated in huge volume which makes it II. LITERATURE SURVEY
complex and time consuming to perform tasks such as information extraction, The present work in text analysis is mainly focused on tokenization which
clustering and classification. These challenges give rise to significant amount spits input text into set of terms or tokens and assigns part-of-speech tags
of ambiguity and make it extremely difficult to handle and process available [1][2][3][4][5][6]. For text segmentation various vocabulary based approaches
data. Many text analytics approaches are proposed recently but most of them [2][3][4] are proposed which are using online knowledge bases and
face challenges mainly due to lack of sufficient statistical information. dictionaries to extract terms. Longest cover method is one of the vocabulary
Consider polysemy of word “apple”. It has different meanings such as a fruit, based method which searches for longest matching term in dictionary to
a tree, a company or a brand. Due to the lack of contextual information, these segment input text. Chen et al [2] proposed sentiment analysis of twitter data
ambiguous words make it extremely hard for computers to understand short using an unsupervised method of named entity recognition (NER) which
texts. utilizes Wikipedia and web corpus for segmentation. Hua et al. proposed trie-
based framework [1] which uses graph to represent terms which are candidate
JETIR2209126 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b201
© 2022 JETIR September 2022, Volume 9, Issue 9 www.jetir.org (ISSN-2349-
5162)
for segmentation and their relationship. One of the common drawbacks of arranged in given input which may not be the case in short text. Song et al. [5]
existing methods for text segmentation is they only consider lexical features proposed a method of short text understanding by using a popular knowledge
and ignore the semantics within the segmentation. Statistical methods for base Probase [5] for getting real world concepts and uses Bayesian inference
segmentation calculate occurrences of two terms together in corpus. N-gram for building words concept vector. However most of the existing knowledge
[2][3] is one of the statistical model which calculates frequencies of two or bases are limited in scale and scope. Further most of them do not consider
more words occurring together in corpus to decide whether those words can content semantics.
be treated as a term. Semantic hashing [4] is another approach which Ming et al [7] proposed a long short term (LSTM) based recurrent neural
represents text into binary code which is then used for clustering. However for network model which recognize text emotion by deriving two word vectors
short texts such approach can yield incorrect information sometimes because semantic and emotional. It can detect seven distinct emotions categories as
of noisy nature. anger, anxiety, boredom, happiness, sadness, disgust and surprise. LSTM
In Part-of-speech tagging appropriate lexical types are assigned to individual models overcome the drawback of traditional recurrent neural networks that is
terms based on their meaning and context. It can be done using grammatical can’t learn long distance dependent information. Jin et al [8] proposed bag of
methods which uses predefined rules or statistical approaches [1] which uses words model to process short texts for duplicate detection. It has used
models trained on large corpus. Rule based approach incurs high cost of Word2vec to derive word vectors and Simhash algorithm is used to compare
constructing production rules however gives stable results. Whereas statistical sequences using hamming distance.
models use learned statistics instead of tagging rules to assign tags, here the
results are unstable. Both the approaches assume that terms are correctly
Table 1: Comparison table of various methods proposed for short text understanding
1 Y. Song, H. Wang, Short text It finds named entities in input Applicability is limited. July 2011
Z. Wang, H. Li, W. conceptualization text and assigns it meaningful Does not focus on word
Chen [5] using Probase labels. semantics.
2 C. Li, J. Weng, Q. Named entity It uses Wikipedia and web Applicability is restricted August 2012
He, Y. Yao, A. recognition in targeted corpus to segment tweets using to tweets only.
Datta, A. Sun, and Twitter stream n-gram method and does NER
B.-S. Lee, [2] using unsupervised model
3 Z. Yu, H. Wang, X. Short text clustering It uses DNN to convert input Heavy computation is
Lin and M. Wang using Deep Neural text into binary code and then required. Feb 2016
[4] Network two texts are compared using Model output totally
their binary code to cluster depends upon quality of
similar texts. testing set.
4 W. Hua, Z. Wang, Semantic analysis of Segmentation is done by Totally dependent upon March 2017
H. Wang, K. Zheng short text using online applying Monte Carlo algorithm online knowledge base.
and X. Zhou [1] knowledgebase and on term graph, followed by POS
web corpus tagging using Standford tagger Computation of co-
occurrence network is
very complex and need
much time and space.
5 M. Su, C. Wu, K. Text emotion It first extract semantic word Results are derived from May 2018
Huang, Q. Hong [7] recognition based on vector and emotional word limited amount data.
word vector vector using word2vec model Need to improve
and auto encoder respectively. accuracy.
The concatenated vector is
analyzed using LSTM to
recognize text emotion.
6 J.Yang, G. Huang, Short text clustering Topic representative terms are Simple yet effective July 2019
B. Cai [10] using TRTD discovered by individual method of text clustering.
occurrence frequency and co- Discover topic terms
occurrence frequency based on high frequency
count which may not be
the case always.
7 R.Man, K.Lin [11] Sentiment Analysis Article feature extraction using Heavy computation is April 2021
Algorithm Based on BERL and Convolutional neural required.
BERT and network Model output totally
Convolutional Neural depends upon quality of
Network testing set.
III. PROBLEM STATEMENT accurately analyze their customer reviews, social networking posts, news and
chatbots queries which are in the form of short texts, in order to better
Analyzing textual data available online in multiple forms is crucial to many understand customer needs and gain competitive advantage. However
e-commerce and other business processes. It is important for organizations to processing this unstructured, complex and huge data is highly challenging task
JETIR2209126 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b202
© 2022 JETIR September 2022, Volume 9, Issue 9 www.jetir.org (ISSN-2349-5162)
as it lacks contextual information. Further short texts are noisy, may contain frequently co-occurring words are grouped together forming co-occurrence
abbreviations and ambiguous words which makes it extremely difficult to infer network. Then a vocabulary index is built using W2V model [7] based on word
semantic meaning out of it. As a result, traditional natural language processing occurrence frequency. Yelp dataset is broken into training set containing 35220
tools, such as part-of-speech tagging, dependency parsing cannot be applied records and testing set containing 8805 records. The vocabulary index is used
efficiently to short texts. for converting reviews in textual format in training and testing set into word
Given a short text s written in a natural language, we generate a semantic embeddings as the machine learning model understands only numbers. Label
interpretation of s represented as a sequence of typed-terms namely Binarizer is used to encode labels in training and testing set into sequence
𝑠 = {𝑡𝑖 |𝑖 = 1, … , 𝑛} vectors. Finally the input reviews in the form of embedding vector and output
labels in the form of sequence vectors are fed to KERAS sequential model. The
And from semantic knowledge we determine the sentiment of input text which
model is trained and fitted with 20 EPOCHS and batch size of 512 records
can be either positive, negative or neutral.
yielding training accuracy of 78.41%.
E.g.
In Live testing part, the user input text is processed in real time to derive its
Input sentence: “went bank to deposit money”
semantic knowledge and sentiment. Initially the input text is pre-processed to
Output: Went [verb], bank [noun –financial institution], deposit [verb], money
correct the misspelled words. Then the text is transformed into word embedding
[noun]
using tokenizer and fed to model to predict its sentiment. Words in input text
Sentiment: Neutral
are disambiguated using Simple Lesk algorithm [14] to derive its part-of-speech
tag and contextual meaning.
V. Result Analysis
Fig. 1 illustrates the proposed system framework for short text Fig 3: Comparison of Textblob, VADAR analysis, SWN analysis & proposed
understanding. The proposed has two phases namely Model Building and Live method based on KERAS model
Testing. In Model building, Yelp dataset with 44025 records depicting the
customer reviews about different business processes is used. Initially the Fig. 3 shows the sentiment count predicted by various methods on Ebay
dataset is pre-processed by removing English stop words and applying dataset of customer reviews. We have compared our method with other three
Snowball stemmer. Bigrams are derived from the cleansed data and most methods namely Textblob, VADAR and SentiwordNet analysis. Three labels
JETIR2209126 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b203
© 2022 JETIR September 2022, Volume 9, Issue 9 www.jetir.org (ISSN-2349-5162)
namely positive, negative and neutral are used to denote the sentiment of input methods in comparison. It has highest accuracy (89%) and F1-score (90%).
text. Efficiency of these methods is compared using precision, recall, F1 Score VADAR analysis has highest precision (93%). It tells of all the reviews that are
and accuracy metrics. labeled as positive, how many are actually positive. Textblob method has
highest recall (89%) which tells of all the reviews that are actually positive,
how many we labeled positive. Proposed method is more effective than
Sentiwordnet Analysis in terms of all the four metrics. Proposed method has
Precision almost similar as Textblob & VADAR analysis however the Recall is
lower. F1-score is usually more useful than accuracy, especially if you have an
uneven class distribution. The results show our method is more effective than
Sentiwordnet analysis, almost similar as VADAR analysis but lesser effective
as compared to Textblob. Hence it needs more improvements to compete with
other state-of-art methods.
VI. CONCLUSION
Fig 4: Precision comparison of Textblob, VADAR analysis, SWN analysis & This paper studies existing work in the field of semantic and sentiment
proposed method based on KERAS model analysis of the short text data available online in the form of tweets, social
networking posts, customer reviews, comments, search engine queries etc. Here
text under analysis is short with limited number of words. Multiple rule-based
as well as statistical methods have been proposed recently in the field of text
processing but all have their own limitations due to multiple challenges.
Primary challenges in text analysis are lack of contextual information, noisy or
misspelled words and enormous volume.
In this work a generalized framework is proposed to analyze semantic and
sentimental knowledge of short text data. An effort is made in anticipation of
getting an alternative method to understand short texts effectively, which
exploits semantic knowledge. The proposed method is divided into two phases
Model building and Live testing. In Model Building, initially the input dataset
is pre-processed to remove URLs and stop words. The misspelled words are
corrected and words are reduced to their stems using Snowball stemmer. Co-
Fig 5: Recall comparison of Textblob, VADAR analysis, SWN analysis & occurrence network is then built using more frequently co-occurring terms and
proposed method based on KERAS model a vocabulary index is built using W2V model. The proposed method uses
LSTM-based KERAS sequential model, which is a deep learning network, for
determining sentiment of short text. The model is trained on Yelp dataset with
44,025 records of customer reviews about various business processes and
sentiments classified as positive, negative and neutral. In live testing, user input
text is processed in real time to determine its semantics and sentiment. For
semantic type detection proposed method uses Simple Lesk algorithm which
assigns the meaningful types and contextual description to the words in input.
For determining the sentiment, it uses KERAS sequential model trained on
YELP dataset.
For performance analysis, Ebay dataset of customer reviews is taken for
validation testing. The results of proposed knowledge-intensive approach are
compared with existing state-of-art methods namely: Text Blob analysis,
VADAR analysis and Sentiwordnet analysis. The results show our method is
more effective than Sentiwordnet analysis, almost similar as VADAR analysis
Fig 6: F1-score comparison of Textblob, VADAR analysis, SWN analysis & but lesser effective as compared to Textblob. Hence it needs more
proposed method based on KERAS model improvements to compete with state-of-art methods. To improve the accuracy
of proposed method further, it is advisable to have more concrete dataset. In
future, multi-language model needs to be build which will be both effective and
efficient in sentiment analysis.
REFERENCES
JETIR2209126 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b205