0% found this document useful (0 votes)
4 views11 pages

Twitter and Emotions: Exploring Sentiment Detection

The study investigates the feasibility of sentiment detection from unstructured text on Twitter using various machine learning methods, including Deep Learning, Decision Trees, Naive Bayes, and Support Vector Machines. It demonstrates the effectiveness of these methods in classifying tweets as positive, negative, or neutral, supported by a manually labeled dataset of tweets. The research highlights the importance of preprocessing techniques and the use of meta-classifiers to enhance sentiment analysis outcomes across different application fields.

Uploaded by

Rafael Guzman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Twitter and Emotions: Exploring Sentiment Detection

The study investigates the feasibility of sentiment detection from unstructured text on Twitter using various machine learning methods, including Deep Learning, Decision Trees, Naive Bayes, and Support Vector Machines. It demonstrates the effectiveness of these methods in classifying tweets as positive, negative, or neutral, supported by a manually labeled dataset of tweets. The research highlights the importance of preprocessing techniques and the use of meta-classifiers to enhance sentiment analysis outcomes across different application fields.

Uploaded by

Rafael Guzman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Twitter and Emotions: Exploring Sentiment

Detection
José Carmen Morales Castro, Tirtha Prasad
Mukhopadhyay, Rafael Guzmán Cabrera John R. Baker
(corresponding author) University of Economics and Finance, Vietnam;
Universidad de Guanajuato, México Shinawatra University, Thailand
jc.moralescastro@ugto.mx, guzmanc@ugto.mx, drjohnrbaker@yahoo.com
tirtha@ugto.mx

Abstract—Human emotions are often discerned through tone,


facial expressions, and gestures via face-to-face interactions. One example of these techniques is using base
However, the question arises: can sentiment be accurately classifiers and linguistic resources, which provide a
identified from unstructured text on social networks? In this foundation for identifying sentiments and
study, we demonstrate that it is indeed possible. We applied categorizing tweets as positive, negative, or neutral.
four machine learning methods—Deep Learning, Decision
Trees, Naive Bayes, and Support Vector Machines—in two
This can significantly facilitate initial data
classification scenarios: cross-validation and training/test sets, processing [4]. Such work also involves a meta-
enhanced by a meta-classifier. Our goal was to identify which classifier that integrates multiple models and
combination of classification scenario, learning method, and approaches to generate more robust and reliable
preprocessing performs best in sentiment analysis. To validate predictions about a tweet's sentiment. Additionally,
our approach, we used a manually labeled corpus, forming
three datasets of different sizes with varying preprocessing
including a Deep Learning technique allows us to
techniques. The results underscore the viability and explore complex and non-linear data patterns.
effectiveness of the proposed approach and provide
implications for various fields (product development, II. PROBLEM STATEMENT
marketing, political analysis, customer service education,
linguistic education). A key challenge is automatically identifying
sentiment in unstructured texts, particularly tweets,
Keywords—natural language preprocessing, sentiment analysis, machine using an architecture that combines base classifiers
learning and linguistic resources. To address this, we
developed automated tools to extract subjective
I. INTRODUCTION information (opinions or feelings) from natural
language texts. This process allows for the
Within the context of the exponential growth of generation of structured, processible knowledge for
social networks, Twitter stands out as a virtual decision-making systems, enabling a better
space where millions of users share their opinions, understanding of users’ perceptions and facilitating
emotions, and experiences in real time. Such a the adoption of strategic measures based on
platform offers a unique window into understanding accurate, relevant information.
how people relate to their surroundings, which is
why sentiment analysis has become essential for III. METHODS
understanding the complexity of human expressions
in the digital world [6]. However, sentiment To address the sentiment in unstructured texts
analysis on social networks like Twitter presents (tweets), we began with an exhaustive review of
various challenges due to the nature of the related work, expanding upon an established design
messages. To overcome this ambiguity, employing [6], to identify the different types of classifiers,
a multifaceted approach that combines different methodologies, and evaluation metrics. This
techniques and methodologies is vital. culminated in a research design that allowed us to
address the task competitively and efficiently. (Fig relevant keywords, including hashtags and stems.
1). This process was included in our experimental
design to ensure consistency and relevance in the
data utilized for sentiment analysis.

Afterward, CNNs were used to compare the results.


In this context, we chose to implement CNNs in the
Weka platform using the WekaDeeplearning4j
extension, which is based on the library of the same
name and allows us to follow a specific procedure
that begins with the installation of the extension as
Fig.1. Research Design. the first step in the platform. This extension offers a
graphical interface for configuring, training, and
In the next stage, we selected a suitable database evaluating deep learning models and can extract
and performed data preprocessing, applying spatial features from data and offers APU for
techniques drawn from related research. We utilized integration with Java applications. A key feature is
a dataset of approximately 163,000 manually its ability to leverage GPUs and distributed clusters,
labeled tweets, categorized by polarity as positive, significantly speeding up model training and
negative, or neutral. These tweets were sourced inference, especially with large datasets [5].
from an archived dataset [see 6]. Our study builds
on this previous work by expanding the scope In the next step, data preprocessing (Figure 2) was
through the introduction of new deep learning performed once the corpus was obtained and the
models, such as convolutional neural networks experiment was undertaken. For this stage, a series
(CNNs), and by analyzing additional dimensions of of steps were taken to standardize the structure of
sentiment classification. Furthermore, we enhanced all the tweets in the corpus, facilitating their
the methodological framework by incorporating interpretation during processing.
additional machine learning models and refining the
preprocessing techniques. This extension enables us
to explore more advanced methodologies and
provides deeper insights into how various
approaches affect the accuracy and scalability of
sentiment analysis tasks, particularly when working
with large datasets. In this case, to evaluate
perceptions of public figures on social networks
during a national event (2019 general elections in
India) and to automatically classify these opinions,
facilitating adjustments to public figures' messaging
or moderating discourse based on public sentiment.

The texts were labeled with opinion values: (0) for


neutral, (1) for positive, and (-1) for negative. After
identifying the dataset, we divided it into two
subsets for the analysis, consisting of 1,000 and
5,000 tweets, respectively. The choice of these Fig.2. Steps followed within preprocessing.
specific sizes facilitated comparison between a
Five steps were taken to carry out the
smaller and larger dataset, allowing us to observe
preprocessing. First, stop words or empty words
the impact of dataset size on model performance.
were removed, i.e., those that have no meaning by
All messages were filtered for both subsets using
themselves [4]. Then, uppercase letters were
(CNNs), and by analyzing additional dimensions of sentiment no meaning by themselves. [2]. Then, uppercase letters were
classification. Furthermore, we enhanced the methodological converted to lowercase to homogenize the corpus. Afterward,
framework by incorporating additional machine learning the tweets were tokenized, segmenting the text into phrases or
models and refining the preprocessing techniques. This words. Next, the tweet was lemmatized to reduce
extension enables us to explore more advanced methodologies morphological variability and improve the accuracy of text
and provides deeper insights into how various approaches processing. Finally, the information gain technique was
affect the accuracy and scalability of sentiment analysis tasks, applied to measure the relevance of an attribute within a
particularly when working with large datasets. In this case, to dataset.
evaluate perceptions of public figures on social networks
during a national event (2019 general elections in India) and Considering the preprocessing phase, both datasets were
to automatically classify these opinions, facilitating divided into four files, each with different preprocessing
adjustments to public figures' messaging or moderating stages. We termed the first set baseline; this file did not
discourse based on public sentiment. experience additional preprocessing, keeping the tweets in
their original form without stopword removal, lemmatization,
The texts were labeled with opinion values: (0) for neutral, or application of information gain techniques. For the next set,
(1) for positive, and (-1) for negative. After identifying the preprocessing was carried out. This consisted of eliminating
dataset, we divided it into two subsets for the analysis: one stopwords, lemmatizing, and applying information gain,
consisting of 1,000 tweets and the other of 5,000 tweets. The resulting in 352 selected attributes for the set of 1000 data and
choice of these specific sizes was to facilitate comparison 637 attributes for the set of 5000 data. For the third set, only
between a smaller and larger dataset, allowing us to observe the information gain technique was applied to select the most
the impact of dataset size on model performance. For both relevant attributes, resulting in 422 attributes for the 1000 data
subsets, all messages were filtered using relevant keywords, set and 2399 attributes for the 5000 data set. Finally, a similar
including hashtags and stems. This process was part of our procedure was undertaken for the fourth set and the second
experimental design to ensure consistency and relevance in file. This resulted in the elimination of stop words and
the data utilized for sentiment analysis. lemmatization, generating 3319 attributes for the set of 1000
data and 597 attributes for the set of 5000 data, clarifying that
Afterward, convolutional neural networks (CNNs) were the information gain technique was not applied in this case.
used to compare the results. In this context, we chose to
implement CNNs in the Weka platform using the Our research used two classification scenarios: 10-fold
WekaDeeplearning4j extension, which is based on the library cross-validation, a method for evaluating predictive models
of the same name and allows us to follow a specific procedure and preventing overfitting. This model was trained with a
that begins with the installation of the extension as the first subset of the data and validated on the remaining sets. This
step in the platform. The WekaDeeplearning4j extension process was repeated ten times to ensure an accurate
offers a graphical interface for configuring, training, and evaluation and classification [5]. The second classification
evaluating deep learning models. It can extract spatial features scenario used was the training and testing sets, where the
from data and offers APU for integration with Java dataset is divided into two, one for training and one for testing.
applications. A key feature is its ability to leverage GPUs and Most of the data were used to in the training of the model,
distributed clusters, significantly speeding up model training while a smaller portion was allocated for testing and
and inference, especially with large datasets [4]. evaluating its performance. The training set adjusted the
model using selected data to improve its accuracy on new data,
In the next step, once the corpus was obtained and the and the testing set evaluated the model's performance by
experiment was undertaken, the data preprocessing ( Figure 2) avoiding overfitting, comparing predictions with actual
was performed. For this stage, a series of steps were taken to classifications [6].
standardize the structure of all the tweets in the corpus,
facilitating their interpretation during processing. In both cases, supervised learning methods were used to
classify the comments according to their corresponding label.
These techniques included (1) Support Vector Machines
(SVM), a learning-based method that provides support for
solving problems through classification and regression based
on training and resolution phases, the result of which is
proposed output to an established problem [7]; (2) Naive
Bayes (NB), a classifier that calculates the probability of an
event given information about the based on the additional
assumptions theorem [8]; and (3) Decision Trees (J48
algorithm), a machine learning algorithm that builds decision
trees for classification to select the feature with the highest
discrimination capacity at each node to be able to divide the
data set into subsets [9]. These techniques proved effective in
achieving precise class separation and obtaining high
performance in comment classification.
SVM was chosen because of its ability to handle nonlinear
data and its effectiveness in classifying short texts such as
Fig. 2. Steps to follow within preprocessing.
tweets, where features are not always easily separable. Naive
Bayes was selected for its simplicity and speed of training,
Five steps were taken to carry out the preprocessing. First, making it ideal for processing large volumes of data quickly.
stopwords or empty words were removed, i.e., those that have Decision Trees (J48) provide a clear visualization of decision
(CNNs), and by analyzing additional dimensions of sentiment no meaning by themselves. [2]. Then, uppercase letters were
classification. Furthermore, we enhanced the methodological converted to lowercase to homogenize the corpus. Afterward,
framework by incorporating additional machine learning the tweets were tokenized, segmenting the text into phrases or
models and refining the preprocessing techniques. This words. Next, the tweet was lemmatized to reduce
extension enables us to explore more advanced methodologies morphological variability and improve the accuracy of text
and provides deeper insights into how various approaches processing. Finally, the information gain technique was
affect the accuracy and scalability of sentiment analysis tasks, applied to measure the relevance of an attribute within a
particularly when working with large datasets. In this case, to dataset.
evaluate perceptions of public figures on social networks
during a national event (2019 general elections in India) and Considering the preprocessing phase, both datasets were
to automatically classify these opinions, facilitating divided into four files, each with different preprocessing
adjustments to public figures' messaging or moderating stages. We termed the first set baseline; this file did not
discourse based on public sentiment. experience additional preprocessing, keeping the tweets in
their original form without stopword removal, lemmatization,
The texts were labeled with opinion values: (0) for neutral, or application of information gain techniques. For the next set,
(1) for positive, and (-1) for negative. After identifying the preprocessing was carried out. This consisted of eliminating
dataset, we divided it into two subsets for the analysis: one stopwords, lemmatizing, and applying information gain,
consisting of 1,000 tweets and the other of 5,000 tweets. The resulting in 352 selected attributes for the set of 1000 data and
choice of these specific sizes was to facilitate comparison 637 attributes for the set of 5000 data. For the third set, only
between a smaller and larger dataset, allowing us to observe the information gain technique was applied to select the most
the impact of dataset size on model performance. For both relevant attributes, resulting in 422 attributes for the 1000 data
subsets, all messages were filtered using relevant keywords, set and 2399 attributes for the 5000 data set. Finally, a similar
including hashtags and stems. This process was part of our procedure was undertaken for the fourth set and the second
experimental design to ensure consistency and relevance in file. This resulted in the elimination of stop words and
the data utilized for sentiment analysis. lemmatization, generating 3319 attributes for the set of 1000
data and 597 attributes for the set of 5000 data, clarifying that
Afterward, convolutional neural networks (CNNs) were the information gain technique was not applied in this case.
used to compare the results. In this context, we chose to
implement CNNs in the Weka platform using the Our research used two classification scenarios: 10-fold
WekaDeeplearning4j extension, which is based on the library cross-validation, a method for evaluating predictive models
of the same name and allows us to follow a specific procedure and preventing overfitting. This model was trained with a
that begins with the installation of the extension as the first subset of the data and validated on the remaining sets. This
step in the platform. The WekaDeeplearning4j extension process was repeated ten times to ensure an accurate
offers a graphical interface for configuring, training, and evaluation and classification [5]. The second classification
evaluating deep learning models. It can extract spatial features scenario used was the training and testing sets, where the
from data and offers APU for integration with Java dataset is divided into two, one for training and one for testing.
applications. A key feature is its ability to leverage GPUs and Most of the data were used to in the training of the model,
distributed clusters, significantly speeding up model training while a smaller portion was allocated for testing and
and inference, especially with large datasets [4]. evaluating its performance. The training set adjusted the
model using selected data to improve its accuracy on new data,
In the next step, once the corpus was obtained and the and the testing set evaluated the model's performance by
experiment was undertaken, the data preprocessing ( Figure 2) avoiding overfitting, comparing predictions with actual
was performed. For this stage, a series of steps were taken to classifications [6].
standardize the structure of all the tweets in the corpus,
facilitating their interpretation during processing. In both cases, supervised learning methods were used to
classify the comments according to their corresponding label.
These techniques included (1) Support Vector Machines
(SVM), a learning-based method that provides support for
solving problems through classification and regression based
on training and resolution phases, the result of which is
proposed output to an established problem [7]; (2) Naive
Bayes (NB), a classifier that calculates the probability of an
event given information about the based on the additional
assumptions theorem [8]; and (3) Decision Trees (J48
algorithm), a machine learning algorithm that builds decision
trees for classification to select the feature with the highest
discrimination capacity at each node to be able to divide the
data set into subsets [9]. These techniques proved effective in
achieving precise class separation and obtaining high
performance in comment classification.
SVM was chosen because of its ability to handle nonlinear
data and its effectiveness in classifying short texts such as
Fig. 2. Steps to follow within preprocessing.
tweets, where features are not always easily separable. Naive
Bayes was selected for its simplicity and speed of training,
Five steps were taken to carry out the preprocessing. First, making it ideal for processing large volumes of data quickly.
stopwords or empty words were removed, i.e., those that have Decision Trees (J48) provide a clear visualization of decision
(CNNs), and by analyzing additional dimensions of sentiment no meaning by themselves. [2]. Then, uppercase letters were
classification. Furthermore, we enhanced the methodological converted to lowercase to homogenize the corpus. Afterward,
framework by incorporating additional machine learning the tweets were tokenized, segmenting the text into phrases or
models and refining the preprocessing techniques. This words. Next, the tweet was lemmatized to reduce
extension enables us to explore more advanced methodologies morphological variability and improve the accuracy of text
and provides deeper insights into how various approaches processing. Finally, the information gain technique was
affect the accuracy and scalability of sentiment analysis tasks, applied to measure the relevance of an attribute within a
particularly when working with large datasets. In this case, to dataset.
evaluate perceptions of public figures on social networks
during a national event (2019 general elections in India) and Considering the preprocessing phase, both datasets were
to automatically classify these opinions, facilitating divided into four files, each with different preprocessing
adjustments to public figures' messaging or moderating stages. We termed the first set baseline; this file did not
discourse based on public sentiment. experience additional preprocessing, keeping the tweets in
their original form without stopword removal, lemmatization,
The texts were labeled with opinion values: (0) for neutral, or application of information gain techniques. For the next set,
(1) for positive, and (-1) for negative. After identifying the preprocessing was carried out. This consisted of eliminating
dataset, we divided it into two subsets for the analysis: one stopwords, lemmatizing, and applying information gain,
consisting of 1,000 tweets and the other of 5,000 tweets. The resulting in 352 selected attributes for the set of 1000 data and
choice of these specific sizes was to facilitate comparison 637 attributes for the set of 5000 data. For the third set, only
between a smaller and larger dataset, allowing us to observe the information gain technique was applied to select the most
the impact of dataset size on model performance. For both relevant attributes, resulting in 422 attributes for the 1000 data
subsets, all messages were filtered using relevant keywords, set and 2399 attributes for the 5000 data set. Finally, a similar
including hashtags and stems. This process was part of our procedure was undertaken for the fourth set and the second
experimental design to ensure consistency and relevance in file. This resulted in the elimination of stop words and
the data utilized for sentiment analysis. lemmatization, generating 3319 attributes for the set of 1000
data and 597 attributes for the set of 5000 data, clarifying that
Afterward, convolutional neural networks (CNNs) were the information gain technique was not applied in this case.
used to compare the results. In this context, we chose to
implement CNNs in the Weka platform using the Our research used two classification scenarios: 10-fold
WekaDeeplearning4j extension, which is based on the library cross-validation, a method for evaluating predictive models
of the same name and allows us to follow a specific procedure and preventing overfitting. This model was trained with a
that begins with the installation of the extension as the first subset of the data and validated on the remaining sets. This
step in the platform. The WekaDeeplearning4j extension process was repeated ten times to ensure an accurate
offers a graphical interface for configuring, training, and evaluation and classification [5]. The second classification
evaluating deep learning models. It can extract spatial features scenario used was the training and testing sets, where the
from data and offers APU for integration with Java dataset is divided into two, one for training and one for testing.
applications. A key feature is its ability to leverage GPUs and Most of the data were used to in the training of the model,
distributed clusters, significantly speeding up model training while a smaller portion was allocated for testing and
and inference, especially with large datasets [4]. evaluating its performance. The training set adjusted the
model using selected data to improve its accuracy on new data,
In the next step, once the corpus was obtained and the and the testing set evaluated the model's performance by
experiment was undertaken, the data preprocessing ( Figure 2) avoiding overfitting, comparing predictions with actual
was performed. For this stage, a series of steps were taken to classifications [6].
standardize the structure of all the tweets in the corpus,
facilitating their interpretation during processing. In both cases, supervised learning methods were used to
classify the comments according to their corresponding label.
These techniques included (1) Support Vector Machines
(SVM), a learning-based method that provides support for
solving problems through classification and regression based
on training and resolution phases, the result of which is
proposed output to an established problem [7]; (2) Naive
Bayes (NB), a classifier that calculates the probability of an
event given information about the based on the additional
assumptions theorem [8]; and (3) Decision Trees (J48
algorithm), a machine learning algorithm that builds decision
trees for classification to select the feature with the highest
discrimination capacity at each node to be able to divide the
data set into subsets [9]. These techniques proved effective in
achieving precise class separation and obtaining high
performance in comment classification.
SVM was chosen because of its ability to handle nonlinear
data and its effectiveness in classifying short texts such as
Fig. 2. Steps to follow within preprocessing.
tweets, where features are not always easily separable. Naive
Bayes was selected for its simplicity and speed of training,
Five steps were taken to carry out the preprocessing. First, making it ideal for processing large volumes of data quickly.
stopwords or empty words were removed, i.e., those that have Decision Trees (J48) provide a clear visualization of decision
(CNNs), and by analyzing additional dimensions of sentiment no meaning by themselves. [2]. Then, uppercase letters were
classification. Furthermore, we enhanced the methodological converted to lowercase to homogenize the corpus. Afterward,
framework by incorporating additional machine learning the tweets were tokenized, segmenting the text into phrases or
models and refining the preprocessing techniques. This words. Next, the tweet was lemmatized to reduce
extension enables us to explore more advanced methodologies morphological variability and improve the accuracy of text
and provides deeper insights into how various approaches processing. Finally, the information gain technique was
affect the accuracy and scalability of sentiment analysis tasks, applied to measure the relevance of an attribute within a
particularly when working with large datasets. In this case, to dataset.
evaluate perceptions of public figures on social networks
during a national event (2019 general elections in India) and Considering the preprocessing phase, both datasets were
to automatically classify these opinions, facilitating divided into four files, each with different preprocessing
adjustments to public figures' messaging or moderating stages. We termed the first set baseline; this file did not
discourse based on public sentiment. experience additional preprocessing, keeping the tweets in
their original form without stopword removal, lemmatization,
The texts were labeled with opinion values: (0) for neutral, or application of information gain techniques. For the next set,
(1) for positive, and (-1) for negative. After identifying the preprocessing was carried out. This consisted of eliminating
dataset, we divided it into two subsets for the analysis: one stopwords, lemmatizing, and applying information gain,
consisting of 1,000 tweets and the other of 5,000 tweets. The resulting in 352 selected attributes for the set of 1000 data and
choice of these specific sizes was to facilitate comparison 637 attributes for the set of 5000 data. For the third set, only
between a smaller and larger dataset, allowing us to observe the information gain technique was applied to select the most
the impact of dataset size on model performance. For both relevant attributes, resulting in 422 attributes for the 1000 data
subsets, all messages were filtered using relevant keywords, set and 2399 attributes for the 5000 data set. Finally, a similar
including hashtags and stems. This process was part of our procedure was undertaken for the fourth set and the second
experimental design to ensure consistency and relevance in file. This resulted in the elimination of stop words and
the data utilized for sentiment analysis. lemmatization, generating 3319 attributes for the set of 1000
data and 597 attributes for the set of 5000 data, clarifying that
Afterward, convolutional neural networks (CNNs) were the information gain technique was not applied in this case.
used to compare the results. In this context, we chose to
implement CNNs in the Weka platform using the Our research used two classification scenarios: 10-fold
WekaDeeplearning4j extension, which is based on the library cross-validation, a method for evaluating predictive models
of the same name and allows us to follow a specific procedure and preventing overfitting. This model was trained with a
that begins with the installation of the extension as the first subset of the data and validated on the remaining sets. This
step in the platform. The WekaDeeplearning4j extension process was repeated ten times to ensure an accurate
offers a graphical interface for configuring, training, and evaluation and classification [5]. The second classification
evaluating deep learning models. It can extract spatial features scenario used was the training and testing sets, where the
from data and offers APU for integration with Java dataset is divided into two, one for training and one for testing.
applications. A key feature is its ability to leverage GPUs and Most of the data were used to in the training of the model,
distributed clusters, significantly speeding up model training while a smaller portion was allocated for testing and
and inference, especially with large datasets [4]. evaluating its performance. The training set adjusted the
model using selected data to improve its accuracy on new data,
In the next step, once the corpus was obtained and the and the testing set evaluated the model's performance by
experiment was undertaken, the data preprocessing ( Figure 2) avoiding overfitting, comparing predictions with actual
was performed. For this stage, a series of steps were taken to classifications [6].
standardize the structure of all the tweets in the corpus,
facilitating their interpretation during processing. In both cases, supervised learning methods were used to
classify the comments according to their corresponding label.
These techniques included (1) Support Vector Machines
(SVM), a learning-based method that provides support for
solving problems through classification and regression based
on training and resolution phases, the result of which is
proposed output to an established problem [7]; (2) Naive
Bayes (NB), a classifier that calculates the probability of an
event given information about the based on the additional
assumptions theorem [8]; and (3) Decision Trees (J48
algorithm), a machine learning algorithm that builds decision
trees for classification to select the feature with the highest
discrimination capacity at each node to be able to divide the
data set into subsets [9]. These techniques proved effective in
achieving precise class separation and obtaining high
performance in comment classification.
SVM was chosen because of its ability to handle nonlinear
data and its effectiveness in classifying short texts such as
Fig. 2. Steps to follow within preprocessing.
tweets, where features are not always easily separable. Naive
Bayes was selected for its simplicity and speed of training,
Five steps were taken to carry out the preprocessing. First, making it ideal for processing large volumes of data quickly.
stopwords or empty words were removed, i.e., those that have Decision Trees (J48) provide a clear visualization of decision
(CNNs), and by analyzing additional dimensions of sentiment no meaning by themselves. [2]. Then, uppercase letters were
classification. Furthermore, we enhanced the methodological converted to lowercase to homogenize the corpus. Afterward,
framework by incorporating additional machine learning the tweets were tokenized, segmenting the text into phrases or
models and refining the preprocessing techniques. This words. Next, the tweet was lemmatized to reduce
extension enables us to explore more advanced methodologies morphological variability and improve the accuracy of text
and provides deeper insights into how various approaches processing. Finally, the information gain technique was
affect the accuracy and scalability of sentiment analysis tasks, applied to measure the relevance of an attribute within a
particularly when working with large datasets. In this case, to dataset.
evaluate perceptions of public figures on social networks
during a national event (2019 general elections in India) and Considering the preprocessing phase, both datasets were
to automatically classify these opinions, facilitating divided into four files, each with different preprocessing
adjustments to public figures' messaging or moderating stages. We termed the first set baseline; this file did not
discourse based on public sentiment. experience additional preprocessing, keeping the tweets in
their original form without stopword removal, lemmatization,
The texts were labeled with opinion values: (0) for neutral, or application of information gain techniques. For the next set,
(1) for positive, and (-1) for negative. After identifying the preprocessing was carried out. This consisted of eliminating
dataset, we divided it into two subsets for the analysis: one stopwords, lemmatizing, and applying information gain,
consisting of 1,000 tweets and the other of 5,000 tweets. The resulting in 352 selected attributes for the set of 1000 data and
choice of these specific sizes was to facilitate comparison 637 attributes for the set of 5000 data. For the third set, only
between a smaller and larger dataset, allowing us to observe the information gain technique was applied to select the most
the impact of dataset size on model performance. For both relevant attributes, resulting in 422 attributes for the 1000 data
subsets, all messages were filtered using relevant keywords, set and 2399 attributes for the 5000 data set. Finally, a similar
including hashtags and stems. This process was part of our procedure was undertaken for the fourth set and the second
experimental design to ensure consistency and relevance in file. This resulted in the elimination of stop words and
the data utilized for sentiment analysis. lemmatization, generating 3319 attributes for the set of 1000
data and 597 attributes for the set of 5000 data, clarifying that
Afterward, convolutional neural networks (CNNs) were the information gain technique was not applied in this case.
used to compare the results. In this context, we chose to
implement CNNs in the Weka platform using the Our research used two classification scenarios: 10-fold
WekaDeeplearning4j extension, which is based on the library cross-validation, a method for evaluating predictive models
of the same name and allows us to follow a specific procedure and preventing overfitting. This model was trained with a
that begins with the installation of the extension as the first subset of the data and validated on the remaining sets. This
step in the platform. The WekaDeeplearning4j extension process was repeated ten times to ensure an accurate
offers a graphical interface for configuring, training, and evaluation and classification [5]. The second classification
evaluating deep learning models. It can extract spatial features scenario used was the training and testing sets, where the
from data and offers APU for integration with Java dataset is divided into two, one for training and one for testing.
applications. A key feature is its ability to leverage GPUs and Most of the data were used to in the training of the model,
distributed clusters, significantly speeding up model training while a smaller portion was allocated for testing and
and inference, especially with large datasets [4]. evaluating its performance. The training set adjusted the
model using selected data to improve its accuracy on new data,
In the next step, once the corpus was obtained and the and the testing set evaluated the model's performance by
experiment was undertaken, the data preprocessing ( Figure 2) avoiding overfitting, comparing predictions with actual
was performed. For this stage, a series of steps were taken to classifications [6].
standardize the structure of all the tweets in the corpus,
facilitating their interpretation during processing. In both cases, supervised learning methods were used to
classify the comments according to their corresponding label.
These techniques included (1) Support Vector Machines
(SVM), a learning-based method that provides support for
solving problems through classification and regression based
on training and resolution phases, the result of which is
proposed output to an established problem [7]; (2) Naive
Bayes (NB), a classifier that calculates the probability of an
event given information about the based on the additional
assumptions theorem [8]; and (3) Decision Trees (J48
algorithm), a machine learning algorithm that builds decision
trees for classification to select the feature with the highest
discrimination capacity at each node to be able to divide the
data set into subsets [9]. These techniques proved effective in
achieving precise class separation and obtaining high
performance in comment classification.
SVM was chosen because of its ability to handle nonlinear
data and its effectiveness in classifying short texts such as
Fig. 2. Steps to follow within preprocessing.
tweets, where features are not always easily separable. Naive
Bayes was selected for its simplicity and speed of training,
Five steps were taken to carry out the preprocessing. First, making it ideal for processing large volumes of data quickly.
stopwords or empty words were removed, i.e., those that have Decision Trees (J48) provide a clear visualization of decision
(CNNs), and by analyzing additional dimensions of sentiment no meaning by themselves. [2]. Then, uppercase letters were
classification. Furthermore, we enhanced the methodological converted to lowercase to homogenize the corpus. Afterward,
framework by incorporating additional machine learning the tweets were tokenized, segmenting the text into phrases or
models and refining the preprocessing techniques. This words. Next, the tweet was lemmatized to reduce
extension enables us to explore more advanced methodologies morphological variability and improve the accuracy of text
and provides deeper insights into how various approaches processing. Finally, the information gain technique was
affect the accuracy and scalability of sentiment analysis tasks, applied to measure the relevance of an attribute within a
particularly when working with large datasets. In this case, to dataset.
evaluate perceptions of public figures on social networks
during a national event (2019 general elections in India) and Considering the preprocessing phase, both datasets were
to automatically classify these opinions, facilitating divided into four files, each with different preprocessing
adjustments to public figures' messaging or moderating stages. We termed the first set baseline; this file did not
discourse based on public sentiment. experience additional preprocessing, keeping the tweets in
their original form without stopword removal, lemmatization,
The texts were labeled with opinion values: (0) for neutral, or application of information gain techniques. For the next set,
(1) for positive, and (-1) for negative. After identifying the preprocessing was carried out. This consisted of eliminating
dataset, we divided it into two subsets for the analysis: one stopwords, lemmatizing, and applying information gain,
consisting of 1,000 tweets and the other of 5,000 tweets. The resulting in 352 selected attributes for the set of 1000 data and
choice of these specific sizes was to facilitate comparison 637 attributes for the set of 5000 data. For the third set, only
between a smaller and larger dataset, allowing us to observe the information gain technique was applied to select the most
the impact of dataset size on model performance. For both relevant attributes, resulting in 422 attributes for the 1000 data
subsets, all messages were filtered using relevant keywords, set and 2399 attributes for the 5000 data set. Finally, a similar
including hashtags and stems. This process was part of our procedure was undertaken for the fourth set and the second
experimental design to ensure consistency and relevance in file. This resulted in the elimination of stop words and
the data utilized for sentiment analysis. lemmatization, generating 3319 attributes for the set of 1000
data and 597 attributes for the set of 5000 data, clarifying that
Afterward, convolutional neural networks (CNNs) were the information gain technique was not applied in this case.
used to compare the results. In this context, we chose to
implement CNNs in the Weka platform using the Our research used two classification scenarios: 10-fold
WekaDeeplearning4j extension, which is based on the library cross-validation, a method for evaluating predictive models
of the same name and allows us to follow a specific procedure and preventing overfitting. This model was trained with a
that begins with the installation of the extension as the first subset of the data and validated on the remaining sets. This
step in the platform. The WekaDeeplearning4j extension process was repeated ten times to ensure an accurate
offers a graphical interface for configuring, training, and evaluation and classification [5]. The second classification
evaluating deep learning models. It can extract spatial features scenario used was the training and testing sets, where the
from data and offers APU for integration with Java dataset is divided into two, one for training and one for testing.
applications. A key feature is its ability to leverage GPUs and Most of the data were used to in the training of the model,
distributed clusters, significantly speeding up model training while a smaller portion was allocated for testing and
and inference, especially with large datasets [4]. evaluating its performance. The training set adjusted the
model using selected data to improve its accuracy on new data,
In the next step, once the corpus was obtained and the and the testing set evaluated the model's performance by
experiment was undertaken, the data preprocessing ( Figure 2) avoiding overfitting, comparing predictions with actual
was performed. For this stage, a series of steps were taken to classifications [6].
standardize the structure of all the tweets in the corpus,
facilitating their interpretation during processing. In both cases, supervised learning methods were used to
classify the comments according to their corresponding label.
These techniques included (1) Support Vector Machines
(SVM), a learning-based method that provides support for
solving problems through classification and regression based
on training and resolution phases, the result of which is
proposed output to an established problem [7]; (2) Naive
Bayes (NB), a classifier that calculates the probability of an
event given information about the based on the additional
assumptions theorem [8]; and (3) Decision Trees (J48
algorithm), a machine learning algorithm that builds decision
trees for classification to select the feature with the highest
discrimination capacity at each node to be able to divide the
data set into subsets [9]. These techniques proved effective in
achieving precise class separation and obtaining high
performance in comment classification.
SVM was chosen because of its ability to handle nonlinear
data and its effectiveness in classifying short texts such as
Fig. 2. Steps to follow within preprocessing.
tweets, where features are not always easily separable. Naive
Bayes was selected for its simplicity and speed of training,
Five steps were taken to carry out the preprocessing. First, making it ideal for processing large volumes of data quickly.
stopwords or empty words were removed, i.e., those that have Decision Trees (J48) provide a clear visualization of decision
converted to lowercase to homogenize the corpus. performance by avoiding overfitting and comparing
Afterward, the tweets were tokenized, segmenting predictions with actual classifications [8].
the text into phrases or words. Next, the tweet was
lemmatized to reduce morphological variability and In both cases, supervised learning methods were
improve the accuracy of text processing. Finally, used to classify the comments according to their
the information gain technique was applied to corresponding label. These techniques included (1)
measure the relevance of an attribute within a Support Vector Machines (SVM), a learning-based
dataset. method that provides support for solving problems
through classification and regression based on
Considering the preprocessing phase, both datasets training and resolution phases, the result of which is
were divided into four files, each with different proposed output to an established [3]; (2) Naive
preprocessing stages. We termed the first set Bayes (NB), a classifier that calculates the
baseline; this file did not experience additional probability of an event given information based on
preprocessing, keeping the tweets in their original the additional assumptions theorem [7]; and (3)
form without stop word removal, lemmatization, or Decision Trees (J48 algorithm), a machine learning
application of information gain techniques. algorithm that builds decision trees for
Preprocessing was carried out for the next set. This classification to select the feature with the highest
consisted of eliminating stop words, lemmatizing, discrimination capacity at each node to be able to
and applying information gain, resulting in 352 divide the data set into subsets [9]. These
selected attributes for the set of 1000 data and 637 techniques proved effective in achieving precise
attributes for the set of 5000 data. For the third set, class separation and obtaining high performance in
only the information gain technique was applied to comment classification.
select the most relevant attributes, resulting in 422
attributes for the 1000 data set and 2399 attributes Precision was used as the evaluation metric, a
for the 5000 data set. Finally, a similar procedure performance measure applied to data retrieved from
was undertaken for the fourth set and the second a set, corpus, or sample space. It is also termed a
file. This resulted in the elimination of stop words positive predictive value, representing the
and lemmatization, generating 3319 attributes for proportion of relevant retrieved instances, as
the set of 1000 data and 597 attributes for the set of indicated in Eq. 1.
5000 data, clarifying that the information gain
technique was not applied in this case.
(1)
Our research used two classification scenarios: ten-
fold cross-validation, a method for evaluating Where “tp” corresponds to a true positive value and
predictive models and preventing overfitting. This “fp” to a false positive value [2].
model was trained with a subset of the data and
validated on the remaining sets. This process was Finally, as an additional step, a meta-classifier was
repeated ten times to ensure an accurate evaluation implemented that combined the three best learning
and classification [1]. The second classification techniques based on the best percentage of accuracy
scenario used was the training and testing sets, obtained in the experiments: SVM, Naive Bayes,
where the dataset is divided into two, one for and Decision Trees.
training and one for testing. Most of the data were
used in the training of the model, while a smaller The results were presented in tables and
portion was allocated for testing and evaluating its comparative graphs, showing the best outcomes for
performance. The training set adjusted the model each set using both classification scenarios from the
using selected data to improve its accuracy on new Weka platform. These highlighted the highest
data, and the testing set evaluated the model's precision values—a key performance metric
(positive predictive value) representing the
proportion of relevant instances among those
retrieved, indicating the percentage of correctly
classified instances.

IV. RESULTS

This study aimed to investigate sentiment analysis


on Twitter using various machine learning
techniques, particularly Support Vector Machines
(SVM), Naive Bayes, and Decision Trees. We
conducted experiments on two datasets of 1,000 and
5,000 tweets, employing different preprocessing
techniques. Fig. 2: Comparison of 1000 data Training and Testing Sets.

Our key findings indicate that SVM consistently The following figures show the results obtained for
outperformed other classifiers across various the dataset containing 5000 tweets.
preprocessing methods and dataset sizes. The
preprocessing approach utilizing information gain
yielded the best results for both datasets.
Additionally, cross-validation and training/testing
scenarios revealed similar trends in performance,
while a meta-classifier combining SVM, Naive
Bayes, and Decision Trees improved overall
accuracy.

The detailed results are presented in the following


figures.

Fig. 3: Comparison of 5000 data Cross-Validation.

Fig. 1: Comparison of 1000 data Cross-Validation.

Fig 4: Comparison of 5000 data Training and Testing Sets.

Our results demonstrate that sentiment analysis


plays a crucial role in extracting information from
unstructured texts on Twitter, generating structured
knowledge useful for decision-making.
The Support Vector Machine (SVM) algorithm However, this study has limitations that future
consistently obtained the best performance across research should address. The dataset's focus on
all experiments, particularly in preprocessing, specific political figures suggests a need for broader
including information gain. For the 1000-tweet topic exploration to test the generalizability of the
dataset, this preprocessing resulted in 422 attributes, methods. Further studies could also examine the
while for the 5000-tweet dataset, it yielded 637 effectiveness of these techniques in different
attributes. languages, analyze temporal aspects, compare
traditional and deep learning approaches more
V. DISCUSSION AND CONCLUSION comprehensively, and develop systems for real-time
sentiment analysis.
This study demonstrates that classifiers such as
SVM, Naive Bayes, and Decision Trees can achieve REFERENCES
high accuracy in sentiment classification. The
strong performance of these classifiers, especially [1] Berrar, D. (2019). Cross-validation [White Paper] Department of
Information and Communications Engineering, School of Engineering,
when combined with a meta-classifier, contributes Tokyo Institute of Technology, Tokyo, Japan
to the development of automated tools for [2] Bowers, A. J., & Zhou, X. J.. (2019). Receiver operating characteristic
(ROC) area under the curve (AUC): A diagnostic measure for evaluating
extracting information from unstructured text. the accuracy of predictors of education outcomes. Journal of Education
These tools can aid in decision-making processes by for Students Placed at Risk. 24(1), 20-46.
providing relevant and precise data derived from [3] Castro, J. C. M., Carrillo, L. M. L., & Cabrera, R. G. (2022)
Identificación de polaridad en Twitter usando validación cruzada.
social media sentiments. Identidat Engergetica, 4, 86-90
[4] Jianqiang, Z., & Xiaolin, G. (2017). Comparison research on text pre-
The sentiment analysis techniques explored in this processing methods on Twitter sentiment analysis. IEEE access, 5,
study have significant implications across various 2870-2879.
[5] Lang, S., Bravo-Marquez, F., Beckham, C., Hall, M., & Frank, E.
fields. In product development, they allow (2019). Wekadeeplearning4j: A deep learning package for Weka based
companies to better understand user reactions to on deeplearning4j. Knowledge-Based Systems, 178, 48-50.
data-driven improvements. In marketing, these tools [6] Morales-Castro, J. C., Pérez-Crespo, J. A., Prasad-Mukhopadhyay, T., &
Guzmán-Cabrera, R. J. J. B. E. R. D. E. B. (2022a). Automatic
help gauge public sentiment toward campaigns, identification of sentiment in unstructured text. 6(15), 22-28.
enabling real-time strategy adjustments. In political [7] Morales Castro, W., & Guzmán Cabrera, R. J. C. y. S. (2020).
Tuberculosis: Diagnóstico mediante procesamiento de imágenes. 24(2),
analysis, they offer insights into public opinion, 875-882.
benefiting campaign managers and policymakers. [8] Santana Mansilla, P. F., Costaguta, R. N., & Missio, D. (2014).
Aplicación de Algoritmos de Clasificación de Minería de Textos para el
Additionally, organizations can enhance customer Reconocimiento de Habilidades de E-tutores Colaborativos.
service by automating the categorization and Inteligencia Artificial 18(54), 2-11
prioritization of feedback. Finally, they have [9] Witten, I. H., Frank, E., Hall, M. A., Pal, C. J., & Data, M. (2005).
Practical machine learning tools and techniques. Data mining. Elsevier
important implications for linguistic education,
particularly in preparing students who intend to
enter these areas.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy