Text Classification
Text Classification
1,2,3,5,8
Department of ICT, Osun State University, Nigeria
4,6
Department of Mathematical Sciences, Osun State University, Nigeria,
7
Department of Physics, Osun State University, Nigeria
ABSTRACT
The information superhighway provides important principles for giving out information to various
consultations. Organizations depend on knowing customer observations about products and services. Data
can be enormous to process physically. This study investigates a technique applying Python programming to
collect datasets instinctively. The use of machine learning models evolves by applying Random Forest and
Naïve Bayes algorithms. These techniques are applied to the data collected for text classification purposes.
This process distributes data into; positive, negative, slightly negative, slightly positive, or neutral. The results
from the study show the Random Forest classifier is more efficient than the Naïve Bayes algorithm, resulting
in an accuracy rate of 76.5% about Naïve Bayes (70.01%). This technique enables organizations to receive
insights into customer ways of thinking.
Keywords: Text classification, Internet community, Random Forest (RF), Insight, Data scraping.
The application of machine learning models to of the deep learning technique yielded an accuracy
analyze articles was discussed by Rejeb et al. of 94.95%, compared to 85.71% for the previous
important tool for students and educators. The Mupaikwa (2024) proposed in digital libraries.
study indicates ChatGPT's crucial function in
The technique utilized the KNearest neighbor,
improving students' writing duties and enhancing Bayesian networks, fuzzy logic, support vector
an interactive learning community. The study finds machines, clustering, and classification algorithms.
theoretical and practical concerns for applying The paper proposed the training of librarians,
ChatGPT in educational institutions. Choe et al. curriculum reviews, and research on Python-
(2024) investigate conducting measurable learning dependent technology for libraries. Büyükkeçeci &
by introducing SAMA, which integrates Okur (2024) discuss the feature selection technique
classification algorithms and models. When the for selecting features relevant to machine learning
SAMA algorithm is compared with large-scale functions. This study focused on feature selection
learning benchmarks, SAMA produces a reduction and feature selection stability. This technique
in storage capacity. Also, SAMA-based data minimizes dataset size. This plays a role in
optimization produces harmonious enhancements improving the performance of machine learning
in text classification accuracy. Abubakr et al. models. Valtonen et al. (2024) proposed a standard
(2024) present a relative analysis between two
research database of unstructured text and
models for a multi-class classification. The result encountered the representativeness difference
47
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56
between collections of preprocessing and UML- into two groups depending on their relative
based algorithms that confront research importance. The results from the study indicate that
undertakings and transparency. The study requires time, cost, responsiveness, and accessibility are
for contextual representations to focus on issues predictors for producing significant user experience
and offer recommendations for addressing on the internet. The recommendations from this
contextual suitability of the UML in research research will improve the quality resulting in more
settings. A review of past research works on text user contentedness. Kariri et al. (2023) examine the
mining was done by Shamshiri et al. (2024). The total study of ANNs and provide directions for
paper investigates the aim of conducting several future research. The research enumerates several
research works having special functions. The articles and various journals using a text-mining
findings from this paper will enumerate important technique. The study indicates that research in
insights, resulting in further progress in computer machine learning is increasing. The study proposed
network research and its connection to academia by Kariri et al. (2023) requires the availability of a
and industry. framework to provide a robust study for ANNs.
Abdusalomovna (2023) presents a framework for
Duan et al. (2024) proposed measuring a data set,
the application to examine unstructured text in
including social media to integrate with the
databases to transform the data into structured data
system’s decision-making process. The system
usable for artificial intelligence (AI) technology.
process will depend on several types of data
collected from elsewhere. The research uses text- METHODOLOGY
mining techniques to process Twitter data. This
This research utilizes web scraping as a method for
paper applies Naïve Bayes, Random Forest, and
data collection, employing a scraper developed in
XGBoost techniques to classify comments on
the Python programming language. This approach
social media. The paper uses the sampling method
is chosen over the conventional method of copy
to compute imbalances in class distribution and
and paste due to its efficiency and time-saving
obtains public opinion about street cleanliness.
capabilities. Web scraping automates the process,
This research can be applied to other social media
enabling the collection of large volumes of data
platforms, including Facebook. The study can
from websites in a matter of minutes, a task that
derive costs and get an understanding of the
would otherwise be tedious and time-consuming.
efficiency of the study. Umer et al. (2023) propose
It's important to note that web scraping is limited to
the CNN model together with text classification.
textual comments and does not include animations
The technique was applied to the classification
or images. The research focuses on gathering data
model to produce a word-embedded model. In
spanning five years of reviews on both perishable
addition, the proposed technique has been applied
and non-perishable food products from Amazon's
on Twitter. The system shows the reliability of the
webpage. A total of 113,683 datasets were
Fast Text word embedded system.
collected using this method. Random Forest and
A practical framework is presented in Pal et al. Naïve Bayes classifiers were selected for analysis,
(2023). The paper finds solutions to challenges in as they are known to perform well with large
research by investigating user comments for some datasets.
websites. The study selects the principal variables
known as predictors and classifies the predictors
48
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56
The system architecture is illustrated in Figure 1 with the help of the web scrapping method, which
which provides an overview of the research is later pre-processed (text transformation).
framework. The proposed architecture is collected
The dataset is divided into train and test datasets. the models. The test set is input into the trained
The train set is input into the algorithm to develop model to predict the results. The output from the
49
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56
models is analyzed to investigate their posterior probabilities. The value of class α might
performances. Additionally, the flowchart outlining be positive, slightly negative, negative, or neutral.
the data collection and analysis process is
A review of food products can be considered as a
presented in Figure 2, offering a visual
document. Verzi & Auger (2021) highlighted that
representation of the methodology employed in the
the multinomial model of Naive Bayes effectively
study.
captures word frequency information within
Random forest and Naïve Bayes classifiers documents. The Maximum Likelihood Estimate
(MLE) determines the most likely value for each
The Random Forest classifier selects its output
parameter given the training data, thereby
category based on a majority vote, whereby the
providing a reliable ratio. This approach helps in
most frequently occurring category among the
accurately estimating the parameters based on the
predictions from multiple trees is considered as the
available training data. For the previous likelihood,
final result. This approach ensures robustness and
this estimate is given as:
reliability in classification. Moreover, Random
Forest classifiers are user-friendly, requiring γ(α)=(Nc)/N (2)
minimal expertise and programming skills. They
Where Nc is the total number of documents in class
are accessible to both experts and novices, making
α while N is the total amount of documents. The
them suitable for individuals without an extensive
multinomial model assumes every other given
mathematical background.
value for the actual class independent of attributes
The Naïve Bayes classifier is a method based on value:
Bayes' theorem. It operates under the assumption
γ(β}α) = γ(𝜑1 . . . 𝜑𝑛𝑑 )|𝛼) (3)
that the presence of a particular feature during
classification is independent of the presence of In the multinomial model, a document is structured
other features. This model is particularly as a sequence of word occurrences drawn from the
advantageous for handling very large datasets due same vocabulary, denoted as V. Each document,
to its simplicity and ease of implementation. In denoted as βi, is considered independent of others.
addition to its simplicity, the Naïve Bayes classifier The parameter βi represents the distribution of
is well-suited for problems that involve associating words within each document, following a
objects with discrete categories. It belongs to the multinomial distribution with numerous
group of numerically-based approaches and offers independent trials. This results in the common bag-
several benefits, including simplicity, speed, and of-words (BOW) representation for documents.
high accuracy. Overall, Naïve Bayes classifiers The BOW model is commonly utilized in
provide a straightforward and efficient solution for document classification tasks, where the frequency
a wide range of classification tasks. Spiteri et al. of word occurrences serves as features for training
(2020) describes the Bayes rule as: classifiers. A unigram feature is employed to
indicate the presence of a single word within a text
γ(β) = (γ(α|β))/(γ(α)∗ γ(β|α) (1)
interval. This approach enables the representation
Where α is the specific class, β is the intended of documents based on the occurrence of individual
document to be classified, γ(α) and γ(β) are the words, facilitating effective classification
prior probabilities, γ(α | β) and γ(β | α) are the processes.
50
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56
αNB = arg max α j ∈ απi(γ(βi|αj)) (5) γ(negative | 1d4) is the maximum means
probability of negative words in document 4 is
Consider Table 1 as the dataset comprising product
maximum so document 4 is negative.
reviews. The objective of the model is to classify
these reviews into either positive or negative Performance evaluation
categories. Table 1 provides an overview of the In this experiment, performance metrics are
structure of the dataset, serving as the foundation employed for the algorithm's accuracy analysis.
for the classification process. Calculate the prior The proposed system is evaluated using several
probability by using Equation 5 accuracy measures which include: precision, recall,
γ(negative) = 2/3 (7) 1. Precision: this deals with the ability of the
classifier not to tag a positive sample as
51
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56
otherwise. How often the classifier is correct Recall. (To measure the accuracy of classifier
each time it predicts is defined as TP/(TP + FP) for each class over others) as: (2∗precision-
recall) /precision+recall.
2. Recall; deals with finding all positive
instances by the classifier. It is defined as the RESULTS AND DISCUSSION
sum of false negatives and the true positives
The dataset used for the experiments contains
ratio of true positives for each class. TP/ (TP
reviews about perishable and non-perishable food
+ FN)
products from Amazon’s web page with labels;
3. F1-score: It is the average mean of the two Positive / Negative / SlightlyPos / Slightly Neg /
values which we have i.e. Precision and Neutral). The sample dataset is shown in Table 2.
Table 2. Dataset.
ID Review Sentiment
1 Good quakity dog food Positive
12 Not as advertised Negative
23 “Delight” says it all Slightly positive
34 Cough medicine Neutral
45 Great taffy Slightly positive
56 Nice taffy Slightly positive
67 Great! Just as good as the expensive brands! Positive
78 Wonderful, tasty taffy Positive
89 Yay barley Positive
910 Healthy dog food Positive
1011 The best hot sauce in the world Positive
1112 My cats LOVE this !diet! better than theirs Positive
1213 My cats are not fans of the new food Negative
1314 Fresh and greasy ! Slightly positive
1415 Strawbwrry Twizzlers-yummy Positive
The description of the dataset is given in Table 3 performance of each classifier, as outlined in Table
4. The True Positives (TP): tested for Positive &
Experimental results
Review is actually positive. The True Negatives
The experimental results for the two classifiers are (TN): tested for Negative & Review is actually
presented in the form of confusion matrices, negative. The False Positives (FP): tested for
showcasing the counts of true positives, false Positive & Review is not (otherwise known as
negatives, true negatives, and false positives. These “Type I error.”). The False Negatives (FN): tested
matrices offer a comprehensive view of the
52
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56
Predicted class
True Neg. (TN) False Pos. (FP)
Actual class
False Neg. (FN) True Pos. (TP)
either positive, slightly negative, slightly positive, data samples, the respective values for the positive
neutral, or negative. Upon testing one of the variable are close to the actual values. This
reviews from the dataset, the outcome revealed its signifies that the model is accurate.
53
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56
Performance evaluation
54
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56
samples than incorrect samples for the text [2] Choe, S., Mehta, S. V., Ahn, H., Neiswanger,
classifiers. W., Xie, P., Strubell, E., & Xing, E. (2024).
Making scalable meta learning practical.
Table 9. Percentage accuracy of various classifiers
Advances in neural information processing
Performance systems, 36.
Dataset Classifier accuracy of [3] Abubakr, M., Rady, M., Badran, K., &
classifier
Naïve bayes 70.01% Mahfouz, S. Y. (2024). Application of deep
Product
review Random forest 76.50% learning in damage classification of
reinforced concrete bridges. Ain Shams
engineering journal, 15(1), 102297.
CONCLUSIONS
[4] Mupaikwa, E. (2025). The Application of
For many large and mid-sized companies, Artificial Intelligence and Machine Learning
regarding their products and services is crucial due Information Science and Technology, Sixth
to the significant impact these sentiments can have Edition (pp. 1-18). IGI Global.
on the company's financial performance. In this [5] BÜYÜKKEÇECİ, M., & OKUR, M. C. (2024).
dataset comprising product reviews. Both the Selection and Feature Selection Stability in
Naive Bayes classifier and the Random Forest Machine Learning. Gazi University Journal of
observed that the Random Forest classifier [6] Valtonen, L., Mäkinen, S. J., & Kirjavainen, J.
outperformed the Naive Bayes classifier. Going (2024). Advancing reproducibility and
friendly graphical interface. Such tools would transparency in reporting preprocessing and
sentiments towards their products. This approach [7] Shamshiri, A., Ryu, K. R., & Park, J. Y. (2024).
would facilitate broader accessibility and Text mining and natural language processing
55
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56
network and FastText embedding on text [13] Spiteri, G., Fielding, J., Diercke, M.,
classification. Multimedia Tools and Campese, C., Enouf, V., Gaymard, A., Bella,
Applications, 82(4), 5569-5585. A., Sognamiglio, P., Moros, M.J.S., Riutort,
[10] Pal, S., Biswas, B., Gupta, R., Kumar, A., & A.N. and Demina, Y.V., 2020. First cases of
Gupta, S. (2023). Exploring the factors that coronavirus disease 2019 (COVID-19) in the
affect user experience in mobile-health WHO European Region, 24 January to 21
applications: A text-mining and machine- February 2020. Eurosurveillance, 25(9),
learning approach. Journal of Business p.2000178
Research, 156, 113484. [14] Spiteri, G., Fielding, J., Diercke, M.,
[11] Kariri, E., Louati, H., Louati, A., & Campese, C., Enouf, V., Gaymard, A., Bella,
Masmoudi, F. (2023). Exploring the A., Sognamiglio, P., Moros, M.J.S., Riutort,
advancements and future research directions A.N. and Demina, Y.V., 2020. First cases of
of artificial neural networks: a text mining coronavirus disease 2019 (COVID-19) in the
approach. Applied Sciences, 13(5), 3186. WHO European Region, 24 January to 21
[12] Abdusalomovna, T. D. (2023). TEXT February 2020. Euro surveillance, 25(9),
MINING. European Journal of p.2000178
Interdisciplinary Research and Development,
13, 284-289.
56