0% found this document useful (0 votes)
3 views10 pages

Text Classification

This study explores the application of machine learning techniques, specifically Random Forest and Naïve Bayes algorithms, for text classification using Python programming. The research demonstrates that the Random Forest classifier outperforms Naïve Bayes with an accuracy of 76.5% compared to 70.01%. The methodology includes web scraping to collect a large dataset of product reviews, which are then classified into various sentiment categories.

Uploaded by

anh.dmh7210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

Text Classification

This study explores the application of machine learning techniques, specifically Random Forest and Naïve Bayes algorithms, for text classification using Python programming. The research demonstrates that the Random Forest classifier outperforms Naïve Bayes with an accuracy of 76.5% compared to 70.01%. The methodology includes web scraping to collect a large dataset of product reviews, which are then classified into various sentiment categories.

Uploaded by

anh.dmh7210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Ozoh LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

APPLICATION OF MACHINE LEARNING TO TEXT CLASSIFICATION


1*Ozoh P., 2Rasheed S., 3Akanbi C., 4Olayiwola M., 5Ibrahim M., 6Kolawole M.,
7Olubusayo O., 8Adigun A.

1,2,3,5,8
Department of ICT, Osun State University, Nigeria
4,6
Department of Mathematical Sciences, Osun State University, Nigeria,
7
Department of Physics, Osun State University, Nigeria

Corresponding Author, email: patrick.ozoh@uniosun.edu.ng: olayiwola.oyedunsi@uniosun.edu.ng

ABSTRACT
The information superhighway provides important principles for giving out information to various
consultations. Organizations depend on knowing customer observations about products and services. Data
can be enormous to process physically. This study investigates a technique applying Python programming to
collect datasets instinctively. The use of machine learning models evolves by applying Random Forest and
Naïve Bayes algorithms. These techniques are applied to the data collected for text classification purposes.
This process distributes data into; positive, negative, slightly negative, slightly positive, or neutral. The results
from the study show the Random Forest classifier is more efficient than the Naïve Bayes algorithm, resulting
in an accuracy rate of 76.5% about Naïve Bayes (70.01%). This technique enables organizations to receive
insights into customer ways of thinking.

Keywords: Text classification, Internet community, Random Forest (RF), Insight, Data scraping.

INTRODUCTION from the study indicates the proposed application

The application of machine learning models to of the deep learning technique yielded an accuracy

analyze articles was discussed by Rejeb et al. of 94.95%, compared to 85.71% for the previous

(2024). The study shows that the ChatGPT is an technique.

important tool for students and educators. The Mupaikwa (2024) proposed in digital libraries.
study indicates ChatGPT's crucial function in
The technique utilized the KNearest neighbor,
improving students' writing duties and enhancing Bayesian networks, fuzzy logic, support vector
an interactive learning community. The study finds machines, clustering, and classification algorithms.
theoretical and practical concerns for applying The paper proposed the training of librarians,
ChatGPT in educational institutions. Choe et al. curriculum reviews, and research on Python-
(2024) investigate conducting measurable learning dependent technology for libraries. Büyükkeçeci &
by introducing SAMA, which integrates Okur (2024) discuss the feature selection technique
classification algorithms and models. When the for selecting features relevant to machine learning
SAMA algorithm is compared with large-scale functions. This study focused on feature selection
learning benchmarks, SAMA produces a reduction and feature selection stability. This technique
in storage capacity. Also, SAMA-based data minimizes dataset size. This plays a role in
optimization produces harmonious enhancements improving the performance of machine learning
in text classification accuracy. Abubakr et al. models. Valtonen et al. (2024) proposed a standard
(2024) present a relative analysis between two
research database of unstructured text and
models for a multi-class classification. The result encountered the representativeness difference

47
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

between collections of preprocessing and UML- into two groups depending on their relative
based algorithms that confront research importance. The results from the study indicate that
undertakings and transparency. The study requires time, cost, responsiveness, and accessibility are
for contextual representations to focus on issues predictors for producing significant user experience
and offer recommendations for addressing on the internet. The recommendations from this
contextual suitability of the UML in research research will improve the quality resulting in more
settings. A review of past research works on text user contentedness. Kariri et al. (2023) examine the
mining was done by Shamshiri et al. (2024). The total study of ANNs and provide directions for
paper investigates the aim of conducting several future research. The research enumerates several
research works having special functions. The articles and various journals using a text-mining
findings from this paper will enumerate important technique. The study indicates that research in
insights, resulting in further progress in computer machine learning is increasing. The study proposed
network research and its connection to academia by Kariri et al. (2023) requires the availability of a
and industry. framework to provide a robust study for ANNs.
Abdusalomovna (2023) presents a framework for
Duan et al. (2024) proposed measuring a data set,
the application to examine unstructured text in
including social media to integrate with the
databases to transform the data into structured data
system’s decision-making process. The system
usable for artificial intelligence (AI) technology.
process will depend on several types of data
collected from elsewhere. The research uses text- METHODOLOGY
mining techniques to process Twitter data. This
This research utilizes web scraping as a method for
paper applies Naïve Bayes, Random Forest, and
data collection, employing a scraper developed in
XGBoost techniques to classify comments on
the Python programming language. This approach
social media. The paper uses the sampling method
is chosen over the conventional method of copy
to compute imbalances in class distribution and
and paste due to its efficiency and time-saving
obtains public opinion about street cleanliness.
capabilities. Web scraping automates the process,
This research can be applied to other social media
enabling the collection of large volumes of data
platforms, including Facebook. The study can
from websites in a matter of minutes, a task that
derive costs and get an understanding of the
would otherwise be tedious and time-consuming.
efficiency of the study. Umer et al. (2023) propose
It's important to note that web scraping is limited to
the CNN model together with text classification.
textual comments and does not include animations
The technique was applied to the classification
or images. The research focuses on gathering data
model to produce a word-embedded model. In
spanning five years of reviews on both perishable
addition, the proposed technique has been applied
and non-perishable food products from Amazon's
on Twitter. The system shows the reliability of the
webpage. A total of 113,683 datasets were
Fast Text word embedded system.
collected using this method. Random Forest and
A practical framework is presented in Pal et al. Naïve Bayes classifiers were selected for analysis,
(2023). The paper finds solutions to challenges in as they are known to perform well with large
research by investigating user comments for some datasets.
websites. The study selects the principal variables
known as predictors and classifies the predictors

48
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

The system architecture is illustrated in Figure 1 with the help of the web scrapping method, which
which provides an overview of the research is later pre-processed (text transformation).
framework. The proposed architecture is collected

Figure 1. System architecture.

Figure 2. Flowchart depicting web process scrapping.

The dataset is divided into train and test datasets. the models. The test set is input into the trained
The train set is input into the algorithm to develop model to predict the results. The output from the

49
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

models is analyzed to investigate their posterior probabilities. The value of class α might
performances. Additionally, the flowchart outlining be positive, slightly negative, negative, or neutral.
the data collection and analysis process is
A review of food products can be considered as a
presented in Figure 2, offering a visual
document. Verzi & Auger (2021) highlighted that
representation of the methodology employed in the
the multinomial model of Naive Bayes effectively
study.
captures word frequency information within
Random forest and Naïve Bayes classifiers documents. The Maximum Likelihood Estimate
(MLE) determines the most likely value for each
The Random Forest classifier selects its output
parameter given the training data, thereby
category based on a majority vote, whereby the
providing a reliable ratio. This approach helps in
most frequently occurring category among the
accurately estimating the parameters based on the
predictions from multiple trees is considered as the
available training data. For the previous likelihood,
final result. This approach ensures robustness and
this estimate is given as:
reliability in classification. Moreover, Random
Forest classifiers are user-friendly, requiring γ(α)=(Nc)/N (2)
minimal expertise and programming skills. They
Where Nc is the total number of documents in class
are accessible to both experts and novices, making
α while N is the total amount of documents. The
them suitable for individuals without an extensive
multinomial model assumes every other given
mathematical background.
value for the actual class independent of attributes
The Naïve Bayes classifier is a method based on value:
Bayes' theorem. It operates under the assumption
γ(β}α) = γ(𝜑1 . . . 𝜑𝑛𝑑 )|𝛼) (3)
that the presence of a particular feature during
classification is independent of the presence of In the multinomial model, a document is structured
other features. This model is particularly as a sequence of word occurrences drawn from the
advantageous for handling very large datasets due same vocabulary, denoted as V. Each document,
to its simplicity and ease of implementation. In denoted as βi, is considered independent of others.
addition to its simplicity, the Naïve Bayes classifier The parameter βi represents the distribution of
is well-suited for problems that involve associating words within each document, following a
objects with discrete categories. It belongs to the multinomial distribution with numerous
group of numerically-based approaches and offers independent trials. This results in the common bag-
several benefits, including simplicity, speed, and of-words (BOW) representation for documents.
high accuracy. Overall, Naïve Bayes classifiers The BOW model is commonly utilized in
provide a straightforward and efficient solution for document classification tasks, where the frequency
a wide range of classification tasks. Spiteri et al. of word occurrences serves as features for training
(2020) describes the Bayes rule as: classifiers. A unigram feature is employed to
indicate the presence of a single word within a text
γ(β) = (γ(α|β))/(γ(α)∗ γ(β|α) (1)
interval. This approach enables the representation
Where α is the specific class, β is the intended of documents based on the occurrence of individual
document to be classified, γ(α) and γ(β) are the words, facilitating effective classification
prior probabilities, γ(α | β) and γ(β | α) are the processes.

50
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

The conditional probability γ(ω | α) is estimated as Table 1: Sample dataset


the relative frequency in term of ω in documents
belonging to class α including multiple occurrences Raining ID Review Sentiment
of a term during a document. Train set 1 Sweet food Positive
Not good As
2 advertised Negative
γ(φ|α) = (count(φ, α) + 1)/(count(α)+|V|) (4)
3 Bad food Negative
Where count (ω, α): Number of occurrences of ω in Test set 4 Bad food Negative
training documents from class α. count(α): Number
of words therein class. Calculate conditional probabilities/maximum
likelihood smoothing (Laplace) Naive Bayes
|V|: Number of terms within the vocabulary in the
estimate by using Equation 5
test set
1
To address the issue of zero probability, the add- γ(bad |positive = (0 + = 0.01235) (8)
2+7

one or Laplace smoothing technique is applied, 1


γ(bad |negative = (1 + = 0.15385) (9)
6+7
which involves adding one to every count. This
1
adjustment ensures that no probability values are γ(food |positive = (1 + = 0.2222) (10)
2+7
zero. Subsequently, the likelihood of a document
1
given its category is calculated using the γ(food }negative = (0 + = 0.0769) (11)
6+7

multinomial distribution, as presented in Equation


Calculate posterior probability
(4). Finally, utilizing posterior probability, the new
document is classified. γ(positive | 1d4) =

Let αNB represent the posterior probability, where 1


∗ 0.02222 = 0.009129 (12)
3
αj is from class α and βi is the ith document. By
calculating the posterior probability based on the γ(negative | 1d4) =

likelihood of the document given its category, 2/3 ∗ 0.222*0.0769=0.0113812 (13)


classification of the new document can be achieved
effectively. γ(negative | 1d4) > γ(positive1d4) (14)

αNB = arg max α j ∈ απi(γ(βi|αj)) (5) γ(negative | 1d4) is the maximum means
probability of negative words in document 4 is
Consider Table 1 as the dataset comprising product
maximum so document 4 is negative.
reviews. The objective of the model is to classify
these reviews into either positive or negative Performance evaluation

categories. Table 1 provides an overview of the In this experiment, performance metrics are
structure of the dataset, serving as the foundation employed for the algorithm's accuracy analysis.
for the classification process. Calculate the prior The proposed system is evaluated using several
probability by using Equation 5 accuracy measures which include: precision, recall,

γ(positive) = 1/3 (6) and F1-score.

γ(negative) = 2/3 (7) 1. Precision: this deals with the ability of the
classifier not to tag a positive sample as

51
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

otherwise. How often the classifier is correct Recall. (To measure the accuracy of classifier
each time it predicts is defined as TP/(TP + FP) for each class over others) as: (2∗precision-
recall) /precision+recall.
2. Recall; deals with finding all positive
instances by the classifier. It is defined as the RESULTS AND DISCUSSION
sum of false negatives and the true positives
The dataset used for the experiments contains
ratio of true positives for each class. TP/ (TP
reviews about perishable and non-perishable food
+ FN)
products from Amazon’s web page with labels;
3. F1-score: It is the average mean of the two Positive / Negative / SlightlyPos / Slightly Neg /
values which we have i.e. Precision and Neutral). The sample dataset is shown in Table 2.

Table 2. Dataset.

ID Review Sentiment
1 Good quakity dog food Positive
12 Not as advertised Negative
23 “Delight” says it all Slightly positive
34 Cough medicine Neutral
45 Great taffy Slightly positive
56 Nice taffy Slightly positive
67 Great! Just as good as the expensive brands! Positive
78 Wonderful, tasty taffy Positive
89 Yay barley Positive
910 Healthy dog food Positive
1011 The best hot sauce in the world Positive
1112 My cats LOVE this !diet! better than theirs Positive
1213 My cats are not fans of the new food Negative
1314 Fresh and greasy ! Slightly positive
1415 Strawbwrry Twizzlers-yummy Positive

Table 3. Description of dataset

Name Variable type Variable Description


ID Input Unique ID of watch review
Review Input Comments about food products from social media pages
Sentiment Output The label associated with each review

The description of the dataset is given in Table 3 performance of each classifier, as outlined in Table
4. The True Positives (TP): tested for Positive &
Experimental results
Review is actually positive. The True Negatives
The experimental results for the two classifiers are (TN): tested for Negative & Review is actually
presented in the form of confusion matrices, negative. The False Positives (FP): tested for
showcasing the counts of true positives, false Positive & Review is not (otherwise known as
negatives, true negatives, and false positives. These “Type I error.”). The False Negatives (FN): tested
matrices offer a comprehensive view of the

52
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

for Negative & Review is not. (Otherwise known


as “Type II error.”)

Table 4. Structure of confusion matrix

Predicted class
True Neg. (TN) False Pos. (FP)
Actual class
False Neg. (FN) True Pos. (TP)

Results for Naïve Bayes’ classifier

The Naïve Bayes algorithm was employed to


Figure 3. Confusion matrix of naïve bayes.
classify the polarity of documents within the
dataset. This algorithm categorizes reviews as The implication of Figure 3 is that for the different

either positive, slightly negative, slightly positive, data samples, the respective values for the positive

neutral, or negative. Upon testing one of the variable are close to the actual values. This

reviews from the dataset, the outcome revealed its signifies that the model is accurate.

polarity classification. Table 5 displays the


experimental results, indicating that 79,658 correct
samples were identified out of 113,683 reviews
using the Naïve Bayes classifier, as determined
from the confusion matrix.

Table 5. Experimental result of Naïve Bayes’


classifier

Total reviews 113,683


Classifier Naive Bayes
Correct sample 79,658
Incorrect sample 34,025

The representations in Table 5 are given as


follows:
Figure 4. Bar chart depicting output from Naïve
correct samples = Summation of all TP values and
Bayes’ classifier.
Incorrect samples = Summation of all FN and FP
Out of the total 113,683 reviews, 79,658 were Results for Random Forest Classifier
correctly classified while 34,025 were incorrectly
The random forest algorithm was employed to
classified. The Naïve Bayes classifier demonstrated
classify the polarity of documents within the
a higher number of correct classifications
dataset. This algorithm categorizes reviews as
compared to the incorrect ones. The results of the
positive, slightly negative, slightly positive,
confusion matrix of the Naïve Bayes classifier are
neutral, or negative. Upon testing one of the
given in Figure 3. Figure 4 shows the bar chart
reviews from the dataset, the outcome revealed its
depicting output from the Naïve Bayes’ classifier

53
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

polarity classification. Table 6 displays the


experimental results, indicating that 86,898 correct
samples were identified out of 113,683 reviews
using the Random Forest algorithm, as derived
from the confusion matrix presented in Figure 5.
Furthermore, Figure 6 illustrates a pie chart
representing the output from the random forest
classifier.

Where correct samples = Summation of all TP


values and
Incorrect samples = Summation of all FN and FP

Figure 6. Pie chart depicting output.

Performance evaluation

Tables 7-8 are individual reports for the techniques


utilized for performance evaluation.

Table 8: Random Forest Classification Report

Precision Recall F1 score


Negative 0.67 0.66 0.66
Figure 5. Confusion matrix of random forest.
Neutral 0.55 0.37 0.44
Table 6. Experimental result. Positive 0.81 0.94 0.87
Slightly neg. 0.61 0.43 0.51
Total reviews 113,683 Slightly pos. 0.63 0.36 0.45
Classifier Random forest Avg. Total 0.74 0.77 0.74
Correct sample 86,898
Incorrect sample 26,788
Discussion
At the end of the experimental analysis, the result
Table 7: Naïve Bayes Classification Report
of Table 6 is obtained with a 70.01% accuracy on
test data. Table 7 has 76.5%; therefore, the best
Precision Recall F1 score
Negative 0.63 0.48 0.55 accuracy was given by Table 7. The percentage
Neutral 0.58 0.06 0.11 accuracy of various classifiers is given in Table 9.
Positive 0.72 0.99 0.83 Accuracy is calculated by:
Slightly neg. 0.59 0.15 0.24
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
Slightly pos. 0.51 0.09 0.15 × 100 (6)
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
Avg. Total 0.66 0.7 0.63
The implication of results obtained from Equation
(6) is that the model produces more correct

54
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

samples than incorrect samples for the text [2] Choe, S., Mehta, S. V., Ahn, H., Neiswanger,
classifiers. W., Xie, P., Strubell, E., & Xing, E. (2024).
Making scalable meta learning practical.
Table 9. Percentage accuracy of various classifiers
Advances in neural information processing
Performance systems, 36.
Dataset Classifier accuracy of [3] Abubakr, M., Rady, M., Badran, K., &
classifier
Naïve bayes 70.01% Mahfouz, S. Y. (2024). Application of deep
Product
review Random forest 76.50% learning in damage classification of
reinforced concrete bridges. Ain Shams
engineering journal, 15(1), 102297.
CONCLUSIONS
[4] Mupaikwa, E. (2025). The Application of

For many large and mid-sized companies, Artificial Intelligence and Machine Learning

understanding customer sentiments and opinions in Academic Libraries. In Encyclopedia of

regarding their products and services is crucial due Information Science and Technology, Sixth

to the significant impact these sentiments can have Edition (pp. 1-18). IGI Global.

on the company's financial performance. In this [5] BÜYÜKKEÇECİ, M., & OKUR, M. C. (2024).

study, experimental analysis was carried out on a A Comprehensive Review of Feature

dataset comprising product reviews. Both the Selection and Feature Selection Stability in

Naive Bayes classifier and the Random Forest Machine Learning. Gazi University Journal of

classifier were utilized to train the dataset. It was Science, 1-1.

observed that the Random Forest classifier [6] Valtonen, L., Mäkinen, S. J., & Kirjavainen, J.

outperformed the Naive Bayes classifier. Going (2024). Advancing reproducibility and

forward, it is recommended to explore the accountability of unsupervised machine

development of a mobile application or a user- learning in text mining: Importance of

friendly graphical interface. Such tools would transparency in reporting preprocessing and

enable individuals without programming skills to algorithm selection. Organizational Research

easily assess and understand their customers' Methods, 27(1), 88-113.

sentiments towards their products. This approach [7] Shamshiri, A., Ryu, K. R., & Park, J. Y. (2024).

would facilitate broader accessibility and Text mining and natural language processing

utilization of sentiment analysis tools, empowering in construction. Automation in Construction,


companies to make informed decisions based on 158, 105200.
customer feedback. [8] Duan, H. K., Vasarhelyi, M. A., Codesso, M.,
& Alzamil, Z. (2023). Enhancing the
REFERENCES
government accounting information systems
[1] Rejeb, A., Rejeb, K., Appolloni, A., using social media information: An
Treiblmaier, H., & Iranmanesh, M. (2024). application of text mining and machine
Exploring the impact of ChatGPT on learning. International Journal of Accounting
education: A web mining and machine Information Systems, 48, 100600.
learning approach. The International Journal [9] Umer, M., Imtiaz, Z., Ahmad, M., Nappi, M.,
of Management Education, 22(1), 100932. Medaglia, C., Choi, G. S., & Mehmood, A.
(2023). Impact of convolutional neural

55
Ozoh P. et al. /LAUTECH Journal of Engineering and Technology 18 (1) 2024: 47-56

network and FastText embedding on text [13] Spiteri, G., Fielding, J., Diercke, M.,
classification. Multimedia Tools and Campese, C., Enouf, V., Gaymard, A., Bella,
Applications, 82(4), 5569-5585. A., Sognamiglio, P., Moros, M.J.S., Riutort,
[10] Pal, S., Biswas, B., Gupta, R., Kumar, A., & A.N. and Demina, Y.V., 2020. First cases of
Gupta, S. (2023). Exploring the factors that coronavirus disease 2019 (COVID-19) in the
affect user experience in mobile-health WHO European Region, 24 January to 21
applications: A text-mining and machine- February 2020. Eurosurveillance, 25(9),
learning approach. Journal of Business p.2000178
Research, 156, 113484. [14] Spiteri, G., Fielding, J., Diercke, M.,
[11] Kariri, E., Louati, H., Louati, A., & Campese, C., Enouf, V., Gaymard, A., Bella,
Masmoudi, F. (2023). Exploring the A., Sognamiglio, P., Moros, M.J.S., Riutort,
advancements and future research directions A.N. and Demina, Y.V., 2020. First cases of
of artificial neural networks: a text mining coronavirus disease 2019 (COVID-19) in the
approach. Applied Sciences, 13(5), 3186. WHO European Region, 24 January to 21
[12] Abdusalomovna, T. D. (2023). TEXT February 2020. Euro surveillance, 25(9),
MINING. European Journal of p.2000178
Interdisciplinary Research and Development,
13, 284-289.

56

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy