Word Embedding Methodsof Text Processing
Word Embedding Methodsof Text Processing
net/publication/369071756
CITATION READS
1 152
2 authors, including:
Lahcen Idouglid
Ibn Tofaïl University
7 PUBLICATIONS 6 CITATIONS
SEE PROFILE
All content following this page was uploaded by Lahcen Idouglid on 21 March 2023.
Computer Sciences Research Laboratory, Faculty of Sciences, IBN Tofail University, Kenitra,
Morocco
{lahcen.idouglid,said.tkatek}@uit.ac.ma
Abstract. One of the biggest challenges any NLP data scientist faces is choos-
ing the best numeric/vector representation of a text string for running a machine
learning model. This research paper provides a comprehensive study of big data
and its impact on improving the performance of word embedding techniques for
text processing. Therefore, we propose a method for text processing using word
embedding techniques and machine learning (ML) algorithms to improve the per-
formance of analyzing data and decision-making. For this reason, it is possible
to use several word embedding methods for text processing, especially the most
popular ones like CountVectoriser, TF-IDF, and HashingVectorizer, and combine
them with ML algorithms like decision trees, random forest classifiers, and logistic
regression for implemented text Regression and other supervised machine learn-
ing algorithms are combined with classification. Compared with recent work, our
comparative study shows the impact of dataset size on the performance of text
classification algorithms and gives good results.
1 Introduction
The big data term refers to vast, complex, and real-time data that necessitates sophis-
ticated management, analytical, and processing approaches to extract management
insights. There are a little of people who can think critically about big data problems
and who have the skills and knowledge to tackle big data problems [14].
Machine learning (ML) is a science focused on understanding and developing learn-
ing algorithms [13]. It is considered to be a component of artificial intelligence [7].
Without being expressly taught to do so, machine learning algorithms create a model
using sample data, also referred to as training data, to make predictions or judgments.
The study of theories and techniques that enable information exchange between
humans and computers through natural language is known as “natural language pro-
cessing" [9]. NLP combines linguistics, computer science, and mathematics [2]. NLP
is an area of AI that aids computers in manipulating and interpreting human languages.
Text mining is a technique used in NLP to extract relevant information from text. The
goal of NLP is to glean knowledge from natural language. One of the most common
2 Related Work
This section introduces related work from methods of evaluating word embeddings, and
existing studies of evaluating embeddings in downstream tasks.
The performance of word embedding approaches was compared, examined, and eval-
uated using machine learning algorithms in a Turkish sentiment analysis study that was
conducted using a dataset derived from user comments shared on multiple shopping web-
sites in recent years. This study will serve as our starting point for the discussion of word
embeddings [3]. The second study named “BnVec: Towards the Development of Word
Embedding for Bangla Language Processing” approaches for Bengali word embedding.
Six well-known word embedding techniques CountVectorizer, TF-IDF, Hash vectorizer,
Word2vec, fastText, and GloVe are included in the first one, which highlights their well-
known functionality [9]. Various qualitative and quantitative tests were conducted on a
few tasks to show each technique’s ability to infer word proximity and, in addition, to
examine how well it performs in comparison to the other word embedding techniques [9].
The third work is about Sentiment analysis on film review in Gujarati language using
machine learning this paper is a comparative study between TF-IDF vectorizer and
CountVectorizer features after applying sentiment analysis [12]. Comparing the results
of two different machine learning algorithms based on Accuracy, Recall, Precision, and
F-score performance parameter. The last work cited is “Measuring associational thinking
through word embeddings” it’s an investigating various way of incorporating existing
embeddings to decide the semantic or non-semantic acquainted strength between words
so relationship with human decisions can be augmented [11].
3 Methodology
3.1 System Architecture
There are four steps for the text classification process and initial steps regarding col-
lecting and preparing the datasets. Preprocessing techniques play an important role to
improve the performance of the models. Three key steps of data preprocessing namely,
tokenization, stop words removal and stemming. The technique of tokenizing involves
separating a stream of text into recognized tokens, such as words, phrases, symbols, or
other practical elements. Tokenization’s objective is to analyze each word in a statement.
Stemming is a technique for getting a word’s numerous forms to look similar to its stems.
Bag of Words is one of the most popular methods. It is a text representation that indicates
the order of words in a text. After Stemming and Lemmatization, the step that becomes
is the division of the Data Set into Train Set and Test Set.
Word Embedding Methods of Text Processing in Big Data 833
In this paragraph, we present the machine learning algorithms implemented in our work.
Decision tree Classifier: A decision tree is an induction approach that has been applied
to a variety of classification problems. It is based on separating features and determining
their worth [4]. The splitting procedure continues until each branch can only have one
classification labeled on it. It generates decision trees from random data samples, assigns
expectations to each tree, and chooses the best solution.
Random Forest Classifier: The Random Forest (RF) decision tree ensemble is a well-
known decision tree ensemble that is often utilized in categorization. The popularity of
RF stems from its superior performance when compared to other classification methods
[8].
Logistic regression Classifier: Is one of the most often used classification techniques.
It’s employed in a variety of fields since it’s easy to understand and the results are
interpretable, allowing for what-if scenarios. It’s a classification method based on the
Bayes Theorem and the assumption of predictor independence [10].
algorithms, namely Decision Tree, Random Forest classifier, Logistic Regression, and
their compatibility with three methods of Word Embedding, namely Tf-IDF Vectorizer,
ContVectorizer, and HashingVectorizer. In the first part of this comparative study, we
will see the evolution of the accuracy of the algorithms regarding the size of the dataset,
for this, we will take parts of our example dataset (2000, 5000, 20000, 50000, 150000
inputs) from my dataset. And we will present all this in graphs.
The Dataset was downloaded from GitHub “Large Question Answering Datasets”;
All experiments are realized, executed and tested on Google Collaboratory.
Table 1. Comparative Accuracy of ML algorithms with word embedding techniques for Text
Processing
This table has shown the results of the comparative study used the most famous
machine learning algorithms with such us Decision Tree, Random Forest Classifier and
Logistic Regression with Tf-idf Vectorizer, ContVectorizer and Hashing Vectorizer word
embedding methods. The evaluation process is conducted using evaluation parameters
such as accuracy. The Hashing Vectorizer word embedding method archived the highest
scores 98,5% in term of accuracy in Random Forest Classifier ML algorithm with dataset
size 2000.
The graphical representation (a) shows the evolution of the accuracy of the Tfid-
fVectorizer method with the number of entries in the training DataSet. As the graph
shows, we conclude that the accuracy increases with increasing number of entries in the
Word Embedding Methods of Text Processing in Big Data 835
training dataset and we arrive at the best result when the size of our training dataset is
5000 entries with the best performance obtained for this model 85%.
The second graph named (b) shows CountVectorizer performance evolution with
increasing dataset size, we conclude that the accuracy of CountVectorizer evolves and
increases with the increase in the number of inputs in the training Dataset, but there is a
degradation when the Dataset size is 50000 with the best performance achieved for this
model 85,47%. And the best algorithm that gives good result are DT and RF.
As shown in the graph (c) that presente HashingVectorizer performance evolution
with increasing dataset size, we conclude that the accuracy evolves and increases with
the increase in the number of inputs in the Training Dataset and we arrive at the best
result when the size of our training Dataset is 15000 inputs with the best performance
achieved for this model 98%. And this method HashingVectorizer work efficacelly with
all algorithms tested.
5 Conclusion
This work presents the result of a comparative study of Word Embedding Methods for
text processing as Tf-idf Vectorizer, ContVectorizer and Hashing Vectorizer.
The results show the impact and influence of big data and the size of training data on
the performance of machine learning algorithms with the three Word Embedding Meth-
ods. When more than the dataset size and larger the performance becomes higherThe
performance achieved for our model 98%, and the best method is HashingVectorizer
witch work efficacelly with all algorithms tested. Int the futures work we will research
the similarity semantic for text categorization using genetics algorithms in subsequent
study. We’ll also create a larger model for several languages.
836 L. Idouglid and S. Tkatek
References
1. Alammary, A.S.: Arabic Questions Classification Using Modified TF-IDF. IEEE Access 9,
95109–95122 (2021). https://doi.org/10.1109/ACCESS.2021.3094115
2. Al-Ansari K Survey on Word Embedding Techniques in Natural Language Processing. 7
3. Aydoğan, M.: Comparison of word embedding methods for Turkish sentiment classification.
21
4. Guezzaz, A., Benkirane, S., Azrour, M., Khurram, S.: A reliable network intrusion detection
approach using decision tree with enhanced data quality. Secur. Commun. Netw. 2021, 1–8
(2021). https://doi.org/10.1155/2021/1230593
5. Haque, F., Md Manik, M.H., Hashem, M.M.A.: Opinion mining from bangla and phonetic
bangla reviews using vectorization methods. In: 2019 4th International Conference on Electri-
cal Information and Communication Technology (EICT), Khulna, Bangladesh. IEEE, pp. 1–6
(2019)
6. Haque, R., Islam, N., Islam, M., Ahsan, M.M.: A comparative analysis on suicidal ideation
detection using NLP, machine, and deep learning. Technologies 10, 57 (2022). https://doi.
org/10.3390/technologies10030057
7. Hilbert, S., et al.: Machine learning for the educational sciences. Rev. Educ. 9(3), e3310https://
doi.org/10.1002/rev3.3310
8. Jain, A., Sharma, Y., Kishor, K.: Financial administration system using ML Algorithm. 10
9. Kowsher, M., et al.: BnVec: towards the development of word embedding for Bangla language
processing. IJET 10, 95 (2021). https://doi.org/10.14419/ijet.v10i2.31538
10. Mahesh, B.: Machine Learning Algorithms - A Review 9, 7 (2018)
11. Periñán-Pascual, C.: Measuring associational thinking through word embeddings. Artif. Intell.
Rev. 55(3), 2065–2102 (2021). https://doi.org/10.1007/s10462-021-10056-6
12. Shah, P., Swaminarayan, P., Patel, M.: Sentiment analysis on film review in Gujarati language
using machine learning. IJECE 12, 1030 (2022). https://doi.org/10.11591/ijece.v12i1.pp1
030-1039
13. Tkatek, S.: A hybrid genetic algorithms and sequential simulated annealing for a constrained
personal reassignment problem to preferred posts. IJATCSE 9, 454–464 (2020). https://doi.
org/10.30534/ijatcse/2020/62912020
14. Tkatek, S., Belmzoukia, A., Nafai, S., Abouchabaka, J., Ibnou-ratib, Y.: Putting the world
back to work: an expert system using big data and artificial intelligence in combating the
spread of COVID-19 and similar contagious diseases. WOR 67, 557–572 (2020). https://doi.
org/10.3233/WOR-203309