0% found this document useful (0 votes)
8 views7 pages

Word Embedding Methodsof Text Processing

Uploaded by

bob60fighter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

Word Embedding Methodsof Text Processing

Uploaded by

bob60fighter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/369071756

Word Embedding Methods of Text Processing in Big Data: A Comparative


Study

Conference Paper · March 2023


DOI: 10.1007/978-3-031-26254-8_121

CITATION READS

1 152

2 authors, including:

Lahcen Idouglid
Ibn Tofaïl University
7 PUBLICATIONS 6 CITATIONS

SEE PROFILE

All content following this page was uploaded by Lahcen Idouglid on 21 March 2023.

The user has requested enhancement of the downloaded file.


Word Embedding Methods of Text Processing
in Big Data: A Comparative Study

Lahcen Idouglid(B) and Said Tkatek

Computer Sciences Research Laboratory, Faculty of Sciences, IBN Tofail University, Kenitra,
Morocco
{lahcen.idouglid,said.tkatek}@uit.ac.ma

Abstract. One of the biggest challenges any NLP data scientist faces is choos-
ing the best numeric/vector representation of a text string for running a machine
learning model. This research paper provides a comprehensive study of big data
and its impact on improving the performance of word embedding techniques for
text processing. Therefore, we propose a method for text processing using word
embedding techniques and machine learning (ML) algorithms to improve the per-
formance of analyzing data and decision-making. For this reason, it is possible
to use several word embedding methods for text processing, especially the most
popular ones like CountVectoriser, TF-IDF, and HashingVectorizer, and combine
them with ML algorithms like decision trees, random forest classifiers, and logistic
regression for implemented text Regression and other supervised machine learn-
ing algorithms are combined with classification. Compared with recent work, our
comparative study shows the impact of dataset size on the performance of text
classification algorithms and gives good results.

Keywords: Word Embedding · NLP · Big Data · Machine Learning

1 Introduction
The big data term refers to vast, complex, and real-time data that necessitates sophis-
ticated management, analytical, and processing approaches to extract management
insights. There are a little of people who can think critically about big data problems
and who have the skills and knowledge to tackle big data problems [14].
Machine learning (ML) is a science focused on understanding and developing learn-
ing algorithms [13]. It is considered to be a component of artificial intelligence [7].
Without being expressly taught to do so, machine learning algorithms create a model
using sample data, also referred to as training data, to make predictions or judgments.
The study of theories and techniques that enable information exchange between
humans and computers through natural language is known as “natural language pro-
cessing" [9]. NLP combines linguistics, computer science, and mathematics [2]. NLP
is an area of AI that aids computers in manipulating and interpreting human languages.
Text mining is a technique used in NLP to extract relevant information from text. The
goal of NLP is to glean knowledge from natural language. One of the most common

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


Y. Farhaoui et al. (Eds.): ICAISE 2022, LNNS 635, pp. 831–836, 2023.
https://doi.org/10.1007/978-3-031-26254-8_121
832 L. Idouglid and S. Tkatek

document vocabulary representations is word embedding. It can record a document’s


context, the semantic and syntactic similarity of its words, their relationships to one
another, etc.[2].
The paper is organized as follows. Section 2 contains related work Sect. 3 discusses
methodologies and the system architecture, we present the word embedding techniques
and algorithms classifiers used in this work. The experimental results are discussed in
Sect. 4. Finally, the conclusion is presented in Sect. 5.

2 Related Work
This section introduces related work from methods of evaluating word embeddings, and
existing studies of evaluating embeddings in downstream tasks.
The performance of word embedding approaches was compared, examined, and eval-
uated using machine learning algorithms in a Turkish sentiment analysis study that was
conducted using a dataset derived from user comments shared on multiple shopping web-
sites in recent years. This study will serve as our starting point for the discussion of word
embeddings [3]. The second study named “BnVec: Towards the Development of Word
Embedding for Bangla Language Processing” approaches for Bengali word embedding.
Six well-known word embedding techniques CountVectorizer, TF-IDF, Hash vectorizer,
Word2vec, fastText, and GloVe are included in the first one, which highlights their well-
known functionality [9]. Various qualitative and quantitative tests were conducted on a
few tasks to show each technique’s ability to infer word proximity and, in addition, to
examine how well it performs in comparison to the other word embedding techniques [9].
The third work is about Sentiment analysis on film review in Gujarati language using
machine learning this paper is a comparative study between TF-IDF vectorizer and
CountVectorizer features after applying sentiment analysis [12]. Comparing the results
of two different machine learning algorithms based on Accuracy, Recall, Precision, and
F-score performance parameter. The last work cited is “Measuring associational thinking
through word embeddings” it’s an investigating various way of incorporating existing
embeddings to decide the semantic or non-semantic acquainted strength between words
so relationship with human decisions can be augmented [11].

3 Methodology
3.1 System Architecture
There are four steps for the text classification process and initial steps regarding col-
lecting and preparing the datasets. Preprocessing techniques play an important role to
improve the performance of the models. Three key steps of data preprocessing namely,
tokenization, stop words removal and stemming. The technique of tokenizing involves
separating a stream of text into recognized tokens, such as words, phrases, symbols, or
other practical elements. Tokenization’s objective is to analyze each word in a statement.
Stemming is a technique for getting a word’s numerous forms to look similar to its stems.
Bag of Words is one of the most popular methods. It is a text representation that indicates
the order of words in a text. After Stemming and Lemmatization, the step that becomes
is the division of the Data Set into Train Set and Test Set.
Word Embedding Methods of Text Processing in Big Data 833

3.2 Word Embedding Techniques:


TfidfVectorizer: Each word is represented as a weighted vector of the terms discovered
in the super vector after the most significant terms in the super vector have been selected.
Each document assigns a weight to each word. It determines the significance of a word
in a corpus document. It’s calculated by multiplying the term frequency (TF) by the
document frequency’s inverse (IDF). TF counts the number of times a phrase appears
in a document, while IDF counts the phrase’s significance concerning the corpus as a
whole [1].

TF − IDF = TF(t) ∗ IDF(t) (1)

CountVectorizer: Count Vector is a simple yet incredibly effective method used in


language processing [9]. The number of information components (N) and the number
of unit components (M) present in the information components are combined to create
an N * M grid [6]. The frequency of the unit elements supplied in each data component
serves as a representation of that component [9]. Text is transformed into a vector by
marking the presence (1) or absence (0) of a word of a given input [12].
HashingVectorizer: The hashing trick is used by the hashing vectorizer to identify
the mapping from the token string name to the feature integer index. This vectorizer
converts text documents into matrices by creating sparse matrices out of the collection
of documents that contain the token occurrence counts [5].

3.3 Text Classifiers

In this paragraph, we present the machine learning algorithms implemented in our work.
Decision tree Classifier: A decision tree is an induction approach that has been applied
to a variety of classification problems. It is based on separating features and determining
their worth [4]. The splitting procedure continues until each branch can only have one
classification labeled on it. It generates decision trees from random data samples, assigns
expectations to each tree, and chooses the best solution.
Random Forest Classifier: The Random Forest (RF) decision tree ensemble is a well-
known decision tree ensemble that is often utilized in categorization. The popularity of
RF stems from its superior performance when compared to other classification methods
[8].
Logistic regression Classifier: Is one of the most often used classification techniques.
It’s employed in a variety of fields since it’s easy to understand and the results are
interpretable, allowing for what-if scenarios. It’s a classification method based on the
Bayes Theorem and the assumption of predictor independence [10].

4 Results and Discussion


In this experiment, we evaluated the classification results, based on standard evaluation
metrics of accuracy that were used to compare the state-of-the-art three machine learning
834 L. Idouglid and S. Tkatek

algorithms, namely Decision Tree, Random Forest classifier, Logistic Regression, and
their compatibility with three methods of Word Embedding, namely Tf-IDF Vectorizer,
ContVectorizer, and HashingVectorizer. In the first part of this comparative study, we
will see the evolution of the accuracy of the algorithms regarding the size of the dataset,
for this, we will take parts of our example dataset (2000, 5000, 20000, 50000, 150000
inputs) from my dataset. And we will present all this in graphs.
The Dataset was downloaded from GitHub “Large Question Answering Datasets”;
All experiments are realized, executed and tested on Google Collaboratory.

Table 1. Comparative Accuracy of ML algorithms with word embedding techniques for Text
Processing

ML Dataset Size Word Embedding Methods


Algorithms Tf-idf Vectorizer ContVectorizer Hashing Vectorizer
Decision 2000 82 82 97,5
Tree 5000 82,6 82,6 98,1
20000 84,42 84,42 97,88
50000 85,47 85,48 97,77
150000 80,26 80,26 97,47
Random 2000 82 82 98,5
Forest 5000 82,50 82,50 98,1
Classifier
20000 84,45 84,48 98,32
50000 85,32 85,33 98,32
150000 80,26 80,60 98,02
Logistic 2000 78,75 82 97
Regression 5000 77,70 81,70 97,20
20000 78,85 82,82 97,75
50000 80,12 83,86 97,7
150000 78,76 70,59 97,49

This table has shown the results of the comparative study used the most famous
machine learning algorithms with such us Decision Tree, Random Forest Classifier and
Logistic Regression with Tf-idf Vectorizer, ContVectorizer and Hashing Vectorizer word
embedding methods. The evaluation process is conducted using evaluation parameters
such as accuracy. The Hashing Vectorizer word embedding method archived the highest
scores 98,5% in term of accuracy in Random Forest Classifier ML algorithm with dataset
size 2000.
The graphical representation (a) shows the evolution of the accuracy of the Tfid-
fVectorizer method with the number of entries in the training DataSet. As the graph
shows, we conclude that the accuracy increases with increasing number of entries in the
Word Embedding Methods of Text Processing in Big Data 835

Fig. 1. Performance comparison for word embedding methods

training dataset and we arrive at the best result when the size of our training dataset is
5000 entries with the best performance obtained for this model 85%.
The second graph named (b) shows CountVectorizer performance evolution with
increasing dataset size, we conclude that the accuracy of CountVectorizer evolves and
increases with the increase in the number of inputs in the training Dataset, but there is a
degradation when the Dataset size is 50000 with the best performance achieved for this
model 85,47%. And the best algorithm that gives good result are DT and RF.
As shown in the graph (c) that presente HashingVectorizer performance evolution
with increasing dataset size, we conclude that the accuracy evolves and increases with
the increase in the number of inputs in the Training Dataset and we arrive at the best
result when the size of our training Dataset is 15000 inputs with the best performance
achieved for this model 98%. And this method HashingVectorizer work efficacelly with
all algorithms tested.

5 Conclusion

This work presents the result of a comparative study of Word Embedding Methods for
text processing as Tf-idf Vectorizer, ContVectorizer and Hashing Vectorizer.
The results show the impact and influence of big data and the size of training data on
the performance of machine learning algorithms with the three Word Embedding Meth-
ods. When more than the dataset size and larger the performance becomes higherThe
performance achieved for our model 98%, and the best method is HashingVectorizer
witch work efficacelly with all algorithms tested. Int the futures work we will research
the similarity semantic for text categorization using genetics algorithms in subsequent
study. We’ll also create a larger model for several languages.
836 L. Idouglid and S. Tkatek

References
1. Alammary, A.S.: Arabic Questions Classification Using Modified TF-IDF. IEEE Access 9,
95109–95122 (2021). https://doi.org/10.1109/ACCESS.2021.3094115
2. Al-Ansari K Survey on Word Embedding Techniques in Natural Language Processing. 7
3. Aydoğan, M.: Comparison of word embedding methods for Turkish sentiment classification.
21
4. Guezzaz, A., Benkirane, S., Azrour, M., Khurram, S.: A reliable network intrusion detection
approach using decision tree with enhanced data quality. Secur. Commun. Netw. 2021, 1–8
(2021). https://doi.org/10.1155/2021/1230593
5. Haque, F., Md Manik, M.H., Hashem, M.M.A.: Opinion mining from bangla and phonetic
bangla reviews using vectorization methods. In: 2019 4th International Conference on Electri-
cal Information and Communication Technology (EICT), Khulna, Bangladesh. IEEE, pp. 1–6
(2019)
6. Haque, R., Islam, N., Islam, M., Ahsan, M.M.: A comparative analysis on suicidal ideation
detection using NLP, machine, and deep learning. Technologies 10, 57 (2022). https://doi.
org/10.3390/technologies10030057
7. Hilbert, S., et al.: Machine learning for the educational sciences. Rev. Educ. 9(3), e3310https://
doi.org/10.1002/rev3.3310
8. Jain, A., Sharma, Y., Kishor, K.: Financial administration system using ML Algorithm. 10
9. Kowsher, M., et al.: BnVec: towards the development of word embedding for Bangla language
processing. IJET 10, 95 (2021). https://doi.org/10.14419/ijet.v10i2.31538
10. Mahesh, B.: Machine Learning Algorithms - A Review 9, 7 (2018)
11. Periñán-Pascual, C.: Measuring associational thinking through word embeddings. Artif. Intell.
Rev. 55(3), 2065–2102 (2021). https://doi.org/10.1007/s10462-021-10056-6
12. Shah, P., Swaminarayan, P., Patel, M.: Sentiment analysis on film review in Gujarati language
using machine learning. IJECE 12, 1030 (2022). https://doi.org/10.11591/ijece.v12i1.pp1
030-1039
13. Tkatek, S.: A hybrid genetic algorithms and sequential simulated annealing for a constrained
personal reassignment problem to preferred posts. IJATCSE 9, 454–464 (2020). https://doi.
org/10.30534/ijatcse/2020/62912020
14. Tkatek, S., Belmzoukia, A., Nafai, S., Abouchabaka, J., Ibnou-ratib, Y.: Putting the world
back to work: an expert system using big data and artificial intelligence in combating the
spread of COVID-19 and similar contagious diseases. WOR 67, 557–572 (2020). https://doi.
org/10.3233/WOR-203309

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy