0% found this document useful (0 votes)

15 views35 pages

The Impact of Preprocessing On Word Embedding Quality: A Comparative Study

This study investigates the impact of data preprocessing on the quality of word embeddings in Persian and English corpora, focusing on how different preprocessing methods alter word co-occurrence counts. The findings reveal that while removing stopwords can enhance performance in certain tasks like text classification, retaining them may improve results in sentiment analysis and syntactic tasks. The research contributes by releasing four new Persian word similarity datasets and evaluating the effects of preprocessing on word vectors using intrinsic and extrinsic methods.

Uploaded by

Vasavi Desu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views35 pages

The Impact of Preprocessing On Word Embedding Quality: A Comparative Study

Uploaded by

Vasavi Desu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Language Resources and Evaluation (2023) 57:257–291

https://doi.org/10.1007/s10579-022-09620-5

ORIGINAL PAPER

The impact of preprocessing on word embedding quality:

a comparative study

Zahra Rahimi1 · Mohammad Mehdi Homayounpour1

Accepted: 16 September 2022 / Published online: 18 October 2022

Abstract
Data preprocessing is among the principal stages in virtually all text-based tasks.
In this light, recent approaches have employed word embeddings in the majority of
text-based tasks, wherein word co-occurrences are used as the basis of word vector
generation processes in the approaches thereof. Word embedding techniques are pri-
marily trained on large corpora. The preprocessing of said corpora is quite crucial in
that it can alter the co-occurrence counts of words, and hence the quality of gener-
ated word vectors. This highlights the significance of selecting the right preprocess-
ing approach when working with word embeddings. The present study proceeds to
scrutinize the effects of preprocessing on the quality of word embeddings on Persian
and English corpora, with a focus on preprocessing approaches capable of altering
the co-occurrence counts of words by virtue of procedures such as the elimination
of stopwords. The quality of the word vectors generated employing different types
of preprocessing are intrinsically evaluated by word similarity and word analogy.
Text classification and sentiment analysis are also used as extrinsic evaluations. The
study also presents four word similarity datasets for Persian, namely PER-RG-65,
PER-RC-30, PER-simlex-999, and PER-Mturk-287. Results obtained using GloVe,
Word2vec (CBOW), and Fasttext word embeddings show that in certain tasks, such
as in the case of text classification or solving semantic questions in word analogy,
the removal of punctuations and stopwords will lead to higher performance, as in the
case of sentiment analysis and syntactic questions regarding word analogy, where
withholding stopwords increases overall performance.

Keywords Text preprocessing · Persian word similarity datasets · Word

embeddings · Text classification · Sentiment analysis

* Mohammad Mehdi Homayounpour

homayoun@aut.ac.ir
Zahra Rahimi
zah-ra@aut.ac.ir
1
Department of Computer Engineering and Information Technology, Amirkabir University
of Technology, No. 350, Hafez Ave, Valiasr Square, Tehran, Iran

13
Vol.:(0123456789)
258 Z. Rahimi, M. M. Homayounpour

1 Introduction

The rather substantial amount of structured and unstructured textual information

available in forms of web sites, social medias, Emails, online reviews, tweets, etc.
grants one the opportunity to utilize them in a myriad of applications such as text
mining, information retrieval, information processing, natural language processing,
etc. However, these raw textual data contain high rates of inapplicable and noisy
information which require to be removed prior to being represented in the required
format for text-based tasks.
Word embedding approaches now hold a key standing in natural language pro-
cessing (NLP) and are employed in the majority of NLP tasks such as text classifica-
tion (Kamkarhaghighi & Makrehchi, 2017; Shuang et al., 2020), sentiment analysis
(Abdi et al., 2019; Enríquez et al., 2016; Pham & Le, 2018), machine translation
(Banik et al., 2019; Shuang et al., 2020), named entity recognition (Ajees & Idicula,
2018; Kwon et al., 2019), word sense disambiguation (Alkhatlan et al., 2018; Cor-
rêa & Amancio, 2019), textual entailment (Etaiwi & Awajan, 2020), text summa-
rization, information retrieval (Esposito et al., 2020; Fernández-Reyes et al., 2018;
Othman et al., 2019; Roy et al., 2019), discourse relation finding (Wu et al., 2019),
and graph based methods (Marinho et al., 2017; Tohalino & Amancio, 2018). Word
embeddings are also employed in biomedical NLP (Khattak et al., 2019; Wang et al.,
2018). Word embedding methods map words to a new vector space, where words
within closer proximity to one another are more similar in meaning.
Word embedding approaches can be categorized into static and dynamic methods
which are explained shortly here. Both of these methods utilize context to generate
word embeddings. The static methods such as Word2vec (Mikolov et al., 2013a) or
GloVe (Pennington et al., 2014) in fact map all senses of a word to a single vector
so the generated vector is a kind of an average of the context of all senses of a word.
The pre-trained vectors which are generated by these models cannot change with the
context alteration. Conversely, the dynamic (contextualize) approaches such as Bert
(Devlin et al., 2019) or Roberta (Liu et al., 2019), produce various vectors for a word
if its context is altered.
All mentioned models utilize word co-occurrences for the extraction of word vec-
tors. The Word2vec (CBOW) (Mikolov et al., 2013a) method feeds words in a win-
dow around a target word as inputs to a neural network which attempts to predict
the correct target word. The Skip-gram model, on the other hand, uses the target
word as the input of a neural network that attempts to predict the neighboring words.
GloVe also proceeds to reconstruct the matrix of word co-occurrence counts using
the inner product of two matrices of word vectors. Fasttext (Bojanowski et al., 2016)
is based on the Skip-gram model. This method finds a vector representation for each
character n-gram, wherein the representation of each word is the sum of its character
n-grams, highlighting the significance of the co-occurrence information.

13
The impact of preprocessing on word embedding quality: a… 259

The referred methods usually use large scale corpora such as Wikipedia1 to learn
word vectors. These corpora can be preprocessed in several ways. Word co-occur-
rence statistics are modified with various kinds of preprocessing. The deletion of
stopwords, numbers, and punctuations alters the co-occurrence statistics. Therefore,
the preprocessing can affect the quality of word vectors extracted by word embed-
ding methods. The quality of extracted word vectors, in turn, affects all the afore-
mentioned natural language processing tasks.
Certain researches have thus far studied the effect of preprocessing on different
natural language processing tasks.(Ayca & Hakan, 2017; Keerthi Kumar & Harish,
2017) examined the effect of preprocessing for text categorization. The effect of text
preprocessing on text classification and sentiment analysis in the case where a neural
network is used for these tasks was considered by Camacho-Collados and Pillevar in
2018 (Camacho-Collados & Pilehvar, 2018). In this research convolutional neural
network was used for text classification and sentiment analysis. (Denny & Spirling,
2018) studied the effect of text preprocessing on unsupervised learning to find the
most optimal features. Angiani et al. (Angiani et al., 2016) scrutinized the effect
of preprocessing on twitter sentiment classification. Uysal and Gunal investigated
the effect of preprocessing on text classification in different domains and languages
(Uysal & Gunal, 2014). The effect of preprocessing on review spam detection was
also considered by (Etaiwi & Naymat, 2017) as a means for extracting suitable fea-
tures for spam detection. Another approach investigated the impact of different kinds
of preprocessing on document embedding (Yahi & Hacene, 2020). The preprocess-
ing also has an impact on text summarization in (Tohalino & Amancio, 2018) where
the authors utilized text segmentation, lemmatization and unnecessary word removal
as effective preprocessing on text summarization and utilized a graph-based method
for this task. In (Marinho et al., 2017) the preprocessing steps of removing punc-
tuations and stopwords, lemmatization and part of speech tagging are applied for
the task of authorship attribution using a graph-based approach. The lemmatization
is important in graph-based method because the words with the same lemma are
mapped to the same node. But for generating general purpose word embeddings, it
is better to generate the embeddings for all forms of a word.
This rather substantial impact of preprocessing on the quality of word embed-
ding processes employed on many NLP tasks, to the best of our knowledge, has not
been carefully investigated in previous works. The significance of this study lies in
the fact that word embeddings are now used extensively in numerous areas of NLP,
wherein the quality of extracted word embeddings and representations can alter the
accuracy of different tasks. Various sorts of preprocessing procedures such as stop-
words and punctuation removal can dramatically change the generated word embed-
dings due to their drastic effects on word co-occurrences and therefore the corre-
sponding relationships. For example, in the sentence The tropics are the regions of
Earth surrounding the Equator. extracted from English Wikipedia, consider the word
“tropics” and its neighbors in a window size of 5:
The tropics are the regions of Earth surrounding the Equator.

1
https://dumps.wikimedia.org/.

13
260 Z. Rahimi, M. M. Homayounpour

In this sentence, the italic words appeared in the context of the word “tropics”.
The following sentences show how the context of the word "tropics" can change
with stopwords removal and punctuation removal:
•Stopwords removal: tropics regions Earth surrounding Equator.
•Punctuation and stopwords removal: tropics regions Earth surrounding Equator
In this example, the effect of punctuation removal is less than stopwords removal
but we can see that the “Equator” is different from the “Equator.” so the presence of
punctuations can change a word. There are large numbers of punctuations in a cor-
pus so the punctuation removal can highly affect the context of a word.
Given this background, the present paper examines the impact of preprocessing
on word embeddings quality. As contextual methods generate embeddings in a sen-
tence and are thus unsuitable for tasks similar to word similarity and word analogy,
they were not considered in the present paper.
The rest of the study is organized as follows: Sect. 2 presents the objectives and
contributions of the paper. Three word embedding methods utilized to generate
word embeddings are introduced in Sect. 3. Different types of preprocessing and
their effects on words co-occurrence counts are described in Sect. 4. Section 5 gives
a brief summary of evaluation methods and datasets used, and finally, Sect. 6 elabo-
rates the experimental results.

2 Paper objectives

Word vectors generated by word embedding methods are affected by alterations

made in the co-occurrence counts of words. Therefore, preprocessing procedures
which change the co-occurrence of words can alter the word vectors generated
by word embedding methods, and thereby the quality of word embeddings can be
affected by the preprocessing of corpora, specifically used for word vectors train-
ing. The objective of this paper is to scrutinize the effect of various kinds of corpus
preprocessing scenarios on the quality of word vectors generated by word embed-
ding tools. This study is focused on preprocessing procedures capable of changing
word co-occurrence statistics. Two different languages; Persian and English are con-
sidered in this paper along with three frequently used word embedding approaches;
GloVe, Word2vec, and Fasttext, used for generating word vectors. The extracted
word vectors were evaluated intrinsically and extrinsically. The intrinsic evalua-
tion methods are the core assessments for word vectors but due to the application
of word embeddings in NLP tasks and the fact that the quality of these vectors
can impact the accuracy of these tasks, the extrinsic evaluations take into account.
Because we want to assess the effects of preprocessing on word vectors the corpora
used for extrinsic evaluation do not preprocessed. The results show that holding the
stopwords improves the results intrinsically but for natural language processing the
preprocessing is task-dependent, with no benefit to gain from the retention of punc-
tuations and numbers.
The contributions of this paper are as follows:

13
The impact of preprocessing on word embedding quality: a… 261

• A complete study concentrated on the preprocessings that has an impact on the

context of words (so it is effective on co-occurrence counts).
• Release of four new datasets for Persian word similarity for the first time2
• Evaluation of word vectors obtained by various types of preprocessing with
intrinsic and extrinsic tests to decide upon the manner by which preprocessing
can affect the final results.

3 Methods

This section contains a brief introduction to three effective and popular word embed-
ding methods: GloVe (Pennington et al., 2014), Word2vec (Mikolov et al., 2013a,
b), and Fasttext (Bojanowski et al., 2016). These methods were applied to gener-
ate word embeddings in the experiments. The inclusion criteria for this choice of
method, apart from their notoriety, can be confined to the different aspects in each of
the methods thereof. Word2vec solely makes use of local co-occurrences of words,
in which words within a small window around the target word are used to predict the
word. This type of co-occurrence misses certain valuable information, such as topi-
cal information (Huang et al., 2012). GloVe utilizes global co-occurrence counts.
GloVe and Word2vec work at the word scale and are thus not very suitable for mor-
phologically rich languages, with disregard for unknown words observed in both
methods (Bojanowski et al., 2016). However, the Fasttext method is implemented
at the sub-word level and is more fit for morphologically rich languages, capable
of generating embeddings for unknown words. As the Persian language is a mor-
phologically rich language, Fasttext was considered in this study as well. For more
detailed information about these methods, please refer to the references of these
papers that are provided in this section.

4 Text preprocessing

Virtually all natural language processing tasks are characterized by a phase of pre-
processing of the dataset. This preprocessing is necessary as most datasets contain
Html or XML tags, Images, and vast body of irrelevant and noisy information. The
steps of preprocessing are, however, different from one task or one language to
another. Generally, the preprocessing phase for an English text may contain the fol-
lowing steps:

1. Extracting pure text by elimination of Html or XML tags, Images and links
2. Lowercasing: convert all words to lowercase
3. Tokenization: this preprocessing phase is vital for all natural language processing
tasks, including word embedding

2
These datasets are available via Emails of the authors.

13
262 Z. Rahimi, M. M. Homayounpour

4. Expansion of contractions: for example we change the “we’re” to “we are” or

“don’t” to “do not”
5. Removal of accented characters: some characters such as é and ï are converted to
e and i.
6. Stopword removal: stopwords are eliminated based on word frequency or accord-
ing to a stopword list. In the former approach, each word can be a stopword given
its frequency in the corpus. So, in this approach, if a word is in the list of the first
10% of most frequent words in a text, it is a stopword. In the latter, a list contains
the most common words in a language, and so each word in this list is considered
a stopword and removed from the corpus.
7. Elimination of punctuations and numbers: punctuations and numbers are either
completely removed or replaced by special symbols. The deletion of punctuations
and numbers can also change the co-occurrence statistics and the meaning of a
sentence. For example, by deleting a comma before ‘which’ in a relative clause,
it is more possible to read the clause as a restrictive one.
8. Lemmatization and stemming: The goal of lemmatization and stemming is to
reduce each word to its base form. The stemming usually is not the case for gen-
erating word embeddings.

Steps 1, 2, 3, 6, 7, 8 are the same for English and Persian. For step 1, we use Wik-
iExtractor3 tool. The crucial preprocessing step for Persian is to correct zero-width
non-joiners (ZWNJs). In Persian, certain suffixes and prefixes stick to the main word
with ZWNJ. For example, “‫می‬/mi/” is a prefix for a verb that changes its tense, so /
mi/ is glued to the verb with ZWNJ. Sometimes in Persian texts, ZWNJs are mistak-
enly replaced with spaces. Thus, in tokenization, /mi/ is considered as an independ-
ent word (/mi/, has a homograph” ‫می‬/mey/” which means wine in Persian). There
are multiple tools for the Persian language such as Hazm4 and ParsiVar (Mohtaj
et al., 2019) to amend the ZWNJs. This paper utilized the Hazm tool.
The basis of methods such as GloVe, Word2vec, and Fasttext, are co-occurrence
statistics. GloVe attempts to reconstruct the co-occurrence counts with the help of
a bi-linear regression method. The basis of Fasttext and Word2vec are the local co-
occurrences of words. The preprocessing of the corpus can change the co-occur-
rence statistics, and thereby affect the quality of generated word embeddings. The
preprocessing steps performed on Persian and English corpora are shown in Fig. 1.
As pre-trained word vectors are not task-specific, it is of utmost importance
that embeddings be generated for as many words as possible. Thus, it appears
that stopword removal is not a good choice for preprocessing of a corpus used
for the generation of word embeddings. Since stopword removal can change
the co-occurrence of words, it can also alter the meaning of a sentence. On the
other hand, stopwords are not very relevant in most NLP tasks. The co-occur-
rence counts are computed in a window of 5 to 10 words around the target word.
If stopwords are not removed, certain valuable words may not fall within the
5–10 word-sized window. As mentioned above, the deletion of punctuations and

3
https://github.com/attardi/wikiextractor.
4
https://github.com/sobhe/hazm.

13
The impact of preprocessing on word embedding quality: a… 263

Fig. 1 Steps of preprocessing implemented on English and Persian corpora

numbers can change the co-occurrence counts. But the type of punctuation or the
amount of a number does not play a role on changing co-occurrence counts. To
examine the effect of punctuations and numbers removal on the quality of gener-
ated word embeddings, two types of corpora were considered:

13
264 Z. Rahimi, M. M. Homayounpour

Table 1 Generated corpora with different preprocessing steps

Name of generated corpus Explanation

Corpus_name_all_rem All the punctuations, numbers and stopwords were removed

Corpus_name_punct Stopwords were removed, punctuations replaced by #punct
and numbers by #NUM
Corpus_name_stop Punctuations and numbers removed but stopwords retained
Corpus_name Nothing removed (original corpus)

1. Corpora in which all forms of punctuations and numbers removed (corpus_name_

stop).
2. Corpora in which each punctuation replaced by ‘#punct” and each number sub-
stituted with “#NUM” (corpus_name_punct) and stopwords are removed.

To inspect the result of preprocessing on the quality of word embedding, four

types of corpora were generated as shown in Table 1. Tokenization, conversion
to lowercase, expansion of contractions and removal of accented characters were
applied to all corpora. Stemming is not suitable for generating word embedding, as
word vectors have to be generated for all shapes of a word. Natural language toolkit5
(NLTK) is used for punctuation removal, stopwords removal and tokenization for
English texts and Hazm is utilized for Persian texts tokenization and correction of
ZWNJs. For stopwords and punctuation removal of Persian text and expanding con-
traction we use our own codes.

5 Evaluation methods and corpora

For comparing the effect of various preprocessing schemes on the quality of word
vectors, several evaluation methods were utilized, which fall within the categories of
intrinsic and extrinsic evaluation. The intrinsic evaluations try to assess how the gen-
erated word vectors measure the semantic relatedness and similarity between words.
The main objective of this paper is to find what form of preprocessing improves the
word vectors intrinsically. Standard intrinsic evaluation techniques for the assess-
ment of word vectors included word similarity and word analogy.
Because word embeddings usually employed in natural language processing tasks
and the quality of word vectors affect their accuracy, we investigate how the cor-
pus preprocessing before generating word embeddings affect the accuracy of natural
language processing tasks. This kind of test is called extrinsic evaluation of word
embeddings. In the case of extrinsic evaluation, word vectors were applied in senti-
ment analysis and text classification.
Each dataset, in a word similarity task, contained multiple word pairs. Each word
pair had a pre-defined similarity score which was rated by humans. The similarity
score between their embeddings was also computed by cosine similarity. Next, the

5
https://www.nltk.org/.

13
The impact of preprocessing on word embedding quality: a… 265

Table 2 English word similarity datasets

Dataset Number of Dataset Number of
word pairs word pairs

EN-Mc-30 (Rubenstein & Good- 30 EN-WS-353-REL (Finkelstein et al., 252

enough, 1965) 2001)
EN-RG-65 (Miller & Charles, 1991) 65 EN-WS-353-SIM (Finkelstein et al., 203
2001)
EN-RW-STANFORD (Luong et al., 2034 EN-MEN-TR-3k (Bruni et al., 2012) 3000
2013)
EN-MTurk-287 (Radinsky et al., 287 EN-VERB-143 (Baker et al., 2014) 144
2011)
EN-WS-353-ALL (Finkelstein et al., 353 EN-SIMLEX-999 (Hill et al., 2015) 999
2001)
EN-MTURK-771 (Halawi et al., 771 SCWS (Huang et al., 2012) 2003
2012)
EN-Sim verb-3500 (Gerz et al., 2016) 3500 EN-YP-130 (Yang & Powers, 2006) 130

Table 3 Persian word similarity Dataset Number of Number of

datasets (“number of remaining word pairs remaining word
word pairs” column shows the pairs
number of word pairs that left
from the original dataset after PER-RG-65 65 64
some deletions listed in bullet
points on page 7) PER-MC-30 30 30
PER-Mturk-287 287 273
PER-simlex-999 999 407
Sem-eval-17-task2-Farsi (Cama- 500 365
cho-Collados et al., 2017)

Pearson-correlation coefficient was computed between pre-defined similarity scores

and inter-embedding scores.
There are multiple word similarity corpora in English. The corpora used in the
experiments for this study are shown in Table 2. As can be observed, there is a lack
of sufficient Persian-language corpora, required to test word embeddings with fur-
ther deficiencies observed in the available corpus (Camacho-Collados et al., 2017)
including: certain misspellings such as “‫ ”سلیب‬and "‫ "ملکول‬and some words such as
“‫”ساب وی‬, which is a transliteration of “subway”, are not actually Persian words.
Hence, four corpora were released by this study to test word vectors using word
similarity. The corpora were translations of various English corpora. Three people
participated in the translation, so that if the first two translations were not equal, the
third translation was used. The corpora are shown in Table 3; namely PER-MC-30,
PER-RG-65, PER-Mturk-287 and PER-simlex-999. The name for each dataset is an
abbreviation for the corresponding English corpus translated into Persian. The origi-
nal English datasets, from which excerpts were translated into Persian for this study
included EN-MC-30, EN-RG-65, EN-Mturk-287 and EN-SIMLEX-999,

13
266 Z. Rahimi, M. M. Homayounpour

Table 4 Relation types and examples of Google word analogy dataset

Types of relation Example

Capital common country Athens: Greece:: Baghdad: Iraq

Capital world Abuja: Nigeria:: Accra: Ghana
City in state Chicago: Illinois:: Philadelphia: Pennsylvania
Currency Algeria: dinar:: Angola: kwanza
Family Boy: girl:: brother: sister
Gram1 adjective to adverb Amazing: amazingly:: apparent: apparently
Gram2 opposite Acceptable: unacceptable:: aware: unaware
Gram3 comparative bad: worse:: big: bigger
Gram4 superlative bad: worst:: big: biggest
Gram5 present participle code: coding:: dance: dancing
Gram6 nationality adjective Albania: Albanian:: Argentina: Argentinean
Gram7 past tense Dancing: danced:: decreasing: decreased
Gram8 plural Banana: bananas:: bird: birds
Gram9 plural verbs Decrease: decreases:: describe: describes

respectively. Certain challenges arose in the translation of the mentioned corpora

from English to Persian which made us to omit some of word pairs from the original
corpora during the translation from English to Persian which are explained in the
following bullet points. Thus the column named “number of remaining word pairs”
in Table 3 shows the number of word pairs that left from the original dataset after
these deletions.

• Some words in English were translated to multi-words in Persian, and so were

removed from the corpus. For example, the translation of the word “chapel” in
Persian is “‫ ”کلیسای کوچک‬which is a multi-word.
• Some words held no meaning in Persian or were assigned the same translation.
For example, the pair “bicycle bike” is translated to “‫ ”دوچرخه دوچرخه‬and was
removed accordingly.
• The difference of some words such as “color” and “colour” in an English word
similarity corpus was only in spelling which have the same form of dictation in
Persian as “‫ “رنگ‬,”‫”رنگ‬. Such word pairs were also omitted from the translated
corpus.

The goal of the word analogy is to test the capacity of generated word vectors
in solving analogy questions. This task shows how word vectors can reveal linguis-
tic regularities between words. Assume the goal is to answer the analogy question:
“man:woman::king:?” and vectors for man, woman, and king are available. Initially
a vector is obtained from vector (woman) – vector(man) + vector(king). Should
the nearest vector to the result be vector (king), it would divulge on how vectors

13
The impact of preprocessing on word embedding quality: a… 267

Table 5 Persian word analogy dataset statistics including type of relations and a few examples
Type of relationship Number of Example
word question
pairs

Family 342 ‫بابا‬:‫مامان‬::‫عمو‬:‫( عمه‬father: mother::uncle: aunt)

Currency 1260 ‫لایر‬:‫ایران‬::‫یورو‬:‫( فرانسه‬rial: Iran::dollar: United states)
Country-capital 5402 ‫چین‬:‫پکن‬::‫ایران‬:‫( تهران‬China: Beijing::Iran: Tehran)
Province-capital 7832 ‫فارس‬:‫شیراز‬::‫البرز‬:‫( کرج‬Fars: Shiraz::Karaj: Alborz)
Adjective-adverb 1332 ‫سریعا‬:‫سریع‬::‫جدا‬:‫( جداگانه‬fast: fast::separate: separately)
antonym 1260 ‫ناتوان‬:‫توانا‬::‫امن‬:‫( ناامن‬able: unable::safe: unsafe)
superlative 1260 ‫سریعترین‬:‫سریع‬::‫به‬:‫( بهترین‬fast: fastest::good: better)
Singular-Plural 2550 ‫درختان‬:‫درخت‬::‫قاضی‬:‫( قضات‬tree: trees::judge: judges)
3rd person 1332 ‫رفت‬:‫رفتند‬::‫کرد‬:‫(( کردند‬he)went:(they)went: (he)did: (They)did
Infinitive-present 1260 ‫رفتن‬::‫می‌رود‬::‫کردن‬:‫( می‌کند‬to go: goes::to do: does)
Noun-adverb 1056 ‫اصل‬:‫اصالتا‬::‫شب‬:‫( شبانه‬origin: originally::night: nightly)
comparative 1260 ‫سریع‬:‫سریعتر‬::‫به‬:‫( بهتر‬fast: faster::good: better)
Nationality 1406 ‫ایتالیا‬:‫ایتالیایی‬::‫مصر‬:‫( مصری‬Italy: Italian::Egypt: Egyptian)
1st person 1260 ‫رفتم‬:‫رفتیم‬::‫کردم‬:‫(( کردیم‬I) went: (we)went:: (I) did: (we)did
Infinitive-past 1260 ‫رفت‬:‫رفتن‬::‫کردن‬:‫( کرد‬to go: went:: to do: did)

Table 6 The statistics of R8 Classes Number of documents Average number of

dataset per class words per document

acq 2292 118.24

crude 374 187.3
earn 3923 65.64
grain 51 183.8
interest 271 118.6
Money-fx 293 161.9
ship 144 149.9
trade 326 234.46

can reveal the relationship of this analogy question. In English, the Google anal-
ogy question (Mikolov et al., 2013a) is commonly used for evaluating word vectors
by analogy questions. The dataset contains 19,544 question pairs, including 8869
semantic questions and 10,675 syntactic ones. The number of relation types for this
dataset is 14, the different types of which are shown in Table 4. Recent analogy
datasets have also been introduced for the Persian language (Zahedi et al., 2018),
comprised of 30,072 question pairs in 15 relations. The statistics for this dataset are
shown in Table 5.

13
268

13
Table 7 The statistics of 20-newsgroups dataset
Classes Number of documents Average number of words Classes Number of documents Average number of
per class per document per classes words per document

Alt.atheism 799 336.4 Rec.sport.hockey 999 240.7

Comp.graphics 973 249.6 Sci-crypt 991 317.9
Comp.os.ms-windows.misc 966 177.9 Sci.electronics 984 192.9
Comp.sys.ibm.pc.hardware 982 184.5 Sci.med 990 284.0
Comp.sys.mac.hardware 963 168.5 Sci.space 987 275.4
Comp.windows.x 985 274.4 Soc.religion.christian 966 378.6
Misc.forsale 975 120.3 Talk.politics.guns 909 327.9
Rec.autos 989 208 Talk.politics.mideast 940 488.2
Rec.motorcycles 996 187.2 Talk.politics.misc 775 429.6
Talk.sport.baseball 944 211.7 Talk.religion.misc 628 353.2
Z. Rahimi, M. M. Homayounpour
The impact of preprocessing on word embedding quality: a… 269

Table 8 The statistics of R52 dataset

Classes Number of Average number of classes Number of Average number
documents words per document documents of words per docu-
ment

acq 2292 118.24 Jobs 49 109.5

alum 50 138.06 Lead 8 123.25
bop 31 147.55 Lei 14 99.93
carcass 11 172.9 Livestock 18 133.6
cocoa 61 206.23 Lumber 11 175.64
coffee 112 199.1 Meal-feed 7 135.9
copper 44 151.68 Money-fx 293 161.9
cotton 24 131.17 Money-supply 151 96.7
cpi 71 102.4 Nat-gas 36 140.56
cpu 4 87.5 Nickel 4 202.75
crude 374 187.3 Orange 22 180.41
dlr 6 168.17 Pet-chem 19 132.58
earn 3923 65.64 Platinum 3 191.33
fuel 11 101.9 Potato 5 117.6
gas 18 166.6 Reserves 49 105.67
gnp 73 219.0 Retail 20 173.1
gold 90 143.8 Rubber 40 213
grain 51 183.8 Ship 144 149.9
heat 10 93.1 Strategic-metal 15 161.7
housing 17 92.94 Sugar 122 167.6
income 11 99.7 Tea 5 187.8
Install-debt 6 97 Tin 27 236.85
interest 271 118.7 Trade 326 234.46
ipi 44 139.4 Veg-oil 30 172.53
Iron-steel 38 146.8 wpi 23 105.35
jet 3 44.3 Zinc 13 87.9

Table 9 The statistics of IMDB Categories Number of

dataset documents

Positive 25,000
Negative 25,000

13
270 Z. Rahimi, M. M. Homayounpour

Table 10 The statistics of Categories Number of docu- Categories Number of

Hamshahri dataset ments documents

ydsht 240 bazar 282

axrooz 454 lifew 495
eqtes 16,675 jjahn 220
kharj 14,688 Artw 984
shahr 8412 Donya 28
sanat 350 elmfa 1785
santj 619 vrzsh 13,011
erteg 19 musical 111
techn 217 siasi 17,548
econo 336 ejtem 8320
telfn 197 theatre 31
Hamln 93 media 37
ikabar 80 igozar 3
norooz 9 zanan 278
sport 447 cinew 126
gofgu 11 city 191
shahz 99 shrst 8248
adabh 5184 gozar 1424
elmif 8888 econw 1030
havad 3758 globa 180
imaqal 12 Newsp 466
gungn 7555 lastp 233
cartoon 164 scien 155
maref 571 lite 30
ertebat 9 shari 10,154
polig 1393 busiw 65
nnaft 1232 mohit 954
art 113 soxan 1326
maqal 1439 abksh 495
bankb 765 akhar 13,821
goftg 5 mskan 66
gqarn 8 cultw 997
adarman 31 Nameh 762
cinama 52 gards 374
intep 797 socie 197
jvarz 1179 Youth 63
sciew 450 sporw 1005
women 71 thought 152
books 229 earth 190
eqtsj 329 shora 649

13
The impact of preprocessing on word embedding quality: a… 271

Table 11 The statistics of Hotel Categories Number of

dataset documents

Positive 6126
Negative 2380

Table 12 Statistics of Wikipedia Name of generated corpus Number of tokens

corpus under preprocessing
scenarios Wikipedia_all_rem 928,924,408
Wikipedia_punct 1,329,448,340
Wikipedia_stop 1,661,638,339
Wikipedia 2,062,161,816

Table 13 Statistics of Hamshahri and Hamshahri + Alaem_dadeha (Ham_alaem) corpus

Name of generated corpus Number of tokens Name of generated corpus Number of tokens

Hamshahri_all_rem 39,887,837 Hamshahri_alaem_all_rem 87,937,153

Hamshahri_punct 44,531,669 Hamshahri_alaem_punct 101,668,655
Hamshahri_stop 60,395,336 Hamshahri_alaem_stop 133,511,208
Hamshahri 64,945,947 Hamshahri_alaem 154,319,228

5.1 Extrinsic evaluation

For extrinsic evaluation, word vectors were examined in two NLP tasks including
text classification and sentiment analysis. For English text classification, R8,6 R527
and 20 newsgroups8 datasets were used. The R8 dataset is a part of Reuter-2157and
contains 5485 documents for training and 2189 documents for testing. Statistics on
this dataset are listed in Table 6. The 20-newsgroups dataset is a set of newsgroups
context, comprised of 18,818 documents categorized into 20 classes. The statistics
of this dataset are shown in Table 7. R52 is a part of Reuter-21578 (Table 8). This
dataset contains 9100 documents in 52 classes. 6532 documents are used for training
and the remainder for testing. The IMDB corpus9 (Table 9) is also commonly used
for English sentiment analysis (25,000 documents for testing and 25,000 documents
for training). For Persian text classification, The Hamshahri corpus (AleAhmad
et al., 2009), collected from the Iranian newspaper “Hamshahri”, was employed. The
Hamshahri corpus contains more than 160,000 articles in 82 categories as denoted

6
https://ana.cachopo.org/datasets-for-single-label-text-categorization.
7
https://ana.cachopo.org/datasets-for-single-label-text-categorization.
8
http://qwone.com/~jason/20Newsgroups/.
9
https://ai.stanford.edu/~amaas/data/sentiment/.

13
272 Z. Rahimi, M. M. Homayounpour

Table 14 Word similarity scores (Pearson correlation coefficients) for all four types of preprocessing of
Wikipedia dataset. 𝝆G shows the Pearson correlation coefficient obtained by GloVe, 𝝆w shows the Pear-
son correlation coefficient obtained by Word2vec, 𝝆f shows the Pearson correlation coefficient obtained
by Fasttext
Dataset Wikipedia_all_rem Wikipedia_punct Wikipedia_stop Wikipedia
𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f
EN-MTurk-287 0.65 0.68 0.68 0.56 0.68 0.69 0.66 0.67 0.69 0.65 0.67 0.69
EN-MEN-TR-3 k 0.69 0.72 0.75 0.70 0.72 0.75 0.69 0.71 0.74 0.69 0.69 0.74
EN-RW-STANFORD 0.35 0.40 0.45 0.36 0.42 0.45 0.35 0.42 0.45 0.35 0.39 0.445

EN-RG-65 0.76 0.79 0.79 0.745 0.81 0.79 0.73 0.79 0.78 0.73 0.77 0.795
EN-Mc-30 0.7 0.73 0.8 0.71 0.75 0.8 0.66 0.73 0.81 0.62 0.72 0.79
EN-Sim verb-3500 0.16 0.21 0.21 0.173 0.2 0.21 0.17 0.22 0.22 0.18 0.19 0.22
EN-WS-353-REL 0.52 0.57 0.66 0.52 0.57 0.66 0.54 0.56 0.65 0.53 0.55 0.65
EN-SIMLEX-999 0.31 0.33 0.33 0.32 0.32 0.33 0.315 0.32 0.32 0.32 0.29 0.31
EN-VERB-143 0.27 0.35 0.34 0.27 0.36 0.33 0.366 0.4 0.36 0.38 0.36 0.36
EN-WS-3 0.68 0.74 0.79 0.68 0.74 0.77 0.68 0.73 0.77 0.66 0.72 0.76
53-SIM
SCWS 0.55 0.65 0.65 0.56 0.64 0.65 0.57 0.65 0.65 0.57 0.64 0.65
EN-MT 0.624 0.62 0.64 0.62 0.61 0.64 0.62 0.61 0.64 0.63 0.59 0.64
URK-771
EN-YP-130 0.396 0.33 0.43 0.41 0.33 0.41 0.47 0.36 0.44 0.46 0.29 0.45
EN-WS-353-ALL 0.58 0.65 0.72 0.58 0.65 0.72 0.59 0.65 0.71 0.57 0.64 0.71
Average 0.52 0.54 0.59 0.51 0.56 0.58 0.53 0.56 0.59 0.52 0.53 0.58

The boldface numbers shows the best results for each row and only for datasets solely comprise of verbs

in Table 10. For Persian sentiment analysis, the Hotel10 dataset was used, which con-
tains approximately 8500 comments by different people about Iranian Hotels (6800
documents for training and 1706 document for testing). More information about this
dataset can be seen in Table 11.

5.2 Datasets for word embedding learning

The Wikipedia 2017 dataset11 was used to learn English word embeddings. The
statistics of this dataset under four types of preprocessing schemes is shown in
Table 12.
For learning Persian word embeddings, Hamshahri and Alaem_dadeha corpora
were used. The results are reported in Table 13 on word embeddings trained on the
Hamshahri and Hamshari + Alaem_dadeha (Ham_alaem) corpora.

10
http://dataheart.ir/article/3414/%D9%85%D8%AC%D9%85%D9%88%D8%B9%D9%87-%D8%AF%
D8%A7%D8%AF%D9%87-%D9%86%D8%B8%D8%B1%D8%A7%D8%AA-%D9%81%D8%A7%D8%
B1%D8%B3%DB%8C-%D8%A8%D8%B1%DA%86%D8%B3%D8%A8-%DA%AF%D8%B2%D8%A7%
D8%B1%DB%8C-%D8%B4%D8%AF%D9%87-%D9%87%D8%AA%D9%84.
11
https://archive.org/details/enwiki-20170920.

13
The impact of preprocessing on word embedding quality: a… 273

6 Experimental results

Experimental results of this study are provided for a total of four tasks of natural
language processing. Word vectors were intrinsically evaluated in accordance with
word similarity and word analogy techniques. Text classification and sentiment anal-
ysis were used for extrinsic evaluation of word vectors.

6.1 Experimental settings

The GloVe, Word2vec (CBOW) and Fasttext techniques were used to extract word
vectors. The window size for all word embedding tools was set to 5 (five words
before and five words after target word). The vector size for all experiments was
set to 100. The vocab_size for GloVe for the English dataset was set at 400. X_max
for GloVe was equal to 10 and 𝛼 = 0.75. The maximum iteration for Word2vec and
GloVe was assumed as 15. The number of negative sampling for Word2vec was set
at 25. Default settings were held for Fasttext.

6.2 Word similarity

Table 14 gives the overall results of English word similarity analysis. According to
the table the Pearson-correlation coefficient was computed between pre-defined sim-
ilarity scores rated by humans and inter-embedding scores computed by cosine simi-
larity. 𝜌G shows the Pearson correlation coefficient achieved by GloVe, 𝜌W shows the
Pearson correlation coefficient obtained by Word2vec, 𝜌f shows the Pearson correla-
tion coefficient outcome of Fasttext. As can be seen in this table, the average word
similarity scores obtained by GloVe and Word2vec for Wikipedia_stop is higher
than others, while for Fasttext, the similarity scores are nearly the same for Wikipe-
dia_all_rem and Wikipedia_stop. The results also show that on average, stopwords
are highly effective on word similarity analysis. Nevertheless, punctuations and
numbers do not have a significant effect on word similarity score. The deletion of
stopwords changes the local co-occurrence counts, and so in all likelihood is effec-
tive on word vectors generated by Word2vec (CBOW), as CBOW uses the words
around the target words in a local window to guess the target word (Mikolov et al.,
2013b). This also decreases the counts in the global word co-occurrence matrix,
which is why many co-occurrence counts are considered as noisy by GloVe, and
so the generated word vectors are impressible when stopwords are deleted from the
corpus. This is owed to the objective function of GloVe, which panelizes the counts
below a certain threshold (Pennington et al., 2014). Fasttext, however, uses a modi-
fied version of Skipgram to work with character n-grams, therefore local co-occur-
rences are also important in Fasttext (Bojanowski et al., 2016).
EN-Simverb-3500, EN-Verb-143 and EN-YP-130 are databases solely comprised
of verbs, where the following principles hold: a verb has to agree with its subject in
number and person, stopwords such as “is”, “are”, “have”, “has”, etc. show the num-
ber, person and tense of a verb, punctuations are affected by the number of a verb,

13
274

13
Table 15 Word similarity scores for all four types of preprocessing of the Ham_alaem dataset. 𝝆G shows the Pearson correlation coefficient obtained by GloVe, 𝝆w shows
the Pearson correlation coefficient obtained by Word2vec, 𝝆f shows the Pearson correlation coefficient obtained by Fasttext
Dataset Ham_alaem_all_rem Ham_alaem _punct Ham_alaem _stop Ham_alaem

𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f

PER-MTurk-287 0.4842 0.5455 0.5223 0.4530 0.5661 0.5383 0.5105 0.5494 0.5150 0.4938 0.5657 0.5563
PER-MC-30 0.4597 0.7655 0.8233 0.4153 0.7624 0.7562 0.4163 0.7512 0.7407 0.3905 0.7035 0.7634
PER-Simlex-999 0.1723 0.2443 0.1963 0.1652 0.2512 0.1996 0.1645 0.2337 0.2054 0.1837 0.2444 0.1924
Semeval-17 0.4045 0.4835 0.4839 0.3741 0.4745 0.4892 0.4249 0.4668 0.5170 0.4119 0.4785 0.5114
PER-RG-65 0.6639 0.6555 0.7195 0.6323 0.6546 0.6612 0.6078 0.6307 0.6772 0.5986 0.6092 0.6983
average 0.44 0.54 0.55 0.41 0.54 0.53 0.42 0.52 0.53 0.41 0.52 0.54

The boldface numbers shows the best results in average forGloVe, Word2vec and Fasttext among all preprocessing types
Z. Rahimi, M. M. Homayounpour
Table 16 Word similarity scores for all four types of preprocessing of Hamshahri dataset. 𝝆G shows the Pearson correlation coefficient obtained by GloVe, 𝝆w shows the
Pearson correlation coefficient obtained by Word2vec, 𝝆f shows the Pearson correlation coefficient for Fasttext
Dataset Hamshahri_all_rem Hamshahri_punct Hamshahri_stop Hamshahri
𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f
PER- 0.6281 0.6287 0.6234 0.5967 0.6456 0.6170 0.6985 0.6618 0.6039 0.7139 0.6406 0.6479
MTurk
-287
PER-MC-30 0.1850 0.6718 0.6850 0.3018 0.5991 0.5771 0.4604 0.7521 0.5969 0.1696 0.4912 0.5132
The impact of preprocessing on word embedding quality: a…

PER-Simlex-999 0.3383 0.3815 0.3474 0.3439 0.3717 0.3522 0.3498 0.2337 0.3465 0.2794 0.3376 0.3353
Semeval-17 0.4889 0.4591 0.5954 0.4775 0.4591 0.5977 0.5052 0.4668 0.6021 0.4627 0.4955 0.5916
PER-RG-65 0.5759 0.6559 0.6325 0.5158 0.5794 0.6528 0.5339 0.6307 0.6390 0.4374 0.5395 0.2586
average 0.44 0.56 0.58 0.45 0.54 0.56 0.51 0.55 0.56 0.41 0.50 0.47

The boldface numbers shows the best results in average for GloVe, Word2vec and Fasttext among all preprocessing types
275

13
276

13
Table 17 The word analogy accuracies (%) on four types of corpora obtained by GloVe, CBOW and Fasttext
Types of relation Wikipedia_all_rem Wikipedia_punct Wikipedia_stop Wikipedia

GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext

Capital common country 92.89 86.76 86.36 90.51 87.35 86.17 92.29 85.38 88.93 84.58 76.88 86.76
Capital world 86.34 90 78.23 80.24 88.93 77.79 82.34 88.15 77.9 70.78 77.61 77.1
City in state 32.14 17.55 13.39 8.2 19.52 11.43 7.27 16.74 12.36 6.24 13.74 11.78
Currency 7.39 60.8 28.05 27.6 59.59 27.73 29.02 58.33 26.78 23.83 39.64 26.96
Family 68.97 74.2 68.18 69.96 75.69 62.65 79.64 85.38 69.57 77.47 73.72 69.96
Gram1 adjective to adverb 22.47 74.9 34.07 21.4 27.42 33.57 19.35 25.91 33.27 20.77 21.88 31.65
Gram2 opposite 16.5 26.3 17.49 20.07 23.40 18.47 18.1 24.26 22.04 19.83 17.61 21.55
Gram3 comparative 65.99 24.14 68.24 69.37 74.77 67.04 70.72 78.15 70.8 71.1 64.94 73.57
Gram4 superlative 36.54 74.55 41.53 37.08 48.22 41 40.46 53.03 47.95 38.5 31.02 53.57
Gram5 present participle 47.44 46.08 39.58 50.19 45.55 41.76 53.22 55.78 50.57 58.05 35.8 50.95
Gram6 nationality adjective 89.31 46.69 85.24 89.37 87.62 84.24 87.3 86.68 83.99 83.43 80.11 84.49
Gram7 past tense 48.72 87.24 51.73 49.87 59.1 52.63 46.92 60.26 55.58 46.09 47.69 54.04
Gram8 plural 80.41 59.29 77.33 77.03 75.9 76.8 73.42 73.95 75.08 73.05 57.51 75.38
Gram9 plural verbs 42.07 76.88 50 41.26 47.93 48.28 48.16 62.87 60.34 47.59 43.33 62.07
Average of semantic 57.55 65.86 54.84 55.30 66.22 53.15 58.11 66.80 55.11 52.58 56.32 54.51
Average of syntactic 50.21 57.65 51.80 50.79 54.85 51.59 51.10 58.20 55.50 50.99 44.86 56.30
Total Accuracy 52.66 60.38 52.82 52.30 58.64 52.11 53.44 61.06 55.37 51.52 48.68 55.70

The boldface numbers shows the best results in average for GloVe, Word2vec and Fasttext among all preprocessing types
Z. Rahimi, M. M. Homayounpour
The impact of preprocessing on word embedding quality: a… 277

for example, commas in a sentence like “Sara, her sister, and her mother go on a
picnic” show the number of the verb “go”. As shown in Table 14, the word similar-
ity scores for embeddings trained on Wikipedia and Wikipedia_stop is higher than
others for the mentioned datasets. This shows that retention of stopwords and punc-
tuations in a corpus can improve the quality of word embeddings for verbs.
Word similarity results for Persian are available in Tables 15 and 16. In Table 15,
on average the results for Ham_alaem_all_rem are better than others. In Table 16,
on average, the results for Hamshahri_all_rem and Hamshahri_stop are more opti-
mal than the others. For both corpora, the result for corpora with no preprocess-
ing was, in most cases, worse than others. The Hamshahri and Alaem_dadeha cor-
pora are different in the number of tokens and domains, such that Alaem_dadeha
has an informal language in some documents, while Hamshahri is from the news
domain and sentences are generally in formal writing. Thus, results patterns tend to
differ between Hamshahri and Ham_alaem corpora. In the Hamshahri corpus, the
number of unknown words for word similarity corpus was higher than the Ham-
shahei + Alaem_dadeha (Ham_alaem) corpus. For example, from 65 pair words
in PER-RG-65, 12 pair words are unknown in the Ham_alaem dataset and 16 pair
words are unknown in the Hamshahri dataset. One is then likely to expect better
results for the Hamshahri dataset vs the Ham-alaem dataset. The average similarity
scores (on all tools) in descending order were Wikipedia_all_rem, Wikipedia_stop,
Wikipedia_punct, and Wikipedia.

6.3 Word analogy

Average values on syntactic accuracy results obtained in the word analogy task for
the English corpus are shown in Table 17. The results for all embedding techniques
including GloVe, Fasttext, and CBOW tend to improve when the Wikipedia_stop
corpus is used as the prime word embedding corpus. It can be seen that stopwords
may positively affect the search for syntactic relations among words. The signifi-
cance of this finding lies in the fact that most stopwords such as determiners, pro-
nouns, conjunctions, and prepositions are function words. Function words are words
that define the grammatical relationships between other words in a sentence.
The Wikipedia_stop showed good performance in finding semantic relations
between words. The results for Wikipedia_all_rem are better than Wikipedia_punct,
yet worse than Wikipedia_stop, therefore, the presence of punctuations is not ben-
eficial to the task of finding semantic relations between words. In general, it can
be concluded that the sole use of Wikipedia, with no preprocessing, as a corpus for
word embedding in the task of word analogy is not recommended, rather it is sug-
gested that certain preprocessing procedures be applied as complementary to this
task.
The average syntactic accuracy results for Persian word analogy are reported in
Tables 18 and 19. As it can be seen in these tables, the average accuracy of syntactic
questions is better than other corpora when the Hamshahri_stop is used as the word
embedding corpus. Not unlike the case in English, stopwords in Persian can also
contribute to the search for syntactic relations among words. As most function words

13
278

13
Table 18 The word analogy accuracies (%) on word vectors trained on Ham_alaem dataset
Types of relation Ham_alaem_all_rem Ham_alaem_punct Ham_alaem_stop Ham_alaem
GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext

Country-capital 30.13 36.54 21.15 32.05 32.69 25 30.77 35.90 23.08 16.67 29.2 18.25
Province-capital 13.66 16.67 6.53 11.64 16.07 5.71 12.09 13.29 5.03 12.06 12 4.95
Family 43.96 54.4 53.3 42.86 52.75 45.05 41.76 51.1 51.65 40.11 39.3 52
gram_third_person 41.13 50.65 55.41 41.77 47.84 58.87 62.12 68.18 75.11 61.47 71.1 74.3
gram_past 13.33 11.19 25.95 13.10 10.24 26.67 35.71 30.09 47.84 36.36 30.3 37
gram_adj2adv 2.85 6.51 11.41 2.67 5.17 9.54 3.85 7.93 10.70 2.76 5.23 8.5
gram_Noun-adverb 0 0.15 0.92 0.15 0.15 2 0.79 0.93 1.32 0.93 0.89 1.84
gram_antonym 5.42 20.20 18.6 7.39 19.95 19.33 6.77 17.61 18.35 6.53 16.64 19.23
gram_comparative 2.91 8.07 23.02 2.91 8.07 18.39 6.61 13.62 32.94 4.63 12.3 6.2
gram_plural 7.39 13.07 11.17 6.91 15.06 13.07 8.05 14.11 11.55 8.52 15.4 12.3
gram_firstperson 9.56 9.19 27.21 12.5 9.93 29.41 18.38 18.01 36.03 10.29 12.05 19
Average accuracy of semantic questions 29.25 35.87 26.99 28.85 33.83 25.25 28.20 33.43 26.60 22.946 26.8 25
Average accuracy of syntactic questions 10.32 14.87 21.71 10.91 14.55 25.31 17.78 21.31 29.23 16.40 20.48 24.6
Total Average 15.48 20.6 23.15 15.84 19.81 25.29 20.62 24.61 28.51 18.185 22.2 24.76

The boldface numbers shows the best results in average for GloVe, Word2vec and Fasttext among all preprocessing types
Z. Rahimi, M. M. Homayounpour
Table 19 The word analogy accuracies (%) for word-vectors trained on Hamshahri dataset
Types of relation Hamshahri_all_rem Hamshahri_punct Hamshahri_stop Hamshahri

GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext

Country-capital 45.45 53.97 34.09 36.36 55.3 40.15 43.18 51.52 36.36 37.88 49.24 31.06
Province-capital 18.78 29.10 7.94 18.78 26.85 8.33 18.39 24.34 8.6 13.89 18.78 7.14
Family 60.61 56.06 34.09 56.82 53.79 37.88 66.67 46.21 39.39 44.70 46.97 40.15
gram_third_person 34.12 39.47 54.47 33.68 42.89 53.95 49.74 52.89 60.26 42.37 55.26 59.47
gram_past 15.44 12.87 29.78 16.54 12.5 26.47 34.64 25.16 38.24 27.12 20.92 30.07
gram_adj2adv 1.06 5.82 9.26 1.32 5.03 10.71 2.83 5.91 12.44 2.22 5.42 11.45
gram_Noun-adverb 0 0.22 0.22 0.22 0.22 0.43 0.77 0.62 1.08 0.92 0.46 1.23
The impact of preprocessing on word embedding quality: a…

gram_antonym 8.50 16.17 16.50 7.50 14.83 14.5 10.71 15.67 15.83 9.83 15 17
gram_comparative 4.97 7.31 15.5 4.09 8.48 15.5 7.60 15.2 25.44 8.77 18.42 26.61
gram_plural 10.33 16.5 15.83 11 15 13.17 8.83 15.83 13 11.33 17.67 14.5
Average accuracy of semantic questions 41.61 46.37 25.37 37.32 45.31 28.78 42.74 40.69 28.11 32.15 38.33 26.11
Average accuracy of syntactic questions 10.63 14.05 20.22 10.62 18.03 19.24 16.44 18.75 27.77 14.65 19.02 22.9
Total Average 19.93 23.747 21.767 20.33 17.27 22.1 30.247 25.33 27.87 19.90 24.81 23.86

The boldface numbers shows the best results in average for GloVe, Word2vec and Fasttext among all preprocessing types
279

13
280 Z. Rahimi, M. M. Homayounpour

(‘‫ ’کلمات دستوری‬/kalamate dasturi/) in Persian are stopwords, therefore, this result has
significance. Word vectors trained on the Hamshahri_dataset show higher accuracy
on syntactic questions on Hamshahri_stop, Hamshahri_all_rem, and Hamshahri_
punct, while the best total average of accuracies (all word analogy questions: seman-
tic and syntactic questions) belongs to Hamshahri_stop among all tools used to gen-
erate word embedding in this paper (GloVe, Fasttext and CBOW).
As shown in Table 18, for the Ham_alaem dataset, which is larger in size com-
pared to the Hamshahri dataset, in case of average accuracy of semantic questions,
results of Ham_alaem_all_rem is slightly better than ham_alaem_stop but these
results are close and in the case of average accuracy of syntactic questions, Ham-
shahri_stop achieves the optimal results. Total average accuracies (all questions:
semantic and syntactic questions of word analogy) for Hamshahri_stop was superior
to others.

6.4 Extrinsic evaluation

As mentioned in Sect. 5.1, text classification and sentiment analysis were used for
the extrinsic evaluation of word vectors in four types of preprocessing scenarios.
The results of these evaluations are described in this section.

6.4.1 Text classification

For text classification, the long short term memory (LSTM) and convolutional neu-
ral network (CNN) models were employed as classifiers. The structures of these neu-
ral networks are shown in Figs. 4 and 5. For the LSTM network, the Dropout1D rate
was set at 0.4, lstm_out at 196, along with the use of dropout and recurrent drop-
out. For the CNN network, the number of filters was selected as 128 and the kernel
size as 5. The results for English text classification obtained by CNN and LSTM are
shown in Tables 20 and 21, respectively. In order to investigate the effect of differ-
ent kinds of preprocessing in all of corpora used for text classification, stopwords
and punctuations were retained. Wikipedia, Wikipedia_stop, Wikipedia_punct and
Wikipedia_all_rem were used separately as word embedding corpora.
According to the text classification accuracy values in Table 20, the Wikipedia_
all_rem and Wikipedia_stop resulted in higher performance in most cases. Wikipe-
dia_punct failed to show good performance. In the majority of cases, the Wikipe-
dia_all_rem corpus led to better results.
For Persian text classification Ham_alaem, Ham_alaem_stop, Ham_alaem_all_
rem and Ham_alaem_punct were used separately as corpora for generating word
embeddings. The results (classification accuracy) for Persian text classification
are shown in Figs. 2 and 3. According to Fig. 2, the classification accuracy for
Ham_alaem_all_rem and Ham_alaem_stop are better than the two remaining pre-
processed corpora. In GloVe, the accuracy of Ham_alaem_stop is higher than oth-
ers, yet quite similar to Ham_alaem_all_rem. For CBOW and Fasttext, the accu-
racy of Ham_alaem_all_rem is higher than other corpora. According to Fig. 3, the
accuracies of all types of preprocessing are the same when GloVe and CBOW

13
Table 20 The CNN text classification accuracy (%) on Wikipedia, Wikipedia_all_rem, Wikipedia_punct and Wikipedia_stop
Dataset Wikipedia_all_rem Wikipedia _punct Wikipedia _stop Wikipedia
GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext

R8 95.38 95.24 95.65 94.97 95 93.92 96.11 93.41 90.7 95.9 94.65 94.5
The impact of preprocessing on word embedding quality: a…

R52 86.6 87.6 85.32 83.12 84.9 82 87 86.8 84.76 84.6 87.1 84.88
20-newsgroups 66.8 65.6 65.68 65.61 64 65.93 64.78 64.3 68.1 63.95 66.64 67.57

Each vector was obtained by GloVe, Word2vec (CBOW) and Fasttext

281

13
282

13
Table 21 The LSTM text classification accuracy (%) on Wikipedia, Wikipedia_all_rem, Wikipedia_punct and Wikipedia_stop
Dataset Wikipedia_all_rem Wikipedia_punct Wikipedia_stop Wikipedia
GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext

R8 89.35 90.5 84.1 86.51 91.13 82.86 87.2 92 84.1 86.2 90.9 80
R52 76.74 80.32 70.93 76.43 80.00 74 75.84 80.17 71.44 71.75 76.3 70
20-newsgroups 66.79 68.0 69.4 65.61 64 65.93 65.78 67.93 64.3 65.5 63.46 65.6

Each vector is obtained by GloVe, Word2vec (CBOW) and Fasttext

Z. Rahimi, M. M. Homayounpour
The impact of preprocessing on word embedding quality: a… 283

74
73
Ham_alaem_all_rem

Accuracy
72
71 Ham_alaem_punct
70 Ham_alaem_stop
69 Ham_alaem
68
GloVe CBOW Fasttext

Fig. 2 Text classification accuracy (%) for LSTM text classifier using four types of preprocessings done
on Ham_alaem corpus. GloVe, Word2vec and Fasttext are used as word embedding methods

77
76
75 Ham_alaem_all_rem
Accuracy

74 Ham_alaem_punct
73
Ham_alaem_stop
72
71 Ham_alaem
70
GloVe CBOW Fasttext

Fig. 3 Text classification accuracy (%) for CNN text classifier using four types of preprocessings done on
Ham_alaem corpus. GloVe, Word2vec and Fasttext are used as word embedding methods

Embedding Layer Spatial Dropout LSTM Softmax

Fig. 4 LSTM network structure for text classification and sentiment analysis

Embedding layer

Conv1D Max pooling Conv1D Max pooling

Global max
somax Dense Conv1D
pooling

Fig. 5 Convolutional network structure for sentiment analysis and text classification

13
284 Z. Rahimi, M. M. Homayounpour

86
84
Wikipedia_all_rem
Accuracy
82
Wikipedia_punct
80
Wikipedia_stop
78
Wikipedia
76
GloVe CBOW Fasttext

Fig. 6 LSTM sentiment analysis accuracy (%) on IMDB dataset for four types of preprocessing proce-
dures on Wikipedia dataset

82 Wikipedia_all_rem
Accuracy

80 Wikipedia_punct
Wikipedia_stop
78
Wikipedia
76
GloVe CBOW Fasttext

Fig. 7 CNN sentiment analysis accuracy (%) on IMDB dataset for four types of preprocessing proce-
dures on Wikipedia dataset

100
80
Ham_alaem_all_rem
F-measure

60
Ham_alaem_punct
40
Ham_alaem_stop
20
Ham_alaem
0
GloVe CBOW Fasttext

Fig. 8 CNN sentiment analysis F-measure (%) on hotel dataset with four types of preprocessing of Ham_
alaem dataset

are used as word embedding method. For Fasttext, the classification accuracies
are higher when the Ham_alaem_all_rem corpus is used. Altogether, the removal
of stopwords and punctuations from corpora, in seeking to extract word vectors,
is a useful apparatus to employ in text classification task. Stopwords include the
most frequent words in a language, which are abundantly found in virtually every
document in a corpus. When stopwords appear among features which are used for
representing a document, the text classifier considers many dissimilar documents

13
The impact of preprocessing on word embedding quality: a… 285

100
80

F-measure
Ham_alaem_all_rem
60
Ham_alaem_punct
40
Ham_alaem_stop
20
Ham_alaem
0
GloVe CBOW Fasttext

Fig. 9 LSTM sentiment analysis F-measure (%) on hotel dataset with four preprocessed Ham_alaem cor-
pora

as similar, as they share many stopwords. This reiterates the fact that stopwords
are not relevant features for the task of text classification.

6.4.2 Sentiment analysis

Long short term memory (LSTM) and convolutional neural network (CNN) were
also applied for sentiment analysis purposes. The structures of the LSTM network
and the CNN network are shown in Figs. 4 and 5, respectively. For the LSTM net-
work, the Dropout1D rate was 0.4, lstm_out set to 196, with the inclusion of dropout
and recurrent dropout. For the CNN network, the number of filters was set at 128
and the kernel size at 5. The RELU activation function was used for Conv1D layers.
The results for English sentiment analysis tasks are shown in Figs. 6 and 7.
For the CNN, Wikipedia_stop showed higher performance compared to Wikipe-
dia_punct, Wikipedia_all_rem and Wikipedia for GloVe, Fasttext and CBOW. In
the case of an LSTM neural network, the results for Wikipedia_stop were better
than others for word vectors obtained using GloVe, while the results on Wikipe-
dia_all_rem, Wikipedia_punct and Wikipedia_stop were rather similar. For Fasttext
the results of Wikipedia_stop were slightly superior to others, whereas the results
for four types of preprocessing are relatively similar. For Word2vec the best result
were obtained for wikipeida_all_rem, albeit very close to that for Wikipedia_stop.
The results of CNN and LSTM on sentiment analysis task indicate that retention of
stopwords can be useful.
stopwords were also effective on sentiment analysis, for example, stopwords such
as “not” or “don’t” may change the polarity of a word. It is noteworthy that stop-
words are not, by default, eliminated in the IMDB dataset. If stopwords are removed
from Wikipedia corpus, corresponding embeddings may not be generated, and
therefore whether they exist in IMDB dataset or not, they hold no effect on the senti-
ment of a document.
The results for the Persian sentiment analysis on the Hotel dataset are shown
in Figs. 8 and 9. The hotel dataset is an unbalanced dataset, in which the number
of positive examples is far more than the number of negative examples, requir-
ing the F-measure to be reported. Generally, results on all corpora are very close
to each other. However, as held by the figures, the best F-measure results for CNN
belong to Ham_alaem_stop and for LSTM, with the exception of GloVe, belong to

13
286 Z. Rahimi, M. M. Homayounpour

Ham_alaem_stop. Therefore, for Persian sentiment analysis, it is better to retain the

stopwords when embeddings are trained on a corpus.

6.5 Discussion

The effect of text preprocessing in the process of generating word embeddings was
explored in the previous sections. Four types of preprocessing were applied to Eng-
lish and Persian corpora. For English, Wikipedia corpus was used to train GloVe,
Word2vec, and Fasttext word embedding models. For Persian, Hamshahri and
Alaem_dadeha corpora were used to train the above mentioned word embedding
models. The statistics of these corpora are shown in the “Evaluation methods and
corpora” section. Two types of evaluation methods were employed including extrin-
sic evaluation (sentiment analysis and text classification) and intrinsic evaluation
(word similarity and word analogy). The results were language and task-dependent.
For Persian word-similarity, the best average results were obtained for Ham_
alaem_all_rem corpus (punctuations, numbers and stopwords are all removed),
whereas in English, the best results were achieved for Wikipedia_stop corpus
(punctuations and numbers are removed but stopwords are remained). This differ-
ence between results for Persian and English corpora is in most likelihood due to
the differences between datasets: Persian word similarity datasets, for instance, con-
tain no verbs among pairs and so, stopwords cannot affect the results. According to
Tables 16 and 17, as Persian is a morphologically rich language, the Pearson corre-
lation coefficients (computed between pre-defined similarity scores rated by humans
and inter-embedding scores computed by cosine similarity) are higher in Fasttext
compared to GloVe and CBOW. In word analogy task, the best average accuracy on
semantic questions was attained for Wikipedia_all_rem and Ham_alaem_all_rem.
The best average accuracy on syntactic questions was achieved by Ham_alaem_stop
and in English, Wikipedia and Wikipedia_stop showed better results. The findings
indicate that retention of stopwords leads to better results in the syntactic case, as
most stopwords such as determiners, pronouns, conjunctions, and prepositions are
function words. Function words are words that define the grammatical relationships
between other words in a sentence.
Apropos of extrinsic evaluation, sentiment analysis and text classification tasks
were implemented. The tasks were modeled using LSTM and CNN deep neural net-
works. The embeddings of words of a document were fed to said neural networks as
input and a Softmax layer was used as the last layer performs the classification of
the document. According to Tables 20 and 21, for English text classification, Wiki-
pedia_all_rem and Ham_alaem_all_rem generate better results in the majority of
cases. This result is rather expectable as punctuations, numbers, and stopwords are
not highly relevant features for representing a document in the task of text classi-
fication. For sentiment analysis, the Wikipedia_stop and Ham_alaem_stop present
higher accuracies than other preprocessed corpora. Certain stopwords such as “not”
and “don’t” were shown to change the polarity of a word, and therefore change the
sentiment of a sentence. For example, in the sentence “The movie was not interest-
ing”, the word “not” changes the sentiment of the sentence from positive to negative.

13
The impact of preprocessing on word embedding quality: a… 287

If the word “was not” is removed from the sentence, the classifier may classify this
sentence as positive.
The main assumption of this paper was that the preprocessing of a corpus, inclu-
sive of tasks such as removing stopwords, punctuations and numbers, can change
the co-occurrence of words, while word co-occurrences, in turn are the basis of
most of word embedding tools. Preprocessing, could, therefore, change the quality
of word embeddings generated by such tools. The deletion of stopwords, punctua-
tions and numbers change the local co-occurrence counts and thereby affect word
vectors generated using Word2vec (CBOW), which is due to the fact that CBOW
uses the words around the target word in a local window to guess the target word
(Mikolov et al., 2013b). This also decreases the counts in the global word co-occur-
rence matrix, which is precisely why in GloVe, many of co-occurrence counts are
considered as noisy, meaning that the generated word vectors are impressible if the
stopwords are deleted from the corpus. This is owed to the objective function of
GloVe, which penalizes counts below a certain thresholds (Pennington et al., 2014).
Fasttext uses a modified version of Skipgram (Mikolov et al., 2013b) to work with
character n-grams. Therefore, local co-occurrences are also significant in Fasttext
(Bojanowski et al., 2016). The results of this paper show the following implications:

• Holding punctuations and numbers is not beneficial to any task except for word
similarity of verbs, albeit the consonant results for word similarity show no sig-
nificant change. When punctuations and numbers are removed, however, the
size of the corpus decreases and the time complexity for corpus processing is
decreased. Therefore, certain irrelevant features can be removed to increase the
speed of data processing.
• Retention of stopwords improves the accuracy of word analogy task, scores for
verb similarity and sentiment analysis task. As the removal stopwords such as
not or don’t can change the polarity of words in the case of sentiment analysis, it
can also alter the polarity of documents can change. For solving the syntactic and
semantic questions of word analogy, most stopwords such as determiners, pro-
nouns, conjunctions, and prepositions are comprised of function words. Function
words are words that define the grammatical relationships between other words
in a sentence. In order to measure the similarity of verbs, stopwords can be used
to show the number and tense of a verb, and thereby affect word similarity.
• Removal of stopwords is beneficial to text classification. When stopwords
appeared among features used to represent a document, the text classifier may
consider many dissimilar documents as similar, as they share many stopwords.
Stopwords are, therefore, not good features for the task of text classification.
• The intrinsic assessments show that holding stopwords improves the word vec-
tors for finding analogy relations and similarities between verbs. But if a set of
word vectors is better than other set of word vectors intrinsically, it does not
mean it is better for the entire downstream tasks. But this results show that the
intrinsic evaluation performance has more correlation with sentiment analysis
than text classification.

13
288 Z. Rahimi, M. M. Homayounpour

7 Conclusions

The present study proceeded to investigate the effects of employing different pre-
processing procedures on the quality of word vectors are generated using Fasttext,
GloVe, and Word2vec as state of the arts word embedding techniques. Intrinsic and
Extrinsic evaluation methods and different Persian and English datasets were used
in the proposed experiments. The results show that the type of corpus preprocessing
used to generate word embeddings is task-dependent. For example, in the task of
Persian word-analogy, removal of punctuations and numbers, while retaining stop-
words, leads to better results, whereas in the case of Persian word similarity task,
when all the punctuations, numbers and stopwords are removed, a higher Pearson
correlation coefficient (computed between pre-defined similarity scores rated by
humans and inter-embedding scores computed by cosine similarity) is obtained.
In general, the corpus_stop and corpus_all_rem generated better results for all the
tasks, and therefore retention of punctuations and numbers, is not beneficial to the
generation of word embeddings. In extrinsic evaluations like text classification,
removing punctuations, numbers and stopwords improves the classification accuracy
as stopwords are frequent in all documents and documents from different classes
may share similar stopwords. Therefore, retention of stopwords misleads the clas-
sifier in distinguishing between documents from different classes. For sentiment
analysis task it is recommended to retain stopwords, as stopwords such as “no” and
“don’t” change the polarity of words in documents. Finally, we can say the intrin-
sic assessments show that holding stopwords improves the word vectors for finding
analogy relations and similarities between verbs. But if a set of word vectors is bet-
ter than other set of word vectors intrinsically, it does not mean it is better for the
entire downstream tasks. But this results show that the intrinsic evaluation perfor-
mance has more correlation with sentiment analysis than text classification.
For future works, it is recommended that investigations be conducted on the
effects of other types of preprocessing procedures such as removal of stopwords
with respect to word frequency, on word embedding quality, as well as to generate
other word similarity datasets for Persian, particularly for Persian verbs.
Acknowledgements The authors wish to express their thanks for the financial support of Iran
National Science foundation (INSF), Project No 97009308.

References
Abdi, A., Shamsuddin, S. M., Hasan, S., & Piran, J. (2019). Deep learning-based sentiment classifica-
tion of evaluative text based on Multi-feature fusion. Information Processing and Management, 56,
1245–1259. https://doi.org/10.1016/j.ipm.2019.02.018
Ajees, A. P., & Idicula, S. M. (2018). A named entity recognition system for malayalam using neural
networks. Procedia Computer Science, 143, 962–969. https://doi.org/10.1016/j.procs.2018.10.338
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., & Oroumchian, F. (2009). Hamshahri: A standard
Persian text collection. Knowledge-Based Systems, 22, 382–387. https://doi.org/10.1016/j.knosys.
2009.05.002

13
The impact of preprocessing on word embedding quality: a… 289

Alkhatlan, A., Kalita, J., & Alhaddad, A. (2018). Word sense disambiguation for arabic exploiting arabic
wordnet and word embedding. Procedia Computer Science, 142, 50–60. https://doi.org/10.1016/j.
procs.2018.10.460
Angiani, G., Ferrari, L., Fontanini, T., Fornacciari, P., Iotti, E., Magliani, F., & Manicardi, S. (2016). A
comparison between preprocessing techniques for sentiment analysis in Twitter. CEUR Workshop
Proceedings, 1748, 1–11. https://doi.org/10.1007/978-3-319-67008-9_31
Ayca, D., Hakan, & E. K. (2017). Effects of varoius preprocessing techniques to Turkish text categoriza-
tion using n-gram features. In 2nd international conference on computer science and engineering.
(pp. 655–660) IEEE. https://doi.org/10.1109/UBMK.2017.8093491
Baker, S., Reichart, R., & Korhonen, A. (2014). An unsupervised model for instance level subcategoriza-
tion acquisition. In EMNLP 2014—2014 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf. (pp.
278–289). https://doi.org/10.3115/v1/d14-1034.
Banik, D., Ekbal, A., Bhattacharyya, P., & Bhattacharyya, S. (2019). Assembling translations from multi-
engine machine translation outputs. Applied Soft Computing Journal, 78, 230–239. https://doi.org/
10.1016/j.asoc.2019.02.031
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword infor-
mation. Transactions of the Association for Computational Linguistics, 5, 135–146.
Bruni, E., Boleda, G., Baroni, M., Tran, N. K. (2012). Distributional semantics in technicolor. In 50th
Annu. Meet. Assoc. Comput. Linguist. ACL 2012—Proc. Conf. (vol. 1, pp. 136–145).
Camacho-Collados, J., & Pilehvar, M. T., 2018. On the role of text preprocessing in neural network archi-
tectures: An evaluation study on text categorization and sentiment analysis. In: Proceedings of the
2018 EMNLP Workshop BlackboxNLP: Analyzing and interpreting neural networks for (pp. 40–46).
Association for computational lingustics, Brussels. https://doi.org/10.18653/v1/w18-5406
Camacho-Collados, J., Pilehvar, M.T., Collier, N., & Navigli, R., 2017. SemEval-2017 Task 2: Multilin-
gual and cross-lingual semantic word similarity. In Proceedings of the 11th international workshop
on semantic evaluation (SemEval-2017) (pp. 15–26). Association for Computational Linguistics,
Vancouver, Canada. https://doi.org/10.18653/v1/s17-2002
Corrêa, E. A., & Amancio, D. R. (2019). Word sense induction using word embeddings and community
detection in complex networks. Physica a: Statistical Mechanics and Its Applications, 523, 180–
190. https://doi.org/10.1016/j.physa.2019.02.032
Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when
it misleads, and what to do about it. Political Analysis, 26, 168–189. https://doi.org/10.1017/pan.
2017.44
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language Technologies (pp.
4171–4186). Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/
10.18653/v1/N19-1423
Enríquez, F., Troyano, J. A., & López-Solaz, T. (2016). An approach to the use of word embeddings in
an opinion classification task. Expert Systems with Applications, 66, 1–6. https://doi.org/10.1016/j.
eswa.2016.09.005
Esposito, M., Damiano, E., Minutolo, A., De Pietro, G., & Fujita, H. (2020). Hybrid query expansion
using lexical resources and word embeddings for sentence retrieval in question answering. Informa-
tion Sciences (NY), 514, 88–105. https://doi.org/10.1016/j.ins.2019.12.002
Etaiwi, W., & Awajan, A. (2020). Graph-based Arabic text semantic representation. Information Process-
ing and Management, 57, 102183. https://doi.org/10.1016/j.ipm.2019.102183
Etaiwi, W., & Naymat, G. (2017). The impact of applying different preprocessing steps on review spam
detection. Procedia Computer Science, 113, 273–279. https://doi.org/10.1016/j.procs.2017.08.368
Fernández-Reyes, F. C., Hermosillo-Valadez, J., & Montes-y-Gómez, M. (2018). A prospect-guided
global query expansion strategy using word embeddings. Information Processing and Management,
54, 1–13. https://doi.org/10.1016/j.ipm.2017.09.001
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E., 2001. Placing
search in context: The concept revisited. In Proc. 10th Int. Conf. World Wide Web, WWW 2001 (pp.
406–414). https://doi.org/10.1145/371920.372094
Gerz, D., Vulić, I., Hill, F., Reichart, R., & Korhonen, A., 2016. SimVerb-3500: A large-scale evalua-
tion set of verb similarity. In Proceedings of the 2016 conference on empirical methods in natural
language processing. (pp. 2173–2182). Association for Computational Linguistics, Austin, Texas.
https://doi.org/10.18653/v1/d16-1235

13
290 Z. Rahimi, M. M. Homayounpour

Halawi, G., Dror, G., Gabrilovich, E., & Koren, Y. (2012). Large-scale learning of word relatedness with
constraints. In Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (pp. 1406–1414). https://
doi.org/10.1145/2339530.2339751.
Hill, F., Reichart, R., & Korhonen, A. (2015). SimLex-999: Evaluating Smantic Models with (Genuine)
Similarity Estimation. Computational Linguistics, 70, 665–695. https://doi.org/10.1162/COLI_a_
00237
Huang, E. H., Socher, R., Manning, C.D., & Ng, A. Y., 2012. Improving word representations via global
context and multiple word prototypes. In Proceedings of the 50th annual meeting of the association
for computational linguistics (pp. 873–882).
Kamkarhaghighi, M., & Makrehchi, M. (2017). Content Tree Word Embedding for document represen-
tation. Expert Systems with Applications, 90, 241–249. https://doi.org/10.1016/j.eswa.2017.08.021
Keerthi Kumar, H. M., & Harish, B. S. (2017). Classification of short text using various preprocessing
techniques: An empirical evaluation. In Advances in intelligent systems and computing techniques
(pp. 19–30). Springer.
Khattak, F. K., Jeblee, S., Pou-Prom, C., Abdalla, M., Meaney, C., & Rudzicz, F. (2019). A survey of
word embeddings for clinical text. Journal of Biomedical Informatics, 4, 100057. https://doi.org/10.
1016/j.yjbinx.2019.100057
Kwon, S., Ko, Y., & Seo, J. (2019). Effective vector representation for the Korean named-entity recogni-
tion. Pattern Recognition Letters, 117, 52–57. https://doi.org/10.1016/j.patrec.2018.11.019
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov,
V. (2019). RoBERTa: A robustly optimized BERT pretraining approach.
Luong, M.-T., Socher, R., & Manning, C. D. (2013). Better word representations with recursive neural
networks for morphology. In Proceedings of the seventeenth conference on computational natural
language learning (pp. 104–113). Association for Computational Linguistics.
Marinho, V. Q., Hirst, G., Amancio, D. R. (2017). Authorship attribution via network motifs identifica-
tion. In Proc.—2016 5th Brazilian Conf. Intell. Syst. BRACIS 2016 (pp. 355–360). https://doi.org/10.
1109/BRACIS.2016.071.
Mikolov, T., Chen, K., Corrado, G.,& Dean, J. (2013a). Efficient estimation of word representations in
vector space. In International estimation on learning representations: workshop track (pp. 1–12).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Distributed representations of words and phrases
and their copositionality. In Advances in neural information processing systems (pp. 3111–3119).
https://doi.org/10.1162/jmlr.2003.3.4-5.951.
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cog-
nitive Processes, 6, 1–28. https://doi.org/10.1080/01690969108406936
Mohtaj, S., Roshanfekr, B., Zafarian, A., & Asghari, H. (2019). Parsivar: A language processing toolkit
for Persian. In LREC 2018—11th international conference on language resources and evaluation
(pp. 1112–1118).
Othman, N., Faiz, R., & Smaïli, K. (2019). Enhancing question retrieval in community question answer-
ing using word embeddings. Procedia Computer Science, 159, 485–494. https://doi.org/10.1016/j.
procs.2019.09.203
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In:
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
(pp. 1532–1543).
Pham, D. H., & Le, A. C. (2018). Exploiting multiple word embeddings and one-hot character vectors
for aspect-based sentiment analysis. International Journal of Approximate Reasoning, 103, 1–10.
https://doi.org/10.1016/j.ijar.2018.08.003
Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S. (2011). A word at a time: Computing word
relatedness using temporal semantic analysis. In Proc. 20th Int. Conf. World Wide Web, WWW 2011
(pp. 337–346). https://doi.org/10.1145/1963405.1963455.
Roy, D., Ganguly, D., Mitra, M., & Jones, G. J. F. (2019). Estimating Gaussian mixture models in the
local neighbourhood of embedded word vectors for query performance prediction. Information Pro-
cessing and Management, 56, 1026–1045. https://doi.org/10.1016/j.ipm.2018.10.009
Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the
ACM, 8, 627–633. https://doi.org/10.1145/365628.365657
Shuang, K., Zhang, Z., Loo, J., & Su, S. (2020). Convolution–deconvolution word embedding: An end-
to-end multi-prototype fusion embedding method for natural language processing. Information
Fusion, 53, 112–122. https://doi.org/10.1016/j.inffus.2019.06.009

13
The impact of preprocessing on word embedding quality: a… 291

Tohalino, J. V., & Amancio, D. R. (2018). Extractive multi-document summarization using multilayer
networks. Physica a: Statistical Mechanics and Its Applications, 503, 526–539. https://doi.org/10.
1016/j.physa.2018.03.013
Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Pro-
cessing and Management, 50, 104–112. https://doi.org/10.1016/j.ipm.2013.08.006
Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., Kingsbury, P., & Liu, H. (2018).
A comparison of word embeddings for the biomedical natural language processing. Journal of Bio-
medical Informatics, 87, 12–20. https://doi.org/10.1016/j.jbi.2018.09.008
Wu, C., Su, J., Chen, Y., & Shi, X. (2019). Boosting implicit discourse relation recognition with connec-
tive-based word embeddings. Neurocomputing, 369, 39–49. https://doi.org/10.1016/j.neucom.2019.
08.081
Yahi, N., & Hacene, B. (2020). Morphosyntactic preprocessing impact on document embedding: An
empirical study on semantic similarity. Emerging trends in intelligent computing and informatics.
IRICT 2019. Advances in intelligent systems and computing (pp. 118–126). Springer. https://doi.org/
10.1007/978-3-030-33582-3_12
Yang, D., & Powers, D. M. W. (2006). Verb similarity on the taxonomy of WordNet. In GWC 2006: 3rd
international global wordnet conference, proceedings. Jeju Islan, Korea (pp. 121–128).
Zahedi, M. S., Bokaei, M. H., Shoeleh, F., Yadollahi, M. M., Doostmohammadi, E., Farhoodi, M. (2018).
Persian word embedding evaluation benchmarks. In 26th Iran. Conf. Electr. Eng. ICEE 2018 (pp.
1583–1588). https://doi.org/10.1109/ICEE.2018.8472549

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article
is solely governed by the terms of such publishing agreement and applicable law.

Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Translating the Future: Exploring the Impact of Technology and AI on Modern Translation Studies
From Everand
Translating the Future: Exploring the Impact of Technology and AI on Modern Translation Studies
Tian Chuanmao
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Puritech Pricelist 2025 - Google Sheets
No ratings yet
Puritech Pricelist 2025 - Google Sheets
14 pages
Notams 03.07.2024
No ratings yet
Notams 03.07.2024
64 pages
UV-LED Curing For An Industrial Wood Coating Application
No ratings yet
UV-LED Curing For An Industrial Wood Coating Application
6 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Natural Language Processing
No ratings yet
Natural Language Processing
34 pages
Large Language Models
From Everand
Large Language Models
A. Scholtens
2/5 (2)
Truncated Doc 1
No ratings yet
Truncated Doc 1
3 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
Lect 5
No ratings yet
Lect 5
40 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Sense VEC A Fast and Accurate Method For Word Sense Disambiguation in Neural Word Embeddings
No ratings yet
Sense VEC A Fast and Accurate Method For Word Sense Disambiguation in Neural Word Embeddings
9 pages
Word Embadding
No ratings yet
Word Embadding
24 pages
Ria 37.03 24
No ratings yet
Ria 37.03 24
7 pages
Word Embedding Methodsof Text Processing
No ratings yet
Word Embedding Methodsof Text Processing
7 pages
Duan 2020
No ratings yet
Duan 2020
6 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Important 2 Marks
No ratings yet
Important 2 Marks
11 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
WORD EMBEDDING Project
No ratings yet
WORD EMBEDDING Project
15 pages
Cryptography Assignment-2
No ratings yet
Cryptography Assignment-2
6 pages
CH 3
No ratings yet
CH 3
183 pages
IJISRT23DEC1110
No ratings yet
IJISRT23DEC1110
8 pages
5 Pretained Word Embeddings Algorithms
No ratings yet
5 Pretained Word Embeddings Algorithms
21 pages
Text 3. Helicopter Overhaul Manual: Make Written Translation of The Text
No ratings yet
Text 3. Helicopter Overhaul Manual: Make Written Translation of The Text
3 pages
Pretrained Model
No ratings yet
Pretrained Model
50 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
8-9 Term 3 Plan
No ratings yet
8-9 Term 3 Plan
17 pages
Chapter 1 - Introduction To Facility Management
100% (4)
Chapter 1 - Introduction To Facility Management
15 pages
Contextual Word Embeddings
No ratings yet
Contextual Word Embeddings
8 pages
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
No ratings yet
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
21 pages
CHATGPT NLP
No ratings yet
CHATGPT NLP
6 pages
Pars BERT
No ratings yet
Pars BERT
10 pages
Unit 5 DL
No ratings yet
Unit 5 DL
11 pages
Karamveer Singh
No ratings yet
Karamveer Singh
1 page
Procedure:: Autodesignating Components
No ratings yet
Procedure:: Autodesignating Components
5 pages
Marketing Research
No ratings yet
Marketing Research
19 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Preprocessing Techniquesfor Text Mining
No ratings yet
Preprocessing Techniquesfor Text Mining
7 pages
Engro Project Report
50% (2)
Engro Project Report
16 pages
Fabrication of Metallic Bellow
No ratings yet
Fabrication of Metallic Bellow
18 pages
CCS369 Unit-2 20.12.24
No ratings yet
CCS369 Unit-2 20.12.24
41 pages
Learning Representations That Convey Semantic and Syntactic Information
No ratings yet
Learning Representations That Convey Semantic and Syntactic Information
14 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
2 Marks
No ratings yet
2 Marks
11 pages
Performance Evaluation of Word Embedding Algorithms
No ratings yet
Performance Evaluation of Word Embedding Algorithms
7 pages
Trend
No ratings yet
Trend
47 pages
139 Zeinabaghahadi
No ratings yet
139 Zeinabaghahadi
6 pages
Resistance Welding Controls AND Applications
No ratings yet
Resistance Welding Controls AND Applications
68 pages
Glosap Group Profile PDF
No ratings yet
Glosap Group Profile PDF
36 pages
PIIS2589004224005558
No ratings yet
PIIS2589004224005558
24 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Electronics 10 01372 With Cover
No ratings yet
Electronics 10 01372 With Cover
24 pages
Engineering Management Handout 5 - Planning Technical Activities PDF
No ratings yet
Engineering Management Handout 5 - Planning Technical Activities PDF
8 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Axalto Partners With Bharti Telesoft To Deliver Cash-Based Prepaid Reloading Solution
No ratings yet
Axalto Partners With Bharti Telesoft To Deliver Cash-Based Prepaid Reloading Solution
2 pages
Fighter Aircrafts PDF
No ratings yet
Fighter Aircrafts PDF
2 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Troubleshooting VTP Vertical Turbine Pump
100% (3)
Troubleshooting VTP Vertical Turbine Pump
20 pages
Question Bank NLP SOLUTIONS
No ratings yet
Question Bank NLP SOLUTIONS
21 pages
Ce Elective: Description Unit Material Cost Unit Planning Qty
No ratings yet
Ce Elective: Description Unit Material Cost Unit Planning Qty
37 pages
Chapter 2 - Section B - Questions 4
No ratings yet
Chapter 2 - Section B - Questions 4
4 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
An Introduction To Number Theory Niven PDF
No ratings yet
An Introduction To Number Theory Niven PDF
1 page
Conducting Best and Current Practices Research: A Starter Kit
No ratings yet
Conducting Best and Current Practices Research: A Starter Kit
12 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Chapter II
No ratings yet
Chapter II
26 pages
Faq 266
No ratings yet
Faq 266
4 pages
Comparative Study of Word Embeddings Models and Their Usage in Arabic Language Applications
No ratings yet
Comparative Study of Word Embeddings Models and Their Usage in Arabic Language Applications
7 pages
Holophane Emergency Panelite Series Brochure 9-79
No ratings yet
Holophane Emergency Panelite Series Brochure 9-79
2 pages
Polyflow Offshore Pull Through Procedure
No ratings yet
Polyflow Offshore Pull Through Procedure
8 pages
Bosch SHE55R55UC
No ratings yet
Bosch SHE55R55UC
2 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Managing Hotspot Clients With Radius
No ratings yet
Managing Hotspot Clients With Radius
34 pages
Telephone Call Feature Codes
No ratings yet
Telephone Call Feature Codes
1 page
Gammon India LTD - Is Not Only The Largest Civil Engineering Construction Company
No ratings yet
Gammon India LTD - Is Not Only The Largest Civil Engineering Construction Company
4 pages
Improving Language Understanding by Generative Pre-Training
No ratings yet
Improving Language Understanding by Generative Pre-Training
12 pages
Crane Safety Training For Engineers and Supervisors Presented by The Construction Institute of ASCE Funded by An OSHA Susan Harwood Training Grant
No ratings yet
Crane Safety Training For Engineers and Supervisors Presented by The Construction Institute of ASCE Funded by An OSHA Susan Harwood Training Grant
33 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

The Impact of Preprocessing On Word Embedding Quality: A Comparative Study

Uploaded by

The Impact of Preprocessing On Word Embedding Quality: A Comparative Study

Uploaded by

Language Resources and Evaluation (2023) 57:257–291

The impact of preprocessing on word embedding quality:

Zahra Rahimi1 · Mohammad Mehdi Homayounpour1

Accepted: 16 September 2022 / Published online: 18 October 2022

Keywords Text preprocessing · Persian word similarity datasets · Word

* Mohammad Mehdi Homayounpour

The rather substantial amount of structured and unstructured textual information

Word vectors generated by word embedding methods are affected by alterations

• A complete study concentrated on the preprocessings that has an impact on the

4. Expansion of contractions: for example we change the “we’re” to “we are” or

Fig. 1 Steps of preprocessing implemented on English and Persian corpora

Table 1 Generated corpora with different preprocessing steps

Corpus_name_all_rem All the punctuations, numbers and stopwords were removed

1. Corpora in which all forms of punctuations and numbers removed (corpus_name_

To inspect the result of preprocessing on the quality of word embedding, four

5 Evaluation methods and corpora

Table 2 English word similarity datasets

EN-Mc-30 (Rubenstein & Good- 30 EN-WS-353-REL (Finkelstein et al., 252

Table 3 Persian word similarity Dataset Number of Number of

Pearson-correlation coefficient was computed between pre-defined similarity scores

Table 4 Relation types and examples of Google word analogy dataset

Capital common country Athens: Greece:: Baghdad: Iraq

respectively. Certain challenges arose in the translation of the mentioned corpora

• Some words in English were translated to multi-words in Persian, and so were

Family 342 ‫بابا‬:‫مامان‬::‫عمو‬:‫( عمه‬father: mother::uncle: aunt)

Table 6 The statistics of R8 Classes Number of documents Average number of

acq 2292 118.24

Alt.atheism 799 336.4 Rec.sport.hockey 999 240.7

Table 8 The statistics of R52 dataset

acq 2292 118.24 Jobs 49 109.5

Table 9 The statistics of IMDB Categories Number of

Table 10 The statistics of Categories Number of docu- Categories Number of

ydsht 240 bazar 282

Table 11 The statistics of Hotel Categories Number of

Table 12 Statistics of Wikipedia Name of generated corpus Number of tokens

Table 13 Statistics of Hamshahri and Hamshahri + Alaem_dadeha (Ham_alaem) corpus

Hamshahri_all_rem 39,887,837 Hamshahri_alaem_all_rem 87,937,153

5.2 Datasets for word embedding learning

Each vector was obtained by GloVe, Word2vec (CBOW) and Fasttext

Each vector is obtained by GloVe, Word2vec (CBOW) and Fasttext

Embedding Layer Spatial Dropout LSTM Softmax

Conv1D Max pooling Conv1D Max pooling

Ham_alaem_stop. Therefore, for Persian sentiment analysis, it is better to retain the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

5 Evaluation methods and corpora

5.2 Datasets for word embedding learning