The Impact of Preprocessing On Word Embedding Quality: A Comparative Study
The Impact of Preprocessing On Word Embedding Quality: A Comparative Study
https://doi.org/10.1007/s10579-022-09620-5
ORIGINAL PAPER
Abstract
Data preprocessing is among the principal stages in virtually all text-based tasks.
In this light, recent approaches have employed word embeddings in the majority of
text-based tasks, wherein word co-occurrences are used as the basis of word vector
generation processes in the approaches thereof. Word embedding techniques are pri-
marily trained on large corpora. The preprocessing of said corpora is quite crucial in
that it can alter the co-occurrence counts of words, and hence the quality of gener-
ated word vectors. This highlights the significance of selecting the right preprocess-
ing approach when working with word embeddings. The present study proceeds to
scrutinize the effects of preprocessing on the quality of word embeddings on Persian
and English corpora, with a focus on preprocessing approaches capable of altering
the co-occurrence counts of words by virtue of procedures such as the elimination
of stopwords. The quality of the word vectors generated employing different types
of preprocessing are intrinsically evaluated by word similarity and word analogy.
Text classification and sentiment analysis are also used as extrinsic evaluations. The
study also presents four word similarity datasets for Persian, namely PER-RG-65,
PER-RC-30, PER-simlex-999, and PER-Mturk-287. Results obtained using GloVe,
Word2vec (CBOW), and Fasttext word embeddings show that in certain tasks, such
as in the case of text classification or solving semantic questions in word analogy,
the removal of punctuations and stopwords will lead to higher performance, as in the
case of sentiment analysis and syntactic questions regarding word analogy, where
withholding stopwords increases overall performance.
13
Vol.:(0123456789)
258 Z. Rahimi, M. M. Homayounpour
1 Introduction
13
The impact of preprocessing on word embedding quality: a… 259
The referred methods usually use large scale corpora such as Wikipedia1 to learn
word vectors. These corpora can be preprocessed in several ways. Word co-occur-
rence statistics are modified with various kinds of preprocessing. The deletion of
stopwords, numbers, and punctuations alters the co-occurrence statistics. Therefore,
the preprocessing can affect the quality of word vectors extracted by word embed-
ding methods. The quality of extracted word vectors, in turn, affects all the afore-
mentioned natural language processing tasks.
Certain researches have thus far studied the effect of preprocessing on different
natural language processing tasks.(Ayca & Hakan, 2017; Keerthi Kumar & Harish,
2017) examined the effect of preprocessing for text categorization. The effect of text
preprocessing on text classification and sentiment analysis in the case where a neural
network is used for these tasks was considered by Camacho-Collados and Pillevar in
2018 (Camacho-Collados & Pilehvar, 2018). In this research convolutional neural
network was used for text classification and sentiment analysis. (Denny & Spirling,
2018) studied the effect of text preprocessing on unsupervised learning to find the
most optimal features. Angiani et al. (Angiani et al., 2016) scrutinized the effect
of preprocessing on twitter sentiment classification. Uysal and Gunal investigated
the effect of preprocessing on text classification in different domains and languages
(Uysal & Gunal, 2014). The effect of preprocessing on review spam detection was
also considered by (Etaiwi & Naymat, 2017) as a means for extracting suitable fea-
tures for spam detection. Another approach investigated the impact of different kinds
of preprocessing on document embedding (Yahi & Hacene, 2020). The preprocess-
ing also has an impact on text summarization in (Tohalino & Amancio, 2018) where
the authors utilized text segmentation, lemmatization and unnecessary word removal
as effective preprocessing on text summarization and utilized a graph-based method
for this task. In (Marinho et al., 2017) the preprocessing steps of removing punc-
tuations and stopwords, lemmatization and part of speech tagging are applied for
the task of authorship attribution using a graph-based approach. The lemmatization
is important in graph-based method because the words with the same lemma are
mapped to the same node. But for generating general purpose word embeddings, it
is better to generate the embeddings for all forms of a word.
This rather substantial impact of preprocessing on the quality of word embed-
ding processes employed on many NLP tasks, to the best of our knowledge, has not
been carefully investigated in previous works. The significance of this study lies in
the fact that word embeddings are now used extensively in numerous areas of NLP,
wherein the quality of extracted word embeddings and representations can alter the
accuracy of different tasks. Various sorts of preprocessing procedures such as stop-
words and punctuation removal can dramatically change the generated word embed-
dings due to their drastic effects on word co-occurrences and therefore the corre-
sponding relationships. For example, in the sentence The tropics are the regions of
Earth surrounding the Equator. extracted from English Wikipedia, consider the word
“tropics” and its neighbors in a window size of 5:
The tropics are the regions of Earth surrounding the Equator.
1
https://dumps.wikimedia.org/.
13
260 Z. Rahimi, M. M. Homayounpour
In this sentence, the italic words appeared in the context of the word “tropics”.
The following sentences show how the context of the word "tropics" can change
with stopwords removal and punctuation removal:
•Stopwords removal: tropics regions Earth surrounding Equator.
•Punctuation and stopwords removal: tropics regions Earth surrounding Equator
In this example, the effect of punctuation removal is less than stopwords removal
but we can see that the “Equator” is different from the “Equator.” so the presence of
punctuations can change a word. There are large numbers of punctuations in a cor-
pus so the punctuation removal can highly affect the context of a word.
Given this background, the present paper examines the impact of preprocessing
on word embeddings quality. As contextual methods generate embeddings in a sen-
tence and are thus unsuitable for tasks similar to word similarity and word analogy,
they were not considered in the present paper.
The rest of the study is organized as follows: Sect. 2 presents the objectives and
contributions of the paper. Three word embedding methods utilized to generate
word embeddings are introduced in Sect. 3. Different types of preprocessing and
their effects on words co-occurrence counts are described in Sect. 4. Section 5 gives
a brief summary of evaluation methods and datasets used, and finally, Sect. 6 elabo-
rates the experimental results.
2 Paper objectives
13
The impact of preprocessing on word embedding quality: a… 261
3 Methods
This section contains a brief introduction to three effective and popular word embed-
ding methods: GloVe (Pennington et al., 2014), Word2vec (Mikolov et al., 2013a,
b), and Fasttext (Bojanowski et al., 2016). These methods were applied to gener-
ate word embeddings in the experiments. The inclusion criteria for this choice of
method, apart from their notoriety, can be confined to the different aspects in each of
the methods thereof. Word2vec solely makes use of local co-occurrences of words,
in which words within a small window around the target word are used to predict the
word. This type of co-occurrence misses certain valuable information, such as topi-
cal information (Huang et al., 2012). GloVe utilizes global co-occurrence counts.
GloVe and Word2vec work at the word scale and are thus not very suitable for mor-
phologically rich languages, with disregard for unknown words observed in both
methods (Bojanowski et al., 2016). However, the Fasttext method is implemented
at the sub-word level and is more fit for morphologically rich languages, capable
of generating embeddings for unknown words. As the Persian language is a mor-
phologically rich language, Fasttext was considered in this study as well. For more
detailed information about these methods, please refer to the references of these
papers that are provided in this section.
4 Text preprocessing
Virtually all natural language processing tasks are characterized by a phase of pre-
processing of the dataset. This preprocessing is necessary as most datasets contain
Html or XML tags, Images, and vast body of irrelevant and noisy information. The
steps of preprocessing are, however, different from one task or one language to
another. Generally, the preprocessing phase for an English text may contain the fol-
lowing steps:
1. Extracting pure text by elimination of Html or XML tags, Images and links
2. Lowercasing: convert all words to lowercase
3. Tokenization: this preprocessing phase is vital for all natural language processing
tasks, including word embedding
2
These datasets are available via Emails of the authors.
13
262 Z. Rahimi, M. M. Homayounpour
Steps 1, 2, 3, 6, 7, 8 are the same for English and Persian. For step 1, we use Wik-
iExtractor3 tool. The crucial preprocessing step for Persian is to correct zero-width
non-joiners (ZWNJs). In Persian, certain suffixes and prefixes stick to the main word
with ZWNJ. For example, “می/mi/” is a prefix for a verb that changes its tense, so /
mi/ is glued to the verb with ZWNJ. Sometimes in Persian texts, ZWNJs are mistak-
enly replaced with spaces. Thus, in tokenization, /mi/ is considered as an independ-
ent word (/mi/, has a homograph” می/mey/” which means wine in Persian). There
are multiple tools for the Persian language such as Hazm4 and ParsiVar (Mohtaj
et al., 2019) to amend the ZWNJs. This paper utilized the Hazm tool.
The basis of methods such as GloVe, Word2vec, and Fasttext, are co-occurrence
statistics. GloVe attempts to reconstruct the co-occurrence counts with the help of
a bi-linear regression method. The basis of Fasttext and Word2vec are the local co-
occurrences of words. The preprocessing of the corpus can change the co-occur-
rence statistics, and thereby affect the quality of generated word embeddings. The
preprocessing steps performed on Persian and English corpora are shown in Fig. 1.
As pre-trained word vectors are not task-specific, it is of utmost importance
that embeddings be generated for as many words as possible. Thus, it appears
that stopword removal is not a good choice for preprocessing of a corpus used
for the generation of word embeddings. Since stopword removal can change
the co-occurrence of words, it can also alter the meaning of a sentence. On the
other hand, stopwords are not very relevant in most NLP tasks. The co-occur-
rence counts are computed in a window of 5 to 10 words around the target word.
If stopwords are not removed, certain valuable words may not fall within the
5–10 word-sized window. As mentioned above, the deletion of punctuations and
3
https://github.com/attardi/wikiextractor.
4
https://github.com/sobhe/hazm.
13
The impact of preprocessing on word embedding quality: a… 263
numbers can change the co-occurrence counts. But the type of punctuation or the
amount of a number does not play a role on changing co-occurrence counts. To
examine the effect of punctuations and numbers removal on the quality of gener-
ated word embeddings, two types of corpora were considered:
13
264 Z. Rahimi, M. M. Homayounpour
For comparing the effect of various preprocessing schemes on the quality of word
vectors, several evaluation methods were utilized, which fall within the categories of
intrinsic and extrinsic evaluation. The intrinsic evaluations try to assess how the gen-
erated word vectors measure the semantic relatedness and similarity between words.
The main objective of this paper is to find what form of preprocessing improves the
word vectors intrinsically. Standard intrinsic evaluation techniques for the assess-
ment of word vectors included word similarity and word analogy.
Because word embeddings usually employed in natural language processing tasks
and the quality of word vectors affect their accuracy, we investigate how the cor-
pus preprocessing before generating word embeddings affect the accuracy of natural
language processing tasks. This kind of test is called extrinsic evaluation of word
embeddings. In the case of extrinsic evaluation, word vectors were applied in senti-
ment analysis and text classification.
Each dataset, in a word similarity task, contained multiple word pairs. Each word
pair had a pre-defined similarity score which was rated by humans. The similarity
score between their embeddings was also computed by cosine similarity. Next, the
5
https://www.nltk.org/.
13
The impact of preprocessing on word embedding quality: a… 265
13
266 Z. Rahimi, M. M. Homayounpour
The goal of the word analogy is to test the capacity of generated word vectors
in solving analogy questions. This task shows how word vectors can reveal linguis-
tic regularities between words. Assume the goal is to answer the analogy question:
“man:woman::king:?” and vectors for man, woman, and king are available. Initially
a vector is obtained from vector (woman) – vector(man) + vector(king). Should
the nearest vector to the result be vector (king), it would divulge on how vectors
13
The impact of preprocessing on word embedding quality: a… 267
Table 5 Persian word analogy dataset statistics including type of relations and a few examples
Type of relationship Number of Example
word question
pairs
can reveal the relationship of this analogy question. In English, the Google anal-
ogy question (Mikolov et al., 2013a) is commonly used for evaluating word vectors
by analogy questions. The dataset contains 19,544 question pairs, including 8869
semantic questions and 10,675 syntactic ones. The number of relation types for this
dataset is 14, the different types of which are shown in Table 4. Recent analogy
datasets have also been introduced for the Persian language (Zahedi et al., 2018),
comprised of 30,072 question pairs in 15 relations. The statistics for this dataset are
shown in Table 5.
13
268
13
Table 7 The statistics of 20-newsgroups dataset
Classes Number of documents Average number of words Classes Number of documents Average number of
per class per document per classes words per document
Positive 25,000
Negative 25,000
13
270 Z. Rahimi, M. M. Homayounpour
13
The impact of preprocessing on word embedding quality: a… 271
Positive 6126
Negative 2380
5.1 Extrinsic evaluation
For extrinsic evaluation, word vectors were examined in two NLP tasks including
text classification and sentiment analysis. For English text classification, R8,6 R527
and 20 newsgroups8 datasets were used. The R8 dataset is a part of Reuter-2157and
contains 5485 documents for training and 2189 documents for testing. Statistics on
this dataset are listed in Table 6. The 20-newsgroups dataset is a set of newsgroups
context, comprised of 18,818 documents categorized into 20 classes. The statistics
of this dataset are shown in Table 7. R52 is a part of Reuter-21578 (Table 8). This
dataset contains 9100 documents in 52 classes. 6532 documents are used for training
and the remainder for testing. The IMDB corpus9 (Table 9) is also commonly used
for English sentiment analysis (25,000 documents for testing and 25,000 documents
for training). For Persian text classification, The Hamshahri corpus (AleAhmad
et al., 2009), collected from the Iranian newspaper “Hamshahri”, was employed. The
Hamshahri corpus contains more than 160,000 articles in 82 categories as denoted
6
https://ana.cachopo.org/datasets-for-single-label-text-categorization.
7
https://ana.cachopo.org/datasets-for-single-label-text-categorization.
8
http://qwone.com/~jason/20Newsgroups/.
9
https://ai.stanford.edu/~amaas/data/sentiment/.
13
272 Z. Rahimi, M. M. Homayounpour
Table 14 Word similarity scores (Pearson correlation coefficients) for all four types of preprocessing of
Wikipedia dataset. 𝝆G shows the Pearson correlation coefficient obtained by GloVe, 𝝆w shows the Pear-
son correlation coefficient obtained by Word2vec, 𝝆f shows the Pearson correlation coefficient obtained
by Fasttext
Dataset Wikipedia_all_rem Wikipedia_punct Wikipedia_stop Wikipedia
𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f
EN-MTurk-287 0.65 0.68 0.68 0.56 0.68 0.69 0.66 0.67 0.69 0.65 0.67 0.69
EN-MEN-TR-3 k 0.69 0.72 0.75 0.70 0.72 0.75 0.69 0.71 0.74 0.69 0.69 0.74
EN-RW-STANFORD 0.35 0.40 0.45 0.36 0.42 0.45 0.35 0.42 0.45 0.35 0.39 0.445
EN-RG-65 0.76 0.79 0.79 0.745 0.81 0.79 0.73 0.79 0.78 0.73 0.77 0.795
EN-Mc-30 0.7 0.73 0.8 0.71 0.75 0.8 0.66 0.73 0.81 0.62 0.72 0.79
EN-Sim verb-3500 0.16 0.21 0.21 0.173 0.2 0.21 0.17 0.22 0.22 0.18 0.19 0.22
EN-WS-353-REL 0.52 0.57 0.66 0.52 0.57 0.66 0.54 0.56 0.65 0.53 0.55 0.65
EN-SIMLEX-999 0.31 0.33 0.33 0.32 0.32 0.33 0.315 0.32 0.32 0.32 0.29 0.31
EN-VERB-143 0.27 0.35 0.34 0.27 0.36 0.33 0.366 0.4 0.36 0.38 0.36 0.36
EN-WS-3 0.68 0.74 0.79 0.68 0.74 0.77 0.68 0.73 0.77 0.66 0.72 0.76
53-SIM
SCWS 0.55 0.65 0.65 0.56 0.64 0.65 0.57 0.65 0.65 0.57 0.64 0.65
EN-MT 0.624 0.62 0.64 0.62 0.61 0.64 0.62 0.61 0.64 0.63 0.59 0.64
URK-771
EN-YP-130 0.396 0.33 0.43 0.41 0.33 0.41 0.47 0.36 0.44 0.46 0.29 0.45
EN-WS-353-ALL 0.58 0.65 0.72 0.58 0.65 0.72 0.59 0.65 0.71 0.57 0.64 0.71
Average 0.52 0.54 0.59 0.51 0.56 0.58 0.53 0.56 0.59 0.52 0.53 0.58
The boldface numbers shows the best results for each row and only for datasets solely comprise of verbs
in Table 10. For Persian sentiment analysis, the Hotel10 dataset was used, which con-
tains approximately 8500 comments by different people about Iranian Hotels (6800
documents for training and 1706 document for testing). More information about this
dataset can be seen in Table 11.
The Wikipedia 2017 dataset11 was used to learn English word embeddings. The
statistics of this dataset under four types of preprocessing schemes is shown in
Table 12.
For learning Persian word embeddings, Hamshahri and Alaem_dadeha corpora
were used. The results are reported in Table 13 on word embeddings trained on the
Hamshahri and Hamshari + Alaem_dadeha (Ham_alaem) corpora.
10
http://dataheart.ir/article/3414/%D9%85%D8%AC%D9%85%D9%88%D8%B9%D9%87-%D8%AF%
D8%A7%D8%AF%D9%87-%D9%86%D8%B8%D8%B1%D8%A7%D8%AA-%D9%81%D8%A7%D8%
B1%D8%B3%DB%8C-%D8%A8%D8%B1%DA%86%D8%B3%D8%A8-%DA%AF%D8%B2%D8%A7%
D8%B1%DB%8C-%D8%B4%D8%AF%D9%87-%D9%87%D8%AA%D9%84.
11
https://archive.org/details/enwiki-20170920.
13
The impact of preprocessing on word embedding quality: a… 273
6 Experimental results
Experimental results of this study are provided for a total of four tasks of natural
language processing. Word vectors were intrinsically evaluated in accordance with
word similarity and word analogy techniques. Text classification and sentiment anal-
ysis were used for extrinsic evaluation of word vectors.
6.1 Experimental settings
The GloVe, Word2vec (CBOW) and Fasttext techniques were used to extract word
vectors. The window size for all word embedding tools was set to 5 (five words
before and five words after target word). The vector size for all experiments was
set to 100. The vocab_size for GloVe for the English dataset was set at 400. X_max
for GloVe was equal to 10 and 𝛼 = 0.75. The maximum iteration for Word2vec and
GloVe was assumed as 15. The number of negative sampling for Word2vec was set
at 25. Default settings were held for Fasttext.
6.2 Word similarity
Table 14 gives the overall results of English word similarity analysis. According to
the table the Pearson-correlation coefficient was computed between pre-defined sim-
ilarity scores rated by humans and inter-embedding scores computed by cosine simi-
larity. 𝜌G shows the Pearson correlation coefficient achieved by GloVe, 𝜌W shows the
Pearson correlation coefficient obtained by Word2vec, 𝜌f shows the Pearson correla-
tion coefficient outcome of Fasttext. As can be seen in this table, the average word
similarity scores obtained by GloVe and Word2vec for Wikipedia_stop is higher
than others, while for Fasttext, the similarity scores are nearly the same for Wikipe-
dia_all_rem and Wikipedia_stop. The results also show that on average, stopwords
are highly effective on word similarity analysis. Nevertheless, punctuations and
numbers do not have a significant effect on word similarity score. The deletion of
stopwords changes the local co-occurrence counts, and so in all likelihood is effec-
tive on word vectors generated by Word2vec (CBOW), as CBOW uses the words
around the target words in a local window to guess the target word (Mikolov et al.,
2013b). This also decreases the counts in the global word co-occurrence matrix,
which is why many co-occurrence counts are considered as noisy by GloVe, and
so the generated word vectors are impressible when stopwords are deleted from the
corpus. This is owed to the objective function of GloVe, which panelizes the counts
below a certain threshold (Pennington et al., 2014). Fasttext, however, uses a modi-
fied version of Skipgram to work with character n-grams, therefore local co-occur-
rences are also important in Fasttext (Bojanowski et al., 2016).
EN-Simverb-3500, EN-Verb-143 and EN-YP-130 are databases solely comprised
of verbs, where the following principles hold: a verb has to agree with its subject in
number and person, stopwords such as “is”, “are”, “have”, “has”, etc. show the num-
ber, person and tense of a verb, punctuations are affected by the number of a verb,
13
274
13
Table 15 Word similarity scores for all four types of preprocessing of the Ham_alaem dataset. 𝝆G shows the Pearson correlation coefficient obtained by GloVe, 𝝆w shows
the Pearson correlation coefficient obtained by Word2vec, 𝝆f shows the Pearson correlation coefficient obtained by Fasttext
Dataset Ham_alaem_all_rem Ham_alaem _punct Ham_alaem _stop Ham_alaem
𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f
PER-MTurk-287 0.4842 0.5455 0.5223 0.4530 0.5661 0.5383 0.5105 0.5494 0.5150 0.4938 0.5657 0.5563
PER-MC-30 0.4597 0.7655 0.8233 0.4153 0.7624 0.7562 0.4163 0.7512 0.7407 0.3905 0.7035 0.7634
PER-Simlex-999 0.1723 0.2443 0.1963 0.1652 0.2512 0.1996 0.1645 0.2337 0.2054 0.1837 0.2444 0.1924
Semeval-17 0.4045 0.4835 0.4839 0.3741 0.4745 0.4892 0.4249 0.4668 0.5170 0.4119 0.4785 0.5114
PER-RG-65 0.6639 0.6555 0.7195 0.6323 0.6546 0.6612 0.6078 0.6307 0.6772 0.5986 0.6092 0.6983
average 0.44 0.54 0.55 0.41 0.54 0.53 0.42 0.52 0.53 0.41 0.52 0.54
The boldface numbers shows the best results in average forGloVe, Word2vec and Fasttext among all preprocessing types
Z. Rahimi, M. M. Homayounpour
Table 16 Word similarity scores for all four types of preprocessing of Hamshahri dataset. 𝝆G shows the Pearson correlation coefficient obtained by GloVe, 𝝆w shows the
Pearson correlation coefficient obtained by Word2vec, 𝝆f shows the Pearson correlation coefficient for Fasttext
Dataset Hamshahri_all_rem Hamshahri_punct Hamshahri_stop Hamshahri
𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f 𝜌G 𝜌W 𝜌f
PER- 0.6281 0.6287 0.6234 0.5967 0.6456 0.6170 0.6985 0.6618 0.6039 0.7139 0.6406 0.6479
MTurk
-287
PER-MC-30 0.1850 0.6718 0.6850 0.3018 0.5991 0.5771 0.4604 0.7521 0.5969 0.1696 0.4912 0.5132
The impact of preprocessing on word embedding quality: a…
PER-Simlex-999 0.3383 0.3815 0.3474 0.3439 0.3717 0.3522 0.3498 0.2337 0.3465 0.2794 0.3376 0.3353
Semeval-17 0.4889 0.4591 0.5954 0.4775 0.4591 0.5977 0.5052 0.4668 0.6021 0.4627 0.4955 0.5916
PER-RG-65 0.5759 0.6559 0.6325 0.5158 0.5794 0.6528 0.5339 0.6307 0.6390 0.4374 0.5395 0.2586
average 0.44 0.56 0.58 0.45 0.54 0.56 0.51 0.55 0.56 0.41 0.50 0.47
The boldface numbers shows the best results in average for GloVe, Word2vec and Fasttext among all preprocessing types
275
13
276
13
Table 17 The word analogy accuracies (%) on four types of corpora obtained by GloVe, CBOW and Fasttext
Types of relation Wikipedia_all_rem Wikipedia_punct Wikipedia_stop Wikipedia
GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext
Capital common country 92.89 86.76 86.36 90.51 87.35 86.17 92.29 85.38 88.93 84.58 76.88 86.76
Capital world 86.34 90 78.23 80.24 88.93 77.79 82.34 88.15 77.9 70.78 77.61 77.1
City in state 32.14 17.55 13.39 8.2 19.52 11.43 7.27 16.74 12.36 6.24 13.74 11.78
Currency 7.39 60.8 28.05 27.6 59.59 27.73 29.02 58.33 26.78 23.83 39.64 26.96
Family 68.97 74.2 68.18 69.96 75.69 62.65 79.64 85.38 69.57 77.47 73.72 69.96
Gram1 adjective to adverb 22.47 74.9 34.07 21.4 27.42 33.57 19.35 25.91 33.27 20.77 21.88 31.65
Gram2 opposite 16.5 26.3 17.49 20.07 23.40 18.47 18.1 24.26 22.04 19.83 17.61 21.55
Gram3 comparative 65.99 24.14 68.24 69.37 74.77 67.04 70.72 78.15 70.8 71.1 64.94 73.57
Gram4 superlative 36.54 74.55 41.53 37.08 48.22 41 40.46 53.03 47.95 38.5 31.02 53.57
Gram5 present participle 47.44 46.08 39.58 50.19 45.55 41.76 53.22 55.78 50.57 58.05 35.8 50.95
Gram6 nationality adjective 89.31 46.69 85.24 89.37 87.62 84.24 87.3 86.68 83.99 83.43 80.11 84.49
Gram7 past tense 48.72 87.24 51.73 49.87 59.1 52.63 46.92 60.26 55.58 46.09 47.69 54.04
Gram8 plural 80.41 59.29 77.33 77.03 75.9 76.8 73.42 73.95 75.08 73.05 57.51 75.38
Gram9 plural verbs 42.07 76.88 50 41.26 47.93 48.28 48.16 62.87 60.34 47.59 43.33 62.07
Average of semantic 57.55 65.86 54.84 55.30 66.22 53.15 58.11 66.80 55.11 52.58 56.32 54.51
Average of syntactic 50.21 57.65 51.80 50.79 54.85 51.59 51.10 58.20 55.50 50.99 44.86 56.30
Total Accuracy 52.66 60.38 52.82 52.30 58.64 52.11 53.44 61.06 55.37 51.52 48.68 55.70
The boldface numbers shows the best results in average for GloVe, Word2vec and Fasttext among all preprocessing types
Z. Rahimi, M. M. Homayounpour
The impact of preprocessing on word embedding quality: a… 277
for example, commas in a sentence like “Sara, her sister, and her mother go on a
picnic” show the number of the verb “go”. As shown in Table 14, the word similar-
ity scores for embeddings trained on Wikipedia and Wikipedia_stop is higher than
others for the mentioned datasets. This shows that retention of stopwords and punc-
tuations in a corpus can improve the quality of word embeddings for verbs.
Word similarity results for Persian are available in Tables 15 and 16. In Table 15,
on average the results for Ham_alaem_all_rem are better than others. In Table 16,
on average, the results for Hamshahri_all_rem and Hamshahri_stop are more opti-
mal than the others. For both corpora, the result for corpora with no preprocess-
ing was, in most cases, worse than others. The Hamshahri and Alaem_dadeha cor-
pora are different in the number of tokens and domains, such that Alaem_dadeha
has an informal language in some documents, while Hamshahri is from the news
domain and sentences are generally in formal writing. Thus, results patterns tend to
differ between Hamshahri and Ham_alaem corpora. In the Hamshahri corpus, the
number of unknown words for word similarity corpus was higher than the Ham-
shahei + Alaem_dadeha (Ham_alaem) corpus. For example, from 65 pair words
in PER-RG-65, 12 pair words are unknown in the Ham_alaem dataset and 16 pair
words are unknown in the Hamshahri dataset. One is then likely to expect better
results for the Hamshahri dataset vs the Ham-alaem dataset. The average similarity
scores (on all tools) in descending order were Wikipedia_all_rem, Wikipedia_stop,
Wikipedia_punct, and Wikipedia.
6.3 Word analogy
Average values on syntactic accuracy results obtained in the word analogy task for
the English corpus are shown in Table 17. The results for all embedding techniques
including GloVe, Fasttext, and CBOW tend to improve when the Wikipedia_stop
corpus is used as the prime word embedding corpus. It can be seen that stopwords
may positively affect the search for syntactic relations among words. The signifi-
cance of this finding lies in the fact that most stopwords such as determiners, pro-
nouns, conjunctions, and prepositions are function words. Function words are words
that define the grammatical relationships between other words in a sentence.
The Wikipedia_stop showed good performance in finding semantic relations
between words. The results for Wikipedia_all_rem are better than Wikipedia_punct,
yet worse than Wikipedia_stop, therefore, the presence of punctuations is not ben-
eficial to the task of finding semantic relations between words. In general, it can
be concluded that the sole use of Wikipedia, with no preprocessing, as a corpus for
word embedding in the task of word analogy is not recommended, rather it is sug-
gested that certain preprocessing procedures be applied as complementary to this
task.
The average syntactic accuracy results for Persian word analogy are reported in
Tables 18 and 19. As it can be seen in these tables, the average accuracy of syntactic
questions is better than other corpora when the Hamshahri_stop is used as the word
embedding corpus. Not unlike the case in English, stopwords in Persian can also
contribute to the search for syntactic relations among words. As most function words
13
278
13
Table 18 The word analogy accuracies (%) on word vectors trained on Ham_alaem dataset
Types of relation Ham_alaem_all_rem Ham_alaem_punct Ham_alaem_stop Ham_alaem
GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext
Country-capital 30.13 36.54 21.15 32.05 32.69 25 30.77 35.90 23.08 16.67 29.2 18.25
Province-capital 13.66 16.67 6.53 11.64 16.07 5.71 12.09 13.29 5.03 12.06 12 4.95
Family 43.96 54.4 53.3 42.86 52.75 45.05 41.76 51.1 51.65 40.11 39.3 52
gram_third_person 41.13 50.65 55.41 41.77 47.84 58.87 62.12 68.18 75.11 61.47 71.1 74.3
gram_past 13.33 11.19 25.95 13.10 10.24 26.67 35.71 30.09 47.84 36.36 30.3 37
gram_adj2adv 2.85 6.51 11.41 2.67 5.17 9.54 3.85 7.93 10.70 2.76 5.23 8.5
gram_Noun-adverb 0 0.15 0.92 0.15 0.15 2 0.79 0.93 1.32 0.93 0.89 1.84
gram_antonym 5.42 20.20 18.6 7.39 19.95 19.33 6.77 17.61 18.35 6.53 16.64 19.23
gram_comparative 2.91 8.07 23.02 2.91 8.07 18.39 6.61 13.62 32.94 4.63 12.3 6.2
gram_plural 7.39 13.07 11.17 6.91 15.06 13.07 8.05 14.11 11.55 8.52 15.4 12.3
gram_firstperson 9.56 9.19 27.21 12.5 9.93 29.41 18.38 18.01 36.03 10.29 12.05 19
Average accuracy of semantic questions 29.25 35.87 26.99 28.85 33.83 25.25 28.20 33.43 26.60 22.946 26.8 25
Average accuracy of syntactic questions 10.32 14.87 21.71 10.91 14.55 25.31 17.78 21.31 29.23 16.40 20.48 24.6
Total Average 15.48 20.6 23.15 15.84 19.81 25.29 20.62 24.61 28.51 18.185 22.2 24.76
The boldface numbers shows the best results in average for GloVe, Word2vec and Fasttext among all preprocessing types
Z. Rahimi, M. M. Homayounpour
Table 19 The word analogy accuracies (%) for word-vectors trained on Hamshahri dataset
Types of relation Hamshahri_all_rem Hamshahri_punct Hamshahri_stop Hamshahri
GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext
Country-capital 45.45 53.97 34.09 36.36 55.3 40.15 43.18 51.52 36.36 37.88 49.24 31.06
Province-capital 18.78 29.10 7.94 18.78 26.85 8.33 18.39 24.34 8.6 13.89 18.78 7.14
Family 60.61 56.06 34.09 56.82 53.79 37.88 66.67 46.21 39.39 44.70 46.97 40.15
gram_third_person 34.12 39.47 54.47 33.68 42.89 53.95 49.74 52.89 60.26 42.37 55.26 59.47
gram_past 15.44 12.87 29.78 16.54 12.5 26.47 34.64 25.16 38.24 27.12 20.92 30.07
gram_adj2adv 1.06 5.82 9.26 1.32 5.03 10.71 2.83 5.91 12.44 2.22 5.42 11.45
gram_Noun-adverb 0 0.22 0.22 0.22 0.22 0.43 0.77 0.62 1.08 0.92 0.46 1.23
The impact of preprocessing on word embedding quality: a…
gram_antonym 8.50 16.17 16.50 7.50 14.83 14.5 10.71 15.67 15.83 9.83 15 17
gram_comparative 4.97 7.31 15.5 4.09 8.48 15.5 7.60 15.2 25.44 8.77 18.42 26.61
gram_plural 10.33 16.5 15.83 11 15 13.17 8.83 15.83 13 11.33 17.67 14.5
Average accuracy of semantic questions 41.61 46.37 25.37 37.32 45.31 28.78 42.74 40.69 28.11 32.15 38.33 26.11
Average accuracy of syntactic questions 10.63 14.05 20.22 10.62 18.03 19.24 16.44 18.75 27.77 14.65 19.02 22.9
Total Average 19.93 23.747 21.767 20.33 17.27 22.1 30.247 25.33 27.87 19.90 24.81 23.86
The boldface numbers shows the best results in average for GloVe, Word2vec and Fasttext among all preprocessing types
279
13
280 Z. Rahimi, M. M. Homayounpour
(‘ ’کلمات دستوری/kalamate dasturi/) in Persian are stopwords, therefore, this result has
significance. Word vectors trained on the Hamshahri_dataset show higher accuracy
on syntactic questions on Hamshahri_stop, Hamshahri_all_rem, and Hamshahri_
punct, while the best total average of accuracies (all word analogy questions: seman-
tic and syntactic questions) belongs to Hamshahri_stop among all tools used to gen-
erate word embedding in this paper (GloVe, Fasttext and CBOW).
As shown in Table 18, for the Ham_alaem dataset, which is larger in size com-
pared to the Hamshahri dataset, in case of average accuracy of semantic questions,
results of Ham_alaem_all_rem is slightly better than ham_alaem_stop but these
results are close and in the case of average accuracy of syntactic questions, Ham-
shahri_stop achieves the optimal results. Total average accuracies (all questions:
semantic and syntactic questions of word analogy) for Hamshahri_stop was superior
to others.
6.4 Extrinsic evaluation
As mentioned in Sect. 5.1, text classification and sentiment analysis were used for
the extrinsic evaluation of word vectors in four types of preprocessing scenarios.
The results of these evaluations are described in this section.
6.4.1 Text classification
For text classification, the long short term memory (LSTM) and convolutional neu-
ral network (CNN) models were employed as classifiers. The structures of these neu-
ral networks are shown in Figs. 4 and 5. For the LSTM network, the Dropout1D rate
was set at 0.4, lstm_out at 196, along with the use of dropout and recurrent drop-
out. For the CNN network, the number of filters was selected as 128 and the kernel
size as 5. The results for English text classification obtained by CNN and LSTM are
shown in Tables 20 and 21, respectively. In order to investigate the effect of differ-
ent kinds of preprocessing in all of corpora used for text classification, stopwords
and punctuations were retained. Wikipedia, Wikipedia_stop, Wikipedia_punct and
Wikipedia_all_rem were used separately as word embedding corpora.
According to the text classification accuracy values in Table 20, the Wikipedia_
all_rem and Wikipedia_stop resulted in higher performance in most cases. Wikipe-
dia_punct failed to show good performance. In the majority of cases, the Wikipe-
dia_all_rem corpus led to better results.
For Persian text classification Ham_alaem, Ham_alaem_stop, Ham_alaem_all_
rem and Ham_alaem_punct were used separately as corpora for generating word
embeddings. The results (classification accuracy) for Persian text classification
are shown in Figs. 2 and 3. According to Fig. 2, the classification accuracy for
Ham_alaem_all_rem and Ham_alaem_stop are better than the two remaining pre-
processed corpora. In GloVe, the accuracy of Ham_alaem_stop is higher than oth-
ers, yet quite similar to Ham_alaem_all_rem. For CBOW and Fasttext, the accu-
racy of Ham_alaem_all_rem is higher than other corpora. According to Fig. 3, the
accuracies of all types of preprocessing are the same when GloVe and CBOW
13
Table 20 The CNN text classification accuracy (%) on Wikipedia, Wikipedia_all_rem, Wikipedia_punct and Wikipedia_stop
Dataset Wikipedia_all_rem Wikipedia _punct Wikipedia _stop Wikipedia
GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext
R8 95.38 95.24 95.65 94.97 95 93.92 96.11 93.41 90.7 95.9 94.65 94.5
The impact of preprocessing on word embedding quality: a…
R52 86.6 87.6 85.32 83.12 84.9 82 87 86.8 84.76 84.6 87.1 84.88
20-newsgroups 66.8 65.6 65.68 65.61 64 65.93 64.78 64.3 68.1 63.95 66.64 67.57
13
282
13
Table 21 The LSTM text classification accuracy (%) on Wikipedia, Wikipedia_all_rem, Wikipedia_punct and Wikipedia_stop
Dataset Wikipedia_all_rem Wikipedia_punct Wikipedia_stop Wikipedia
GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext GloVe CBOW Fasttext
R8 89.35 90.5 84.1 86.51 91.13 82.86 87.2 92 84.1 86.2 90.9 80
R52 76.74 80.32 70.93 76.43 80.00 74 75.84 80.17 71.44 71.75 76.3 70
20-newsgroups 66.79 68.0 69.4 65.61 64 65.93 65.78 67.93 64.3 65.5 63.46 65.6
74
73
Ham_alaem_all_rem
Accuracy
72
71 Ham_alaem_punct
70 Ham_alaem_stop
69 Ham_alaem
68
GloVe CBOW Fasttext
Fig. 2 Text classification accuracy (%) for LSTM text classifier using four types of preprocessings done
on Ham_alaem corpus. GloVe, Word2vec and Fasttext are used as word embedding methods
77
76
75 Ham_alaem_all_rem
Accuracy
74 Ham_alaem_punct
73
Ham_alaem_stop
72
71 Ham_alaem
70
GloVe CBOW Fasttext
Fig. 3 Text classification accuracy (%) for CNN text classifier using four types of preprocessings done on
Ham_alaem corpus. GloVe, Word2vec and Fasttext are used as word embedding methods
Fig. 4 LSTM network structure for text classification and sentiment analysis
Embedding layer
Global max
somax Dense Conv1D
pooling
Fig. 5 Convolutional network structure for sentiment analysis and text classification
13
284 Z. Rahimi, M. M. Homayounpour
86
84
Wikipedia_all_rem
Accuracy
82
Wikipedia_punct
80
Wikipedia_stop
78
Wikipedia
76
GloVe CBOW Fasttext
Fig. 6 LSTM sentiment analysis accuracy (%) on IMDB dataset for four types of preprocessing proce-
dures on Wikipedia dataset
84
82 Wikipedia_all_rem
Accuracy
80 Wikipedia_punct
Wikipedia_stop
78
Wikipedia
76
GloVe CBOW Fasttext
Fig. 7 CNN sentiment analysis accuracy (%) on IMDB dataset for four types of preprocessing proce-
dures on Wikipedia dataset
100
80
Ham_alaem_all_rem
F-measure
60
Ham_alaem_punct
40
Ham_alaem_stop
20
Ham_alaem
0
GloVe CBOW Fasttext
Fig. 8 CNN sentiment analysis F-measure (%) on hotel dataset with four types of preprocessing of Ham_
alaem dataset
are used as word embedding method. For Fasttext, the classification accuracies
are higher when the Ham_alaem_all_rem corpus is used. Altogether, the removal
of stopwords and punctuations from corpora, in seeking to extract word vectors,
is a useful apparatus to employ in text classification task. Stopwords include the
most frequent words in a language, which are abundantly found in virtually every
document in a corpus. When stopwords appear among features which are used for
representing a document, the text classifier considers many dissimilar documents
13
The impact of preprocessing on word embedding quality: a… 285
100
80
F-measure
Ham_alaem_all_rem
60
Ham_alaem_punct
40
Ham_alaem_stop
20
Ham_alaem
0
GloVe CBOW Fasttext
Fig. 9 LSTM sentiment analysis F-measure (%) on hotel dataset with four preprocessed Ham_alaem cor-
pora
as similar, as they share many stopwords. This reiterates the fact that stopwords
are not relevant features for the task of text classification.
6.4.2 Sentiment analysis
Long short term memory (LSTM) and convolutional neural network (CNN) were
also applied for sentiment analysis purposes. The structures of the LSTM network
and the CNN network are shown in Figs. 4 and 5, respectively. For the LSTM net-
work, the Dropout1D rate was 0.4, lstm_out set to 196, with the inclusion of dropout
and recurrent dropout. For the CNN network, the number of filters was set at 128
and the kernel size at 5. The RELU activation function was used for Conv1D layers.
The results for English sentiment analysis tasks are shown in Figs. 6 and 7.
For the CNN, Wikipedia_stop showed higher performance compared to Wikipe-
dia_punct, Wikipedia_all_rem and Wikipedia for GloVe, Fasttext and CBOW. In
the case of an LSTM neural network, the results for Wikipedia_stop were better
than others for word vectors obtained using GloVe, while the results on Wikipe-
dia_all_rem, Wikipedia_punct and Wikipedia_stop were rather similar. For Fasttext
the results of Wikipedia_stop were slightly superior to others, whereas the results
for four types of preprocessing are relatively similar. For Word2vec the best result
were obtained for wikipeida_all_rem, albeit very close to that for Wikipedia_stop.
The results of CNN and LSTM on sentiment analysis task indicate that retention of
stopwords can be useful.
stopwords were also effective on sentiment analysis, for example, stopwords such
as “not” or “don’t” may change the polarity of a word. It is noteworthy that stop-
words are not, by default, eliminated in the IMDB dataset. If stopwords are removed
from Wikipedia corpus, corresponding embeddings may not be generated, and
therefore whether they exist in IMDB dataset or not, they hold no effect on the senti-
ment of a document.
The results for the Persian sentiment analysis on the Hotel dataset are shown
in Figs. 8 and 9. The hotel dataset is an unbalanced dataset, in which the number
of positive examples is far more than the number of negative examples, requir-
ing the F-measure to be reported. Generally, results on all corpora are very close
to each other. However, as held by the figures, the best F-measure results for CNN
belong to Ham_alaem_stop and for LSTM, with the exception of GloVe, belong to
13
286 Z. Rahimi, M. M. Homayounpour
6.5 Discussion
The effect of text preprocessing in the process of generating word embeddings was
explored in the previous sections. Four types of preprocessing were applied to Eng-
lish and Persian corpora. For English, Wikipedia corpus was used to train GloVe,
Word2vec, and Fasttext word embedding models. For Persian, Hamshahri and
Alaem_dadeha corpora were used to train the above mentioned word embedding
models. The statistics of these corpora are shown in the “Evaluation methods and
corpora” section. Two types of evaluation methods were employed including extrin-
sic evaluation (sentiment analysis and text classification) and intrinsic evaluation
(word similarity and word analogy). The results were language and task-dependent.
For Persian word-similarity, the best average results were obtained for Ham_
alaem_all_rem corpus (punctuations, numbers and stopwords are all removed),
whereas in English, the best results were achieved for Wikipedia_stop corpus
(punctuations and numbers are removed but stopwords are remained). This differ-
ence between results for Persian and English corpora is in most likelihood due to
the differences between datasets: Persian word similarity datasets, for instance, con-
tain no verbs among pairs and so, stopwords cannot affect the results. According to
Tables 16 and 17, as Persian is a morphologically rich language, the Pearson corre-
lation coefficients (computed between pre-defined similarity scores rated by humans
and inter-embedding scores computed by cosine similarity) are higher in Fasttext
compared to GloVe and CBOW. In word analogy task, the best average accuracy on
semantic questions was attained for Wikipedia_all_rem and Ham_alaem_all_rem.
The best average accuracy on syntactic questions was achieved by Ham_alaem_stop
and in English, Wikipedia and Wikipedia_stop showed better results. The findings
indicate that retention of stopwords leads to better results in the syntactic case, as
most stopwords such as determiners, pronouns, conjunctions, and prepositions are
function words. Function words are words that define the grammatical relationships
between other words in a sentence.
Apropos of extrinsic evaluation, sentiment analysis and text classification tasks
were implemented. The tasks were modeled using LSTM and CNN deep neural net-
works. The embeddings of words of a document were fed to said neural networks as
input and a Softmax layer was used as the last layer performs the classification of
the document. According to Tables 20 and 21, for English text classification, Wiki-
pedia_all_rem and Ham_alaem_all_rem generate better results in the majority of
cases. This result is rather expectable as punctuations, numbers, and stopwords are
not highly relevant features for representing a document in the task of text classi-
fication. For sentiment analysis, the Wikipedia_stop and Ham_alaem_stop present
higher accuracies than other preprocessed corpora. Certain stopwords such as “not”
and “don’t” were shown to change the polarity of a word, and therefore change the
sentiment of a sentence. For example, in the sentence “The movie was not interest-
ing”, the word “not” changes the sentiment of the sentence from positive to negative.
13
The impact of preprocessing on word embedding quality: a… 287
If the word “was not” is removed from the sentence, the classifier may classify this
sentence as positive.
The main assumption of this paper was that the preprocessing of a corpus, inclu-
sive of tasks such as removing stopwords, punctuations and numbers, can change
the co-occurrence of words, while word co-occurrences, in turn are the basis of
most of word embedding tools. Preprocessing, could, therefore, change the quality
of word embeddings generated by such tools. The deletion of stopwords, punctua-
tions and numbers change the local co-occurrence counts and thereby affect word
vectors generated using Word2vec (CBOW), which is due to the fact that CBOW
uses the words around the target word in a local window to guess the target word
(Mikolov et al., 2013b). This also decreases the counts in the global word co-occur-
rence matrix, which is precisely why in GloVe, many of co-occurrence counts are
considered as noisy, meaning that the generated word vectors are impressible if the
stopwords are deleted from the corpus. This is owed to the objective function of
GloVe, which penalizes counts below a certain thresholds (Pennington et al., 2014).
Fasttext uses a modified version of Skipgram (Mikolov et al., 2013b) to work with
character n-grams. Therefore, local co-occurrences are also significant in Fasttext
(Bojanowski et al., 2016). The results of this paper show the following implications:
• Holding punctuations and numbers is not beneficial to any task except for word
similarity of verbs, albeit the consonant results for word similarity show no sig-
nificant change. When punctuations and numbers are removed, however, the
size of the corpus decreases and the time complexity for corpus processing is
decreased. Therefore, certain irrelevant features can be removed to increase the
speed of data processing.
• Retention of stopwords improves the accuracy of word analogy task, scores for
verb similarity and sentiment analysis task. As the removal stopwords such as
not or don’t can change the polarity of words in the case of sentiment analysis, it
can also alter the polarity of documents can change. For solving the syntactic and
semantic questions of word analogy, most stopwords such as determiners, pro-
nouns, conjunctions, and prepositions are comprised of function words. Function
words are words that define the grammatical relationships between other words
in a sentence. In order to measure the similarity of verbs, stopwords can be used
to show the number and tense of a verb, and thereby affect word similarity.
• Removal of stopwords is beneficial to text classification. When stopwords
appeared among features used to represent a document, the text classifier may
consider many dissimilar documents as similar, as they share many stopwords.
Stopwords are, therefore, not good features for the task of text classification.
• The intrinsic assessments show that holding stopwords improves the word vec-
tors for finding analogy relations and similarities between verbs. But if a set of
word vectors is better than other set of word vectors intrinsically, it does not
mean it is better for the entire downstream tasks. But this results show that the
intrinsic evaluation performance has more correlation with sentiment analysis
than text classification.
13
288 Z. Rahimi, M. M. Homayounpour
7 Conclusions
The present study proceeded to investigate the effects of employing different pre-
processing procedures on the quality of word vectors are generated using Fasttext,
GloVe, and Word2vec as state of the arts word embedding techniques. Intrinsic and
Extrinsic evaluation methods and different Persian and English datasets were used
in the proposed experiments. The results show that the type of corpus preprocessing
used to generate word embeddings is task-dependent. For example, in the task of
Persian word-analogy, removal of punctuations and numbers, while retaining stop-
words, leads to better results, whereas in the case of Persian word similarity task,
when all the punctuations, numbers and stopwords are removed, a higher Pearson
correlation coefficient (computed between pre-defined similarity scores rated by
humans and inter-embedding scores computed by cosine similarity) is obtained.
In general, the corpus_stop and corpus_all_rem generated better results for all the
tasks, and therefore retention of punctuations and numbers, is not beneficial to the
generation of word embeddings. In extrinsic evaluations like text classification,
removing punctuations, numbers and stopwords improves the classification accuracy
as stopwords are frequent in all documents and documents from different classes
may share similar stopwords. Therefore, retention of stopwords misleads the clas-
sifier in distinguishing between documents from different classes. For sentiment
analysis task it is recommended to retain stopwords, as stopwords such as “no” and
“don’t” change the polarity of words in documents. Finally, we can say the intrin-
sic assessments show that holding stopwords improves the word vectors for finding
analogy relations and similarities between verbs. But if a set of word vectors is bet-
ter than other set of word vectors intrinsically, it does not mean it is better for the
entire downstream tasks. But this results show that the intrinsic evaluation perfor-
mance has more correlation with sentiment analysis than text classification.
For future works, it is recommended that investigations be conducted on the
effects of other types of preprocessing procedures such as removal of stopwords
with respect to word frequency, on word embedding quality, as well as to generate
other word similarity datasets for Persian, particularly for Persian verbs.
Acknowledgements The authors wish to express their thanks for the financial support of Iran
National Science foundation (INSF), Project No 97009308.
References
Abdi, A., Shamsuddin, S. M., Hasan, S., & Piran, J. (2019). Deep learning-based sentiment classifica-
tion of evaluative text based on Multi-feature fusion. Information Processing and Management, 56,
1245–1259. https://doi.org/10.1016/j.ipm.2019.02.018
Ajees, A. P., & Idicula, S. M. (2018). A named entity recognition system for malayalam using neural
networks. Procedia Computer Science, 143, 962–969. https://doi.org/10.1016/j.procs.2018.10.338
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., & Oroumchian, F. (2009). Hamshahri: A standard
Persian text collection. Knowledge-Based Systems, 22, 382–387. https://doi.org/10.1016/j.knosys.
2009.05.002
13
The impact of preprocessing on word embedding quality: a… 289
Alkhatlan, A., Kalita, J., & Alhaddad, A. (2018). Word sense disambiguation for arabic exploiting arabic
wordnet and word embedding. Procedia Computer Science, 142, 50–60. https://doi.org/10.1016/j.
procs.2018.10.460
Angiani, G., Ferrari, L., Fontanini, T., Fornacciari, P., Iotti, E., Magliani, F., & Manicardi, S. (2016). A
comparison between preprocessing techniques for sentiment analysis in Twitter. CEUR Workshop
Proceedings, 1748, 1–11. https://doi.org/10.1007/978-3-319-67008-9_31
Ayca, D., Hakan, & E. K. (2017). Effects of varoius preprocessing techniques to Turkish text categoriza-
tion using n-gram features. In 2nd international conference on computer science and engineering.
(pp. 655–660) IEEE. https://doi.org/10.1109/UBMK.2017.8093491
Baker, S., Reichart, R., & Korhonen, A. (2014). An unsupervised model for instance level subcategoriza-
tion acquisition. In EMNLP 2014—2014 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf. (pp.
278–289). https://doi.org/10.3115/v1/d14-1034.
Banik, D., Ekbal, A., Bhattacharyya, P., & Bhattacharyya, S. (2019). Assembling translations from multi-
engine machine translation outputs. Applied Soft Computing Journal, 78, 230–239. https://doi.org/
10.1016/j.asoc.2019.02.031
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword infor-
mation. Transactions of the Association for Computational Linguistics, 5, 135–146.
Bruni, E., Boleda, G., Baroni, M., Tran, N. K. (2012). Distributional semantics in technicolor. In 50th
Annu. Meet. Assoc. Comput. Linguist. ACL 2012—Proc. Conf. (vol. 1, pp. 136–145).
Camacho-Collados, J., & Pilehvar, M. T., 2018. On the role of text preprocessing in neural network archi-
tectures: An evaluation study on text categorization and sentiment analysis. In: Proceedings of the
2018 EMNLP Workshop BlackboxNLP: Analyzing and interpreting neural networks for (pp. 40–46).
Association for computational lingustics, Brussels. https://doi.org/10.18653/v1/w18-5406
Camacho-Collados, J., Pilehvar, M.T., Collier, N., & Navigli, R., 2017. SemEval-2017 Task 2: Multilin-
gual and cross-lingual semantic word similarity. In Proceedings of the 11th international workshop
on semantic evaluation (SemEval-2017) (pp. 15–26). Association for Computational Linguistics,
Vancouver, Canada. https://doi.org/10.18653/v1/s17-2002
Corrêa, E. A., & Amancio, D. R. (2019). Word sense induction using word embeddings and community
detection in complex networks. Physica a: Statistical Mechanics and Its Applications, 523, 180–
190. https://doi.org/10.1016/j.physa.2019.02.032
Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when
it misleads, and what to do about it. Political Analysis, 26, 168–189. https://doi.org/10.1017/pan.
2017.44
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language Technologies (pp.
4171–4186). Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/
10.18653/v1/N19-1423
Enríquez, F., Troyano, J. A., & López-Solaz, T. (2016). An approach to the use of word embeddings in
an opinion classification task. Expert Systems with Applications, 66, 1–6. https://doi.org/10.1016/j.
eswa.2016.09.005
Esposito, M., Damiano, E., Minutolo, A., De Pietro, G., & Fujita, H. (2020). Hybrid query expansion
using lexical resources and word embeddings for sentence retrieval in question answering. Informa-
tion Sciences (NY), 514, 88–105. https://doi.org/10.1016/j.ins.2019.12.002
Etaiwi, W., & Awajan, A. (2020). Graph-based Arabic text semantic representation. Information Process-
ing and Management, 57, 102183. https://doi.org/10.1016/j.ipm.2019.102183
Etaiwi, W., & Naymat, G. (2017). The impact of applying different preprocessing steps on review spam
detection. Procedia Computer Science, 113, 273–279. https://doi.org/10.1016/j.procs.2017.08.368
Fernández-Reyes, F. C., Hermosillo-Valadez, J., & Montes-y-Gómez, M. (2018). A prospect-guided
global query expansion strategy using word embeddings. Information Processing and Management,
54, 1–13. https://doi.org/10.1016/j.ipm.2017.09.001
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E., 2001. Placing
search in context: The concept revisited. In Proc. 10th Int. Conf. World Wide Web, WWW 2001 (pp.
406–414). https://doi.org/10.1145/371920.372094
Gerz, D., Vulić, I., Hill, F., Reichart, R., & Korhonen, A., 2016. SimVerb-3500: A large-scale evalua-
tion set of verb similarity. In Proceedings of the 2016 conference on empirical methods in natural
language processing. (pp. 2173–2182). Association for Computational Linguistics, Austin, Texas.
https://doi.org/10.18653/v1/d16-1235
13
290 Z. Rahimi, M. M. Homayounpour
Halawi, G., Dror, G., Gabrilovich, E., & Koren, Y. (2012). Large-scale learning of word relatedness with
constraints. In Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (pp. 1406–1414). https://
doi.org/10.1145/2339530.2339751.
Hill, F., Reichart, R., & Korhonen, A. (2015). SimLex-999: Evaluating Smantic Models with (Genuine)
Similarity Estimation. Computational Linguistics, 70, 665–695. https://doi.org/10.1162/COLI_a_
00237
Huang, E. H., Socher, R., Manning, C.D., & Ng, A. Y., 2012. Improving word representations via global
context and multiple word prototypes. In Proceedings of the 50th annual meeting of the association
for computational linguistics (pp. 873–882).
Kamkarhaghighi, M., & Makrehchi, M. (2017). Content Tree Word Embedding for document represen-
tation. Expert Systems with Applications, 90, 241–249. https://doi.org/10.1016/j.eswa.2017.08.021
Keerthi Kumar, H. M., & Harish, B. S. (2017). Classification of short text using various preprocessing
techniques: An empirical evaluation. In Advances in intelligent systems and computing techniques
(pp. 19–30). Springer.
Khattak, F. K., Jeblee, S., Pou-Prom, C., Abdalla, M., Meaney, C., & Rudzicz, F. (2019). A survey of
word embeddings for clinical text. Journal of Biomedical Informatics, 4, 100057. https://doi.org/10.
1016/j.yjbinx.2019.100057
Kwon, S., Ko, Y., & Seo, J. (2019). Effective vector representation for the Korean named-entity recogni-
tion. Pattern Recognition Letters, 117, 52–57. https://doi.org/10.1016/j.patrec.2018.11.019
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov,
V. (2019). RoBERTa: A robustly optimized BERT pretraining approach.
Luong, M.-T., Socher, R., & Manning, C. D. (2013). Better word representations with recursive neural
networks for morphology. In Proceedings of the seventeenth conference on computational natural
language learning (pp. 104–113). Association for Computational Linguistics.
Marinho, V. Q., Hirst, G., Amancio, D. R. (2017). Authorship attribution via network motifs identifica-
tion. In Proc.—2016 5th Brazilian Conf. Intell. Syst. BRACIS 2016 (pp. 355–360). https://doi.org/10.
1109/BRACIS.2016.071.
Mikolov, T., Chen, K., Corrado, G.,& Dean, J. (2013a). Efficient estimation of word representations in
vector space. In International estimation on learning representations: workshop track (pp. 1–12).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Distributed representations of words and phrases
and their copositionality. In Advances in neural information processing systems (pp. 3111–3119).
https://doi.org/10.1162/jmlr.2003.3.4-5.951.
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cog-
nitive Processes, 6, 1–28. https://doi.org/10.1080/01690969108406936
Mohtaj, S., Roshanfekr, B., Zafarian, A., & Asghari, H. (2019). Parsivar: A language processing toolkit
for Persian. In LREC 2018—11th international conference on language resources and evaluation
(pp. 1112–1118).
Othman, N., Faiz, R., & Smaïli, K. (2019). Enhancing question retrieval in community question answer-
ing using word embeddings. Procedia Computer Science, 159, 485–494. https://doi.org/10.1016/j.
procs.2019.09.203
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In:
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
(pp. 1532–1543).
Pham, D. H., & Le, A. C. (2018). Exploiting multiple word embeddings and one-hot character vectors
for aspect-based sentiment analysis. International Journal of Approximate Reasoning, 103, 1–10.
https://doi.org/10.1016/j.ijar.2018.08.003
Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S. (2011). A word at a time: Computing word
relatedness using temporal semantic analysis. In Proc. 20th Int. Conf. World Wide Web, WWW 2011
(pp. 337–346). https://doi.org/10.1145/1963405.1963455.
Roy, D., Ganguly, D., Mitra, M., & Jones, G. J. F. (2019). Estimating Gaussian mixture models in the
local neighbourhood of embedded word vectors for query performance prediction. Information Pro-
cessing and Management, 56, 1026–1045. https://doi.org/10.1016/j.ipm.2018.10.009
Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the
ACM, 8, 627–633. https://doi.org/10.1145/365628.365657
Shuang, K., Zhang, Z., Loo, J., & Su, S. (2020). Convolution–deconvolution word embedding: An end-
to-end multi-prototype fusion embedding method for natural language processing. Information
Fusion, 53, 112–122. https://doi.org/10.1016/j.inffus.2019.06.009
13
The impact of preprocessing on word embedding quality: a… 291
Tohalino, J. V., & Amancio, D. R. (2018). Extractive multi-document summarization using multilayer
networks. Physica a: Statistical Mechanics and Its Applications, 503, 526–539. https://doi.org/10.
1016/j.physa.2018.03.013
Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Pro-
cessing and Management, 50, 104–112. https://doi.org/10.1016/j.ipm.2013.08.006
Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., Kingsbury, P., & Liu, H. (2018).
A comparison of word embeddings for the biomedical natural language processing. Journal of Bio-
medical Informatics, 87, 12–20. https://doi.org/10.1016/j.jbi.2018.09.008
Wu, C., Su, J., Chen, Y., & Shi, X. (2019). Boosting implicit discourse relation recognition with connec-
tive-based word embeddings. Neurocomputing, 369, 39–49. https://doi.org/10.1016/j.neucom.2019.
08.081
Yahi, N., & Hacene, B. (2020). Morphosyntactic preprocessing impact on document embedding: An
empirical study on semantic similarity. Emerging trends in intelligent computing and informatics.
IRICT 2019. Advances in intelligent systems and computing (pp. 118–126). Springer. https://doi.org/
10.1007/978-3-030-33582-3_12
Yang, D., & Powers, D. M. W. (2006). Verb similarity on the taxonomy of WordNet. In GWC 2006: 3rd
international global wordnet conference, proceedings. Jeju Islan, Korea (pp. 121–128).
Zahedi, M. S., Bokaei, M. H., Shoeleh, F., Yadollahi, M. M., Doostmohammadi, E., Farhoodi, M. (2018).
Persian word embedding evaluation benchmarks. In 26th Iran. Conf. Electr. Eng. ICEE 2018 (pp.
1583–1588). https://doi.org/10.1109/ICEE.2018.8472549
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article
is solely governed by the terms of such publishing agreement and applicable law.
13