An Accuracy-Enhanced Light Stemmer For Arabic Text
An Accuracy-Enhanced Light Stemmer For Arabic Text
Stemming is a key step in most text mining and information retrieval applications. Information extraction,
semantic annotation, as well as ontology learning are but a few examples where using a stemmer is a
must. While the use of light stemmers in Arabic texts has proven highly effective for the task of informa-
tion retrieval, this class of stemmers falls short of providing the accuracy required by many text mining
applications. This can be attributed to the fact that light stemmers employ a set of rules that they apply
indiscriminately and that they do not address stemming of broken plurals at all, even though this class
of plurals is very commonly used in Arabic texts. The goal of this work is to overcome these limitations.
The evaluation of the work shows that it significantly improves stemming accuracy. It also shows that by
improving stemming accuracy, tasks such as automatic annotation and keyphrase extraction can also be
significantly improved.
Categories and Subject Descriptors: I 2.7 [Artificial Intelligence]: Natural Language Processing; I 7.0
[Document and Text Processing]: General
General Terms: Experimentation, Algorithms
Additional Key Words and Phrases: Stemming, broken plurals, Arabic, heuristic rules
ACM Reference Format:
El-Beltagy, S. R. and Rafea, A. 2011. An accuracy-enhanced light stemmer for arabic text. ACM Trans.
Speech Lang. Process. 7, 2, Article 2 (February 2011), 22 pages.
DOI = 10.1145/1921656.1921657 http://doi.acm.org/10.1145/1921656.1921657
1. INTRODUCTION
Stemming is a very common operation in almost any text mining application. The
importance of building a good stemmer lies in the fact that stemming can directly affect
the performance of any application of which it is a component. Research has shown that
stemming of Arabic terms is a particularly difficult task because of its highly inflected
and derivational nature [Aljlayl and Frieder 2002; Chen and Gey 2002; Larkey et al.
2002, 2007; Nwesri et al 2005].
Most work in this area either tries to reduce a given word to its root (aggressive
stemmers) or to identify a set of prefixes and suffixes, the removal of which can have a
positive impact on a given task such as information retrieval (light or simple stemmers).
The main advantage of using a light stemmer is that it is very simple to implement and
apply and, though not very accurate in its stemming performance, has proven highly
effective for the task of information retrieval. However, different applications have
This research was partially supported by the Center of Excellence of Data Mining and Computer Modeling
within the Egyptian Ministry of Communication and Information (MCIT).
Authors’ addresses: S. R. El-Beltagy, Department of Computer Science, Faculty of Computers and Infor-
mation, Cairo University, Giza, Egypt; email: samhaa@computer.org; A. Rafea, Department of Computer
Science and Engineering, School of Sciences and Engineering, The American University in Cairo; email
rafea@aucegypt.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c 2011 ACM 1550-4875/2011/02-ART2 $10.00
DOI 10.1145/1921656.1921657 http://doi.acm.org/10.1145/1921656.1921657
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:2 S. R. El-Beltagy and A. Rafea
different requirements from a stemmer, and what might work well with a certain class
of applications might not necessarily perform as well with other application classes.
Mention detection systems, for example, have different requirements from a stemmer
than information retrieval systems [Zitouni et al. 2005].
The goal of this work is to address applications that demand accuracy from a stemmer
at a low computational cost, where accuracy in stemming is defined as stemming a word
to its shortest possible form without compromising its meaning. Aggressive stemmers
do not meet this requirement because converting a word to its root can result in the
mapping of too many related terms, each with a unique meaning, to a single root. This
problem is known as over-stemming. While existing light stemmers produce words that
are closer to maintaining their meaning, they fail to remove some affixes and do not
handle broken plurals which are very common in the Arabic language [Larkey et al.
2007]. This problem is known as under-stemming [Paice 1996].
Examples of applications that require high stemming accuracy include applications
that try to detect ontology elements in a user’s free text input or query such as in
Laclavik et al. [2007] in order to convert it into a semantic query. Applications that
automatically add metadata to documents through the use of ontologies or thesauri
present another example. Examples of other tasks that can be affected by stemming
accuracy include terminology extraction and concept mining. The emphasis in all of
these applications is on correctly matching word forms that have the same meaning.
In these applications, missing such matches or incorrectly matching distinct terms
usually carries an expensive cost.
To achieve the previously stated goal of emphasizing accuracy without imposing a
high computational cost, this work proposes an approach that extends light stemming
mainly through the introduction of rules to handle broken plurals that often result in
the addition of infixes to a word as well as by addressing the limitations of existing
light stemmers which manifest in the form of indiscriminate removal of certain affixes.
This task is carried out without the use of a morphological analyzer and can be used
by any application that requires accuracy in stemming. The basic premise on which
this work is based is that in any reasonably sized corpus, a word and its stem are both
likely to appear in the corpus. This is a similar assumption to that adopted by Xu and
Croft [1998] in their work on corpus-based stemming.
The rest of this article is organized as follows: Section 2 presents an overview of the
Arabic language and more details on why stemming of this language is a complex task,
Section 3 defines the addressed problem in more detail, Section 4 presents related work,
Section 5 provides a conceptual overview of the presented work, Section 6 details the
followed procedure for generating a stem from an input word, Section 7 presents analy-
ses related to rules that were presented in Section 6 for handling broken plurals using
a large corpus, Section 8 describes experiments carried out to evaluate the presented
work and their results, and Section 9 concludes this work and provides future research
directions.
common use [Beesley 1996]. Derivation of a word having a certain part of speech from
a root takes place through the application of a pattern that essentially adds specific
letters to that word. Table I shows a number of words derived from the root (ktb).
Letters that have been added as a result of the derivational process are underlined.
Despite having the same root, each of these words has a totally different meaning. The
process of extracting the root from the derived form is usually carried out through the
use of a derivational morphological analyzer.
Another transformation that occurs on a word having a certain part of speech is
the inflectional process, which adds affixes to this word to indicate person, gender,
number, case, and tense. When added at the beginning of a word, an affix is known
as a prefix, when added at the end of the word it is known as suffix, and when added
anywhere in the middle, it is known as an infix. In Arabic gender is either feminine
or masculine. The feminine form of a noun is usually formulated by adding a suffix to
the masculine form. There are also three possible ways to indicate number: in singular,
dual, or plural form. Dual and plural forms can have different suffixes depending on
whether the form is feminine or masculine and depending on its grammatical case.
There are three possible grammatical cases: nominative, genitive, and accusative, and
two possible tenses: perfect and imperfect [Chen and Gey 2002]. Examples of inflection
are presented in Table II. The process of extracting a word from this inflected form is
usually performed by inflectional morphological analyzers.
Affixes can also manifest in the form of particles that attach to the beginning of a word
or possessive pronouns that attach at the end. The use of particles that attach to the
beginning of a word is very common in Arabic. These particles can denote prepositions
and conjunctions. Out of the nine conjunction particles in Arabic, two attach to the
beginning of words, and out of the twenty available prepositions, five are inseparable
[Nwesri et al. 2005]. So it is common in Arabic to have a word with multiple affixes.
For example, the word (be-ikhtiarat-ehem) which translates into “with their
choices” and which is inflected from the word “choice” or (ikhtiar),
which itself is a derived word, has one prefix and two suffixes as follows.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:4 S. R. El-Beltagy and A. Rafea
Broken plurals are also quite common in Arabic. Unlike regular or sound plurals
which simply result in the addition of suffixes, broken plurals change the entire struc-
ture of a word’s singular form in order to convert it to a plural. This structural change
usually involves the addition of infixes as well as reordering or deletion of letters that
existed in the original word. Table III presents examples of broken plurals. While a
language like English has irregular plurals, it does not really have an equivalent of a
broken plural. There are known rules for changing words with given patterns to their
irregular plural forms and vice versa, but these rules cannot be applied blindly. Gowder
et al. [2004b] have demonstrated through experimentation that the straightforward ap-
plication of rules that reduce broken plural patterns to their singular representation
results in low precision.
Having presented these aspects of the Arabic language, a root can be defined as the
three-letter origin of a word obtained by removing both inflectional and derivational
affixes, while a stem represents a derived word from which only inflectional affixes
have been removed.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:5
of an affix that is actually part of the word. These stemming limitations can cause
problems in applications that have strict word matching requirements. The goal of this
work is to address stemming accuracy by avoiding over-stemming, under-stemming,
and misstemming without adding too much complexity to the stemming algorithm and
without using any type of morphological analysis. This work also mainly focuses on
stemming nouns. A stem in the context of this work is defined as the singular, and
whenever applicable, masculine form of a word which does not necessarily map to its
root. This definition is closely related to the definition of a lemma in linguistics, or the
dictionary form of a word [Manning et al. 2008]. Traditionally, the process of reducing
a word to its lemma is called lemmatization. However, lemmatizers rely more heavily
on linguistic features of a given text [Manning et al. 2008; Al-Shammari and Lin 2008;
Šnajder et al. 2008]. In this sense, they produce more accurate results, but at a more
computationally expensive price.
4. RELATED WORK
Because of the complexity of the Arabic language and the importance of stemming,
many approaches with various levels of complexity have been devised to address stem-
ming examples of which can be found in Khoja and Garside [1999], Rogati et al. [2003],
Lee et al. [2003], Darwish [2002], Al Ameed et al. [2005], and Al-Shammari and Lin
[2008b]. Stemming for information retrieval has been particularly well researched
[Larkey and Connell 2001; Larkey et al. 2002, 2007; Darwish and Oard 2002; Chen
and Gey 2002; Al Kharashi and Al Sughaiyer 2004; Taghva et al. 2005; Moukdad 2006;
Harmanani et al. 2006].
In general, it can be stated that for the Arabic language, stemmers usually fall under
one of two classes: aggressive or light. Khoja and Garside [1999], for example, devised
an aggressive stemmer that reduces words to their roots. In their work, diacritics,
stop-words, punctuation marks, numbers, the inseparable conjunction prefix (and),
and the definite article (the) are all removed. Input words are also checked among
a large list of prefixes and suffixes, and the longest of these is stripped off, if found.
The resulting word is then compared to a list of patterns and if a match is found,
the root is produced. Taghva et al. [2005] identified three weaknesses with the Khoja
stemmer and developed a similar stemmer that overcomes these weaknesses also with
the goal of deriving the root of a word. The three identified limitations were: (1) the
fact the Khoja stemmer uses a root dictionary which can be difficult to maintain,
(2) that the stemmer would sometimes produce a root that is not related to the original
word (an incorrect root), and (3) that the stemmer would occasionally fail to remove
affixes that should have been removed. However, as stated before, the problem with
aggressive stemmers in general is that by definition they reduce words to their roots,
which results in losing the specific meaning of the original words. This makes this
class of stemmers poor candidates for applications where high accuracy in matching
between similar words is needed.
As stated in the Introduction, this work makes use of an input corpus to validate
word transformations. Using a corpus for stemming purposes is not a novel idea. In
fact, Xu and Croft had the same notion and demonstrated that in English this is a very
effective approach [Xu and Croft 1998]. In their work, Xu and Croft define corpus-based
stemming as the process of automatically modifying “equivalence classes to suit the
characteristics of a given text corpus,” their assumption being that a stemmer that can
adapt to a certain domain using the characteristics of its corpus should perform better
than one that cannot. Another assumption underlying their work is that words and
their stems are likely to occur in the same document, or, even more specifically, in the
same text window. Rather than use any linguistic knowledge to generate equivalence
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:6 S. R. El-Beltagy and A. Rafea
classes, an n-gram model is employed to carry out this task. Larkey et al. [2002]
experimented with Xu and Croft’s approach for Arabic, but they recognized that the n-
gram approach is not the most appropriate one to use for a language such as Arabic. So
instead they formed classes of words that included all words that would reduce to the
same term if their vowels were removed. They then used the cooccurrence measures
suggested by Xu and Croft to split these classes even more. However, when they applied
this on IR, they found that this approach did not perform as well for Arabic as it did
for English. While the same basic idea proposed by Xu and Croft is the one adopted in
this work, the way of generating stems is different than that proposed by Xu and Croft
[1998] or by Larkey et al. [2002].
The introduction of an Arabic information retrieval task in the Text Retrieval Con-
ference (TREC) in 2001 and 2002 resulted in the development of a series of Arabic
light stemmers. Even though this work does not directly address IR, light stemmers
are relevant to this work because they produce terms that are closer to the definition of
a stem as defined in Section 3, than do aggressive stemmers. Larkey et al. [2002, 2007]
have proposed a number of light stemmers with very minor differences between each.
But experimenting with the TREC dataset has revealed that the one that works best
is the light10 stemmer [Larkey et al. 2007]. Given an input word, the light10 stemmer
carries out the following steps to generate its stem.
(1) It normalizes the input word by removing punctuation, diacritics, and any character
that is not a letter. In this step also, all forms of the letter “ ” (alf) are changed into
the base representation of the letter, the final letter final letter is changed to
, and into .
(2) It strips off the initial “ ” (the Arabic proposition equivalent of “and”) if the length
of the resulting word is equal to or greater than 3 characters
(3) It strips off the definite articles if the resulting word is at
least two characters long.
(4) It strips off the suffixes if the resulting word is also at
least two characters long.
The main problem with the second step is that many Arabic words start with the
letter “ ” (waw) so removing this letter can result in removing part of the word. To over-
come this problem, Larkey et al. [2007] have imposed the stricter length restriction, but
this still does not guarantee that this prefix will not be stripped off when it shouldn’t.
Through experimenting with an Arabic IR dataset, Larkey et al. [2007] have shown
that the light10 stemmer performs significantly better than the Khoja stemmer. The
light10 stemmer was also compared to other Arabic light stemmers, including Al Stem
[Darwish and Oard 2002], Buckwalter stemmers [Buckwalter 2003] and the Diab stem-
mers [Diab et al. 2004], and produced the overall best results [Larkey et al. 2007]. Chen
and Gey [2002] built the Berkley light stemmer, which is similar to light10. However,
the Berkley stemmer introduced a larger set of prefixes and suffixes, and had different
restrictions on the length of the words resulting from stemming. Chen and Gey com-
pared their work to that of Al Stem [Darwish and Oard 2002] and demonstrated better
performance. However, they did not directly compare their work with the light10.
Also relevant to this work is research carried out by Nwesri et al. [2005] in which
the authors propose and compare a set of techniques for removing conjunctions and
prepositions that attach to the beginning of a word. The relevancy derives from the fact
that in their work, Nwesri et al. [2005] follow a similar approach to the one adopted in
this article, as each of their proposed techniques involves some processing on an input
word, after which the result is checked against a lexicon to determine whether to strip
that word of a certain prefix.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:7
Also highly relevant to this work is the work carried out by Goweder et al. [2004a,
2004b] in which the authors attempt the difficult problem of identifying broken plurals
and reducing them to their singular forms. In their work [Goweder et al. 2004b] carry
out a series of experiments on broken plurals. In all experiments, input words are
firstly lightly stemmed using a modified version of the aggressive Khoja stemmer
[Khoja and Garside 1999]. In the first and least-accurate experiment, all forms that
fit the pattern of a broken plural were detected and analyzed to see whether words
that fit these patterns are in fact broken plurals. Having found that this technique
results in very low precision, an alternative method which adds further restrictions,
based on the authors’ observations of existing patterns, was adopted. Using this method
precision increased significantly. In the third variation, a machine learning approach
was adopted to automatically add restriction rules which further improved the results,
but the best results of all were obtained using a dictionary-based approach.
Like this work, the work of Al-Shammari and Lin [2008b] also tried to reduce stem-
ming errors, but the problem of broken plurals was not directly addressed in their
approach.
5. CONCEPTUAL OVERVIEW
As stated before, the goal of this work is to increase the accuracy of the stemming
process. To do so, in stripping prefixes, suffixes, and infixes, care is taken to ensure
that the resulting word is a valid transformation of the input word. The approach that
has been adopted is one in which we extend a light stemmer in a way that would allow
it to satisfy accuracy requirements. Even though the goal of this work is not directly
related to improving performance in IR, the stemmer to which this work is compared
and which we actually build upon is the light10 stemmer. The reason behind this is
that the light10 stemmer is probably the most accurate of existing stemmers. The set of
prefixes and suffixes addressed by the light10 stemmer is already limited and has been
selected based on their high occurrence frequency, either as suffixes or prefixes. This
means that in most cases the stemmer produces stems that maintain the meaning
of the original words. Other light stemmers remove more affixes, indiscriminately
resulting in lower accuracy. Like this work, the light10 stemmer also only targets
nouns, which is not explicitly stated, but can be deduced by looking at the set of
prefixes and suffixes that it removes. The main weakness of the light10 stemmer, if
viewed from the perspective of accuracy, it its inability to handle broken plurals, its
indiscriminate removal of suffixes and the conjunction “ ”, and its inability to handle
other particles that attach to the beginning of the word. It is precisely these limitations
that this work aims to address by extending the light10 stemmer.
To handle broken plurals, this work proposes a set of rules for detecting broken plural
patterns and transforming these to their singular forms. However, as stated before in
Section 2, just because a word conforms to a broken plural pattern does not always
mean that the proposed transformation is a valid one. So this work makes use of text
within a corpus for verifying whether to carry out such a transformation by checking to
see whether the word resulting from the proposed transformation exists in the corpus
or not. The same process is also applied for removing certain prefixes and suffixes. So,
if a word resulting from applying a transformation rule on an input word (a potential
stem), or from removing certain prefixes or suffixes, is found to have appeared in the
corpus, then this word is considered as a stem for the input word.
It is important to note that this procedure is not error free, especially for broken
plural patterns. For example, the same rule that would reduce (lessons) to “ ”
(lesson) can reduce (sunset) to (west), which obviously is not its stem.
However, if both words are in the corpus (west) will be assumed to be the stem
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:8 S. R. El-Beltagy and A. Rafea
for (sunset). To overcome this kind of problem, a stem list containing words
that should not be further conflated (such as ) should be maintained. Building
such lists from scratch is very expensive in terms of human effort and is not feasible
in general nondomain-specific applications. A tool was developed to assist in the rapid
construction of such a list [El-Beltagy and Rafea 2009a]. In general, it can be stated
that in this work building a stemmer is a two-phase process. In the first phase (training
phase), a number of documents from an input corpus can be selected for the stem list
building process. The training phase is an optional step that a user can bypass, but
in this case accuracy will be degraded. In the second phase (operational phase), given
any word or document, stemming can be carried out by checking whether its potential
stem (obtained by applying appropriate rules) exists in the stem list or not.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:9
Loading of stem lists is also a one-time process (for all documents) involving one step.
—For each stem List X, load terms in stemList X into the initially empty set StemList.
Depending on the required strictness of the stemmer, the original word whose con-
flated form cannot be matched to an entry in the stem list can be returned, or the
corresponding lightly stemmed version of the word may be returned.
We believe that for most applications the less strict option will yield better results.
The stemming of any term t is then carried out as shown in Figure 2.
6. STEMMING RULES
The presented stemmer works on three levels: prefix removal, suffix removal, and infix
removal. Each of these levels utilizes a different set of rules. In general, prefix removal
is first attempted, then suffix removal, followed by infix removal. Each of these steps
is described in the following sections.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:10 S. R. El-Beltagy and A. Rafea
single character. The pseudocode for removing either of these prefixes from a term t is
shown in Figure 3.
6.1.1. Compound Prefix Removal. Compound prefixes are shown in Table IV. For the
removal of any prefix, the length of the term to be stemmed minus the length of the
prefix has to be greater than or equal to two, otherwise no prefix removal is carried
out.
A small experiment was carried out to test the effect of removing prefixes in the
compound prefix set from any word that starts with them if they satisfy the length re-
quirements mentioned earlier (except for prefix ). In this experiment nine documents,
making up a total of 19,291 words, were used. From these, a total 3,851 unique words
(excluding stop-words, single characters, and numbers) were extracted. The number of
words on which prefix removal was applied was 1,708 (44% of input words). This re-
sulted in the generation of 1,345 unique terms (a reduction of 21%). Words that started
with a prefix, along with their altered forms, were examined to assess the accuracy
of the compound prefix removal step. After identifying all cases from which prefixes
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:11
should not have been removed but which met the only requirement (which is the length
condition), it was found that these cases are quite rare and that the overall accuracy
of this step is about 98.7%. For this reason it was decided not to impose any further
rules on the removal of a prefix in the compound set. So if a word matches with a prefix
in the compound prefix set (except prefix ), then it is simply removed. There is no
particular order in which these prefixes should be checked.
The prefix which is pronounced as (la) is considered as a special case and is handled
a bit differently than all the other prefixes in the compound set. The reason for this
is that is made up of two letters which are followed by . On its own, the
character can serve as a causation prefix. So, when precedes a word it might be
for the purpose of negating that word, or it could be because the causation prefix
happens to be attached to a word starting with the letter . So the removal of this
particular prefix has to be validated by removing the first two letters of the word that
starts with it and then checking for the resulting word in the stem list or the document
in which the word has appeared. The pseudocode for the validation step is shown in
Figure 4.
If a match is found, then the resulting word is returned. If no match is found, then
the first letter is removed from the original word and the validation method is called
again. Again, if a match is found, the resulting word is returned, but if no match is
found the original word is returned unaltered.
6.1.2. Single-Letter Prefix Removal. As described before, it is common in Arabic for con-
junctions and prepositions to attach to the beginning of a word. Conjunctions and
prefixes addressed by this work are shown in Table V.
From this list, only the conjunction was removed by the light10 stemmer. Remov-
ing these conjunctions and prepositions from a word without validating the removal
can often result in mistakes. This is because this removal means removing the first
letter of any word that starts with any of these prefixes. For example, the word
(baby) would be reduced to which has no meaning and the word (elephant)
would also be reduced to the meaningless word . This is perhaps why the only
prefix from this set that the light10 stemmer removes is which occurs often as a
conjunction attached to a word. But as stated before, the likelihood of error in case of
indiscriminate removal of these particles is quite high. So, to validate the removal of
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:12 S. R. El-Beltagy and A. Rafea
these in training mode, the prefix is first removed and then a check is made to see if
the resulting word has appeared at least once as part of the input corpus. If the term is
found, then it is assumed that the prefix can be removed safely. This transformation is
then validated by the user. In stemming mode, the user or application can control the
way a transformation takes place through choosing from two different modes: strict
or relaxed. In strict mode, the prefix is removed and the resulting word is matched
against entries in the stem list. If a match is found, then the transformation is carried
out safely. If not, the suffix and infix removal steps are invoked on the transformed
word. If still no change occurs, then the stem is considered to be an original term. In
relaxed mode, all the preceding steps take place, but when no change occurs, terms in
the input document are checked as well to see whether a match for the transformed
word has appeared in the document. If a match is found, then the transformed word is
considered as a stem of the original. Figure 5 summarizes the process.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:13
Handling of suffixes in the second suffix set is carried out in a similar manner. Here,
the original word is checked for each of the suffixes in the suffix set in Table VI in the
order from right to left. If the word ends with the suffix, the suffix is removed and the
resulting word is validated. If validated, it is returned. Otherwise, a check is made to
see whether the resulting word ends with a . If it does, then the is replaced
with a and the resulting word is validated again. The reason for this is that adding
some suffixes from suffix set 2 to a word that originally ends with a results in the
conversion of this to a in the process of suffix addition. So this step simply
reverses the process. For example, the word (cultivation) is converted to
(their cultivation) when adding the possessive pronoun (their). If the generated
term is validated, it returned as an output from this step. Otherwise the original word
is checked for the next suffix in the list until all are exhausted. If no match is found in
this procedure, the original word is returned as-is.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:14 S. R. El-Beltagy and A. Rafea
match in this case also results in w’ being returned as a stem for word w, but in this
case with less certainty.
For all rules except R1 and R4, if w’ is not found in the stem list, then the letter is
appended to w’ and a match is attempted again. The same is also true when trying to
find a match in the local context. In order to ensure efficiency in the matching process,
all words in a stem list as well as in a local context are stored in a hash list. Rule 4
is different in the fact that it actually adds a character which is the (teh marbota).
Because the addition of the teh marbota character is very tricky to determine from a
local context, this rule is restricted to usage in conjunction with a stem list. Figure 6
summarizes the pattern detection and transformation procedure.
7. ANALYZING INFIX RULE PERFORMANCE
Collecting statistics related to the number of times a certain broken plural rule matches
words in the text (fires) and the number of times it results in a transformation (suc-
ceeds), as well as the precision and recall of each rule, can provide an insight as to the
importance of each of these, as well as guidelines as to how to improve them or to use
them differently. In order to analyze the rules and patterns on an individual level, a
dataset consisting of 4,098 different news stories was used.1 Files in the dataset were
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:15
all loaded at first, and then individual words were stemmed. No stem list was used;
instead stemming relied completely on the local context (which is the aggregate of all
words loaded from input documents). The number of words contained in this dataset
was 1,202,692 (approximately 1.2 million words). After filtering out stop-words, a total
of 75,160 unique terms was retained. To analyze individual rules, the stemmer was
augmented with code that keeps track of the number of times a rule was fired and
the number of times it succeeded (resulted in a transformation). For each rule, a file
was generated containing words that matched the rule’s pattern, the suggested trans-
formation, and whether or not transformation actually took place. An example in the
form of a sample of the output generated for rule R1 is shown in Table IX. It is impor-
tant to reiterate that for a transformation to occur, the resulting candidate term must
have occurred at least once in the local context. In the future, investigation of a higher
threshold will be carried out to avoid mismatches taking place as a result of possible
typing errors. It is also important to note that in terms of precision, all obtained results
are likely to have been better had a smaller corpus or a domain-specific corpus been
used.
The results of this analysis are summarized in Table X. In this table, the total
number of times each rule is fired is displayed, as well as the total number of times it
succeeded. Also displayed is the total number of unique words matching the rule. In
addition, stemming precision for each rule is presented as well as stemming recall and
the stemming F-score (which represents the harmonic mean of precision and recall).
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:16 S. R. El-Beltagy and A. Rafea
Stemming precision is calculated based on the number of unique words and their
correctly obtained stems using the following equation.
Stemming Precision for rule i = (number of unique words correctly stemmed by rule
i)/(total number of words stemmed by rule i).
Stemming recall for rule i = (number of unique words correctly stemmed by rule
i)/(number of unique words correctly stemmed by rule i + number of unique words that
should have been stemmed by rule i but were not).
When calculating precision and recall, any ambiguous conflations were counted as
wrongly transformed. R1 performed fairly well in terms of accuracy. An example of
a wrong transformation performed by this rule is on the name of the country
(Algeria) which was transformed to (Island). Rule R3 seems to be a rare rule.
Rule 5 was fired 12,599 times, but since it can only succeed if a match between a
potential stem and an entry in a stem list is found, no statistics for this rule could be
obtained. After looking at incorrect stems produced by Rule 5, it was concluded that
certain conditions can be added to increase its accuracy. Words matching with this
rule should not end with letters “ ”, “ ”, or , nor should they start with letters “ ”,
“ ”, “ ”, or “ ”. The same is also true for Rule 6 which has very low precision. By just
adding the “ ” restriction to this rule, precision immediately increased to 43.77%. Rule
7 mostly matched with verbs, but even when excluding these from the evaluation, the
precision was still very low, suggesting that this rule might best be confined to usage in
conjunction with a stem list. These analysis show that for most of the presented rules,
using context information does in factact as a valid safeguard against low precision.
8. EVALUATION
The design of the evaluation experiments targeted a number of specific questions which
are as follows.
(1) Does the proposed stemming methodology used in conjunction with a stem list
significantly improve stemming accuracy, as claimed?
(2) If a stem list has not been built, can using the same approach and relying only on
the local context of documents still enhance stemming accuracy?
(3) Even if accuracy is enhanced as claimed, can this really positively enhance the
performance of some real-life applications, and if so, to what extent?
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:17
Table XI. Comparison between Accuracy Results Obtained Using the Proposed Stemmer
with and without a Stem List, and a Light Stemmer
Proposed Stemmer Proposed Stemmer Light Stemmer
ith a Stem List Without a Stem List
Document 1 90.4% 75% 50.9%
Document 2 89.4% 67.4% 45.8%
Document 3 93.1% 73.6% 55.3%
Document 4 90.5% 67.6% 47.2%
Average Accuracy 90.8% ± 1.59% 70.9% ± 3.96% 49.8 ± 4.25%
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:18 S. R. El-Beltagy and A. Rafea
of stemming over a light stemmer, and that when using it without the stem list there is
an improvement of about 42.4%. Despite the fact that the used dataset is a small one,
using a t-test to calculate the significance of the difference between the results showed
that in both cases (when using a stem list and without), the difference in accuracy
between the proposed stemmer and a light one is statistically significant. In the first
case (using a stem list) the p value was less than 0.0001 with a t value of 18.1, and
6 degrees of freedom, while in the second the p value was equal to 0.0003 with a t value
of to 7.27, and also 6 degrees of freedom.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:19
Table XII. Comparison between a Light Stemmer and Different Configurations for Proposed Stemmer
Average # .
Total # of Matches Average Precision Average Recall of Matches per Doc.
Light Stemmer 175 0.175 +/− 0.088 0.292 +/− 0.244 1.75 +/− 0.88
C1 180 0.181 +/− 0.082 0.297 +/− 0.240 1.80 +/− 0.82
C2 187 0.188 +/− 0.090 0.305 +/− 0.239 1.87 +/− 0.90
C3 192 0.192 +/− 0.092 0.312 +/− 0.241 1.92 +/− 0.83
C4 194 0.194 +/− 0.093 0.314 +/− 0.241 1.94 +/− 0.93
between the two results, the difference was found to be statistically significant with
p < 0.0001 (t = 7.6 and degrees of freedom = 6,382).
Another experiment was carried out to determine the effect of stemming on the
task of keyphrase extraction. In this experiment the KP-Miner system was used for
keyphrase extraction [El-Beltagy and Rafea 2009b]. The used dataset consisted of 100
randomly collected articles from the Arabic Wikipedia [Wikipedia 2008]. Keywords
for each article were obtained from the keyword metatag associated with each, but
numeric entries (mostly denoting year numbers) were ignored and so were Wikipedia-
related tags (such as article seed, for example). The average number of words per
document in this dataset is 804 ± 934 and the average number of keyphrases is 8.1 ±
3.2. The percentage of author-assigned keyphrases actually appearing within the body
of associated articles in this dataset is 81.8%. The KP-Miner system allows the user
to specify the number of keypharses to extract for each input document. Setting this
number to 10, a comparison was held to see how many keyphrases would be correctly
extracted when using a light stemmer as opposed to using the proposed stemmer. Four
different configurations for the proposed stemmer were used.
The results of comparing these configurations with each other and with a light
stemmer are shown in Table XII.
As can be seen from Table XII, the best result was obtained using the proposed
stemmer in conjunction with the stem list obtained from agricultural documents and a
light stemmer. These results show an approximately 11% improvement in keyphrase
extraction over the basic light stemmer, despite the fact that the used stem list is
not directly related to the document set from which keyphrases were extracted. When
using the t-test to compare average precision values, the difference between these
results turned out to be not statistically significant with p = 0.1394. Nevertheless,
another advantage offered by the use of the presented stemmer was the elimination
of an undesirable mistake. Before using the outlined stemmer, a common mistake for
the keyphrase extractor was to generate two phrases with the exact same meaning
because one form was the broken plural of another, but this was completely avoided
through use of the presented stemmer. Further experimentation with a larger dataset
and with other configurations for the proposed stemmer are planned.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:20 S. R. El-Beltagy and A. Rafea
ACKNOWLEDGMENTS
The authors wish to thank the two anonymous reviewers for their helpful comments and suggestions that
have contributed to the improvement of the quality of the article.
REFERENCES
AL AMEED, H. K., AL KETBI, S. O., AL KAABI, A. A., AL SHEBLI, K. S., AL SHAMSI, N. F., AL NUAIMI, N. H., AND
AL MUHAIRI, S. S. 2005. Arabic light stemmer: A new enhanced approach. In Proceedings of the 2nd
International Conference on Innovations in Information Technology (IIT’05).
AL KHARASHI, I. A. AND AL SUGHAIYER, I. A. 2004. Performance evaluation of an Arabic rule-based stemmer. In
Proceedings of the 17th National Computer Conference.
ALJLAYL, M. AND FRIEDER, O. 2002. On Arabic search: Improving the retrieval effectiveness via light stemming
approach. In Proceedings of the ACM 11th Conference on Information and Knowledge Management.
340–347.
AL-SHAMMARI, E. AND LIN, J. 2008a. A novel Arabic lemmatization algorithm. In Proceedings of Conference
AND’08. 113–118.
AL-SHAMMARI, E. AND LIN, J. 2008b. Towards an error-free Arabic stemming. In Proceedings of the ACM
International Conference on Information and Knowledge Management (CIKM-iNEWS’08). 9–16.
BEESLEY, K. R. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th
Conference on Computational Linguistics. 89–94.
BUCKWALTER, T. 2003. Qamus: Arabic Lexicography. http://www.qamus.org/
CHEN, A. AND GEY, F. 2002. Building an Arabic stemmer for information retrieval. In Proceedings of the Text
Retrieval Conference (TREC’02). 631–639.
DARWISH, K. 2002. Building a shallow morphological analyzer in one day. In Proceedings of the ACL Workshop
on Computational Approaches to Semitic Languages.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:21
DARWISH, K. AND OARD, D. W. 2002. CLIR experiments at Maryland for TREC-2002: Evidence combination for
Arabic-English retrieval. In Proceedings of the Text Retrieval Conference (TREC’02). 703–710.
DIAB, M., HACIOGLU, K., AND JURAFSKY, D. 2004. Automatic tagging of Arabic text: From raw test to base
phrase chunks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on
Human Language Technologies North American Chapter of the Association for Computational Linguistics
(HLT-NAACL’04).
EL-BELTAGY, S. R. AND RAFEA, A. 2009a. A framework for the rapid development of list based domain specific
Arabic stemmers. In Proceedings of the 2nd International Conference on Arabic Language Resources and
Tools.
EL-BELTAGY, S. R. AND RAFEA, A. 2009b. KP-Miner: A keyphrase extraction system for English and Arabic
documents. Inform. Syst. 34, 132–144.
EL-BELTAGY, S. R., HAZMAM, M., AND RAFEA, A. 2007. Ontology based annotation of Web document segments.
In Proceedings of the 22nd Annual ACM Symposium on Applied Computing. 1362–1367.
FLORES, F. N., MOREIRA, V. P., AND HEUSER, C. A. 2010. Assessing the impact of stemming accuracy on informa-
tion retrieval. In Proceedings of the International Conference on Computational Processing of Portuguese
Language. Lecture Notes in Computer Science, vol. 6001. Springer, 11–20.
GOLDSMITH, J. A., HIGGINS, D., AND SOGLASNOVA, S. 2000. Automatic language-specific stemming in information
retrieval. In Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language
Information Retrieval and Evaluation. 273–284.
GOWEDER, A., POESIO, M., AND DE ROECK, A. 2004a. Broken plural detection for arabic information retrieval.
In Proceedings of the Annual ACM Conference on Reseurch and Development in Information Retvieval
(SIGIR’04).
GOWEDER, A., POESIO, M., DE ROECK, A., AND REYNOLDS, J. 2004b. Identifying broken plurals in unvowelised
Arabic text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP).
HARMANANI, H. M., KEIROUZ, W. T., AND RAHEEL, S. 2006. A rule-based extensible stemmer for information
retrieval with application to Arabic. Int. Arab J. Inform. Technol. 3, 3, 265–272.
HAZMAN, M., EL-BELTAGY, S. R., AND RAFEA, A. 2009. Ontology learning from domain specific Web documents.
Int. J. Metadata Semant. Ontol. 4, 1–2, 24–33.
KHOJA, S. AND GARSIDE, R. 1999. Stemming Arabic text. Tech. rep. Computing Department, Lancaster Uni-
versity, Lancaster, U.K.
LACLAVIK, M., SELENG, M., GATIAL, E., BALOGH, Z., AND HLUCHY, L. 2007. Ontology based text
annotation—OnTeA. In Proceedings of the Conference on Information Modeling and Knowledge Bases.
311–315.
LARKEY, L. S. AND CONNELL, M. E. 2001. Arabic information retrieval at UMass in TREC-10. In Proceedings
of the Text Retrieval Conference (TREC’01).
LARKEY, L. S., BALLESTEROS, L., AND CONNELL, M. E. 2002. Improving stemming for Arabic information retrieval:
Light stemming and co-occurrence analysis. In Proceeedings of the Annual ACM Conference on Research
and Development in Information Retrieval (SIGIR’02).
LARKEY, L. S., BALLESTEROS, L., AND CONNELL, M. E. 2007. Light stemming for Arabic information retrieval. In
Arabic Computational Morphology, A. Soudi, A. van der Bosch, and G. Neumann, Eds. 221–243.
LEE, Y., PAPINENI, K., ROUKOS, S., EMAM, O., AND HASSAN, H. 2003. Language model based Arabic word seg-
mentation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics.
399–406.
MANNING, C. D., RAGHAVAN, P., AND SCHÜTZE, H. 2008. Introduction to Information Retrieval. Cambridge Uni-
versity Press. Cambridge, U.K.
MOUKDAD, H. 2006. Stemming and root-based approaches to the retrieval of Arabic documents on the Web.
Webology 3, 1, Article 22. http://www.webology.ir/2006/v3n1/a22.html.
NWESRI, A., TAHAGHOGHI, S. M. M., AND SCHOLER, F. 2005. Stemming Arabic conjunctions and prepositions.
In Proceedings of the 12th International Symposium on String Processing and Information Retrieval
(SPIRE’05). Lecture Notes in Computer Science, vol. 3772, Springer, 206–217.
PAICE, C. D. 1996. Method for evaluation of stemming algorithms based on error counting. J. Amer. Soc.
Inform. Sci. 47, 632–649.
RAFEA, A. AND SHAALAN, K. 1993. Lexical analysis of inflected Arabic words using exhaustive search of an
augmented transition network. Softw. Pract. Exper. 23, 6, 567–588.
ROGATI, M., MCCARLEY, S., AND YANG, Y. 2003. Unsupervised learning of Arabic stemming using a parallel
corpus. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics.
391–398.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:22 S. R. El-Beltagy and A. Rafea
ŠNAJDER, J., BAŠIC, B. D., AND TADIC, M. 2008. Automatic acquisition of inflectional lexica for morphological
normalization. Inform. Process. Manag. 44, 1720–1731.
TAGHVA, K., ELKHOURY, R., AND COOMBS, J. S. 2005. Arabic stemming without a root dictionary. ITCC 1, 152–157.
WIKIPEDIA. 2008. Wikipedia, the free encyclopedia. http://ar.wikipedia.org/wiki/Main Page.
XU, J. AND CROFT, W. B. 1998. Corpus, based stemming using co, occurrence of word variants. ACM Trans.
Inform. Syst. 16, 1, 61–81.
ZITOUNI, I., SORENSEN, J., LUO, X., AND FLORIAN, R. 2005. The impact of morphological stemming on Arabic
mention detection and coreference resolution. In Proceedings of the ACL Workshop on Computational
Approaches to Semitic Languages. 63–70.
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.