0% found this document useful (0 votes)
23 views22 pages

An Accuracy-Enhanced Light Stemmer For Arabic Text

Uploaded by

rickshark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

An Accuracy-Enhanced Light Stemmer For Arabic Text

Uploaded by

rickshark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

2

An Accuracy-Enhanced Light Stemmer for Arabic Text


SAMHAA R. EL-BELTAGY, Cairo University
AHMED RAFEA, The American University in Cairo

Stemming is a key step in most text mining and information retrieval applications. Information extraction,
semantic annotation, as well as ontology learning are but a few examples where using a stemmer is a
must. While the use of light stemmers in Arabic texts has proven highly effective for the task of informa-
tion retrieval, this class of stemmers falls short of providing the accuracy required by many text mining
applications. This can be attributed to the fact that light stemmers employ a set of rules that they apply
indiscriminately and that they do not address stemming of broken plurals at all, even though this class
of plurals is very commonly used in Arabic texts. The goal of this work is to overcome these limitations.
The evaluation of the work shows that it significantly improves stemming accuracy. It also shows that by
improving stemming accuracy, tasks such as automatic annotation and keyphrase extraction can also be
significantly improved.
Categories and Subject Descriptors: I 2.7 [Artificial Intelligence]: Natural Language Processing; I 7.0
[Document and Text Processing]: General
General Terms: Experimentation, Algorithms
Additional Key Words and Phrases: Stemming, broken plurals, Arabic, heuristic rules
ACM Reference Format:
El-Beltagy, S. R. and Rafea, A. 2011. An accuracy-enhanced light stemmer for arabic text. ACM Trans.
Speech Lang. Process. 7, 2, Article 2 (February 2011), 22 pages.
DOI = 10.1145/1921656.1921657 http://doi.acm.org/10.1145/1921656.1921657

1. INTRODUCTION
Stemming is a very common operation in almost any text mining application. The
importance of building a good stemmer lies in the fact that stemming can directly affect
the performance of any application of which it is a component. Research has shown that
stemming of Arabic terms is a particularly difficult task because of its highly inflected
and derivational nature [Aljlayl and Frieder 2002; Chen and Gey 2002; Larkey et al.
2002, 2007; Nwesri et al 2005].
Most work in this area either tries to reduce a given word to its root (aggressive
stemmers) or to identify a set of prefixes and suffixes, the removal of which can have a
positive impact on a given task such as information retrieval (light or simple stemmers).
The main advantage of using a light stemmer is that it is very simple to implement and
apply and, though not very accurate in its stemming performance, has proven highly
effective for the task of information retrieval. However, different applications have

This research was partially supported by the Center of Excellence of Data Mining and Computer Modeling
within the Egyptian Ministry of Communication and Information (MCIT).
Authors’ addresses: S. R. El-Beltagy, Department of Computer Science, Faculty of Computers and Infor-
mation, Cairo University, Giza, Egypt; email: samhaa@computer.org; A. Rafea, Department of Computer
Science and Engineering, School of Sciences and Engineering, The American University in Cairo; email
rafea@aucegypt.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c 2011 ACM 1550-4875/2011/02-ART2 $10.00
DOI 10.1145/1921656.1921657 http://doi.acm.org/10.1145/1921656.1921657

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:2 S. R. El-Beltagy and A. Rafea

different requirements from a stemmer, and what might work well with a certain class
of applications might not necessarily perform as well with other application classes.
Mention detection systems, for example, have different requirements from a stemmer
than information retrieval systems [Zitouni et al. 2005].
The goal of this work is to address applications that demand accuracy from a stemmer
at a low computational cost, where accuracy in stemming is defined as stemming a word
to its shortest possible form without compromising its meaning. Aggressive stemmers
do not meet this requirement because converting a word to its root can result in the
mapping of too many related terms, each with a unique meaning, to a single root. This
problem is known as over-stemming. While existing light stemmers produce words that
are closer to maintaining their meaning, they fail to remove some affixes and do not
handle broken plurals which are very common in the Arabic language [Larkey et al.
2007]. This problem is known as under-stemming [Paice 1996].
Examples of applications that require high stemming accuracy include applications
that try to detect ontology elements in a user’s free text input or query such as in
Laclavik et al. [2007] in order to convert it into a semantic query. Applications that
automatically add metadata to documents through the use of ontologies or thesauri
present another example. Examples of other tasks that can be affected by stemming
accuracy include terminology extraction and concept mining. The emphasis in all of
these applications is on correctly matching word forms that have the same meaning.
In these applications, missing such matches or incorrectly matching distinct terms
usually carries an expensive cost.
To achieve the previously stated goal of emphasizing accuracy without imposing a
high computational cost, this work proposes an approach that extends light stemming
mainly through the introduction of rules to handle broken plurals that often result in
the addition of infixes to a word as well as by addressing the limitations of existing
light stemmers which manifest in the form of indiscriminate removal of certain affixes.
This task is carried out without the use of a morphological analyzer and can be used
by any application that requires accuracy in stemming. The basic premise on which
this work is based is that in any reasonably sized corpus, a word and its stem are both
likely to appear in the corpus. This is a similar assumption to that adopted by Xu and
Croft [1998] in their work on corpus-based stemming.
The rest of this article is organized as follows: Section 2 presents an overview of the
Arabic language and more details on why stemming of this language is a complex task,
Section 3 defines the addressed problem in more detail, Section 4 presents related work,
Section 5 provides a conceptual overview of the presented work, Section 6 details the
followed procedure for generating a stem from an input word, Section 7 presents analy-
ses related to rules that were presented in Section 6 for handling broken plurals using
a large corpus, Section 8 describes experiments carried out to evaluate the presented
work and their results, and Section 9 concludes this work and provides future research
directions.

2. CHARACTERISTICS OF THE ARABIC LANGUAGE


Arabic is a Semitic language that has a high inflectional and derivational nature. The
Arabic language is written from the right to the left and the way an Arabic letter is
written varies depending on where it appears within a word (beginning, middle, or at
the end). Word formation and the morphology of the Arabic language are significantly
different than any Latin-based language. In Arabic, a single root, which can be com-
posed of three to five consonant letters, can be used to generate a large number of words
(lexical forms) each having a specific semantic meaning [Rafea and Shaalan 1993]. In
theory, hundreds of words can be derived from a single root [Goweder et al. 2004; Al
Kharashi and Al Sughaiyer 2004]. It is estimated that there are approximately 10,000
roots from which all Arabic words are derived [Darwish 2002], 5,000 of which are still in
ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:3

Table I. Examples of Word Derived from the Root (ktb)

Table II. Examples of Word Inflected from the Word (kateb)

common use [Beesley 1996]. Derivation of a word having a certain part of speech from
a root takes place through the application of a pattern that essentially adds specific
letters to that word. Table I shows a number of words derived from the root (ktb).
Letters that have been added as a result of the derivational process are underlined.
Despite having the same root, each of these words has a totally different meaning. The
process of extracting the root from the derived form is usually carried out through the
use of a derivational morphological analyzer.
Another transformation that occurs on a word having a certain part of speech is
the inflectional process, which adds affixes to this word to indicate person, gender,
number, case, and tense. When added at the beginning of a word, an affix is known
as a prefix, when added at the end of the word it is known as suffix, and when added
anywhere in the middle, it is known as an infix. In Arabic gender is either feminine
or masculine. The feminine form of a noun is usually formulated by adding a suffix to
the masculine form. There are also three possible ways to indicate number: in singular,
dual, or plural form. Dual and plural forms can have different suffixes depending on
whether the form is feminine or masculine and depending on its grammatical case.
There are three possible grammatical cases: nominative, genitive, and accusative, and
two possible tenses: perfect and imperfect [Chen and Gey 2002]. Examples of inflection
are presented in Table II. The process of extracting a word from this inflected form is
usually performed by inflectional morphological analyzers.
Affixes can also manifest in the form of particles that attach to the beginning of a word
or possessive pronouns that attach at the end. The use of particles that attach to the
beginning of a word is very common in Arabic. These particles can denote prepositions
and conjunctions. Out of the nine conjunction particles in Arabic, two attach to the
beginning of words, and out of the twenty available prepositions, five are inseparable
[Nwesri et al. 2005]. So it is common in Arabic to have a word with multiple affixes.
For example, the word (be-ikhtiarat-ehem) which translates into “with their
choices” and which is inflected from the word “choice” or (ikhtiar),
which itself is a derived word, has one prefix and two suffixes as follows.

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:4 S. R. El-Beltagy and A. Rafea

Table III. Examples of Broken Plurals

Broken plurals are also quite common in Arabic. Unlike regular or sound plurals
which simply result in the addition of suffixes, broken plurals change the entire struc-
ture of a word’s singular form in order to convert it to a plural. This structural change
usually involves the addition of infixes as well as reordering or deletion of letters that
existed in the original word. Table III presents examples of broken plurals. While a
language like English has irregular plurals, it does not really have an equivalent of a
broken plural. There are known rules for changing words with given patterns to their
irregular plural forms and vice versa, but these rules cannot be applied blindly. Gowder
et al. [2004b] have demonstrated through experimentation that the straightforward ap-
plication of rules that reduce broken plural patterns to their singular representation
results in low precision.
Having presented these aspects of the Arabic language, a root can be defined as the
three-letter origin of a word obtained by removing both inflectional and derivational
affixes, while a stem represents a derived word from which only inflectional affixes
have been removed.

3. PROBLEM SCOPE AND DEFINITION


Broadly speaking, stemming can be defined as the process of removing affixes from
a word and any algorithm capable of carrying out this task is called a stemmer
[Goldsmith et al. 2000]. The difference between one stemmer and another is in the
extent and degree to which the stemmer removes affixes with different applications
posing different requirements from a stemming component. Removing all affixes (in-
flectional morphology) and reducing a word to its root (derivational morphology) is the
goal of aggressive or strong stemmers as well as morphological analyzers. Because of
the derivational nature of the Arabic language, reducing a word to its root often causes
the meaning of the original word to be lost. For example, using this approach the Arabic
words denoting library, desk, writer, writings, and books will all be reduced to the root
“write”. Arabic light stemmers, on the other hand, target a specific subset of prefixes
and suffixes to remove [Larkey and Connell 2001; Larkey et al. 2002, 2007; Darwish
and Oard 2002; Al Ameed et al. 2005; Chen and Gey 2002]. The result of applying a
light stemmer on a word can be a meaningless word, but this is not important because
the goal is not so much as to obtain as semantically meaningful word as it is to group all
related words using the generated or conflated label. This class of stemmers has proven
very effective for the task of information retrieval, as reported in Larkey et al. [2007].
However, according to Larkey et al. [2007], even in IR, “it is not clear what the correct
level of conflation should be.” They observe that light stemmers are too weak but that
equating all words that derive from the same word is undesirable and cite a major
weakness of light stemmers as their inability to recognize and conflate broken plurals.
So while aggressive stemmers tend to form large stem classes in which unrelated
words are grouped together (over-stemming), light stemmers often fail to conflate re-
lated words that belong together (under-stemming). Because of the indiscriminate re-
moval of affixes, light stemmers can also result in mis-stemming which is the removal

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:5

of an affix that is actually part of the word. These stemming limitations can cause
problems in applications that have strict word matching requirements. The goal of this
work is to address stemming accuracy by avoiding over-stemming, under-stemming,
and misstemming without adding too much complexity to the stemming algorithm and
without using any type of morphological analysis. This work also mainly focuses on
stemming nouns. A stem in the context of this work is defined as the singular, and
whenever applicable, masculine form of a word which does not necessarily map to its
root. This definition is closely related to the definition of a lemma in linguistics, or the
dictionary form of a word [Manning et al. 2008]. Traditionally, the process of reducing
a word to its lemma is called lemmatization. However, lemmatizers rely more heavily
on linguistic features of a given text [Manning et al. 2008; Al-Shammari and Lin 2008;
Šnajder et al. 2008]. In this sense, they produce more accurate results, but at a more
computationally expensive price.

4. RELATED WORK
Because of the complexity of the Arabic language and the importance of stemming,
many approaches with various levels of complexity have been devised to address stem-
ming examples of which can be found in Khoja and Garside [1999], Rogati et al. [2003],
Lee et al. [2003], Darwish [2002], Al Ameed et al. [2005], and Al-Shammari and Lin
[2008b]. Stemming for information retrieval has been particularly well researched
[Larkey and Connell 2001; Larkey et al. 2002, 2007; Darwish and Oard 2002; Chen
and Gey 2002; Al Kharashi and Al Sughaiyer 2004; Taghva et al. 2005; Moukdad 2006;
Harmanani et al. 2006].
In general, it can be stated that for the Arabic language, stemmers usually fall under
one of two classes: aggressive or light. Khoja and Garside [1999], for example, devised
an aggressive stemmer that reduces words to their roots. In their work, diacritics,
stop-words, punctuation marks, numbers, the inseparable conjunction prefix (and),
and the definite article (the) are all removed. Input words are also checked among
a large list of prefixes and suffixes, and the longest of these is stripped off, if found.
The resulting word is then compared to a list of patterns and if a match is found,
the root is produced. Taghva et al. [2005] identified three weaknesses with the Khoja
stemmer and developed a similar stemmer that overcomes these weaknesses also with
the goal of deriving the root of a word. The three identified limitations were: (1) the
fact the Khoja stemmer uses a root dictionary which can be difficult to maintain,
(2) that the stemmer would sometimes produce a root that is not related to the original
word (an incorrect root), and (3) that the stemmer would occasionally fail to remove
affixes that should have been removed. However, as stated before, the problem with
aggressive stemmers in general is that by definition they reduce words to their roots,
which results in losing the specific meaning of the original words. This makes this
class of stemmers poor candidates for applications where high accuracy in matching
between similar words is needed.
As stated in the Introduction, this work makes use of an input corpus to validate
word transformations. Using a corpus for stemming purposes is not a novel idea. In
fact, Xu and Croft had the same notion and demonstrated that in English this is a very
effective approach [Xu and Croft 1998]. In their work, Xu and Croft define corpus-based
stemming as the process of automatically modifying “equivalence classes to suit the
characteristics of a given text corpus,” their assumption being that a stemmer that can
adapt to a certain domain using the characteristics of its corpus should perform better
than one that cannot. Another assumption underlying their work is that words and
their stems are likely to occur in the same document, or, even more specifically, in the
same text window. Rather than use any linguistic knowledge to generate equivalence

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:6 S. R. El-Beltagy and A. Rafea

classes, an n-gram model is employed to carry out this task. Larkey et al. [2002]
experimented with Xu and Croft’s approach for Arabic, but they recognized that the n-
gram approach is not the most appropriate one to use for a language such as Arabic. So
instead they formed classes of words that included all words that would reduce to the
same term if their vowels were removed. They then used the cooccurrence measures
suggested by Xu and Croft to split these classes even more. However, when they applied
this on IR, they found that this approach did not perform as well for Arabic as it did
for English. While the same basic idea proposed by Xu and Croft is the one adopted in
this work, the way of generating stems is different than that proposed by Xu and Croft
[1998] or by Larkey et al. [2002].
The introduction of an Arabic information retrieval task in the Text Retrieval Con-
ference (TREC) in 2001 and 2002 resulted in the development of a series of Arabic
light stemmers. Even though this work does not directly address IR, light stemmers
are relevant to this work because they produce terms that are closer to the definition of
a stem as defined in Section 3, than do aggressive stemmers. Larkey et al. [2002, 2007]
have proposed a number of light stemmers with very minor differences between each.
But experimenting with the TREC dataset has revealed that the one that works best
is the light10 stemmer [Larkey et al. 2007]. Given an input word, the light10 stemmer
carries out the following steps to generate its stem.

(1) It normalizes the input word by removing punctuation, diacritics, and any character
that is not a letter. In this step also, all forms of the letter “ ” (alf) are changed into
the base representation of the letter, the final letter final letter is changed to
, and into .
(2) It strips off the initial “ ” (the Arabic proposition equivalent of “and”) if the length
of the resulting word is equal to or greater than 3 characters
(3) It strips off the definite articles if the resulting word is at
least two characters long.
(4) It strips off the suffixes if the resulting word is also at
least two characters long.

The main problem with the second step is that many Arabic words start with the
letter “ ” (waw) so removing this letter can result in removing part of the word. To over-
come this problem, Larkey et al. [2007] have imposed the stricter length restriction, but
this still does not guarantee that this prefix will not be stripped off when it shouldn’t.
Through experimenting with an Arabic IR dataset, Larkey et al. [2007] have shown
that the light10 stemmer performs significantly better than the Khoja stemmer. The
light10 stemmer was also compared to other Arabic light stemmers, including Al Stem
[Darwish and Oard 2002], Buckwalter stemmers [Buckwalter 2003] and the Diab stem-
mers [Diab et al. 2004], and produced the overall best results [Larkey et al. 2007]. Chen
and Gey [2002] built the Berkley light stemmer, which is similar to light10. However,
the Berkley stemmer introduced a larger set of prefixes and suffixes, and had different
restrictions on the length of the words resulting from stemming. Chen and Gey com-
pared their work to that of Al Stem [Darwish and Oard 2002] and demonstrated better
performance. However, they did not directly compare their work with the light10.
Also relevant to this work is research carried out by Nwesri et al. [2005] in which
the authors propose and compare a set of techniques for removing conjunctions and
prepositions that attach to the beginning of a word. The relevancy derives from the fact
that in their work, Nwesri et al. [2005] follow a similar approach to the one adopted in
this article, as each of their proposed techniques involves some processing on an input
word, after which the result is checked against a lexicon to determine whether to strip
that word of a certain prefix.

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:7

Also highly relevant to this work is the work carried out by Goweder et al. [2004a,
2004b] in which the authors attempt the difficult problem of identifying broken plurals
and reducing them to their singular forms. In their work [Goweder et al. 2004b] carry
out a series of experiments on broken plurals. In all experiments, input words are
firstly lightly stemmed using a modified version of the aggressive Khoja stemmer
[Khoja and Garside 1999]. In the first and least-accurate experiment, all forms that
fit the pattern of a broken plural were detected and analyzed to see whether words
that fit these patterns are in fact broken plurals. Having found that this technique
results in very low precision, an alternative method which adds further restrictions,
based on the authors’ observations of existing patterns, was adopted. Using this method
precision increased significantly. In the third variation, a machine learning approach
was adopted to automatically add restriction rules which further improved the results,
but the best results of all were obtained using a dictionary-based approach.
Like this work, the work of Al-Shammari and Lin [2008b] also tried to reduce stem-
ming errors, but the problem of broken plurals was not directly addressed in their
approach.

5. CONCEPTUAL OVERVIEW
As stated before, the goal of this work is to increase the accuracy of the stemming
process. To do so, in stripping prefixes, suffixes, and infixes, care is taken to ensure
that the resulting word is a valid transformation of the input word. The approach that
has been adopted is one in which we extend a light stemmer in a way that would allow
it to satisfy accuracy requirements. Even though the goal of this work is not directly
related to improving performance in IR, the stemmer to which this work is compared
and which we actually build upon is the light10 stemmer. The reason behind this is
that the light10 stemmer is probably the most accurate of existing stemmers. The set of
prefixes and suffixes addressed by the light10 stemmer is already limited and has been
selected based on their high occurrence frequency, either as suffixes or prefixes. This
means that in most cases the stemmer produces stems that maintain the meaning
of the original words. Other light stemmers remove more affixes, indiscriminately
resulting in lower accuracy. Like this work, the light10 stemmer also only targets
nouns, which is not explicitly stated, but can be deduced by looking at the set of
prefixes and suffixes that it removes. The main weakness of the light10 stemmer, if
viewed from the perspective of accuracy, it its inability to handle broken plurals, its
indiscriminate removal of suffixes and the conjunction “ ”, and its inability to handle
other particles that attach to the beginning of the word. It is precisely these limitations
that this work aims to address by extending the light10 stemmer.
To handle broken plurals, this work proposes a set of rules for detecting broken plural
patterns and transforming these to their singular forms. However, as stated before in
Section 2, just because a word conforms to a broken plural pattern does not always
mean that the proposed transformation is a valid one. So this work makes use of text
within a corpus for verifying whether to carry out such a transformation by checking to
see whether the word resulting from the proposed transformation exists in the corpus
or not. The same process is also applied for removing certain prefixes and suffixes. So,
if a word resulting from applying a transformation rule on an input word (a potential
stem), or from removing certain prefixes or suffixes, is found to have appeared in the
corpus, then this word is considered as a stem for the input word.
It is important to note that this procedure is not error free, especially for broken
plural patterns. For example, the same rule that would reduce (lessons) to “ ”
(lesson) can reduce (sunset) to (west), which obviously is not its stem.
However, if both words are in the corpus (west) will be assumed to be the stem

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:8 S. R. El-Beltagy and A. Rafea

for (sunset). To overcome this kind of problem, a stem list containing words
that should not be further conflated (such as ) should be maintained. Building
such lists from scratch is very expensive in terms of human effort and is not feasible
in general nondomain-specific applications. A tool was developed to assist in the rapid
construction of such a list [El-Beltagy and Rafea 2009a]. In general, it can be stated
that in this work building a stemmer is a two-phase process. In the first phase (training
phase), a number of documents from an input corpus can be selected for the stem list
building process. The training phase is an optional step that a user can bypass, but
in this case accuracy will be degraded. In the second phase (operational phase), given
any word or document, stemming can be carried out by checking whether its potential
stem (obtained by applying appropriate rules) exists in the stem list or not.

5.1. The Training Phase: Building the Stem List


The input to the training phase is a subset of documents from a given corpus and the
output is a stem list that can be used in the operational phase. The user can also load
any previously built stem lists in this step to avoid duplicate efforts. In order to rapidly
build a stem list (which, as stated before, can have the effect of boosting stemming
accuracy) the following steps are then carried out.
(1) Load input documents and extract all unique terms into the initially empty set
LocalContext. This step includes removal of diacritics and normalization of letters,
as described in Larkey et al. [2001].
(2) For each unique term ti .
(a) generate term ti ’ by removing prefixes from ti
(b) generate term ti ” by removing suffixes from ti ’
(c) generate term ti ”’ by removing infixes (added if the word is a broken plural)
(d) store ti and the result potential stem ti ”’ in a stem table.
(3) Display all term pairs (ti , potential Stem ti ) stored in the dictionary to the user
for validation (see Figure 1). The user can then approve suggested stems or correct
them.
(4) For each validated entry, store ONLY the validated stem in a text file. This now
constitutes the stem list.
The exact manner in which prefixes, suffixes, and infixes are removed is described in
the next section. It must be emphasized that this list is much more compact than stem
dictionaries where an association between a stem and each of its possible affix combi-
nations is stored. Here only the stemmed form is stored. So, if we have an entry in the
stem list for “ ” (lesson), any inflected form of this entry such as
will be detected and reduced to this entry without having to actually store any of its
inflected forms. The only reason that the original word and its stemmed form are both
shown to the user is so that the user can detect words that should not be further
conflated and allow these to be stored in the stem list.

5.2. The Operational Phase


The operational phase makes use of any previously built stem lists to stem an input
word or document. In this step, a user can also utilize the local context, which is
basically the document in which a word appears, in the stemming process. Setting the
local context is a one-time process for each unique document involving one step which
is described as follows.
— Load input document d and extract all unique terms into the initially empty set
LocalContext. This step includes removal of diacritics and normalization of letters.

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:9

Fig. 1. Stemming validation interface.

Loading of stem lists is also a one-time process (for all documents) involving one step.

—For each stem List X, load terms in stemList X into the initially empty set StemList.

Depending on the required strictness of the stemmer, the original word whose con-
flated form cannot be matched to an entry in the stem list can be returned, or the
corresponding lightly stemmed version of the word may be returned.
We believe that for most applications the less strict option will yield better results.
The stemming of any term t is then carried out as shown in Figure 2.

6. STEMMING RULES
The presented stemmer works on three levels: prefix removal, suffix removal, and infix
removal. Each of these levels utilizes a different set of rules. In general, prefix removal
is first attempted, then suffix removal, followed by infix removal. Each of these steps
is described in the following sections.

6.1. Prefix Removal


Prefix removal is the first step to be carried out in either training or operational mode.
In our work, we’ve divided prefixes into two classes: compound prefixes, which consist
of more than one letter, and singular prefixes or particles that are made up of just a

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:10 S. R. El-Beltagy and A. Rafea

Fig. 2. General stemming algorithm for stemming a term t.

Fig. 3. Pseudocode for removing prefixes from a term t.

Table IV. Compound Prefixes and


Their Meaning
Prefix Meaning
Negation prefix
The
And the
With the
Like the
Then the
For
And with the
And for
And like the
And then the

single character. The pseudocode for removing either of these prefixes from a term t is
shown in Figure 3.
6.1.1. Compound Prefix Removal. Compound prefixes are shown in Table IV. For the
removal of any prefix, the length of the term to be stemmed minus the length of the
prefix has to be greater than or equal to two, otherwise no prefix removal is carried
out.
A small experiment was carried out to test the effect of removing prefixes in the
compound prefix set from any word that starts with them if they satisfy the length re-
quirements mentioned earlier (except for prefix ). In this experiment nine documents,
making up a total of 19,291 words, were used. From these, a total 3,851 unique words
(excluding stop-words, single characters, and numbers) were extracted. The number of
words on which prefix removal was applied was 1,708 (44% of input words). This re-
sulted in the generation of 1,345 unique terms (a reduction of 21%). Words that started
with a prefix, along with their altered forms, were examined to assess the accuracy
of the compound prefix removal step. After identifying all cases from which prefixes

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:11

Fig. 4. Term validation pseudocode.

Table V. Single-Letter Prefixes and


Their Meaning
Prefix Meaning
and
like
then
For or because
With or at

should not have been removed but which met the only requirement (which is the length
condition), it was found that these cases are quite rare and that the overall accuracy
of this step is about 98.7%. For this reason it was decided not to impose any further
rules on the removal of a prefix in the compound set. So if a word matches with a prefix
in the compound prefix set (except prefix ), then it is simply removed. There is no
particular order in which these prefixes should be checked.
The prefix which is pronounced as (la) is considered as a special case and is handled
a bit differently than all the other prefixes in the compound set. The reason for this
is that is made up of two letters which are followed by . On its own, the
character can serve as a causation prefix. So, when precedes a word it might be
for the purpose of negating that word, or it could be because the causation prefix
happens to be attached to a word starting with the letter . So the removal of this
particular prefix has to be validated by removing the first two letters of the word that
starts with it and then checking for the resulting word in the stem list or the document
in which the word has appeared. The pseudocode for the validation step is shown in
Figure 4.
If a match is found, then the resulting word is returned. If no match is found, then
the first letter is removed from the original word and the validation method is called
again. Again, if a match is found, the resulting word is returned, but if no match is
found the original word is returned unaltered.
6.1.2. Single-Letter Prefix Removal. As described before, it is common in Arabic for con-
junctions and prepositions to attach to the beginning of a word. Conjunctions and
prefixes addressed by this work are shown in Table V.
From this list, only the conjunction was removed by the light10 stemmer. Remov-
ing these conjunctions and prepositions from a word without validating the removal
can often result in mistakes. This is because this removal means removing the first
letter of any word that starts with any of these prefixes. For example, the word
(baby) would be reduced to which has no meaning and the word (elephant)
would also be reduced to the meaningless word . This is perhaps why the only
prefix from this set that the light10 stemmer removes is which occurs often as a
conjunction attached to a word. But as stated before, the likelihood of error in case of
indiscriminate removal of these particles is quite high. So, to validate the removal of

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:12 S. R. El-Beltagy and A. Rafea

Fig. 5. Single-prefix removal pseudocode.

Table VI. Suffix Sets Handled by the System


Suffix set 1
Suffix set 2

these in training mode, the prefix is first removed and then a check is made to see if
the resulting word has appeared at least once as part of the input corpus. If the term is
found, then it is assumed that the prefix can be removed safely. This transformation is
then validated by the user. In stemming mode, the user or application can control the
way a transformation takes place through choosing from two different modes: strict
or relaxed. In strict mode, the prefix is removed and the resulting word is matched
against entries in the stem list. If a match is found, then the transformation is carried
out safely. If not, the suffix and infix removal steps are invoked on the transformed
word. If still no change occurs, then the stem is considered to be an original term. In
relaxed mode, all the preceding steps take place, but when no change occurs, terms in
the input document are checked as well to see whether a match for the transformed
word has appeared in the document. If a match is found, then the transformed word is
considered as a stem of the original. Figure 5 summarizes the process.

6.2. Suffix Removal


The next step after prefix removal is suffix removal. Two suffix sets have been identified
and are handled by the system; these are shown in Table VI.
Each of these sets is treated differently. Again, for the removal of any suffix in either
of the suffix sets, the length of the resulting term must be at least two characters in
length. If this condition is not met, the input term is returned as-is.
If an input term ends with any of the suffixes in suffix set 1, the suffix is removed
and validation is carried out in a similar manner as described before and presented
in Figure 4. In training mode this means checking that the word resulting from the
removal of the suffix has appeared at least once in the input corpus. In stemming
operational mode, this means checking that the resulting word either appears in the
dictionary or in the input document). If a match is found, the resulting word is returned.
If no match is found, the character is added to the resulting word and validation is
carried out again. In case a match is found, the resulting word is returned. The reason
a second check is made with the suffix added to the generated word is that the
singular form of words having the suffixes in suffix set 1 sometimes has the as an
ending. For example, the plural word (cars) has the plural suffix but the
singular form of this word is not . If after carrying out this second check
still no match is found, then suffixes in suffix set 2 are checked.

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:13

Table VII. Examples of Patterns Handled by the Algorithm

Handling of suffixes in the second suffix set is carried out in a similar manner. Here,
the original word is checked for each of the suffixes in the suffix set in Table VI in the
order from right to left. If the word ends with the suffix, the suffix is removed and the
resulting word is validated. If validated, it is returned. Otherwise, a check is made to
see whether the resulting word ends with a . If it does, then the is replaced
with a and the resulting word is validated again. The reason for this is that adding
some suffixes from suffix set 2 to a word that originally ends with a results in the
conversion of this to a in the process of suffix addition. So this step simply
reverses the process. For example, the word (cultivation) is converted to
(their cultivation) when adding the possessive pronoun (their). If the generated
term is validated, it returned as an output from this step. Otherwise the original word
is checked for the next suffix in the list until all are exhausted. If no match is found in
this procedure, the original word is returned as-is.

6.3. Infix Removal (Broken Plural Detection and Handling)


After suffix removal, infix removal takes place. Infixes usually occur as a part of broken
or irregular plurals. While there is no straightforward way for handling irregular
plurals except through the use of dictionaries, some of the commonly used broken
plurals exhibit well-defined patterns that can be detected and transformed. Examples
of patterns detected and handled by the developed stemming algorithm are shown in
Table VII.
For each of the previous patterns, a transformation rule is defined to transform
the broken plural pattern to its singular form. Each time a pattern matches and a
transformation occurs, the validation takes place as previously described and shown
in Figure 4. If validation results in a match being found, then the transformed pattern
is assumed to be the stem of the original word. The rules are outlined in Table VIII.
These rules are only applied on words of length three characters or longer. For each
word w matching the condition of a rule, a word w’ is created by applying the rule itself
to w. Moreover, w’ is said to represent a candidate or potential stem for word w. When
a stem list is being utilized, then w’ is checked against entries in the list. If a match is
found, then w’ is determined to be the stem for word w. When no match is found or in
case a stem list is not being utilized, w’ is matched against words that have appeared
in the local context of word w (the document in which the term w has appeared). A

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:14 S. R. El-Beltagy and A. Rafea

Table VIII. Pattern Detection and Transformation Rules

match in this case also results in w’ being returned as a stem for word w, but in this
case with less certainty.
For all rules except R1 and R4, if w’ is not found in the stem list, then the letter is
appended to w’ and a match is attempted again. The same is also true when trying to
find a match in the local context. In order to ensure efficiency in the matching process,
all words in a stem list as well as in a local context are stored in a hash list. Rule 4
is different in the fact that it actually adds a character which is the (teh marbota).
Because the addition of the teh marbota character is very tricky to determine from a
local context, this rule is restricted to usage in conjunction with a stem list. Figure 6
summarizes the pattern detection and transformation procedure.
7. ANALYZING INFIX RULE PERFORMANCE
Collecting statistics related to the number of times a certain broken plural rule matches
words in the text (fires) and the number of times it results in a transformation (suc-
ceeds), as well as the precision and recall of each rule, can provide an insight as to the
importance of each of these, as well as guidelines as to how to improve them or to use
them differently. In order to analyze the rules and patterns on an individual level, a
dataset consisting of 4,098 different news stories was used.1 Files in the dataset were

1 Dataset is available from http://www.claes.sci.eg/coe wm/data.htm.

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:15

Fig. 6. Pseudocode for infix removal.

Table IX. A Sample of the Output Obtained for Rule R1

all loaded at first, and then individual words were stemmed. No stem list was used;
instead stemming relied completely on the local context (which is the aggregate of all
words loaded from input documents). The number of words contained in this dataset
was 1,202,692 (approximately 1.2 million words). After filtering out stop-words, a total
of 75,160 unique terms was retained. To analyze individual rules, the stemmer was
augmented with code that keeps track of the number of times a rule was fired and
the number of times it succeeded (resulted in a transformation). For each rule, a file
was generated containing words that matched the rule’s pattern, the suggested trans-
formation, and whether or not transformation actually took place. An example in the
form of a sample of the output generated for rule R1 is shown in Table IX. It is impor-
tant to reiterate that for a transformation to occur, the resulting candidate term must
have occurred at least once in the local context. In the future, investigation of a higher
threshold will be carried out to avoid mismatches taking place as a result of possible
typing errors. It is also important to note that in terms of precision, all obtained results
are likely to have been better had a smaller corpus or a domain-specific corpus been
used.
The results of this analysis are summarized in Table X. In this table, the total
number of times each rule is fired is displayed, as well as the total number of times it
succeeded. Also displayed is the total number of unique words matching the rule. In
addition, stemming precision for each rule is presented as well as stemming recall and
the stemming F-score (which represents the harmonic mean of precision and recall).

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:16 S. R. El-Beltagy and A. Rafea

Table X. Statistics for Individual Rules


Rule Code # of times fired # of unique words # of times succeeded Precision Recall F-score
R1 345 81 193 94.64% 81.54% 87.6%
R2 69 17 59 100% 92.86% 96.3%
R3 118 20 26 85.71% 66.67% 75%
R4 1999 608 1291 47.65% 96.31% 63.7%
R6 5367 1552 1967 25.27% 92.6% 39.7%
R7 2255 962 1089 10.19% 100% 18.49%
R8 591 104 164 80% 94.12% 86.49%
R9 2001 466 1337 59.24% 94.02% 72.69%
R10 194 40 152 58.6% 94.44% 72.34%
R11 633 167 245 77.57% 96.49% 85.94%
Precision = (total number of correctly transformed words) / (total number of transformed words)
Recall = (total number of correctly transformed words) / (total number of words that should have been
transformed).

Stemming precision is calculated based on the number of unique words and their
correctly obtained stems using the following equation.
Stemming Precision for rule i = (number of unique words correctly stemmed by rule
i)/(total number of words stemmed by rule i).
Stemming recall for rule i = (number of unique words correctly stemmed by rule
i)/(number of unique words correctly stemmed by rule i + number of unique words that
should have been stemmed by rule i but were not).
When calculating precision and recall, any ambiguous conflations were counted as
wrongly transformed. R1 performed fairly well in terms of accuracy. An example of
a wrong transformation performed by this rule is on the name of the country
(Algeria) which was transformed to (Island). Rule R3 seems to be a rare rule.
Rule 5 was fired 12,599 times, but since it can only succeed if a match between a
potential stem and an entry in a stem list is found, no statistics for this rule could be
obtained. After looking at incorrect stems produced by Rule 5, it was concluded that
certain conditions can be added to increase its accuracy. Words matching with this
rule should not end with letters “ ”, “ ”, or , nor should they start with letters “ ”,
“ ”, “ ”, or “ ”. The same is also true for Rule 6 which has very low precision. By just
adding the “ ” restriction to this rule, precision immediately increased to 43.77%. Rule
7 mostly matched with verbs, but even when excluding these from the evaluation, the
precision was still very low, suggesting that this rule might best be confined to usage in
conjunction with a stem list. These analysis show that for most of the presented rules,
using context information does in factact as a valid safeguard against low precision.

8. EVALUATION
The design of the evaluation experiments targeted a number of specific questions which
are as follows.

(1) Does the proposed stemming methodology used in conjunction with a stem list
significantly improve stemming accuracy, as claimed?
(2) If a stem list has not been built, can using the same approach and relying only on
the local context of documents still enhance stemming accuracy?
(3) Even if accuracy is enhanced as claimed, can this really positively enhance the
performance of some real-life applications, and if so, to what extent?

To answer each of the preceding questions a series of experiments were conducted.


These are detailed in the following subsections.

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:17

Table XI. Comparison between Accuracy Results Obtained Using the Proposed Stemmer
with and without a Stem List, and a Light Stemmer
Proposed Stemmer Proposed Stemmer Light Stemmer
ith a Stem List Without a Stem List
Document 1 90.4% 75% 50.9%
Document 2 89.4% 67.4% 45.8%
Document 3 93.1% 73.6% 55.3%
Document 4 90.5% 67.6% 47.2%
Average Accuracy 90.8% ± 1.59% 70.9% ± 3.96% 49.8 ± 4.25%

8.1. Determining Improvements in Accuracy


Experiments presented in this subsection were carried out in order to answer the first
two questions which are related to the significance of the improvements in accuracy us-
ing the proposed stemmer with and without the use of a stem list. For experimentation
purposes, the light10 stemmer was developed in Java using the rules outlined in Larkey
et al. [2007] as described earlier. It is important to note that the light10 stemmer was
not designed to be an accurate stemmer, but rather to target a specific application which
is information retrieval, whose performance it has significantly improved. Flores et al.
[2010] have in fact shown that accuracy can sometimes degrade information retrieval
performance. So, the sole goal of this experiment is to demonstrate that by extending
this light stemmer as previously described, significant improvements in accuracy are
achieved which can make it applicable to tasks that do require accuracy.
To build an agricultural stem list, a total of nine agricultural extension documents
were used as a training set. The total number of words in these documents was 19,291,
but unique nonstop-words amounted to only 3,869 terms. From these, a stem list
consisting of 1,268 terms was built as well as a dictionary for irregular terms consisting
of 68 terms. The whole process of reviewing the stems and building the stem list took
a little less than two hours. In the first part of this experiment, the proposed stemmer
was used in conjunction with the stem list to stem a totally different set of agricultural
documents. The set consisted of four different large documents and each document was
stemmed separately using the proposed stemmer. The total number of words in these
documents was 9,818 words while the average length of each document was 2,455 ±
405 words. In order to facilitate the manual review of the resulting stems, only unique
terms and their stems were shown to the evaluator. Stems resulting from the proposed
approach were shown side by side to stems resulting from the light stemmer. Invalid
entries, which include verbs, stop-words, English words, and misspelled entries, were
excluded from the evaluation process. However, words that can be either verbs or
nouns depending on the context were included and were treated as nouns. A total of
1,889 unique words were produced for all four documents, of which 1,524 words were
considered as valid A stem was considered correct only if it followed our definition
of a stem, which is that it represents the singular, and whenever possible, masculine
form of the input word. After manually reviewing all resulting stems, accuracy for each
method on each document was calculated using the following formula.
Accuracy = sum of correctly stemmed words/total number of valid words
The results of this step are shown in Table XI. In the second part of this experiment,
the same procedure outlined before was repeated, but this time the proposed stemmer
did not make use of a stem list. In the absence of a stem list, transformations on words
were carried out if the word resulting from the application of a transformation rule
was found in the local context (i.e., the words constituting the input document). The
evaluation of the produced stems can also be found in Table XI.
From the last row in Table XI, it can be seen that on average when using the proposed
stemmer in conjunction with a stem list there is a 82.3% improvement in the accuracy

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:18 S. R. El-Beltagy and A. Rafea

of stemming over a light stemmer, and that when using it without the stem list there is
an improvement of about 42.4%. Despite the fact that the used dataset is a small one,
using a t-test to calculate the significance of the difference between the results showed
that in both cases (when using a stem list and without), the difference in accuracy
between the proposed stemmer and a light one is statistically significant. In the first
case (using a stem list) the p value was less than 0.0001 with a t value of 18.1, and
6 degrees of freedom, while in the second the p value was equal to 0.0003 with a t value
of to 7.27, and also 6 degrees of freedom.

8.2. Evaluating the Proposed Stemmer with Real-Life Applications


Having established that using the proposed stemmer, better stemming accuracy can
be achieved, the goal of the second experiment was to investigate whether such an
improvement can in fact have an implication for real applications. One of the moti-
vations for developing the proposed stemming approach was to use it as part of an
Arabic semantic annotation system as well as for an ontology learning system. Broadly
speaking, semantic annotation refers to the task of representing raw text that is in-
comprehensible to machines in a machine-processable format. Semantic annotation
systems vary in their complexity, from systems that simply try to identify concepts
in a document, to systems that try to identify concepts and their attributes as well
as their relationships to neighboring concepts [Laclavik et al. 2007; El-Beltagy et al.
2007; Hazman et al. 2009].
For our semantic annotation task, it was very important to maximize matches be-
tween concepts in an ontology and their corresponding forms, but not with nonmatching
terms, as this would directly affect the performance of the system, as well as the auto-
matic evaluation of the ontology learning system in which terms in an actual ontology
are compared to terms in the learned ontology. Using existing stemmers either resulted
in missing certain important matches, as, for example, for the concepts
(“diseases”, “disease”) and (“fertilizers”, fertilizer”), as well as
(“weeds”, “weed”). It may not be obvious in the English translations of these terms, but
they are all broken plurals. Existing stemmers also conflated independent concepts
such as “ and ” (irrigation, water for irrigation). However, the former case was
the more common one. So, it was only natural to test the developed stemmer on a
simplified version of a system that carries out automatic annotations. A description of
the full system can be found in El-Beltagy et al. [2007]. Towards this end, an exper-
iment was set up to annotate or assign metadata to section headings extracted from
agricultural documents using concepts from an agricultural ontology. The ontology
used consisted of 322 concepts. The section headings were automatically collected from
90 agricultural extension documents and amounted to 3,192 headings. Each heading
was tagged with zero or more concepts from the ontology if a match between phrases
in that heading and an entry in the ontology was found. Stemming of concepts and
headings is an essential preprocessing step in the matching process. Using the light10
stemmer in this step resulted in the generation of 3,260 correct concept annotations
and 0 incorrect annotations, which means that an average of 1.02 correct tags were
generated per heading with a standard deviation of 0.98 tag per heading. Substituting
the light10 stemmer with proposed one and using the dictionary generated in the first
experiment, a total of 3,934 correct concept annotations were generated, and also 0
incorrect annotations. So, on average 1.23 correct tags were generated per heading
with a standard deviation of 1.04 tags. Comparing these results, it can be observed
that using the proposed approach resulted in an increase of 20.7% correct tags. When
the average number of tags were used to establish the significance of the difference

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:19

Table XII. Comparison between a Light Stemmer and Different Configurations for Proposed Stemmer
Average # .
Total # of Matches Average Precision Average Recall of Matches per Doc.
Light Stemmer 175 0.175 +/− 0.088 0.292 +/− 0.244 1.75 +/− 0.88
C1 180 0.181 +/− 0.082 0.297 +/− 0.240 1.80 +/− 0.82
C2 187 0.188 +/− 0.090 0.305 +/− 0.239 1.87 +/− 0.90
C3 192 0.192 +/− 0.092 0.312 +/− 0.241 1.92 +/− 0.83
C4 194 0.194 +/− 0.093 0.314 +/− 0.241 1.94 +/− 0.93

between the two results, the difference was found to be statistically significant with
p < 0.0001 (t = 7.6 and degrees of freedom = 6,382).
Another experiment was carried out to determine the effect of stemming on the
task of keyphrase extraction. In this experiment the KP-Miner system was used for
keyphrase extraction [El-Beltagy and Rafea 2009b]. The used dataset consisted of 100
randomly collected articles from the Arabic Wikipedia [Wikipedia 2008]. Keywords
for each article were obtained from the keyword metatag associated with each, but
numeric entries (mostly denoting year numbers) were ignored and so were Wikipedia-
related tags (such as article seed, for example). The average number of words per
document in this dataset is 804 ± 934 and the average number of keyphrases is 8.1 ±
3.2. The percentage of author-assigned keyphrases actually appearing within the body
of associated articles in this dataset is 81.8%. The KP-Miner system allows the user
to specify the number of keypharses to extract for each input document. Setting this
number to 10, a comparison was held to see how many keyphrases would be correctly
extracted when using a light stemmer as opposed to using the proposed stemmer. Four
different configurations for the proposed stemmer were used.

(1) C1: Stemmer is used with no stem list.


(2) C2: Stemmer is used with no stem list, but in conjunction with a light stemmer (i.e.
stemming is enforced). What this means is that after carrying out transformations
outlined before, if the resulting word is the same as the original word, a light
stemmer is invoked on the resulting word.
(3) C3: Stemmer is used with the stem list obtained from agricultural documents.
(4) C4: Stemmer is used with the stem list from agricultural documents in conjunction
with a light stemmer.

The results of comparing these configurations with each other and with a light
stemmer are shown in Table XII.
As can be seen from Table XII, the best result was obtained using the proposed
stemmer in conjunction with the stem list obtained from agricultural documents and a
light stemmer. These results show an approximately 11% improvement in keyphrase
extraction over the basic light stemmer, despite the fact that the used stem list is
not directly related to the document set from which keyphrases were extracted. When
using the t-test to compare average precision values, the difference between these
results turned out to be not statistically significant with p = 0.1394. Nevertheless,
another advantage offered by the use of the presented stemmer was the elimination
of an undesirable mistake. Before using the outlined stemmer, a common mistake for
the keyphrase extractor was to generate two phrases with the exact same meaning
because one form was the broken plural of another, but this was completely avoided
through use of the presented stemmer. Further experimentation with a larger dataset
and with other configurations for the proposed stemmer are planned.

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:20 S. R. El-Beltagy and A. Rafea

9. CONCLUSION AND FUTURE WORK


This article has presented an approach for making use of text within a corpus or a
document for verifying whether or not to strip a word from certain prefixes, suffixes,
or infixes. The approach can be utilized for rapidly building stem lists that can greatly
improve stemming accuracy, as well as during the stemming process itself. For the
removal of infixes resulting from broken plurals, a set of transformation rules were
proposed. Experimenting with transformation rules on a large data corpus, and ana-
lyzing the results for each individual rule, revealed that for most of the proposed rules,
using context information does in fact act as a valid safeguard against stemming er-
rors. The same analysis has shown that some rules are best used in conjunction with a
stem list because of their low overall average precision when applied in a local context.
Evaluation of the proposed stemmer has shown that it does in fact achieve signifi-
cantly higher accuracy when compared to a light stemmer with or without the use of
a stem list. It has also shown that improved stemming accuracy leads to significant
improvement in the task of automatic annotation.
The main contribution of this work is the way in which it has extended a light
stemmer so as to allow it to handle broken plurals and validate questionable affix
removals. The presented approach for semiautomatically building compact stem lists
from an input corpus is also novel.
Future work will mainly focus on means of making the stemmer even more accurate
by addressing shortcomings identified in the analysis phase as well as by other means.
We will also investigate the possible addition of other broken plural patterns. Experi-
menting with modifying the stemmer and testing it on the task of information retrieval
is also planned. In addition, some modifications to the stem building application, so as
to make the stem list building process even faster, are intended. One of the important
modifications is to offer alternative stems for each word in case a word matches with
more than one pattern.

ACKNOWLEDGMENTS
The authors wish to thank the two anonymous reviewers for their helpful comments and suggestions that
have contributed to the improvement of the quality of the article.

REFERENCES
AL AMEED, H. K., AL KETBI, S. O., AL KAABI, A. A., AL SHEBLI, K. S., AL SHAMSI, N. F., AL NUAIMI, N. H., AND
AL MUHAIRI, S. S. 2005. Arabic light stemmer: A new enhanced approach. In Proceedings of the 2nd
International Conference on Innovations in Information Technology (IIT’05).
AL KHARASHI, I. A. AND AL SUGHAIYER, I. A. 2004. Performance evaluation of an Arabic rule-based stemmer. In
Proceedings of the 17th National Computer Conference.
ALJLAYL, M. AND FRIEDER, O. 2002. On Arabic search: Improving the retrieval effectiveness via light stemming
approach. In Proceedings of the ACM 11th Conference on Information and Knowledge Management.
340–347.
AL-SHAMMARI, E. AND LIN, J. 2008a. A novel Arabic lemmatization algorithm. In Proceedings of Conference
AND’08. 113–118.
AL-SHAMMARI, E. AND LIN, J. 2008b. Towards an error-free Arabic stemming. In Proceedings of the ACM
International Conference on Information and Knowledge Management (CIKM-iNEWS’08). 9–16.
BEESLEY, K. R. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th
Conference on Computational Linguistics. 89–94.
BUCKWALTER, T. 2003. Qamus: Arabic Lexicography. http://www.qamus.org/
CHEN, A. AND GEY, F. 2002. Building an Arabic stemmer for information retrieval. In Proceedings of the Text
Retrieval Conference (TREC’02). 631–639.
DARWISH, K. 2002. Building a shallow morphological analyzer in one day. In Proceedings of the ACL Workshop
on Computational Approaches to Semitic Languages.

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
An Accuracy-Enhanced Light Stemmer for Arabic Text 2:21

DARWISH, K. AND OARD, D. W. 2002. CLIR experiments at Maryland for TREC-2002: Evidence combination for
Arabic-English retrieval. In Proceedings of the Text Retrieval Conference (TREC’02). 703–710.
DIAB, M., HACIOGLU, K., AND JURAFSKY, D. 2004. Automatic tagging of Arabic text: From raw test to base
phrase chunks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on
Human Language Technologies North American Chapter of the Association for Computational Linguistics
(HLT-NAACL’04).
EL-BELTAGY, S. R. AND RAFEA, A. 2009a. A framework for the rapid development of list based domain specific
Arabic stemmers. In Proceedings of the 2nd International Conference on Arabic Language Resources and
Tools.
EL-BELTAGY, S. R. AND RAFEA, A. 2009b. KP-Miner: A keyphrase extraction system for English and Arabic
documents. Inform. Syst. 34, 132–144.
EL-BELTAGY, S. R., HAZMAM, M., AND RAFEA, A. 2007. Ontology based annotation of Web document segments.
In Proceedings of the 22nd Annual ACM Symposium on Applied Computing. 1362–1367.
FLORES, F. N., MOREIRA, V. P., AND HEUSER, C. A. 2010. Assessing the impact of stemming accuracy on informa-
tion retrieval. In Proceedings of the International Conference on Computational Processing of Portuguese
Language. Lecture Notes in Computer Science, vol. 6001. Springer, 11–20.
GOLDSMITH, J. A., HIGGINS, D., AND SOGLASNOVA, S. 2000. Automatic language-specific stemming in information
retrieval. In Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language
Information Retrieval and Evaluation. 273–284.
GOWEDER, A., POESIO, M., AND DE ROECK, A. 2004a. Broken plural detection for arabic information retrieval.
In Proceedings of the Annual ACM Conference on Reseurch and Development in Information Retvieval
(SIGIR’04).
GOWEDER, A., POESIO, M., DE ROECK, A., AND REYNOLDS, J. 2004b. Identifying broken plurals in unvowelised
Arabic text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP).
HARMANANI, H. M., KEIROUZ, W. T., AND RAHEEL, S. 2006. A rule-based extensible stemmer for information
retrieval with application to Arabic. Int. Arab J. Inform. Technol. 3, 3, 265–272.
HAZMAN, M., EL-BELTAGY, S. R., AND RAFEA, A. 2009. Ontology learning from domain specific Web documents.
Int. J. Metadata Semant. Ontol. 4, 1–2, 24–33.
KHOJA, S. AND GARSIDE, R. 1999. Stemming Arabic text. Tech. rep. Computing Department, Lancaster Uni-
versity, Lancaster, U.K.
LACLAVIK, M., SELENG, M., GATIAL, E., BALOGH, Z., AND HLUCHY, L. 2007. Ontology based text
annotation—OnTeA. In Proceedings of the Conference on Information Modeling and Knowledge Bases.
311–315.
LARKEY, L. S. AND CONNELL, M. E. 2001. Arabic information retrieval at UMass in TREC-10. In Proceedings
of the Text Retrieval Conference (TREC’01).
LARKEY, L. S., BALLESTEROS, L., AND CONNELL, M. E. 2002. Improving stemming for Arabic information retrieval:
Light stemming and co-occurrence analysis. In Proceeedings of the Annual ACM Conference on Research
and Development in Information Retrieval (SIGIR’02).
LARKEY, L. S., BALLESTEROS, L., AND CONNELL, M. E. 2007. Light stemming for Arabic information retrieval. In
Arabic Computational Morphology, A. Soudi, A. van der Bosch, and G. Neumann, Eds. 221–243.
LEE, Y., PAPINENI, K., ROUKOS, S., EMAM, O., AND HASSAN, H. 2003. Language model based Arabic word seg-
mentation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics.
399–406.
MANNING, C. D., RAGHAVAN, P., AND SCHÜTZE, H. 2008. Introduction to Information Retrieval. Cambridge Uni-
versity Press. Cambridge, U.K.
MOUKDAD, H. 2006. Stemming and root-based approaches to the retrieval of Arabic documents on the Web.
Webology 3, 1, Article 22. http://www.webology.ir/2006/v3n1/a22.html.
NWESRI, A., TAHAGHOGHI, S. M. M., AND SCHOLER, F. 2005. Stemming Arabic conjunctions and prepositions.
In Proceedings of the 12th International Symposium on String Processing and Information Retrieval
(SPIRE’05). Lecture Notes in Computer Science, vol. 3772, Springer, 206–217.
PAICE, C. D. 1996. Method for evaluation of stemming algorithms based on error counting. J. Amer. Soc.
Inform. Sci. 47, 632–649.
RAFEA, A. AND SHAALAN, K. 1993. Lexical analysis of inflected Arabic words using exhaustive search of an
augmented transition network. Softw. Pract. Exper. 23, 6, 567–588.
ROGATI, M., MCCARLEY, S., AND YANG, Y. 2003. Unsupervised learning of Arabic stemming using a parallel
corpus. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics.
391–398.

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.
2:22 S. R. El-Beltagy and A. Rafea

ŠNAJDER, J., BAŠIC, B. D., AND TADIC, M. 2008. Automatic acquisition of inflectional lexica for morphological
normalization. Inform. Process. Manag. 44, 1720–1731.
TAGHVA, K., ELKHOURY, R., AND COOMBS, J. S. 2005. Arabic stemming without a root dictionary. ITCC 1, 152–157.
WIKIPEDIA. 2008. Wikipedia, the free encyclopedia. http://ar.wikipedia.org/wiki/Main Page.
XU, J. AND CROFT, W. B. 1998. Corpus, based stemming using co, occurrence of word variants. ACM Trans.
Inform. Syst. 16, 1, 61–81.
ZITOUNI, I., SORENSEN, J., LUO, X., AND FLORIAN, R. 2005. The impact of morphological stemming on Arabic
mention detection and coreference resolution. In Proceedings of the ACL Workshop on Computational
Approaches to Semitic Languages. 63–70.

Received March 2010; revised December 2010; accepted December 2010

ACM Transactions on Speech and Language Processing, Vol. 7, No. 2, Article 2, Publication date: February 2011.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy