Classifying Arabic Web Pages Toolkit
Classifying Arabic Web Pages Toolkit
1
each language [4]. Consequently, NLP techniques need to In this research, we also concentrated on finding the best
be applied on texts. Unfortunately, Arabic NLP is still in Arabic web page classification algorithm. In order to achieve
its initial stage compared to the work in the English lan- our goal, we needed to divide our work into two stages: ”The
guage. NLP tasks may include tokenization, morphological Learning Stage” and ”The Prediction Stage”. Each stage
analysis, part-of-speech (POS) tagging, and parsing [5]. contains many tasks, that are all available within the toolkit.
Tokenization is the process of breaking the text into sen-
tences and words. POS tagging determines the part of speech 3.1 The Learning Stage
tag for each word of the text based on the context in which It is where the preprocessing tasks of the Arabic web pages
they appear [5] [8]. The most common set of POS tags con- are done. In the preprocessing phase, we propose to apply
tains seven different tags (Article, Noun, Verb, Adjective, different tasks in order to extract the features of each prede-
Preposition, Number, and Proper Noun) [5]. Usually, POS fined category. We prepared a training set for each category
taggers need a stemming process in order to perform mor- and we applied the following tasks on them:
phological analysis of words. Parsing produces a parse tree Parsing Parsing the Arabic web pages and extract their
of a sentence. After toknizing the text into words, the num- text. This could be done in two methods: extract the full
ber of the resulted words should be reduced. This can be text of the web page or extract the content of some selected
done by filtering and lemmatization or stemming methods tags. These methods were shown in [3] and experimented in
[8]. [7].
Filtering methods remove certain words from the docu- Stemming Processing the extracted text using two meth-
ment. This step enhances the data mining process since all ods: light stemming to extract the stem of the word and
the removed words have low importance in the document. A heavy stemming to extract the root of the word. Our toolkit
standard filtering method is stop word filtering. Stop words integrates many algorithms: Khoja Stemmer which is a heavy
are very common words or words that don’t convey meaning stemmer and AraMorph which is a light stemmer. We also
or no content information, like articles, conjunctions, prepo- modified Aramorph to be a heavy stemmer. Moreover, our
sitions, etc [8]. In addition, the words that occur extremely toolkit presents the choice of either stemming the whole text
often are considered to be of little information content to of the parsed material or stemming the parsed materials to
distinguish between documents, and also the words that oc- until we get only the first N-words to be preprocessed.
cur very rare in the document can be removed too. Feature Extraction Extract the features of each cate-
Stemming is a NLP technique used to combat the vocabu- gory from the training set. This was achieved by finding the
lary mismatch problem and to simplify and minimize num- most frequent words (MFW), which are either stems or roots
ber of words used in text mining tasks. Stemmers equate of each category. The MFW of each category was found by
I.J» H. AJ»
or conflate certain variant forms of the same word. For ex- first finding the MFW of each instance in the category train-
I.J»
ample which means write, which means book and ing set, append them to a single file, and then select the
H. ñJºÓ which means written, are derivation from the ( ) MFW from this file. Then we created the attribute-class
root which have the notion of writing [9]. In many languages relation file and it becomes ready to feed a classifier with it
stemming is primarily a process of suffix removal. in order to build the classification model.
Author of [12] categorize stemmers into two types: light Classification Algorithms We used different classifi-
and heavy stemmers. Light stemmers reduce words to their cation algorithms to build the classification model. Until
stems – the word without affix (including prefix and suf- now, in this paper we used only two algorithms (C4.5 and
fix). Heavy stemmers reduce words to their roots. They AntMiner) . Moreover, the environment provides all the
include light stemming, besides reducing resultant stems to classification algorithms that are available in Weka such as
Neural networks, SVM, etc. And work is in progress to en-
àñÒʪÖÏ @ ÕΪÓ
their roots. Roots are the smallest lexical unit of a word.
For example, the word ( ) has the stem ( ) and rich the environment with other algorithms not available in
ÕΫ
the root ( ). Each specific language has a custom-made
Weka.
As it is shown above, we proposed different combination
stemmer. The nature of Arabic language makes it very dif- of methods, all using the same data sets as input. This
ficult to stem. Light stemmers are fast and simple, since approach gives us the ability to determine best techniques,
they don’t need any grammatical analysis [12]. Tim Buck- parameters, and algorithms to classify Arabic web pages.
walter’s Arabic Morphological Analyzer [2] is an example
for them. Heavy stemmers are slower than light stemmers, 3.2 The Prediction Stage
because they need to analyze words, but they have the ad- It is where the classification task of the Arabic web pages
vantage of minimizing the size of the dictionary. Shereen actually done. In this stage we have used a predetermined
Khoja stemmer [1] is an example of them. classification model (the result of the learning stage) to clas-
Additional filtering may be done to the text of an instance sify the web page with the chosen classification algorithms.
(i.e. web page text) by taking into account only the most
frequent n-words, rather than the whole text.
éK XAJ¯@
4. EXPERIMENT AND RESULTS
In our work, we have defined 5 classes (Banking - ,
éJm
3. THE PROPOSED APPROACH
éJK X éJAK P
éJJ®K
The aim of this research is to produce a complete environ- Health - , Religious - , Sport - , and Technol-
ment (toolkit) for Arabic Web Classification. The toolkit ogy - ). The instances’ collecting process was identical
provides the users with all what they need to build the clas- for each category. The instances were selected completely
sification model and then classify any Arabic web page. It randomly in order to be a representation of the real HTML
integrates many existing tools, techniques and enhancement world. In addition, we have chosen the web pages from sev-
of some algorithms. eral web sites. Practically, the maximum number is about 6
2
web pages from each site. Table 1 also shows a comparison of results using extracted
We have collected 335 web pages from different websites. full text vs. first 200 words. As Khoja roots gives best
Each category have 67 web pages, two-third of them (the results, we have taken its results in this table. Its clear
first 45 web pages) are for training set. The remaining third from this comparison that taking the whole text gives better
of them (22 web pages) are left for the test set. results than taking only the first 200 words. Also its shows
Using the proposed approach, we were able to test differ- that the Ant Miner performance is better than the C4.5.
ent setups in the web page classification process. As clas-
sification techniques, we mainly used Ant Miner and J48
which is WEKAS’s implementation of C4.5 [6]. The main Table 1: Classification Performance While Consid-
comparison criteria is the accuracy rate of each classifier. ering The Full Text Vs. 200 Words
To test the classification performance, we have performed Parsed Material AM c4.5 AM cv c4.5 cv
the two approaches: cross validation, and train-and-test Full Text
method.For cross validation, we used 10 folds, as it is a Light Stemming 73.64 64.55 83.09 85.78
standard [15]. With Ant Miner, the number of ants used Khoja Roots 77.27 73.64 88.76 86.22
was kept as the default which is 5 ants. Aramorph Roots 62.73 67.27 78.81 75.11
The parameters that we have changed over the classifica- First 200 Words
tion process were: In parsing: Extraction of full text or only Aramorph Stems 63.64 60.91 73.79 72.89
selected tags. In preprocessing: Use of heavy stemming to Khoja Roots 69.09 70.91 75.36 79.56
extract roots using two tools (AraMorph, Khoja Stemmer) Aramorph Roots 57.27 55.45 69.3 71.11
or use of light stemming to extract stems using (AraMorph).
In addition the stemming was applied to the whole text or
to the first 200 words only. In finding frequent words: We
took 5 attributes or 10 attributes from each category. Web Extracting The Content of Selected Tags.
pages source: Usage of web pages from multiple web sites or We have also experimented classification by extracting the
usage of web pages from one web site. content of some selected tags. The tags that we have cho-
We have faced some difficulties in working with Arabic sen were: TITLE, HEAD, HEADINGS, BOLD, ITALIC,
web pages which could affect the classification performance. UNDERLYING, META data (descriptions, keywords). As
Most Arabic web pages are badly programmed. In addition, a result, many instances in the training set have given no
Arabic web sites are not semantically rich HTML coded. information, and thus we removed their occurrence vector
Researchers found that around 99% of Arabic websites do rows. This minimized the size of the used data set and thus
not implement any metadata standards at all [11]. the evaluation was not very efficient. The results obtained
The experiment results of web pages classification from are summarized in table 2.
many web sites are discussed in details in the following: Our Finally, in our experiment on multiple web sites, we have
data set contains 335 different web pages divided into 5 cat- compared the classification performance of Ant Miner and
egories. The training set contain 45 instance for each cate- C4.5 using 5 or 10 attributes. Since the best performance
gory, while the remaining 22 instances are left for the test reached was while using full text and Khoja roots, we have
set. made this comparison using these parameters. Table 3 sum-
By selecting five attributes from each category, we have maries the obtained results. It appears clearly that taking
done many classification performance testing organized into only 5 most frequent words gives better accuracy rate than
two main division: testing by extracting full text from web taking 10 most frequent words.
pages or testing by extracting the content of some selected
tags.
Table 2: Classification Performance Comparison –
Content of Selected Tags
Extracting the Full Text.
Method AM c4.5 AM cv c4.5 cv
Preprocess of Full Text The comparison results ob-
Aramorph Stems 44.44 40.74 87.5 86.08
tained are summarized in the first part of table 1. The
Khoja Roots 55.10 53.06 78.83 81.37
first column shows the stemming method. The second and
Aramorph Roots 45.24 48.81 69.27 77.35
the third shows the accuracy rate of respectively Ant Miner
and C4.5 using train-and-test method. The fourth and fifth
columns show the same accuracy rates computed using the
cross validation method. It shows clearly that Khoja roots
give better accuracy than AraMorph roots or stems. Also it Table 3: Classification Performance Comparison –
shows that Ant miner and C4.5 performance are comparable Taking 5 Vs. 10 Attributes
according to the accuracy rate. Parsed Material AM c4.5 AM cv c4.5 cv
Preprocess of the First 200 Words 5 Attributes 76.36 73.64 88.76 86.22
Also we have done classification testing by taking only the 10 Attributes 73.64 60.91 86.71 83.11
first 200 words from the extracted text. We have tested the
classification performance using both AntMiner and C4.5. From our expensive experiment, as described above, we
The comparison results obtained are summarized in the sec- have conclude that the best parameters setup to achieve the
ond part of the table 1. It shows clearly that AraMorph highest classification performance were as follow: In pars-
roots give the worst results and again Khoja roots gives best ing: Extracting full text. In preprocessing: Extract roots
accuracy. Also it shows that Ant miner and C4.5 are still by Khoja stemmer. In finding frequent words: Taking the 5
comparable. most frequent words from each category.
3
5. CONCLUSION AND FUTURE WORK Leuven, Belgium, 2000.
It is the need of the hour to classify Arabic language web [11] K. A. Mohamed. The impact of metadata in web
pages. We faced and removed a number of complications resources discovering. Online Information Review,
throughout the research, and an expert level knowledge was 2005.
mandatory in various fields. [12] A. F. Nwesri. Arabic text processing for indexing and
Our work to achieve Arabic web page classification in our retrieval. Technical report, School of Computer
research passed by the following major milestones: Science and Information Technology, RMIT
We started off by gaining knowledge of different areas University, Melbourne, Australia, 2007.
which were concerned with design and implementation of [13] R. S. Parpinelli, H. S. Lopes, and A. A. Freitas. Data
the project. We didn’t reinvent the wheel. Instead, we uti- mining with an ant colony optimization algorithm.
lized existing tools and technologies which suited our re- IEEE Trans. Evolutionary Computation, 6(4), 2002.
quirements. After establishing a strong foundation for the [14] J. Srivastava, P. Desikan, and V. Kumar. Web mining
research, we collected data instances. After these steps, we - accomplishments & future directions. In Web Mining
integrated all the required tools along with our modifications - Accomplishments & Future Directions, pages
and expansions to perform the different tasks of the learning 461–481. University of Minnesota, USA, 2004.
stage and besides that, we created the classification model. [15] I. H. Witten and E. Frank. Data Mining: Practical
This integration represents the Learning Environment. Fi- Machine Learning Tools and Techniques. Morgan
nally, we built the Prediction Tool, which applies the best Kaufmann Publishers, second edition, 2005.
results found in the learning stage to classify a single in-
stance.
We have many additional ideas and improvements for the
environment, which we suggest as a future work. One of the
most significant improvements is expansion of the number of
the predefined categories in order to support a wide variety
of applications. Another is to enable individual users to
expand the predefined categories by providing the data set
and its corresponding category. A very important and vital
future work is to improve the feature selection method such
as taking only nouns, N-Grams, etc. This development shall
make the tool more reliable to be used as a plug-in for the
web browser for the sake of filtering search results.
6. REFERENCES
[1] Shereen khoja stemmer, November 2008.
http://zeus.cs.pacificu.edu/shereen/research.htm.
[2] Tim buckwalter’s arabic morphological analyzer,
August 2008. http://www.qamus.org/ ,
http://www.nongnu.org/aramorph/.
[3] B. D. Davison and X. Qi. Web page classification:
Features and algorithms. ACM Computing Surveys
(CSUR), 41(2), 2009.
[4] H. A. do Prado and E. Ferneda. Emerging
Technologies of Text Mining: Techniques and
Applications. Information science reference, first
edition, 2008.
[5] R. Feldman and J. Sanger. The Text Mining Handbook.
Cambridge University Press, first edition, 2007.
[6] J. Han and M. Kamber. Data Mining:Concepts and
Techniques. Morgan Kaufmann Publishers, third
edition, 2011.
[7] N. Holden and A. A. Freitas. Web page classification
with an ant colony algorithm. Technical report,
Computing Laboratory, University of Kent, 2006.
[8] A. Hotho, A. Nürnberger, and G. Paa? A brief survey
of text mining. LDV Forum - GLDV Journal for
Computational Linguistics and Language Technology,
20:19–62, 2005.
[9] Y. Kadri and J.-Y. Nie. Effective stemming for arabic
information retrieval. Technical report, Laboratoire
RALI, DIRO, Université de Montréal, Canada, 2006.
[10] R. Kosala and H. Blockeel. Web mining research: A
survey. Technical report, Katholieke Universiteit