Abstract
A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naïve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset.





Similar content being viewed by others
References
Al-Saleem, S. (2010). Associative classification to categorize Arabic data sets. The International Journal Of ACM JORDAN, 1, 118–127.
Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.
Bawaneh, J. M., Alkoffash, M. S., & Alrabea, A. I. (2008). Arabic text classification using K-NN and Naive Bayes. Journal of Computer Science, 4, 600–605.
Diederich, J., Kindermann, J. L., Leopold, E., & PAAß, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109–123.
Duwairi, R. (2006). Machine learning for Arabic text categorization. Journal of the American Society for Information Science and Technology JASIST, 57(8), 1005–1010.
Duwairi, R., Al-Refai, M., & Khasawneh, N. (2009). Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science, 60(11), 2347–2352.
El-Halees, A. (2008). A comparative study on Arabic text classification, Egyptian Computer Science Journal, 30(2). http://www.informatik.uni-trier.de/~ley/db/journals/ecs/ecs30.html
Elkourdi, M., Bensaid, A., & Rachidi, T. (2004). Automatic Arabic document categorization based on the Naive Bayes algorithm. In Proceedings of COLING 20th Workshop on Computational Approaches to Arabic Script-Based Languages, (pp. 51–58).
Kanaan, G., Al-Shalabi R., & Al-Azzam, O. (2005). Automatic text classification using Naïve Bayesian algorithm on Arabic language. In Proceedings of the 5 th International Business Information Management Conference (IBIMA), (pp. 327–339).
Kanaan, G., Al-Shalabi, R., Ghwanmeh, S., & Al-Ma’adeed, H. (2009). A comparison of text-classification techniques applied to Arabic text. Journal of the American Society for Information Science and Technology, 60(9), 1836–1844.
Khreisat, L. (2006). Arabic text classification using N-gram frequency statistics a comparative study. In Proceedings of the 2006 International Conference on Data Mining, (pp. 78–82).
Mesleh, A. A. (2007). Chi square feature extraction based Svms Arabic language text categorization system. Journal of Computer Science, 3(6), 430–435.
Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T. (2006). YALE: Rapid prototyping for complex data mining tasks. In L. Ungar, M. Craven, D. Gunopulos, & T. Eliassi-Rad (Eds.), KDD 06 Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 935–940. New York, USA: ACM.
Sawaf, H., Zaplo, J., & Ney, H. (2001). Statistical classification methods for Arabic news articles. Arabic Natural Language Processing Workshop, ACL’2001, (pp. 127–132).
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Sinclair, J. (1995). Corpus typology–a framework for classification. In G. Melchers & B. Warren (Eds.), Studies in anglistics (pp. 17–33). Stockholm: Almqvist & Wiksell.
Syiam, M. M., Fayed, Z. T., & Habib, M. B. (2006). An intelligent system for Arabic text categorization. International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19.
Thabtah, F., Eljinini, M., Zamzeer, M., & Hadi, W. (2009). Naïve Bayesian based on Chi Square to categorize Arabic data. In Proceedings of The 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies, (pp. 930–935).
Thabtah, F., Hadi, W., & Al-Shammare, G. (2008). VSMs with K-Nearest Neighbour to categorise Arabic text data. In The World Congress on Engineering and Computer Science 2008, (pp. 778–781).
Zahran, M. M., Kanaan, G., & Habib, M. B. (2009). Text feature selection using particle Swarm optimization algorithm. World Applied Sciences Journal, 7(Special Issue of Computer & IT), 69–74.
Acknowledgments
This project was fully funded by King Abdulaziz City for Science and Technology via grant number 104-27-30. The authors would like to thank the two anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khorsheed, M.S., Al-Thubaity, A.O. Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Lang Resources & Evaluation 47, 513–538 (2013). https://doi.org/10.1007/s10579-013-9221-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-013-9221-8