Abstract
Term mismatch is a common limitation of traditional information retrieval (IR) models where relevance scores are estimated based on exact matching of documents and queries. Typically, good IR model should consider distinct but semantically similar words in the matching process. In this paper, we propose a method to incorporate word embedding (WE) semantic similarities into existing probabilistic IR models for Arabic in order to deal with term mismatch. Experiments are performed on the standard Arabic TREC collection using three neural word embedding models. The results show that extending the existing IR models improves significantly baseline bag-of-words models. Although the proposed extensions significantly outperform their baseline bag-of-words, the difference between the evaluated neural word embedding models is not statistically significant. Moreover, the overall comparison results show that our extensions significantly improve the Arabic WordNet based semantic indexing approach and three recent WE-based IR language models.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
LDC catalog number LDC2001T55.
References
Abdelali, A., Darwish, K., Durrani, N., Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. In Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 11–16). San Diego, CA, June 12–17, 2016.
Abderrahim, M. A., Dib, M., Abderrahim, M. E. A., & Chikh, M. A. (2016). Semantic indexing of arabic texts for information retrieval system. International Journal of Speech Technology, 19(2), 229–236.
Abouenour, L., Bouzoubaa, K., & Rosso, P. (2013). On the evaluation and improvement of Arabic wordnet coverage and usability. Language Resources and Evaluation, 47(3), 891–917.
Abu El-Khair, I. (2007). Arabic information retrieval. Annual Review of Information Science and Technology, 41(1), 505–533.
Algarni, M., Martin, B., Bell, T., Neshatian, K. (2014). Simple arabic stemmer. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14 (pp. 1803–1806).
Amati, G., & Van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.
Atwan, J., Mohd, M., Rashaideh, H., & Kanaan, G. (2016). Semantically enhanced pseudo relevance feedback for arabic information retrieval. Journal of Information Science, 42(2), 246–260.
Baroni, M., Dinu, G., Kruszewski, G. (2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL (pp. 238–247), Baltimore, MA.
Belalem, G., Abbache, A., Barigou, F., & Belkredim, F. Z. (2014). The use of arabic wordnet in arabic information retrieval. International Journal of Information Retrieval Research, 4(3), 54–65.
Ben Guirat, S., Bounhas, I., & Slimani, Y. (2016). Combining indexing units for arabic information retrieval. International Journal of Software Innovation, 4(4), 1–14.
Berger, A., Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 222–229), New York: SIGIR ’99.
Boulaknadel, S., Daille, B., Aboutajdine, D. (2008). Multi-word term indexing for Arabic document retrieval. In IEEE Symposium on Computers and Communications (ISCC’08) (pp. 869–873).
Clinchant, S., Gaussier, E. (2010). Information-based models for ad hoc IR. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 234–241), New York: SIGIR ’10.
Clinchant, S., & Gaussier, E. (2011). Retrieval constraints and word frequency distributions a log-logistic model for IR. Information Retrieval, 14(1), 5–25.
Croft, W. B., Bendersky, M., Li, H., & Xu, G. (2011). Query representation and understanding workshop. SIGIR Forum, 44(2), 48–53.
Darwish, K., Ali, A. M. (2012). Arabic retrieval revisited: Morphological hole filling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (Vol. 2, pp. 218–222). Stroudsburg, PA: Association for Computational Linguistics, ACL’12.
Darwish ,K., Mubarak, H. (2016). Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016.
Dragoni, M., Da Costa Pereira, C., & Tettamanzi, A. G. (2012). A conceptual representation of documents and queries for information retrieval systems by using light ontologies. Expert Systems with Applications, 39(12), 10,376–10,388.
El Mahdaouy, A., Gaussier, E., EL Alaoui, S. O. (2014). Exploring term proximity statistic for Arabic information retrieval. In 2014 Third IEEE International Colloquium in Information Science and Technology (CIST) (pp. 272–277).
El Mahdaouy, A., EL Alaoui, S. O., Gaussier, E. (2016). Semantically enhanced term frequency based on word embeddings for Arabic information retrieval. In 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt) (pp. 385–389).
Elkateb, W. S., Fellbaum, C. (2006). Building a wordnet for Arabic. In Proceedings of The Fifth International Conference on Language Resources and Evaluation (LREC 2006).
Fang, H., Zhai, C. (2006). Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 115–122), New York: SIGIR ’06.
Fang, H., Tao, T., Zhai, C. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 49–56). New York: SIGIR ’04.
Farghaly, A. (2004). Computer processing of arabic script-based languages. Current state and future directions. In A. Farghaly & K. Megerdoomian (Eds.), COLING 2004 computational approaches to Arabic script-based languages (pp. 1–1). COLING: Geneva.
Faruqui, M., Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014 (pp. 462–471), April 26–30, 2014, Gothenburg, Sweden.
Fernández, M., Cantador, I., López, V., Vallet, D., Castells, P., Motta, E. (2011). Semantically enhanced information retrieval: An ontology-based approach. Web Semantics: Science, Services and Agents on the World Wide Web, 9(4), 434–452 (JWS special issue on Semantic Search).
Ganguly, D., Roy, D., Mitra, M., Jones, G. J. (2015). Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 795–798) New York: SIGIR ’15.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 50–57), New York: SIGIR ’99.
Jaafar, Y., Bouzoubaa, K., Yousfi, A., Tajmout, R., & Khamar, H. (2016). Improving Arabic morphological analyzers benchmark. International Journal of Speech Technology, 19(2), 259–267.
Kadri, Y., Nie, J. Y. (2006). Effective stemming for arabic information retrieval. In The Challenge of Arabic for NLP/MT, International Conf. at the British Computer Society (BCS) (pp. 68–74).
Karimzadehgan, M., Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 323–330), New York: SIGIR ’10.
Khoja, S., Garside, R. (1999). Stemming Arabic Text. Computing Department. Lancaster University.
Larkey, L., Ballesteros, L., & Connell, M. E. (2007). Light stemming for Arabic information retrieval. In A. Soudi, A. D. Bosch, & G. Neumann (Eds.), Arabic computational morphology, text, speech and language technology (Vol. 38, pp. 221–243). Netherlands: Springer.
Larkey, L. S., Ballesteros, L., Connell, M. E. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 275–282), New York: SIGIR ’02.
Li, B., & Gaussier, E. (2012). An information-based cross-language information retrieval model. In R. Baeza-Yates, A. P. Vries, H. Zaragoza, B. B. Cambazoglu, V. Murdock, R. Lempel, & F. Silvestri (Eds.), 34th European conference on IR research, ECIR 2012 (Vol. 7224, pp. 281–292)., Lecture Notes in Computer Science (LNCS) Barcelone: Springer.
Li, H., & Xu, J. (2014). Semantic matching in search. Foundations and Trends in Information Retrieval, 7(5), 343–469.
Lofi, C. (2015). Measuring semantic similarity and relatedness with distributional and knowledge-based approaches. Information and Media Technologies, 10(3), 493–501.
Mahgoub, A. Y., Rashwan, M. A., Raafat, H., Zahran, M. A., & Fayek, M. B. (2014). Semantic query expansion for arabic information retrieval. ANLP, 2014, 87–92.
Metzler, D., Croft, W. B. (2005). A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 472–479), New York, NY: SIGIR ’05.
Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations, ICLR ’13.
Mustafa, M., AbdAlla, H., & Suleman, H. (2008). Current approaches in Arabic IR: A survey (pp. 406–407). Berlin Heidelberg, Berlin, Heidelberg: Springer.
Nwesri, A., Tahaghoghi, S., & Scholer, F. (2005). Stemming arabic conjunctions and prepositions. In M. Consens & G. Navarro (Eds.), String processing and information retrieval (Vol. 3772, pp. 206–217)., Lecture notes in computer science Berlin: Springer.
Pennington, J., Socher, R., Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543), Doha: Association for Computational Linguistics.
Ponte, J. M., Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 275–281), New York: SIGIR ’98.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M. (1994). Okapi at trec-3. In TREC’94 (pp. 109–126). City University, London.
Sun, Y., Rao, N., Ding, W. (2017). A simple approach to learn polysemous word embeddings. CoRR abs/1707.01793, http://arxiv.org/abs/1707.01793,1707.01793.
Tazit, N., Bouyakhf, E. H., Sabri, S, Yousfi, A., Bouzouba, K. (2007). Semantic internet search engine with focus on Arabic language. In the International Symposium on Computers & Arabic Language, ISCAL 07.
Tazit, N., Yousfi, A., & Bouyakhf, E. H. (2009). Design and implementation of an information retrieval system by integrating semantic knowledge in the indexing phase. Artificial Intelligence and Machine Learning AIML, 9(1), 49–56.
Vulić, I., Moens, M. F. (2015). Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 363–372). New York: SIGIR ’15.
Wei, X., Croft, W. B. (2006). Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 178–185), New York: SIGIR ’06.
Yang, X., & Mao, K. (2016). Learning multi-prototype word embedding from single-prototype word embedding with integrated knowledge. Expert Systems with Applications, 56, 291–299.
Zahran, M. A., Magooda, A., Mahgoub, A. Y., Raafat, H., Rashwan, M., & Atyia, A. (2015). Word representations in vector space and their applications for Arabic (pp. 430–443). Cham: Springer.
Zhai, C., Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 334–342), New York: SIGIR ’01.
Zuccon, G., Koopman, B., Bruza, P., Azzopardi, L. (2015). Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of the 20th Australasian Document Computing Symposium, ACM (pp. 12:1–12:8), New York: ADCS ’15.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
El Mahdaouy, A., El Alaoui, S.O. & Gaussier, E. Improving Arabic information retrieval using word embedding similarities. Int J Speech Technol 21, 121–136 (2018). https://doi.org/10.1007/s10772-018-9492-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-9492-y