Improving Arabic information retrieval using word embedding similarities

El Mahdaouy, Abdelkader; El Alaoui, Saïd Ouatik; Gaussier, Eric

doi:10.1007/s10772-018-9492-y

Improving Arabic information retrieval using word embedding similarities

Published: 19 January 2018

Volume 21, pages 121–136, (2018)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

729 Accesses
27 Citations
Explore all metrics

Abstract

Term mismatch is a common limitation of traditional information retrieval (IR) models where relevance scores are estimated based on exact matching of documents and queries. Typically, good IR model should consider distinct but semantically similar words in the matching process. In this paper, we propose a method to incorporate word embedding (WE) semantic similarities into existing probabilistic IR models for Arabic in order to deal with term mismatch. Experiments are performed on the standard Arabic TREC collection using three neural word embedding models. The results show that extending the existing IR models improves significantly baseline bag-of-words models. Although the proposed extensions significantly outperform their baseline bag-of-words, the difference between the evaluated neural word embedding models is not statistically significant. Moreover, the overall comparison results show that our extensions significantly improve the Arabic WordNet based semantic indexing approach and three recent WE-based IR language models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring Semantic Similarity Measure Based on Word Embedding Representation for Arabic Passages Retrieval

Related Terms Extraction from Arabic News Corpus Using Word Embedding

Improving Arabic Microblog Retrieval with Distributed Representations

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Abdelali, A., Darwish, K., Durrani, N., Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. In Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 11–16). San Diego, CA, June 12–17, 2016.
Abderrahim, M. A., Dib, M., Abderrahim, M. E. A., & Chikh, M. A. (2016). Semantic indexing of arabic texts for information retrieval system. International Journal of Speech Technology, 19(2), 229–236.
Article Google Scholar
Abouenour, L., Bouzoubaa, K., & Rosso, P. (2013). On the evaluation and improvement of Arabic wordnet coverage and usability. Language Resources and Evaluation, 47(3), 891–917.
Article Google Scholar
Abu El-Khair, I. (2007). Arabic information retrieval. Annual Review of Information Science and Technology, 41(1), 505–533.
Article Google Scholar
Algarni, M., Martin, B., Bell, T., Neshatian, K. (2014). Simple arabic stemmer. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14 (pp. 1803–1806).
Amati, G., & Van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.
Article Google Scholar
Atwan, J., Mohd, M., Rashaideh, H., & Kanaan, G. (2016). Semantically enhanced pseudo relevance feedback for arabic information retrieval. Journal of Information Science, 42(2), 246–260.
Article Google Scholar
Baroni, M., Dinu, G., Kruszewski, G. (2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL (pp. 238–247), Baltimore, MA.
Belalem, G., Abbache, A., Barigou, F., & Belkredim, F. Z. (2014). The use of arabic wordnet in arabic information retrieval. International Journal of Information Retrieval Research, 4(3), 54–65.
Article Google Scholar
Ben Guirat, S., Bounhas, I., & Slimani, Y. (2016). Combining indexing units for arabic information retrieval. International Journal of Software Innovation, 4(4), 1–14.
Article Google Scholar
Berger, A., Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 222–229), New York: SIGIR ’99.
Boulaknadel, S., Daille, B., Aboutajdine, D. (2008). Multi-word term indexing for Arabic document retrieval. In IEEE Symposium on Computers and Communications (ISCC’08) (pp. 869–873).
Clinchant, S., Gaussier, E. (2010). Information-based models for ad hoc IR. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 234–241), New York: SIGIR ’10.
Clinchant, S., & Gaussier, E. (2011). Retrieval constraints and word frequency distributions a log-logistic model for IR. Information Retrieval, 14(1), 5–25.
Article Google Scholar
Croft, W. B., Bendersky, M., Li, H., & Xu, G. (2011). Query representation and understanding workshop. SIGIR Forum, 44(2), 48–53.
Article Google Scholar
Darwish, K., Ali, A. M. (2012). Arabic retrieval revisited: Morphological hole filling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (Vol. 2, pp. 218–222). Stroudsburg, PA: Association for Computational Linguistics, ACL’12.
Darwish ,K., Mubarak, H. (2016). Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016.
Dragoni, M., Da Costa Pereira, C., & Tettamanzi, A. G. (2012). A conceptual representation of documents and queries for information retrieval systems by using light ontologies. Expert Systems with Applications, 39(12), 10,376–10,388.
Article Google Scholar
El Mahdaouy, A., Gaussier, E., EL Alaoui, S. O. (2014). Exploring term proximity statistic for Arabic information retrieval. In 2014 Third IEEE International Colloquium in Information Science and Technology (CIST) (pp. 272–277).
El Mahdaouy, A., EL Alaoui, S. O., Gaussier, E. (2016). Semantically enhanced term frequency based on word embeddings for Arabic information retrieval. In 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt) (pp. 385–389).
Elkateb, W. S., Fellbaum, C. (2006). Building a wordnet for Arabic. In Proceedings of The Fifth International Conference on Language Resources and Evaluation (LREC 2006).
Fang, H., Zhai, C. (2006). Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 115–122), New York: SIGIR ’06.
Fang, H., Tao, T., Zhai, C. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 49–56). New York: SIGIR ’04.
Farghaly, A. (2004). Computer processing of arabic script-based languages. Current state and future directions. In A. Farghaly & K. Megerdoomian (Eds.), COLING 2004 computational approaches to Arabic script-based languages (pp. 1–1). COLING: Geneva.
Google Scholar
Faruqui, M., Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014 (pp. 462–471), April 26–30, 2014, Gothenburg, Sweden.
Fernández, M., Cantador, I., López, V., Vallet, D., Castells, P., Motta, E. (2011). Semantically enhanced information retrieval: An ontology-based approach. Web Semantics: Science, Services and Agents on the World Wide Web, 9(4), 434–452 (JWS special issue on Semantic Search).
Ganguly, D., Roy, D., Mitra, M., Jones, G. J. (2015). Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 795–798) New York: SIGIR ’15.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 50–57), New York: SIGIR ’99.
Jaafar, Y., Bouzoubaa, K., Yousfi, A., Tajmout, R., & Khamar, H. (2016). Improving Arabic morphological analyzers benchmark. International Journal of Speech Technology, 19(2), 259–267.
Article Google Scholar
Kadri, Y., Nie, J. Y. (2006). Effective stemming for arabic information retrieval. In The Challenge of Arabic for NLP/MT, International Conf. at the British Computer Society (BCS) (pp. 68–74).
Karimzadehgan, M., Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 323–330), New York: SIGIR ’10.
Khoja, S., Garside, R. (1999). Stemming Arabic Text. Computing Department. Lancaster University.
Larkey, L., Ballesteros, L., & Connell, M. E. (2007). Light stemming for Arabic information retrieval. In A. Soudi, A. D. Bosch, & G. Neumann (Eds.), Arabic computational morphology, text, speech and language technology (Vol. 38, pp. 221–243). Netherlands: Springer.
Chapter Google Scholar
Larkey, L. S., Ballesteros, L., Connell, M. E. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 275–282), New York: SIGIR ’02.
Li, B., & Gaussier, E. (2012). An information-based cross-language information retrieval model. In R. Baeza-Yates, A. P. Vries, H. Zaragoza, B. B. Cambazoglu, V. Murdock, R. Lempel, & F. Silvestri (Eds.), 34th European conference on IR research, ECIR 2012 (Vol. 7224, pp. 281–292)., Lecture Notes in Computer Science (LNCS) Barcelone: Springer.
Google Scholar
Li, H., & Xu, J. (2014). Semantic matching in search. Foundations and Trends in Information Retrieval, 7(5), 343–469.
Article Google Scholar
Lofi, C. (2015). Measuring semantic similarity and relatedness with distributional and knowledge-based approaches. Information and Media Technologies, 10(3), 493–501.
Google Scholar
Mahgoub, A. Y., Rashwan, M. A., Raafat, H., Zahran, M. A., & Fayek, M. B. (2014). Semantic query expansion for arabic information retrieval. ANLP, 2014, 87–92.
Google Scholar
Metzler, D., Croft, W. B. (2005). A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 472–479), New York, NY: SIGIR ’05.
Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations, ICLR ’13.
Mustafa, M., AbdAlla, H., & Suleman, H. (2008). Current approaches in Arabic IR: A survey (pp. 406–407). Berlin Heidelberg, Berlin, Heidelberg: Springer.
Google Scholar
Nwesri, A., Tahaghoghi, S., & Scholer, F. (2005). Stemming arabic conjunctions and prepositions. In M. Consens & G. Navarro (Eds.), String processing and information retrieval (Vol. 3772, pp. 206–217)., Lecture notes in computer science Berlin: Springer.
Chapter Google Scholar
Pennington, J., Socher, R., Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543), Doha: Association for Computational Linguistics.
Ponte, J. M., Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 275–281), New York: SIGIR ’98.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M. (1994). Okapi at trec-3. In TREC’94 (pp. 109–126). City University, London.
Sun, Y., Rao, N., Ding, W. (2017). A simple approach to learn polysemous word embeddings. CoRR abs/1707.01793, http://arxiv.org/abs/1707.01793,1707.01793.
Tazit, N., Bouyakhf, E. H., Sabri, S, Yousfi, A., Bouzouba, K. (2007). Semantic internet search engine with focus on Arabic language. In the International Symposium on Computers & Arabic Language, ISCAL 07.
Tazit, N., Yousfi, A., & Bouyakhf, E. H. (2009). Design and implementation of an information retrieval system by integrating semantic knowledge in the indexing phase. Artificial Intelligence and Machine Learning AIML, 9(1), 49–56.
Google Scholar
Vulić, I., Moens, M. F. (2015). Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 363–372). New York: SIGIR ’15.
Wei, X., Croft, W. B. (2006). Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 178–185), New York: SIGIR ’06.
Yang, X., & Mao, K. (2016). Learning multi-prototype word embedding from single-prototype word embedding with integrated knowledge. Expert Systems with Applications, 56, 291–299.
Article Google Scholar
Zahran, M. A., Magooda, A., Mahgoub, A. Y., Raafat, H., Rashwan, M., & Atyia, A. (2015). Word representations in vector space and their applications for Arabic (pp. 430–443). Cham: Springer.
Google Scholar
Zhai, C., Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 334–342), New York: SIGIR ’01.
Zuccon, G., Koopman, B., Bruza, P., Azzopardi, L. (2015). Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of the 20th Australasian Document Computing Symposium, ACM (pp. 12:1–12:8), New York: ADCS ’15.

Download references

Author information

Authors and Affiliations

Laboratory of Informatics and Modeling, Faculty of Sciences Dhar el Mahraz, Sidi Mohamed Ben Abdellah University, Fez, Morocco
Abdelkader El Mahdaouy & Saïd Ouatik El Alaoui
Université Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000, Grenoble, France
Abdelkader El Mahdaouy & Eric Gaussier

Authors

Abdelkader El Mahdaouy
View author publications
You can also search for this author in PubMed Google Scholar
Saïd Ouatik El Alaoui
View author publications
You can also search for this author in PubMed Google Scholar
Eric Gaussier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdelkader El Mahdaouy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

El Mahdaouy, A., El Alaoui, S.O. & Gaussier, E. Improving Arabic information retrieval using word embedding similarities. Int J Speech Technol 21, 121–136 (2018). https://doi.org/10.1007/s10772-018-9492-y

Download citation

Received: 25 July 2017
Accepted: 11 January 2018
Published: 19 January 2018
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10772-018-9492-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Arabic information retrieval using word embedding similarities

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring Semantic Similarity Measure Based on Word Embedding Representation for Arabic Passages Retrieval

Related Terms Extraction from Arabic News Corpus Using Word Embedding

Improving Arabic Microblog Retrieval with Distributed Representations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Improving Arabic information retrieval using word embedding similarities

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring Semantic Similarity Measure Based on Word Embedding Representation for Arabic Passages Retrieval

Related Terms Extraction from Arabic News Corpus Using Word Embedding

Improving Arabic Microblog Retrieval with Distributed Representations

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.