0% found this document useful (0 votes)
154 views30 pages

Benchmarking Arabic AI With Large Language Models

This document summarizes a study that benchmarks the performance of large language models (LLMs) like GPT-3.5-turbo, Whisper, and USM on 33 Arabic natural language processing and speech processing tasks using 59 publicly available datasets. The study finds that in zero-shot settings, the LLMs generally underperform compared to state-of-the-art task-specific models. It also observes a performance gap between Modern Standard Arabic and dialectal Arabic datasets across different tasks. The study aims to provide a benchmark for future research on Arabic language tasks and make the resources used publicly available.

Uploaded by

Noha Elrawy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views30 pages

Benchmarking Arabic AI With Large Language Models

This document summarizes a study that benchmarks the performance of large language models (LLMs) like GPT-3.5-turbo, Whisper, and USM on 33 Arabic natural language processing and speech processing tasks using 59 publicly available datasets. The study finds that in zero-shot settings, the LLMs generally underperform compared to state-of-the-art task-specific models. It also observes a performance gap between Modern Standard Arabic and dialectal Arabic datasets across different tasks. The study aims to provide a benchmark for future research on Arabic language tasks and make the resources used publicly available.

Uploaded by

Noha Elrawy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Benchmarking Arabic AI with Large Language Models

Ahmed Abdelali,1∗ Hamdy Mubarak,1∗ Shammur Absar Chowdhury,1 Maram Hasanain,1


Basel Mousi,1 Sabri Boughorbel,1 Yassine El Kheir,1 Daniel Izham, 2
Fahim Dalvi,1 Majd Hawasly,1 Nizi Nazar,1 Yousseif Elshahawy,2
Ahmed Ali,1 Nadir Durrani,1 Natasa Milic-Frayling,1 Firoj Alam1
1
Qatar Computing Research Institute, HBKU, Qatar, 2 Kanari AI, Doha, Qatar
fialam@hbku.edu.qa

Abstract supervised learning. These models serve as a foun-


dation for a wide range of downstream tasks.
With large Foundation Models (FMs), language Large Language Models (LLMs) are prominent
technologies (AI in general) are entering a new
arXiv:2305.14982v1 [cs.CL] 24 May 2023

examples of FMs, based on the Transformer net-


paradigm: eliminating the need for develop-
ing large-scale task-specific datasets and sup- work architecture (Vaswani et al., 2017). Trained to
porting a variety of tasks through set-ups rang- predict the subsequent token in a sequence, LLMs
ing from zero-shot to few-shot learning. How- capture implicit and intricate information contained
ever, understanding FMs capabilities requires in the data. Moreover, when created using multi-
a systematic benchmarking effort by compar- lingual training data, the models capture linguis-
ing FMs performance with the state-of-the-art tic nuances, phonological patterns, and semantic
(SOTA) task-specific models. With that goal,
relationships across languages, strengthening its
past work focused on the English language and
included a few efforts with multiple languages. multilingual capabilities. However, understanding
Our study contributes to ongoing research by how their capabilities generalize across tasks and
evaluating FMs performance for standard Ara- languages requires a systematic approach to evalu-
bic NLP and Speech processing, including a ating LLMs.
range of tasks from sequence tagging to content Given the public access to ChatGPT and GPT-4
classification across diverse domains. We start (OpenAI, 2023) and the models’ abilities to per-
with zero-shot learning using GPT-3.5-turbo,
Whisper, and USM, addressing 33 unique tasks
form diverse tasks, there have been many research
using 59 publicly available datasets resulting in initiatives to benchmark their performance on stan-
96 test setups. For a few tasks, FMs performs dard NLP tasks (Bubeck et al., 2023; Bang et al.,
on par or exceeds the performance of the SOTA 2023; Ahuja et al., 2023; Hendy et al., 2023). For
models but for the majority it under-performs. example, the Holistic Evaluation of Language Mod-
Given the importance of prompt for the FMs els (HELM) project (Liang et al., 2022) pursued a
performance, we discuss our prompt strategies holistic evaluation of LLMs for English in terms
in detail and elaborate on our findings. Our
of the the number of metrics (accuracy, calibra-
future work on Arabic AI will explore few-shot
prompting, expand the range of tasks, and in- tion, robustness, fairness, bias, toxicity, and effi-
vestigate additional open-source models. ciency) and a comprehensive set of 42 scenarios
with 30 prominent language models. BIG-Bench
1 Introduction (Srivastava et al., 2022) introduced a large-scale
benchmarking (214 tasks, including non-English
The recent breakthroughs in artificial intelligence low-resource languages) focusing on the limita-
(AI) can be attributed to the remarkable perfor- tions of the current benchmarks. (Bang et al., 2023)
mance of foundation models (FMs) (Bommasani carried out an extensive ChatGPT evaluation on 8
et al., 2022) across a spectrum of research areas NLP tasks with 21 datasets, consisting of multi-
(e.g., machine translation, question-answering, au- task, multilingual, and multimodal setup. (Ahuja
tomatic speech recognition, text-to-speech genera- et al., 2023) carried out a multilingual evaluation
tion) and application domains (e.g., law, healthcare, of GPT2.5 and BLOOMZ, comparing the perfor-
education, and psychology). FMs are elaborate net- mance with state-of-the-art (SOTA) on 8 NLP tasks
works with a large number of parameters, trained involving 33 languages. (Hendy et al., 2023) evalu-
with vast amounts of data and often based on self- ated several OpenAI’s GPT models for translations.

Equal contribution. For speech, OpenAI’s Whisper (Radford et al.,
2022), Google’s USM (Zhang et al., 2023), and Arabic Speech Processing
other speech FMs are explored by the speech com- • A comprehensive evaluation of Arabic speech
munity. They are general-purpose speech models recognition (ASR) system for different do-
with multilingual capabilities, designed for speech mains (broadcast, meeting, and telephony)
recognition (ASR) and other tasks. The bench- and dialects, including code-switching (for
marking efforts include Speech Processing Univer- multi-lingual) scenarios.
sal PERformance Benchmark (SUPERB) initiative
(Yang et al., 2021) which includes a collection of • The first benchmarking effort of Whisper and
benchmarking tools, resources, and a leader board USM models for Arabic speech recognition.
for 10 tasks from six domains. • The first reported benchmark for standard Ara-
In this study, we benchmark FMs for Arabic bic text-to-speech (TTS) generative model.
NLP and Speech processing tasks for different do- The performance of the models is compared to
mains and communication channels, focusing on SOTA models for all targeted tasks providing a
the Modern Standard Arabic (MSA) and Dialectal strong reference point for future research on these
Arabic (DA). We use publicly available datasets tasks. All resources used in this study will be made
and report details of our prompting approach, post- publicly available to the community to scale up the
processing of LLM’s responses, and challenges of effort.1 Our comprehensive benchmarking effort
each task. suggests that the LLMs perform worse across dif-
Typically, hundreds or even thousands of anno- ferent tasks, dialects, and domains when compared
tated domain-specific and task-specific examples with SOTA models in zero-shot setting. We also
are needed to fine-tune LLMs. This may incur observed a gap between the performance of MSA
significant costs while still having limited general- and the dialectal datasets across different tasks. We
ization capabilities across dialects, domains, and hypothesized that MSA data is comparatively well-
tasks. Thus we selected 33 specific Arabic tasks represented in the LLM than the dialect and there is
(with 59 datasets with 96 test setups) to benchmark a chance that the test data may have been ingested
the performance of ChatGPT (GPT-3.5-turbo) for by the model during training. We also noticed that
NLP tasks and the performance of Whisper and LLMs’ performance highly depends on prompting
USM for Speech processing, with zero-shot setting. strategies and post-processing.
We aim to understand (i) whether FMs can per- To the best of our knowledge, this study rep-
form tasks without prior task-specific knowledge, resents the first benchmarking study that investi-
(ii) how the performance and prompting strategies gates ChatGPT and recent large Speech models
vary based on task-complexity, from sequence tag- (e.g., USM) within the Arabic language context.
ging to classification, and (iii) how the performance Our evaluation encompasses a diverse array of
compares with the current SOTA models. foundation models, tasks, and datasets, distinguish-
As we continue research in this area, we present ing it from previous benchmarks such as ORCA
the findings, insights, and the unique contributions (Elmadany et al., 2022), ALUE (Seelawi et al.,
of the first benchmarking effort involving Arabic 2021), ArBERT (Abdul-Mageed et al., 2021), and
NLP and Arabic Speech processing, across do- AraBench (Sajjad et al., 2020). With this study, our
mains (news articles, social media /tweets, meet- aim is to provide valuable insights for the Arabic
ings, telephony, and broadcast content) and a large research community and practitioners, and enable
number of publicly available datasets (59 datasets them to make informed decisions regarding the ne-
with 96 test setups): cessity of task-specific adaptations/fine-tuning and
Arabic NLP dataset enhancements for new tasks. Ultimately,
• The first benchmarking of ChatGPT’s zero- our work contributes to the advancement of the
shot performance for Arabic NLP tasks of field as a whole.
varying task complexities. The rest of the paper is organized as follows.
Section 2 gives an overview of related work. In Sec-
• An extensive comparison of ChatGPT’s
tion 3, we present the tasks and associated datasets.
performance for MSA and DA, providing
In Section 4, we provide the details of the experi-
a valuable resource for NLP research with
ments. Section 5 presents the results with discus-
dialectal Arabic texts.
1
http://arabicai.org/
sion; and we present the conclusion in Section 7. training. GPT-3 is trained with Adam optimizer
with global norm clipping, and cosine decay for the
2 Related Work learning rate. During training, a context window
of 2048 tokens is used. For few-shot learning, K
2.1 Models for NLP
examples of context plus correct answer or comple-
Prior to adoption of self-supervised learning, NLP tion are given as input. The evaluation is performed
models required large amounts of annotated data to on 42 different benchmarks such as natural lan-
acquire proficiency in specialized tasks. This pre- guage inference, reading comprehension, common
sented a considerable limitation snce the labeled sense reasoning, closed book question answering.
data required for model training was not easily The capability of GPT-3 in zero-shot learning is
obtainable. Consequently, NLP models exhibited much better than in previous models. Emergent
suboptimal performance as they struggled to gen- properties are noticeable, such as good generaliza-
eralize and support tasks that deviated from the tion to unseen tasks, improved comprehension, and
learning parameters established during training. creativity.
Overcoming these limitations, OpenAI proposed ChatGPT is a closed-source model for which
a unique method of training a generative language details on architecture, training procedures, and
model, called Generative Pre-trained Transformer datasets are not available. ChatGPT is likely based
GPT-1 (Radford et al., 2018), utilizing a large un- on Transformer model with an architecture similar
labeled dataset. The task-agnostic architecture is to GPT-3. The multi-lingual capability of Chat-
composed of 12-layer decoder-only transformers GPT indicates that the training dataset is very large,
with 12 masked self-attention heads of dimension diverse, and contains an extensive portion of con-
64. The model is trained to generate text by pre- versational data. ChatGPT is aligned to follow
dicting the next word in a sequence of tokens. GPT- human feedback (Ouyang et al., 2022). Specifi-
2 (Radford et al., 2019) is a direct scale-up of GPT-1 cally, it is fine-tuned using Reinforcement Learn-
both in the number of parameters and pre-training ing from Human Feedback (RLHF). The latter has
dataset size. The GPT-2 model exhibits the capa- three steps: 1) The pre-trained model is fine-tuned
bility for zero-shot task transfer. using responses written by human. 2) User rank-
Zero-shot learning can be identified as a particu- ings of multiple responses generated by the model
lar subcategory of zero-shot task transfer, wherein are used to train a reward model. 3) Proximal pol-
no examples are provided. The model discerns the icy optimization (PPO) is used to update the model
nature of the task solely based on the provided in- weight based on the reward function.
struction. Contrary to the approach with GPT-1,
which involved sequence rearrangement for fine- 2.2 Models for Speech Processing
tuning, the GPT-2 model was presented with input
in a format that necessitated the model to compre- LLMs have consistently showcased impressive ca-
hend the task and produce corresponding responses. pabilities spanning diverse domains and tasks. Nev-
This approach was adopted to simulate the behavior ertheless, they exhibit limitations when tasked with
characteristic of zero-shot task transfer. decoding complex audio/speech data or facilitating
GPT-3 is 100 times larger than GPT-2 with dou- spoken dialogues. Notable among the challenges
ble the number of layers (96) and 175 billion param- is the issue of data procurement: sourcing human-
eters. GPT-3 is different from GPT-2 in using al- annotated speech data is both resource-intensive
ternating dense and locally banded sparse attention. and time-consuming. Computational resources
GPT-3 is trained mainly using a filtered version of present a further obstacle: training multimodal
Common Crawl dataset. The filtering is based on LLMs from the outset is both computationally in-
similarity to high-quality reference corpora. Fuzzy tensive and time-consuming.
deduplication at document level is performed to Self-supervised learning has initiated a transfor-
remove redundancy and avoid contamination of the mative era for speech processing and helped to ad-
held-out datasets for validation and testing. In addi- dress the challenge of scaling speech technologies
tion, high-quality datasets such as WebText dataset across a multitude of languages using unlabeled
and English-language Wikipedia are added to the speech data on the internet. Listed below are some
training dataset to ensure diversity. In total about notable large speech representation models trained
0.5 trillion byte-pair-encoded tokens are used for with the self-supervised paradigm:
The wav2vec (Baevski et al., 2020) models use In addition to these representation models, ‘Au-
a self-supervised paradigm, leveraging the con- dioGPT’ (Huang et al., 2023) is recently intro-
trastive predictive coding (CPC) loss function to duced. The model is specifically engineered to
master speech representations devoid of transcrip- comprehend the understanding and generation of
tion or segmentation requirements (Baevski et al., audio modalities within spoken dialogues. Rather
2019). WavLM (Chen et al., 2022) released by than initiating the training of multimodal LLMs
Microsoft Research Asia, is a large-scale self- from ground zero, the system efficiently harnesses
supervised pre-trained model proficient in address- an array of pre-existing audio foundation models to
ing comprehensive downstream speech tasks such process elements such as speech, music, ambient
as Automatic Speech Recognition (ASR), Text-To- sound, and talking heads. By amalgamating the
Speech (TTS), and speaker verification. WavLM strengths of ChatGPT and audio-modality solvers,
jointly learns masked speech prediction and denois- AudioGPT has demonstrated robust capabilities
ing during the pre-training phase and utilizes gated in managing audio information through four inte-
relative position bias within the Transformer struc- gral stages: modality transformation, task analysis,
ture to effectively apprehend the sequential order model assignment, and response generation.
of the input speech.
Whisper (Radford et al., 2022) is a general- 2.3 Prompting
purpose model specifically designed for speech LLMs have shown great capability in solving vari-
recognition in environments characterized by noise ous sets of language and reasoning tasks. By care-
interference or low-resource settings, demonstrat- fully designing and crafting prompts, it is possi-
ing competency across a multitude of speech- ble to steer LLMs towards an improved response.
related tasks. By employing weak supervision Prompt engineering has emerged as the field of
and adopting a minimalist approach towards data developing and optimizing prompts as input for
pre-processing, Whisper achieves superior perfor- language models. It offers an intuitive and natural
mance, thus exemplifying the efficacy of deploying interface for humans to interact with LLMs. As
sophisticated machine learning methodologies in models can be sensitive to small modifications of
the sphere of speech processing. the input, prompt engineering develops tools and
Universal Speech Model (USM) (Zhang et al., methods to identify prompts that are robust and
2023) is a single large model proficient in perform- lead to high-performance results.
ing automatic speech recognition (ASR) across a A prompt can contain one or more items from the
spectrum of over 100 languages. This is accom- following elements: Instruction which describes
plished by pre-training the encoder constituent of the task and gives the instruction to be performed
the model on an expansive, unlabeled multilingual by the model. Context: Gives additional informa-
dataset comprising 12 million hours and extending tion that can guide model response. Input: This
over 300 languages. The model is then fine-tuned is the core question or task that is being solved.
on a smaller, labeled dataset. This model demon- Output indicator: Guides the model in restricting
strates performance equivalent to, if not exceeding, and formatting its response.
that of other models for both in-domain and out- (White et al., 2023) have presented a framework
of-domain speech recognition tasks across a broad for documenting patterns for structuring prompts
range of languages. to solve a range of problems so that they can be
VALL-E (Wang et al., 2023) model presents a adapted to different domains. Prompt engineers
novel approach to text-to-speech synthesis as a bear the responsibility of discerning the contexts
zero-shot system. It employs a language model- that give rise to errors in LLMs. It is incumbent
ing approach, treating text-to-speech synthesis as upon them to formulate prompting strategies to
a conditional language modeling task rather than surmount these obstacles, and to conduct system-
continuous signal regression. This system lever- atic evaluations of the efficacy of these strategies.
ages discrete codes derived from a readily avail- (Zamfirescu-Pereira et al., 2023) conducted an in-
able neural audio codec model and undergoes pre- vestigation that yielded findings with substantial
training on 60,000 hours of English speech data, implications for the design of LLM-based tools in-
thus demonstrating robust in-context learning capa- tended for non-AI-expert users. This work also has
bilities. repercussions for augmenting LLM-and-prompt lit-
eracy among both programming professionals and biguation. They include 18 multilingual and Ara-
the wider public, offering fertile ground for future bic language models. This benchmark introduces
research endeavors. The integration of pre-trained a public leaderboard with a unified single metric
LLMs with prompts has elicited renewed interest (ORCA score) defined as the macro-average of the
in prompt engineering. These prompts facilitate different scores across all tasks and task clusters.
the model in producing desired outputs, thereby In the ORCA benchmark, ARBERTv2 achieved the
stretching the boundaries of achievable conversa- highest score in Arabic tasks.
tional User Experience (UX) for non-AI experts. AraBench (Sajjad et al., 2020) is a dialectal
We mention a few techniques used in prompt- Arabic-to-English machine translation evaluation
ing: Chain-of-Thought Prompting: By providing suite that provides 4 coarse, 15 fine-grained and 25
examples of intermediate reasoning steps to solve city-level dialect categories, belonging to diverse
the task, the model skill in solving the task im- genres, such as media, chat and travel with differ-
proves (Wei et al., 2022). By combining chain-of- ent levels of dialectness. Strong baselines using
thought with a few-shot prompting, the achieved different training settings such as fine-tuning, back-
model performance surpasses fine-tuned LLMs. translation, and data augmentation are reported.
Automatic Prompting: Several approaches are pro- The ALUE (Seelawi et al., 2021) benchmark pro-
posed to automate the selection and design of vides 8 curated and previously published tasks in
prompts (Zhou et al., 2022; Shin et al., 2020). addition to the privately held evaluation datasets.
These methods define a template and a set of con- The benchmark includes a wide range of tasks
didate instructions. Optimization approaches are such as emotion classification, hate speech, and
proposed to identify the best prompt for a specific fine-grained dialect identification. ArabicBERT
task or across several tasks. achieved the best performance on 7 out of the 8
tasks. Other variants of BERT models with AraVec
2.4 Benchmarking Efforts and FastText models are included in the evaluation.
A benchmark establishes a criterion for the assess- ARLUE (Abdul-Mageed et al., 2021) benchmark
ment of system performance across diverse tasks. the multi-dialectal Arabic language understanding
Prior research has concentrated on the development evaluation using 42 datasets for six task clusters.
of benchmarks specifically designed for the evalua- The benchmarked models are variants of BERT and
tion of singular tasks. Examples of this can be ob- XLM. Fine-tuned models using ARLUE achieved
served in the field of sentiment analysis (SA) such the highest performance across all six task clusters.
as in the works of (Cer et al., 2017b; Farha and In contrast to the previous benchmarking ef-
Magdy, 2021), named-entity recognition (NER) forts, our work focuses on evaluating 32 Arabic
(Derczynski et al., 2017), part of speech tagging tasks over 97 task-variants-datasets across two
(POS) (Gimpel et al., 2011), natural language in- modalities with recent popular foundation models –
ference (NLI) (Williams et al., 2017), question an- Speech ( Whisper and USM) and Text (ChatGPT).
swering (QA) (Longpre et al., 2021), and code This work in progress investigates these models’
understanding (Lu et al., 2021). Contemporary performance in a wide range of domain datasets,
benchmarks commonly suggest a representative covering both MSA and dialectal Arabic content.
set of standard tasks for evaluation purposes. A
concise comparison of Arabic-centric benchmarks 3 Tasks and Datasets
is discussed below, with a particular focus on the In this section, we discuss the tasks and the associ-
diversity of tasks covered. ated datasets by grouping them based on ACL-2022
Prior Benchmarks on Arabic track.2 . In Tables 1 and 2, we provide a summa-
rized description of the test sets used for evaluating
There are a few existing benchmarks for Arabic
textual and speech processing tasks, respectively.
NLP tasks that evaluated language models. ORCA
(Elmadany et al., 2022) is the largest benchmark 3.1 Sequence Tagging/Token classification
collecting 60 datasets organized in seven task clus- 3.1.1 Segmentation
ters namely: (1) sentence classification, (2) text
Segmentation is an important problem for language
classification, (3) structured prediction, (4) seman-
like Arabic, which is rich with bound morphemes
tic similarity, (5) natural language inference, (6)
question-answering, and (7) word sense disam- 2
https://www.2022.aclweb.org/callpapers
Dataset Task Domain Size
Word Segmentation, Syntax and Information Extraction
WikiNews Segmentation MSA 400 sentences
Samih et al. (2017) Segmentation Tweets (Dialects: EGY, 350 X 4 dialects
LEV, GLF, MGR)
WikiNews Lemmatization MSA 400 sentences
WikiNews Diacritization MSA 400 sentences
Darwish et al. (2018) Diacritization Dialects (Moroccan, 1,640 verses
Tunisian)
WikiNews POS MSA 400 sentences
Samih et al. (2017) POS Tweets (Dialects: EGY, 350 X 4 dialects
LEV, GLF, MGR)
XGLUE (Arabic) POS Web, Wikipedia 900 sentences
Conll2006 Parsing MSA 146 sentences
QADI Dialect Tweets 3,797
ANERcorp NER Tweets (Dialectal) 924 sentences
AQMAR NER Wikipidia 1,976 sentences
QASR NER Transcript 7,907 segments
Sentiment, Stylistic and Emotion Analysis
ArSAS Sentiment Tweets 4,213
SemEval2018 (Task E-c) Emotion Tweets (Dialectal) 1,518
Unified-FC Stance News articles 3,042 claim-article pairs
Khouja (2020) Stance News articles 379 headline pairs
News Categorization
ASND News Cat. Posts∗ 1,103
SANAD/Akhbarona News Cat. News articles 7,843
SANAD/AlArabiya News Cat. News articles 7,126
SANAD/AlKhaleej News Cat. News articles 4,550
Demographic/Protected Attributes
ASAD Name Info Wikidata 80,131
UL2C Location User loc. (Twitter) 28,317
ArabGend Gender Usernames (Twitter) 1,000
Ethics and NLP: Factuality, Disinformation and Harmful Content Detection
OffensEval 2020 Offensive lang. Tweets (Dialectal) 2,000
OSACT 2020 Hate Speech Tweets (Dialectal) 2,000
ASAD Adult Content Tweets (Dialectal) 10,000
ASAD Spam Tweets (Dialectal) 28,383
In-house Subjectivity News articles 297 sentences
WANLP23 Propaganda Tweets 323
CT–CWT–22 Checkworthiness Tweets (COVID19) 682
CT–CWT–22 Factuality Tweets (COVID19) 996
CT–CWT–22 Claim Tweets (COVID19) 1,248
CT–CWT–22 Harmful content Tweets (COVID19) 1,201
CT–CWT–22 Attention-worthy Tweets (COVID19) 1,186
Unified-FC Factuality News articles 422 claims
Khouja (2020) Claim News articles 456 headlines
Semantics
BTEC Paraphrasing MSA 500 sentences
STS2017.eval.v1.1-Track 1 STS Transcript 250
STS2017.eval.v1.1-Track 2 STS Transcript 250
Mawdoo3 Q2Q STS QS (Q2Q) Questions 3,715 question pairs
XNLI XNLI ANC 5,010
Question Answering (QA)
ARCD QA Wikipedia 702
MLQA QA Wikipedia 5,335
TyDi QA QA Wikipedia 921
XQuAD QA Wikipedia 1,190

Table 1: Summary on test sets and their sizes used in evaluation for the different textual tasks. ANC: American
National Corpus. Posts∗: posts from Twitter, Youtube and Facebook. News Cat.: News Categorization
that change the tense of verbs, or represent pro- 3.1.4 Diacritization
nouns and prepositions in nouns. It is a building Diacritization involves assigning the diacritics to
block for NLP tasks such as search, part-of-speech each letter in an Arabic word within a sentence.
tagging, parsing, and machine translation. The idea Diacritical marks indicate the correct pronuncia-
is segmenting Arabic words into prefixes, stems, tion and meaning of the written Arabic words. For
and suffixes, which can facilitate many other tasks. example, different word diacretizations could trans-
Datasets form a noun into a verb or vice versa.
WikiNews For modern standard Arabic (MSA), Datasets
we used the WikiNews dataset of (Darwish and
WikiNews We use a dataset of modern standard
Mubarak, 2016) which comprises 70 news articles
Arabic from (Mubarak et al., 2019) that comprises
in politics, economics, health, science and technol-
fully diacritized WikiNews corpus (Darwish et al.,
ogy, sports, arts, and culture. The dataset has 400
2017b).
sentences (18,271 words) in total.
Tweets For the dialectal Arabic, we used the Bibles This dataset includes translations of the
dataset in (Samih et al., 2017), which provides 1400 New Testament into two Maghrebi sub-dialects:
tweets in Egyptian, Gulf, Levantine, and Maghrebi Moroccan and Tunisian (Darwish et al., 2018; Ab-
dialects for a total of 25,708 annotated words . delali et al., 2019).

3.1.2 Part-Of-Speech (POS) Tagging 3.1.5 Parsing


Part-of-speech (POS) is one of the fundamental Dependency parsing is the task of identifying syn-
components in the NLP pipeline. It helps in ex- tactical and grammatical relations among the words
tracting higher-level information such as named in a sentence. These dependencies result in a hierar-
entities, discourse, and syntactic parsing. chical tree representation that captures the structure
of the sentence at different levels.
Datasets
WikiNews We used for this task the WikiNews Dataset For this task we used the Arabic part
dataset tagged for POS (Darwish et al., 2017c) for of CoNLL-X 2006 shared tasks on dependency
modern standard Arabic. parsing (Buchholz and Marsi, 2006), which has
4,990 scoring tokens and uses the Prague Arabic
Tweets For POS tagging with noisy texts and Dependency Treebank (Hajic et al., 2004).
different dialects we used the same dataset reported
in (Samih et al., 2017) (see §3.1.1). 3.1.6 Named-Entity Recognition (NER)
XGLUE We also used the Arabic part of XGLUE This task involves identifying and classifying the
benchmark (Liang et al., 2020) for POS tagging, words in a sentence that are proper names, names
which uses a subset of Universal Dependencies of places, entities like organizations or products,
Treebanks (v2.5) (Zeman et al., 2020). amongst other things. This depends on understand-
ing the context and the relations of a word or a
3.1.3 Lemmatization collection of words in a sentence, and is key to
Lemmatization is another component in the NLP tasks such as question answering.
pipeline, which reduces words to their base or root
form, known as a lemma. It takes into considera- Datasets
tion the morphological analysis of the words, which ANERCorp We used the test corpus of the AN-
uses the context and POS to convert a word to its ERCorp dataset (Benajiba et al., 2007; Benajiba
simplest form. This task differs from segmentation and Rosso, 2007), which contains 316 articles,
which only separates a word stem from prefixes 150,286 tokens and 32,114 types, and classifies
and suffixes. In contrast, lemmatization requires re- words into one of four classes (organization, loca-
turning the lexicon entry for a certain word, which tion, person and miscellaneous), we used the test
may depend on POS tagging. split of the dataset for our evaluation.

Dataset We used WikiNews dataset tagged for AQMAR The dataset is developed as an evalua-
lemmas (Mubarak, 2017) (see §3.1.1 for the details tion suite for the named entity recognition task in
of the dataset). Arabic. It consists of a collection of 28 Wikipedia
articles with 74,000 tokens. We consider the arti- The Bible It consists of 8.2k parallel sentences
cles corresponding to the test split for our evalua- translated into modern standard Arabic, and to Mo-
tion. (Schneider et al., 2012). roccan3 and Tunisian4 dialects (Abdelali et al.,
2019).
QASR The QASR dataset consists of 70k words
extracted from 2,000 hours of transcribed Arabic Media Dataset The dataset consists of 7.5 hours
speech (Mubarak et al., 2021b). of recordings collected from five public broadcast-
ing channels that cover programs with Maghrebi,
3.1.7 Paraphrasing Lebanese, Omani dialects, and MSA with genres
involving movies, news reports, and cultural pro-
This task involves rewriting text using different grams. The recordings were transcribed and trans-
words and sentence structures while maintaining lated by a professional translation house (Sajjad
its original meaning. This is a complex language et al., 2020).
understanding task that involves having the capa-
bility to suggest different words or even structures, 3.3 Dialect Identification
which preserve the intended meaning. Dialect is defined as the speaker’s grammatical, lex-
ical, and phonological variation in pronunciation
Dataset For this task, we used the modern stan-
(Etman and Beex, 2015). Automatic Dialect Identi-
dard Arabic part of the MADAR corpus of paral-
fication (ADI) has became an important research
lel sentences (Bouamor et al., 2018), which has
area in order to improve certain applications and
2,000 translated sentences from the BTEC corpus
services, such as ASR and many downstream NLP
(Takezawa et al., 2007). We used back-translation
tasks.
from Google MT as SOTA, i.e., translate from Ara-
bic to English then back to Arabic. Dataset For this task, we used the QADI dataset
containing a wide range of country-level Arabic
3.2 Machine Translation (MT) dialects covering 18 different countries in the Mid-
dle East and North Africa region (Abdelali et al.,
The machine translation evaluation set is a rich
2020). It consists of 540,590 tweets from 2,525
set that covers a variety of Arabic in addition to
users.
the Modern Standard Arabic (MSA). The genera
of the evaluation set also cover formal, informal, 3.4 Sentiment, Stylistic and Emotion Analysis
speech, and other modalities. These types and va-
3.4.1 Sentiment Analysis
rieties allowed us to assess the system and reveal
its potential and limitations. For this study, we fo- Sentiment analysis has been an active research area
cused on translating Arabic to English and used the and aims to analyze people’s sentiment or opin-
datasets discussed below. ion toward entities such as topics, events, individ-
uals, issues, services, products, organizations, and
Datasets their attributes (Liu and Zhang, 2012; Zhang et al.,
2018). This task involves classifying the content
MADAR Corpus This dataset consists of 2,000 into sentiment labels such as positive, neutral, and
sentences from the BTEC corpus translated to mod- negative.
ern standard Arabic and four major dialects from
15 countries (Bouamor et al., 2018). Dataset ArSAS dataset consists of 21k Arabic
tweets covering multiple topics that were collected,
(Zbib et al., 2012) It is collected from the Arabic- prepared, and annotated for six different classes of
Dialect/English Parallel Text (APT), which consists speech-act labels and four sentiment classes (El-
of 2,000 sentences with 3.5 million tokens of trans- madany et al., 2018). For the experiments, we used
lated dialectal Arabic (Zbib et al., 2012). only sentiment labels from this dataset.
3.4.2 Emotion Recognition
Multi-dialectal Parallel Corpus of Arabic
(MDC) This dataset also consists of 2,000 sen- Emotion recognition is the task of categorizing dif-
tences in Egyptian, Palestinian, Syrian, Jordanian, ferent types of content (e.g., text, speech, and vi-
and Tunisian dialects and their English counter- 3
The Morocco Bible Society https://www.biblesociety.ma
4
parts (Bouamor et al., 2014). The United Bible Societies https://www.bible.com
sual) in different emotion labels (six basic emo- 3.5 News Categorization
tions (Ekman, 1971) or more fine-grained cate- News text categorization was a popular task in the
gories (Demszky et al., 2020)). earlier days of NLP research (Sebastiani, 2002).
Dataset For the emotion recognition tasks we The idea of to assign a category C = {c1 , ...cn }
used SemEval-2018 Task 1: Affect in Tweets (Mo- to a document D = {d1 , ...dn }. For the news
hammad et al., 2018). The task is defined as classi- categorization the D is a set of news articles and
fying a tweet as one or more of the eleven emotion C is a set of predefined categories. Most often a
labels, which is annotated as a multilabel (pres- news article can be categorized into more than one
ence/absence of 11 emotions) annotation setting. category and the models are trained in a multilabel
setting. While earlier work mostly focused on news
3.4.3 Stance Detection article, however, lately it has been used for the
Stance is defined as the expression of the speaker’s categorization of tweets in which news articles are
view and judgment toward a given argument or shared as a part of a tweet.
statement (Biber and Finegan, 1988). Given that Datasets
the social media platforms allow users to consume
and disseminate information by expressing their Social Media Posts ASND is a News Tweets
views, enabling them to obtain instant feedback dataset (Chowdhury et al., 2020b), collected from
and explore others’ views, it is important to char- Aljazeera news channel accounts on Twitter, Face-
acterize a stance expressed in a given content. Au- book, and YouTube. The dataset consists of twelve
tomatic stance detection also allows for assessing categories such as art-and-entertainment, business-
public opinion on social media, particularly on dif- and-economy, crime-war-conflict, education, envi-
ferent social and political issues such as abortion, ronment, health, human-rights-press-freedom, poli-
climate change, and feminism, on which people ex- tics, science-and-technology, spiritual, sports, and
press supportive or opposing opinions (ALDayel (xii) others. We used the test split from each dataset
and Magdy, 2021; Küçük and Can, 2020). The task for the evaluation.
involves “classification as the stance of the pro- Arabic News SANAD corpus is a large col-
ducer of a piece of text, towards a target as either lection of Arabic news articles collected from
one of the three classes: {support, against, neither} Akhbarona, AlKhaleej, and AlArabiya (Einea et al.,
or {agree, disagree, discuss, or unrelated}” (Küçük 2019). The dataset has separate collections col-
and Can, 2020). lected from different news media, each of which
Datasets has six news categories, such as culture, finance,
medical, politics, sports and technology.
Unified-FC dataset consists of claims collected
from Verify.sy (false claims) and Reuters (true 3.6 Demographic Attributes
claims), which resulted in 422 claims. Based Demographic information (e.g., gender, age, coun-
on these claims documents are collected using try of origin) are useful in many different appli-
Google custom search API and filtered by com- cations such as understanding population charac-
puting claim-documents similarity (Baly et al., teristics, personalized advertising, socio-cultural
2018b). This approach resulted in 3,042 claim- studies, etc. Demographic information helps gov-
documents pairs, which are then annotated for ernments, businesses, and organizations understand
stance (agree, disagree, discuss, unrelated) by Ap- their target audiences, and plan accordingly.
pen crowd-sourcing platform.
3.6.1 Gender
Khouja (2020) developed a dataset by first sam- Gender analysis can reveal important differences
pling news titles from Arabic News Texts (ANT) between male and female users such as topics of
corpus (Chouigui et al., 2017) and then generating interest, gender gap, preferences, etc.
true and false claims. From these claims stance
(three classes – agree, disagree, other) is annotated Dataset We used the ArabGend test set, which
from a pair of sentences using Amazon Mechanical contains 1,000 names collected from Twit-
Turk and Upwork. The dataset consists of 3,786 ter (divided equally between males and fe-
claim-reference pairs. males) (Mubarak et al., 2022).
3.6.2 Location split into sentences and filtered poorly-formatted
Identifying user locations is useful for many appli- sentences using a rule-based approach. The dataset
cations such as author profiling, dialect identifica- has been released as a part of Task 2 of CLEF2023
tion, recommendation systems, etc. Often, users CheckThat Lab (Barrón-Cedeño et al., 2023).
on social media platforms, such as Twitter, declare 3.7.2 Propaganda Detection
their locations in noisy ways, and mapping these
Propaganda can be defined as a form of commu-
locations to countries is a challenging task.
nication that aims to influence the opinions or the
Dataset We used the UL2C dataset, which con- actions of people towards a specific goal; this is
tains 28K unique locations, as written by Arabic achieved utilizing well-defined rhetorical and psy-
Twitter users, and their mappings to Arab coun- chological devices (Dimitrov et al., 2021). In differ-
tries (Mubarak and Hassan, 2021). ent communication channels, propaganda (persua-
sion techniques) is conveyed through the use of di-
3.6.3 Name Info verse techniques (Miller, 1939), which range from
Names contain important information about our leveraging the emotions of the audience, such as
identities and demographic characteristics, includ- using emotional technique or logical fallacies such
ing factors like gender, nationality, and ethnicity. as straw man (misrepresenting someone’s opinion),
The purpose of this task is to predict the country of hidden ad-hominem fallacies, and red herring (pre-
origin of a person name giving only their names. senting irrelevant data).
Dataset We used an in-house dataset for mapping Dataset The dataset used for this study consists
person names to World countries extracted from of Arabic tweets (Alam et al., 2022b) posted by
Wikipedia.5 different news media from Arab countries such as
Al Arabiya and Sky News Arabia from UAE, Al
3.7 Factuality, Disinformation and Harmful Jazeera, and Al Sharq from Qatar, and from five
content detection international Arabic news sources Al-Hurra News,
3.7.1 Subjectivity Identification BBC Arabic, CNN Arabic, France 24, and Russia
A sentence is considered subjective when it is based Today. The final annotated dataset consists of 930
on – or influenced by – personal feelings, tastes, tweets. Alam et al. (2022b) formulated the task as
or opinions. Otherwise, the sentence is considered a multilabel and multiclass span level classification
objective (Antici et al., 2021). Given that the identi- task. For this study, we used the multilabel setup.
fication of subjectivity is subjective itself, therefore, 3.7.3 Check-worthiness Detection
it poses challenges in the annotation process by the
Fact-checking is a time-consuming and complex
annotator. The complexity lies due to the different
process, and it often takes effort to determine
levels of expertise by the annotators, different in-
whether a claim is important to check, irrespective
terpretations and their conscious and unconscious
of its potential to be misleading or not. Check-
bias towards the content they annotate. The content
worthiness detection is the first step and a criti-
can be text (e.g., sentence, article), image or multi-
cal component of fact-checking systems (Nakov
modal content, consisting of opinionated, factual
et al., 2021) and the aim is to facilitate manual
or non-factual content. The annotation typically
fact-checking efforts by prioritizing the claims for
has been done using two labels, objective (OBJ)
the fact-checkers. Research on check-worthiness
and subjective (SUBJ).
includes check-worthiness detection/ranking from
Dataset The dataset consists of sentences curated political speeches, debates, and social media posts
from news articles. The dataset has been developed (Nakov et al., 2022a; Shaar et al., 2021). A check-
based on the existing AraFacts dataset (Ali et al., worthy claim is usually defined by its importance
2021) that contains claims verified by Arabic fact- to the public and journalists, and whether it can
checking websites, and each claim is associated cause harm to an individual, organization, and/or
with web pages propagating or negating the claim. society.
The news articles are collected from different news
Dataset For this study, we used the Arabic subset
media. News articles were automatically parsed,
of the dataset released with Task 1A (Arabic) of the
5
Paper is under revision. CLEF2022 CheckThat Lab (Nakov et al., 2022b)
The dataset consists of 4,121 annotated tweets. The mors, and conspiracy theories that are spreading in
Arabic tweets were collected using keywords re- different social media channels to manipulate peo-
lated to COVID-19, vaccines, and politics. ple’s opinions or to influence the outcome of major
events such as political elections (Darwish et al.,
3.7.4 Claim Detection
2017a; Baly et al., 2018b). While fact-checking has
Information shared in the mainstream and social largely been done by manual fact-checker due to
media often contains misleading content. Claim de- the reliability, however, that does not scale well as
tection has become an important problem in order the enormous amount of information shared online
to mitigate misinformation and disinformation in every day. Therefore, an automatic fact-checking
those media channels. A factual (verifiable) claim system is important and it has been used for fa-
is a sentence claiming that something is true, and cilitating human fact-checker (Nakov et al., 2021).
this can be verified using factually verifiable in- The task typically involves assessing the level of
formation such as statistics, specific examples, or factual correctness in a news article, media outlets,
personal testimony (Konstantinovskiy et al., 2021). or social media posts. The content is generally
Research on claim detection includes social media judged to be of high, low, or mixed factual cor-
posts – text modality (Alam et al., 2021b), multi- rectness, seven-point Likert scale6,7 or just binary
modality (Cheema et al., 2022) and news (Reddy labels {yes, no} (Baly et al., 2018a; Alam et al.,
et al., 2022). 2021b).
Datasets
Datasets
CT-CWT-22-Claim We used the Arabic sub-
set of the dataset released with Task 1B of the News Articles We used the dataset developed
CLEF2022 CheckThat Lab (Nakov et al., 2022a). by Baly et al. (2018a) in which false claims
The dataset has been annotated using a multi- are extracted from verify-sy8 and true claims are
question annotation schema (Alam et al., 2021a), extracted from http://ara.reuters.com. The
which consists of tweets collected using COVID- dataset consists of 3,042 documents.
19 related keywords. The dataset contains 6,214 Tweets For the claim detection from tweets, we
tweets (Nakov et al., 2022b). used the same dataset (Alam et al., 2021b) dis-
cussed in 3.7.4. As mentioned earlier, this dataset
(Khouja, 2020) This dataset consists of 4,547 was annotated using a multi-questions annotation
true and false claims, which was developed based schema in which one of the questions was “does the
on Arabic News Texts (ANT) corpus. A sample tweet appear to contain false information?”. Based
of articles was modified to generate true and false on the answer to this question factuality label of
claims using crowdsourcing. the tweet has been defined. The Arabic dataset
contains a total of 4,966 tweets.
3.7.5 Attention-worthiness Detection
In social media most often people tweet by blam- 3.7.7 Harmful Content Detection
ing authorities, providing advice, and/or call for For the harmful content detection we adopted the
action. It might be important for the policy mak- task proposed in (Alam et al., 2021b; Nakov et al.,
ers to respond to those posts. The purpose of this 2022b) though the research on harmful content de-
task is to categorize such information into one of tection also include identifying or detecting offen-
the following categories: not interesting, not sure, sive, hate-speech, cyberbullying, violence, racist,
harmfullness, other, blames authorities, contains misogynistic and sexist content (Sharma et al.,
advice, calls for action, discusses action taken, dis- 2022; Alam et al., 2022a). For some of the those
cusses cure, asks a question. harmful content detection tasks we addressed them
Dataset For this task, we used a subset of the separately and discussed in the below sections.
dataset Task 1D of the CLEF2022 CheckThat Alam et al. (2021b); Nakov et al. (2022b) proposed
Lab (Nakov et al., 2022a), which contains 6,140 the as in the context of tweets and idea was to de-
annotated tweets. tect whether the content of the tweet aims to and
can negatively affect society as a whole, specific
3.7.6 Factuality Detection 6
https://mediabiasfactcheck.com
Fact-checking has emerged as an important re- 7
https://allsides.com
search topic due to a large amount of fake news, ru- 8
http://www.verify-sy.com
person(s), company(s), product(s), or spread ru- Dataset We used the dataset discussed in
mors about them. The content intends to harm or (Mubarak et al., 2021a), which contains 10,000
weaponize the information9 (Broniatowski et al., tweets collected by first identifying Twitter ac-
2018). counts that post adult content. Tweets are manually
annotated as adult and not-adult.
Dataset We used the Arabic dataset proposed in
(Nakov et al., 2022b), which consists of a total of 3.7.11 Spam Detection
6,155 annotated tweets. Spam content in social media includes ads, ma-
3.7.8 Offensive Language Detection licious content, and any low-quality content
(Ghanem et al., 2023). Spam detection is another
The use of offensive language in social media has
important problem as such content may often annoy
became a major problem, which can lead to real-
and mislead the users (Gao et al., 2012).
world violence (Husain and Uzuner, 2021; Sap
et al., 2019). This literature for offensive language Dataset We used the dataset discussed in
detection mainly focused on social media content (Mubarak et al., 2020a) for Arabic spam detec-
and addressing for variety of languages. The task tion which contains 28K tweets manually labeled
is mainly defined as whether the content (e.g., text, as spam and not-spam.
image, or multimodal) is offensive or not (Chowd-
hury et al., 2020c). 3.8 Semantic textual similarity

Dataset For this task, we used the dataset from 3.8.1 Textual Similarity
the SemEval-2020 Task 12 (OffensEval 2020) Semantic textual similarity is a measure used to
(Zampieri et al., 2020), which consists of 10,000 determine if two sentences are semantically equiv-
tweets, collected from a set of 660k Arabic tweets alent. The task involves generating numerical sim-
containing the vocative particle (“yA” – O) from ilarity scores for pairs of sentences, with perfor-
April 15 to May 6, 2019. mance evaluated based on the Pearson correla-
tion between machine-generated scores and human
3.7.9 Hate Speech Detection judgments (Cer et al., 2017a). Two tasks were con-
Davidson et al. (2017) defined hate speech as “as ducted to gauge the similarity between 250 pairs
language that is used to expresses hatred towards a of Arabic sentences, as well as Arabic-English sen-
targeted group or is intended to be derogatory, to tence pairs.
humiliate, or to insult the members of the group”.
The literature for hate speech detection defined the Dataset We used SemEval-2017 Task 1 (Track
task as detecting hate vs. non-hate from different 1: ar-ar and Track 2: ar-en) dataset (Cer et al.,
types of content such as text, image and multimodal 2017a), which is a translated version (machine
(Schmidt and Wiegand, 2017; Kiela et al., 2020; translation followed by post-editing by human) of
Gomez et al., 2020). SNLI dataset (Bowman et al., 2015).

Dataset For this task, we also used the OSACT4 3.8.2 Semantic Question Similarity
dataset (Mubarak et al., 2020b), which consists The idea of this task is to determine how similar
of 10,000 tweets with annotated label hate-speech, two questions are in terms of their meaning.
not-hate-speech.
Dataset We used Mawdoo3 Q2Q dataset
3.7.10 Adult Content Detection (NSURL-2019 task 8: Semantic question simi-
Identifying this type of content is important for larity in Arabic), which consists of 15,712 anno-
social media platforms to make a safe place for tated pairs of questions. Each pair is labeled as
users. Especially this type of content poses a seri- no semantic similarity (0) or semantically simi-
ous threat to other vulnerable groups (e.g., younger lar(1) (Seelawi et al., 2019).
age groups). The task typically involves detecting
and identifying whether the textual content con- 3.9 Natural Language Inference (NLI)
tains sensitive/adult content or account that share The XNLI task, known as Cross-lingual Natural
such content. Language Inference (Conneau et al., 2018), is a
9
The use of information as a weapon to spread misinfor- widely used benchmark in the field of natural lan-
mation and mislead people. guage processing (NLP). It involves determining
the logical relationship between pairs of sentences XQuAD comprises 240 paragraphs and 1190
written in different languages. Specifically, the question-answers pairs from the development set
task requires NLP models to determine whether a of SQuAD v1.1 with their professional translations
given hypothesis sentence is entailed, contradicted, into ten languages. Hindi, Turkish, Arabic, Viet-
or neutral in relation to a given premise sentence, namese, Thai, German, Greek, Russian, Spanish
across multiple languages. The XNLI task serves and Chinese. We use the the Arabic split of the data
as a rigorous evaluation of the cross-lingual transfer which consists of 48 articles, 240 paragraphs, and
capabilities of NLP models, assessing their ability 1190 questions (Artetxe et al., 2019). We used the
to understand and reason in different languages sQuad version of all datasets along with the official
within a multilingual context. squad evaluation script.
Dataset The dataset we used for this study is 3.11 Speech Processing
the translated version of Arabic from XNLI cor-
For this study, we address the speech modalities
pus (Conneau et al., 2018). For the annotation, 250
in the context of large foundation models, and we
English sentences were selected from ten different
evaluate the following two tasks in this edition: (i)
sources and then asked the annotators to produce
automatic speech recognition (ASR); and (ii) text
three hypotheses per sentence premise. The result-
to speech (TTS) models. In future, we will scale
ing premises and hypotheses are then translated
the speech benchmark with speech translation (ST)
into 15 languages and we used the Arabic version
and spoken Arabic dialect identification spoken
for this study.
(ADI).
3.10 Question Answering (QA) 3.11.1 Speech Recognition
This task involves answering questions in Arabic The primary objective of an ASR system is to trans-
based on a given text10 . For this task, we use four form spoken language into written text. The task
different datasets consisting of (passage, question, itself is challenging due to the presence of vari-
and answer) pairs. ability in human speech, which can be affected
by factors such as accent, speaking style, code-
Datasets
switching, environmental factors like channels, and
ARCD consists of 1,395 Arabic MSA questions background noise among others. Furthermore, the
posed by crowd-sourced workers along with the presence of language-related challenges, including
text segments from Arabic Wikipedia. We use the complex morphology, unstandardized orthography,
test set only for our evaluation. The test set consists and a wide array of dialects as a primary mode
of 78 articles, 234 paragraphs, and 702 questions of communication, adds a layer of complexity to
(Mozannar et al., 2019). the task. Therefore to properly benchmark Ara-
bic ASR, we covered a wide range of domains
MLQA comprises multilingual question-answer
encapsulating different speaking styles, dialects,
instances in 7 languages, English, Arabic, Simpli-
and environments. For our study, we considered
fied Chinese, Hindi, German, Vietnamese and Span-
broadcast news, telephony, and meeting data for
ish. We used the Arabic QA pairs from this dataset,
MSA, Egyptian, Moroccan Arabic, etc., in both
which consist of 2389 articles, 4646 paragraphs,
monolingual and code-switching setups.
and 5335 questions (Lewis et al., 2019).
Datasets
TyDi QA comprises 11 languages with 204K
question-answer pairs. We used the data provided MGB2 consists of 9.57 hours of multi-dialect
for the Gold Passage task in which a passage that speech data that was collected from Aljazeera TV
contains the answer is provided and the task is to programs and manually transcribed. The data con-
predict the span that contains the answer. We used sists of a mix of Modern Standard Arabic (MSA)
the Arabic split of the data which contains 921 ar- and various dialects, including Egyptian, Levantine,
ticles, 921 paragraphs and 921 questions (Artetxe Gulf, and North African (Ali et al., 2016).11
et al., 2019).
MGB3 is a collection of 5.78 hours of multi-
10
This task is also referred to as machine reading compre- genre speech data in Egyptian dialect. The data
hension where the model is tested on its ability to extract
answers from the given text 11
https://arabicspeech.org/mgb2
Dataset Task Domain Size
MGB2 ASR Broadcast (MSA) 9.57 hrs
MGB3 ASR Broadcast (EGY) 5.78 hrs
MGB5 ASR Broadcast (MOR) 1.40 hrs
QASR.CS ASR Broadcast (Mixed) → Code-switching 5.90 hrs
DACS ASR Broadcast (MSA-EGY) → Code-switching 1.50 hrs
ESCWA.CS ASR Meeting (Mixed DA - ENG) → Code-switching 2.80 hrs
CallHome ASR Telephony (EGY) 20 phone conversations
In-house TTS Mixed Topics (education, health, etc) 20 sentences

Table 2: Summary on test sets and their sizes used in evaluation for the speech processing tasks.

was collected from YouTube videos and manually systems comprise three modules: text front-end,
transcribed (Ali et al., 2017).12 acoustic model, and vocoder. However, there is
ongoing research to combine acoustic models and
MGB5 is a collection of 1.4 hours of speech data vocoder in a single neural network. Text front-
in Moroccan dialect. The data was collected from end module normalizes input text by converting
YouTube videos and manually transcribed (Ali digits, symbols, abbreviations, and acronyms into
et al., 2019).13 full words, processing words with special sounds,
ESCWA.CS is a collection of 2.8 hours of borrowed words, etc. This task is challenging in
speech code-switching corpus collected over two Arabic due to missing diacritics in modern texts
days of meetings of the United Nations Economic as explained in 3.1.4. Therefore, the Arabic front-
and Social Commission for West Asia (ESCWA) end part of the TTS is responsible for restoring the
in 2019 (Chowdhury et al., 2021).14 missing diacritics and text normalization.

QASR.CS is a collection of 5.9 hours of code- Dataset For MSA TTS, we create the first public
switching extracted from the Arabic broadcast test dataset, which comprises 20 sentences cover-
news data (QASR) to test the system for code- ing different topics such as psychology, education,
switching. The dataset also includes some in- health, etc. The average length for each sentence
stances where the switch is between Arabic and is 8 words. This data is used for objective and
French, however, this type of instance are very rare subjective evaluation for Arabic TTS.
occurrence (Mubarak et al., 2021b).15
4 Methodology
DACS is a collection of ≈ 1.5 hours of broadcast
For the purpose of benchmarking the Arabic tasks,
speech designed to evaluate the performance of
we opt for zero-shot learning for both NLP and
ASR for code-switching between MSA to Egyptian
Speech tasks. We benchmarked these varieties of
dialect and vice versa (Chowdhury et al., 2020a).16
tasks by leveraging ChatGPT (GPT-3.5-Turbo) for
CallHome Egyptian is a speech corpus of tele- NLP, Whisper (small. medium, and large), USM
phone conversations between native speakers of and Amazon Polly for Speech and compared their
Egyptian Arabic. It consists of 20 unscripted tele- performance with respective state-of-the-art mod-
phone conversations, each of which lasts between els.
5-30 minutes (Kumar et al., 2014).17
4.1 Model for NLP Tasks
3.11.2 Text to Speech In the zero-shot setting, the model – ChatGPT – is
Speech Synthesis a.k.a text to speech (TTS) helps only given a natural language instruction describing
users to get the written output easier and in some the task and asked to produce the expected output.
cases faster. Most state-of-the-art end-to-end TTS The goal is to allow the LLM to build a context that
12
https://arabicspeech.org/mgb3 will help to narrow the inference space and produce
13
https://arabicspeech.org/mgb5 more accurate output. For each task, we explored a
14
https://arabicspeech.org/escwa number of prompts guided by the same instruction
15
https://arabicspeech.org/qasr
and format as recommended in the Azure OpenAI
16
https://github.com/qcri/Arabic_speech_code_
switching Studio Chat playground. After obtaining the best
17
https://catalog.ldc.upenn.edu/LDC97S45 prompt, we used it to complete the evaluation using
the OpenAI API from Azure Cognitive Services. still produced very unexpected responses, some-
times incomplete names of propaganda techniques,
4.2 Models for Speech Tasks or even techniques not among the provided list. An-
Similar to the NLP tasks, we benchmarked the other challenge with designing prompts for these
large speech models in zero-shot settings. For the tasks, is the issue of a task’s subjectivity where
speech recognition task, we explored three differ- providing a crisp-clear classification task definition
ent OpenAI’s Whisper models – small, medium, to the model is not possible. As an example, one of
and large, along with Google’s USM model (see our tasks is to evaluate whether a tweet is offensive
Table 3). We compared these large models with towards a person or an entity. In many instances,
supervised KANARI18 state-of-the-art conformer- the model predicted tweets to be offensive, while in
based offline and RNN-T based streaming ASR19 . reality they were descriptive of the tweet’s author
For the TTS task, we compare the-state-of-art in mental or physical state, or they were just repeating
two public systems; Amazon Polly TTS engine20 common negative statements or Arabic proverbs
and KANARI TTS system21 . not directed at anyone indicating the model’s un-
derstanding of offensiveness is not inline of our
Model Layers Width Heads Parameters definition.
W.Small 12 768 12 244M
W.Medium 24 1024 16 769M
W.Large-v2 32 1280 20 1550m For almost all NLP tasks, post-processing was
USM 32 1526 16 2B needed in order to match the gold labels, which in-
clude mapping prefixes, suffixes, or filtering tokens.
Table 3: Model parameters and architecture for Large
pretrained ASRs. W. stands for Open.AI’s Whisper For example, for POS tagging task, the tag PREP
(Radford et al., 2022) and USM is Universal Speech is mapped to ‘preposition’, ‘P’, ‘PREP’, ‘PRP’.
Model from Google (Zhang et al., 2023) Another example for NER, the model switches the
tag of the prediction i.e., B-PER predicts as PER-B,
which needed remapping the NER tags.
4.3 Prompts and Post Processing
Prompts design is the major challenge for zero-
shot learning setup and depending on the type of For speech recognition task, post-processing is
tasks such as token classification vs. sentence clas- a crucial component. Traditionally ASR is evalu-
sification the complexities varies. Designing the ated based on word error rate (WER) – an edit dis-
appropriate prompt ensure an accurate output. In tance based metric. The measure aligns the model’s
Appendix A.1, we provided prompts for the differ- output with reference transcription and penalizes
ent tasks. based on insertion, deletion, and substitution er-
For example, for the segmentation task, some of rors. Hence, the measure is unable to disambiguate
the output was not segmented based on linguistic some code-switching errors introduced by the mul-
information but rather more Byte-Pair Encoding tilingual writing along with other minor formatting
(BPE) like encoding. Based on that prompt is fur- differences. This high penalizing characteristics of
ther redesigned, which resulted in a better outcome. WER particularly poses a challenge in zero-shot
For factuality, disinformation, and harmful con- settings, where the model does not observe any par-
tent detection tasks, the challenges were different ticular in-domain/task data formatting beforehand.
than other tasks. One notable example is the pro- Therefore, to minimize the challenge we opt for
paganda detection task. The task requires deter- text-standardization by normalizing ‘alif’, ‘ya’ and
mining whether a text snippet contains propagan- ta-marbuta’. Moreover, to support multi-script ren-
distic language, and if it does, the model should dering, we created a simple Global Mapping File
detect which propaganda technique is used from a (GLM) to transliterate common recognized outputs.
pre-defined list of techniques. Even with our best To reduce the risk of overfiting the post-processing
efforts to design the prompt for this task, the model to the model’s transcription style, we adapted mini-
18
malist GLM (used in (Chowdhury et al., 2021)) and
https://fenek.ai/
19
https://arabicasr.kanari.ai/
normalization pipeline and applied it to all models.
20
https://aws.amazon.com/polly/ We designed it based on common confusion and
21
https://arabictts.kanari.ai/ English-Arabic transliteration pairs.
4.4 Evaluation Metric Bible evaluation set, due to their availability on the
To measure the performance of each task, we fol- web. For example,
Input: ? H@  Ø ú¯ Ñ¢« B@ ñKñºƒ
 ð A҂Ë@ éºÊÜ
lowed current state-of-art references and used the
metric reported in the respective work. It ranges Output: Who is the greatest in the kingdom of heaven?
from Accuracy (ACC), F1 (macro, micro, and Who Is the Most Important Person in the Kingdom?

Input: †C¢Ë@ð h. @ð QË@
weighted), word error rate (WER), character er-
Output: Marriage and Divorce, Jesus Teaches About Di-
ror rate (CER), discretization error rate (DER), and
vorce
mean opinion score (MOS) on naturalness, intelli-
Such behavior indicates the test data is contam-
gibility and vowelization for subjective evaluation.
inated as the model might have ingested the data
5 Results and Discussion during training. Furthermore, the findings from
the zero-shot setting demonstrate that the ChatGPT
In Tables 4, 5, 6 and 7, we report the results of dif- model exhibits superior performance in Gulf and
ferent NLP and Speech related tasks. In the below Egyptian dialects to have better BLEU scores than
Sections, we summarize the results and challenges the overall MSA. This behavior can be attributed
specific to the task groups. The last column ∆ rep- to the lack of dialectal representation in the LLM
resents the difference between SOTA and zero-shot to stopping it from hallucinating. For the Media
performance. genre, it is clear that the conversational content
is much harder to translate in general and further
5.1 Word Segmentation, Syntax and
worse for the dialectal content.
Information Extraction
In the first part of the Table 4, we report for 5.2 Sentiment, Stylistic and Emotion Analysis
the token classification (sequence tagging) tasks. In the second group of Table 4, we report results
For almost all tasks, the performance are below for sentiment, emotion and stance. The datasets
than the SOTA. The prompt design and the post- for this task include tweet classification tasks. We
processing were more challenging for these task. observe that performances are below SOTA by mar-
For the segmentation task, we observe that the gins between 19% and 58%. For these types of
rare words are broken down into two or more sub- tasks, model provided additional text with label.
word tokens. The words such as “ èðXQK. ,‘ËAg , é‚
 ¯” For example, “Sentiment: Positive (because of the
  ¯” re-
should be segmented into “ èðXQK. ,‘ËAg , é+ ‚ laughing emoji)”. It provided the reason of class
spectively, The system output segmented the words label, which is in fact great but post processing
 ‚+J
into “ è+ðXQ+K. ,‘+ËAg , é+  ¯” which are not accu- were needed for such cases to match the gold label.
rate.
For NER task, we noticed that the model misses 5.3 News Categorization
tokens in some instances. It also tends to predict For the news categorization experiments, we used
extra tokens in other instances. Such Errors leads to four different datasets consisting of news articles
miss-alignment between the inputs and the outputs, with multiclass classification setting. Across all
which affects the metric calculation. We deal with datasets, zero-shot performances are lower than
this issue by either truncating the prediction or the current SOTA. As can be seen in Table 4 per-
padding it with the O class depending on the length formances vary significantly ranging from 5% to
of the ground truth. 25%. Like other tasks, we need to post-process the
We have similar observation for the lemmatiza- output labels as the API returned additional tokens.
tion and parsing. In many cases we observed that API returned a
For the MT, results in Table 5 indicate the short- message “content is in Arabic” without providing
coming of these large models when explored with any label. We also observed that it returns addi-
standard and dialectal Arabic. From the reported tional labels, which may be an obvious case as a
measure, we noticed ChatGPT is outperformed by news article may contain information representing
SOTA techniques. When investigated further, we multiple labels.
observed that ChatGPT is penalized most of the
time for inserting additional content (shown in blue 5.4 Demographic/Protected Attributes
in the below example) in their response. This is In these tasks, the model was asked to predict
often seen in MSA MT test sets, especially in the country of origin for person names extracted from
Task Dataset Metric Zero-shot SOTA ∆
Word Segmentation, Syntax and Information Extraction
Segmentation Samih et al. (2017) Acc (Avg) 0.688 0.931 0.243
Lemmatization WikiNews Acc 0.530 0.973 0.443
Diacritization WikiNews WER 0.308 0.045 -0.263
Diacritization Darwish et al. (2018) WER 0.367 0.031 -0.336
POS WikiNews Acc 0.810 0.953 0.143
POS Samih et al. (2017) Acc 0.379 0.892 0.513
POS XGLUE (Arabic) Acc 0.520 0.686 0.166
Parsing Conll2006 UAS 0.239 0.796 0.557
Dialect QADI Macro-F1 0.070 0.600 0.530
NER ANERcorp Macro-F1 0.185 0.550 0.365
NER AQMAR Macro F1 0.180 0.690 0.510
NER QASR Macro-F1 0.102 0.637 0.535
Sentiment, Stylistic and Emotion Analysis
Sentiment ArSAS Macro-F1 0.550 0.760 0.210
Emotion SemEval2018 (Task E-c) JS 0.395 0.541 0.146
Stance Unified-FC Macro-F1 0.232 0.558 0.326
Stance Khouja (2020) Macro-F1 0.620 0.767 0.147
News Categorization
News (Tweets) ASND Macro-F1 0.512 0.770 0.258
News articles SANAD/Akhbarona Acc 0.730 0.940 0.210
News articles SANAD/AlArabiya Acc 0.922 0.974 0.052
News articles SANAD/AlKhaleej Acc 0.864 0.969 0.105
Demographic/Protected Attributes
Name Info ASAD Weighted-F1 0.570 0.530 -0.040
Location UL2C Macro-F1 0.339 0.881 0.542
Gender ArabGend Macro-F1 0.390 0.820 0.430
Ethics and NLP: Factuality, Disinformation and Harmful Content Detection
Offensive lang. OffensEval 2020 Macro-F1 0.460 0.910 0.450
Hate Speech OSACT 2020 Macro-F1 0.430 0.820 0.390
Adult Content ASAD Macro-F1 0.460 0.880 0.420
Spam ASAD Macro-F1 0.440 0.989 0.549
Subjectivity In-house Macro-F1 0.670 0.730 0.060
Propaganda WANLP23 Micro-F1 0.353 0.649 0.296
Checkworthiness CT–CWT–22 F1 (POS) 0.526 0.628 0.102
Factuality CT–CWT–22 Weighted-F1 0.103 0.831 0.728
Claim CT–CWT–22 Acc 0.703 0.570 -0.133
Harmful content CT–CWT–22 F1 (POS) 0.471 0.557 0.086
Attention-worthy CT–CWT–22 Weighted-F1 0.258 0.206 -0.052
Factuality Unified-FC Macro-F1 0.306 - -
Claim Khouja (2020) Macro-F1 0.036 0.643 0.607
Semantics
Paraphrasing BTEC Fluency 0.946 0.972 0.026
Faithfulness 0.835 0.916 0.081
STS STS2017.eval.v1.1-Track 1 PC 0.789 0.744 -0.045
STS STS2017.eval.v1.1-Track 2 PC 0.808 0.749 -0.059
STS QS (Q2Q) Mawdoo3 Q2Q Micro-F1 0.895 0.959 0.064
XNLI (Arabic) XNLI Acc 0.489 0.648 0.159
Question answering (QA)
QA ARCD F1 0.502 0.501 -0.001
QA MLQA F1 0.376 0.584 0.208
QA TyDi QA F1 0.480 0.820 0.340
QA XQuAD F1 0.442 0.648 0.206

Table 4: Results on different tasks and datasets using zero-shot prompts. QS: Question similarity, PC: Pearson
Correlation, Conv. Text: Conversational text; JS: Jaccard Similarity. ∆ column shows the performance difference
between SOTA and ChatGPT.
Corpus Dia. SC City #Sent Zero-shot SOTA ∆
APT LEV lv - 1000 3.13 21.90 18.77
APT Nile eg - 1000 3.64 22.60 18.96
MADAR Gulf iq Baghdad 2000 27.60 29.10 1.50
MADAR Gulf iq Basra 2000 27.75 29.00 1.25
MADAR Gulf iq Mosul 2000 27.28 31.30 4.02
MADAR Gulf om Muscat 2000 34.29 39.50 5.21
MADAR Gulf qa Doha 2000 26.92 29.30 2.38
MADAR Gulf sa Jeddah 2000 27.66 29.40 1.74
MADAR Gulf sa Riyadh 2000 35.84 40.70 4.86
MADAR Gulf ye Sana’a 2000 27.12 31.40 4.28
MADAR LEV jo Amman 2000 22.79 35.10 12.31
MADAR LEV jo Salt 2000 27.43 34.90 7.47
MADAR LEV lb Beirut 2000 16.97 23.70 6.73
MADAR LEV ps Jerusalem 2000 28.24 33.60 5.36
MADAR LEV sy Aleppo 2000 27.31 34.30 6.99
MADAR LEV sy Damascus 2000 25.34 33.10 7.76
MADAR MGR dz Algiers 2000 16.89 21.30 4.41
MADAR MGR ly Benghazi 2000 28.26 32.00 3.74
MADAR MGR ly Tripoli 2000 23.21 25.90 2.69
MADAR MGR ma Fes 2000 25.38 29.90 4.52
MADAR MGR ma Rabat 2000 17.85 23.10 5.25
MADAR MGR tn Sfax 2000 13.41 13.80 0.39
MADAR MGR tn Tunis 2000 10.39 16.00 5.61
MADAR MSA ms - 2000 26.69 43.40 16.71
MADAR Nile eg Alexandria 2000 34.23 38.30 4.07
MADAR Nile eg Aswan 2000 24.06 30.40 6.34
MADAR Nile eg Cairo 2000 26.82 32.90 6.08
MADAR Nile sd Khartoum 2000 32.62 39.00 6.38
MDC LEV jo - 1000 3.35 17.70 14.35
MDC LEV ps - 1000 3.03 15.30 12.27
MDC LEV sy - 1000 3.28 19.90 16.62
MDC MGR tn - 1000 2.54 13.90 11.36
MDC MSA ms - 1000 4.88 20.40 15.52
Media Gulf om - 467 4.52 19.60 15.08
Media LEV lb - 250 3.58 16.80 13.22
Media MGR ma - 526 2.45 9.60 7.15
Media MSA ms - 637 9.26 29.70 20.44
Media MSA ms - 621 8.94 35.60 26.66
Bible MGR ma - 600 5.06 28.80 23.74
Bible MGR tn - 600 6.86 29.20 22.34
Bible MSA ms - 600 9.27 33.20 23.93
Bible MSA ms - 600 8.08 29.20 21.12

Table 5: Results (BLEU score) on machine translation for different datasets using zero-shot prompts. Best result per
row is bolfaced. ∆ column shows the performance difference between SOTA and ChatGPT. XXX Do we need all
these tiny subtasks? (accuracy at the city level?) It’s better to merge them at country level or region level to compare
MADAR with other datasets

Wikipedia, map user locations (extracted from Twit- tion, in many cases, the model generated outputs in
ter) to one of the Arab countries, and predict the many formats with additional country names, e.g.,
gender of person names (extracted from Twitter). bahrain (bh); áK QjJ.Ë@:bh; muscat - om; dz (algeria),
From the Table 4, we observe that the model strug- others (palestine), and unk, etc. which required a
gles with user-generated content on Twitter as op- post-processing code to standardize its output.
posed to the formal data from Wikipedia. In few
cases, the model provides messages indicating that 5.5 Ethics and NLP: Factuality,
it is unable to provide the expected outputs, e.g., “It Disinformation and Harmful Content
is not possible to determine the gender of a person Detection
based solely on their name”. Our results in Table 4 show that ChatGPT generally
In location prediction, although the prompt struggled with the tasks under this category, with
asked the model to give only a single country code its lowest performance being for the claim factu-
in ISO 3166-1 alpha-2 format without explana- ality detection task in the zero-shot setup. This is
expected given that in majority of the tasks, the 5.7 Question answering (QA)
model is operating over tweets, which are very In the last part of the Table 4, we report the QA
short, usually informal, and often dialectal in the results on four different tasks. For arcd, the model
Arab world. The tasks themselves are generally achieved a score higher than that of SOTA by a
challenging requiring deep contextual analysis and small margin. However, for the other QA tasks
reasoning abilities, and domain knowledge in many under study, the model did not perform well.
of the cases. For instance, determining a claim’s
veracity is a very intensive process that usually re- 5.8 Speech Recognition and Synthesis
quires reasoning over information from multiple In Table 6, we reported the performance of ASR
sources and modalities (e.g., text and audio), with using different datasets and models. We observed
some sources not even available online for models that Google’s Universal Speech Model (USM) out-
to access and use (Nakov et al., 2021; Das et al., performs OpenAI’s whisper in all the datasets. The
2023) (e.g., witness testimonies to an event). Al- USM model also performs comparably with the
though for this task, we prompted the model to standard task- and domain-specific ASR systems
return Yes/No predictions of a claim’s truthfulness, and is better equipped to handle cross-language
sometimes it explicitly expressed its shortcoming and dialectal code-switching data from unseen do-
in predicting for such complex task responding by mains compared to the SOTA models. It should
statements like: “not enough context provided to be noted that the reported results, for both USM
determine if the information is correct”. and Whisper models, can be further improved with
Issues to consider while handling ChatGPT’s better model-specific post-processing to reduce pe-
responses were not limited to parsing responses nalization of non-semantic differences between the
for the sensitive category of tasks we are work- reference and the hypothesis.
ing with. Some of our tasks inherently require the As for the text-to-speech task, we evaluated the
model to operate over offensive language, profan- transformer-based models, with 20 test sentences,
ities, adult content, etc. Such language generally using both subjective and objective evaluation met-
goes against the OpenAI’s content management rics. Three native speakers have evaluated each
policy 22 followed by ChatGPT. In many instances, sentence on a 5-point scale: 5 for excellent, 4 for
ChatGPT raised an error regarding the type of lan- good, 3 for fair, 2 for poor, and 1 for bad. We
guage used in the text we were sending its way, and normalized results to scores out of 10 as shown in
did not return a prediction. This raises a question Table 7. From the objective evaluation, we noticed
on how developers can employ such models over Amazon Polly is significantly better in WER and
user generated content that is expected to contain CER, however, humans preferred KANARI models
“unacceptable” language. for better diacritization. As for the rest, both mod-
During our experiments, it was interesting to els performed comparably. We plan to increase the
see ChatGPT failing to provide predictions in sev- number of sentences increase coverage and con-
eral cases, and specifically mentioning that “the sider other available TTS systems such that Google
text is in arabic and i am not programmed to”. TTS, ReadSpeaker, etc in the future.
Such instances demonstrate a need for a deeper
understanding of the model’s abilities over lower 6 Findings and Limitations
resource languages like Arabic. It further poses an
6.1 Findings
opportunity to study ways to improve the training
of such LLMs over Arabic. Our experimental results suggest a big gap in the
performance of LLM (ChatGPT) in comparison
5.6 Semantics to the SOTA in zero-shot settings for most of the
The results for different semantic tasks reported in tasks. We observed a handful of tasks outperformed
the second last part of the Table 4 show that the SOTA in this challenging setting. Moreover, the
performances (pearson correlation) for STS (track LLM performance varies significantly between the
1 and 2) are higher than SOTA. The performances MSA versus dialectal test sets. For example, POS
for paraphrasing and XNLI tasks are lower. accuracy of 0.810 versus 0.379 on MSA and di-
alects respectively, which indicates a large gap in
22
https://learn.microsoft.com/en-us/
azure/cognitive-services/openai/concepts/ LLM for low-resource languages/dialects. This
content-filter performance gap can also be attributed to the lack
Domain (SR) Zero-Shot SOTA Supervised
Dataset
Dialect Models WER K.Offline K.Streaming
W.Small 46.70
Broadcast (16kHz) W.Medium 33.00
MGB2 11.40 11.90
MSA W.Large-v2 26.20
USM 15.70
W.Small 83.20
Broadcast (16kHz) W.Medium 65.90
MGB3 21.40 26.70
EGY W.Large-v2 55.60
USM 22.10
W.Small 135.20
Broadcast (16kHz) W.Medium 116.90
MGB5 44.10 49.20
MOR W.Large-v2 89.40
USM 51.20
W.Small 63.60
Broadcast (16kHz) W.Medium 48.90
QASR.CS 23.40 24.90
Mixed W.Large-v2 37.90
USM 27.80
W.Small 61.90
Broadcast (16kHz) W.Medium 48.70
DACS 15.90 21.30
MSA-EGY W.Large-v2 34.20
USM 14.30
W.Small 101.50
Meeting (16kHz) W.Medium 69.30
ESCWA.CS 49.80 48.00
Mixed W.Large-v2 60.00
USM 45.70
W.Small 155.90
Telephony (8kHz) W.Medium 113.70
CallHome 45.8* 50.90
EGY W.Large-v2 78.70
USM 54.20

Table 6: Reported Word Error Rate (WER) on ASR for different domains and dialect datasets in zero-shot setup
and domain-specific ASR setup. W. stands for Open.AI Whisper model, USM is Universal Speech Model from
Google, K stands for KANARI models. * represent the model’s input is 8kHz sampling rate and Offline model
was re-trained to accommodate telephony data. SOTA Supervised represents fully supervised models trained with
domain-specific data. WER with bold form represents best results, italic represents best results with zero-shot

Subjective Objective showing a huge performance gap in dialects. We


Model Diacritization Naturalness Intelligibility WER CER
Amazon 8.2 8.3 9.8 19.1 4.4 noticed that these large models are also capable to
KANARI 9.5 8.6 9.8 30.1 7.2 recognize Egyptian dialect much better than the
rest in zero-shot inference. Our observation sug-
Table 7: Subjective and objective evaluation for Ara-
bic TTS. For subjective, we report accuracy, where the
gests that these large models are better equipped to
higher is the better. For objective, we report: word and handle code-switching phenomena in spoken utter-
character error rates, where the lower is the better. Only ance than the supervised large transformer models.
the significant result differences are bold. The performance of the model is highly depen-
dent on the prompting strategy. Designing the best
prompts for each task is challenging and required
of dialectal representation in the model and data several iterations. In many tasks, the output was
contamination. As observed in the machine transla- not consistent for all instances of the datasets. For
tion task for Bible, the results indicate that the GPT example, in many cases the model provides the
model hallucinates and inserts additional content desired labels however, there are cases where the
in their response as the test data has already been model output different kind of error messages: (i)
ingested by the model during training. We can not it’s trained only on English and cannot handle Ara-
be sure if the model is already exposed to bench- bic texts, (ii) the response was filtered due to the
marking dataset and in future we will explore this prompt triggering Azure OpenAI’s content manage-
further. ment policy, (iii) it often provided extra tokens or
A similar pattern is noticed in Speech models, swapped the tag (B-PER to PER-B). These resulted
both Whisper (and its variants) and USM model in an extra layer of post-processing and filtering of
performs comparably with SOTA for MSA while the evaluation dataset.
Post-processing was needed for almost all tasks are very prominent for Arabic AI. We compare and
in order to match gold labels, which include refor- report the performance of each task and dataset
matting the output handling exceptions, missing with SOTA, which will enable the community and
values, and unexpected values. Much like NLP practitioners of large language models to decide on
tasks, post-processing the transcription output from their uses of these models.
the speech models is an important step. We no- While this is a work in progress, we foresee
ticed that the performance of the Whisper models that future work can include investigating few-shot
is highly dependent on the post-processing. As the learning with other open and closed models to eval-
models (Whisper family) are trained with massive uate the performance. As for the evaluation metric,
dataset created by weak supervision, the output is we only computed the one reported in SOTA, which
quite noisy and needs extra care for post-processing. is very limited. Future work should include other
In this study, we opt for a simple post-processing metrics for evaluating LLMs such as robustness,
pipeline so that the process is not overfitted to task- interpretability, bias, and toxicity.
based data styles.

6.2 Limitations References


The main focus of this study was to benchmark Ahmed Abdelali, Mohammed Attia, Younes Samih, Ka-
large language models for Arabic NLP and Speech reem Darwish, and Hamdy Mubarak. 2019. Dia-
tasks. Given that this is a work-in-progress, cur- critization of maghrebi Arabic sub-dialects. arXiv
preprint arXiv:1810.06619.
rently there are several limitations. In this edition,
Ahmed Abdelali, Hamdy Mubarak, Younes Samih,
we managed to use a handful of large models: Chat-
Sabit Hassan, and Kareem Darwish. 2020. Ara-
GPT, USM, and Whisper models and compared bic dialect identification in the wild. arXiv preprint
them to SOTA. Although a comparison to SOTA arXiv:2005.06557.
is a necessary and novel step, however, we will Muhammad Abdul-Mageed, AbdelRahim Elmadany,
enrich our study by adding other models such as et al. 2021. Arbert & marbert: Deep bidirectional
GPT-4, BARD, MMS, and other open models (e.g., transformers for Arabic. In Proceedings of the 59th
Bloom). We aimed to benchmark many tasks and Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Confer-
datasets. In this work, we benchmarked 59 datasets ence on Natural Language Processing (Volume 1:
with 96 test setups for 33 tasks. However, a lim- Long Papers), pages 7088–7105.
itation is we did not benchmark all the available Kabir Ahuja, Rishav Hada, Millicent Ochieng, Prachi
datasets; for example, the study reported in (El- Jain, Harshita Diddee, Samuel Maina, Tanuja Ganu,
madany et al., 2022) benchmarked 19 sentiment Sameer Segal, Maxamed Axmed, Kalika Bali, et al.
datasets, whereas we only covered one. It is also 2023. Mega: Multilingual evaluation of generative
ai. arXiv preprint arXiv:2303.12528.
possible that we missed many other Arabic NLP
and Speech tasks, which we will attempt to cover Firoj Alam, Stefano Cresci, Tanmoy Chakraborty, Fab-
rizio Silvestri, Dimiter Dimitrov, Giovanni Da San
in the future. Our current results are also limited Martino, Shaden Shaar, Hamed Firooz, and Preslav
to only zero-shot learning, for which performance Nakov. 2022a. A survey on multimodal disinfor-
highly depends on the prompt design and it requires mation detection. In Proceedings of the 29th Inter-
significant prompt engineering effort. national Conference on Computational Linguistics,
COLING ’22, pages 6625–6643, Gyeongju, Republic
of Korea.
7 Conclusion and Future Studies
Firoj Alam, Fahim Dalvi, Shaden Shaar, Nadir Durrani,
This study is the first large-scale benchmark that Hamdy Mubarak, Alex Nikolov, Giovanni Da San
Martino, Ahmed Abdelali, Hassan Sajjad, Kareem
brings together both Arabic Speech and NLP tasks Darwish, and Preslav Nakov. 2021a. Fighting the
under the same study. We report the performance COVID-19 infodemic in social media: A holistic
of LLMs for a variety of tasks covering different perspective and a call to arms. In Proceedings of the
domains and dialects. Our study also considers International AAAI Conference on Web and Social
Media, ICWSM ’21, pages 913–922.
tasks with a wide range of complexity ranging from
token to text classification, different application Firoj Alam, Hamdy Mubarak, Wajdi Zaghouani, Gio-
vanni Da San Martino, and Preslav Nakov. 2022b.
setting NER to sentiment, factuality and disinfor- Overview of the wanlp 2022 shared task on pro-
mation, ASR, TTS among others. We evaluate 33 paganda detection in Arabic. arXiv preprint
tasks and 59 datasets with 96 test setups, which arXiv:2211.10057.
Firoj Alam, Shaden Shaar, Fahim Dalvi, Hassan Saj- sources. In Proceedings of the 2018 Conference on
jad, Alex Nikolov, Hamdy Mubarak, Giovanni Empirical Methods in Natural Language Processing,
Da San Martino, Ahmed Abdelali, Nadir Durrani, pages 3528–3539, Brussels, Belgium. Association
Kareem Darwish, Abdulaziz Al-Homaid, Wajdi Za- for Computational Linguistics.
ghouani, Tommaso Caselli, Gijs Danoe, Friso Stolk, Ramy Baly, Mitra Mohtarami, James Glass, Lluís
Britt Bruntink, and Preslav Nakov. 2021b. Fighting Màrquez, Alessandro Moschitti, and Preslav Nakov.
the COVID-19 infodemic: Modeling the perspective 2018b. Integrating stance detection and fact check-
of journalists, fact-checkers, social media platforms, ing in a unified corpus. In Proceedings of the 2018
policy makers, and the society. In Findings of the Conference of the North American Chapter of the
Association for Computational Linguistics: EMNLP Association for Computational Linguistics: Human
2021, pages 611–649, Punta Cana, Dominican Re- Language Technologies, Volume 2 (Short Papers).
public. Association for Computational Linguistics.
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-
Abeer ALDayel and Walid Magdy. 2021. Stance detec-
liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei
tion on social media: State of the art and trends. In-
Ji, Tiezheng Yu, Willy Chung, et al. 2023. A mul-
formation Processing & Management, 58(4):102597.
titask, multilingual, multimodal evaluation of Chat-
Ahmed Ali, Peter Bell, James Glass, Yacine Messaoui, GPT on reasoning, hallucination, and interactivity.
Hamdy Mubarak, Steve Renals, and Yifan Zhang. arXiv preprint arXiv:2302.04023.
2016. The MGB-2 challenge: Arabic multi-dialect
broadcast media recognition. In 2016 IEEE Spoken Alberto Barrón-Cedeño, Firoj Alam, Tommaso Caselli,
Language Technology Workshop (SLT), pages 279– Giovanni Da San Martino, Tamer Elsayed, An-
284. IEEE. drea Galassi, Fatima Haouari, Federico Ruggeri, Ju-
lia Maria Struß, Rabindra Nath Nandi, et al. 2023.
Ahmed Ali, Suwon Shon, Younes Samih, Hamdy The clef-2023 checkthat! lab: Checkworthiness, sub-
Mubarak, Ahmed Abdelali, James Glass, Steve Re- jectivity, political bias, factuality, and authority. In
nals, and Khalid Choukri. 2019. The MGB-5 chal- Advances in Information Retrieval: 45th European
lenge: Recognition and dialect identification of di- Conference on Information Retrieval, ECIR 2023,
alectal Arabic speech. In 2019 IEEE Automatic Dublin, Ireland, April 2–6, 2023, Proceedings, Part
Speech Recognition and Understanding Workshop III, pages 506–517. Springer.
(ASRU), pages 1026–1033. IEEE.
Yassine Benajiba and Paolo Rosso. 2007. ANERsys 2.0:
Ahmed Ali, Stephan Vogel, and Steve Renals. 2017.
Conquering the NER task for the Arabic language
Speech recognition challenge in the wild: Arabic
by combining the maximum entropy with pos-tag
MGB-3. In 2017 IEEE Automatic Speech Recog-
information. In IICAI, pages 1814–1823.
nition and Understanding Workshop (ASRU), pages
316–322. IEEE. Yassine Benajiba, Paolo Rosso, and José Miguel
Zien Sheikh Ali, Watheq Mansour, Tamer Elsayed, and BenedíRuiz. 2007. ANERsys: An arabic named en-
Abdulaziz Al-Ali. 2021. Arafacts: the first large Ara- tity recognition system based on maximum entropy.
bic dataset of naturally occurring claims. In Proceed- In Computational Linguistics and Intelligent Text Pro-
ings of the Sixth Arabic Natural Language Processing cessing, pages 143–153, Berlin, Heidelberg. Springer
Workshop, pages 231–236. Berlin Heidelberg.
Francesco Antici, Luca Bolognini, Matteo Antonio In- Douglas Biber and Edward Finegan. 1988. Adverbial
ajetovic, Bogdan Ivasiuk, Andrea Galassi, and Fed- stance types in english. Discourse processes, 11(1):1–
erico Ruggeri. 2021. Subjectivita: An italian corpus 34.
for subjectivity detection in newspapers. In Experi- Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ
mental IR Meets Multilinguality, Multimodality, and Altman, Simran Arora, Sydney von Arx, Michael S.
Interaction, pages 40–52, Cham. Springer Interna- Bernstein, Jeannette Bohg, Antoine Bosselut, Emma
tional Publishing. Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas
Mikel Artetxe, Sebastian Ruder, and Dani Yo- Card, Rodrigo Castellon, Niladri Chatterji, Annie
gatama. 2019. On the cross-lingual transferabil- Chen, Kathleen Creel, Jared Quincy Davis, Dora
ity of monolingual representations. arXiv preprint Demszky, Chris Donahue, Moussa Doumbouya,
arXiv:1910.11856. Esin Durmus, Stefano Ermon, John Etchemendy,
Alexei Baevski, Steffen Schneider, and Michael Auli. Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor
2019. vq-wav2vec: Self-supervised learning of Gale, Lauren Gillespie, Karan Goel, Noah Goodman,
discrete speech representations. arXiv preprint Shelby Grossman, Neel Guha, Tatsunori Hashimoto,
arXiv:1910.05453. Peter Henderson, John Hewitt, Daniel E. Ho, Jenny
Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth
and Michael Auli. 2020. wav2vec 2.0: A framework Karamcheti, Geoff Keeling, Fereshte Khani, Omar
for self-supervised learning of speech representations. Khattab, Pang Wei Koh, Mark Krass, Ranjay Kr-
Advances in neural information processing systems, ishna, Rohith Kuditipudi, Ananya Kumar, Faisal Lad-
33:12449–12460. hak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle
Ramy Baly, Georgi Karadzhov, Dimitar Alexandrov, Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma,
James Glass, and Preslav Nakov. 2018a. Predict- Ali Malik, Christopher D. Manning, Suvir Mirchan-
ing factuality of reporting and bias of news media dani, Eric Mitchell, Zanele Munyikwa, Suraj Nair,
Avanika Narayan, Deepak Narayanan, Ben Newman, the 11th International Workshop on Semantic Evalu-
Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, ation (SemEval-2017), SemEval-2017, pages 1–14,
Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Pa- Vancouver, Canada.
padimitriou, Joon Sung Park, Chris Piech, Eva Porte-
Gullal Singh Cheema, Sherzod Hakimov, Abdul Sittar,
lance, Christopher Potts, Aditi Raghunathan, Rob
Eric Müller-Budack, Christian Otto, and Ralph Ew-
Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani,
erth. 2022. MM-claims: A dataset for multimodal
Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa
claim detection in social media. In Findings of the
Sadigh, Shiori Sagawa, Keshav Santhanam, Andy
Association for Computational Linguistics: NAACL
Shih, Krishnan Srinivasan, Alex Tamkin, Rohan
2022, pages 962–979, Seattle, United States. Associ-
Taori, Armin W. Thomas, Florian Tramèr, Rose E.
ation for Computational Linguistics.
Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai
Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022.
Zhou, and Percy Liang. 2022. On the opportunities Wavlm: Large-scale self-supervised pre-training for
and risks of foundation models. full stack speech processing. IEEE Journal of Se-
lected Topics in Signal Processing, 16(6):1505–1518.
Houda Bouamor, Nizar Habash, and Kemal Oflazer.
2014. A multidialectal parallel corpus of Arabic. In Amina Chouigui, Oussama Ben Khiroun, and Bilel
LREC, pages 1240–1245. Elayeb. 2017. ANT corpus: An Arabic news text col-
lection for textual classification. In 2017 IEEE/ACS
Houda Bouamor, Nizar Habash, Mohammad Salameh,
14th International Conference on Computer Systems
Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim,
and Applications (AICCSA), pages 135–142. IEEE.
Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexan-
der Erdmann, et al. 2018. The MADAR Arabic Di- Shammur A Chowdhury, Younes Samih, Mohamed El-
alect Corpus and Lexicon. In LREC. desouki, and Ahmed Ali. 2020a. Effects of dialectal
code-switching on speech modules: A study using
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
egyptian Arabic broadcast speech. Proc. Interspeech.
and Christopher D. Manning. 2015. A large anno-
tated corpus for learning natural language inference. Shammur Absar Chowdhury, Ahmed Abdelali, Ka-
In Proceedings of the 2015 Conference on Empiri- reem Darwish, Jung Soon-Gyo, Joni Salminen, and
cal Methods in Natural Language Processing, pages Bernard J Jansen. 2020b. Improving Arabic text cat-
632–642, Lisbon, Portugal. Association for Compu- egorization using transformer training diversification.
tational Linguistics. In Proceedings of the fifth arabic natural language
processing workshop, pages 226–236.
David A Broniatowski, Amelia M Jamison, SiHua Qi,
Lulwah AlKulaib, Tao Chen, Adrian Benton, San- Shammur Absar Chowdhury, Amir Hussein, Ahmed
dra C Quinn, and Mark Dredze. 2018. Weaponized Abdelali, and Ahmed Ali. 2021. Towards one
health communication: Twitter bots and Russian model to rule all: Multilingual strategy for di-
trolls amplify the vaccine debate. American jour- alectal code-switching Arabic asr. arXiv preprint
nal of public health, 108(10):1378–1384. arXiv:2105.14779.
Sébastien Bubeck, Varun Chandrasekaran, Ronen El- Shammur Absar Chowdhury, Hamdy Mubarak, Ahmed
dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Pe- Abdelali, Soon-gyo Jung, Bernard J Jansen, and Joni
ter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Salminen. 2020c. A multi-platform arabic news com-
Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, ment dataset for offensive language detection. In
and Yi Zhang. 2023. Sparks of artificial general in- Proceedings of the Twelfth Language Resources and
telligence: Early experiments with GPT-4. Technical Evaluation Conference, pages 6203–6212.
report, Microsoft Research.
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Williams, Samuel Bowman, Holger Schwenk, and
shared task on multilingual dependency parsing. In Veselin Stoyanov. 2018. XNLI: Evaluating cross-
Proceedings of the tenth conference on computational lingual sentence representations. In Proceedings of
natural language learning (CoNLL-X), pages 149– the 2018 Conference on Empirical Methods in Natu-
164. ral Language Processing, EMNLP ’18, pages 2475–
2485.
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
Gazpio, and Lucia Specia. 2017a. SemEval-2017 Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak,
task 1: Semantic textual similarity multilingual and Younes Samih, and Mohammed Attia. 2018. Diacriti-
crosslingual focused evaluation. In Proceedings zation of moroccan and tunisian Arabic dialects: A
of the 11th International Workshop on Semantic crf approach. OSACT, 3:62.
Evaluation (SemEval-2017), pages 1–14, Vancouver,
Kareem Darwish, Dimitar Alexandrov, Preslav Nakov,
Canada. Association for Computational Linguistics.
and Yelena Mejova. 2017a. Seminar users in the
Daniel Cer, Mona Diab, Eneko E. Agirre, Iñigo Lopez- Arabic twitter sphere. In Social Informatics: 9th
Gazpio, and Lucia Specia. 2017b. SemEval-2017 International Conference, SocInfo 2017, Oxford, UK,
Task 1: Semantic Textual Similarity Multilingual and September 13-15, 2017, Proceedings, Part I 9, pages
Cross-lingual Focused Evaluation. In Proceedings of 91–108. Springer.
Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A. Etman and A. A. Louis Beex. 2015. Language and di-
A new fast and accurate Arabic word segmenter. In alect identification: A survey. In 2015 SAI Intelligent
Proceedings of the Tenth International Conference Systems Conference (IntelliSys), pages 220–231.
on Language Resources and Evaluation (LREC’16), Ibrahim Abu Farha and Walid Magdy. 2021. Bench-
pages 1070–1074. marking transformer-based language models for Ara-
Kareem Darwish, Hamdy Mubarak, and Ahmed Abde- bic sentiment and sarcasm detection. In Proceedings
lali. 2017b. Arabic diacritization: Stats, rules, and of the sixth Arabic natural language processing work-
hacks. In Proceedings of the Third Arabic Natural shop, pages 21–31.
Language Processing Workshop, pages 9–17, Valen-
Hongyu Gao, Yan Chen, Kathy Lee, Diana Palsetia, and
cia, Spain. Association for Computational Linguis-
Alok N. Choudhary. 2012. Towards online spam fil-
tics.
tering in social networks. In Network and Distributed
Kareem Darwish, Hamdy Mubarak, Ahmed Abdelali, System Security Symposium, NDSS ’12, pages 1–16.
and Mohamed Eldesouki. 2017c. Arabic pos tag-
ging: Don’t abandon feature engineering just yet. Razan Ghanem, Hasan Erbay, and Khaled Bakour. 2023.
In Proceedings of the third arabic natural language Contents-based spam detection on social networks
processing workshop, pages 130–137. using roberta embedding and stacked blstm. SN Com-
puter Science, 4(4):380.
Anubrata Das, Houjiang Liu, Venelin Kovatchev, and
Matthew Lease. 2023. The state of human-centered Kevin Gimpel, Nathan Schneider, Brendan O’connor,
nlp technology for fact-checking. Information Pro- Dipanjan Das, Daniel P Mills, Jacob Eisenstein,
cessing & Management, 60(2):103219. Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
and Noah A Smith. 2011. Part-of-speech tagging
Thomas Davidson, Dana Warmsley, Michael Macy, and for twitter: Annotation, features, and experiments.
Ingmar Weber. 2017. Automated hate speech detec- In Proceedings of the 49th annual meeting of the
tion and the problem of offensive language. Proceed- Association for Computational Linguistics: Human
ings of the International AAAI Conference on Web Language Technologies, pages 42–47.
and Social Media, 11(1):512–515.
Raul Gomez, Jaume Gibert, Lluis Gomez, and Dimos-
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo thenis Karatzas. 2020. Exploring hate speech detec-
Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. tion in multimodal publications. In Proceedings of
2020. GoEmotions: A dataset of fine-grained emo- the IEEE/CVF winter conference on applications of
tions. In Proceedings of the 58th Annual Meeting of computer vision, pages 1470–1478.
the Association for Computational Linguistics, pages
4040–4054, Online. Association for Computational Jan Hajic, Otakar Smrz, Petr Zemánek, Jan Šnaidauf,
Linguistics. and Emanuel Beška. 2004. Prague Arabic depen-
Leon Derczynski, Eric Nichols, Marieke Van Erp, and dency treebank: Development in data and tools. In
Nut Limsopatham. 2017. Results of the wnut2017 Proc. of the NEMLAR Intern. Conf. on Arabic Lan-
shared task on novel and emerging entity recognition. guage Resources and Tools, volume 1.
In Proceedings of the 3rd Workshop on Noisy User- Amr Hendy, Mohamed Abdelrehim, Amr Sharaf,
generated Text, pages 140–147. Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita,
Dimitar Dimitrov, Bishr Bin Ali, Shaden Shaar, Firoj Young Jin Kim, Mohamed Afify, and Hany Hassan
Alam, Fabrizio Silvestri, Hamed Firooz, Preslav Awadalla. 2023. How good are GPT models at ma-
Nakov, and Giovanni Da San Martino. 2021. Detect- chine translation? a comprehensive evaluation. arXiv
ing propaganda techniques in memes. In Proceed- preprint arXiv:2302.09210.
ings of the 59th Annual Meeting of the Association for Rongjie Huang, Mingze Li, Dongchao Yang, Jia-
Computational Linguistics and the 11th International tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu,
Joint Conference on Natural Language Processing Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. 2023.
(Volume 1: Long Papers), pages 6603–6617, Online. Audiogpt: Understanding and generating speech,
Association for Computational Linguistics. music, sound, and talking head. arXiv preprint
Omar Einea, Ashraf Elnagar, and Ridhwan Al Debsi. arXiv:2304.12995.
2019. Sanad: Single-label Arabic news articles Fatemah Husain and Ozlem Uzuner. 2021. A survey of
dataset for automatic text categorization. Data in offensive language detection for the Arabic language.
brief, 25:104076. ACM Transactions on Asian and Low-Resource Lan-
Paul Ekman. 1971. Universals and cultural differences guage Information Processing (TALLIP), 20(1):1–44.
in facial expressions of emotion. In Nebraska sympo- Jude Khouja. 2020. Stance prediction and claim veri-
sium on motivation. University of Nebraska Press. fication: An Arabic perspective. In Proceedings of
AbdelRahim Elmadany, El Moatez Billah Nagoudi, and the Third Workshop on Fact Extraction and VERifica-
Muhammad Abdul-Mageed. 2022. Orca: A challeng- tion (FEVER), pages 8–17, Online. Association for
ing benchmark for Arabic language understanding. Computational Linguistics.
arXiv preprint arXiv:2212.10758. Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj
AbdelRahim A. Elmadany, Hamdy Mubarak, and Walid Goswami, Amanpreet Singh, Pratik Ringshia, and
Magdy. 2018. ArSAS: An Arabic speech-act and Davide Testuggine. 2020. The hateful memes chal-
sentiment corpus of tweets. OSACT, 3:20. lenge: Detecting hate speech in multimodal memes.
In Advances in Neural Information Processing Sys- Hamdy Mubarak. 2017. Build fast and accu-
tems, volume 33, pages 2611–2624. rate lemmatization for Arabic. arXiv preprint
arXiv:1710.06700.
Lev Konstantinovskiy, Oliver Price, Mevan Babakar,
and Arkaitz Zubiaga. 2021. Toward automated Hamdy Mubarak, Ahmed Abdelali, Sabit Hassan, and
factchecking: Developing an annotation schema and Kareem Darwish. 2020a. Spam detection on Arabic
benchmark for consistent automated claim detection. twitter. In Social Informatics: 12th International
Digital Threats: Research and Practice, 2(2). Conference, SocInfo 2020, Pisa, Italy, October 6–9,
2020, Proceedings 12, pages 237–251. Springer.
Dilek Küçük and Fazli Can. 2020. Stance detection: A
survey. ACM Computing Surveys (CSUR), 53(1):1– Hamdy Mubarak, Ahmed Abdelali, Hassan Sajjad,
37. Younes Samih, and Kareem Darwish. 2019. Highly
effective Arabic diacritization using sequence to se-
Gaurav Kumar, Yuan Cao, Ryan Cotterell, Chris quence modeling. In Proceedings of the 2019 Con-
Callison-Burch, Daniel Povey, and Sanjeev Khudan- ference of the North American Chapter of the Asso-
pur. 2014. Translations of the callhome Egyptian ciation for Computational Linguistics: Human Lan-
Arabic corpus for conversational speech translation. guage Technologies, Volume 1 (Long and Short Pa-
In Proceedings of the 11th International Workshop pers), pages 2390–2395.
on Spoken Language Translation: Papers, pages 244–
Hamdy Mubarak, Shammur Absar Chowdhury, and
248, Lake Tahoe, California.
Firoj Alam. 2022. Arabgend: Gender analysis
Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian and inference on Arabic twitter. arXiv preprint
Riedel, and Holger Schwenk. 2019. Mlqa: Eval- arXiv:2203.00271.
uating cross-lingual extractive question answering. Hamdy Mubarak, Kareem Darwish, Walid Magdy,
arXiv preprint arXiv:1910.07475. Tamer Elsayed, and Hend Al-Khalifa. 2020b.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Overview of osact4 Arabic offensive language detec-
Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian tion shared task. In Proceedings of the 4th Workshop
Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku- on open-source arabic corpora and processing tools,
mar, et al. 2022. Holistic evaluation of language with a shared task on offensive language detection,
models. arXiv preprint arXiv:2211.09110. pages 48–52.
Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Hamdy Mubarak and Sabit Hassan. 2021. Ul2c: Map-
Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin ping user locations to countries on Arabic twitter. In
Jiang, Guihong Cao, et al. 2020. Xglue: A new Proceedings of the Sixth Arabic Natural Language
benchmark datasetfor cross-lingual pre-training, un- Processing Workshop, pages 145–153.
derstanding and generation. In Proceedings of the Hamdy Mubarak, Sabit Hassan, and Ahmed Abdelali.
2020 Conference on Empirical Methods in Natural 2021a. Adult content detection on Arabic twitter:
Language Processing (EMNLP), pages 6008–6018. Analysis and experiments. In Proceedings of the
Sixth Arabic Natural Language Processing Workshop,
Bing Liu and Lei Zhang. 2012. A survey of opinion
pages 136–144.
mining and sentiment analysis. In Mining text data,
pages 415–463. Springer. Hamdy Mubarak, Amir Hussein, Shammur Absar
Chowdhury, and Ahmed Ali. 2021b. QASR:
Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. QCRI Aljazeera speech resource–a large scale an-
MKQA: A linguistically diverse benchmark for mul- notated Arabic speech corpus. arXiv preprint
tilingual open domain question answering. Transac- arXiv:2106.13000.
tions of the Association for Computational Linguis-
tics, 9:1389–1406. Preslav Nakov, Alberto Barrón-Cedeño, Giovanni
Da San Martino, Firoj Alam, Rubén Míguez, Tom-
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey maso Caselli, Mucahid Kutlu, Wajdi Zaghouani,
Svyatkovskiy, Ambrosio Blanco, Colin Clement, Chengkai Li, Shaden Shaar, Hamdy Mubarak, Alex
Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Nikolov, Yavuz Selim Kartal, and Javier Beltrán.
Codexglue: A machine learning benchmark dataset 2022a. Overview of the CLEF-2022 CheckThat!
for code understanding and generation. arXiv lab task 1 on identifying relevant claims in tweets. In
preprint arXiv:2102.04664. Working Notes of CLEF 2022—Conference and Labs
Clyde R. Miller. 1939. The Techniques of Propaganda. of the Evaluation Forum, CLEF ’2022.
From “How to Detect and Analyze Propaganda,” an Preslav Nakov, Alberto Barrón-Cedeño, Giovanni
address given at Town Hall. The Center for learning. da San Martino, Firoj Alam, Julia Maria Struß,
Thomas Mandl, Rubén Míguez, Tommaso Caselli,
Saif Mohammad, Felipe Bravo-Marquez, Mohammad
Mucahid Kutlu, Wajdi Zaghouani, et al. 2022b.
Salameh, and Svetlana Kiritchenko. 2018. Semeval-
Overview of the clef–2022 checkthat! lab on fighting
2018 task 1: Affect in tweets. In Proceedings of the
the covid-19 infodemic and fake news detection. In
12th international workshop on semantic evaluation,
Experimental IR Meets Multilinguality, Multimodal-
pages 1–17.
ity, and Interaction: 13th International Conference
Hussein Mozannar, Karl El Hajal, Elie Maamary, and of the CLEF Association, CLEF 2022, Bologna, Italy,
Hazem Hajj. 2019. Neural Arabic question answer- September 5–8, 2022, Proceedings, pages 495–520.
ing. arXiv preprint arXiv:1906.05394. Springer.
Preslav Nakov, David Corney, Maram Hasanain, Firoj Proceedings of the 50th Annual Meeting of the As-
Alam, Tamer Elsayed, Alberto Barrón-Cedeño, Paolo sociation for Computational Linguistics (Volume 2:
Papotti, Shaden Shaar, and Giovanni Da San Martino. Short Papers), pages 253–258.
2021. Automated fact-checking for assisting human Fabrizio Sebastiani. 2002. Machine learning in auto-
fact-checkers. In Proceedings of the 30th Interna- mated text categorization. ACM computing surveys
tional Joint Conference on Artificial Intelligence, IJ- (CSUR), 34(1):1–47.
CAI ’21, pages 4551–4558.
Haitham Seelawi, Ahmad Mustafa, Hesham Al-
OpenAI. 2023. GPT-4 technical report. Technical re-
Bataineh, Wael Farhan, and Hussein T Al-Natsheh.
port, OpenAI.
2019. Nsurl-2019 task 8: Semantic question simi-
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, larity in Arabic. In Proceedings of the First Inter-
Carroll Wainwright, Pamela Mishkin, Chong Zhang, national Workshop on NLP Solutions for Under Re-
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. sourced Languages (NSURL 2019) co-located with
2022. Training language models to follow instruc- ICNLSP 2019-Short Papers, pages 1–8.
tions with human feedback. Advances in Neural
Information Processing Systems, 35:27730–27744. Haitham Seelawi, Ibraheem Tuffaha, Mahmoud Gzawi,
Wael Farhan, Bashar Talafha, Riham Badawi, Zyad
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- Sober, Oday Al-Dweik, Abed Alhakim Freihat, and
man, Christine McLeavey, and Ilya Sutskever. 2022. Hussein Al-Natsheh. 2021. Alue: Arabic language
Robust speech recognition via large-scale weak su- understanding evaluation. In Proceedings of the
pervision. arXiv preprint arXiv:2212.04356. Sixth Arabic Natural Language Processing Workshop,
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya pages 173–184.
Sutskever, et al. 2018. Improving Language Un- Shaden Shaar, Maram Hasanain, Bayan Hamdan,
derstanding by Generative Pre-Training". Technical Zien Sheikh Ali, Fatima Haouari, Alex Nikolov,
report, Open AI. Mucahid Kutlu, Yavuz Selim Kartal, Firoj Alam,
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Giovanni Da San Martino, Alberto Barrón-Cedeño,
Dario Amodei, Ilya Sutskever, et al. 2019. Language Rubén Míguez, Javier Beltrán, Tamer Elsayed, and
models are unsupervised multitask learners. OpenAI Preslav Nakov. 2021. Overview of the CLEF-2021
blog, 1(8):9. CheckThat! lab task 1 on check-worthiness estima-
Revanth Gangi Reddy, Sai Chetan Chinthakindi, Zhen- tion in tweets and political debates. In 2021 Working
hailong Wang, Yi Fung, Kathryn Conger, Ahmed El- Notes of CLEF - Conference and Labs of the Evalua-
sayed, Martha Palmer, Preslav Nakov, Eduard Hovy, tion Forum.
Kevin Small, et al. 2022. Newsclaims: A new bench- Shivam Sharma, Firoj Alam, Md. Shad Akhtar, Dimitar
mark for claim detection from news with attribute Dimitrov, Giovanni Da San Martino, Hamed Firooz,
knowledge. In Proceedings of the 2022 Conference Alon Halevy, Fabrizio Silvestri, Preslav Nakov, and
on Empirical Methods in Natural Language Process- Tanmoy Chakraborty. 2022. Detecting and under-
ing, pages 6002–6018. standing harmful memes: A survey. In Proceedings
Hassan Sajjad, Ahmed Abdelali, Nadir Durrani, and of the Thirty-First International Joint Conference on
Fahim Dalvi. 2020. Arabench: Benchmarking di- Artificial Intelligence, IJCAI ’22, pages 5597–5606,
alectal Arabic-English machine translation. In Pro- Vienna, Austria. International Joint Conferences on
ceedings of the 28th International Conference on Artificial Intelligence Organization. Survey Track.
Computational Linguistics, pages 5094–5107. Taylor Shin, Yasaman Razeghi, Robert L Logan IV,
Younes Samih, Mohamed Eldesouki, Mohammed Attia, Eric Wallace, and Sameer Singh. 2020. Autoprompt:
Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Eliciting knowledge from language models with
and Laura Kallmeyer. 2017. Learning from relatives: automatically generated prompts. arXiv preprint
Unified dialectal Arabic segmentation. In Proceed- arXiv:2010.15980.
ings of the 21st Conference on Computational Nat- Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
ural Language Learning (CoNLL 2017), pages 432– Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
441. Adam R Brown, Adam Santoro, Aditya Gupta,
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, Adrià Garriga-Alonso, et al. 2022. Beyond the
and Noah A Smith. 2019. The risk of racial bias in imitation game: Quantifying and extrapolating the
hate speech detection. In Proceedings of the 57th capabilities of language models. arXiv preprint
annual meeting of the association for computational arXiv:2206.04615.
linguistics, pages 1668–1678. Toshiyuki Takezawa, Genichiro Kikui, Masahide
Anna Schmidt and Michael Wiegand. 2017. A survey Mizushima, and Eiichiro Sumita. 2007. Multilingual
on hate speech detection using natural language pro- spoken language corpus development for communi-
cessing. In Proceedings of the fifth international cation research. In International Journal of Compu-
workshop on natural language processing for social tational Linguistics & Chinese Language Processing,
media, pages 1–10. Volume 12, Number 3, September 2007: Special Issue
Nathan Schneider, Behrang Mohit, Kemal Oflazer, and on Invited Papers from ISCSLP 2006, pages 303–324.
Noah A Smith. 2012. Coarse lexical semantic an- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
notation with supersenses: an Arabic case study. In Uszkoreit, Llion Jones, Aidan N Gomez, ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all Yu Zhang, Wei Han, James Qin, Yongqiang Wang,
you need. Advances in neural information processing Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li,
systems, 30. Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, An-
drew Rosenberg, Rohit Prabhavalkar, Daniel S. Park,
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang,
Parisa Haghani, Jason Riesa, Ginger Perng, Hagen
Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu,
Soltau, Trevor Strohman, Bhuvana Ramabhadran,
Huaming Wang, Jinyu Li, et al. 2023. Neural codec
Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Jo-
language models are zero-shot text to speech synthe-
han Schalkwyk, Françoise Beaufays, and Yonghui
sizers. arXiv preprint arXiv:2301.02111.
Wu. 2023. Google usm: Scaling automatic speech
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten recognition beyond 100 languages. arXiv preprint
Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, arXiv:2303.01037.
et al. 2022. Chain-of-thought prompting elicits rea- Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han,
soning in large language models. In Advances in Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy
Neural Information Processing Systems. Ba. 2022. Large language models are human-level
Jules White, Quchen Fu, Sam Hays, Michael Sandborn, prompt engineers. In NeurIPS 2022 Foundation Mod-
Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse els for Decision Making Workshop.
Spencer-Smith, and Douglas C Schmidt. 2023. A
prompt pattern catalog to enhance prompt engineer-
ing with chatgpt. arXiv preprint arXiv:2302.11382.
Adina Williams, Nikita Nangia, and Samuel R Bow-
man. 2017. A broad-coverage challenge corpus for
sentence understanding through inference. arXiv
preprint arXiv:1704.05426.
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang,
Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin,
Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-
Ting Lin, et al. 2021. Superb: Speech processing
universal performance benchmark. arXiv preprint
arXiv:2105.01051.
JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern
Hartmann, and Qian Yang. 2023. Why johnny can’t
prompt: how non-ai experts try (and fail) to design
llm prompts. In Proceedings of the 2023 CHI Con-
ference on Human Factors in Computing Systems,
pages 1–21.
Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa
Atanasova, Georgi Karadzhov, Hamdy Mubarak,
Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin.
2020. Semeval-2020 task 12: Multilingual offensive
language identification in social media (offenseval
2020). arXiv preprint arXiv:2006.07235.
Rabih Zbib, Erika Malchiodi, Jacob Devlin, David Stal-
lard, Spyros Matsoukas, Richard Schwartz, John
Makhoul, Omar Zaidan, and Chris Callison-Burch.
2012. Machine translation of Arabic dialects. In Pro-
ceedings of the 2012 conference of the north amer-
ican chapter of the association for computational
linguistics: Human language technologies, pages 49–
59.
Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia
Ackermann, Noëmi Aepli, Hamid Aghaei, and
R Ziane. 2020. Universal dependencies 2.5. LIN-
DAT/CLARIAHCZ digital library at the Institute of
Formal and Applied Linguistics (UFAL), Faculty of
Mathematics and Physics, Charles University. url:
http://hdl. handle. net/11234/1-3226.
Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep
learning for sentiment analysis: A survey. Wiley
Interdisciplinary Reviews: Data Mining and Knowl-
edge Discovery, 8(4):e1253.
Appendix
If the following person name can be considered
as male, write 'm' without explnanation, and if
A Prompts it can be considered as female, write 'f'
without explnanation.
In the below sections, we report the prompts we {inputSentence}
used for different tasks.

A.1 Word Segmentation, Syntax and Location


Information Extraction
Map the following locations to one of the Arab
Segmentation countries. Write the country code in ISO 3166-1
alpha-2 format without explanation. If the
A word can be composed of one root and one or country is outside Arab countries, write
multiple affixes, segment the following 'OTHERS', and if the location cannot be mapped
sentence into its morphological constituents: to any country in the world, write 'UNK'
{inputSentence}"+". The output should be in the without any explanation:
format: [{word: {inputSentence}
{prefix}(0,n)+root+{suffix}(0,n)},... XXX
FIXME: What is n?]
Name Info

Named Entity Recognition Predict the country of citizenship of the


following person name. Write the country code
Task Description: You are working as a named in ISO 3166-1 alpha-2 format without
entity recognition expert and your task is to explananation.
label a given arabic text with named entity {inputSentence}
labels. Your task is to identify and label any
named entities present in the text without any
explanation. The named entity labels that you A.4 Question answering
will be using are PER (person), LOC (location),
ORG (organization), MISC (miscellaneous). You
may encounter multi-word entities, so make sure Your task is to answer questions in Arabic
to label each word of the entity with the based on a given context.
appropriate prefix ('B' for first word entity, Note: Your answers should be spans extracted
'I' for any non-initial word entity). For words from the given context without any
which are not part of any named entity, you illustrations.
should return 'O'. Note: Your output format You don't need to provide a complete answer
should be a list of tuples, where each tuple Context:{context}
consists of a word from the input text and its Question:{question}
corresponding named entity label. Input: Answer:
{inputSentence}

A.5 Natural language inference


A.2 Sentiment, Stylistic and Emotion Analysis
You are provided with a premise and a
Sentiment analysis hypothesis. Your task is to classify the
hypothesis as either true (entailment), false
Choose only one sentiment between: Positive, (contradiction), or unknown (neutral) based on
Negative, Neutral, or Mixed for this sentence: the given premise. The output should be true,
{inputSentence} false or unknown.
Example:
Premise: , ½Ë X Èñk úæk  Qº¯ @ á» @ ÕË , AJ‚k
Emotion detection . éJ KAK èQÓ éªÓ IK  îE @ð, AÓAÖß A¢J.m× I J» áºË
 YmÌ '@ IJ
Predict all the possible emotions in the Hypothesis: .øQk @ èQÓ éªÓ HYm
  ' @ ÕË
following Arabic sentences without explanation Output: Contradiction
and put them in a Python list. List of emotions
are: anger, anticipation, disgust, fear, joy,
love, optimism, pessimism, sadness, surprise,
and trust A.6 Semantic textual similarity
Sentence: {inputSentence} Regression

Given two sentences, produce a continuous


A.3 Demographic/Protected Attributes valued similarity score on a scale from 0 to 5,
with 0 indicating that the semantics of the
Gender sentences are completely independent and 5
signifying semantic equivalence. The output You don\'t need to provide a complete answer
should be exactly in form Similarity score =. Context:{context}
Question:{question}
Answer:
Classification (question similarity)

Are the two questions below semantically A.10 Factuality, Disinformation, Harmful
similar? The output should be exactly in form
yes or no. content
Offensive Language
A.7 Factuality, Disinformation, Harmful
if the following sentence is offensive, just
content say "OFF", otherwise, say just "NOT_OFF"
without explanation:
Subjectivity {inputSentence

system
You are an AI assistant that helps people find
information. Hate Speech
text
Classify the sentence as Subjective or if the following sentence has hate speech, just
Objective. Provide only label. say "HS", otherwise, say just "NOT_HS" without
sentence: {sentence} explanation:
label: {inputSentence}

Checkworthiness
Adult Content
Classify the sentence as checkworthy or not
checkworthy. Provide only label. Classify the following sentence as adult
language (the language used in adult
advertisement and porno advertisement) or not
adult language without illustruation. Sentence:
Claim detection xx Label: xx
Does this sentence contain a factual claim?
Answer only by yes or no.
Spam

Harmful content detection If the following sentence can be classified as


spam or contains an advertisemnt, write 'ADS'
Classify the sentence as harmful or not without explnanation, otherwise write 'NOTADS'
harmful. Answer only by yes or no. Provide only without explanantion.
label. {inputSentence}

Attention-worthy A.11 Dialect Identification


Classify the sentence by whether it should get
the attention of policymakers. Answer by yes or Write only the country code of the Arabic
no. If the predicted label is yes then classify country in which this sentence is written in
the sentence into one of the following its dialect without any explanation? Write only
categories: asks question, blame authorities, the country code in ISO 3166-1 alpha-2 format
calls for action, Harmful, contains advice, without explanation. Write 'MSA' if the
discusses action taken, discusses cure, or sentence is written in Modern Standard Arabic.
other. {inputSentence}

A.8 Semantics A.12 Sequence Tagging/ Token classification/


A.9 Question answering (QA) Core linguistics
Segmentation
Your task is to answer questions in Arabic
based on a given context.
Segment the following sentence by separating
Note: Your answers should be spans extracted
morphological parts with +:
from the given context without any
illustrations.
Spell Checking sentence, which is either a value of a related
ID or 0. A value of zero means the token
correct spelling errors in this Arabic attaches to the virtual root node: :
sentence: {inputSentence} {inputSentence}

POS Paraphrasing

these are the segmentation and POS tags for a rephrase this Arabic sentence: {inputSentence}
sample sentence:
 ƒQ K PY’JK éJ K. XAg. ÕÎJ¯
éJ KA¢ Q.Ë@ éJ Öß XA¿ B@ QK@ñk. HAjJ
àñK Q®ÊJË@ð ÕÎJ®Ë @ àñJ®Ë
ÕÎJ¯ ÕÎJ¯ NOUN
éJ K. XAg. è + úG XAg NOUN+NSUFF
. .
PY’JK PY’JK V
 ƒQ K H@
HAjJ  + iJƒQ K NOUN+NSUFF
QK@ñk. QK@ñk. NOUN
éJ Öß XA¿ B@ è + ùÖß XA¿ @ + È@ DET+NOUN+NSUFF
éJ KA¢ Q.Ë@ è + úGA¢ QK. + È@ DET+ADJ+NSUFF
àñJ®Ë àñJ¯ + È PREP+NOUN
ÕÎJ®Ë @ ÕÎJ¯ + È@ DET+NOUN
àñK Q®ÊJË@ð àñK Q®ÊK + È@ + ð CONJ+DET+NOUN
get the segmentation and POS tags for this
sentence: {inputSentence}

Assign POS tag to each morphological segment


within each word. group the tags for each word
with +: {inputSentence}"+". The output should
be in the format: [{word: label}, {word: label}]

Label the following sentence with its


corresponding PENN Treebank POS Labels:
{inputSentence}

Lemmatization

for every word in the following sentence, write


only the lemmas without diacritics in separate
lines without explanation:
{inputSentence}

Diacritization

diacritize this Arabic sentence fully: "..."

Vowelized the following sentence:


{inputSentence}. Words that can't be vowelized
put them back as they were.

Parsing

Given the following features (in order: ID,


Form, Lemma, CPostTag, POSTag, Features),
predict the Head of each token in the following

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy