Benchmarking Arabic AI With Large Language Models
Benchmarking Arabic AI With Large Language Models
Table 1: Summary on test sets and their sizes used in evaluation for the different textual tasks. ANC: American
National Corpus. Posts∗: posts from Twitter, Youtube and Facebook. News Cat.: News Categorization
that change the tense of verbs, or represent pro- 3.1.4 Diacritization
nouns and prepositions in nouns. It is a building Diacritization involves assigning the diacritics to
block for NLP tasks such as search, part-of-speech each letter in an Arabic word within a sentence.
tagging, parsing, and machine translation. The idea Diacritical marks indicate the correct pronuncia-
is segmenting Arabic words into prefixes, stems, tion and meaning of the written Arabic words. For
and suffixes, which can facilitate many other tasks. example, different word diacretizations could trans-
Datasets form a noun into a verb or vice versa.
WikiNews For modern standard Arabic (MSA), Datasets
we used the WikiNews dataset of (Darwish and
WikiNews We use a dataset of modern standard
Mubarak, 2016) which comprises 70 news articles
Arabic from (Mubarak et al., 2019) that comprises
in politics, economics, health, science and technol-
fully diacritized WikiNews corpus (Darwish et al.,
ogy, sports, arts, and culture. The dataset has 400
2017b).
sentences (18,271 words) in total.
Tweets For the dialectal Arabic, we used the Bibles This dataset includes translations of the
dataset in (Samih et al., 2017), which provides 1400 New Testament into two Maghrebi sub-dialects:
tweets in Egyptian, Gulf, Levantine, and Maghrebi Moroccan and Tunisian (Darwish et al., 2018; Ab-
dialects for a total of 25,708 annotated words . delali et al., 2019).
Dataset We used WikiNews dataset tagged for AQMAR The dataset is developed as an evalua-
lemmas (Mubarak, 2017) (see §3.1.1 for the details tion suite for the named entity recognition task in
of the dataset). Arabic. It consists of a collection of 28 Wikipedia
articles with 74,000 tokens. We consider the arti- The Bible It consists of 8.2k parallel sentences
cles corresponding to the test split for our evalua- translated into modern standard Arabic, and to Mo-
tion. (Schneider et al., 2012). roccan3 and Tunisian4 dialects (Abdelali et al.,
2019).
QASR The QASR dataset consists of 70k words
extracted from 2,000 hours of transcribed Arabic Media Dataset The dataset consists of 7.5 hours
speech (Mubarak et al., 2021b). of recordings collected from five public broadcast-
ing channels that cover programs with Maghrebi,
3.1.7 Paraphrasing Lebanese, Omani dialects, and MSA with genres
involving movies, news reports, and cultural pro-
This task involves rewriting text using different grams. The recordings were transcribed and trans-
words and sentence structures while maintaining lated by a professional translation house (Sajjad
its original meaning. This is a complex language et al., 2020).
understanding task that involves having the capa-
bility to suggest different words or even structures, 3.3 Dialect Identification
which preserve the intended meaning. Dialect is defined as the speaker’s grammatical, lex-
ical, and phonological variation in pronunciation
Dataset For this task, we used the modern stan-
(Etman and Beex, 2015). Automatic Dialect Identi-
dard Arabic part of the MADAR corpus of paral-
fication (ADI) has became an important research
lel sentences (Bouamor et al., 2018), which has
area in order to improve certain applications and
2,000 translated sentences from the BTEC corpus
services, such as ASR and many downstream NLP
(Takezawa et al., 2007). We used back-translation
tasks.
from Google MT as SOTA, i.e., translate from Ara-
bic to English then back to Arabic. Dataset For this task, we used the QADI dataset
containing a wide range of country-level Arabic
3.2 Machine Translation (MT) dialects covering 18 different countries in the Mid-
dle East and North Africa region (Abdelali et al.,
The machine translation evaluation set is a rich
2020). It consists of 540,590 tweets from 2,525
set that covers a variety of Arabic in addition to
users.
the Modern Standard Arabic (MSA). The genera
of the evaluation set also cover formal, informal, 3.4 Sentiment, Stylistic and Emotion Analysis
speech, and other modalities. These types and va-
3.4.1 Sentiment Analysis
rieties allowed us to assess the system and reveal
its potential and limitations. For this study, we fo- Sentiment analysis has been an active research area
cused on translating Arabic to English and used the and aims to analyze people’s sentiment or opin-
datasets discussed below. ion toward entities such as topics, events, individ-
uals, issues, services, products, organizations, and
Datasets their attributes (Liu and Zhang, 2012; Zhang et al.,
2018). This task involves classifying the content
MADAR Corpus This dataset consists of 2,000 into sentiment labels such as positive, neutral, and
sentences from the BTEC corpus translated to mod- negative.
ern standard Arabic and four major dialects from
15 countries (Bouamor et al., 2018). Dataset ArSAS dataset consists of 21k Arabic
tweets covering multiple topics that were collected,
(Zbib et al., 2012) It is collected from the Arabic- prepared, and annotated for six different classes of
Dialect/English Parallel Text (APT), which consists speech-act labels and four sentiment classes (El-
of 2,000 sentences with 3.5 million tokens of trans- madany et al., 2018). For the experiments, we used
lated dialectal Arabic (Zbib et al., 2012). only sentiment labels from this dataset.
3.4.2 Emotion Recognition
Multi-dialectal Parallel Corpus of Arabic
(MDC) This dataset also consists of 2,000 sen- Emotion recognition is the task of categorizing dif-
tences in Egyptian, Palestinian, Syrian, Jordanian, ferent types of content (e.g., text, speech, and vi-
and Tunisian dialects and their English counter- 3
The Morocco Bible Society https://www.biblesociety.ma
4
parts (Bouamor et al., 2014). The United Bible Societies https://www.bible.com
sual) in different emotion labels (six basic emo- 3.5 News Categorization
tions (Ekman, 1971) or more fine-grained cate- News text categorization was a popular task in the
gories (Demszky et al., 2020)). earlier days of NLP research (Sebastiani, 2002).
Dataset For the emotion recognition tasks we The idea of to assign a category C = {c1 , ...cn }
used SemEval-2018 Task 1: Affect in Tweets (Mo- to a document D = {d1 , ...dn }. For the news
hammad et al., 2018). The task is defined as classi- categorization the D is a set of news articles and
fying a tweet as one or more of the eleven emotion C is a set of predefined categories. Most often a
labels, which is annotated as a multilabel (pres- news article can be categorized into more than one
ence/absence of 11 emotions) annotation setting. category and the models are trained in a multilabel
setting. While earlier work mostly focused on news
3.4.3 Stance Detection article, however, lately it has been used for the
Stance is defined as the expression of the speaker’s categorization of tweets in which news articles are
view and judgment toward a given argument or shared as a part of a tweet.
statement (Biber and Finegan, 1988). Given that Datasets
the social media platforms allow users to consume
and disseminate information by expressing their Social Media Posts ASND is a News Tweets
views, enabling them to obtain instant feedback dataset (Chowdhury et al., 2020b), collected from
and explore others’ views, it is important to char- Aljazeera news channel accounts on Twitter, Face-
acterize a stance expressed in a given content. Au- book, and YouTube. The dataset consists of twelve
tomatic stance detection also allows for assessing categories such as art-and-entertainment, business-
public opinion on social media, particularly on dif- and-economy, crime-war-conflict, education, envi-
ferent social and political issues such as abortion, ronment, health, human-rights-press-freedom, poli-
climate change, and feminism, on which people ex- tics, science-and-technology, spiritual, sports, and
press supportive or opposing opinions (ALDayel (xii) others. We used the test split from each dataset
and Magdy, 2021; Küçük and Can, 2020). The task for the evaluation.
involves “classification as the stance of the pro- Arabic News SANAD corpus is a large col-
ducer of a piece of text, towards a target as either lection of Arabic news articles collected from
one of the three classes: {support, against, neither} Akhbarona, AlKhaleej, and AlArabiya (Einea et al.,
or {agree, disagree, discuss, or unrelated}” (Küçük 2019). The dataset has separate collections col-
and Can, 2020). lected from different news media, each of which
Datasets has six news categories, such as culture, finance,
medical, politics, sports and technology.
Unified-FC dataset consists of claims collected
from Verify.sy (false claims) and Reuters (true 3.6 Demographic Attributes
claims), which resulted in 422 claims. Based Demographic information (e.g., gender, age, coun-
on these claims documents are collected using try of origin) are useful in many different appli-
Google custom search API and filtered by com- cations such as understanding population charac-
puting claim-documents similarity (Baly et al., teristics, personalized advertising, socio-cultural
2018b). This approach resulted in 3,042 claim- studies, etc. Demographic information helps gov-
documents pairs, which are then annotated for ernments, businesses, and organizations understand
stance (agree, disagree, discuss, unrelated) by Ap- their target audiences, and plan accordingly.
pen crowd-sourcing platform.
3.6.1 Gender
Khouja (2020) developed a dataset by first sam- Gender analysis can reveal important differences
pling news titles from Arabic News Texts (ANT) between male and female users such as topics of
corpus (Chouigui et al., 2017) and then generating interest, gender gap, preferences, etc.
true and false claims. From these claims stance
(three classes – agree, disagree, other) is annotated Dataset We used the ArabGend test set, which
from a pair of sentences using Amazon Mechanical contains 1,000 names collected from Twit-
Turk and Upwork. The dataset consists of 3,786 ter (divided equally between males and fe-
claim-reference pairs. males) (Mubarak et al., 2022).
3.6.2 Location split into sentences and filtered poorly-formatted
Identifying user locations is useful for many appli- sentences using a rule-based approach. The dataset
cations such as author profiling, dialect identifica- has been released as a part of Task 2 of CLEF2023
tion, recommendation systems, etc. Often, users CheckThat Lab (Barrón-Cedeño et al., 2023).
on social media platforms, such as Twitter, declare 3.7.2 Propaganda Detection
their locations in noisy ways, and mapping these
Propaganda can be defined as a form of commu-
locations to countries is a challenging task.
nication that aims to influence the opinions or the
Dataset We used the UL2C dataset, which con- actions of people towards a specific goal; this is
tains 28K unique locations, as written by Arabic achieved utilizing well-defined rhetorical and psy-
Twitter users, and their mappings to Arab coun- chological devices (Dimitrov et al., 2021). In differ-
tries (Mubarak and Hassan, 2021). ent communication channels, propaganda (persua-
sion techniques) is conveyed through the use of di-
3.6.3 Name Info verse techniques (Miller, 1939), which range from
Names contain important information about our leveraging the emotions of the audience, such as
identities and demographic characteristics, includ- using emotional technique or logical fallacies such
ing factors like gender, nationality, and ethnicity. as straw man (misrepresenting someone’s opinion),
The purpose of this task is to predict the country of hidden ad-hominem fallacies, and red herring (pre-
origin of a person name giving only their names. senting irrelevant data).
Dataset We used an in-house dataset for mapping Dataset The dataset used for this study consists
person names to World countries extracted from of Arabic tweets (Alam et al., 2022b) posted by
Wikipedia.5 different news media from Arab countries such as
Al Arabiya and Sky News Arabia from UAE, Al
3.7 Factuality, Disinformation and Harmful Jazeera, and Al Sharq from Qatar, and from five
content detection international Arabic news sources Al-Hurra News,
3.7.1 Subjectivity Identification BBC Arabic, CNN Arabic, France 24, and Russia
A sentence is considered subjective when it is based Today. The final annotated dataset consists of 930
on – or influenced by – personal feelings, tastes, tweets. Alam et al. (2022b) formulated the task as
or opinions. Otherwise, the sentence is considered a multilabel and multiclass span level classification
objective (Antici et al., 2021). Given that the identi- task. For this study, we used the multilabel setup.
fication of subjectivity is subjective itself, therefore, 3.7.3 Check-worthiness Detection
it poses challenges in the annotation process by the
Fact-checking is a time-consuming and complex
annotator. The complexity lies due to the different
process, and it often takes effort to determine
levels of expertise by the annotators, different in-
whether a claim is important to check, irrespective
terpretations and their conscious and unconscious
of its potential to be misleading or not. Check-
bias towards the content they annotate. The content
worthiness detection is the first step and a criti-
can be text (e.g., sentence, article), image or multi-
cal component of fact-checking systems (Nakov
modal content, consisting of opinionated, factual
et al., 2021) and the aim is to facilitate manual
or non-factual content. The annotation typically
fact-checking efforts by prioritizing the claims for
has been done using two labels, objective (OBJ)
the fact-checkers. Research on check-worthiness
and subjective (SUBJ).
includes check-worthiness detection/ranking from
Dataset The dataset consists of sentences curated political speeches, debates, and social media posts
from news articles. The dataset has been developed (Nakov et al., 2022a; Shaar et al., 2021). A check-
based on the existing AraFacts dataset (Ali et al., worthy claim is usually defined by its importance
2021) that contains claims verified by Arabic fact- to the public and journalists, and whether it can
checking websites, and each claim is associated cause harm to an individual, organization, and/or
with web pages propagating or negating the claim. society.
The news articles are collected from different news
Dataset For this study, we used the Arabic subset
media. News articles were automatically parsed,
of the dataset released with Task 1A (Arabic) of the
5
Paper is under revision. CLEF2022 CheckThat Lab (Nakov et al., 2022b)
The dataset consists of 4,121 annotated tweets. The mors, and conspiracy theories that are spreading in
Arabic tweets were collected using keywords re- different social media channels to manipulate peo-
lated to COVID-19, vaccines, and politics. ple’s opinions or to influence the outcome of major
events such as political elections (Darwish et al.,
3.7.4 Claim Detection
2017a; Baly et al., 2018b). While fact-checking has
Information shared in the mainstream and social largely been done by manual fact-checker due to
media often contains misleading content. Claim de- the reliability, however, that does not scale well as
tection has become an important problem in order the enormous amount of information shared online
to mitigate misinformation and disinformation in every day. Therefore, an automatic fact-checking
those media channels. A factual (verifiable) claim system is important and it has been used for fa-
is a sentence claiming that something is true, and cilitating human fact-checker (Nakov et al., 2021).
this can be verified using factually verifiable in- The task typically involves assessing the level of
formation such as statistics, specific examples, or factual correctness in a news article, media outlets,
personal testimony (Konstantinovskiy et al., 2021). or social media posts. The content is generally
Research on claim detection includes social media judged to be of high, low, or mixed factual cor-
posts – text modality (Alam et al., 2021b), multi- rectness, seven-point Likert scale6,7 or just binary
modality (Cheema et al., 2022) and news (Reddy labels {yes, no} (Baly et al., 2018a; Alam et al.,
et al., 2022). 2021b).
Datasets
Datasets
CT-CWT-22-Claim We used the Arabic sub-
set of the dataset released with Task 1B of the News Articles We used the dataset developed
CLEF2022 CheckThat Lab (Nakov et al., 2022a). by Baly et al. (2018a) in which false claims
The dataset has been annotated using a multi- are extracted from verify-sy8 and true claims are
question annotation schema (Alam et al., 2021a), extracted from http://ara.reuters.com. The
which consists of tweets collected using COVID- dataset consists of 3,042 documents.
19 related keywords. The dataset contains 6,214 Tweets For the claim detection from tweets, we
tweets (Nakov et al., 2022b). used the same dataset (Alam et al., 2021b) dis-
cussed in 3.7.4. As mentioned earlier, this dataset
(Khouja, 2020) This dataset consists of 4,547 was annotated using a multi-questions annotation
true and false claims, which was developed based schema in which one of the questions was “does the
on Arabic News Texts (ANT) corpus. A sample tweet appear to contain false information?”. Based
of articles was modified to generate true and false on the answer to this question factuality label of
claims using crowdsourcing. the tweet has been defined. The Arabic dataset
contains a total of 4,966 tweets.
3.7.5 Attention-worthiness Detection
In social media most often people tweet by blam- 3.7.7 Harmful Content Detection
ing authorities, providing advice, and/or call for For the harmful content detection we adopted the
action. It might be important for the policy mak- task proposed in (Alam et al., 2021b; Nakov et al.,
ers to respond to those posts. The purpose of this 2022b) though the research on harmful content de-
task is to categorize such information into one of tection also include identifying or detecting offen-
the following categories: not interesting, not sure, sive, hate-speech, cyberbullying, violence, racist,
harmfullness, other, blames authorities, contains misogynistic and sexist content (Sharma et al.,
advice, calls for action, discusses action taken, dis- 2022; Alam et al., 2022a). For some of the those
cusses cure, asks a question. harmful content detection tasks we addressed them
Dataset For this task, we used a subset of the separately and discussed in the below sections.
dataset Task 1D of the CLEF2022 CheckThat Alam et al. (2021b); Nakov et al. (2022b) proposed
Lab (Nakov et al., 2022a), which contains 6,140 the as in the context of tweets and idea was to de-
annotated tweets. tect whether the content of the tweet aims to and
can negatively affect society as a whole, specific
3.7.6 Factuality Detection 6
https://mediabiasfactcheck.com
Fact-checking has emerged as an important re- 7
https://allsides.com
search topic due to a large amount of fake news, ru- 8
http://www.verify-sy.com
person(s), company(s), product(s), or spread ru- Dataset We used the dataset discussed in
mors about them. The content intends to harm or (Mubarak et al., 2021a), which contains 10,000
weaponize the information9 (Broniatowski et al., tweets collected by first identifying Twitter ac-
2018). counts that post adult content. Tweets are manually
annotated as adult and not-adult.
Dataset We used the Arabic dataset proposed in
(Nakov et al., 2022b), which consists of a total of 3.7.11 Spam Detection
6,155 annotated tweets. Spam content in social media includes ads, ma-
3.7.8 Offensive Language Detection licious content, and any low-quality content
(Ghanem et al., 2023). Spam detection is another
The use of offensive language in social media has
important problem as such content may often annoy
became a major problem, which can lead to real-
and mislead the users (Gao et al., 2012).
world violence (Husain and Uzuner, 2021; Sap
et al., 2019). This literature for offensive language Dataset We used the dataset discussed in
detection mainly focused on social media content (Mubarak et al., 2020a) for Arabic spam detec-
and addressing for variety of languages. The task tion which contains 28K tweets manually labeled
is mainly defined as whether the content (e.g., text, as spam and not-spam.
image, or multimodal) is offensive or not (Chowd-
hury et al., 2020c). 3.8 Semantic textual similarity
Dataset For this task, we used the dataset from 3.8.1 Textual Similarity
the SemEval-2020 Task 12 (OffensEval 2020) Semantic textual similarity is a measure used to
(Zampieri et al., 2020), which consists of 10,000 determine if two sentences are semantically equiv-
tweets, collected from a set of 660k Arabic tweets alent. The task involves generating numerical sim-
containing the vocative particle (“yA” – O) from ilarity scores for pairs of sentences, with perfor-
April 15 to May 6, 2019. mance evaluated based on the Pearson correla-
tion between machine-generated scores and human
3.7.9 Hate Speech Detection judgments (Cer et al., 2017a). Two tasks were con-
Davidson et al. (2017) defined hate speech as “as ducted to gauge the similarity between 250 pairs
language that is used to expresses hatred towards a of Arabic sentences, as well as Arabic-English sen-
targeted group or is intended to be derogatory, to tence pairs.
humiliate, or to insult the members of the group”.
The literature for hate speech detection defined the Dataset We used SemEval-2017 Task 1 (Track
task as detecting hate vs. non-hate from different 1: ar-ar and Track 2: ar-en) dataset (Cer et al.,
types of content such as text, image and multimodal 2017a), which is a translated version (machine
(Schmidt and Wiegand, 2017; Kiela et al., 2020; translation followed by post-editing by human) of
Gomez et al., 2020). SNLI dataset (Bowman et al., 2015).
Dataset For this task, we also used the OSACT4 3.8.2 Semantic Question Similarity
dataset (Mubarak et al., 2020b), which consists The idea of this task is to determine how similar
of 10,000 tweets with annotated label hate-speech, two questions are in terms of their meaning.
not-hate-speech.
Dataset We used Mawdoo3 Q2Q dataset
3.7.10 Adult Content Detection (NSURL-2019 task 8: Semantic question simi-
Identifying this type of content is important for larity in Arabic), which consists of 15,712 anno-
social media platforms to make a safe place for tated pairs of questions. Each pair is labeled as
users. Especially this type of content poses a seri- no semantic similarity (0) or semantically simi-
ous threat to other vulnerable groups (e.g., younger lar(1) (Seelawi et al., 2019).
age groups). The task typically involves detecting
and identifying whether the textual content con- 3.9 Natural Language Inference (NLI)
tains sensitive/adult content or account that share The XNLI task, known as Cross-lingual Natural
such content. Language Inference (Conneau et al., 2018), is a
9
The use of information as a weapon to spread misinfor- widely used benchmark in the field of natural lan-
mation and mislead people. guage processing (NLP). It involves determining
the logical relationship between pairs of sentences XQuAD comprises 240 paragraphs and 1190
written in different languages. Specifically, the question-answers pairs from the development set
task requires NLP models to determine whether a of SQuAD v1.1 with their professional translations
given hypothesis sentence is entailed, contradicted, into ten languages. Hindi, Turkish, Arabic, Viet-
or neutral in relation to a given premise sentence, namese, Thai, German, Greek, Russian, Spanish
across multiple languages. The XNLI task serves and Chinese. We use the the Arabic split of the data
as a rigorous evaluation of the cross-lingual transfer which consists of 48 articles, 240 paragraphs, and
capabilities of NLP models, assessing their ability 1190 questions (Artetxe et al., 2019). We used the
to understand and reason in different languages sQuad version of all datasets along with the official
within a multilingual context. squad evaluation script.
Dataset The dataset we used for this study is 3.11 Speech Processing
the translated version of Arabic from XNLI cor-
For this study, we address the speech modalities
pus (Conneau et al., 2018). For the annotation, 250
in the context of large foundation models, and we
English sentences were selected from ten different
evaluate the following two tasks in this edition: (i)
sources and then asked the annotators to produce
automatic speech recognition (ASR); and (ii) text
three hypotheses per sentence premise. The result-
to speech (TTS) models. In future, we will scale
ing premises and hypotheses are then translated
the speech benchmark with speech translation (ST)
into 15 languages and we used the Arabic version
and spoken Arabic dialect identification spoken
for this study.
(ADI).
3.10 Question Answering (QA) 3.11.1 Speech Recognition
This task involves answering questions in Arabic The primary objective of an ASR system is to trans-
based on a given text10 . For this task, we use four form spoken language into written text. The task
different datasets consisting of (passage, question, itself is challenging due to the presence of vari-
and answer) pairs. ability in human speech, which can be affected
by factors such as accent, speaking style, code-
Datasets
switching, environmental factors like channels, and
ARCD consists of 1,395 Arabic MSA questions background noise among others. Furthermore, the
posed by crowd-sourced workers along with the presence of language-related challenges, including
text segments from Arabic Wikipedia. We use the complex morphology, unstandardized orthography,
test set only for our evaluation. The test set consists and a wide array of dialects as a primary mode
of 78 articles, 234 paragraphs, and 702 questions of communication, adds a layer of complexity to
(Mozannar et al., 2019). the task. Therefore to properly benchmark Ara-
bic ASR, we covered a wide range of domains
MLQA comprises multilingual question-answer
encapsulating different speaking styles, dialects,
instances in 7 languages, English, Arabic, Simpli-
and environments. For our study, we considered
fied Chinese, Hindi, German, Vietnamese and Span-
broadcast news, telephony, and meeting data for
ish. We used the Arabic QA pairs from this dataset,
MSA, Egyptian, Moroccan Arabic, etc., in both
which consist of 2389 articles, 4646 paragraphs,
monolingual and code-switching setups.
and 5335 questions (Lewis et al., 2019).
Datasets
TyDi QA comprises 11 languages with 204K
question-answer pairs. We used the data provided MGB2 consists of 9.57 hours of multi-dialect
for the Gold Passage task in which a passage that speech data that was collected from Aljazeera TV
contains the answer is provided and the task is to programs and manually transcribed. The data con-
predict the span that contains the answer. We used sists of a mix of Modern Standard Arabic (MSA)
the Arabic split of the data which contains 921 ar- and various dialects, including Egyptian, Levantine,
ticles, 921 paragraphs and 921 questions (Artetxe Gulf, and North African (Ali et al., 2016).11
et al., 2019).
MGB3 is a collection of 5.78 hours of multi-
10
This task is also referred to as machine reading compre- genre speech data in Egyptian dialect. The data
hension where the model is tested on its ability to extract
answers from the given text 11
https://arabicspeech.org/mgb2
Dataset Task Domain Size
MGB2 ASR Broadcast (MSA) 9.57 hrs
MGB3 ASR Broadcast (EGY) 5.78 hrs
MGB5 ASR Broadcast (MOR) 1.40 hrs
QASR.CS ASR Broadcast (Mixed) → Code-switching 5.90 hrs
DACS ASR Broadcast (MSA-EGY) → Code-switching 1.50 hrs
ESCWA.CS ASR Meeting (Mixed DA - ENG) → Code-switching 2.80 hrs
CallHome ASR Telephony (EGY) 20 phone conversations
In-house TTS Mixed Topics (education, health, etc) 20 sentences
Table 2: Summary on test sets and their sizes used in evaluation for the speech processing tasks.
was collected from YouTube videos and manually systems comprise three modules: text front-end,
transcribed (Ali et al., 2017).12 acoustic model, and vocoder. However, there is
ongoing research to combine acoustic models and
MGB5 is a collection of 1.4 hours of speech data vocoder in a single neural network. Text front-
in Moroccan dialect. The data was collected from end module normalizes input text by converting
YouTube videos and manually transcribed (Ali digits, symbols, abbreviations, and acronyms into
et al., 2019).13 full words, processing words with special sounds,
ESCWA.CS is a collection of 2.8 hours of borrowed words, etc. This task is challenging in
speech code-switching corpus collected over two Arabic due to missing diacritics in modern texts
days of meetings of the United Nations Economic as explained in 3.1.4. Therefore, the Arabic front-
and Social Commission for West Asia (ESCWA) end part of the TTS is responsible for restoring the
in 2019 (Chowdhury et al., 2021).14 missing diacritics and text normalization.
QASR.CS is a collection of 5.9 hours of code- Dataset For MSA TTS, we create the first public
switching extracted from the Arabic broadcast test dataset, which comprises 20 sentences cover-
news data (QASR) to test the system for code- ing different topics such as psychology, education,
switching. The dataset also includes some in- health, etc. The average length for each sentence
stances where the switch is between Arabic and is 8 words. This data is used for objective and
French, however, this type of instance are very rare subjective evaluation for Arabic TTS.
occurrence (Mubarak et al., 2021b).15
4 Methodology
DACS is a collection of ≈ 1.5 hours of broadcast
For the purpose of benchmarking the Arabic tasks,
speech designed to evaluate the performance of
we opt for zero-shot learning for both NLP and
ASR for code-switching between MSA to Egyptian
Speech tasks. We benchmarked these varieties of
dialect and vice versa (Chowdhury et al., 2020a).16
tasks by leveraging ChatGPT (GPT-3.5-Turbo) for
CallHome Egyptian is a speech corpus of tele- NLP, Whisper (small. medium, and large), USM
phone conversations between native speakers of and Amazon Polly for Speech and compared their
Egyptian Arabic. It consists of 20 unscripted tele- performance with respective state-of-the-art mod-
phone conversations, each of which lasts between els.
5-30 minutes (Kumar et al., 2014).17
4.1 Model for NLP Tasks
3.11.2 Text to Speech In the zero-shot setting, the model – ChatGPT – is
Speech Synthesis a.k.a text to speech (TTS) helps only given a natural language instruction describing
users to get the written output easier and in some the task and asked to produce the expected output.
cases faster. Most state-of-the-art end-to-end TTS The goal is to allow the LLM to build a context that
12
https://arabicspeech.org/mgb3 will help to narrow the inference space and produce
13
https://arabicspeech.org/mgb5 more accurate output. For each task, we explored a
14
https://arabicspeech.org/escwa number of prompts guided by the same instruction
15
https://arabicspeech.org/qasr
and format as recommended in the Azure OpenAI
16
https://github.com/qcri/Arabic_speech_code_
switching Studio Chat playground. After obtaining the best
17
https://catalog.ldc.upenn.edu/LDC97S45 prompt, we used it to complete the evaluation using
the OpenAI API from Azure Cognitive Services. still produced very unexpected responses, some-
times incomplete names of propaganda techniques,
4.2 Models for Speech Tasks or even techniques not among the provided list. An-
Similar to the NLP tasks, we benchmarked the other challenge with designing prompts for these
large speech models in zero-shot settings. For the tasks, is the issue of a task’s subjectivity where
speech recognition task, we explored three differ- providing a crisp-clear classification task definition
ent OpenAI’s Whisper models – small, medium, to the model is not possible. As an example, one of
and large, along with Google’s USM model (see our tasks is to evaluate whether a tweet is offensive
Table 3). We compared these large models with towards a person or an entity. In many instances,
supervised KANARI18 state-of-the-art conformer- the model predicted tweets to be offensive, while in
based offline and RNN-T based streaming ASR19 . reality they were descriptive of the tweet’s author
For the TTS task, we compare the-state-of-art in mental or physical state, or they were just repeating
two public systems; Amazon Polly TTS engine20 common negative statements or Arabic proverbs
and KANARI TTS system21 . not directed at anyone indicating the model’s un-
derstanding of offensiveness is not inline of our
Model Layers Width Heads Parameters definition.
W.Small 12 768 12 244M
W.Medium 24 1024 16 769M
W.Large-v2 32 1280 20 1550m For almost all NLP tasks, post-processing was
USM 32 1526 16 2B needed in order to match the gold labels, which in-
clude mapping prefixes, suffixes, or filtering tokens.
Table 3: Model parameters and architecture for Large
pretrained ASRs. W. stands for Open.AI’s Whisper For example, for POS tagging task, the tag PREP
(Radford et al., 2022) and USM is Universal Speech is mapped to ‘preposition’, ‘P’, ‘PREP’, ‘PRP’.
Model from Google (Zhang et al., 2023) Another example for NER, the model switches the
tag of the prediction i.e., B-PER predicts as PER-B,
which needed remapping the NER tags.
4.3 Prompts and Post Processing
Prompts design is the major challenge for zero-
shot learning setup and depending on the type of For speech recognition task, post-processing is
tasks such as token classification vs. sentence clas- a crucial component. Traditionally ASR is evalu-
sification the complexities varies. Designing the ated based on word error rate (WER) – an edit dis-
appropriate prompt ensure an accurate output. In tance based metric. The measure aligns the model’s
Appendix A.1, we provided prompts for the differ- output with reference transcription and penalizes
ent tasks. based on insertion, deletion, and substitution er-
For example, for the segmentation task, some of rors. Hence, the measure is unable to disambiguate
the output was not segmented based on linguistic some code-switching errors introduced by the mul-
information but rather more Byte-Pair Encoding tilingual writing along with other minor formatting
(BPE) like encoding. Based on that prompt is fur- differences. This high penalizing characteristics of
ther redesigned, which resulted in a better outcome. WER particularly poses a challenge in zero-shot
For factuality, disinformation, and harmful con- settings, where the model does not observe any par-
tent detection tasks, the challenges were different ticular in-domain/task data formatting beforehand.
than other tasks. One notable example is the pro- Therefore, to minimize the challenge we opt for
paganda detection task. The task requires deter- text-standardization by normalizing ‘alif’, ‘ya’ and
mining whether a text snippet contains propagan- ta-marbuta’. Moreover, to support multi-script ren-
distic language, and if it does, the model should dering, we created a simple Global Mapping File
detect which propaganda technique is used from a (GLM) to transliterate common recognized outputs.
pre-defined list of techniques. Even with our best To reduce the risk of overfiting the post-processing
efforts to design the prompt for this task, the model to the model’s transcription style, we adapted mini-
18
malist GLM (used in (Chowdhury et al., 2021)) and
https://fenek.ai/
19
https://arabicasr.kanari.ai/
normalization pipeline and applied it to all models.
20
https://aws.amazon.com/polly/ We designed it based on common confusion and
21
https://arabictts.kanari.ai/ English-Arabic transliteration pairs.
4.4 Evaluation Metric Bible evaluation set, due to their availability on the
To measure the performance of each task, we fol- web. For example,
Input: ? H@ Ø ú¯ Ñ¢« B@ ñKñº
ð AÒË@ éºÊÜ
lowed current state-of-art references and used the
metric reported in the respective work. It ranges Output: Who is the greatest in the kingdom of heaven?
from Accuracy (ACC), F1 (macro, micro, and Who Is the Most Important Person in the Kingdom?
Input: C¢Ë@ð h. @ð QË@
weighted), word error rate (WER), character er-
Output: Marriage and Divorce, Jesus Teaches About Di-
ror rate (CER), discretization error rate (DER), and
vorce
mean opinion score (MOS) on naturalness, intelli-
Such behavior indicates the test data is contam-
gibility and vowelization for subjective evaluation.
inated as the model might have ingested the data
5 Results and Discussion during training. Furthermore, the findings from
the zero-shot setting demonstrate that the ChatGPT
In Tables 4, 5, 6 and 7, we report the results of dif- model exhibits superior performance in Gulf and
ferent NLP and Speech related tasks. In the below Egyptian dialects to have better BLEU scores than
Sections, we summarize the results and challenges the overall MSA. This behavior can be attributed
specific to the task groups. The last column ∆ rep- to the lack of dialectal representation in the LLM
resents the difference between SOTA and zero-shot to stopping it from hallucinating. For the Media
performance. genre, it is clear that the conversational content
is much harder to translate in general and further
5.1 Word Segmentation, Syntax and
worse for the dialectal content.
Information Extraction
In the first part of the Table 4, we report for 5.2 Sentiment, Stylistic and Emotion Analysis
the token classification (sequence tagging) tasks. In the second group of Table 4, we report results
For almost all tasks, the performance are below for sentiment, emotion and stance. The datasets
than the SOTA. The prompt design and the post- for this task include tweet classification tasks. We
processing were more challenging for these task. observe that performances are below SOTA by mar-
For the segmentation task, we observe that the gins between 19% and 58%. For these types of
rare words are broken down into two or more sub- tasks, model provided additional text with label.
word tokens. The words such as “ èðXQK. ,ËAg , é
¯” For example, “Sentiment: Positive (because of the
¯” re-
should be segmented into “ èðXQK. ,ËAg , é+ laughing emoji)”. It provided the reason of class
spectively, The system output segmented the words label, which is in fact great but post processing
+J
into “ è+ðXQ+K. ,+ËAg , é+ ¯” which are not accu- were needed for such cases to match the gold label.
rate.
For NER task, we noticed that the model misses 5.3 News Categorization
tokens in some instances. It also tends to predict For the news categorization experiments, we used
extra tokens in other instances. Such Errors leads to four different datasets consisting of news articles
miss-alignment between the inputs and the outputs, with multiclass classification setting. Across all
which affects the metric calculation. We deal with datasets, zero-shot performances are lower than
this issue by either truncating the prediction or the current SOTA. As can be seen in Table 4 per-
padding it with the O class depending on the length formances vary significantly ranging from 5% to
of the ground truth. 25%. Like other tasks, we need to post-process the
We have similar observation for the lemmatiza- output labels as the API returned additional tokens.
tion and parsing. In many cases we observed that API returned a
For the MT, results in Table 5 indicate the short- message “content is in Arabic” without providing
coming of these large models when explored with any label. We also observed that it returns addi-
standard and dialectal Arabic. From the reported tional labels, which may be an obvious case as a
measure, we noticed ChatGPT is outperformed by news article may contain information representing
SOTA techniques. When investigated further, we multiple labels.
observed that ChatGPT is penalized most of the
time for inserting additional content (shown in blue 5.4 Demographic/Protected Attributes
in the below example) in their response. This is In these tasks, the model was asked to predict
often seen in MSA MT test sets, especially in the country of origin for person names extracted from
Task Dataset Metric Zero-shot SOTA ∆
Word Segmentation, Syntax and Information Extraction
Segmentation Samih et al. (2017) Acc (Avg) 0.688 0.931 0.243
Lemmatization WikiNews Acc 0.530 0.973 0.443
Diacritization WikiNews WER 0.308 0.045 -0.263
Diacritization Darwish et al. (2018) WER 0.367 0.031 -0.336
POS WikiNews Acc 0.810 0.953 0.143
POS Samih et al. (2017) Acc 0.379 0.892 0.513
POS XGLUE (Arabic) Acc 0.520 0.686 0.166
Parsing Conll2006 UAS 0.239 0.796 0.557
Dialect QADI Macro-F1 0.070 0.600 0.530
NER ANERcorp Macro-F1 0.185 0.550 0.365
NER AQMAR Macro F1 0.180 0.690 0.510
NER QASR Macro-F1 0.102 0.637 0.535
Sentiment, Stylistic and Emotion Analysis
Sentiment ArSAS Macro-F1 0.550 0.760 0.210
Emotion SemEval2018 (Task E-c) JS 0.395 0.541 0.146
Stance Unified-FC Macro-F1 0.232 0.558 0.326
Stance Khouja (2020) Macro-F1 0.620 0.767 0.147
News Categorization
News (Tweets) ASND Macro-F1 0.512 0.770 0.258
News articles SANAD/Akhbarona Acc 0.730 0.940 0.210
News articles SANAD/AlArabiya Acc 0.922 0.974 0.052
News articles SANAD/AlKhaleej Acc 0.864 0.969 0.105
Demographic/Protected Attributes
Name Info ASAD Weighted-F1 0.570 0.530 -0.040
Location UL2C Macro-F1 0.339 0.881 0.542
Gender ArabGend Macro-F1 0.390 0.820 0.430
Ethics and NLP: Factuality, Disinformation and Harmful Content Detection
Offensive lang. OffensEval 2020 Macro-F1 0.460 0.910 0.450
Hate Speech OSACT 2020 Macro-F1 0.430 0.820 0.390
Adult Content ASAD Macro-F1 0.460 0.880 0.420
Spam ASAD Macro-F1 0.440 0.989 0.549
Subjectivity In-house Macro-F1 0.670 0.730 0.060
Propaganda WANLP23 Micro-F1 0.353 0.649 0.296
Checkworthiness CT–CWT–22 F1 (POS) 0.526 0.628 0.102
Factuality CT–CWT–22 Weighted-F1 0.103 0.831 0.728
Claim CT–CWT–22 Acc 0.703 0.570 -0.133
Harmful content CT–CWT–22 F1 (POS) 0.471 0.557 0.086
Attention-worthy CT–CWT–22 Weighted-F1 0.258 0.206 -0.052
Factuality Unified-FC Macro-F1 0.306 - -
Claim Khouja (2020) Macro-F1 0.036 0.643 0.607
Semantics
Paraphrasing BTEC Fluency 0.946 0.972 0.026
Faithfulness 0.835 0.916 0.081
STS STS2017.eval.v1.1-Track 1 PC 0.789 0.744 -0.045
STS STS2017.eval.v1.1-Track 2 PC 0.808 0.749 -0.059
STS QS (Q2Q) Mawdoo3 Q2Q Micro-F1 0.895 0.959 0.064
XNLI (Arabic) XNLI Acc 0.489 0.648 0.159
Question answering (QA)
QA ARCD F1 0.502 0.501 -0.001
QA MLQA F1 0.376 0.584 0.208
QA TyDi QA F1 0.480 0.820 0.340
QA XQuAD F1 0.442 0.648 0.206
Table 4: Results on different tasks and datasets using zero-shot prompts. QS: Question similarity, PC: Pearson
Correlation, Conv. Text: Conversational text; JS: Jaccard Similarity. ∆ column shows the performance difference
between SOTA and ChatGPT.
Corpus Dia. SC City #Sent Zero-shot SOTA ∆
APT LEV lv - 1000 3.13 21.90 18.77
APT Nile eg - 1000 3.64 22.60 18.96
MADAR Gulf iq Baghdad 2000 27.60 29.10 1.50
MADAR Gulf iq Basra 2000 27.75 29.00 1.25
MADAR Gulf iq Mosul 2000 27.28 31.30 4.02
MADAR Gulf om Muscat 2000 34.29 39.50 5.21
MADAR Gulf qa Doha 2000 26.92 29.30 2.38
MADAR Gulf sa Jeddah 2000 27.66 29.40 1.74
MADAR Gulf sa Riyadh 2000 35.84 40.70 4.86
MADAR Gulf ye Sana’a 2000 27.12 31.40 4.28
MADAR LEV jo Amman 2000 22.79 35.10 12.31
MADAR LEV jo Salt 2000 27.43 34.90 7.47
MADAR LEV lb Beirut 2000 16.97 23.70 6.73
MADAR LEV ps Jerusalem 2000 28.24 33.60 5.36
MADAR LEV sy Aleppo 2000 27.31 34.30 6.99
MADAR LEV sy Damascus 2000 25.34 33.10 7.76
MADAR MGR dz Algiers 2000 16.89 21.30 4.41
MADAR MGR ly Benghazi 2000 28.26 32.00 3.74
MADAR MGR ly Tripoli 2000 23.21 25.90 2.69
MADAR MGR ma Fes 2000 25.38 29.90 4.52
MADAR MGR ma Rabat 2000 17.85 23.10 5.25
MADAR MGR tn Sfax 2000 13.41 13.80 0.39
MADAR MGR tn Tunis 2000 10.39 16.00 5.61
MADAR MSA ms - 2000 26.69 43.40 16.71
MADAR Nile eg Alexandria 2000 34.23 38.30 4.07
MADAR Nile eg Aswan 2000 24.06 30.40 6.34
MADAR Nile eg Cairo 2000 26.82 32.90 6.08
MADAR Nile sd Khartoum 2000 32.62 39.00 6.38
MDC LEV jo - 1000 3.35 17.70 14.35
MDC LEV ps - 1000 3.03 15.30 12.27
MDC LEV sy - 1000 3.28 19.90 16.62
MDC MGR tn - 1000 2.54 13.90 11.36
MDC MSA ms - 1000 4.88 20.40 15.52
Media Gulf om - 467 4.52 19.60 15.08
Media LEV lb - 250 3.58 16.80 13.22
Media MGR ma - 526 2.45 9.60 7.15
Media MSA ms - 637 9.26 29.70 20.44
Media MSA ms - 621 8.94 35.60 26.66
Bible MGR ma - 600 5.06 28.80 23.74
Bible MGR tn - 600 6.86 29.20 22.34
Bible MSA ms - 600 9.27 33.20 23.93
Bible MSA ms - 600 8.08 29.20 21.12
Table 5: Results (BLEU score) on machine translation for different datasets using zero-shot prompts. Best result per
row is bolfaced. ∆ column shows the performance difference between SOTA and ChatGPT. XXX Do we need all
these tiny subtasks? (accuracy at the city level?) It’s better to merge them at country level or region level to compare
MADAR with other datasets
Wikipedia, map user locations (extracted from Twit- tion, in many cases, the model generated outputs in
ter) to one of the Arab countries, and predict the many formats with additional country names, e.g.,
gender of person names (extracted from Twitter). bahrain (bh); áK QjJ.Ë@:bh; muscat - om; dz (algeria),
From the Table 4, we observe that the model strug- others (palestine), and unk, etc. which required a
gles with user-generated content on Twitter as op- post-processing code to standardize its output.
posed to the formal data from Wikipedia. In few
cases, the model provides messages indicating that 5.5 Ethics and NLP: Factuality,
it is unable to provide the expected outputs, e.g., “It Disinformation and Harmful Content
is not possible to determine the gender of a person Detection
based solely on their name”. Our results in Table 4 show that ChatGPT generally
In location prediction, although the prompt struggled with the tasks under this category, with
asked the model to give only a single country code its lowest performance being for the claim factu-
in ISO 3166-1 alpha-2 format without explana- ality detection task in the zero-shot setup. This is
expected given that in majority of the tasks, the 5.7 Question answering (QA)
model is operating over tweets, which are very In the last part of the Table 4, we report the QA
short, usually informal, and often dialectal in the results on four different tasks. For arcd, the model
Arab world. The tasks themselves are generally achieved a score higher than that of SOTA by a
challenging requiring deep contextual analysis and small margin. However, for the other QA tasks
reasoning abilities, and domain knowledge in many under study, the model did not perform well.
of the cases. For instance, determining a claim’s
veracity is a very intensive process that usually re- 5.8 Speech Recognition and Synthesis
quires reasoning over information from multiple In Table 6, we reported the performance of ASR
sources and modalities (e.g., text and audio), with using different datasets and models. We observed
some sources not even available online for models that Google’s Universal Speech Model (USM) out-
to access and use (Nakov et al., 2021; Das et al., performs OpenAI’s whisper in all the datasets. The
2023) (e.g., witness testimonies to an event). Al- USM model also performs comparably with the
though for this task, we prompted the model to standard task- and domain-specific ASR systems
return Yes/No predictions of a claim’s truthfulness, and is better equipped to handle cross-language
sometimes it explicitly expressed its shortcoming and dialectal code-switching data from unseen do-
in predicting for such complex task responding by mains compared to the SOTA models. It should
statements like: “not enough context provided to be noted that the reported results, for both USM
determine if the information is correct”. and Whisper models, can be further improved with
Issues to consider while handling ChatGPT’s better model-specific post-processing to reduce pe-
responses were not limited to parsing responses nalization of non-semantic differences between the
for the sensitive category of tasks we are work- reference and the hypothesis.
ing with. Some of our tasks inherently require the As for the text-to-speech task, we evaluated the
model to operate over offensive language, profan- transformer-based models, with 20 test sentences,
ities, adult content, etc. Such language generally using both subjective and objective evaluation met-
goes against the OpenAI’s content management rics. Three native speakers have evaluated each
policy 22 followed by ChatGPT. In many instances, sentence on a 5-point scale: 5 for excellent, 4 for
ChatGPT raised an error regarding the type of lan- good, 3 for fair, 2 for poor, and 1 for bad. We
guage used in the text we were sending its way, and normalized results to scores out of 10 as shown in
did not return a prediction. This raises a question Table 7. From the objective evaluation, we noticed
on how developers can employ such models over Amazon Polly is significantly better in WER and
user generated content that is expected to contain CER, however, humans preferred KANARI models
“unacceptable” language. for better diacritization. As for the rest, both mod-
During our experiments, it was interesting to els performed comparably. We plan to increase the
see ChatGPT failing to provide predictions in sev- number of sentences increase coverage and con-
eral cases, and specifically mentioning that “the sider other available TTS systems such that Google
text is in arabic and i am not programmed to”. TTS, ReadSpeaker, etc in the future.
Such instances demonstrate a need for a deeper
understanding of the model’s abilities over lower 6 Findings and Limitations
resource languages like Arabic. It further poses an
6.1 Findings
opportunity to study ways to improve the training
of such LLMs over Arabic. Our experimental results suggest a big gap in the
performance of LLM (ChatGPT) in comparison
5.6 Semantics to the SOTA in zero-shot settings for most of the
The results for different semantic tasks reported in tasks. We observed a handful of tasks outperformed
the second last part of the Table 4 show that the SOTA in this challenging setting. Moreover, the
performances (pearson correlation) for STS (track LLM performance varies significantly between the
1 and 2) are higher than SOTA. The performances MSA versus dialectal test sets. For example, POS
for paraphrasing and XNLI tasks are lower. accuracy of 0.810 versus 0.379 on MSA and di-
alects respectively, which indicates a large gap in
22
https://learn.microsoft.com/en-us/
azure/cognitive-services/openai/concepts/ LLM for low-resource languages/dialects. This
content-filter performance gap can also be attributed to the lack
Domain (SR) Zero-Shot SOTA Supervised
Dataset
Dialect Models WER K.Offline K.Streaming
W.Small 46.70
Broadcast (16kHz) W.Medium 33.00
MGB2 11.40 11.90
MSA W.Large-v2 26.20
USM 15.70
W.Small 83.20
Broadcast (16kHz) W.Medium 65.90
MGB3 21.40 26.70
EGY W.Large-v2 55.60
USM 22.10
W.Small 135.20
Broadcast (16kHz) W.Medium 116.90
MGB5 44.10 49.20
MOR W.Large-v2 89.40
USM 51.20
W.Small 63.60
Broadcast (16kHz) W.Medium 48.90
QASR.CS 23.40 24.90
Mixed W.Large-v2 37.90
USM 27.80
W.Small 61.90
Broadcast (16kHz) W.Medium 48.70
DACS 15.90 21.30
MSA-EGY W.Large-v2 34.20
USM 14.30
W.Small 101.50
Meeting (16kHz) W.Medium 69.30
ESCWA.CS 49.80 48.00
Mixed W.Large-v2 60.00
USM 45.70
W.Small 155.90
Telephony (8kHz) W.Medium 113.70
CallHome 45.8* 50.90
EGY W.Large-v2 78.70
USM 54.20
Table 6: Reported Word Error Rate (WER) on ASR for different domains and dialect datasets in zero-shot setup
and domain-specific ASR setup. W. stands for Open.AI Whisper model, USM is Universal Speech Model from
Google, K stands for KANARI models. * represent the model’s input is 8kHz sampling rate and Offline model
was re-trained to accommodate telephony data. SOTA Supervised represents fully supervised models trained with
domain-specific data. WER with bold form represents best results, italic represents best results with zero-shot
Are the two questions below semantically A.10 Factuality, Disinformation, Harmful
similar? The output should be exactly in form
yes or no. content
Offensive Language
A.7 Factuality, Disinformation, Harmful
if the following sentence is offensive, just
content say "OFF", otherwise, say just "NOT_OFF"
without explanation:
Subjectivity {inputSentence
system
You are an AI assistant that helps people find
information. Hate Speech
text
Classify the sentence as Subjective or if the following sentence has hate speech, just
Objective. Provide only label. say "HS", otherwise, say just "NOT_HS" without
sentence: {sentence} explanation:
label: {inputSentence}
Checkworthiness
Adult Content
Classify the sentence as checkworthy or not
checkworthy. Provide only label. Classify the following sentence as adult
language (the language used in adult
advertisement and porno advertisement) or not
adult language without illustruation. Sentence:
Claim detection xx Label: xx
Does this sentence contain a factual claim?
Answer only by yes or no.
Spam
POS Paraphrasing
these are the segmentation and POS tags for a rephrase this Arabic sentence: {inputSentence}
sample sentence:
Q K PYJK éJ K. XAg. ÕÎJ¯
éJ KA¢ Q.Ë@ éJ Öß XA¿ B@ QK@ñk. HAjJ
àñK Q®ÊJË@ð ÕÎJ®Ë @ àñJ®Ë
ÕÎJ¯ ÕÎJ¯ NOUN
éJ K. XAg. è + úG XAg NOUN+NSUFF
. .
PYJK PYJK V
Q K H@
HAjJ + iJQ K NOUN+NSUFF
QK@ñk. QK@ñk. NOUN
éJ Öß XA¿ B@ è + ùÖß XA¿ @ + È@ DET+NOUN+NSUFF
éJ KA¢ Q.Ë@ è + úGA¢ QK. + È@ DET+ADJ+NSUFF
àñJ®Ë àñJ¯ + È PREP+NOUN
ÕÎJ®Ë @ ÕÎJ¯ + È@ DET+NOUN
àñK Q®ÊJË@ð àñK Q®ÊK + È@ + ð CONJ+DET+NOUN
get the segmentation and POS tags for this
sentence: {inputSentence}
Lemmatization
Diacritization
Parsing