0% found this document useful (0 votes)

67 views23 pages

Survey Sign Language Production 2023

This document provides a survey of recent advances in Sign Language Production (SLP) using deep learning. It begins with an introduction to sign language and its importance as the dominant form of communication for deaf communities. It then discusses the need for bidirectional sign language translation systems between spoken and sign languages to improve communication. The survey reviews recent work on the two main components of such systems: Sign Language Recognition (SLR) and Sign Language Production (SLP). It focuses on advances in SLP using deep learning and provides an overview of challenges, methods, frameworks and evaluation for SLP research.

Uploaded by

Nguyễn Koi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views23 pages

Survey Sign Language Production 2023

Uploaded by

Nguyễn Koi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Expert Systems With Applications 243 (2024) 122846

Contents lists available at ScienceDirect

Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa

Review

A survey on recent advances in Sign Language Production

Razieh Rastgoo a ,∗, Kourosh Kiani a , Sergio Escalera b , Vassilis Athitsos c , Mohammad Sabokrou d
a
Electrical and Computer Engineering Department, Semnan University, Semnan, 3513119111, Iran
b
Department of Mathematics and Informatics, University of Barcelona and Computer Vision Center, Spain
c
University of Texas at Arlington, TX 76019, United States
d
Institute for Research in Fundamental Sciences (IPM), Tehran, 193955746, Iran

ARTICLE INFO ABSTRACT

Keywords: Sign Language is the dominant form of communication language used in the Deaf and hearing-impaired
Sign Language Production community. To make easy and mutual communication between the hearing-impaired and the hearing
Sign Language Recognition communities, building a robust system capable of translating the spoken language into sign language and
Sign Language Translation
vice versa is fundamental. To this end, sign language recognition and production are two necessary parts for
Deep learning
making such a two-way system. Sign language recognition and production need to cope with some critical
Survey
Deaf
challenges. In this survey, we review recent advances in Sign Language Production (SLP) and related areas
using deep learning. To have more realistic perspectives to sign language, we present an introduction to the
Deaf culture, Deaf centers, the psychological perspective of sign language, and the main differences between
spoken language and sign language. Furthermore, we present the fundamental components of a bi-directional
sign language translation system, discussing the main challenges in this area. Also, the backbone architectures
and methods in SLP are briefly introduced and the proposed taxonomy of SLP is presented. Finally, a general
framework for SLP and performance evaluation, and also a discussion on the recent developments, advantages,
and limitations of SLP, commenting on possible lines for future research are presented.

1. Introduction people in society. According to the World Health Organization (WHO)

report in 2020, there are more than 466 million deaf people in the
Sign Language is the dominant form of communication language world (WHO: World Health Organization, 2022) using different forms
used in Deaf society (Rodriguez-Moreno, Martinez-Otzeta, & Sierra, of sign languages, such as American Sign Language (ASL) (Vicars,
2023). Most of the people in the hearing and deaf communities are not 2022), Argentine Sign Language, (Ethnologue, 2022a), Polish Sign Lan-
familiar with sign language. To touch on the necessity of this language guage (Salto, 2022), German Sign Language (Hanke, 2022), Greek Sign
for both hearing and Deaf people, let us imagine you are in a grocery Language (Ethnologue, 2022b), Spanish Sign Language (ASL, 2022),
store. What will happen if a deaf person asks you to help him/her? Chinese Sign Language (CSL, 2022), Korean Sign Language (Owlcation,
(see Fig. 1). This is a challenging situation. If you do not know sign 2022), Arabic Sign Language (Luqman & Sabri, 2019), and Persian Sign
language, it would be useful to be able to use an application to translate Language (Ethnologue, 2022c), just to mention a few. This report shows
a spoken language into sign language and vice versa. This example is the necessity of study, research, and technology development in this
just one situation among many others in which being familiar with area.
sign language is useful. Actually, developing efficient bidirectional sign
In general terms, the translation task is an important component
language translation systems requires expertise in a wide range of
in recent technologies developed by multinational institutions to aid
fields, including Computer Vision (CV), Computer Graphics (CG), Nat-
in internal and external communications (Artetxe & Schwenk, 2019;
ural Language Processing (NLP), Human-Computer Interaction (HCI),
Bragg et al., 2019). In more detail, there are two specific translation
Linguistics, and Deaf culture. While the vast majority of communica-
tasks in a bidirectional sign language translation system: Sign Language
tion technologies have been developed for spoken/written language,
Recognition (SLR) and Sign Language Production (SLP). The former is
sign languages have been excluded from most of these technologies.
Furthermore, most hearing people do not know sign language. The defined as a translation task from sign language into spoken language.
results will be the existence of many communication barriers for Deaf However, this task can be configured as a vision-based recognition task.

∗ Corresponding author.
E-mail addresses: rrastgoo@semnan.ac.ir (R. Rastgoo), kourosh.kiani@semnan.ac.ir (K. Kiani), sergio@maia.ub.es (S. Escalera), athitsos@uta.edu
(V. Athitsos), sabokro@ipm.ir (M. Sabokrou).

https://doi.org/10.1016/j.eswa.2023.122846
Received 30 December 2022; Received in revised form 3 June 2023; Accepted 2 December 2023
Available online 9 December 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

information (Stoll, Camgoz, Hadfield, & Bowden, 2020). Proposed

systems in SLR generally map signs into the spoken language in the
form of text transcription or speech (Rastgoo et al., 2021c). However,
SLP systems perform the reverse procedure. There are some accurate
and well-detailed surveys in the SLR (Ghanem, Conly, & Athitsos,
2017; Rastgoo et al., 2021c) but only one detailed discussion has been
presented in SLP (Rastgoo et al., 2021d). Here, we extend our previous
survey (Rastgoo et al., 2021d) and review recent advances in SLP and
related areas using deep learning. To have more realistic perspectives
to sign language, we present an introduction to the Deaf culture,
Deaf centers, the psychological perspective of sign language, and the
main differences between spoken language and sign language. Fur-
thermore, we present the fundamental components of a bi-directional
sign language translation system, discussing the main challenges in this
area. Also, the backbone architectures and methods in SLP are briefly
Fig. 1. A scenario where a deaf person asks for help (Auslan Language, 2022). introduced and the proposed taxonomy of SLP is presented. Finally,
a general framework for SLP and performance evaluation, and also a
discussion on the recent developments, advantages, and limitations of
SLP, commenting on possible lines for future research are presented.
The latter translates the spoken language into sign language. Both SLR
The remainder of this paper is organized as follows.
and SLP tasks are fundamental to make a bidirectional translation sys-
In Section 2, we present an introduction to the Deaf culture, some
tem applicable to real-life applications. Compared to other languages,
of the most familiar Deaf centers, the psychological perspective of
the translation task is more challenging in sign language. Some factors
sign language, the main differences between spoken language and sign
that contribute to such complexity are the unfamiliarity of people with
language, and also the fundamental components of a bi-directional
sign language, complex patterns in different signs, lack of a specific
sign language translation system. In Section 3, we dive into the SLP
standard for sign languages, and the challenges corresponding to vision-
problem, discussing the main challenges in this area. The backbone
based tasks. Furthermore, the visual variability of signs is challenging,
architectures and methods in SLP are briefly introduced in Section 4.
as it is affected by hand shape, palm orientation, movement, location,
Our taxonomy on SLP is presented in Section 5. A general framework
facial expressions, and other non-hand signals. These differences in
for SLP and performance evaluation, and also a conclusion on the
sign appearance produce a large intra-class variability and low inter-
recent developments, advantages, and limitations in SLP, commenting
class variability. This makes it hard to provide a robust and universal
on possible lines for future research are presented in Sections 6 and 7
system capable of recognizing different sign types. Another challenge is and 8, respectively.
developing a photo-realistic SLP system to generate the corresponding
sign digit/word/sentence from a text/voice in spoken language in a 2. A brief introduction to some concepts in SLP
real-world situation. The challenge corresponding to the grammatical
rules and linguistic structures of sign language needs to be considered. We introduce Deaf culture in this section. Some of the most familiar
Translation between spoken and sign language is a complex prob- Deaf centers are also presented. After that, we briefly discuss the psy-
lem. This is not a simple mapping problem from text/voice to signs chological perspective of sign language as well as the main differences
word by word. This challenge comes from the differences between between spoken language and sign language. Finally, we present the
the tokenization and ordering of words in spoken and sign languages. fundamental components of a bi-directional sign language translation
Another complex condition is related to the application area. Most of system.
the applications in sign language focus on sign language recognition in
different areas, such as robotics (Dawes, Penders, & Carbone, 2018), 2.1. Deaf culture
HCI (Bachmann, Weichert, & Rinkenauer, 2018), education (Darabkh,
Alturk, & Sweidan, 2018), computer games (Roccetti, Marfia, & Semer- Deaf culture is the heart of the Deaf community. Language and
aro, 2012), recognition of children with autism (Cai, Zhu, Tien Wu, culture are intertwined and inseparable passed down through gener-
Liu, & Hu, 2018), automatic sign-language interpretation (Yang, 2014), ations of Deaf people. The Deaf community is not based on geographic
decision support for medical diagnosis of motor skills disorders (Butt vicinity. Generally, it contains two groups: the culturally Deaf people
et al., 2018), home-based rehabilitation (Cohen, Voldman, Regazzoni, and the other individuals who use sign language. Actually, the Deaf
& Vitali, 2018; Morando, Ponte, Ferrara, & Dellepiane, 2018), and community brings together these two groups. Furthermore, the intu-
virtual reality (Vaitkevičius et al., 2019). This is due to the misun- ition behind the Deaf culture is that it helps the Deaf people who are
derstanding of hearing people. Most of them think that deaf people educated at residential Deaf schools to develop their own Deaf network
are much more comfortable with reading spoken language; therefore, once they graduate and keep in touch with everyone. Most of these Deaf
it is not necessary to translate the reading spoken language into sign people take on leadership positions in the Deaf community, organize
language. This is not true since there is no guarantee that a deaf person Deaf sports, community events, etc, and become the core of the Deaf
is familiar with the reading and writing forms of a spoken language. community. A key point is that this community needs to be sure that
Generally, these two forms of language are completely different from their language and heritage are passed on to other peers and the next
each other. Most sign languages, do not have a standard written form. generation. They also form connections with parents and siblings of
Nevertheless, despite the aforementioned challenges, some meth- Deaf children to extend the community circle for Deaf children.
ods have been proposed with different degrees of success in both Suppression of sign language communication is a sample of cruelty
SLR and SLP (Rastgoo, Kiani, & Escalera, 2021c; Rastgoo, Kiani, Es- against the Deaf community. One example is dated back to 1889 when
calera, & Sabokrou, 2021d). While SLR has rapidly advanced in re- an international congress of largely hearing educators of Deaf students
cent years (Rastgoo, Kiani, & Escalera, 2018, 2020a, 2020b, 2021a, announced that sign language should be substituted with spoken lan-
2021b; Rastgoo et al., 2021c; Rastgoo, Kiani, & Escalera, 2022y, guage (Lane, 2017). Afterward, oralism, as a teaching system for Deaf
2022z; Rastgoo-VQA et al., 2023x), SLP is still a very challenging people to communicate using speech and lip-reading instead of sign
problem, involving an interpretation between visual and linguistic language, was extensively enforced. Since then, Deaf communities have

2
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Fig. 2. A bi-directional sign language translation system.

struggled to use sign languages in schools, work, and public life (Geers interact with the Deaf community only on a professional basis. So, this
et al., 2017). Furthermore, linguistic advancements have assisted sign issue can be considered for hardware implementations of the proposed
languages so that they can be used as natural languages (Stokoe, systems in SLR and SLP. Developing such a system can make room
Casterline, & Croneberg, 1965). Also, the role of the legislation cannot for the Deaf community to overcome communication barriers and help
be ignored in helping to establish legal support for sign language them to keep motivated.
education and usage (UN General Assembly, 2006). Considering this
historical struggle can help the researchers to have a sense of the 2.4. Sign language vs. spoken language
necessity of the translation and recognition systems for sign languages
applicable to the real-life of the Deaf community (Bragg et al., 2019). Generally, there are two main types of languages used in the com-
munity: spoken and sign. While these two language types are different
2.2. Deafness centers from each other, both of them should be viewed as natural languages.
The main difference between them refers to the way that they convey
Some common Deafness centers and open sources of data related to information. The spoken language is understood as an auditory/vocal
the Deaf community are listed below: language. It can also be considered as an oral language. The various
sound patterns are used to convey a message. There are many linguistic
• World Health Organization (WHO) (WHO, 2022), elements in the spoken language, such as vowels, consonants, and
• National Institute on Deafness and Other Communication Disor- tones. Making changes in these elements can lead to different meanings
ders (NIDCD) (NIDCD, 2022), for the same set of words in the spoken language. In contrast to spoken
• Centers of Disease Control and Prevention (CDC, 2022), languages, gestures and facial expressions play key roles to convey
• National Deaf Center (NDC) (NDC, 2022), information in sign languages instead of vocal tracts. There are different
• Hearing, Speech, and Deaf Center (HSDC) (HSDC, 2022), sign languages in the world. Some of them are better known, such
• Center for Hearing and Deaf Services (HDS) (HDS, 2022), as American Sign Language (ASL). In every country, there are one
• Deaf and Hard of Hearing Program (DHHP, 2022), or more sign languages used by the Deaf community. While people
• Manchester Centre for Audiology and Deafness (ManCAD) (Man- think that sign languages have derived from spoken languages, they
CAD, 2022), are independent of natural languages that have evolved over time. Sign
• Northern Virginia Resource Center for Deaf and Hard of Hearing language is a complex language that has specific linguistic properties.
Persons (Virginia, 2022), SLR is affected by the structural properties of sign language and occurs
• National Center on Deaf-Blindness (NCDB) (NCDB, 2022). faster than spoken language recognition. While signs are articulated
slower than spoken words, the proposition rate for sign and speech
These centers aim to provide educational, clinical, and research is identical. It should be noted that both languages can be used to
services to the Deaf community. convey all sorts of information, such as news, conversations about daily
activities, stories, narrations, etc.
2.3. Psychological perspective of sign language
2.5. Bi-directional sign language translation system
As we stated before, developing an efficient bidirectional sign lan-
guage translation system requires the study of a wide range of fields, As already discussed, to make a bi-directional sign language transla-
including CV, CG, NLP, HCI, Linguistics, and Deaf culture. To this tion system, we need a system capable of translation from sign language
end, we present a brief discussion of the findings from developmental into a spoken language (SLR) and vice versa (SLP) (see Fig. 2). While
psychology, psycho-linguistics, cognitive psychology, and neuropsycho- SLR has rapidly advanced in recent years, SLP is still a challenging
logical studies. Recent studies of attention and perception show that problem. Since the details of SLR have been presented in some accurate
usage of sign language from an early age can boost some aspects of non- and well-detailed surveys (Ghanem et al., 2017; Rastgoo et al., 2021c),
language visual perception, such as motion perception. Furthermore, in this survey, we focus on SLP details of this bi-directional system and
neuropsychological and functional imaging studies indicate that left present more details of recent works in the SLP.
hemisphere regions are important in both sign and spoken language
processing. Aphasia can be occurred in signers due to left hemisphere 2.6. Conclusion
damage. Also, the existence of different modalities for language ex-
pression, such as oral–aural and manual–visual, makes room to explore In this section, we briefly reviewed some concepts related to the
different characteristics of human languages. Deaf community and its language. It is worth mentioning that sign lan-
From a pathological perspective, Deaf people have different degrees guage, as a language used in the Deaf community, should be viewed as
of hearing deviations from the standard/norm hearing level defined a natural language, similar to spoken language. To have bi-directional
for hearing people. Generally, four levels of deafness are defined: communication between hearing and Deaf people, we need a system
mild, moderate, severe, and profound hearing loss. This perspective is capable of translating from sign language into spoken language and vice
traditionally acceptable by a majority of non-deaf professionals who versa. Considering our findings in this section, it is necessary to know

3
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

that the Deaf community unifies two groups, the culturally Deaf people exposure can lead to mental flexibility, creative thinking, and com-
and the other individuals who use sign language, in the same group munication advantages (Hamers, 1998). Historically, sign language
to help Deaf people. Furthermore, the data from Deaf centers can be has not been incorporated in the education of Deaf children (Grosjean,
used for providing professional analysis of the Deaf community. To this 2010; Humphries, 2013; Swanwick, 2010). Some earlier sign languages
end, we need to also consider psychology, psycho-linguistics, cognitive were not natural. They just used signing to deliver the content for the
psychology, and neuropsychological findings in the Deaf community. Deaf individuals who failed within an oral-only approach. The dissat-
isfaction with the educational outcomes of the Deaf community was
3. SLP led to a bilingual design that placed sign language at the same level as
spoken/written language. To develop functional and bilingual systems,
SLP is one of the main components of a bidirectional sign language the full development of two languages is crucial. In such systems,
translation system. This system can be used to facilitate easy and the social and academic functions of both languages are considered.
clear communication between the hearing and the Deaf communities. Furthermore, the consistent and strategic usage of them is promoted in
Furthermore, the necessity of such systems can also be considered as a the environment. The final goal is to deliver content instruction in both
psychological perspective for the Deaf community. In this section, we languages making it a viable design for Deaf children (Gárate, 2014).
present more details on SLP. Application area: Millions of Deaf and hearing-impaired people
live across the world. The predominant part of them lives in low-income
3.1. Problem definition and developing countries with low access to suitable ear and hearing
care services. While hearing loss makes many difficulties in the corre-
The task of SLP can be defined as a video generation process from sponding community, many mainsprings of it can be prevented through
an input text. In more detail, given a spoken language sentence, 𝑆 𝑁 = public health measures. Rehabilitation, education, empowerment, and
{𝑤1 , 𝑤2 , … , 𝑤𝑁 }, it is expected that the model generates a video with M communication technology usage are some of the main solutions to
frames, 𝑉 𝑀 = {𝐹1 , 𝐹2 , … , 𝐹𝑀 }, including a sign language video corre- solve the communication barriers for the hearing-impaired community
sponding to the input sentence. Generally, there are some intermediate and use the full potential of the Deaf and hearing-impaired people.
steps for the SLP task. During these steps, the input sentence from To this end, we present compact information regarding the community
the spoken language is encoded into some representations to generate affected by hearing loss to make a promising insight and humanitarian
more accurate videos. We will review the proposed models and also the motivation in a research community. Considering the scope of this sur-
intermediate steps in SLP in this survey. vey, this information can help to develop communication technologies
compatible with the needs of the hearing-impaired community.
3.2. Challenges According to the World Health Organization (WHO) report, 1.5
billion people live with some degree of hearing loss. It is predicted
Here, we discuss the most important challenges in SLR. that by 2050 approximately 2.5 billion people will have some degree
Interpretation between visual and linguistic information: SLP of hearing loss (Deafness and hearing, 2022). Some critical points need
is still a very challenging problem, involving an interpretation between to be considered for application development in this area:
visual and linguistic information (Stoll, Camgoz, et al., 2020). Proposed
1. Geography: 80% of people with hearing loss live in low-income
systems in SLR generally map signs into the spoken language in the
countries. These people cannot easily access assistive technolo-
form of text transcription (Rastgoo et al., 2021c). However, SLP systems
gies to improve their communication quality.
perform the reverse procedure. The challenges regarding mapping from
2. Age: Another challenge is the hearing loss outbreak with age.
the lingual domain into the visual domain still remain.
Age is an important predictor of hearing loss among adults aged
Visual variability of signs: The visual variability of signs is one
20–69, with the maximum amount of hearing loss in the 60
of the challenges in SLP, which is affected by hand shape, palm
to 69 age group. Nearly 25% of people older than 60 years
orientation, movement, location, facial expressions, and other non-
are affected by hearing loss. This challenge can be considered
hand signals. These differences in sign appearance produce a large
for adopting assistive technologies with the special physical and
intra-class variability and low inter-class variability. This makes it hard
mental situations of people older than 60 years. Nearly 15%
to provide a robust and universal system.
of American adults (37.5 million) aged 18 face hearing loss.
Photo-realistic SLP system: Another challenge is generating a
Furthermore, about 2 to 3 out of every 1000 American children
photo-realistic sign video from a text or voice in spoken language in
are affected by hearing loss. Considering the difference in hear-
a real-world situation. This is important because it helps the generated
ing loss definition in different ages, the age factor needs to be
videos to be truly understandable and accepted by Deaf communities.
considered for application development. Generally, hearing loss
Thanks to the previous models based on graphical avatars and also
greater than 40 (dB) and 30 dB is defined for adults and children,
recent neural SLP works that produce skeleton pose sequences, we need
respectively. Due to this difference and the other communication
systems that are understandable and acceptable to Deaf viewers (Saun-
requirements corresponding to different age groups, the age
ders, Camgöz, & Bowden, 2020c).
factor is an important factor for application development in the
The grammatical rules and linguistic structures of the sign
field.
language: The challenge corresponding to the grammatical rules and
3. Gender: According to the reports of WHO, men are approxi-
linguistic structures of sign language is another critical challenge in this
mately twice as likely as women to have hearing loss among
area. Translation between spoken and sign language is a complex task.
adults aged 20–69. This is due to the fact that men usually work
This is not a simple word-to-word mapping problem from text/voice
in louder environments.
to sign. Another issue is the synchronized multi-modal nature of sign
4. Sign language: As we discussed in the previous sections, most
language that needs simultaneously moving hands and faces to convey
Deaf people are not familiar with sign language. This makes it
lexical and grammatical information.
hard to develop communication tools for sign language transla-
Bilingual education: The brain has no preference for any type
tion.
of languages. The only preference of the brain is that it expects to
receive input from a complete and natural language. In this way, both Considering all of these critical points, developing effective ap-
spoken and sign languages can be used as inputs for the brain. Being plications is challenging. Although different applications have been
bilingual, as a positive and desirable quality, the Deaf community developed in recent years (Bangham et al., 2000; Dangsaart, Narue-
can follow similar developmental paths as monolinguals. This dual domkul, Cercone, & Sirinaovakul, 2008; Grieve, 1999, 2002; Hanke,

4
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

2004; Huenerfauth, 2004, 2005; Jemni et al., 2022; Kanis, Zahradil, 2007), increasing the kernel size, linearly fusing multiple scales (Den-
Jurčíček, & Müller, 2006; Karpouzis, Caridakis, Fotinea, & Efthimiou, ton, Chintala, Szlam, & Fergus, 2015; Mathieu, Couprie, & LeCun,
2020; Veale & Conway, 1994; Zhao, Kipper, Schuler, Vogler, & Palmer, 2016), using dilated convolutions to include long-range spatial depen-
2000; Zij & Barker, 2003), more endeavor is necessary to develop real- dencies (Yu, Koltun, & Funkhouser, 2017), extending the receptive
time applications for bi-directional translation from sign language to fields (Chen & Koltun, 2017; Luo, Li, Urtasun, & Zemel, 2016), sub-
spoken language and vise versa. Furthermore, most of the applications sampling, or using residual connections (He, Zhang, Ren, & Sun, 2016;
in sign language focus on the recognition task, such as robotics (Dawes Villegas, Yang, Hong, Lin, & Lee, 2017). Another challenge is the lack of
et al., 2018), HCI (Bachmann et al., 2018), education (Darabkh et al., temporal learning corresponding to the image sequences. To properly
2018), computer games (Roccetti et al., 2012), recognition of children address this challenge, 3D convolutions are used as a promising alterna-
with autism (Cai et al., 2018), automatic sign-language interpreta- tive to recurrent modeling. Several models have been proposed to sign
tion (Yang, 2014), decision support for medical diagnosis of motor skills
language using 3D convolutions (hammadi et al., 2020; Rastgoo et al.,
disorders (Butt et al., 2018), home-based rehabilitation (Cohen et al.,
2020a; Sharma & Kumar, 2021; Sripairojthikoon & Harnsomburana,
2018; Morando et al., 2018), and virtual reality (Vaitkevičius et al.,
2019). However, the 3DCNN models are not generally as powerful as
2019). This is due to a common misunderstanding by hearing people
the sequence learning models such as RNN, Long Short-Term Memory
that Deaf people are much more comfortable with reading spoken
(LSTM), and Gated Recurrent Unit (GRU).
language; therefore, it is not necessary to translate the reading spoken
language into sign language. This is not true since there is no guarantee
that a Deaf person is familiar with the reading/writing forms of a 4.2. Transformer
spoken language. In some languages, these two forms are completely
different from each other.
The main intuition behind recurrent models is modeling the tem-
Real-time communication: For now, accessibility in SL is mainly
poral representation of sequential data, such as image sequences.
achieved by pre-recorded videos. This cannot enable real-time inter-
Deep recurrent networks demonstrated great success in different se-
action for the content provider. To have an automatic sign recogni-
quence learning tasks, such as machine translation (Siddique, Ahmed,
tion system applicable to a mutual interaction between a hearing-
impaired/Deaf user and a hearing user or a digital assistant in real-time, Talukder, & Uddin, 2020), speech recognition (Graves, Mohamed, &
we need low-complex and fast models. Using such models, the Deaf Hinton, 2013), video captioning (Pei et al., 2019), video prediction (Vil-
community can simply communicate with other people in different legas et al., 2017), SLR (Rastgoo et al., 2020a), and SLP (Rastgoo et al.,
locations, such as schools, banks, hospitals, trains, University, just to 2021d). However, there are some limitations in these networks, such
mention a few. A translation system could be vision-based or sensor- as vanishing and exploding gradient. To mitigate these challenges, the
based, depending on the type of input it receives. To date, most of the classical RNNs were extended to more sophisticated recurrent models,
current commercial systems for sign language translation are sensor- such as LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Cho,
based, which are expensive and not user-friendly. Vision-based sign Merrienboer, et al., 2014). Different works have explored different
translation systems are necessary but should overcome many challenges modifications of the extended recurrent models, such as applying
to build a system applicable to real-time communication. the LSTM-based models to the image space (Shi et al., 2015), using
Sign anonymization: The purpose of sign anonymization is to multidimensional LSTM (MD-LSTM) (Graves, Fernandez, & Schmid-
ensure that no personal information of the signers is shared with the huber, 2007), using the stacked recurrent layers to include abstract
community (Saunders, Camgoz, & Bowden, 2021). Furthermore, pro- spatio-temporal correlations (Finn, Goodfellow, & Levine, 2016; Lotter,
viding realistic, human-like, and anonymized animations would ensure Kreiman, & Cox, 2015), and addressing the duplicated recurrent repre-
higher acceptability and comprehensibility than actual signing avatars. sentations (Zhan, Zheng, Yue, Sha, & Lucey, 2019). In addition, the
In sign language, complete anonymization of the video data is not Transformer models have recently improved results due to using the
possible because both the face and hands of the signers must be fully self-attention mechanism and parallel computing. In most of the mod-
visible so that the content can be understandable. Since most Deaf els for SLP, a recurrent model is used for the temporal representation
people have challenges in communication through written content, the
of sequential data.
need for producing messages anonymously is an important demand of
them. As a result, video is the main communication modality used by
native signers. The development of virtual signers is thus expanding, in 4.3. Generative models
order to make written material on the internet more available to Deaf
users. Generative modeling is an unsupervised learning task in machine
learning. It involves automatically discovering and learning the regu-
4. Backbone architectures and methods larities or patterns in input data. Such a model can generate or output
plausible examples. Generally, there are two main categories for model
In this section, we review the most-used architectures and meth-
learning: discriminative and generative. While a discriminative model
ods in SLP: Convolutional Neural Networks (CNNs), Recurrent Neural
learns the decision boundaries between the classes, a generative model
Networks (RNNs), generative models, motion capture, and signing
learns the real distribution of each class. In other words, a generative
avatars.
model learns the joint probability distribution p(x,y) to predict the
4.1. CNNs conditional probability using the Bayes Theorem. On the other side,
a discriminative model learns the conditional probability distribution
One of the basic deep learning-based building blocks designed for p(y|x). Both of these models generally fall into supervised learning
visual reasoning is convolutional layers (Rezaei et al., 2023). Using problems. The goal of generative models is to generate new samples
these layers, CNNs effectively model the spatial structure of images (Le- from the same distribution, given some training data. In the learning
Cun, Bottou, Bengio, & Haffner, 1998). In SLP, CNNs are the foundation procedure, the distribution of the real and generated data gets closer
of the proposed models. However, CNNs performance faces some chal- to each other. This is done by explicitly, e.g. VAEs, or implicitly,
lenges. One challenge is corresponding to the limited receptive field, e.g. GANs, estimating a density function from the real data. In SLP,
introduced as a kernel size. As some solutions to this challenge, we generative models are used to generate more realistic and plausible
can take into account: stacking more convolutional layers (Jain et al., videos, considering sign language challenges.

5
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

4.4. Motion capture and signing avatars

Motion capture, mocap for short, is defined as the process of

recording the movement of objects/people (Naert, Larboulette, & Gibet,
2020b). Different application areas use mocap for the fulfillment of
their requirements, such as sports (Tamir & Oz, 2008), entertain-
ment (Bregler, 2007), gaming industry (Geng & Yu, 2003), robotics
(Field, Stirling, Naghdy, & Pan, 2009), automotive (Plantard, Shum,
Le Pierres, & Multon, 2017), and construction (Valero, Sivanathan,
Bosché, & Abdel-Wahab, 2017). In movie production and video game
development, mocap refers to recording actions of human actors and
utilizing that information to animate digital character models in a
2D/3D computer animation. During the mocap sessions, the movements
of one or more actors are sampled many times per second. The mocap
aims to record only the movements of the actors, not their visual
appearance. Finally, the animation data is mapped to a 3D model in
a way that the model performs the same actions as the actor.
The mocap process has several advantages over traditional com-
puter animation of a 3D model, such as lower latency for data record-
ing, the ability to produce a large amount of data within a given time,
and the ability to create complex movement and realistic physical inter- Fig. 3. The proposed taxonomy of the reviewed works in SLP.

actions. However, there are some challenges in mocap usage. The need
for special and expensive hardware/software to obtain and process the
data, the need for specific requirements for the space that the mocap Visual modality: RGB and skeleton are two common types of
process is operated in, and the need for re-recording data instead of input data used in SLP models. While RGB images/videos contain
manipulating it in facing problems are some of these challenges. high-resolution content, skeleton inputs decrease the input dimension
Signing avatars are an animated 3D model of the mocap data necessary to feed to the model and assists in making a low-complex and
obtained using signers. Animating can be manually defined, captured fast model. The spatial features corresponding to the input image can
from a human signer, or parametrically described. The signing avatars be extracted using computer vision-based techniques, especially deep
aim to assist the research community in making different applications learning-based models. In recent years, CNNs achieved outstanding
more accessible to the Deaf community. Furthermore, they will also performance for spatial feature extraction from an input image (Majidi,
help address the lack of human interpreters. The goal is not to replace Kiani, & Rastgoo, 2020). Furthermore, generative models, such as
human interpreters but rather to increase the amount of signed content Generative Adversarial Networks (GAN), can use CNNs as an encoder
available to Deaf users. As another application of signing avatars, they or decoder block to generate a sign image/video. Due to the temporal
can be used as assistive technologies for Deaf students in school. Using dimension of RGB video inputs, the processing of this input modality
these technologies, the interaction between Deaf and hearing students is more complicated than the RGB image input. Most of the proposed
will be much easier. models in SLP use the RGB video as input (Camgoz, Koller, Hadfield,
Recently, mocap data is used to edit and generate sign language & Bowden, 2020; Saunders, Camgöz, & Bowden, 2020a; Saunders,
samples. To this end, some motion edition operations, such as con- Camgoz, & Bowden, 2020d; Stoll, Camgoz, et al., 2020). An RGB sign
catenation and mixing, are applied to mocap data to compose new video can correspond to one sign word or some concatenated sign
utterances. This helps to facilitate the enrichment of the original mocap words, in the form of a sign sentence. GAN and LSTM are the most used
data, enhancing the natural look of the animation, and promoting the deep learning-based models in SLP for static and temporal learning in
avatar’s acceptability. However, manipulating existing movements does the visual input modalities. While successful results have been achieved
using these models, more effort is necessary to generate more lifelike
not guarantee the semantic consistency of the reconstructed signs. Em-
sign images/videos in order to improve the communication interface
ploying an expert user for constructing new utterances from linguistic
with the Deaf community.
patterns can be a primary solution to this challenge.
Lingual modality: Text input is the most common form of linguistic
modality. To process the input text, different models are used (See
5. SLP taxonomy
& Lamm, 2020; Sutskever, Vinyals, & Le, 2014). Among the deep
learning-based models, the Neural Machine Translation (NMT) model
In this section, we present a taxonomy that summarizes the main is the most used model for input text processing. The Seq2Seq mod-
concepts related to deep learning in SLP. We categorize recent works in els (Sutskever et al., 2014), such as Recurrent Neural Network (RNN)-
SLP providing separate discussions in each category. In the rest of this based models, proved their effectiveness in many tasks. While success-
section, we explain different input modalities, datasets, applications, ful results were achieved using these models, more effort is necessary to
and proposed models. Fig. 3 shows the proposed taxonomy described overcome the existing challenges in the translation task. One challenge
in this section. in translation tasks is related to domain adaptation due to different
word styles, translations, and meanings in different languages. Thus, a
5.1. Input modalities critical requirement of developing machine translation systems is to tar-
get a specific domain. Transfer learning, training the translation system
Generally, vision and language are two input modalities in SLP. in a general domain followed by fine-tuning in-domain data for a few
While the visual modality includes the captured image/video data, the epochs, is a common approach in coping with this challenge. Another
linguistic modality for the spoken language contains the text/audio challenge is regarding the amount of training data. Since the main
input from the natural language. CV and NLP techniques are necessary property of deep learning-based models is the mutual relation between
to process these input modalities. While the visual modality is used in the amount of data and model performance, a large amount of data
the training, the lingual modality is applicable in both the training and is necessary to provide a good generalization capability in the model.
testing of the proposed models. Another challenge is the poor performance of machine translation

6
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

that can be used for text-to-sign language translation. This dataset is an

extended version of the continuous sign language recognition dataset,
PHOENIX-2014 (Forster et al., 2012). RWTH-PHOENIX-Weather 2014T
includes a total of 8257 sequences performed by 9 signers. There
are 1066 sign glosses and 2887 spoken language vocabularies in this
dataset. Furthermore, the gloss annotations corresponding to the spo-
ken language sentences have been included in the dataset. The latter
dataset, How2Sing, is a recently released multi-modal dataset used for
speech-to-sign language translation. This dataset contains a total of
38 611 sequences and 4k vocabularies performed by 10 signers. Like
the former dataset, the annotation for sign glosses has been included in
this dataset. In addition to these two well-know datasets, to make room
for real-life applications using the larger domain datasets, Camgoz
et al. (2021) prepared six datasets contains of 190 h of footage on
the news. In this way, Deaf experts and interpreters were employed
to annotate 20 h of footage. These datasets are publicly available to
the research community. Furthermore, Brock and Nakadai (2018) col-
lected a dataset, including continuous sentence utterances in Japanese
Sign Language (JSL). A multi-modal data of high-resolution motion
capture data, video data, and both visual and gloss-like annotations
was obtained with the support of fluent Japanese signers. In addi-
tion, three different encoding schemes of annotations with respect to
directions, intonation, and non-manual information are included in
the dataset. Though RWTH-PHOENIX-Weather 2014T and How2Sign
provided SLP evaluation benchmarks, they are not enough for the
generalization of SLP models. Furthermore, these datasets just include
German and American sentences. In line with the aim of providing an
easy-to-use application for mutual communication between the Deaf
Fig. 4. Samples of some most-used datasets: (a) How2Sign, (b) RWTH-PHOENIX- and hearing communities, new large-scale datasets with enough variety
Weather 2014T, (c) ASLLVD, (d) ATIS Corpus.
and diversity in different sign languages are required. The point is
that the signs are generally dexterous and the signing procedure in-
volves different channels simultaneously, including arms, hands, body,
gaze, and facial expressions. Capturing such gestures requires a trade-
off between capture cost, measurement (space and time) accuracy,
and production spontaneity. Furthermore, different equipment is used
for data recordings, such as wired Cybergloves, Polhemus magnetic
sensors, headsets equipped with an infrared camera, emitting diodes
and reflectors. Synchronization between different channels captured by
the aforementioned devices has a key role in data collection and
annotation. Another challenge is related to the capturing complexity
of the hand movement using some capturing devices, such as Cyber-
gloves. Hard calibration and deviation during data recording are some
difficulties of these acquisition devices. The synchronization of external
devices, hand modeling accuracy, data loss, the noisy capturing process,
facial expression processing, gaze direction, and data annotation are
Fig. 5. SLP datasets in time. The number of samples for each dataset is shown in additional challenges. Given these challenges, providing a large and
brackets.
diverse dataset for SLP, including spoken language and sign language
annotations, is difficult. Figs. 4 and 5 show some samples and also the
timeline of existing datasets for SLP.
systems on uncommon and unseen words. To cope with these words, To make sense of the existence of spoken and sign language
byte-pair encoding, such as stemming or compound-splitting, can be datasets, we review some datasets in machine translation for spoken-
used for rare word translation. Another challenge is that machine
to-spoken language translation. Analyzing these datasets shows that
translation systems are not properly able to translate long sentences.
there are more spoken datasets with more variety in the sample and
However, the attention model (Vaswani et al., 2017) partially deals
language numbers, compared to sign language datasets. For example,
with this challenge for short sentences. Furthermore, the challenge
the MUSE dataset includes bilingual dictionaries for 110 language pairs.
regarding word alignment is more critical in reverse translation, that
For each language pair, the training and testing seed dictionaries in-
is translating back from the target language to the source language.
clude approximately 5000 and 1500 word pairs, respectively. Another
5.2. Datasets dataset, namely OpenSubtitles, is a collection of multilingual parallel
corpora obtained from a large database of movie and TV subtitles.
While there are some large-scale and annotated datasets avail- OpenSubtitles contains a total of 1689 bitexts spanning 2.6 billion
able for sign language recognition (Rastgoo et al., 2021c), there are sentences across 60 languages. Multi30K is a multi-modal dataset ob-
only a few publicly available large-scale datasets for SLP (Brock & tained from the Flickr30k dataset. This dataset includes 31,014 images
Nakadai, 2018; Camgoz et al., 2021). Two public datasets, RWTH- and the corresponding five English descriptions. The Flicker dataset
Phoenix-2014T (Camgöz, Hadfield, Koller, Ney, & Bowden, 2018) and includes 145,000 training, 5,070 validation, and 5000 test descriptions.
How2Sign (Duarte et al., 2020) are the most used datasets in sign lan- The Multi30K dataset aims to translate the Flicker30 descriptions into
guage translation. The former includes German sign language sentences German sentences. ASPEC dataset contains a Japanese-English paper

7
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Table 1
SLP datasets in time.
Type Dataset Nationality Level Content type Public Year
ASLLVD (Athitsos et al., 2018) English (US) Word Video, Gloss, Trans. Y 2008
ATIS Corpus (Bungeroth et al., 2008) Multilingual Sentence Video, Gloss, Trans. Y 2008
Dicta-Sign (Matthes et al., 2012) English (US) Word Video, Gloss, Trans. Y 2012
Sign ASL-LEX (Caselli, Sehyr, Cohen-Goldberg, & English (US) Word Video, Gloss, Trans. Y 2016
Emmorey, 2017)
RWTH-Phoenix-2014T (Camgöz et al., 2018) German Sentence Video, Gloss, Trans. Y 2018
Deep JSLC (Brock & Nakadai, 2018) Japanese Sentence Video, Gloss Y 2018
KETI (Ko, Kim, Jung, & Cho, 2019) Korean Sentence Video, Gloss, Trans. N 2019
Content4All (Camgoz et al., 2021) Swiss Sentence Video, Gloss Y 2021
How2Sign (Duarte et al., 2020) English (US) Sentence Video, Gloss, Trans, Speech. Y 2021
OpenSubtitles (Tiedemann, 2016) Multilingual (60) Sentence Video, Trans. Y 2016
Multi30K (Elliott, Frank, Simaéan, & Specia, English, German Sentence Image, Trans. Y 2016
2016)
Spoken ASPEC (Nakazawa et al., 2016) Japanese, English Sentence Text Y 2016
MUSE (Conneau, Lample, Ranzato, Denoyer, Multilingual (110) Word Text Y 2017
& Jégou, 2017; Lample, Conneau, Denoyer,
& Ranzato, 2017)
MTNT (Michel & Neubig, 2018) Japanese, French Sentence Text Y 2018
MLQA (Lewis, Oğuz, Rinott, Riedel, & Multilingual (7) Sentence Text Y 2019
Schwenk, 2019)

abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese- Machine Translation of Weather reports from English to ASL
Chinese paper corpus of 680K parallel sentences. MLQA is a cross- project: Using the freely available Perl modules, some packages are
lingual dataset containing over 5K Question Answering (QA) samples designed to employ ASL grammar rules and generate fluent ASL words
(12K in English) in SQuAD format in seven languages: English, Arabic, (Grieve, 1999).
German, Spanish, Hindi, Vietnamese, and Chinese. The MTNT dataset South African Sign Language Machine Translation (SASL-MT)
is a Machine Translation dataset that contains noisy comments on Red- project: Like the TEAM Project (Zhao et al., 2000), SASL-MT uses the
dit and professionally sourced translation. The translation is between rule-based transfer mechanism from English to ASL. The SASL-MT is
French, Japanese, and French, with between 7k and 37k sentences per freely available for the Deaf community in specific domains, such as
language pair. Table 1 summarizes the most-used datasets for SLP and clinics, hospitals, and police stations. While this project is still under
also the datasets for spoken-to-spoken language translation. development, no evaluation results have been reported (Zij & Barker,
2003).
5.3. Applications and technologies Multi-path architecture for Sign Language Machine Translation
(SLMT): Using the virtual reality scene, a multi-channel architecture
5.3.1. Applications is proposed to include supplementary information of ASL. This project
With the advent of potent methodologies and techniques in recent aims to generate spatially complex ASL words (Huenerfauth, 2004,
years, machine translation applications have become more efficient 2005).
and trustworthy. One of the early efforts on machine translation is Czech Sign Language Machine Translation: Using computer ani-
dated back to the sixties, when a model was proposed to translate mation techniques, hand articulations are generated using an automatic
from Russian to English. This model defined the machine translation process. Translation from spoken Czech to Signed Czech is a primary
task as a phase of encryption and decryption. Nowadays, the standard goal of this project. More than 3000 simple or linked sign vocabu-
machine translation models fall into three main categories: rule-based laries of Czech sign language are included in the dictionary of this
grammatical models, statistical models, and example-based models. project, which is a successful improvement in Czech sign language
Deep learning-based models, such as Seq2Seq and NMT models, fall translation (Hanke, 2004; Kanis et al., 2006).
into the third category, and showed promising results in SLP. Virtual signing, capture, animation, storage and transmission
To translate from a source language to a target language, a corpus (ViSiCAST) Translator: This project is proposed to translate from
is needed to perform some preprocessing steps, such as boundary English text into British Sign Language (BSL). Using the grammar
detection, word tokenization, and chunking. While there are different rules and symbolic representation, natural movements in the sing words
corpora for most spoken languages, sign language lacks of such a are modeled. This project has successfully developed an avatar-based
large and diverse corpus. Since Deaf people may not be able to read signing system for BSL (Bangham et al., 2000).
or write in spoken language, they need some tools for communication ZARDOZ System: This system is an English translation system
with other people in society. Furthermore, many interesting and useful using Artificial Intelligence knowledge representation, metaphorical
applications on the Internet are not accessible to the Deaf community. reasoning, and blackboard system architecture. The main advantage
However, we are still far from having applications accessible for Deaf of this system is its efficient performance in processing semantic
people with large vocabularies or sentences from real-world scenarios. information. The contributors of this project aim to improve this system
One of the main challenges for these applications is a license right for using intelligent linguistic technologies (Veale & Conway, 1994).
usage. Only some of these applications are freely available. Another Environment for Greek Sign Language Synthesis: This system
challenge is the lack of generalization of current applications, which are includes an educational platform for Deaf children. Virtual character
developed for the requirements of very specific application scenarios. animation techniques are used for sign sequence synthesis and lexicon-
Here, we present some of the most used projects in sign language grammatical processing of Greek sign language sequences (Karpouzis
translation. et al., 2020).
Translation from English to ASL by Machine (TEAM) project: An Thai-Thai Sign Machine Translation (TTSMT): This model is a
English translation system uses grammar rules to create an American multi-phase approach to translate Thai text into Thai Sign language.
Sign Language (ASL) syntactic structure. Using the signing avatar, this This system has been developed using the spatial grammatical order of
project achieved successful performance for generating aspectual and the sign words and evaluated on the frequently used sign words in daily
adverbial information in ASL (Zhao et al., 2000). communication (Dangsaart et al., 2008).

8
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Table 2
Statistical report on the existing technology products based on the
EASTIN database (Eastin, 2022).
Device type Product number
Hearing technology 300
Alerting devices 173
Communication support technology 223

various types of alerting devices, including clocks and wake-up alarm

systems, household device alerts, doorbell and telephone alerts, and
baby monitoring devices. These devices employ the remote receivers
placed around the house (Lucker & Hersh, 2003).
Communication support technology: Communication support
technologies aim to facilitate communication between different com-
munities. They generally fall into two main categories: telecommuni-
cation services and person-to-person interactions. The former category
contains a variety of standard technologies, such as physical and
virtual keyboards, touch screens, video calling, captioning for phone
Fig. 6. A glance at some projects in SLP: (a) SignSynth (Stoll, Hadfield, & Bowden, calls, voice-to-sign language translation, recognition text messaging,
2020), (b) TEAM Project (Zhao et al., 2000). and text-based technology (such as WhatsApp (WhatsApp Messenger,
2022), Facebook (FB) Messenger (Facebook Messenger, 2022), and
Snapchat (Snapchat, 2022)). The latter category covers person-to-
Web-Based Interpreter of Sign Language (WebSign): WebSign person interactions using picture boards, keyboards, touch screens,
is a project to develop a web-based tool for information processing. display panels, and speech-generating devices. Table 2 shows a statis-
It includes a Plug-in to play the ASL signs. WebSign, as an avatar- tical report on the existing technology products based on the EASTIN
based technology, generates a real-time and online interpretation in database (Eastin, 2022).
sign language that decreases the communication barriers between Deaf Promises and challenges: Actually, research on Deaf and hearing-
and hearing people (Jemni et al., 2022). impaired technologies coincides with research on mainstream technol-
Sign Language Synthesis Application (SignSynth): SignSynth is a ogy. While Deaf and hearing-impaired people contribute to mainstream
Deep Learning-based project for video generation from the human pose technology (WFD, 2022), most of the communication technologies only
sequences using a generative model. One of the main advantages of this support spoken or written languages, excluding sign language. They are
project is the capability of producing natural-looking sign videos using among the first adopters of recent technologies such as Skype (Skype,
a fully automatic approach (Stoll, Hadfield, & Bowden, 2020). Fig. 6 2022), Google Hangout (Hangout, 2022), FaceTime (FaceTime, 2022),
shows a summary view of some of these projects. Instant Messaging (IM) (WFD, 2022), text-to-speech and speech recog-
nition software (Siri, 2022)(Cortana, 2022), WhatsApp (WhatsApp Mes-
5.3.2. Assistive technologies for SLP senger, 2022), and Imo (Imo, 2022). While these tools have become an
In this section, we review the existing assistive technologies for important part of our life, Deaf and hearing-impaired people have many
Deaf and hearing-impaired people to get an insight into the researchers problems with using these technologies. Recent developments in SLR
in SLP and also make a bridge between them and the corresponding and SLP aim to facilitate bidirectional communication between Deaf
technology requirements. Considering the pros and cons corresponding and hearing people. The sensory substitution across sensory systems,
to the existing assistive technologies, researchers in SLP can develop such as vibratory and visual-auditory substitutions, is an active research
many more real-world technologies for Deaf people. These technologies area aiming to make a real-life perceptual experience of hearing for
fall into three device categories: hearing technology, alerting devices, Deaf and hearing-impaired people. For example, one can imagine a
and communication support technology. Here, we present a quick recap technology that assists a Deaf person go through a musical experience
on each category and discuss the promises and challenges. translated into another sensory modality. The recent advances in CV
Hearing technology: Hearing technology contains devices em- algorithms, especially deep learning models, made room to develop
ployed to enhance the sound level available to a listener. So, it is not some applications in sign language. As we presented in the previous
suitable for Deaf people with a complete loss of their hearing ability. sections, the recent advances in SLP are promising. However, more
The main devices used in this technology include Hearing Aids (HAs)
endeavor is indispensable to provide a fast processing model in an un-
devices, assistive listening devices, Personal Sound Amplification Prod-
controlled environment considering rapid hand motions. It is clear that
ucts (PSAPs), and cochlear implants (Hersh & Johnson, 2003). Here,
technology standardization and full interoperability among devices and
we present a quick definition of each device.
platforms are prerequisites to having real-life communication between
HAs devices: HAs devices are sound-amplifying devices employed
two communities.
to enhance the hearing quality for impaired-hearing people (Ehima,
2022).
Assistive Listening Devices (ALD): Like HAs, ALD are used to 5.4. Proposed models
amplify the sounds, especially in noisy backgrounds (NAD, 2022).
PSAPs: PSAPs are devices that increase sound levels and reduce In this section, we review recent works in SLP. These works are
background noise. presented and discussed in five categories: Avatar approaches, NMT
Cochlear Implant (CI): CI, considered as an artificial cochlea, is a approaches, Motion Graph (MG) approaches, Conditional image/video
surgically-implanted sensor for sound-to-electrical wave conversion. Generation approaches, and other approaches. Table 3, Table 4, and
Alerting devices: Since hearing ability is not obligatory in the Table 5 present a summary of the main characteristics and details
alerting or alarm systems, these systems can be used in the Deaf of the reviewed models. Furthermore, Table 6, Table 7, and Table 8
community. Alerting systems usually use light, vibrations, or a com- summarize the mathematical formulations as well as the approaches,
bination of them to make an alert notification for users. There are respectively.

9
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Table 3
A summary of the main characteristics of the reviewed models.
Query Available choices Most used
Methods Avatar, NMT, Motion Graph, Image/video Image/video generation
generation
Input modalities Image (RGB), Skeleton, Video, Text, Speech Text
Datasets PHOENIX14T, Czech news, Own datasets PHOENIX14T
Production modalities Isolated, Continuous Continuous
Architectures Static: GAN, AE, VAE Static: GAN
Dynamic: LSTM, GRU Dynamic: LSTM
Generative models AE, VAE, GAN GAN
Evaluation metrics Accuracy, Word Error Rate, BLEU, ROUGE BLEU
Features Face, Hand, Body, Fused features Fused features

Table 4
Summary of deep SLP models (Part 1).
Year Ref Feature Input modality Dataset Description
2011 Kipp, Heloir, and Nguyen Avatar RGB video ViSiCAST Pros. Proposing a gloss-based tool focusing on
(2011) the animation content evaluation using a new
metric for comparing avatars with human
signers. Cons. Need to include non-manual
features of human signers.
2016 McDonald et al. (2016) Avatar RGB video Own dataset Pros. Automatically adding realism to the
generated images, low computational
complexity. Cons. Need to place the position of
the shoulder and torso extension on the
position of the avatar’s elbow, rather than the
IK end-effector.
2016 Gibet, Lefebvre-Albaret, Avatar RGB video Own dataset Pros. Easy to understand with high viewer
Hamon, Brun, and Turki acceptance of the sign avatars. Cons. Limited
(2016b) to the small set of sign phrases.
2018 Camgoz, Hadfield, Koller, NMT RGB video PHOENIX-Weather 2014T Pros. Robust to jointly align, recognize, and
Ney, and Bowden (2018a) translate sign videos. Cons. Need to align the
signs in the spatial domain.
2018 Guo, Zhou, Li, and Wang NMT RGB video Own dataset Pros. Robust to align the word order
(2018) corresponding to visual content in sentences.
Cons. Need to generalize to additional datasets.
2020 Stoll, Camgoz, et al. (2020) NMT, MG Text PHOENIX14T Pros. Robust to minimal gloss and skeletal
level annotations for model training. Cons.
Model complexity is high.
2020 Saunders et al. (2020d) Others Text PHOENIX14 Pros. Robust to the dynamic length of output
sign sequence. Cons. Model performance can
be improved including non-manual information.
2020 Zelinka and Kanis (2020) Others Text Czech news Pros. Robust to the missing skeleton parts.
Cons. Model performance can be improved
including information of facial expressions.
2020 Saunders et al. (2020c) Others Text PHOENIX14T Pros. Robust to non-manual feature production.
Cons. Need to increase the realism of the
generated signs.
2020 Camgoz et al. (2020) Others Text PHOENIX14T Pros. No need to the gloss information. Cons.
Model complexity is high.
2020 Saunders et al. (2020a) Others Text PHOENIX14T Pros. Robust to manual feature production.
Cons. Need to increase the realism of the
generated signs.

Table 5
Summary of deep SLP models (Part 2).
Year Ref Feature Input modality Dataset Description
2021 Christopher, Kümmel, Others Text MS-ASL Pros. Performance improvement on
Ritter, and Hildebrand independent and contrasting signers using
(2021) synthesizing target poses. Cons. The generated
samples need to be improved to be more
realistic.
2022 Wencan, Zhao, He, and Others Text PHOENIX14T Pros. Using the unlabeled data for model
Zhang (2022) training. Cons. The modality imbalance issue
can still decrease the model performance.
2022 Tang, Hong, Guo, and Others Text PHOENIX14T Pros. Using the CTC optimization in the model
Wang (2022) guarantees semantic preservation in terms of
both pose and gloss. Cons. Model complexity is
high.
2022 Wellington, Alaniz, Others Text PHOENIX14T Pros. Independent control of diverse factors of
Hurtado, De Silva, and De variation for SLP by disentangling appearance
Bem (2022) and gestural communication parameters. Cons.
Model complexity is high.

10
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Table 6
A summary of the mathematical formulation for SLP (First part).
Ref. Formulation
𝑉 +𝑉
McDonald et al. (2016) 𝑉𝑅 = 𝐴𝑅 − 𝑆𝑅 , 𝑉𝐿 = 𝐴𝐿 − 𝑆𝐿 , 𝑉𝑟𝑒𝑎𝑐ℎ = 𝑅 2 𝐿
The parameters included in the model, showing the displacement vectors from the
right and left shoulders and also the average of these two vectors.
𝜖 ∏𝑀
Othman and Jemni (2011) 𝑝(𝑒, 𝑎|𝑓 ) = (𝑁+1) 𝑀 𝑗=1
𝑡(𝑒𝑗 |𝑓𝑎(𝑗) ),
∏𝑀
𝑝(𝑒, 𝑎|𝑓 ) = 𝜖 𝑗=1 𝑡(𝑒𝑗 |𝑓𝑎(𝑗) )𝑎(𝑎(𝑗|𝑗, 𝑁, 𝑀))
The translation probability of an English sentence to an ASL sentence using the
alignment function.
∏𝑇 ′
Bahdanau, Cho, and Bengio (2014) 𝑝(𝑦1 , … , 𝑦𝑇 ′ |𝑥1 , … , 𝑥𝑇 ) = 𝑡=1 𝑝(𝑦𝑡 |𝑣, 𝑦1 , … , 𝑦𝑡−1 )
Attention function (a distribution over sequences of all possible lengths.)
𝑒𝑥𝑝(𝑠𝑐𝑜𝑟𝑒(ℎ𝑢 ,𝑜𝑛 ))
Camgoz et al. (2018a) 𝛾𝑛𝑢 = ∑𝑁
𝑛′ =1
𝑒𝑥𝑝(𝑠𝑐𝑜𝑟𝑒(ℎ𝑢 ,𝑜𝑛′ ))
Attention-based Encoder-Decoder network with the attention weights.
∑𝑛
Kovar, Gleicher, and Pighin (2002) 𝑓 (𝑤) = 𝑓 ([𝑒1 , … , 𝑒𝑛 ]) = 𝑖=1 𝑔([𝑒1 , … 𝑒𝑖−1 ], 𝑒𝑖 )
The total error corresponds to the alignment and interpolation of the motions and
positions used between the joints.
∑𝑛
Arikan and Forsyth (2002) 𝑆(𝑒1 , … , 𝑒𝑛 ) = 𝑤𝑐 ∗ 𝑖=1 𝑐𝑜𝑠𝑡(𝑒𝑖 ) + 𝑤𝑓 ∗ 𝐹 + 𝑤𝑏 ∗ 𝐵 + 𝑤𝑗 ∗ 𝐽
A score function assigned to the distance between two consecutive frames using the
joint positions, velocities, and acceleration parameters.

Table 7
A summary of the mathematical formulation for SLP (Second part).
Ref. Formulation
Saunders et al. (2020a) min𝐺 max𝐷 𝐺𝐴𝑁 (𝐺, 𝐷) = [𝑙𝑜𝑔𝐷(𝑌 ∗ |𝑋)] + 𝐸[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑋)|𝑋))]
A minimax function to model the training process as an adversarial training scheme.
∑𝑛𝑎 ∑𝑛𝑏
𝑖=1
𝑤 ‖𝑎𝑖 −𝑏𝑗 ‖2𝐷
𝑗=1 𝑖,𝑗
Zelinka and Kanis (2020) 𝜀= ∑𝑛𝑎 ∑𝑛𝑏
𝑤
𝑖=1 𝑗=1 𝑖,𝑗

A loss function for sequence-to-sequence translation.

∑𝑘 ∑𝑘
Saunders et al. (2020c) 𝑇 𝑜𝑡𝑎𝑙 = min𝐺 ((max𝐷𝑖 𝑖=1 𝐺𝐴𝑁 (𝐺, 𝐷𝑖 )) + 𝜆𝐹 𝑀 𝑖=1 𝐹 𝑀 (𝐺, 𝐷𝑖 ) + 𝜆𝑉 𝐺𝐺 𝑉 𝐺𝐺 (𝐺(𝑦𝑡 , 𝐼 𝑆 )) +
𝜆𝐾𝐸𝑌 𝐾𝐸𝑌 (𝐺, 𝐷𝐻 ) + 𝜆𝑇 𝑇 (𝐺))
An adversarial loss for translation from text to skeletal pose data.
∑𝑡𝑒𝑛 ∑𝑡𝑒𝑛
1 ∑𝑁 𝑡=𝑡𝑏𝑛
𝑙(𝑦𝑛 ,𝑧𝑡 )
1 ∑𝑁 𝑡=𝑡𝑏𝑛
𝑆𝐾𝐿(𝑦𝑛,𝑓 𝑧𝑡,𝑓 )
Tornay, Camgöz, Bowden, and Doss (2020)𝑙𝑒𝑥 = 𝑁 𝑛=1 𝑡𝑒𝑛 −𝑡𝑏𝑛 +1
, 𝑓𝑓𝑜𝑟𝑚 = 𝑁 𝑛=1 𝑡𝑒𝑛 −𝑡𝑏𝑛 +1
Loss functions correspond to the linguistic aspects of sign language: the generated
lexeme and the generated forms.

Table 8 animations with smooth transitions between the signs (Gibet, 2020). To
A summary of the mathematical approaches for SLP.
capture the motion data of Deaf people, special cameras and sensors
Ref. Approach are used (Duarte & Gibet, 2010). Furthermore, a computing method
Othman and Jemni (2011), Stoll, Conditional probability function is used to transfer the body movements into the signing avatar (Kipp
Camgoz, et al. (2020)
et al., 2011).
Hochreiter and Schmidhuber Probability distribution
(1997) Two ways to derive the signing avatars include motion capture
Camgoz et al. (2018a), Bahdanau Attention data and parametrized glosses. In recent years, some works have been
et al. (2014), Zelinka and Kanis developed exploring avatars animated from parametrized glosses. Visi-
(2020)
Cast (Bangham et al., 2000), Tessa (Cox et al., 2002), eSign (Zwit-
Kovar et al. (2002) Distance between two point clouds
Arikan and Forsyth (2002) Distance between two consecutive frames serlood, Verlinden, Ros, & Schoot, 2005), dicta-sign (Efthimiou et al.,
Saunders et al. (2020a), Saunders minimax function and adversarial loss 2012), JASigning (Virtual Humans Group, 2017), and WebSign (Jemni
et al. (2020c) et al., 2022) are some of them. These works need the sign video
annotated via the transcription language (Walsh, Saunders, & Bow-
den, 2022), such as HamNoSys (Prillwitz, 1989) or SigML (Kennaway,
5.4.1. Avatar approaches 2013). However, under-articulated, unnatural movements, and miss-
In order to reduce the communication barriers between hearing ing non-manual information, such as eye gaze and facial expressions,
and hearing-impaired people, sign language interpreters are used as an are some challenges of the avatar approach (Naert, Larboulette, &
effective yet costly solution (Naert, Larboulette, & Gibet, 2020a). To Gibet, 2017). These challenges lead to misunderstanding the final sign
inform Deaf people quickly in cases where there is no interpreter language sequences (Gibet, Lefebvre-Albaret, Hamon, Brun, & Turki,
on hand, researchers are working on novel approaches to providing 2016a). Furthermore, due to the uncanny valley, the users do not
the content (Gibet, 2020; Larboulette & Gibet, 2018). One of these feel comfortable (Mori, MacDorman, & Kageki, 2012) with the robotic
approaches is signing avatars (Lucie, Larboulette, & Gibet, 2020). motion of the avatars (Ben, Camgoz, & Bowden, 2021a). To tackle
Avatar is a technique to display the signed conversation in the absence these problems, recent works focus on the annotation of non-manual
of videos corresponding to a human signer. To this end, 3D animated information such as the face, body, and facial expression (EblingJohn
models are employed, which can be stored more efficiently compared to & Glauert, 2013; Naert, Reverdy, Larboulette, & Gibet, 2018). For
videos. The movements of the fingers, hands, facial gestures, and body instance, Kipp et al. (2011) proposed two techniques, the torso and the
can be generated using the avatar (Naert, Larboulette, & Gibet, 2021). noise methods, to aid manual animation and supplement procedurally
This technique can be programmed to be used in different sign lan- generated avatar movements systems such as (Delorme & Braffort,
guages (Gibet & Marteau, 2023). With the advent of computer graphics 2009; Hanke, 2004). The first technique is an extension to any limb
in recent years, computers and smartphones can generate high-quality system and helps to automatically rotate the torso and spine of an

11
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

avatar. This rotation supports the specified arm motions from the As an example-based methodology, some research works have been
linguistic model. The second technique generates motion in the held developed by focusing on translation from text into sign language
joints. Evaluation results show the effectiveness of these techniques. using Artificial Neural Networks (ANNs), namely NMT (Bahdanau
Though, the accurate alignment and articulation of this information are et al., 2014). NMT uses ANNs to predict the likelihood of a word
challenging (Kipp et al., 2011; McDonald et al., 2016). More concretely, sequence, typically modeling entire sentences in a single integrated
three steps have been included in the proposed model by McDonald model. Seq2seq model (Cho, van Merrienboer, et al., 2014; Sutskever
et al. (2016): movement of the spine, spreading the effect over the et al., 2014), as one of the most interesting breakthroughs in neural ma-
spine, and shoulder movement. To this end, the following parameters chine translations, consists of two Recurrent Neural Networks (RNNs).
are included in the model: These RNNs form an encoder–decoder architecture to translate from a
source sequence to a target sequence. This model aims to overcome
𝑉𝑅 = 𝐴 𝑅 − 𝑆 𝑅 , (1) the challenges in problems whose input and output sequences have
different lengths with complicated and non-monotonic relationships.
𝑉𝐿 = 𝐴𝐿 − 𝑆𝐿 , (2) Considering the capabilities of the LSTM model (Hochreiter & Schmid-
huber, 1997) in learning the long-range temporal dependencies, the
𝑉𝑅 + 𝑉𝐿 seq2seq model improved the translation performance of the model.
𝑉𝑟𝑒𝑎𝑐ℎ = , (3)
2 More concretely, the LSTM aims to estimate the conditional probability
where 𝑉𝑅 , 𝑉𝐿 , 𝑎𝑛𝑑 𝑉𝑟𝑒𝑎𝑐ℎ are the displacement vectors from the right 𝑝(𝑦1 , … , 𝑦𝑇 ′ |𝑥1 , … , 𝑥𝑇 ), where (𝑥1 , … , 𝑥𝑇 ) and (𝑦1 , … , 𝑦𝑇 ′ ) are the input
and left shoulders and also the average of these two vectors, respec- and output sequences, respectively. The length of input and output
tively. To compute both the bend angle and direction, the torso must sequences may differ from each other. The LSTM network computes
be rotated in the direction of 𝑉𝑟𝑒𝑎𝑐ℎ . Experimental results indicated that the conditional probability by first obtaining the fixed dimensional
including such movements in the system would be highly beneficial. representation 𝑣 of the input sequence given by the last hidden state of
the LSTM, and then computing the probability of the output sequence
Using the data collected from motion capture, avatars can be more
with a standard LSTM formulation. The initial hidden state is set to the
usable and acceptable for reviewers (Gibet, 2018) (such as the Sign3D
representation 𝑣 of 𝑥1 , … , 𝑥𝑇 :
project by MocapLab (Gibet et al., 2016b)). Highly realistic results are
′
achieved by avatars, but the results are restricted to a small set of ∏
𝑇

phrases. This comes from the cost of data collection and annotation. 𝑝(𝑦1 , … , 𝑦𝑇 ′ |𝑥1 , … , 𝑥𝑇 ) = 𝑝(𝑦𝑡 |𝑣, 𝑦1 , … , 𝑦𝑡−1 ). (6)
𝑡=1
Furthermore, avatar data is not a scalable solution and needs expert
knowledge to perform a sanity check on the generated data. To cope where each 𝑝(𝑦𝑡 |𝑣, 𝑦1 , … , 𝑦𝑡−1 ) distribution is represented with a Soft-
with these problems and improve performance, deep learning-based max over all the words in the vocabulary. As a requirement, a special
models, as the latest machine translation developments, are used. Gen- End-Of-Sentence symbol ‘‘<EOS>’’ is necessary to enable the model to
erative models along with some graphical techniques, such as Motion define a distribution over sequences of all possible lengths. In addition
Graphs, are being recently employed (Stoll, Camgoz, et al., 2020). to the LSTM Network, the GRU model (Chung, Gulcehre, Cho, &
Bengio, 2014) can be used as an RNN cell. The seq2seq models proved
their effectiveness in many sequence generation tasks by obtaining
5.4.2. NMT approaches
nearly human-level performance. However, there are some drawbacks
Machine translators are a practical methodology for translation
to these models. One of them is corresponding to the fixed-size vec-
from one language to another (Kahlon & Singh, 2021; Khan, Abid,
tor representation of the input sequences with different lengths. The
& Abid, 2020; Luqman & Sabri, 2019). The first translator comes
vanishing gradient related to the long-term dependencies is another
back to the sixties when the Russian language was translated into
drawback of this model (Ko et al., 2019). To enhance the translation
English (Hutchins, 2005). The translation task requires preprocessing
performance of long sequences, Bahdanau et al. (2014) presented an
of the source language, including sentence boundary detection, word
effective attention mechanism. This mechanism was later improved
tokenization, and chunking. These preprocessing tasks are challenging,
by Luong, Pham, and Manning (2015).
especially in sign language. Sign Language Translation (SLT) aims
Regarding sign language, Camgoz et al. proposed a combination of
to produce/generate spoken language translations from sign language a seq2seq model with a CNN model to translate sign videos to spoken
considering different word orders and grammar. The ordering and the language sentences (Camgoz et al., 2018a). They used an attention-
number of glosses do not necessarily match the words of the spoken based Encoder-Decoder network with the attention weights defined as
language sentences (Li et al., 2022). follows:
Nowadays, there are different types of machine translators, mainly
𝑒𝑥𝑝(𝑠𝑐𝑜𝑟𝑒(ℎ𝑢 , 𝑜𝑛 ))
based on grammatical rules, statistics, and examples (Othman & Jemni, 𝛾𝑛𝑢 = ∑𝑁 (7)
2011). For instance, Othman and Jemni (2011) proposed a machine 𝑛′ =1 𝑒𝑥𝑝(𝑠𝑐𝑜𝑟𝑒(ℎ𝑢 , 𝑜𝑛′ ))
translation, namely IBM 1, by defining the translation probability for where ℎ𝑢 , 𝑜𝑛 , 𝑁, are the hidden state, output, and sequence length,
an English sentence 𝑓 = (𝑓1 , 𝑓2 , … , 𝑓𝑁 ) of length 𝑁 to an ASL sentence respectively. While results on the first continuous sign language trans-
𝑒 = (𝑒1 , 𝑒2 , … , 𝑒𝑀 ) of length 𝑀 with an alignment of each ASL word 𝑒𝑗 lation dataset, PHOENIX14T, showed promising results, it would be
to an English word 𝑓𝑖 , considering the alignment function 𝑎 ∶ 𝑗 → 𝑖 as interesting to extend the attention mechanisms to the spatial domain to
follows: align building blocks of signs with their spoken language translations.
In another work, Guo et al. (2018) designed a hybrid model including
𝜖 ∏
𝑀
𝑝(𝑒, 𝑎|𝑓 ) = 𝑀
𝑡(𝑒𝑗 |𝑓𝑎(𝑗) ), (4) the combination of a 3D Convolutional Neural Network (3DCNN) and
(𝑁 + 1) 𝑗=1 an LSTM-based (Hochreiter & Schmidhuber, 1997) encoder–decoder to
where t is a conditional probability function. The alignment function translate from sign videos to text outputs (see Fig. 7). Results on their
𝑎 maps each ASL output word 𝑗 to an English input position 𝑎(𝑗). own dataset showed a 0.071% improvement margin of the precision
The alignment probability distribution is also applied to this reverse metric compared to state-of-the-art models. However, unseen sentence
direction. The combination of these two steps defines the IBM 2 model translation is still a challenging problem with limited sentence data.
as follows: Dilated convolutions and Transformers are two approaches that are also
used for sign language translation (Kalchbrenner et al., 2016; Vaswani
∏
𝑀
et al., 2017). Stoll, Camgoz, et al. (2020) proposed a hybrid model for
𝑝(𝑒, 𝑎|𝑓 ) = 𝜖 𝑡(𝑒𝑗 |𝑓𝑎(𝑗) )𝑎(𝑎(𝑗|𝑗, 𝑁, 𝑀)). (5)
𝑗=1
automatic SLP using NMT, GANs, and motion generation. The proposed

12
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

𝑤𝑓 ∗ 𝐹 + 𝑤𝑏 ∗ 𝐵 + 𝑤𝑗 ∗ 𝐽 (9)

where 𝑤𝑐 , 𝑤𝑓 , 𝑤𝑏 , 𝑤𝑗 are weights for the quality (continuity) of the

motion, how well the length of the motion is satisfied, how well the
body constraints are satisfied, and how well the joints constraints
are defined. 𝐹 is the squared difference between the actual and the
required number of frames. 𝐵 is the squared distance between the
actual and the required position and orientation of the constraint. 𝐽
is the squared distance between the actual and the required position
of the constraint. In addition, the discontinuity between two clips was
calculated using a smoothing function. After summarizing the graph, a
random search was applied to the graph.
A two-layer representation of motion data in another approach was
proposed by Lee and Shin (1999). In the first layer, data was modeled
as a first-order Markov process and the transition probabilities were
Fig. 7. An overview of the model proposed by Guo et al. (2018): A hybrid model in- calculated using the distances of weighted joint angles and velocities. A
cluding the combination of a 3DCNN and an LSTM-based (Hochreiter & Schmidhuber, cluster analysis was performed on the second layer, namely the cluster
1997) encoder–decoder to translate from sign videos to text outputs.
forest, to generalize the motions. The proposed hierarchical motion
representation approach adapts the existing motion of a human-like
character to have desired features included by a set of constraints.
model generates sign videos from spoken language sentences with a Results confirm the relative improvement by employing a curve-fitting
minimal level of data annotation for training. This model first translates technique that minimizes a local approximation error.
spoken language sentences into sign pose sequences. Then, a generative A continuous SLP model has been proposed by Stoll, Camgoz, et al.
model is used to generate plausible sign language videos. Results on (2020) using pose data. The sign glosses were embedded to an MG with
the PHOENIX14T Sign Language Translation dataset show comparable the transition probabilities provided by an NMT decoder at each time
results compared to state-of-the-art alternatives (see Fig. 8). step (see Fig. 10).
While NMT-based methods achieved successful results in translation Although MG can generate plausible and controllable motion
tasks, some major challenges need to be solved. Domain adaptation is through a database of motion capture, it faces some challenges. The
the first challenge in this area. Since the translation between different first challenge is regarding access to data. To show the model potential
domains is affected by different rules, domain adaptation is a crucial with a truly diverse set of actions, a large set of data is necessary.
requirement in developing machine translation systems targeted to a The scalability and computational complexity of the graph to select the
specific use case. The second challenge is regarding the amount of best transitions are the other challenges in MG. Furthermore, since the
training data. Especially in deep learning-based models, increasing the number of edges leaving a node increases with the size of the graph,
amount of data can lead to better results. Another difficulty is dealing the branching factor in the search algorithm will increase as well.
with uncommon words. The translation models perform poorly on these
words. Words alignment and adjusting the beam search parameters are 5.4.4. Conditional image/video generation
the other challenges for NMT-based models. The promising results of The field of automatic image/video generation has experienced
current deep learning-based models set an underpin for future research a remarkable evolution in recent years. However, the task of video
in this area. generation is challenging since the content between consecutive frames
has to be consistent, showing a plausible motion. These challenges are
5.4.3. Motion Graph approaches
more difficult in SLP due to the need for human video generation. The
Motion Graph (MG), as a computer graphic method for dynamically
complexity and variety of actions and appearances in these videos are
animating characters, is defined as a directed graph constructed from
high and challenging. Controlling the content of the generated videos
motion capture data. MG can generate new sequences to satisfy specific
is crucial yet difficult.
goals. In SLP, MG can be combined with an NMT-based network to
With the recent advances in deep learning, the field of automatic
make a continuous text-to-pose translation. One of the early efforts
image/video generation has seen different approaches employing neu-
on MG is dated back to 2002 when a general framework was proposed
ral network-based architectures (Ben, Camgoz, & Bowden, 2022), such
by Kovar et al. (2002) for extracting particular graph walks that satisfy
as CNNs (Chen & Koltun, 2017; Oord, Kalchbrenner, Vinyals, et al.,
the user’s specifications. Distance between two frames was defined as a
2016), RNNs (Gregor, Danihelka, Graves, Rezende, & Wierstra, 2015;
distance between two point clouds (see Fig. 9). To make the transitions,
Oord, Kalchbrenner, & Kavukcuoglu, 2016), Variational Auto-Encoders
alignment and interpolation of the motions and positions were used
(VAEs) (Kingma & Welling, 2014), conditional VAEs (Yan, Yang, Sohn,
between the joints. The total error, 𝑓 (𝑤) of a path in the model is
& Lee, 2016), and GAN (Goodfellow et al., 2014). VAEs and GANs are
defined as follows:
generally combined to benefit from the VAE’s stability and the GAN’s
∑
𝑛
discriminative nature. Most relevant to SLP, a hybrid model, including
𝑓 (𝑤) = 𝑓 ([𝑒1 , … , 𝑒𝑛 ]) = 𝑔([𝑒1 , … , 𝑒𝑖−1 ], 𝑒𝑖 ), (8)
𝑖=1
a VAE and GAN combination, has been proposed for image generation
of people (Ma et al., 2017; Saunders, Camgoz, & Bowden, 2020b;
where 𝑔 is a scalar function that evaluates the additional error accrued
Siarohin, Sangineto, Lathuiliere, & Sebe, 2018) and video generation
by appending an edge 𝑒 to the existing path 𝑤, which may be the empty
of people performing sign language (Natarajan & Elakkiya, 2022; Saun-
path. Finally, the branch and bound search algorithm was applied to the
ders et al., 2020d; Stoll, Camgoz, Hadfield, & Bowden, 2018; Vasani,
graph.
Autee, Kalyani, & Karani, 2020). Furthermore, there are some models
In another work, Arikan and Forsyth (2002) used the joint positions,
for image/video generation that can be used in SLP. For example, Chen
velocities, and accelerations parameters to define the distance between
and Koltun (Chen, Papandreou, Kokkinos, Murphy, & Yuille, 2018)
two consecutive frames. Given a sequence of edges 𝑒1 , … , 𝑒𝑛 , a score is
proposed a CNN-based model to generate photographic images given
assigned to each path using the following function:
semantic label maps. Oord, Kalchbrenner, Vinyals, et al. (2016) pro-
∑
𝑛
posed a deep learning-based model, namely PixelRNNs, to sequentially
𝑆(𝑒1 , … , 𝑒𝑛 ) = 𝑤𝑐 ∗ 𝑐𝑜𝑠𝑡(𝑒𝑖 )+
generate the image pixels along the two spatial dimensions. Gregor
𝑖=1

13
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Fig. 8. An overview of the model proposed by Stoll, Camgoz, et al. (2020): A hybrid model to automatic SLP using NMT, GANs, and motion generation.

Fig. 9. An overview of the model proposed by Kovar et al. (2002): A general framework for extracting particular graph walks that satisfy the user’s specifications.

Fig. 10. An overview of the graph nodes in a model proposed by Stoll, Camgoz, et al. (2020) for SLP. Each node contains one or more motion primitives and a prior distribution.
The transition probability between two nodes is defined as the probability of motion primitive.

et al. (2015) developed an RNN-based architecture, including an en- sign sequences from spoken language sentences (see Fig. 11). They
coder and a decoder network to compress the real images presented formalized the model training process as an adversarial training scheme
during training and refine images after receiving codes. Karras, Laine, using a minimax game. To this end, the generator, 𝐺, aims to minimize
and Aila (2019) designed a deep generative model, entitled StyleGAN, the following equation, whilst D maximizes it:
to adjust the image style at each convolution layer. Kataoka, Matsubara,
and Uehara (2016) proposed a model using the combination of GAN min max 𝐺𝐴𝑁 (𝐺, 𝐷) =
𝐺 𝐷
and attention mechanism. Benefiting from the attention mechanism,
[𝑙𝑜𝑔𝐷(𝑌 ∗ |𝑋)] + 𝐸[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑋)|𝑋))] (10)
this model can generate images containing highly detailed content.
While deep learning-based generative models have recently achieved where 𝑌 ∗ and 𝐺(𝑋) are the ground truth and the produced sign pose
remarkable results (Ankith, Boggaram, Sharma, Ramanujan, & Bharathi, sequences, respectively. Results on the PHOENIX14T dataset show the
2022), there exist major challenges in their training. Mode collapse, effectiveness of the proposed approach. However, the model needs to
non-convergence and instability, suitable objective function, and op- further increase the realism of sign production by generating photo-
timization algorithms are some of these challenges. However, sev- realistic human signers. Furthermore, user studies in collaboration
eral strategies have been recently proposed to address a better de- with the Deaf are required to evaluate the reception of the produced
sign and optimization of them. Appropriate design of network archi- sign pose sequences.
tecture, proper objective functions, and optimization algorithms are In another work, Zelinka and Kanis (2020) designed a sign language
some of the proposed techniques to improve the performance of deep synthesis system focusing on skeletal data production. A feed-forward
learning-based models. transformer and a recurrent transformer, as deep learning-based mod-
els, along with the attention mechanism were used to enhance the
5.4.5. Other models model performance (see Fig. 12). The loss of the proposed model for
In addition to the previous categories, some models have been a sequence 𝑎 = (𝑎1 , … , 𝑎𝑛𝑎 ) and a sequence 𝑏 = (𝑏𝑎 , … , 𝑏𝑛𝑏 ) is defined as
proposed for SLP using different deep learning models (Christopher follows:
et al., 2021; Tang et al., 2022; Wellington et al., 2022; Wencan et al., ∑𝑛𝑎 ∑𝑛𝑏
𝑤 ‖𝑎𝑖 − 𝑏𝑗 ‖2𝐷
𝑗=1 𝑖,𝑗
2022). For example, Saunders et al. (2020a) proposed a Progressive 𝑖=1
𝜀= ∑𝑛𝑎 ∑𝑛𝑏 (11)
Transformers, as a deep learning-based model, to generate continuous 𝑖=1
𝑤
𝑗=1 𝑖,𝑗

14
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Fig. 11. An overview of the model proposed by Saunders et al. (2020a). In this model, a Conditional Adversarial Discriminator measures the realism of Sign Pose Sequences
produced by an SLP Generator.

corresponding to these linguistic aspects are calculated as follows:

∑𝑡𝑒𝑛
𝑙(𝑦 , 𝑧 )
1 ∑ 𝑡=𝑡𝑏𝑛 𝑛 𝑡
𝑁
𝑙𝑒𝑥 = (13)
𝑁 𝑛=1 𝑡𝑒𝑛 − 𝑡𝑏𝑛 + 1
∑𝑡𝑛 𝑒
𝑆𝐾𝐿(𝑦𝑛,𝑓 𝑧𝑡,𝑓 )
1 ∑ 𝑡=𝑡𝑏𝑛
𝑁
𝑓
𝑓 𝑜𝑟𝑚 = (14)
𝑁 𝑛=1 𝑡𝑒𝑛 − 𝑡𝑏𝑛 + 1

where 𝑓𝑓𝑜𝑟𝑚 is the state duration normalized form-level score for

each channel 𝑓 , 𝑙𝑒𝑥 is the state duration normalized lexeme-level
score, 𝑙(𝑦𝑛 , 𝑧𝑡 ) is the local score defined by Symmetric KL-divergence
(SKL) between the probability distributions. 𝑧 and 𝑦 are the test
sign production and the sequence of stacked categorical distributions
corresponding to the KL-HMM representing the target reference lexeme.
While results on the SMILE DSGS dataset show a promising lexeme and
form levels assessment, they need to focus on the assessment of a case
that the lexeme is correct but the form is incorrect.
Saunders et al. proposed a graph-based model, entitled Skeletal
Graph Self-Attention (SGSA), for SLP from the skeletal data. By apply-
ing a spatiotemporal adjacency matrix to the self-attention formulation,
the proposed model provides structure and context to each skeletal
joint. Results on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T)
dataset show that the model obtains state-of-the-art performance by
Fig. 12. Schematic diagram of a model proposed by Zelinka and Kanis (2020), the relative improvements of 7% compared to other models in the
including three blocks: (a) the feed-forward model for a text-to-signs translation for field (Ben, Camgoz, & Bowden, 2021b).
word-level features, (b) the feed-forward model for a text-to-signs translation for Using the capabilities of different methods in this category has
character-level features, and (c) the model for a sign-to-skeleton transformation.
led to successful results in SLP. However, the challenge of the model
complexity still remains as an open issue. Making a trade-off between
accuracy vs. task complexity is a key element.
𝑖=1,…,𝑛
where 𝑤(𝑎, 𝑏) = [𝑤𝑖,𝑗 ]𝑗=1,…,𝑛𝑎 is an attention matrix and ‖.‖2𝐷 is a chosen Tang et al. (2022) designed a Gloss Semantic-Enhanced Network
𝑏
metric. with Online Back-Translation (GEN-OBT) for gloss to pose production
Saunders et al. (2020c) proposed a generative-based model to gen- in sign language. More concretely, GEN-OBT includes a transformer-
erate photo-realistic continuous sign videos from text inputs. They based gloss encoder, a pose decoder, and an online reverse gloss
combined a transformer with a Mixture Density Network (MDN) to decoder. The gloss encoder aims to learn the global contextual depen-
manage the translation from text to skeletal pose. The adversarial loss dency of the entire gloss sequence. In this way, the gloss token is fused
of the proposed model is defined as follows: with the generated pose to make a unified gloss embedding vector,
∑
𝑘 which is used for generating the next pose. In addition, a CTC-based
𝑇 𝑜𝑡𝑎𝑙 = min ((max 𝐺𝐴𝑁 (𝐺, 𝐷𝑖 ))+ decoder is employed to convert the generated poses backward into
𝐺 𝐷𝑖
𝑖=1
glosses. Results on PHOENIX14T dataset confirm the efficiency of the
∑
𝑘
GEN-OBT model compared to the state-of-the-art models. Wellington
𝜆𝐹 𝑀 𝐹 𝑀 (𝐺, 𝐷𝑖 ) + 𝜆𝑉 𝐺𝐺
𝑖=1
et al. (2022) introduced a model, entitled SynLibras, for disentangling
appearance and gestural communication on image synthesis. Preserving
𝑉 𝐺𝐺 (𝐺(𝑦𝑡 , 𝐼 𝑆 )) + 𝜆𝐾𝐸𝑌 𝐾𝐸𝑌 (𝐺, 𝐷𝐻 ) + 𝜆𝑇 𝑇 (𝐺)) (12)
the appearance-based features, the suggested model performs a cross-
Tornay et al. (2020) designed an SLP assessment approach using language pose-transfer on the source signer. Furthermore, a dataset
multi-channel information (hand shape, hand movement, mouthing, with annotated poses of Libras signers performing single words is
facial expression). In this approach, two linguistic aspects are consid- also introduced in this work. Results on the RWTH-PHOENIX-Weather
ered: the generated lexeme and the generated forms. Two loss functions dataset show that the SynLibras, as the first method for Brazilian sign

15
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Fig. 13. A general framework for SLP.

language (Libras) synthesis, obtains promising results. Christopher et al. 7.1. Evaluation metrics and protocols
(2021) presented a generative-based model for SLP using GANs. This
model uses the human semantic parser of the Soft-Gated Warping- Generally, the evaluation metrics measure the output quality by
GAN to generate photo-realistic videos guided by region-level spatial comparing the system output against the ground truth output corre-
layouts. Experimental results on the MS-ASL dataset with over 200 sponding to the source data. In SLP, the visual/lingual evaluation met-
signers show performance improvement compared to related models in rics are used to evaluate the correctness of the generated visual/lingual
the field. Wencan et al. (2022) designed a model, namely DualSign, outputs:
as a semi-supervised two-staged SLP framework. This model utilizes Visual evaluation metrics: To evaluate the quality of the gener-
partially gloss-annotated text-pose pairs and monolingual gloss data. ated sign image/video, the Structural Similarity Index Measurement
Furthermore, a method, entitled Balanced Multi-Modal Multi-Task Dual (SSIM) (Wang et al., 2018a), Peak Signal-to-Noise Ratio (PSNR), and
Transformation (BM3T-DT), is proposed, which includes two models: Mean Squared Error (MSE), as three well-known metrics for assessing
a Multi-Modal T2G model (MM-T2G) and a Multi-Task G2P model image quality, are used in the proposed models for SLP. SSIM actually
(MT-G2P). These models are jointly trained by leveraging their task measures the perceptual difference between two images. In SLP, this
duality and unlabeled data. Results on the PHOENIX14T dataset show metric is used to compare the generated synthetic image to its ground
the efficiency of this model in the semi-supervised setting. truth image. PSNR and MSE are metrics used to assess the quality of
compressed images compared to their original. In SLP, the MSE is used
6. General framework for SLP to calculate the average squared error between a synthetic image and
its ground truth image. In contrast, PSNR measures the peak error in
dB, using the MSE metric.
SLP can be decomposed into some intermediate steps or addressed
Lingual evaluation metrics: Some of the most familiar machine
as an end-to-end translation task. As we reviewed in the previous
translation metrics like BLEU@N (Papineni, Roukos, Ward, & Zhu,
sections, there are different translation models applicable to sign
2002), METEOR (Denkowski & Lavie, 2014), ROUGE (Lin, 2004),
language. In this section, we present the common intermediate steps
CIDEr (Vedantam, Zitnick, & Parikh, 2015) are used to evaluate the
used in SLP (see Fig. 13).
translation performance of the proposed models in SLP. These metrics
Text/Speech to gloss translation: Gloss is defined as written infor-
have acceptable relevancy with human judgment. In the BLEU@N
mation of a sign word translated from the spoken language. It contains
metric, the matched N-grams between the machine-generated and
the facial and body grammar presented during the signing. For instance,
the ground truth answer are utilized to compute the precision score.
let us translate an English sentence, ‘‘I am Anna’’, into sign language.
BLEU@N metric is calculated for 𝑁 = 1 to 4, where shorter N-grams
To this end, we need to translate ‘‘I am’’ and ‘‘Anna’’ separately but
are used to fulfill the adequacy and longer N-gram matching accounts
finger-spelling for a letter-by-letter translation corresponding to ‘‘Anna’’
for fluency. ROUGE-L is another machine translation metric that scores
is needed. Finally, we have this: ‘‘EM FS-ANNA’’, where ‘‘FS’’ denotes
a machine-generated sentence using a recall-based criterion. CIDEr
the start of a finger-spelling sequence. While the gloss is not a correct
is a metric for evaluating machine-generated sentences using human
translation, it can provide suitable spoken language morphemes con-
consensus.
taining some conceptual information about the signs. The process of
the spoken-to-gloss translation can be seen as a sequence-to-sequence
7.2. Results
task. In this task, various models from speech recognition and NMT,
especially deep learning-based models, can be employed.
In this section, we report the quantitative results of the most rele-
Gloss to skeleton prediction: This step aims to generate the human
vant methods reviewed in the previous sections. We limited the quan-
pose information corresponding to the sign gloss sequences. To this end,
titative results to the most common metrics and datasets. The results
different parts of the human pose, including accurate finger locations,
are compacted in one table, given that there are only a few works
arm and torso position, and facial expressions, are considered. Like the
in SLP. ROUGE and BELU are the most used metrics for reporting
previous step, this step can benefit from recent developments in deep the results of the model evaluation. As Table 9 shows, most of the
learning. Attention-based models are one of the effective techniques proposed models for SLP are evaluated on the PHOENIX14T dataset.
employed for mapping from the textual input to the skeleton sequences. This dataset contains 8257 sequences being performed by 9 signers,
Skeleton to image/video synthesis: Two general approaches are which are annotated with both the sign glosses and spoken language
used in this step: animating an avatar and generating video frames. In translations. However, due to the limited number of signers in the
the first approach, the skeleton keypoints are used to animate an avatar. dataset, it is necessary to use one or more large-scale datasets to train
Motion smoothing and interpolation are two techniques used before the generation network. Using multiple datasets is motivated by the fact
final rendering. While the video generation from the skeleton keypoints that there is no single dataset that provides text-to-sign translations,
is hard, recent improvements in deep learning-based skeleton-to-video a broad range of signers of different appearances, and high-definition
translation are promising. signing content. Using datasets from different subject domains and
languages demonstrates the robustness and flexibility of the proposed
7. Performance evaluation methods, as it allows us to transfer knowledge between specialized
datasets. This makes the approach suitable for translating between
In this section, the results of the previously analyzed SLP models on different spoken and signed languages, as well as other problems, such
the most popular datasets are presented. as text-conditioned image and video generation.

16
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Table 9
Results of SLP models.
Model Acc CIDEr ROUGE METEOR WER BLEU-1 BLEU-2 BLEU-3 BLEU-4 MSE SSIM FID
S2G2T (Camgoz, – – 43.80 – – 43.29 30.39 22.82 18.13 – –
Hadfield, Koller,
Ney, & Bowden,
2018b)
HLSTM-attn (Guo 0.506 0.605 – 0.205 0.641 0.508 0.330 0.207 – – – –
et al., 2018)
Text2Gloss (Stoll, – – 48.10 – 4.53 50.67 32.25 21.54 15.26 – 0.727 64.01
Camgoz, et al.,
2020)
Symbolic – – 54.55 – – 55.18 37.10 26.24 19.10 – – –
Transformer
(Saunders et al.,
2020d)
Progressive – – 32.02 – – 31.80 19.19 13.51 10.43 – – –
Transformer
(Saunders et al.,
2020d)
NSLS (Zelinka & – – – – – – – – – 11.94 – –
Kanis, 2020)
SIGNGAN (Saunders – – 29.05 – – 27.63 19.26 14.84 12.18 – 0.759 27.75
et al., 2020d)
EDN (Chan, – – – – – – – – – – 0.737 41.54
Ginosar, Zhou, &
Efros, 2019)
vid2vid (Wang, – – – – – – – – – – 0.750 56.17
et al., 2018)
Pix2PixHD (Wang – – – – – – – – – – 0.737 42.57
et al., 2018a)

Currently, the proposed SLP systems cannot compete with existing assist in making a low-complex and fast model. GAN and LSTM are
avatar approaches. A large amount of high-resolution training data the two most used deep learning-based models in SLP for visual in-
is necessary to obtain results comparable with motion capture and puts. While successful results were achieved using these models, more
avatar-based approaches. However, the avatar-based approaches need effort is necessary to generate more lifelike and high-resolution sign
detailed annotations using task-specific transcription languages, which images/videos acceptable to the Deaf community. Among the deep
can only be provided by expert linguists. Animating the avatar it- learning-based models for lingual modality, the NMT model is the
self often involves a remarkable amount of hand-engineering. Motion most used model for input text processing. Other Seq2Seq models,
capture-based approaches require high-fidelity data, which needs to such as RNN-based models, proved their effectiveness in many tasks.
be captured, cleaned, and stored at remarkable cost, decreasing the While accurate results were achieved using these models, more effort
amount of data available, therefore, making this approach un-scalable. is necessary to overcome the existing challenges in the translation task,
Given that recent approaches use automatic feature extraction methods, such as domain adaptation, uncommon words, word alignment, and
we think that in short term these approaches will enable highly realis- word tokenization.
tic, and cost-effective translation of spoken languages to sign languages, Datasets: The lack of a large annotated dataset is one of the major
improving equal access for the Deaf and Hard of Hearing. Generating challenges in SLP. The collection and annotation of sign language data
high-resolution and signer-independent videos with signers of arbitrary is an expensive task that needs the collaboration of linguistic experts
appearance makes room to provide highly realistic, expressive, and and native speakers. While there are some publicly available datasets
end-to-end SLP systems applicable to real-world communications. for SLP (Athitsos et al., 2018; Brock & Nakadai, 2018; Bungeroth et al.,
Additionally, developing stronger data-processing strategies to pay at- 2008; Camgöz et al., 2018; Camgoz et al., 2021; Caselli et al., 2017;
tention to the intricate features of sign language data, such as the size Duarte et al., 2020; Ko et al., 2019; Matthes et al., 2012), they suffer
of motion and speed, can be effective. from weakly annotated data for sign language. Furthermore, most of
the available datasets in SLP contain a restricted domain of vocab-
8. Conclusion ularies/sentences. To facilitate real-world communication between
the Deaf and hearing communities, access to a large-scale continuous
In this survey, we presented a detailed review of the recent advance- sign language dataset, segmented on the sentence level, is necessary. In
ments in SLP. We presented a taxonomy that summarized the main such a dataset, a paired form of the continuous sign language sentence
concepts related to SLP. We categorized recent works in SLP, providing and the corresponding spoken language sentence needs to be included.
separate discussions in each category. The proposed taxonomy covered Just a few datasets meet these criteria (Camgoz et al., 2018a; Duarte
different input modalities, datasets, applications, and proposed models. et al., 2020; Ko et al., 2019; Zelinka & Kanis, 2020). The point is that
Here, we summarize the main findings: most of the aforementioned datasets cannot be used for end-to-end
Input modalities: Generally, vision and language modalities are translation (Camgoz et al., 2018a; Ko et al., 2019; Zelinka & Ka-
two input modalities in SLP. While the visual modality includes the nis, 2020). Two public datasets, RWTH-Phoenix-2014T and How2Sign,
captured image/video data used in the training, the linguistic modality are the most used datasets in SLP. The former includes German sign
contains the text input from natural language, which is applicable in language sentences that can be used for text-to-sign language trans-
both the training and testing of the proposed models. Both categories lation. The latter is a recently proposed multi-modal dataset used for
benefit from deep learning approaches to improve model performance. speech-to-sign language translation. Though RWTH-PHOENIX-Weather
RGB and skeleton are two common types of visual input data used in 2014T (Camgoz et al., 2018a) and How2Sign (Duarte, 2019) provided
SLP models. While RGB images/videos contain high-resolution content, the appropriate SLP evaluation benchmarks, they are not enough for
skeleton inputs decrease the parameter complexity of the model and the generalization of the SLP models. Furthermore, these datasets only

17
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Fig. 14. Translation results from Stoll, Camgoz, et al. (2020): (a) ‘‘Guten Abend liebe Zuschauer’’. (Good evening dear viewers), (b) ‘‘Im Norden maessiger Wind an den Kuesten
weht er teilweise frisch’’. (Mild winds in the north, at the coast it blows fresh in parts). Top row: Ground truth gloss and video, Bottom row: Generated gloss and video. This
model combines an NMT network and GAN for SLP.

include German and American sentences. Translating from the spoken data. Especially in deep learning-based models, increasing the amount
language to a large diversity of sign languages is a major challenge for of data can lead to better results. Another challenge is regarding the
the Deaf community. uncommon words. The translation models perform poorly on these
Applications: American Sign Language (ASL) is the most-used sign words. Words alignment and adjusting the beam search parameters are
language in developed applications for SLP. Since it may be hard for other challenges in NMT-based models.
Deaf people to read or write the spoken language, they need some Although MG can generate plausible and controllable motion
tools for communication with other people in society. Furthermore, through a database of motion capture, it faces some challenges. The
many interesting and useful applications on the internet are not ac- first challenge is regarding limited access to data. To show the model
cessible to the Deaf community. To tackle these challenges, some potential with a truly diverse set of actions, a large set of data is
projects have been proposed aiming to develop such tools. While these necessary. The scalability and computational complexity of the graph
applications successfully made a bridge between Deaf and hearing to select the best transitions are other challenges in MG. Furthermore,
communities, we are still far from having applications involving large since the number of edges leaving a node increases with the graph size,
vocabularies/sentences from complex real-world scenarios. One of the the branching factor in the search algorithm will increase as well. To
main challenges for these applications is a license right for usage. automatically adjust the graph configuration and rely on the training
Another challenge is regarding the application domain. Most of these data, instead of user interference, Graph Convolutional Network (GCN)
applications have been developed for very specific domains such as could be used along with some refining algorithms to adopt the graph
clinics, hospitals, and police stations. Improving the amount of avail- structure monotonically.
able data and its quality can benefit the creation of these needed While GANs have recently achieved remarkable results for im-
applications. Furthermore, understanding the Deaf culture is helpful to age/video generation, there exist major challenges in the training of
create systems that align with user needs and desires. GANs. Mode collapse, non-convergence and instability, suitable objec-
Proposed models: The proposed works in SLP can be catego- tive function, and optimization algorithm are some of these challenges.
rized into five categories: Avatar approaches, NMT approaches, MG However, several suggestions have been recently proposed to address
approaches, Conditional image/video generation approaches, and other the better design and optimization of GANs. Appropriate design of net-
approaches. Table 2 shows a summary of state-of-the-art deep SLP work architecture, proper objective functions, and optimization algo-
models. Some samples of the generated videos and gloss annotations rithms are some of the proposed techniques to improve the performance
are shown in Figs. 14, 15, and 16. Using the data collected from motion of GAN-based models. Finally, the challenge of the model complexity
capture, avatars can be more usable and acceptable for reviewers. still remains for hybrid models.
Avatars achieve highly realistic results but the results are restricted Limitations: In this survey, we presented recent advances in SLP
to a small set of phrases. This comes from the cost of data collection and related areas using deep learning. While successful results have
and annotation. Furthermore, avatar data is not a scalable solution and been achieved in SLP by recent deep learning-based models, there
needs expert knowledge to be inspected and polished. To cope with are some limitations that need to be addressed. The main challenge
these problems and improve performance, deep learning-based models is regarding the Multi-Signer (MS) generation that is necessary for
are used. providing real-world communication in the Deaf community. To this
While NMT-based methods achieved significant results in transla- end, we need to produce multiple signers of different appearances and
tion tasks, some major challenges need to be solved. Domain adaptation configurations. Another limitation is the possibility of high-resolution
is the first challenge in this area. Since translation in different domains and photo-realistic continuous sign language videos. Most of the pro-
needs different styles and requirements, it is a crucial requirement in posed models in SLP can only generate low-resolution sign samples.
developing machine translation systems targeted at a specific use case. Conditioning on human keypoints extracted from training data can
The second challenge is regarding the amount of available training decrease the parameter complexity of the model and assist to produce a

18
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Fig. 15. Translation results from Stoll, Camgoz, et al. (2020): Text from spoken language is translated to human pose sequences.

Fig. 16. Translation results from Kipp et al. (2011): A signing avatar is created using a character animation system. Top row: signing avatar, Bottom row: original video.

high-resolution video sign. However, avatar-based models can success- CRediT authorship contribution statement
fully generate high-resolution video samples, though they are complex
and expensive. In addition, the pruning algorithms of MG need to be Razieh Rastgoo: Work supervisors, Conceptualization, Supervision,
improved by including additional features of sign language, such as Writing – review & editing. Kourosh Kiani: Work supervisors, Super-
duration and speed of motion. vision, Review & editing. Sergio Escalera: Work supervisors, Super-
Future directions: While recent models in SLP presented promis- vision, Review & editing. Vassilis Athitsos: Work supervisors, Super-
ing results relying on deep learning capabilities, there is still much vision, Review & editing. Mohammad Sabokrou: Work supervisors,
room for improvement. Considering the discriminative power of self- Supervision, Review & editing.
attention, learning to fuse multiple input modalities to benefit from
multi-channel information, learning structured spatiotemporal patterns Declaration of competing interest
(such as Graph Neural Networks models), and employing domain-
specific prior knowledge of sign language are some possible future
The authors declare that they have no known competing finan-
directions in this area. Furthermore, there are some exciting assistive
cial interests or personal relationships that could have appeared to
technologies for Deaf and hearing-impaired people. A brief introduction
influence the work reported in this paper.
to these technologies can get insight to the researchers in SLP and
also make a bridge between them and the corresponding technol-
ogy requirements. These technologies fall into three device categories: Data availability
hearing technology, alerting devices, and communication support tech-
nology. For example, let us imagine a technology that assists a Deaf No data was used for the research described in the article.
person going through a musical experience translated into another
sensory modality. While the recent advances in SLP are promising,
Acknowledgments
more endeavor is indispensable to provide a fast processing model in
an uncontrolled environment considering rapid hand motions. It is clear
This work has been partially supported by the HIS Company and
that technology standardization and full interoperability among devices
Institute for Research in Fundamental Sciences (IPM) in Iran, Spanish
and platforms are prerequisites to having real-life communication be-
project PID2019-105093GB-I00 (MINECO/FEDER, UE), and CERCA
tween the hearing and hearing-impaired communities. Finally, since
Programme/Generalitat de Catalunya, and ICREA under the ICREA
providing the data annotation is also challenging, recently some efforts
Academia programme.
have been done by Rastgoo, Kiani, and Escalera (2022a), Rastgoo,
Kiani, Escalera, and Sabokrou (2022b) to overcome the annotation
bottleneck. To this end, Zero-Shot Learning (ZSL) is employed for SLR. Funding
Using this approach, we hope to get closer to real and accurate systems
for bidirectional communication between Deaf and hearing people in This research did not receive any specific grant from funding agen-
society. cies in the public, commercial, or not-for-profit sectors.

19
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

References Cohen, M., Voldman, I., Regazzoni, D., & Vitali, A. (2018). Hand rehabilitation via
gesture recognition using leap motion controller. In 11th international conference on
Ankith, B., Boggaram, A., Sharma, A., Ramanujan, A., & Bharathi, R. (2022). Sign human system interaction.
language translation systems: A systematic literature review. International Journal Conneau, A., Lample, G., Ranzato, M., Denoyer, L., & Jégou, H. (2017). Word
of Software Science and Computational Intelligence (IJSSCI), 14, 1–33. translation without parallel data. arXiv:1710.04087.
Arikan, O., & Forsyth, D. (2002). Interactive motion generation from examples. In Cortana, C. 2022. www.microsoft.com.
Proceedings of the 29th annual conference on computer graphics and interactive Cox, S., Lincoln, M., Tryggvason, J., Nakisa, M., Wells, M., Tutt, M., et al. (2002).
techniques (pp. 483–490). Tessa, a system to aid communication with deaf people. In Proceedings of the 5th
Artetxe, M., & Schwenk, H. (2019). Massively multilingual sentence embeddings for international ACM conference on assistive technologies (pp. 205–212).
zero-shot cross-lingual transfer and beyond. In Transactions of the Association for CSL (2022). Chinese sign language. https://www.startasl.com/chinese-sign-language/.
Computational Linguistics, vol. 7 (pp. 597–610). Dangsaart, S., Naruedomkul, K., Cercone, N., & Sirinaovakul, B. (2008). Intelligent
Athitsos, V., Neidle, C., Sclaroff, S., Nash, J., Stefan, A., Yuan, Q., et al. (2018). The Thai text – Thai sign translation for language learning. Computers and Education,
American sign language lexicon video dataset. In Proceedings of the IEEE conference 1125–1141.
on computer vision and pattern recognition (pp. 1–8). Darabkh, K., Alturk, F., & Sweidan, S. (2018). VRCDEA-TCS: 3D virtual reality cooper-
Auslan Language (2022). shopping-and-auslan. https://www.gcss.org.au/2018/07/ ative drawing educational application with textual chatting system. In Comput appl
shopping-and-auslan/. eng educ, vol. 26 (pp. 1677–1698).
Bachmann, D., Weichert, F., & Rinkenauer, G. (2018). Review of three-dimensional Dawes, F., Penders, J., & Carbone, G. (2018). Remote control of a robotic hand using
human-computer interaction with focus on the leap motion controller. Sensors. a leap sensor. In The international conference of IFToMM (pp. 332–341).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly Deafness and hearing 2022. https://www.who.int/news-room/fact-sheets/detail/
learning to align and translate. In ICLR. deafness-and-hearing.
Bangham, J., Cox, S., Elliott, R., Glauert, J., Marshall, I., Rankov, S., et al. (2000). Delorme, M., & Braffort, M. (2009). Animation generation process for sign language syn-
Virtual signing: Capture, animation, storage and transmission – an overview of the thesis. In Advances in Computer-Human Interactions, Second International Conferences
ViSiCAST project. Speech and Language Processing for Disabled and Elderly People. on.
Ben, S., Camgoz, N., & Bowden, R. (2021a). Mixed signals: Sign language production Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation
via a mixture of motion primitives. In Proceedings of the IEEE/CVF international evaluation for any target language. Proceedings of the Ninth Workshop on Statistical
conference on computer vision (pp. 1919–1929). Machine Translation, 376—380.
Ben, S., Camgoz, N., & Bowden, R. (2021b). Skeletal graph self-attention: Embedding Denton, E., Chintala, S., Szlam, A., & Fergus, R. (2015). Deep generative image models
a skeleton inductive bias into sign language production. arXiv:2112.05277. using a laplacian pyramid of adversarial networks. Advances in Neural Information
Ben, S., Camgoz, N., & Bowden, R. (2022). Signing at scale: Learning to co-articulate Processing Systems (NIPS).
signs for large-scale photo-realistic sign language production. In Proceedings of the DHHP (2022). Deaf and hard of hearing program. https://www.childrenshospital.org/
IEEE/CVF conference on computer vision and pattern recognition (pp. 5141–5151). centers-and-services/programs/a-e/deaf-and-hard-of-hearing-program.
Bragg, D., et al. (2019). Sign language recognition, generation, and translation: An Duarte, A. (2019). Cross-modal neural sign language translation. In The 27th ACM
interdisciplinary perspective. In ASSETS ’19. international conference.
Bregler, C. (2007). Motion capture technology for entertainment. IEEE Signal Processing Duarte, K., & Gibet, S. (2010). Corpus design for signing avatars. In Workshop on
Magazine. representation and processing of sign languages: Corpora and sign language technologies
Brock, H., & Nakadai, K. (2018). Deep JSLC: A multimodal corpus collection for data- (pp. 1–3).
driven generation of Japanese sign language expressions. In Proceedings of the Duarte, A., Sh., P., Ghadiyaram, D., DeHaan, K., Metze, F., Torres, J., et al.
eleventh international conference on language resources and evaluation. (2020). How2Sign: A large-scale multimodal dataset for continuous American sign
Bungeroth, J., Stein, D., Dreuw, P., Ney, H., Morrissey, S., Way, A., et al. (2008). The language. In Sign language recognition, translation, and production workshop.
ATIS sign language corpus. In 6th International conference on language resources and Eastin, E. (2022). European assistive technology information network (EASTIN)
evaluation. database. http://www.eastin.eu.
Butt, A., Rovini, E., Dolciotti, C., De Petris, G., Bongioanni, P., Carboncini, M., et EblingJohn, S., & Glauert, G. (2013). Exploiting the full potential of jasigning to
al. (2018). Objective and automatic classification of parkinson disease with leap build an avatar signing train announcements. In 3rd International symposium on
motion controller. Biomedical Engineering. sign language translation and avatar technology (pp. 1–9).
Cai, S., Zhu, G., Tien Wu, Y., Liu, E., & Hu, X. (2018). A case study of gesture-based Efthimiou, E., Fotinea, S., Hanke, T., Glauert, J., Bowden, R., Braffort, A., et al. (2012).
games in enhancing the fine motor skills and recognition. In Interactive learning The dicta-sign wiki: Enabling web communication for the deaf. In International
environments, vol. 26. conference on computers for handicapped persons (pp. 205–212).
Camgöz, N., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). RWTH-PHOENIX- Ehima, E. (2022). Guidance document for classification of hearing aids and accessories.
weather 2014 T: Parallel corpus of sign language video, Gloss and translation. In In European hearing instrument manufacturers association.
Proceedings of the IEEE conference on computer vision and pattern recognition. Elliott, D., Frank, S., Simaéan, K., & Specia, L. (2016). Multi30k: Multilingual english-
Camgoz, N., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018a). Neural sign german image descriptions. In Proceedings of the 5th workshop on vision and language
language translation. In Proceedings of the IEEE conference on computer vision and (pp. 70–74).
pattern recognition. Ethnologue (2022a). Argentine sign language. https://www.ethnologue.com/language/
Camgoz, N., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018b). Neural sign aed.
language translation. IEEE Conference on Computer Vision and Pattern Recognition. Ethnologue (2022b). Greek sign language. https://www.ethnologue.com/language/gss.
Camgoz, N., Koller, O., Hadfield, S., & Bowden, R. (2020). Multi-channel transformers Ethnologue (2022c). Persian sign language. https://www.ethnologue.com/language/
for multi-articulatory sign language translation. In ECCVW. psc.
Camgoz, N., Saunders, B., Rochette, G., Giovanelli, M., Inches, G., Nachtrab-Ribback, R., Facebook Messenger 2022. www.facebook.com.
et al. (2021). Content4All open research sign language translation datasets. In IEEE FaceTime, F. 2022. www.Appleapps.apple.com.
international conference on automatic face and gesture recognition. Field, M., Stirling, D., Naghdy, F., & Pan, Z. (2009). Motion capture in robotics review.
Caselli, N., Sehyr, Z., Cohen-Goldberg, A., & Emmorey, K. (2017). ASL-lex: A lexical In Proceedings of the 2009 IEEE international conference on control and automation;
database for ASL. In Behavior research methods, vol. 49 (pp. 784—801). Institute of electrical and electronics engineers (pp. 1697—1702).
CDC (2022). Centers for disease control and prevention. https://www.cdc.gov/. Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical
Chan, C., Ginosar, S., Zhou, T., & Efros, A. (2019). Everybody dance now. IEEE interaction through video prediction. Advances in Neural Information Processing
Conference on Computer Vision and Pattern Recognition. Systems (NIPS).
Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement Forster, J., Schmidt, C., Hoyoux, T., Koller, O., Zelle, U., Piater, J., et al. (2012). RWTH-
networks. In ICCV (pp. 1511–1520). PHOENIX-weather: A large vocabulary sign language recognition and translation
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2018). Deeplab: corpus. In Proceedings of the eighth international conference on language resources and
Semantic image segmentation with deep convolutional nets, atrous convolution, evaluation (pp. 3785—3789).
and fully connected crfs. In TPAMI, vol. 40. Gárate, M. (2014). Developing bilingual literacy in deaf children. Kurosio Publishe.
Cho, B., Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. Geers, A., et al. (2017). Early sign language exposure and cochlear implantation
(2014). Learning phrase representations using RNN encoder–decoder for statistical benefits. In Pediatrics, vol. 140.
machine translation. arXiv:1406.1078. Geng, W., & Yu, G. (2003). Reuse of motion capture data in animation: A review. In
Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., & Bengio, H. Proceedings of the lecture notes in computer science (pp. 620—629).
(2014). Learning phrase representations using RNN encoder-decoder for statistical Ghanem, S., Conly, C., & Athitsos, V. (2017). A survey on sign language recognition
machine translation. In EMNLP (pp. 1724—1734). using smartphones. In Proceedings of the 10th international conference on pervasive
Christopher, K., Kümmel, C., Ritter, D., & Hildebrand, K. (2021). Pose-guided sign technologies related to assistive environments, Island of Rhodes Greece.
language video GAN with dynamic lambda. arXiv:2105.02742. Gibet, S. (2018). Building french sign language motion capture corpora for signing
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated avatars. In Workshop on the representation and processing of sign languages: involving
recurrent neural networks on sequence modeling. arXiv:1412.3555. the language community.

20
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Gibet, S. (2020). Signing avatars: what challenges for the production of content in sign Kataoka, Y., Matsubara, T., & Uehara, K. (2016). Image generation using adversarial
languages? Disability. networks and attention mechanism. In IEEE/ACIS 15th international conference on
Gibet, S., Lefebvre-Albaret, F., Hamon, L., Brun, R., & Turki, A. (2016a). Interactive computer and information science.
editing in french sign language dedicated to virtual signers: Requirements and Kennaway, R. (2013). Avatar-independent scripting for real-time gesture animation,
challenges. In Universal access in the information society, vol. 15 (pp. 525–539). procedural animation of sign language. arXiv:1502.02961.
Gibet, S., Lefebvre-Albaret, F., Hamon, L., Brun, R., & Turki, A. (2016b). Interactive Khan, N., Abid, A., & Abid, K. (2020). A novel natural language processing (NLP)-
editing in french sign language dedicated to virtual signers: Requirements and based machine translation model for English to Pakistan sign language translation.
challenges. In Universal access in the information society, vol. 15 (pp. 525–539). In Cognitive computation, vol. 12 (pp. 748–765).
Gibet, S., & Marteau, P. (2023). Signing avatars-multimodal challenges for text-to-sign Kingma, D., & Welling, M. (2014). Auto-encoding variational bayes. In ICLR.
generation. In 2023 IEEE 17th international conference on automatic face and gesture Kipp, M., Heloir, A., & Nguyen, Q. (2011). Sign language avatars: Animation
recognition (pp. 1–8). and comprehensibility. In International workshop on intelligent virtual agents (pp.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et 113–126).
al. (2014). Attribute2image: Conditional image generation from visual attributes. Ko, S., Kim, C., Jung, H., & Cho, C. (2019). Neural sign language translation based on
Advances in Neural Information Processing Systems, 2672—2680. human keypoint estimation. In Applied sciences, vol. 9.
Graves, A., Fernandez, S., & Schmidhuber, J. (2007). Multi-dimensional recurrent neural Kovar, L., Gleicher, M., & Pighin, F. (2002). Motion graphs. In SIGGRAPH ’02:
networks. In ICANN. Proceedings of the 29th annual conference on computer graphics and interactive
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent techniques (pp. 473–483).
neural networks. In ICASSP. Lample, G., Conneau, A., Denoyer, L., & Ranzato, M. (2017). Unsupervised machine
Gregor, K., Danihelka, I., Graves, A., Rezende, D., & Wierstra, D. (2015). Draw: A translation using monolingual corpora only. arXiv:1711.00043.
recurrent neural network for image generation. In Proceedings of machine learning Lane, H. (2017). A chronology of the oppression of sign language in France and the
research. United States. In Recent perspectives on American sign language (pp. 119—161).
Grieve, S. (1999). English to American sign language machine translation of weather Psychology Press.
reports. In Proceedings of the second high desert student conference in linguistics (pp. Larboulette, C., & Gibet, S. (2018). Avatar signers: what can they teach us?, J. Enaction
23–30). 2018: Day on enaction in animation. Simulation and Virtual Reality.
Grieve, S. (2002). SignSynth: A sign language synthesis application using Web3D and LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied
perl. In International gesture workshop (pp. 134–145). to document recognition. In Proceedings of the IEEE, vol. 86.
Grosjean, F. (2010). Bilingualism, biculturalism, and deafness. In International journal Lee, J., & Shin, S. (1999). A hierarchical approach to interactive motion editing
of bilingual education and bilingualism, vol. 13 (pp. 133–145). for human-like figures. A Hierarchical Approach to Interactive Motion Editing for
Guo, D., Zhou, W., Li, H., & Wang, M. (2018). Hierarchical LSTM for sign language Human-Like Figures, 39–48.
translation. In The thirty-second AAAI conference on artificial intelligence.
Lewis, P., Oğuz, B., Rinott, R., Riedel, S., & Schwenk, H. (2019). MLQA: Evaluating
Hamers, J. (1998). Cognitive and language development of bilingual children. In
cross-lingual extractive question answering. arXiv:1910.07475.
Parasnis, cultural, and language diversity and the deaf experience (pp. 51—75).
Li, D., Xu, C., Liu, L., Zhong, Y., Wang, R., Peterson, L., et al. (2022). Transcribing
Cambridge University Press.
natural languages for the deaf via neural editing programs. In Proceedings of the
Hammadi, M., Muhammad, G., Abdul, W., Alsulaiman, M., Bencherif, M., &
AAAI conference on artificial intelligence, vol. 36 (pp. 11991–11999).
Mekhtiche, M. (2020). Hand gesture recognition for sign language using 3DCNN.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In
IEEE Access, 8, 79491–79509.
Proceedings of the workshop on text summarization branches out.
Hangout, G. 2022. www.hangout.google.com.
Lotter, W., Kreiman, G., & Cox, D. (2015). Unsupervised learning of visual structure
Hanke, T. (2004). HamNoSys – Representing sign language data in language resources
using predictive generative networks. arXiv:1511.06380.
and language processing contexts. In Workshop on the representation and processing
Lucie, N., Larboulette, C., & Gibet, S. (2020). SignSynth: Data-driven sign language
of sign languages. 4th international conference on language resources and evaluation.
video generation. In Computers and graphics, vol. 92 (pp. 76–98).
Hanke, T. (2022). German sign language. https://www.awhamburg.de/en/research/
Lucker, J., & Hersh, M. (2003). Alarm and alerting systems for hearing-impaired and
long-term-scientific-projects/dictionary-german-sign-language.html.
deaf people. In Assistive technology for the hearing-impaired, Deaf and Deaf blind (pp.
HDS (2022). Center for hearing and deaf services. https://www.nationaldeafcenter.org/.
215–255).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective recep-
recognition. In Proceedings of the IEEE conference on computer vision and pattern
tive field in deep convolutional neural networks. Advances in Neural Information
recognition.
Processing Systems (NIPS).
Hersh, M., & Johnson, M. (2003). Hearing-aid principles and technology. In Assistive
Luong, M., Pham, h., & Manning, C. (2015). Effective approaches to attention-based
technology for the hearing-impaired, deaf and deaf blind (pp. 71–116).
neural machine translation. arXiv:1508.04025.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. In Neural
computation, vol. 9. Luqman, H., & Sabri, A. (2019). Automatic translation of arabic text to arabic sign
HSDC (2022). Hearing, speech and deaf cente. https://www.nationaldeafcenter.org/. language. In Universal access in the information society, vol. 18 (pp. 939–951).
Huenerfauth, M. (2004). Spatial and planning models of ASL classifier predicates Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., & Van Gool, L. (2017). Pose guided
for machine translation. In The 10th international conference on theoretical and person image generation. Advances in Neural Information Processing Systems (NIPS).
methodological issues in machine translation (pp. 65–74). Majidi, N., Kiani, K., & Rastgoo, R. (2020). A deep model for super-resolution
Huenerfauth, M. (2005). American sign language generation: Multimodal NLG with enhancement from a single image. In Journal of AI and data mining, vol. 8 (pp.
multiple linguistic channels. In Proceedings of the ACL student research workshop 451–460).
(pp. 37–42). ManCAD (2022). Manchester centre for audiology and deafness. https://www.
Humphries, T. (2013). Schooling in American sign language: A paradigm shift from a research.manchester.ac.uk/portal/en/projects/manchester-centre-for-audiology-
deficit model to a bilingual model in deaf education. In Berkeley review of education, and-deafness-mancad(90f5d28b-08c9-4c76-891b-9340c290529f).html.
vol. 4 (pp. 7–33). Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond
Hutchins, J. (2005). History of machine translation. http://psychotransling.ucoz.com/- mean square error. In ICLR.
ld/0/11-Hutchins-survey.pdf. Matthes, S., Hanke, T., Regen, A., Storz, J., Worseck, S., Efthimiou, E., et al. (2012).
Imo, I. 2022. www.Imo.com. Dicta-sign–building a multilingual sign language corpus. In 5th LREC. Istanbul.
Jain, V., et al. (2007). Supervised learning of image restoration with convolutional McDonald, J., Wolfe, R., Schnepp, J., Hochgesang, J., Jamrozik, D., Stumbo, M., et al.
networks. In ICCV. (2016). An automated technique for real-time production of lifelike animations of
Jemni, M., Ghoul, O., Boulares, M., Yahia, N., Jaballah, K., Othman, A., et al. (2022). American sign language. In Universal access in the information society, vol. 15 (pp.
WebSign. http://www.latice.rnu.tn/websign/. 551–566).
Kahlon, N., & Singh, W. (2021). Machine translation from text to sign language: a Michel, P., & Neubig, G. (2018). MTNT: A testbed for machine translation of noisy
systematic review. Universal Access in the Information Society, 1–35. text. In Proceedings of the 2018 conference on empirical methods in natural language
Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A., Graves, A., & Kavukcuoglu, K. processing.
(2016). Neural machine translation in linear time. arXiv:1610.10099. Morando, M., Ponte, S., Ferrara, E., & Dellepiane, S. (2018). Definition of motion
Kanis, J., Zahradil, j., Jurčíček, F., & Müller, L. (2006). Czech-sign speech corpus for and biophysical indicators for home-based rehabilitation through serious games.
semantic based machine translation. In International conference on text, speech and In Information, vol. 9.
dialogue (pp. 613–620). Mori, M., MacDorman, K., & Kageki, N. (2012). The uncanny valley [from the field. In
Karpouzis, K., Caridakis, G., Fotinea, S.-E., & Efthimiou, E. (2020). Educational IEEE robotics and automation magazine, vol. 19 (pp. 98–100).
resources and implementation of a greek sign language synthesis architecture. In NAD (2022). National association of the deaf. Assistive Listening Systems and Devices.
Web3D technologies in learning, education and training (pp. 54–74). Naert, L., Larboulette, C., & Gibet, S. (2017). Coarticulation analysis for sign language
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for synthesis. In Universal access in human–computer interaction. Designing novel interac-
generative adversarial networks. In Proceedings of the IEEE conference on computer tions: 11th international conference, UAHCI 2017, held as part of HCI international
vision and pattern recognition. 2017, Vancouver, BC, Canada, July 9–14, 2017, proceedings, part II 11 (pp. 55–75).

21
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Naert, L., Larboulette, C., & Gibet, S. (2020a). Lsf-animal: A motion capture corpus in Salto (2022). Polish sign language. https://www.salto-youth.net/tools/otlas-partner-
french sign language designed for the animation of signing avatars. In Proceedings finding/organisation/association-of-polish-sign-language-interpreters.2561/.
of the 12th language resources and evaluation conference (pp. 6008–6017). Saunders, B., Camgöz, N., & Bowden, R. (2020a). Adversarial training for multi-channel
Naert, L., Larboulette, C., & Gibet, S. (2020b). A survey on the animation of signing sign language production. In BMVC.
avatars: From sign representation to utterance synthesis. In Computers and graphics, Saunders, B., Camgoz, N., & Bowden, R. (2020b). Everybody sign now: Translating
vol. 92 (pp. 76–98). spoken language to photo realistic sign language video. arXiv:2011.09846.
Naert, L., Larboulette, C., & Gibet, S. (2021). Motion synthesis and editing for the Saunders, B., Camgöz, N., & Bowden, R. (2020c). Everybody sign now: Translating
generation of new sign language content: Building new signs with phonological spoken language to photo realistic sign language video. arXiv:2011.09846.
recombination. In Machine translation, vol. 35 (pp. 405–430). Saunders, B., Camgoz, N., & Bowden, R. (2020d). Progressive transformers for
Naert, L., Reverdy, C., Larboulette, C., & Gibet, S. (2018). Per channel automatic end-to-end sign language production. In ECCV (pp. 687–705).
annotation of sign language motion capture data. In Workshop on the representation Saunders, B., Camgoz, N., & Bowden, R. (2021). AnonySIGN: Novel human appearance
and processing of sign languages: Involving the language community. synthesis for sign language video anonymization. In IEEE international conference
Nakazawa, T., Yaguchi, M., Uchimoto, K., Utiyama, M., Sumita, E., Kurohashi, S., et on automatic face and gesture recognition.
al. (2016). ASPEC: Asian scientific paper excerpt corpus. In Proceedings of the ninth See, A., & Lamm, M. (2020). Machine translation, sequence-to-sequence and attention.
international conference on language resources and evaluation (pp. 2204–2208). https://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture08-nmt.pdf.
Natarajan, B., & Elakkiya, R. (2022). Dynamic GAN for high-quality sign language Sharma, S., & Kumar, K. (2021). ASL-3DCNN: American sign language recognition tech-
video generation from skeletal poses using generative adversarial networks. In Soft nique using 3-d convolutional neural networks. In Multimedia tools and applications,
computing, vol. 26 (pp. 13153–13175). vol. 80 (pp. 26319–26331).
NCDB (2022). National center on deaf-blindness. https://www.nationaldb.org/. Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., & Woo, W. (2015). Convolutional
NDC (2022). National deaf cente. https://www.nationaldeafcenter.org/. LSTM network: A machine learning approach for precipitation nowcasting. Advances
NIDCD (2022). National institute on deafness and other communication. https://www. in Neural Information Processing Systems (NIPS).
nidcd.nih.gov/. Siarohin, A., Sangineto, E., Lathuiliere, S., & Sebe, N. (2018). Deformable gans for pose-
Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural net- based human image generation. In Proceedings of the IEEE conference on computer
works. In Proceedings of the 33rd international conference on machine learning (pp. vision and pattern recognition.
1747–1756). Siddique, S., Ahmed, T., Talukder, R., & Uddin, M. (2020). English to bangla machine
Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., & Kavukcuoglu, K. translation using recurrent neural network. In International journal of future computer
(2016). Conditional image generation with pixel-cnn decoders. Advances in Neural and communication, vol. 9.
Information Processing Systems (NIPS). Siri, S. 2022. www.apple.com.
Othman, A., & Jemni, M. (2011). Statistical sign language machine translation: from Skype 2022. www.skype.com.
english written text to American sign language gloss. In IJCSI international journal Snapchat 2022. www.snapchat.com.
of computer science, vol. 8 (pp. 65–73). Sripairojthikoon, N., & Harnsomburana, J. (2019). Thai sign language recognition using
Owlcation (2022). Korean sign language. https://owlcation.com/humanities/Korean- 3D convolutional neural networks. In ICCCM 2019: Proceedings of the 2019 7th
Sign-Language. international conference on computer and communications management (pp. 186–189).
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic Start ASL (2022). Spanish sign language. https://www.startasl.com/spanish-sign-
evaluation of machine translation. In ACL’02: Proceedings of the 40th annual meeting language-ssl/.
on Association for Computational Linguistics (pp. 311—318). Stokoe, W., Casterline, D., & Croneberg, C. (1965). A dictionary of American sign language
Pei, W., Zhang, J., Wang, X., Ke, L., Shen, K., & Tai, Y. (2019). Memory-attended on linguistic principles. Linstok Press.
recurrent network for video captioning. In Proceedings of the IEEE conference on Stoll, S., Camgoz, N., Hadfield, S., & Bowden, R. (2018). Sign language production
computer vision and pattern recognition (pp. 8347–8356). using neural machine translation and generative adversarial networks. In BMVC.
Plantard, P., Shum, H., Le Pierres, A., & Multon, F. (2017). Validation of an ergonomic Stoll, S., Camgoz, N., Hadfield, S., & Bowden, R. (2020). Text2Sign: Towards sign
assessment method using Kinect data in real workplace conditions. In Appl. ergon., language production using neural machine translation and generative adversarial
vol. 65 (pp. 562—569). networks. In International journal of computer vision, vol. 128 (pp. 891–908).
Prillwitz, S. (1989). HamNoSys. Version 2.0. Hamburg notation system for sign languages. Stoll, S., Hadfield, S., & Bowden, R. (2020). SignSynth: Data-driven sign language video
An introductory guide. Hamburg Signum Press. generation. In Lecture notes in computer science: vol. 12538, Computer vision - ECCV
Rastgoo, R., Kiani, K., & Escalera, S. (2018). Multi-modal deep hand sign language 2020 workshops.
recognition in still images using restricted Boltzmann machine. In Entropy, vol. 20. Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural
Rastgoo, R., Kiani, K., & Escalera, S. (2020a). Hand sign language recognition using networks. Advances in Neural Information Processing Systems (NIPS).
multi-view hand skeleton. In Expert systems with applications, vol. 150. Swanwick, R. (2010). Bipolicy and practice in sign bilingual education: Develop-
Rastgoo, R., Kiani, K., & Escalera, S. (2020b). Video-based isolated hand sign language ment, challenges and directions. In International journal of bilingual education and
recognition using a deep cascaded model. In Multimedia tools and applications, vol. bilingualism, vol. 13 (pp. 147–158).
79 (pp. 22965—22987). Tamir, M., & Oz, G. (2008). Real-time objects tracking and motion capture in sports
Rastgoo, R., Kiani, K., & Escalera, S. (2021a). Hand pose aware multimodal isolated sign events. In U.S. patent application No. 11/909,080.
language recognition. In Multimedia tools and applications, vol. 80 (pp. 127—163). Tang, S., Hong, R., Guo, D., & Wang, M. (2022). Gloss semantic-enhanced network with
Rastgoo, R., Kiani, K., & Escalera, S. (2021b). Real-time isolated hand sign language online back-translation for sign language production. In Proceedings of the 30th ACM
recognition using deep networks and SVD. Journal of Ambient Intelligence and international conference on multimedia (pp. 5630–5638).
Humanized Computing. Tiedemann, J. (2016). Finding alternative translations in a large corpus of movie
Rastgoo, R., Kiani, K., & Escalera, S. (2021c). Sign language recognition: A deep survey. subtitles. In Proceedings of the 10th international conference on language resources
In Expert systems with application, vol. 164. Article 113794. and evaluation.
Rastgoo, R., Kiani, K., & Escalera, S. (2022a). ZS-SLR: Zero-shot sign language Tornay, S., Camgöz, N., Bowden, R., & Doss, M. (2020). A phonology-based approach
recognition from RGB-d videos. arXiv:2108.10059. for isolated sign production assessment in sign language. In ICMI ’20 companion:
Rastgoo, R., Kiani, K., & Escalera, S. (2022y). A non-anatomical graph structure for Companion publication of the 2020 international conference on multimodal interaction
isolated hand gesture separation in continuous gesture sequences. arXiv:2207. (pp. 102–106).
07619. UN General Assembly (2006). Convention on the rights of persons with disabilities. In
Rastgoo, R., Kiani, K., & Escalera, S. (2022z). Word separation in continuous sign GA res, vol. 61.
language using isolated signs and post-processing. arXiv:2204.00923. Vaitkevičius, A., Taroza, M., Blažauskas, T., Damaševičius, R., Maskeliūnas, R., &
Rastgoo, R., Kiani, K., & Escalera, S. (2023x). A deep co-attentive hand-based video Woźniak, M. (2019). Recognition of American sign language gestures in a virtual
question answering framework using multi-view skeleton. Multimedia Tools and reality using leap motion. In Appl. sci., vol. 9.
Applications 82, 1401–1429. Valero, P., Sivanathan, A., Bosché, F., & Abdel-Wahab, M. (2017). Analysis of con-
Rastgoo, R., Kiani, K., Escalera, S., & Sabokrou, M. (2021d). Sign language production: struction trade worker body motions using a wearable and wireless motion sensor
A review. In Proceedings of the IEEE/CVF conference on computer vision and pattern network. In Autom. constr., vol. 83 (pp. 48—55).
recognition (pp. 3451–3461). Vasani, N., Autee, P., Kalyani, S., & Karani, R. (2020). Generation of Indian sign lan-
Rastgoo, R., Kiani, K., Escalera, S., & Sabokrou, M. (2022b). Multi-modal zero-shot sign guage by sentence processing and generative adversarial networks. In International
language recognition. arXiv:2109.00796. conference on intelligent sustainable systems.
Rezaei, M., Rastgoo, R., & Athitsos, V. (2023). Trihorn-net: a model for accurate Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., et al. (2017).
depth-based 3d hand pose estimation. Expert Systems with Applications 223, 119922. Attention is all you need. Advances in Neural Information Processing Systems (NIPS).
Roccetti, M., Marfia, G., & Semeraro, A. (2012). Playing into the wild: A gesture-based Veale, T., & Conway, A. (1994). Cross modal comprehension in ZARDOZ an english to
interface for gaming in public spaces. In Journal of visual communication and image sign-language translation system. In INLG ’94 proceedings of the seventh international
representation, vol. 23 (pp. 426–440). workshop on natural language generation (pp. 249–252).
Rodriguez-Moreno, I., Martinez-Otzeta, J. M., & Sierra, B. (2023). HAKA: HierArchical Vedantam, R., Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image descrip-
knowledge acquisition in a sign language tutor. In Expert systems with applications, tion evaluation. In Proceedings of the workshop on text summarization branches out
vol. 215. (pp. 4566—4575).

22
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846

Vicars, W. (2022). American sign language. http://www.lifeprint.com/. WFD, W. (2022). World federation of the deaf. In Working document on adoption and
Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017). Decomposing motion and adaptation of technologies and accessibility – Prepared by the WFD expert group on
content for natural video sequence prediction. In ICLR. accessibility and technology.
Virginia (2022). Northern virginia resource center for deaf and hard of hearing persons. WhatsApp Messenger 2022. www.whatsapp.com.
https://nvrc.org/. WHO (2022). World health organization. https://www.who.int/.
Virtual Humans Group (2017). Virtual humans research for sign language animation. WHO: World Health Organization (2022). Deafness and hearing loss. http://www.who.
In School of computing sciences (pp. 205–212). UEA Norwich, UK. int/mediacentre/factsheets/fs300/en/.
Walsh, H., Saunders, B., & Bowden, R. (2022). Changing the representation: Examining Yan, X., Yang, J., Sohn, K., & Lee, H. (2016). Attribute2image: Conditional image
language representation for neural sign language production. In Seventh international generation from visual attributes. In ECCV (pp. 776—791).
workshop on sign language translation and avatar technology: The junction of the visual Yang, H. (2014). Sign language recognition with the kinect sensor based on conditional
and the textual. random fields. Sensors, 15, 135–147.
Wang, T., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., et al. (2018). Video-to-video Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. In Proceedings
synthesis. Advances in Neural Information Processing Systems (NIPS). of the IEEE conference on computer vision and pattern recognition.
Wang, T., Liu, M., Zhu, J., Tao, A., Kautz, J., & Catanzaro, B. (2018a). High-resolution Zelinka, J., & Kanis, J. (2020). Neural sign language synthesis: Words are our glosses.
image synthesis and semantic manipulation with conditional GANs. In Proceedings In WACV (pp. 3395–3403).
of the IEEE conference on computer vision and pattern recognition. Zhan, E., Zheng, S., Yue, Y., Sha, L., & Lucey, P. (2019). Generating multi-agent
Wellington, S., Alaniz, A., Hurtado, M., De Silva, B., & De Bem, R. (2022). SynLibras: trajectories using programmatic weak supervision. In ICLR.
A disentangled deep generative model for Brazilian sign language synthesis. In Zhao, L., Kipper, K., Schuler, W., Vogler, C., & Palmer, M. (2000). A machine translation
In 2022 35th SIBGRAPI conference on graphics, patterns and images, vol. 1 (pp. system from english to American sign language. In Conference of the Association for
210–215). machine translation in the Americas (pp. 54–67).
Wencan, H., Zhao, Z., He, J., & Zhang, M. (2022). DualSign: Semi-supervised sign Zij, L., & Barker, D. (2003). South African sign language machinee translation system.
language production with balanced multi-modal multi-task dual transformation. In In Proceedings of the second international conference on computer graphics, Virtual
Proceedings of the 30th ACM international conference on multimedia (pp. 5486–5495). reality, visualization and interaction in Africa (pp. 49–52).
Zwitserlood, I., Verlinden, M., Ros, J., & Schoot, S. (2005). Synthetic signing for the
deaf: eSIGN. (pp. 1–6). http://www.visicast.cmp.uea.ac.uk.

Metaphor in Communication, Science and Education
No ratings yet
Metaphor in Communication, Science and Education
332 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
675 pages
Programming Glossary
100% (2)
Programming Glossary
15 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
Merge and The Strong Minimalist Thesis
No ratings yet
Merge and The Strong Minimalist Thesis
88 pages
Machine Translation Abdulla Homiedan
No ratings yet
Machine Translation Abdulla Homiedan
19 pages
Divyesh 1
No ratings yet
Divyesh 1
4 pages
Assignment No.1 - Lexical and Structural Ambiguity - NLP
No ratings yet
Assignment No.1 - Lexical and Structural Ambiguity - NLP
5 pages
Deixis
67% (3)
Deixis
23 pages
Dependency Parsing
100% (11)
Dependency Parsing
127 pages
Recent Advances On Deep Learning For Sign Language
No ratings yet
Recent Advances On Deep Learning For Sign Language
52 pages
NLP 3 4 5
No ratings yet
NLP 3 4 5
105 pages
Computational Semiotics
No ratings yet
Computational Semiotics
289 pages
The Branches of Linguistics
No ratings yet
The Branches of Linguistics
6 pages
Paper 1
No ratings yet
Paper 1
19 pages
Logical Reasoning-Introduction
No ratings yet
Logical Reasoning-Introduction
28 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
A Programming Language Is An Artificial Language Designed To Communicate Instructions To A Machine
100% (1)
A Programming Language Is An Artificial Language Designed To Communicate Instructions To A Machine
15 pages
LEEDS-HURWITZ - Semiotics and Communication
No ratings yet
LEEDS-HURWITZ - Semiotics and Communication
90 pages
Natural Language Processing
No ratings yet
Natural Language Processing
36 pages
The Evolution of Localization (Bert Esselink)
100% (2)
The Evolution of Localization (Bert Esselink)
4 pages
Technology Era in Literary Translation
100% (1)
Technology Era in Literary Translation
5 pages
Tree Adjoining Grammar
No ratings yet
Tree Adjoining Grammar
22 pages
Types
No ratings yet
Types
41 pages
NLP Unit Iv
No ratings yet
NLP Unit Iv
24 pages
Computer Assisted Language Learning: A Brief History of CALL
No ratings yet
Computer Assisted Language Learning: A Brief History of CALL
40 pages
WFP 0000102103
No ratings yet
WFP 0000102103
119 pages
Large Language Models: A Survey
No ratings yet
Large Language Models: A Survey
43 pages
ChatGPT Assignment
No ratings yet
ChatGPT Assignment
44 pages
LLM Paper 1707247828
No ratings yet
LLM Paper 1707247828
24 pages
19011517-011-PPT-Translation & Technology
No ratings yet
19011517-011-PPT-Translation & Technology
15 pages
Retrieve
No ratings yet
Retrieve
21 pages
Introduction To Compiler Design - Solutions
No ratings yet
Introduction To Compiler Design - Solutions
6 pages
ATN
100% (1)
ATN
16 pages
Artificial Intelligencefor Sign Language Translation ADesign
No ratings yet
Artificial Intelligencefor Sign Language Translation ADesign
26 pages
Computational Linguistics - Introduction
No ratings yet
Computational Linguistics - Introduction
50 pages
Object Oriented Programming Lab-06 (Inheritance and Friend Functions)
No ratings yet
Object Oriented Programming Lab-06 (Inheritance and Friend Functions)
5 pages
The Virtualsign Channel For The Communication Between Deaf and Hearing Users
No ratings yet
The Virtualsign Channel For The Communication Between Deaf and Hearing Users
8 pages
NLP Python Intro 1-3
100% (1)
NLP Python Intro 1-3
79 pages
Natural Language Processing PDF
No ratings yet
Natural Language Processing PDF
170 pages
Visual Interpretation of Hand Gestures For Human Computer Interaction A Review Pavlovic Pavlovic97pami PDF
No ratings yet
Visual Interpretation of Hand Gestures For Human Computer Interaction A Review Pavlovic Pavlovic97pami PDF
19 pages
An Introduction To Psycholinguistics Language, Mind and Brain
No ratings yet
An Introduction To Psycholinguistics Language, Mind and Brain
20 pages
Text Mining Digital Humanities Projects: Assessing Content Analysis Capabilities of Voyant Tools
No ratings yet
Text Mining Digital Humanities Projects: Assessing Content Analysis Capabilities of Voyant Tools
30 pages
Artificial Intelligence For Education
No ratings yet
Artificial Intelligence For Education
18 pages
Lec12-SEMANTIC PROCESSING
No ratings yet
Lec12-SEMANTIC PROCESSING
26 pages
Memories On The Move: Migration, Diasporas and Citizenship
No ratings yet
Memories On The Move: Migration, Diasporas and Citizenship
301 pages
Deep Learning For Sign Language Recognition Current Techniques Benchmarks and Open Issues
No ratings yet
Deep Learning For Sign Language Recognition Current Techniques Benchmarks and Open Issues
35 pages
Language
No ratings yet
Language
53 pages
Text, Context and Knowledge
No ratings yet
Text, Context and Knowledge
38 pages
How To Do Data Analysis For Thesis
100% (3)
How To Do Data Analysis For Thesis
6 pages
David Ausubel's Subsumption Theory
No ratings yet
David Ausubel's Subsumption Theory
11 pages
Towards Automatically Extracting Story Graphs From Natural Language Stories
No ratings yet
Towards Automatically Extracting Story Graphs From Natural Language Stories
8 pages
Language Acquisition - Steven Pinker
No ratings yet
Language Acquisition - Steven Pinker
53 pages
The Universality of Human Language: Psycholiinguistics
No ratings yet
The Universality of Human Language: Psycholiinguistics
10 pages
Irony Special LLC2012
No ratings yet
Irony Special LLC2012
13 pages
Word Semantics, Sentence Semantics and Utterance Semantics
No ratings yet
Word Semantics, Sentence Semantics and Utterance Semantics
11 pages
MYP Years 4-5 Assessment Criteria 1
100% (1)
MYP Years 4-5 Assessment Criteria 1
33 pages
Natural Language Processing
100% (2)
Natural Language Processing
9 pages
What Is An Adjunct
No ratings yet
What Is An Adjunct
6 pages
Theoretical Foundations
No ratings yet
Theoretical Foundations
43 pages
BESC-133 Sample Paper
No ratings yet
BESC-133 Sample Paper
7 pages
1.02 Methods of Organizing Information
No ratings yet
1.02 Methods of Organizing Information
2 pages
Application of First-Order Logic in Knowledge Based Systems PDF
No ratings yet
Application of First-Order Logic in Knowledge Based Systems PDF
7 pages
Senior Research Guidelines 23 1 20
No ratings yet
Senior Research Guidelines 23 1 20
28 pages
The Relationship Between Parental Involvement and Student Achievement
No ratings yet
The Relationship Between Parental Involvement and Student Achievement
14 pages
Workshop-NMAM-2025-July 10-14 2025
No ratings yet
Workshop-NMAM-2025-July 10-14 2025
2 pages
Comic Strips - Miscommunication and Misinformation
No ratings yet
Comic Strips - Miscommunication and Misinformation
3 pages
Cultural Promotion Through Language: A Case Study of English Textbooks at Secondary Level in Different Provinces of Pakistan
No ratings yet
Cultural Promotion Through Language: A Case Study of English Textbooks at Secondary Level in Different Provinces of Pakistan
16 pages
Readings About Foregrounding
No ratings yet
Readings About Foregrounding
2 pages
AOTA's Updated Occupational Profile Template
No ratings yet
AOTA's Updated Occupational Profile Template
3 pages
EJ1294348
No ratings yet
EJ1294348
17 pages
Introduction: Biology Today: Powerpoint Lectures For
No ratings yet
Introduction: Biology Today: Powerpoint Lectures For
12 pages
Chapter 9 - Learning To Be The Student
No ratings yet
Chapter 9 - Learning To Be The Student
4 pages
حل المشكلات في الرياضيات - مقال
No ratings yet
حل المشكلات في الرياضيات - مقال
19 pages
The Wise Mind
No ratings yet
The Wise Mind
6 pages
Network Data Management Model Based On NaÃ Ve Bayes Classiï Er and Deep Neural Networks in Heterogeneous Wireless
No ratings yet
Network Data Management Model Based On NaÃ Ve Bayes Classiï Er and Deep Neural Networks in Heterogeneous Wireless
11 pages
5E'S Semi-Detailed Lesson Plan in Media and Information Literacy
No ratings yet
5E'S Semi-Detailed Lesson Plan in Media and Information Literacy
2 pages
Supervisi Administrasi Pembelajaran Dalam Meningkatkan Mutu Pembelajaran Di Smks 6 Pertiwi Curup
No ratings yet
Supervisi Administrasi Pembelajaran Dalam Meningkatkan Mutu Pembelajaran Di Smks 6 Pertiwi Curup
12 pages
PurCom Chapter 2 Communication and Globalization
No ratings yet
PurCom Chapter 2 Communication and Globalization
19 pages
RPP SD
No ratings yet
RPP SD
7 pages
Syntactic Theory and The Evolution of Syntax: Draft Manuscript
No ratings yet
Syntactic Theory and The Evolution of Syntax: Draft Manuscript
36 pages
Media Logic: January 2016
No ratings yet
Media Logic: January 2016
8 pages
ADET
No ratings yet
ADET
6 pages
Lesson 13 Discussion 1
No ratings yet
Lesson 13 Discussion 1
3 pages
Unit 1 The Nature and Context of Social Research
No ratings yet
Unit 1 The Nature and Context of Social Research
48 pages
Collocations and Semantic Prosody
No ratings yet
Collocations and Semantic Prosody
26 pages
Componential Analysis (I) Classical Structuralism
No ratings yet
Componential Analysis (I) Classical Structuralism
23 pages
January February March April May: Tasks
No ratings yet
January February March April May: Tasks
4 pages
UBA21S1272
No ratings yet
UBA21S1272
1 page
Mitosis Rubric
No ratings yet
Mitosis Rubric
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Survey Sign Language Production 2023

Uploaded by

Survey Sign Language Production 2023

Uploaded by

Expert Systems With Applications 243 (2024) 122846

Contents lists available at ScienceDirect

Expert Systems With Applications

A survey on recent advances in Sign Language Production

ARTICLE INFO ABSTRACT

1. Introduction people in society. According to the World Health Organization (WHO)

information (Stoll, Camgoz, Hadfield, & Bowden, 2020). Proposed

Fig. 2. A bi-directional sign language translation system.

4.4. Motion capture and signing avatars

Motion capture, mocap for short, is defined as the process of

that can be used for text-to-sign language translation. This dataset is an

various types of alerting devices, including clocks and wake-up alarm

A loss function for sequence-to-sequence translation.

where 𝑤𝑐 , 𝑤𝑓 , 𝑤𝑏 , 𝑤𝑗 are weights for the quality (continuity) of the

corresponding to these linguistic aspects are calculated as follows:

where 𝑓𝑓𝑜𝑟𝑚 is the state duration normalized form-level score for

Fig. 13. A general framework for SLP.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.