Survey Sign Language Production 2023
Survey Sign Language Production 2023
Review
Keywords: Sign Language is the dominant form of communication language used in the Deaf and hearing-impaired
Sign Language Production community. To make easy and mutual communication between the hearing-impaired and the hearing
Sign Language Recognition communities, building a robust system capable of translating the spoken language into sign language and
Sign Language Translation
vice versa is fundamental. To this end, sign language recognition and production are two necessary parts for
Deep learning
making such a two-way system. Sign language recognition and production need to cope with some critical
Survey
Deaf
challenges. In this survey, we review recent advances in Sign Language Production (SLP) and related areas
using deep learning. To have more realistic perspectives to sign language, we present an introduction to the
Deaf culture, Deaf centers, the psychological perspective of sign language, and the main differences between
spoken language and sign language. Furthermore, we present the fundamental components of a bi-directional
sign language translation system, discussing the main challenges in this area. Also, the backbone architectures
and methods in SLP are briefly introduced and the proposed taxonomy of SLP is presented. Finally, a general
framework for SLP and performance evaluation, and also a discussion on the recent developments, advantages,
and limitations of SLP, commenting on possible lines for future research are presented.
∗ Corresponding author.
E-mail addresses: rrastgoo@semnan.ac.ir (R. Rastgoo), kourosh.kiani@semnan.ac.ir (K. Kiani), sergio@maia.ub.es (S. Escalera), athitsos@uta.edu
(V. Athitsos), sabokro@ipm.ir (M. Sabokrou).
https://doi.org/10.1016/j.eswa.2023.122846
Received 30 December 2022; Received in revised form 3 June 2023; Accepted 2 December 2023
Available online 9 December 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
2
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
struggled to use sign languages in schools, work, and public life (Geers interact with the Deaf community only on a professional basis. So, this
et al., 2017). Furthermore, linguistic advancements have assisted sign issue can be considered for hardware implementations of the proposed
languages so that they can be used as natural languages (Stokoe, systems in SLR and SLP. Developing such a system can make room
Casterline, & Croneberg, 1965). Also, the role of the legislation cannot for the Deaf community to overcome communication barriers and help
be ignored in helping to establish legal support for sign language them to keep motivated.
education and usage (UN General Assembly, 2006). Considering this
historical struggle can help the researchers to have a sense of the 2.4. Sign language vs. spoken language
necessity of the translation and recognition systems for sign languages
applicable to the real-life of the Deaf community (Bragg et al., 2019). Generally, there are two main types of languages used in the com-
munity: spoken and sign. While these two language types are different
2.2. Deafness centers from each other, both of them should be viewed as natural languages.
The main difference between them refers to the way that they convey
Some common Deafness centers and open sources of data related to information. The spoken language is understood as an auditory/vocal
the Deaf community are listed below: language. It can also be considered as an oral language. The various
sound patterns are used to convey a message. There are many linguistic
• World Health Organization (WHO) (WHO, 2022), elements in the spoken language, such as vowels, consonants, and
• National Institute on Deafness and Other Communication Disor- tones. Making changes in these elements can lead to different meanings
ders (NIDCD) (NIDCD, 2022), for the same set of words in the spoken language. In contrast to spoken
• Centers of Disease Control and Prevention (CDC, 2022), languages, gestures and facial expressions play key roles to convey
• National Deaf Center (NDC) (NDC, 2022), information in sign languages instead of vocal tracts. There are different
• Hearing, Speech, and Deaf Center (HSDC) (HSDC, 2022), sign languages in the world. Some of them are better known, such
• Center for Hearing and Deaf Services (HDS) (HDS, 2022), as American Sign Language (ASL). In every country, there are one
• Deaf and Hard of Hearing Program (DHHP, 2022), or more sign languages used by the Deaf community. While people
• Manchester Centre for Audiology and Deafness (ManCAD) (Man- think that sign languages have derived from spoken languages, they
CAD, 2022), are independent of natural languages that have evolved over time. Sign
• Northern Virginia Resource Center for Deaf and Hard of Hearing language is a complex language that has specific linguistic properties.
Persons (Virginia, 2022), SLR is affected by the structural properties of sign language and occurs
• National Center on Deaf-Blindness (NCDB) (NCDB, 2022). faster than spoken language recognition. While signs are articulated
slower than spoken words, the proposition rate for sign and speech
These centers aim to provide educational, clinical, and research is identical. It should be noted that both languages can be used to
services to the Deaf community. convey all sorts of information, such as news, conversations about daily
activities, stories, narrations, etc.
2.3. Psychological perspective of sign language
2.5. Bi-directional sign language translation system
As we stated before, developing an efficient bidirectional sign lan-
guage translation system requires the study of a wide range of fields, As already discussed, to make a bi-directional sign language transla-
including CV, CG, NLP, HCI, Linguistics, and Deaf culture. To this tion system, we need a system capable of translation from sign language
end, we present a brief discussion of the findings from developmental into a spoken language (SLR) and vice versa (SLP) (see Fig. 2). While
psychology, psycho-linguistics, cognitive psychology, and neuropsycho- SLR has rapidly advanced in recent years, SLP is still a challenging
logical studies. Recent studies of attention and perception show that problem. Since the details of SLR have been presented in some accurate
usage of sign language from an early age can boost some aspects of non- and well-detailed surveys (Ghanem et al., 2017; Rastgoo et al., 2021c),
language visual perception, such as motion perception. Furthermore, in this survey, we focus on SLP details of this bi-directional system and
neuropsychological and functional imaging studies indicate that left present more details of recent works in the SLP.
hemisphere regions are important in both sign and spoken language
processing. Aphasia can be occurred in signers due to left hemisphere 2.6. Conclusion
damage. Also, the existence of different modalities for language ex-
pression, such as oral–aural and manual–visual, makes room to explore In this section, we briefly reviewed some concepts related to the
different characteristics of human languages. Deaf community and its language. It is worth mentioning that sign lan-
From a pathological perspective, Deaf people have different degrees guage, as a language used in the Deaf community, should be viewed as
of hearing deviations from the standard/norm hearing level defined a natural language, similar to spoken language. To have bi-directional
for hearing people. Generally, four levels of deafness are defined: communication between hearing and Deaf people, we need a system
mild, moderate, severe, and profound hearing loss. This perspective is capable of translating from sign language into spoken language and vice
traditionally acceptable by a majority of non-deaf professionals who versa. Considering our findings in this section, it is necessary to know
3
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
that the Deaf community unifies two groups, the culturally Deaf people exposure can lead to mental flexibility, creative thinking, and com-
and the other individuals who use sign language, in the same group munication advantages (Hamers, 1998). Historically, sign language
to help Deaf people. Furthermore, the data from Deaf centers can be has not been incorporated in the education of Deaf children (Grosjean,
used for providing professional analysis of the Deaf community. To this 2010; Humphries, 2013; Swanwick, 2010). Some earlier sign languages
end, we need to also consider psychology, psycho-linguistics, cognitive were not natural. They just used signing to deliver the content for the
psychology, and neuropsychological findings in the Deaf community. Deaf individuals who failed within an oral-only approach. The dissat-
isfaction with the educational outcomes of the Deaf community was
3. SLP led to a bilingual design that placed sign language at the same level as
spoken/written language. To develop functional and bilingual systems,
SLP is one of the main components of a bidirectional sign language the full development of two languages is crucial. In such systems,
translation system. This system can be used to facilitate easy and the social and academic functions of both languages are considered.
clear communication between the hearing and the Deaf communities. Furthermore, the consistent and strategic usage of them is promoted in
Furthermore, the necessity of such systems can also be considered as a the environment. The final goal is to deliver content instruction in both
psychological perspective for the Deaf community. In this section, we languages making it a viable design for Deaf children (Gárate, 2014).
present more details on SLP. Application area: Millions of Deaf and hearing-impaired people
live across the world. The predominant part of them lives in low-income
3.1. Problem definition and developing countries with low access to suitable ear and hearing
care services. While hearing loss makes many difficulties in the corre-
The task of SLP can be defined as a video generation process from sponding community, many mainsprings of it can be prevented through
an input text. In more detail, given a spoken language sentence, 𝑆 𝑁 = public health measures. Rehabilitation, education, empowerment, and
{𝑤1 , 𝑤2 , … , 𝑤𝑁 }, it is expected that the model generates a video with M communication technology usage are some of the main solutions to
frames, 𝑉 𝑀 = {𝐹1 , 𝐹2 , … , 𝐹𝑀 }, including a sign language video corre- solve the communication barriers for the hearing-impaired community
sponding to the input sentence. Generally, there are some intermediate and use the full potential of the Deaf and hearing-impaired people.
steps for the SLP task. During these steps, the input sentence from To this end, we present compact information regarding the community
the spoken language is encoded into some representations to generate affected by hearing loss to make a promising insight and humanitarian
more accurate videos. We will review the proposed models and also the motivation in a research community. Considering the scope of this sur-
intermediate steps in SLP in this survey. vey, this information can help to develop communication technologies
compatible with the needs of the hearing-impaired community.
3.2. Challenges According to the World Health Organization (WHO) report, 1.5
billion people live with some degree of hearing loss. It is predicted
Here, we discuss the most important challenges in SLR. that by 2050 approximately 2.5 billion people will have some degree
Interpretation between visual and linguistic information: SLP of hearing loss (Deafness and hearing, 2022). Some critical points need
is still a very challenging problem, involving an interpretation between to be considered for application development in this area:
visual and linguistic information (Stoll, Camgoz, et al., 2020). Proposed
1. Geography: 80% of people with hearing loss live in low-income
systems in SLR generally map signs into the spoken language in the
countries. These people cannot easily access assistive technolo-
form of text transcription (Rastgoo et al., 2021c). However, SLP systems
gies to improve their communication quality.
perform the reverse procedure. The challenges regarding mapping from
2. Age: Another challenge is the hearing loss outbreak with age.
the lingual domain into the visual domain still remain.
Age is an important predictor of hearing loss among adults aged
Visual variability of signs: The visual variability of signs is one
20–69, with the maximum amount of hearing loss in the 60
of the challenges in SLP, which is affected by hand shape, palm
to 69 age group. Nearly 25% of people older than 60 years
orientation, movement, location, facial expressions, and other non-
are affected by hearing loss. This challenge can be considered
hand signals. These differences in sign appearance produce a large
for adopting assistive technologies with the special physical and
intra-class variability and low inter-class variability. This makes it hard
mental situations of people older than 60 years. Nearly 15%
to provide a robust and universal system.
of American adults (37.5 million) aged 18 face hearing loss.
Photo-realistic SLP system: Another challenge is generating a
Furthermore, about 2 to 3 out of every 1000 American children
photo-realistic sign video from a text or voice in spoken language in
are affected by hearing loss. Considering the difference in hear-
a real-world situation. This is important because it helps the generated
ing loss definition in different ages, the age factor needs to be
videos to be truly understandable and accepted by Deaf communities.
considered for application development. Generally, hearing loss
Thanks to the previous models based on graphical avatars and also
greater than 40 (dB) and 30 dB is defined for adults and children,
recent neural SLP works that produce skeleton pose sequences, we need
respectively. Due to this difference and the other communication
systems that are understandable and acceptable to Deaf viewers (Saun-
requirements corresponding to different age groups, the age
ders, Camgöz, & Bowden, 2020c).
factor is an important factor for application development in the
The grammatical rules and linguistic structures of the sign
field.
language: The challenge corresponding to the grammatical rules and
3. Gender: According to the reports of WHO, men are approxi-
linguistic structures of sign language is another critical challenge in this
mately twice as likely as women to have hearing loss among
area. Translation between spoken and sign language is a complex task.
adults aged 20–69. This is due to the fact that men usually work
This is not a simple word-to-word mapping problem from text/voice
in louder environments.
to sign. Another issue is the synchronized multi-modal nature of sign
4. Sign language: As we discussed in the previous sections, most
language that needs simultaneously moving hands and faces to convey
Deaf people are not familiar with sign language. This makes it
lexical and grammatical information.
hard to develop communication tools for sign language transla-
Bilingual education: The brain has no preference for any type
tion.
of languages. The only preference of the brain is that it expects to
receive input from a complete and natural language. In this way, both Considering all of these critical points, developing effective ap-
spoken and sign languages can be used as inputs for the brain. Being plications is challenging. Although different applications have been
bilingual, as a positive and desirable quality, the Deaf community developed in recent years (Bangham et al., 2000; Dangsaart, Narue-
can follow similar developmental paths as monolinguals. This dual domkul, Cercone, & Sirinaovakul, 2008; Grieve, 1999, 2002; Hanke,
4
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
2004; Huenerfauth, 2004, 2005; Jemni et al., 2022; Kanis, Zahradil, 2007), increasing the kernel size, linearly fusing multiple scales (Den-
Jurčíček, & Müller, 2006; Karpouzis, Caridakis, Fotinea, & Efthimiou, ton, Chintala, Szlam, & Fergus, 2015; Mathieu, Couprie, & LeCun,
2020; Veale & Conway, 1994; Zhao, Kipper, Schuler, Vogler, & Palmer, 2016), using dilated convolutions to include long-range spatial depen-
2000; Zij & Barker, 2003), more endeavor is necessary to develop real- dencies (Yu, Koltun, & Funkhouser, 2017), extending the receptive
time applications for bi-directional translation from sign language to fields (Chen & Koltun, 2017; Luo, Li, Urtasun, & Zemel, 2016), sub-
spoken language and vise versa. Furthermore, most of the applications sampling, or using residual connections (He, Zhang, Ren, & Sun, 2016;
in sign language focus on the recognition task, such as robotics (Dawes Villegas, Yang, Hong, Lin, & Lee, 2017). Another challenge is the lack of
et al., 2018), HCI (Bachmann et al., 2018), education (Darabkh et al., temporal learning corresponding to the image sequences. To properly
2018), computer games (Roccetti et al., 2012), recognition of children address this challenge, 3D convolutions are used as a promising alterna-
with autism (Cai et al., 2018), automatic sign-language interpreta- tive to recurrent modeling. Several models have been proposed to sign
tion (Yang, 2014), decision support for medical diagnosis of motor skills
language using 3D convolutions (hammadi et al., 2020; Rastgoo et al.,
disorders (Butt et al., 2018), home-based rehabilitation (Cohen et al.,
2020a; Sharma & Kumar, 2021; Sripairojthikoon & Harnsomburana,
2018; Morando et al., 2018), and virtual reality (Vaitkevičius et al.,
2019). However, the 3DCNN models are not generally as powerful as
2019). This is due to a common misunderstanding by hearing people
the sequence learning models such as RNN, Long Short-Term Memory
that Deaf people are much more comfortable with reading spoken
(LSTM), and Gated Recurrent Unit (GRU).
language; therefore, it is not necessary to translate the reading spoken
language into sign language. This is not true since there is no guarantee
that a Deaf person is familiar with the reading/writing forms of a 4.2. Transformer
spoken language. In some languages, these two forms are completely
different from each other.
The main intuition behind recurrent models is modeling the tem-
Real-time communication: For now, accessibility in SL is mainly
poral representation of sequential data, such as image sequences.
achieved by pre-recorded videos. This cannot enable real-time inter-
Deep recurrent networks demonstrated great success in different se-
action for the content provider. To have an automatic sign recogni-
quence learning tasks, such as machine translation (Siddique, Ahmed,
tion system applicable to a mutual interaction between a hearing-
impaired/Deaf user and a hearing user or a digital assistant in real-time, Talukder, & Uddin, 2020), speech recognition (Graves, Mohamed, &
we need low-complex and fast models. Using such models, the Deaf Hinton, 2013), video captioning (Pei et al., 2019), video prediction (Vil-
community can simply communicate with other people in different legas et al., 2017), SLR (Rastgoo et al., 2020a), and SLP (Rastgoo et al.,
locations, such as schools, banks, hospitals, trains, University, just to 2021d). However, there are some limitations in these networks, such
mention a few. A translation system could be vision-based or sensor- as vanishing and exploding gradient. To mitigate these challenges, the
based, depending on the type of input it receives. To date, most of the classical RNNs were extended to more sophisticated recurrent models,
current commercial systems for sign language translation are sensor- such as LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Cho,
based, which are expensive and not user-friendly. Vision-based sign Merrienboer, et al., 2014). Different works have explored different
translation systems are necessary but should overcome many challenges modifications of the extended recurrent models, such as applying
to build a system applicable to real-time communication. the LSTM-based models to the image space (Shi et al., 2015), using
Sign anonymization: The purpose of sign anonymization is to multidimensional LSTM (MD-LSTM) (Graves, Fernandez, & Schmid-
ensure that no personal information of the signers is shared with the huber, 2007), using the stacked recurrent layers to include abstract
community (Saunders, Camgoz, & Bowden, 2021). Furthermore, pro- spatio-temporal correlations (Finn, Goodfellow, & Levine, 2016; Lotter,
viding realistic, human-like, and anonymized animations would ensure Kreiman, & Cox, 2015), and addressing the duplicated recurrent repre-
higher acceptability and comprehensibility than actual signing avatars. sentations (Zhan, Zheng, Yue, Sha, & Lucey, 2019). In addition, the
In sign language, complete anonymization of the video data is not Transformer models have recently improved results due to using the
possible because both the face and hands of the signers must be fully self-attention mechanism and parallel computing. In most of the mod-
visible so that the content can be understandable. Since most Deaf els for SLP, a recurrent model is used for the temporal representation
people have challenges in communication through written content, the
of sequential data.
need for producing messages anonymously is an important demand of
them. As a result, video is the main communication modality used by
native signers. The development of virtual signers is thus expanding, in 4.3. Generative models
order to make written material on the internet more available to Deaf
users. Generative modeling is an unsupervised learning task in machine
learning. It involves automatically discovering and learning the regu-
4. Backbone architectures and methods larities or patterns in input data. Such a model can generate or output
plausible examples. Generally, there are two main categories for model
In this section, we review the most-used architectures and meth-
learning: discriminative and generative. While a discriminative model
ods in SLP: Convolutional Neural Networks (CNNs), Recurrent Neural
learns the decision boundaries between the classes, a generative model
Networks (RNNs), generative models, motion capture, and signing
learns the real distribution of each class. In other words, a generative
avatars.
model learns the joint probability distribution p(x,y) to predict the
4.1. CNNs conditional probability using the Bayes Theorem. On the other side,
a discriminative model learns the conditional probability distribution
One of the basic deep learning-based building blocks designed for p(y|x). Both of these models generally fall into supervised learning
visual reasoning is convolutional layers (Rezaei et al., 2023). Using problems. The goal of generative models is to generate new samples
these layers, CNNs effectively model the spatial structure of images (Le- from the same distribution, given some training data. In the learning
Cun, Bottou, Bengio, & Haffner, 1998). In SLP, CNNs are the foundation procedure, the distribution of the real and generated data gets closer
of the proposed models. However, CNNs performance faces some chal- to each other. This is done by explicitly, e.g. VAEs, or implicitly,
lenges. One challenge is corresponding to the limited receptive field, e.g. GANs, estimating a density function from the real data. In SLP,
introduced as a kernel size. As some solutions to this challenge, we generative models are used to generate more realistic and plausible
can take into account: stacking more convolutional layers (Jain et al., videos, considering sign language challenges.
5
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
actions. However, there are some challenges in mocap usage. The need
for special and expensive hardware/software to obtain and process the
data, the need for specific requirements for the space that the mocap Visual modality: RGB and skeleton are two common types of
process is operated in, and the need for re-recording data instead of input data used in SLP models. While RGB images/videos contain
manipulating it in facing problems are some of these challenges. high-resolution content, skeleton inputs decrease the input dimension
Signing avatars are an animated 3D model of the mocap data necessary to feed to the model and assists in making a low-complex and
obtained using signers. Animating can be manually defined, captured fast model. The spatial features corresponding to the input image can
from a human signer, or parametrically described. The signing avatars be extracted using computer vision-based techniques, especially deep
aim to assist the research community in making different applications learning-based models. In recent years, CNNs achieved outstanding
more accessible to the Deaf community. Furthermore, they will also performance for spatial feature extraction from an input image (Majidi,
help address the lack of human interpreters. The goal is not to replace Kiani, & Rastgoo, 2020). Furthermore, generative models, such as
human interpreters but rather to increase the amount of signed content Generative Adversarial Networks (GAN), can use CNNs as an encoder
available to Deaf users. As another application of signing avatars, they or decoder block to generate a sign image/video. Due to the temporal
can be used as assistive technologies for Deaf students in school. Using dimension of RGB video inputs, the processing of this input modality
these technologies, the interaction between Deaf and hearing students is more complicated than the RGB image input. Most of the proposed
will be much easier. models in SLP use the RGB video as input (Camgoz, Koller, Hadfield,
Recently, mocap data is used to edit and generate sign language & Bowden, 2020; Saunders, Camgöz, & Bowden, 2020a; Saunders,
samples. To this end, some motion edition operations, such as con- Camgoz, & Bowden, 2020d; Stoll, Camgoz, et al., 2020). An RGB sign
catenation and mixing, are applied to mocap data to compose new video can correspond to one sign word or some concatenated sign
utterances. This helps to facilitate the enrichment of the original mocap words, in the form of a sign sentence. GAN and LSTM are the most used
data, enhancing the natural look of the animation, and promoting the deep learning-based models in SLP for static and temporal learning in
avatar’s acceptability. However, manipulating existing movements does the visual input modalities. While successful results have been achieved
using these models, more effort is necessary to generate more lifelike
not guarantee the semantic consistency of the reconstructed signs. Em-
sign images/videos in order to improve the communication interface
ploying an expert user for constructing new utterances from linguistic
with the Deaf community.
patterns can be a primary solution to this challenge.
Lingual modality: Text input is the most common form of linguistic
modality. To process the input text, different models are used (See
5. SLP taxonomy
& Lamm, 2020; Sutskever, Vinyals, & Le, 2014). Among the deep
learning-based models, the Neural Machine Translation (NMT) model
In this section, we present a taxonomy that summarizes the main is the most used model for input text processing. The Seq2Seq mod-
concepts related to deep learning in SLP. We categorize recent works in els (Sutskever et al., 2014), such as Recurrent Neural Network (RNN)-
SLP providing separate discussions in each category. In the rest of this based models, proved their effectiveness in many tasks. While success-
section, we explain different input modalities, datasets, applications, ful results were achieved using these models, more effort is necessary to
and proposed models. Fig. 3 shows the proposed taxonomy described overcome the existing challenges in the translation task. One challenge
in this section. in translation tasks is related to domain adaptation due to different
word styles, translations, and meanings in different languages. Thus, a
5.1. Input modalities critical requirement of developing machine translation systems is to tar-
get a specific domain. Transfer learning, training the translation system
Generally, vision and language are two input modalities in SLP. in a general domain followed by fine-tuning in-domain data for a few
While the visual modality includes the captured image/video data, the epochs, is a common approach in coping with this challenge. Another
linguistic modality for the spoken language contains the text/audio challenge is regarding the amount of training data. Since the main
input from the natural language. CV and NLP techniques are necessary property of deep learning-based models is the mutual relation between
to process these input modalities. While the visual modality is used in the amount of data and model performance, a large amount of data
the training, the lingual modality is applicable in both the training and is necessary to provide a good generalization capability in the model.
testing of the proposed models. Another challenge is the poor performance of machine translation
6
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
7
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Table 1
SLP datasets in time.
Type Dataset Nationality Level Content type Public Year
ASLLVD (Athitsos et al., 2018) English (US) Word Video, Gloss, Trans. Y 2008
ATIS Corpus (Bungeroth et al., 2008) Multilingual Sentence Video, Gloss, Trans. Y 2008
Dicta-Sign (Matthes et al., 2012) English (US) Word Video, Gloss, Trans. Y 2012
Sign ASL-LEX (Caselli, Sehyr, Cohen-Goldberg, & English (US) Word Video, Gloss, Trans. Y 2016
Emmorey, 2017)
RWTH-Phoenix-2014T (Camgöz et al., 2018) German Sentence Video, Gloss, Trans. Y 2018
Deep JSLC (Brock & Nakadai, 2018) Japanese Sentence Video, Gloss Y 2018
KETI (Ko, Kim, Jung, & Cho, 2019) Korean Sentence Video, Gloss, Trans. N 2019
Content4All (Camgoz et al., 2021) Swiss Sentence Video, Gloss Y 2021
How2Sign (Duarte et al., 2020) English (US) Sentence Video, Gloss, Trans, Speech. Y 2021
OpenSubtitles (Tiedemann, 2016) Multilingual (60) Sentence Video, Trans. Y 2016
Multi30K (Elliott, Frank, Simaéan, & Specia, English, German Sentence Image, Trans. Y 2016
2016)
Spoken ASPEC (Nakazawa et al., 2016) Japanese, English Sentence Text Y 2016
MUSE (Conneau, Lample, Ranzato, Denoyer, Multilingual (110) Word Text Y 2017
& Jégou, 2017; Lample, Conneau, Denoyer,
& Ranzato, 2017)
MTNT (Michel & Neubig, 2018) Japanese, French Sentence Text Y 2018
MLQA (Lewis, Oğuz, Rinott, Riedel, & Multilingual (7) Sentence Text Y 2019
Schwenk, 2019)
abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese- Machine Translation of Weather reports from English to ASL
Chinese paper corpus of 680K parallel sentences. MLQA is a cross- project: Using the freely available Perl modules, some packages are
lingual dataset containing over 5K Question Answering (QA) samples designed to employ ASL grammar rules and generate fluent ASL words
(12K in English) in SQuAD format in seven languages: English, Arabic, (Grieve, 1999).
German, Spanish, Hindi, Vietnamese, and Chinese. The MTNT dataset South African Sign Language Machine Translation (SASL-MT)
is a Machine Translation dataset that contains noisy comments on Red- project: Like the TEAM Project (Zhao et al., 2000), SASL-MT uses the
dit and professionally sourced translation. The translation is between rule-based transfer mechanism from English to ASL. The SASL-MT is
French, Japanese, and French, with between 7k and 37k sentences per freely available for the Deaf community in specific domains, such as
language pair. Table 1 summarizes the most-used datasets for SLP and clinics, hospitals, and police stations. While this project is still under
also the datasets for spoken-to-spoken language translation. development, no evaluation results have been reported (Zij & Barker,
2003).
5.3. Applications and technologies Multi-path architecture for Sign Language Machine Translation
(SLMT): Using the virtual reality scene, a multi-channel architecture
5.3.1. Applications is proposed to include supplementary information of ASL. This project
With the advent of potent methodologies and techniques in recent aims to generate spatially complex ASL words (Huenerfauth, 2004,
years, machine translation applications have become more efficient 2005).
and trustworthy. One of the early efforts on machine translation is Czech Sign Language Machine Translation: Using computer ani-
dated back to the sixties, when a model was proposed to translate mation techniques, hand articulations are generated using an automatic
from Russian to English. This model defined the machine translation process. Translation from spoken Czech to Signed Czech is a primary
task as a phase of encryption and decryption. Nowadays, the standard goal of this project. More than 3000 simple or linked sign vocabu-
machine translation models fall into three main categories: rule-based laries of Czech sign language are included in the dictionary of this
grammatical models, statistical models, and example-based models. project, which is a successful improvement in Czech sign language
Deep learning-based models, such as Seq2Seq and NMT models, fall translation (Hanke, 2004; Kanis et al., 2006).
into the third category, and showed promising results in SLP. Virtual signing, capture, animation, storage and transmission
To translate from a source language to a target language, a corpus (ViSiCAST) Translator: This project is proposed to translate from
is needed to perform some preprocessing steps, such as boundary English text into British Sign Language (BSL). Using the grammar
detection, word tokenization, and chunking. While there are different rules and symbolic representation, natural movements in the sing words
corpora for most spoken languages, sign language lacks of such a are modeled. This project has successfully developed an avatar-based
large and diverse corpus. Since Deaf people may not be able to read signing system for BSL (Bangham et al., 2000).
or write in spoken language, they need some tools for communication ZARDOZ System: This system is an English translation system
with other people in society. Furthermore, many interesting and useful using Artificial Intelligence knowledge representation, metaphorical
applications on the Internet are not accessible to the Deaf community. reasoning, and blackboard system architecture. The main advantage
However, we are still far from having applications accessible for Deaf of this system is its efficient performance in processing semantic
people with large vocabularies or sentences from real-world scenarios. information. The contributors of this project aim to improve this system
One of the main challenges for these applications is a license right for using intelligent linguistic technologies (Veale & Conway, 1994).
usage. Only some of these applications are freely available. Another Environment for Greek Sign Language Synthesis: This system
challenge is the lack of generalization of current applications, which are includes an educational platform for Deaf children. Virtual character
developed for the requirements of very specific application scenarios. animation techniques are used for sign sequence synthesis and lexicon-
Here, we present some of the most used projects in sign language grammatical processing of Greek sign language sequences (Karpouzis
translation. et al., 2020).
Translation from English to ASL by Machine (TEAM) project: An Thai-Thai Sign Machine Translation (TTSMT): This model is a
English translation system uses grammar rules to create an American multi-phase approach to translate Thai text into Thai Sign language.
Sign Language (ASL) syntactic structure. Using the signing avatar, this This system has been developed using the spatial grammatical order of
project achieved successful performance for generating aspectual and the sign words and evaluated on the frequently used sign words in daily
adverbial information in ASL (Zhao et al., 2000). communication (Dangsaart et al., 2008).
8
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Table 2
Statistical report on the existing technology products based on the
EASTIN database (Eastin, 2022).
Device type Product number
Hearing technology 300
Alerting devices 173
Communication support technology 223
9
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Table 3
A summary of the main characteristics of the reviewed models.
Query Available choices Most used
Methods Avatar, NMT, Motion Graph, Image/video Image/video generation
generation
Input modalities Image (RGB), Skeleton, Video, Text, Speech Text
Datasets PHOENIX14T, Czech news, Own datasets PHOENIX14T
Production modalities Isolated, Continuous Continuous
Architectures Static: GAN, AE, VAE Static: GAN
Dynamic: LSTM, GRU Dynamic: LSTM
Generative models AE, VAE, GAN GAN
Evaluation metrics Accuracy, Word Error Rate, BLEU, ROUGE BLEU
Features Face, Hand, Body, Fused features Fused features
Table 4
Summary of deep SLP models (Part 1).
Year Ref Feature Input modality Dataset Description
2011 Kipp, Heloir, and Nguyen Avatar RGB video ViSiCAST Pros. Proposing a gloss-based tool focusing on
(2011) the animation content evaluation using a new
metric for comparing avatars with human
signers. Cons. Need to include non-manual
features of human signers.
2016 McDonald et al. (2016) Avatar RGB video Own dataset Pros. Automatically adding realism to the
generated images, low computational
complexity. Cons. Need to place the position of
the shoulder and torso extension on the
position of the avatar’s elbow, rather than the
IK end-effector.
2016 Gibet, Lefebvre-Albaret, Avatar RGB video Own dataset Pros. Easy to understand with high viewer
Hamon, Brun, and Turki acceptance of the sign avatars. Cons. Limited
(2016b) to the small set of sign phrases.
2018 Camgoz, Hadfield, Koller, NMT RGB video PHOENIX-Weather 2014T Pros. Robust to jointly align, recognize, and
Ney, and Bowden (2018a) translate sign videos. Cons. Need to align the
signs in the spatial domain.
2018 Guo, Zhou, Li, and Wang NMT RGB video Own dataset Pros. Robust to align the word order
(2018) corresponding to visual content in sentences.
Cons. Need to generalize to additional datasets.
2020 Stoll, Camgoz, et al. (2020) NMT, MG Text PHOENIX14T Pros. Robust to minimal gloss and skeletal
level annotations for model training. Cons.
Model complexity is high.
2020 Saunders et al. (2020d) Others Text PHOENIX14 Pros. Robust to the dynamic length of output
sign sequence. Cons. Model performance can
be improved including non-manual information.
2020 Zelinka and Kanis (2020) Others Text Czech news Pros. Robust to the missing skeleton parts.
Cons. Model performance can be improved
including information of facial expressions.
2020 Saunders et al. (2020c) Others Text PHOENIX14T Pros. Robust to non-manual feature production.
Cons. Need to increase the realism of the
generated signs.
2020 Camgoz et al. (2020) Others Text PHOENIX14T Pros. No need to the gloss information. Cons.
Model complexity is high.
2020 Saunders et al. (2020a) Others Text PHOENIX14T Pros. Robust to manual feature production.
Cons. Need to increase the realism of the
generated signs.
Table 5
Summary of deep SLP models (Part 2).
Year Ref Feature Input modality Dataset Description
2021 Christopher, Kümmel, Others Text MS-ASL Pros. Performance improvement on
Ritter, and Hildebrand independent and contrasting signers using
(2021) synthesizing target poses. Cons. The generated
samples need to be improved to be more
realistic.
2022 Wencan, Zhao, He, and Others Text PHOENIX14T Pros. Using the unlabeled data for model
Zhang (2022) training. Cons. The modality imbalance issue
can still decrease the model performance.
2022 Tang, Hong, Guo, and Others Text PHOENIX14T Pros. Using the CTC optimization in the model
Wang (2022) guarantees semantic preservation in terms of
both pose and gloss. Cons. Model complexity is
high.
2022 Wellington, Alaniz, Others Text PHOENIX14T Pros. Independent control of diverse factors of
Hurtado, De Silva, and De variation for SLP by disentangling appearance
Bem (2022) and gestural communication parameters. Cons.
Model complexity is high.
10
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Table 6
A summary of the mathematical formulation for SLP (First part).
Ref. Formulation
𝑉 +𝑉
McDonald et al. (2016) 𝑉𝑅 = 𝐴𝑅 − 𝑆𝑅 , 𝑉𝐿 = 𝐴𝐿 − 𝑆𝐿 , 𝑉𝑟𝑒𝑎𝑐ℎ = 𝑅 2 𝐿
The parameters included in the model, showing the displacement vectors from the
right and left shoulders and also the average of these two vectors.
𝜖 ∏𝑀
Othman and Jemni (2011) 𝑝(𝑒, 𝑎|𝑓 ) = (𝑁+1) 𝑀 𝑗=1
𝑡(𝑒𝑗 |𝑓𝑎(𝑗) ),
∏𝑀
𝑝(𝑒, 𝑎|𝑓 ) = 𝜖 𝑗=1 𝑡(𝑒𝑗 |𝑓𝑎(𝑗) )𝑎(𝑎(𝑗|𝑗, 𝑁, 𝑀))
The translation probability of an English sentence to an ASL sentence using the
alignment function.
∏𝑇 ′
Bahdanau, Cho, and Bengio (2014) 𝑝(𝑦1 , … , 𝑦𝑇 ′ |𝑥1 , … , 𝑥𝑇 ) = 𝑡=1 𝑝(𝑦𝑡 |𝑣, 𝑦1 , … , 𝑦𝑡−1 )
Attention function (a distribution over sequences of all possible lengths.)
𝑒𝑥𝑝(𝑠𝑐𝑜𝑟𝑒(ℎ𝑢 ,𝑜𝑛 ))
Camgoz et al. (2018a) 𝛾𝑛𝑢 = ∑𝑁
𝑛′ =1
𝑒𝑥𝑝(𝑠𝑐𝑜𝑟𝑒(ℎ𝑢 ,𝑜𝑛′ ))
Attention-based Encoder-Decoder network with the attention weights.
∑𝑛
Kovar, Gleicher, and Pighin (2002) 𝑓 (𝑤) = 𝑓 ([𝑒1 , … , 𝑒𝑛 ]) = 𝑖=1 𝑔([𝑒1 , … 𝑒𝑖−1 ], 𝑒𝑖 )
The total error corresponds to the alignment and interpolation of the motions and
positions used between the joints.
∑𝑛
Arikan and Forsyth (2002) 𝑆(𝑒1 , … , 𝑒𝑛 ) = 𝑤𝑐 ∗ 𝑖=1 𝑐𝑜𝑠𝑡(𝑒𝑖 ) + 𝑤𝑓 ∗ 𝐹 + 𝑤𝑏 ∗ 𝐵 + 𝑤𝑗 ∗ 𝐽
A score function assigned to the distance between two consecutive frames using the
joint positions, velocities, and acceleration parameters.
Table 7
A summary of the mathematical formulation for SLP (Second part).
Ref. Formulation
Saunders et al. (2020a) min𝐺 max𝐷 𝐺𝐴𝑁 (𝐺, 𝐷) = [𝑙𝑜𝑔𝐷(𝑌 ∗ |𝑋)] + 𝐸[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑋)|𝑋))]
A minimax function to model the training process as an adversarial training scheme.
∑𝑛𝑎 ∑𝑛𝑏
𝑖=1
𝑤 ‖𝑎𝑖 −𝑏𝑗 ‖2𝐷
𝑗=1 𝑖,𝑗
Zelinka and Kanis (2020) 𝜀= ∑𝑛𝑎 ∑𝑛𝑏
𝑤
𝑖=1 𝑗=1 𝑖,𝑗
Table 8 animations with smooth transitions between the signs (Gibet, 2020). To
A summary of the mathematical approaches for SLP.
capture the motion data of Deaf people, special cameras and sensors
Ref. Approach are used (Duarte & Gibet, 2010). Furthermore, a computing method
Othman and Jemni (2011), Stoll, Conditional probability function is used to transfer the body movements into the signing avatar (Kipp
Camgoz, et al. (2020)
et al., 2011).
Hochreiter and Schmidhuber Probability distribution
(1997) Two ways to derive the signing avatars include motion capture
Camgoz et al. (2018a), Bahdanau Attention data and parametrized glosses. In recent years, some works have been
et al. (2014), Zelinka and Kanis developed exploring avatars animated from parametrized glosses. Visi-
(2020)
Cast (Bangham et al., 2000), Tessa (Cox et al., 2002), eSign (Zwit-
Kovar et al. (2002) Distance between two point clouds
Arikan and Forsyth (2002) Distance between two consecutive frames serlood, Verlinden, Ros, & Schoot, 2005), dicta-sign (Efthimiou et al.,
Saunders et al. (2020a), Saunders minimax function and adversarial loss 2012), JASigning (Virtual Humans Group, 2017), and WebSign (Jemni
et al. (2020c) et al., 2022) are some of them. These works need the sign video
annotated via the transcription language (Walsh, Saunders, & Bow-
den, 2022), such as HamNoSys (Prillwitz, 1989) or SigML (Kennaway,
5.4.1. Avatar approaches 2013). However, under-articulated, unnatural movements, and miss-
In order to reduce the communication barriers between hearing ing non-manual information, such as eye gaze and facial expressions,
and hearing-impaired people, sign language interpreters are used as an are some challenges of the avatar approach (Naert, Larboulette, &
effective yet costly solution (Naert, Larboulette, & Gibet, 2020a). To Gibet, 2017). These challenges lead to misunderstanding the final sign
inform Deaf people quickly in cases where there is no interpreter language sequences (Gibet, Lefebvre-Albaret, Hamon, Brun, & Turki,
on hand, researchers are working on novel approaches to providing 2016a). Furthermore, due to the uncanny valley, the users do not
the content (Gibet, 2020; Larboulette & Gibet, 2018). One of these feel comfortable (Mori, MacDorman, & Kageki, 2012) with the robotic
approaches is signing avatars (Lucie, Larboulette, & Gibet, 2020). motion of the avatars (Ben, Camgoz, & Bowden, 2021a). To tackle
Avatar is a technique to display the signed conversation in the absence these problems, recent works focus on the annotation of non-manual
of videos corresponding to a human signer. To this end, 3D animated information such as the face, body, and facial expression (EblingJohn
models are employed, which can be stored more efficiently compared to & Glauert, 2013; Naert, Reverdy, Larboulette, & Gibet, 2018). For
videos. The movements of the fingers, hands, facial gestures, and body instance, Kipp et al. (2011) proposed two techniques, the torso and the
can be generated using the avatar (Naert, Larboulette, & Gibet, 2021). noise methods, to aid manual animation and supplement procedurally
This technique can be programmed to be used in different sign lan- generated avatar movements systems such as (Delorme & Braffort,
guages (Gibet & Marteau, 2023). With the advent of computer graphics 2009; Hanke, 2004). The first technique is an extension to any limb
in recent years, computers and smartphones can generate high-quality system and helps to automatically rotate the torso and spine of an
11
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
avatar. This rotation supports the specified arm motions from the As an example-based methodology, some research works have been
linguistic model. The second technique generates motion in the held developed by focusing on translation from text into sign language
joints. Evaluation results show the effectiveness of these techniques. using Artificial Neural Networks (ANNs), namely NMT (Bahdanau
Though, the accurate alignment and articulation of this information are et al., 2014). NMT uses ANNs to predict the likelihood of a word
challenging (Kipp et al., 2011; McDonald et al., 2016). More concretely, sequence, typically modeling entire sentences in a single integrated
three steps have been included in the proposed model by McDonald model. Seq2seq model (Cho, van Merrienboer, et al., 2014; Sutskever
et al. (2016): movement of the spine, spreading the effect over the et al., 2014), as one of the most interesting breakthroughs in neural ma-
spine, and shoulder movement. To this end, the following parameters chine translations, consists of two Recurrent Neural Networks (RNNs).
are included in the model: These RNNs form an encoder–decoder architecture to translate from a
source sequence to a target sequence. This model aims to overcome
𝑉𝑅 = 𝐴 𝑅 − 𝑆 𝑅 , (1) the challenges in problems whose input and output sequences have
different lengths with complicated and non-monotonic relationships.
𝑉𝐿 = 𝐴𝐿 − 𝑆𝐿 , (2) Considering the capabilities of the LSTM model (Hochreiter & Schmid-
huber, 1997) in learning the long-range temporal dependencies, the
𝑉𝑅 + 𝑉𝐿 seq2seq model improved the translation performance of the model.
𝑉𝑟𝑒𝑎𝑐ℎ = , (3)
2 More concretely, the LSTM aims to estimate the conditional probability
where 𝑉𝑅 , 𝑉𝐿 , 𝑎𝑛𝑑 𝑉𝑟𝑒𝑎𝑐ℎ are the displacement vectors from the right 𝑝(𝑦1 , … , 𝑦𝑇 ′ |𝑥1 , … , 𝑥𝑇 ), where (𝑥1 , … , 𝑥𝑇 ) and (𝑦1 , … , 𝑦𝑇 ′ ) are the input
and left shoulders and also the average of these two vectors, respec- and output sequences, respectively. The length of input and output
tively. To compute both the bend angle and direction, the torso must sequences may differ from each other. The LSTM network computes
be rotated in the direction of 𝑉𝑟𝑒𝑎𝑐ℎ . Experimental results indicated that the conditional probability by first obtaining the fixed dimensional
including such movements in the system would be highly beneficial. representation 𝑣 of the input sequence given by the last hidden state of
the LSTM, and then computing the probability of the output sequence
Using the data collected from motion capture, avatars can be more
with a standard LSTM formulation. The initial hidden state is set to the
usable and acceptable for reviewers (Gibet, 2018) (such as the Sign3D
representation 𝑣 of 𝑥1 , … , 𝑥𝑇 :
project by MocapLab (Gibet et al., 2016b)). Highly realistic results are
′
achieved by avatars, but the results are restricted to a small set of ∏
𝑇
phrases. This comes from the cost of data collection and annotation. 𝑝(𝑦1 , … , 𝑦𝑇 ′ |𝑥1 , … , 𝑥𝑇 ) = 𝑝(𝑦𝑡 |𝑣, 𝑦1 , … , 𝑦𝑡−1 ). (6)
𝑡=1
Furthermore, avatar data is not a scalable solution and needs expert
knowledge to perform a sanity check on the generated data. To cope where each 𝑝(𝑦𝑡 |𝑣, 𝑦1 , … , 𝑦𝑡−1 ) distribution is represented with a Soft-
with these problems and improve performance, deep learning-based max over all the words in the vocabulary. As a requirement, a special
models, as the latest machine translation developments, are used. Gen- End-Of-Sentence symbol ‘‘<EOS>’’ is necessary to enable the model to
erative models along with some graphical techniques, such as Motion define a distribution over sequences of all possible lengths. In addition
Graphs, are being recently employed (Stoll, Camgoz, et al., 2020). to the LSTM Network, the GRU model (Chung, Gulcehre, Cho, &
Bengio, 2014) can be used as an RNN cell. The seq2seq models proved
their effectiveness in many sequence generation tasks by obtaining
5.4.2. NMT approaches
nearly human-level performance. However, there are some drawbacks
Machine translators are a practical methodology for translation
to these models. One of them is corresponding to the fixed-size vec-
from one language to another (Kahlon & Singh, 2021; Khan, Abid,
tor representation of the input sequences with different lengths. The
& Abid, 2020; Luqman & Sabri, 2019). The first translator comes
vanishing gradient related to the long-term dependencies is another
back to the sixties when the Russian language was translated into
drawback of this model (Ko et al., 2019). To enhance the translation
English (Hutchins, 2005). The translation task requires preprocessing
performance of long sequences, Bahdanau et al. (2014) presented an
of the source language, including sentence boundary detection, word
effective attention mechanism. This mechanism was later improved
tokenization, and chunking. These preprocessing tasks are challenging,
by Luong, Pham, and Manning (2015).
especially in sign language. Sign Language Translation (SLT) aims
Regarding sign language, Camgoz et al. proposed a combination of
to produce/generate spoken language translations from sign language a seq2seq model with a CNN model to translate sign videos to spoken
considering different word orders and grammar. The ordering and the language sentences (Camgoz et al., 2018a). They used an attention-
number of glosses do not necessarily match the words of the spoken based Encoder-Decoder network with the attention weights defined as
language sentences (Li et al., 2022). follows:
Nowadays, there are different types of machine translators, mainly
𝑒𝑥𝑝(𝑠𝑐𝑜𝑟𝑒(ℎ𝑢 , 𝑜𝑛 ))
based on grammatical rules, statistics, and examples (Othman & Jemni, 𝛾𝑛𝑢 = ∑𝑁 (7)
2011). For instance, Othman and Jemni (2011) proposed a machine 𝑛′ =1 𝑒𝑥𝑝(𝑠𝑐𝑜𝑟𝑒(ℎ𝑢 , 𝑜𝑛′ ))
translation, namely IBM 1, by defining the translation probability for where ℎ𝑢 , 𝑜𝑛 , 𝑁, are the hidden state, output, and sequence length,
an English sentence 𝑓 = (𝑓1 , 𝑓2 , … , 𝑓𝑁 ) of length 𝑁 to an ASL sentence respectively. While results on the first continuous sign language trans-
𝑒 = (𝑒1 , 𝑒2 , … , 𝑒𝑀 ) of length 𝑀 with an alignment of each ASL word 𝑒𝑗 lation dataset, PHOENIX14T, showed promising results, it would be
to an English word 𝑓𝑖 , considering the alignment function 𝑎 ∶ 𝑗 → 𝑖 as interesting to extend the attention mechanisms to the spatial domain to
follows: align building blocks of signs with their spoken language translations.
In another work, Guo et al. (2018) designed a hybrid model including
𝜖 ∏
𝑀
𝑝(𝑒, 𝑎|𝑓 ) = 𝑀
𝑡(𝑒𝑗 |𝑓𝑎(𝑗) ), (4) the combination of a 3D Convolutional Neural Network (3DCNN) and
(𝑁 + 1) 𝑗=1 an LSTM-based (Hochreiter & Schmidhuber, 1997) encoder–decoder to
where t is a conditional probability function. The alignment function translate from sign videos to text outputs (see Fig. 7). Results on their
𝑎 maps each ASL output word 𝑗 to an English input position 𝑎(𝑗). own dataset showed a 0.071% improvement margin of the precision
The alignment probability distribution is also applied to this reverse metric compared to state-of-the-art models. However, unseen sentence
direction. The combination of these two steps defines the IBM 2 model translation is still a challenging problem with limited sentence data.
as follows: Dilated convolutions and Transformers are two approaches that are also
used for sign language translation (Kalchbrenner et al., 2016; Vaswani
∏
𝑀
et al., 2017). Stoll, Camgoz, et al. (2020) proposed a hybrid model for
𝑝(𝑒, 𝑎|𝑓 ) = 𝜖 𝑡(𝑒𝑗 |𝑓𝑎(𝑗) )𝑎(𝑎(𝑗|𝑗, 𝑁, 𝑀)). (5)
𝑗=1
automatic SLP using NMT, GANs, and motion generation. The proposed
12
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
𝑤𝑓 ∗ 𝐹 + 𝑤𝑏 ∗ 𝐵 + 𝑤𝑗 ∗ 𝐽 (9)
13
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Fig. 8. An overview of the model proposed by Stoll, Camgoz, et al. (2020): A hybrid model to automatic SLP using NMT, GANs, and motion generation.
Fig. 9. An overview of the model proposed by Kovar et al. (2002): A general framework for extracting particular graph walks that satisfy the user’s specifications.
Fig. 10. An overview of the graph nodes in a model proposed by Stoll, Camgoz, et al. (2020) for SLP. Each node contains one or more motion primitives and a prior distribution.
The transition probability between two nodes is defined as the probability of motion primitive.
et al. (2015) developed an RNN-based architecture, including an en- sign sequences from spoken language sentences (see Fig. 11). They
coder and a decoder network to compress the real images presented formalized the model training process as an adversarial training scheme
during training and refine images after receiving codes. Karras, Laine, using a minimax game. To this end, the generator, 𝐺, aims to minimize
and Aila (2019) designed a deep generative model, entitled StyleGAN, the following equation, whilst D maximizes it:
to adjust the image style at each convolution layer. Kataoka, Matsubara,
and Uehara (2016) proposed a model using the combination of GAN min max 𝐺𝐴𝑁 (𝐺, 𝐷) =
𝐺 𝐷
and attention mechanism. Benefiting from the attention mechanism,
[𝑙𝑜𝑔𝐷(𝑌 ∗ |𝑋)] + 𝐸[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑋)|𝑋))] (10)
this model can generate images containing highly detailed content.
While deep learning-based generative models have recently achieved where 𝑌 ∗ and 𝐺(𝑋) are the ground truth and the produced sign pose
remarkable results (Ankith, Boggaram, Sharma, Ramanujan, & Bharathi, sequences, respectively. Results on the PHOENIX14T dataset show the
2022), there exist major challenges in their training. Mode collapse, effectiveness of the proposed approach. However, the model needs to
non-convergence and instability, suitable objective function, and op- further increase the realism of sign production by generating photo-
timization algorithms are some of these challenges. However, sev- realistic human signers. Furthermore, user studies in collaboration
eral strategies have been recently proposed to address a better de- with the Deaf are required to evaluate the reception of the produced
sign and optimization of them. Appropriate design of network archi- sign pose sequences.
tecture, proper objective functions, and optimization algorithms are In another work, Zelinka and Kanis (2020) designed a sign language
some of the proposed techniques to improve the performance of deep synthesis system focusing on skeletal data production. A feed-forward
learning-based models. transformer and a recurrent transformer, as deep learning-based mod-
els, along with the attention mechanism were used to enhance the
5.4.5. Other models model performance (see Fig. 12). The loss of the proposed model for
In addition to the previous categories, some models have been a sequence 𝑎 = (𝑎1 , … , 𝑎𝑛𝑎 ) and a sequence 𝑏 = (𝑏𝑎 , … , 𝑏𝑛𝑏 ) is defined as
proposed for SLP using different deep learning models (Christopher follows:
et al., 2021; Tang et al., 2022; Wellington et al., 2022; Wencan et al., ∑𝑛𝑎 ∑𝑛𝑏
𝑤 ‖𝑎𝑖 − 𝑏𝑗 ‖2𝐷
𝑗=1 𝑖,𝑗
2022). For example, Saunders et al. (2020a) proposed a Progressive 𝑖=1
𝜀= ∑𝑛𝑎 ∑𝑛𝑏 (11)
Transformers, as a deep learning-based model, to generate continuous 𝑖=1
𝑤
𝑗=1 𝑖,𝑗
14
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Fig. 11. An overview of the model proposed by Saunders et al. (2020a). In this model, a Conditional Adversarial Discriminator measures the realism of Sign Pose Sequences
produced by an SLP Generator.
15
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
language (Libras) synthesis, obtains promising results. Christopher et al. 7.1. Evaluation metrics and protocols
(2021) presented a generative-based model for SLP using GANs. This
model uses the human semantic parser of the Soft-Gated Warping- Generally, the evaluation metrics measure the output quality by
GAN to generate photo-realistic videos guided by region-level spatial comparing the system output against the ground truth output corre-
layouts. Experimental results on the MS-ASL dataset with over 200 sponding to the source data. In SLP, the visual/lingual evaluation met-
signers show performance improvement compared to related models in rics are used to evaluate the correctness of the generated visual/lingual
the field. Wencan et al. (2022) designed a model, namely DualSign, outputs:
as a semi-supervised two-staged SLP framework. This model utilizes Visual evaluation metrics: To evaluate the quality of the gener-
partially gloss-annotated text-pose pairs and monolingual gloss data. ated sign image/video, the Structural Similarity Index Measurement
Furthermore, a method, entitled Balanced Multi-Modal Multi-Task Dual (SSIM) (Wang et al., 2018a), Peak Signal-to-Noise Ratio (PSNR), and
Transformation (BM3T-DT), is proposed, which includes two models: Mean Squared Error (MSE), as three well-known metrics for assessing
a Multi-Modal T2G model (MM-T2G) and a Multi-Task G2P model image quality, are used in the proposed models for SLP. SSIM actually
(MT-G2P). These models are jointly trained by leveraging their task measures the perceptual difference between two images. In SLP, this
duality and unlabeled data. Results on the PHOENIX14T dataset show metric is used to compare the generated synthetic image to its ground
the efficiency of this model in the semi-supervised setting. truth image. PSNR and MSE are metrics used to assess the quality of
compressed images compared to their original. In SLP, the MSE is used
6. General framework for SLP to calculate the average squared error between a synthetic image and
its ground truth image. In contrast, PSNR measures the peak error in
dB, using the MSE metric.
SLP can be decomposed into some intermediate steps or addressed
Lingual evaluation metrics: Some of the most familiar machine
as an end-to-end translation task. As we reviewed in the previous
translation metrics like BLEU@N (Papineni, Roukos, Ward, & Zhu,
sections, there are different translation models applicable to sign
2002), METEOR (Denkowski & Lavie, 2014), ROUGE (Lin, 2004),
language. In this section, we present the common intermediate steps
CIDEr (Vedantam, Zitnick, & Parikh, 2015) are used to evaluate the
used in SLP (see Fig. 13).
translation performance of the proposed models in SLP. These metrics
Text/Speech to gloss translation: Gloss is defined as written infor-
have acceptable relevancy with human judgment. In the BLEU@N
mation of a sign word translated from the spoken language. It contains
metric, the matched N-grams between the machine-generated and
the facial and body grammar presented during the signing. For instance,
the ground truth answer are utilized to compute the precision score.
let us translate an English sentence, ‘‘I am Anna’’, into sign language.
BLEU@N metric is calculated for 𝑁 = 1 to 4, where shorter N-grams
To this end, we need to translate ‘‘I am’’ and ‘‘Anna’’ separately but
are used to fulfill the adequacy and longer N-gram matching accounts
finger-spelling for a letter-by-letter translation corresponding to ‘‘Anna’’
for fluency. ROUGE-L is another machine translation metric that scores
is needed. Finally, we have this: ‘‘EM FS-ANNA’’, where ‘‘FS’’ denotes
a machine-generated sentence using a recall-based criterion. CIDEr
the start of a finger-spelling sequence. While the gloss is not a correct
is a metric for evaluating machine-generated sentences using human
translation, it can provide suitable spoken language morphemes con-
consensus.
taining some conceptual information about the signs. The process of
the spoken-to-gloss translation can be seen as a sequence-to-sequence
7.2. Results
task. In this task, various models from speech recognition and NMT,
especially deep learning-based models, can be employed.
In this section, we report the quantitative results of the most rele-
Gloss to skeleton prediction: This step aims to generate the human
vant methods reviewed in the previous sections. We limited the quan-
pose information corresponding to the sign gloss sequences. To this end,
titative results to the most common metrics and datasets. The results
different parts of the human pose, including accurate finger locations,
are compacted in one table, given that there are only a few works
arm and torso position, and facial expressions, are considered. Like the
in SLP. ROUGE and BELU are the most used metrics for reporting
previous step, this step can benefit from recent developments in deep the results of the model evaluation. As Table 9 shows, most of the
learning. Attention-based models are one of the effective techniques proposed models for SLP are evaluated on the PHOENIX14T dataset.
employed for mapping from the textual input to the skeleton sequences. This dataset contains 8257 sequences being performed by 9 signers,
Skeleton to image/video synthesis: Two general approaches are which are annotated with both the sign glosses and spoken language
used in this step: animating an avatar and generating video frames. In translations. However, due to the limited number of signers in the
the first approach, the skeleton keypoints are used to animate an avatar. dataset, it is necessary to use one or more large-scale datasets to train
Motion smoothing and interpolation are two techniques used before the generation network. Using multiple datasets is motivated by the fact
final rendering. While the video generation from the skeleton keypoints that there is no single dataset that provides text-to-sign translations,
is hard, recent improvements in deep learning-based skeleton-to-video a broad range of signers of different appearances, and high-definition
translation are promising. signing content. Using datasets from different subject domains and
languages demonstrates the robustness and flexibility of the proposed
7. Performance evaluation methods, as it allows us to transfer knowledge between specialized
datasets. This makes the approach suitable for translating between
In this section, the results of the previously analyzed SLP models on different spoken and signed languages, as well as other problems, such
the most popular datasets are presented. as text-conditioned image and video generation.
16
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Table 9
Results of SLP models.
Model Acc CIDEr ROUGE METEOR WER BLEU-1 BLEU-2 BLEU-3 BLEU-4 MSE SSIM FID
S2G2T (Camgoz, – – 43.80 – – 43.29 30.39 22.82 18.13 – –
Hadfield, Koller,
Ney, & Bowden,
2018b)
HLSTM-attn (Guo 0.506 0.605 – 0.205 0.641 0.508 0.330 0.207 – – – –
et al., 2018)
Text2Gloss (Stoll, – – 48.10 – 4.53 50.67 32.25 21.54 15.26 – 0.727 64.01
Camgoz, et al.,
2020)
Symbolic – – 54.55 – – 55.18 37.10 26.24 19.10 – – –
Transformer
(Saunders et al.,
2020d)
Progressive – – 32.02 – – 31.80 19.19 13.51 10.43 – – –
Transformer
(Saunders et al.,
2020d)
NSLS (Zelinka & – – – – – – – – – 11.94 – –
Kanis, 2020)
SIGNGAN (Saunders – – 29.05 – – 27.63 19.26 14.84 12.18 – 0.759 27.75
et al., 2020d)
EDN (Chan, – – – – – – – – – – 0.737 41.54
Ginosar, Zhou, &
Efros, 2019)
vid2vid (Wang, – – – – – – – – – – 0.750 56.17
et al., 2018)
Pix2PixHD (Wang – – – – – – – – – – 0.737 42.57
et al., 2018a)
Currently, the proposed SLP systems cannot compete with existing assist in making a low-complex and fast model. GAN and LSTM are
avatar approaches. A large amount of high-resolution training data the two most used deep learning-based models in SLP for visual in-
is necessary to obtain results comparable with motion capture and puts. While successful results were achieved using these models, more
avatar-based approaches. However, the avatar-based approaches need effort is necessary to generate more lifelike and high-resolution sign
detailed annotations using task-specific transcription languages, which images/videos acceptable to the Deaf community. Among the deep
can only be provided by expert linguists. Animating the avatar it- learning-based models for lingual modality, the NMT model is the
self often involves a remarkable amount of hand-engineering. Motion most used model for input text processing. Other Seq2Seq models,
capture-based approaches require high-fidelity data, which needs to such as RNN-based models, proved their effectiveness in many tasks.
be captured, cleaned, and stored at remarkable cost, decreasing the While accurate results were achieved using these models, more effort
amount of data available, therefore, making this approach un-scalable. is necessary to overcome the existing challenges in the translation task,
Given that recent approaches use automatic feature extraction methods, such as domain adaptation, uncommon words, word alignment, and
we think that in short term these approaches will enable highly realis- word tokenization.
tic, and cost-effective translation of spoken languages to sign languages, Datasets: The lack of a large annotated dataset is one of the major
improving equal access for the Deaf and Hard of Hearing. Generating challenges in SLP. The collection and annotation of sign language data
high-resolution and signer-independent videos with signers of arbitrary is an expensive task that needs the collaboration of linguistic experts
appearance makes room to provide highly realistic, expressive, and and native speakers. While there are some publicly available datasets
end-to-end SLP systems applicable to real-world communications. for SLP (Athitsos et al., 2018; Brock & Nakadai, 2018; Bungeroth et al.,
Additionally, developing stronger data-processing strategies to pay at- 2008; Camgöz et al., 2018; Camgoz et al., 2021; Caselli et al., 2017;
tention to the intricate features of sign language data, such as the size Duarte et al., 2020; Ko et al., 2019; Matthes et al., 2012), they suffer
of motion and speed, can be effective. from weakly annotated data for sign language. Furthermore, most of
the available datasets in SLP contain a restricted domain of vocab-
8. Conclusion ularies/sentences. To facilitate real-world communication between
the Deaf and hearing communities, access to a large-scale continuous
In this survey, we presented a detailed review of the recent advance- sign language dataset, segmented on the sentence level, is necessary. In
ments in SLP. We presented a taxonomy that summarized the main such a dataset, a paired form of the continuous sign language sentence
concepts related to SLP. We categorized recent works in SLP, providing and the corresponding spoken language sentence needs to be included.
separate discussions in each category. The proposed taxonomy covered Just a few datasets meet these criteria (Camgoz et al., 2018a; Duarte
different input modalities, datasets, applications, and proposed models. et al., 2020; Ko et al., 2019; Zelinka & Kanis, 2020). The point is that
Here, we summarize the main findings: most of the aforementioned datasets cannot be used for end-to-end
Input modalities: Generally, vision and language modalities are translation (Camgoz et al., 2018a; Ko et al., 2019; Zelinka & Ka-
two input modalities in SLP. While the visual modality includes the nis, 2020). Two public datasets, RWTH-Phoenix-2014T and How2Sign,
captured image/video data used in the training, the linguistic modality are the most used datasets in SLP. The former includes German sign
contains the text input from natural language, which is applicable in language sentences that can be used for text-to-sign language trans-
both the training and testing of the proposed models. Both categories lation. The latter is a recently proposed multi-modal dataset used for
benefit from deep learning approaches to improve model performance. speech-to-sign language translation. Though RWTH-PHOENIX-Weather
RGB and skeleton are two common types of visual input data used in 2014T (Camgoz et al., 2018a) and How2Sign (Duarte, 2019) provided
SLP models. While RGB images/videos contain high-resolution content, the appropriate SLP evaluation benchmarks, they are not enough for
skeleton inputs decrease the parameter complexity of the model and the generalization of the SLP models. Furthermore, these datasets only
17
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Fig. 14. Translation results from Stoll, Camgoz, et al. (2020): (a) ‘‘Guten Abend liebe Zuschauer’’. (Good evening dear viewers), (b) ‘‘Im Norden maessiger Wind an den Kuesten
weht er teilweise frisch’’. (Mild winds in the north, at the coast it blows fresh in parts). Top row: Ground truth gloss and video, Bottom row: Generated gloss and video. This
model combines an NMT network and GAN for SLP.
include German and American sentences. Translating from the spoken data. Especially in deep learning-based models, increasing the amount
language to a large diversity of sign languages is a major challenge for of data can lead to better results. Another challenge is regarding the
the Deaf community. uncommon words. The translation models perform poorly on these
Applications: American Sign Language (ASL) is the most-used sign words. Words alignment and adjusting the beam search parameters are
language in developed applications for SLP. Since it may be hard for other challenges in NMT-based models.
Deaf people to read or write the spoken language, they need some Although MG can generate plausible and controllable motion
tools for communication with other people in society. Furthermore, through a database of motion capture, it faces some challenges. The
many interesting and useful applications on the internet are not ac- first challenge is regarding limited access to data. To show the model
cessible to the Deaf community. To tackle these challenges, some potential with a truly diverse set of actions, a large set of data is
projects have been proposed aiming to develop such tools. While these necessary. The scalability and computational complexity of the graph
applications successfully made a bridge between Deaf and hearing to select the best transitions are other challenges in MG. Furthermore,
communities, we are still far from having applications involving large since the number of edges leaving a node increases with the graph size,
vocabularies/sentences from complex real-world scenarios. One of the the branching factor in the search algorithm will increase as well. To
main challenges for these applications is a license right for usage. automatically adjust the graph configuration and rely on the training
Another challenge is regarding the application domain. Most of these data, instead of user interference, Graph Convolutional Network (GCN)
applications have been developed for very specific domains such as could be used along with some refining algorithms to adopt the graph
clinics, hospitals, and police stations. Improving the amount of avail- structure monotonically.
able data and its quality can benefit the creation of these needed While GANs have recently achieved remarkable results for im-
applications. Furthermore, understanding the Deaf culture is helpful to age/video generation, there exist major challenges in the training of
create systems that align with user needs and desires. GANs. Mode collapse, non-convergence and instability, suitable objec-
Proposed models: The proposed works in SLP can be catego- tive function, and optimization algorithm are some of these challenges.
rized into five categories: Avatar approaches, NMT approaches, MG However, several suggestions have been recently proposed to address
approaches, Conditional image/video generation approaches, and other the better design and optimization of GANs. Appropriate design of net-
approaches. Table 2 shows a summary of state-of-the-art deep SLP work architecture, proper objective functions, and optimization algo-
models. Some samples of the generated videos and gloss annotations rithms are some of the proposed techniques to improve the performance
are shown in Figs. 14, 15, and 16. Using the data collected from motion of GAN-based models. Finally, the challenge of the model complexity
capture, avatars can be more usable and acceptable for reviewers. still remains for hybrid models.
Avatars achieve highly realistic results but the results are restricted Limitations: In this survey, we presented recent advances in SLP
to a small set of phrases. This comes from the cost of data collection and related areas using deep learning. While successful results have
and annotation. Furthermore, avatar data is not a scalable solution and been achieved in SLP by recent deep learning-based models, there
needs expert knowledge to be inspected and polished. To cope with are some limitations that need to be addressed. The main challenge
these problems and improve performance, deep learning-based models is regarding the Multi-Signer (MS) generation that is necessary for
are used. providing real-world communication in the Deaf community. To this
While NMT-based methods achieved significant results in transla- end, we need to produce multiple signers of different appearances and
tion tasks, some major challenges need to be solved. Domain adaptation configurations. Another limitation is the possibility of high-resolution
is the first challenge in this area. Since translation in different domains and photo-realistic continuous sign language videos. Most of the pro-
needs different styles and requirements, it is a crucial requirement in posed models in SLP can only generate low-resolution sign samples.
developing machine translation systems targeted at a specific use case. Conditioning on human keypoints extracted from training data can
The second challenge is regarding the amount of available training decrease the parameter complexity of the model and assist to produce a
18
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Fig. 15. Translation results from Stoll, Camgoz, et al. (2020): Text from spoken language is translated to human pose sequences.
Fig. 16. Translation results from Kipp et al. (2011): A signing avatar is created using a character animation system. Top row: signing avatar, Bottom row: original video.
high-resolution video sign. However, avatar-based models can success- CRediT authorship contribution statement
fully generate high-resolution video samples, though they are complex
and expensive. In addition, the pruning algorithms of MG need to be Razieh Rastgoo: Work supervisors, Conceptualization, Supervision,
improved by including additional features of sign language, such as Writing – review & editing. Kourosh Kiani: Work supervisors, Super-
duration and speed of motion. vision, Review & editing. Sergio Escalera: Work supervisors, Super-
Future directions: While recent models in SLP presented promis- vision, Review & editing. Vassilis Athitsos: Work supervisors, Super-
ing results relying on deep learning capabilities, there is still much vision, Review & editing. Mohammad Sabokrou: Work supervisors,
room for improvement. Considering the discriminative power of self- Supervision, Review & editing.
attention, learning to fuse multiple input modalities to benefit from
multi-channel information, learning structured spatiotemporal patterns Declaration of competing interest
(such as Graph Neural Networks models), and employing domain-
specific prior knowledge of sign language are some possible future
The authors declare that they have no known competing finan-
directions in this area. Furthermore, there are some exciting assistive
cial interests or personal relationships that could have appeared to
technologies for Deaf and hearing-impaired people. A brief introduction
influence the work reported in this paper.
to these technologies can get insight to the researchers in SLP and
also make a bridge between them and the corresponding technol-
ogy requirements. These technologies fall into three device categories: Data availability
hearing technology, alerting devices, and communication support tech-
nology. For example, let us imagine a technology that assists a Deaf No data was used for the research described in the article.
person going through a musical experience translated into another
sensory modality. While the recent advances in SLP are promising,
Acknowledgments
more endeavor is indispensable to provide a fast processing model in
an uncontrolled environment considering rapid hand motions. It is clear
This work has been partially supported by the HIS Company and
that technology standardization and full interoperability among devices
Institute for Research in Fundamental Sciences (IPM) in Iran, Spanish
and platforms are prerequisites to having real-life communication be-
project PID2019-105093GB-I00 (MINECO/FEDER, UE), and CERCA
tween the hearing and hearing-impaired communities. Finally, since
Programme/Generalitat de Catalunya, and ICREA under the ICREA
providing the data annotation is also challenging, recently some efforts
Academia programme.
have been done by Rastgoo, Kiani, and Escalera (2022a), Rastgoo,
Kiani, Escalera, and Sabokrou (2022b) to overcome the annotation
bottleneck. To this end, Zero-Shot Learning (ZSL) is employed for SLR. Funding
Using this approach, we hope to get closer to real and accurate systems
for bidirectional communication between Deaf and hearing people in This research did not receive any specific grant from funding agen-
society. cies in the public, commercial, or not-for-profit sectors.
19
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
References Cohen, M., Voldman, I., Regazzoni, D., & Vitali, A. (2018). Hand rehabilitation via
gesture recognition using leap motion controller. In 11th international conference on
Ankith, B., Boggaram, A., Sharma, A., Ramanujan, A., & Bharathi, R. (2022). Sign human system interaction.
language translation systems: A systematic literature review. International Journal Conneau, A., Lample, G., Ranzato, M., Denoyer, L., & Jégou, H. (2017). Word
of Software Science and Computational Intelligence (IJSSCI), 14, 1–33. translation without parallel data. arXiv:1710.04087.
Arikan, O., & Forsyth, D. (2002). Interactive motion generation from examples. In Cortana, C. 2022. www.microsoft.com.
Proceedings of the 29th annual conference on computer graphics and interactive Cox, S., Lincoln, M., Tryggvason, J., Nakisa, M., Wells, M., Tutt, M., et al. (2002).
techniques (pp. 483–490). Tessa, a system to aid communication with deaf people. In Proceedings of the 5th
Artetxe, M., & Schwenk, H. (2019). Massively multilingual sentence embeddings for international ACM conference on assistive technologies (pp. 205–212).
zero-shot cross-lingual transfer and beyond. In Transactions of the Association for CSL (2022). Chinese sign language. https://www.startasl.com/chinese-sign-language/.
Computational Linguistics, vol. 7 (pp. 597–610). Dangsaart, S., Naruedomkul, K., Cercone, N., & Sirinaovakul, B. (2008). Intelligent
Athitsos, V., Neidle, C., Sclaroff, S., Nash, J., Stefan, A., Yuan, Q., et al. (2018). The Thai text – Thai sign translation for language learning. Computers and Education,
American sign language lexicon video dataset. In Proceedings of the IEEE conference 1125–1141.
on computer vision and pattern recognition (pp. 1–8). Darabkh, K., Alturk, F., & Sweidan, S. (2018). VRCDEA-TCS: 3D virtual reality cooper-
Auslan Language (2022). shopping-and-auslan. https://www.gcss.org.au/2018/07/ ative drawing educational application with textual chatting system. In Comput appl
shopping-and-auslan/. eng educ, vol. 26 (pp. 1677–1698).
Bachmann, D., Weichert, F., & Rinkenauer, G. (2018). Review of three-dimensional Dawes, F., Penders, J., & Carbone, G. (2018). Remote control of a robotic hand using
human-computer interaction with focus on the leap motion controller. Sensors. a leap sensor. In The international conference of IFToMM (pp. 332–341).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly Deafness and hearing 2022. https://www.who.int/news-room/fact-sheets/detail/
learning to align and translate. In ICLR. deafness-and-hearing.
Bangham, J., Cox, S., Elliott, R., Glauert, J., Marshall, I., Rankov, S., et al. (2000). Delorme, M., & Braffort, M. (2009). Animation generation process for sign language syn-
Virtual signing: Capture, animation, storage and transmission – an overview of the thesis. In Advances in Computer-Human Interactions, Second International Conferences
ViSiCAST project. Speech and Language Processing for Disabled and Elderly People. on.
Ben, S., Camgoz, N., & Bowden, R. (2021a). Mixed signals: Sign language production Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation
via a mixture of motion primitives. In Proceedings of the IEEE/CVF international evaluation for any target language. Proceedings of the Ninth Workshop on Statistical
conference on computer vision (pp. 1919–1929). Machine Translation, 376—380.
Ben, S., Camgoz, N., & Bowden, R. (2021b). Skeletal graph self-attention: Embedding Denton, E., Chintala, S., Szlam, A., & Fergus, R. (2015). Deep generative image models
a skeleton inductive bias into sign language production. arXiv:2112.05277. using a laplacian pyramid of adversarial networks. Advances in Neural Information
Ben, S., Camgoz, N., & Bowden, R. (2022). Signing at scale: Learning to co-articulate Processing Systems (NIPS).
signs for large-scale photo-realistic sign language production. In Proceedings of the DHHP (2022). Deaf and hard of hearing program. https://www.childrenshospital.org/
IEEE/CVF conference on computer vision and pattern recognition (pp. 5141–5151). centers-and-services/programs/a-e/deaf-and-hard-of-hearing-program.
Bragg, D., et al. (2019). Sign language recognition, generation, and translation: An Duarte, A. (2019). Cross-modal neural sign language translation. In The 27th ACM
interdisciplinary perspective. In ASSETS ’19. international conference.
Bregler, C. (2007). Motion capture technology for entertainment. IEEE Signal Processing Duarte, K., & Gibet, S. (2010). Corpus design for signing avatars. In Workshop on
Magazine. representation and processing of sign languages: Corpora and sign language technologies
Brock, H., & Nakadai, K. (2018). Deep JSLC: A multimodal corpus collection for data- (pp. 1–3).
driven generation of Japanese sign language expressions. In Proceedings of the Duarte, A., Sh., P., Ghadiyaram, D., DeHaan, K., Metze, F., Torres, J., et al.
eleventh international conference on language resources and evaluation. (2020). How2Sign: A large-scale multimodal dataset for continuous American sign
Bungeroth, J., Stein, D., Dreuw, P., Ney, H., Morrissey, S., Way, A., et al. (2008). The language. In Sign language recognition, translation, and production workshop.
ATIS sign language corpus. In 6th International conference on language resources and Eastin, E. (2022). European assistive technology information network (EASTIN)
evaluation. database. http://www.eastin.eu.
Butt, A., Rovini, E., Dolciotti, C., De Petris, G., Bongioanni, P., Carboncini, M., et EblingJohn, S., & Glauert, G. (2013). Exploiting the full potential of jasigning to
al. (2018). Objective and automatic classification of parkinson disease with leap build an avatar signing train announcements. In 3rd International symposium on
motion controller. Biomedical Engineering. sign language translation and avatar technology (pp. 1–9).
Cai, S., Zhu, G., Tien Wu, Y., Liu, E., & Hu, X. (2018). A case study of gesture-based Efthimiou, E., Fotinea, S., Hanke, T., Glauert, J., Bowden, R., Braffort, A., et al. (2012).
games in enhancing the fine motor skills and recognition. In Interactive learning The dicta-sign wiki: Enabling web communication for the deaf. In International
environments, vol. 26. conference on computers for handicapped persons (pp. 205–212).
Camgöz, N., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). RWTH-PHOENIX- Ehima, E. (2022). Guidance document for classification of hearing aids and accessories.
weather 2014 T: Parallel corpus of sign language video, Gloss and translation. In In European hearing instrument manufacturers association.
Proceedings of the IEEE conference on computer vision and pattern recognition. Elliott, D., Frank, S., Simaéan, K., & Specia, L. (2016). Multi30k: Multilingual english-
Camgoz, N., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018a). Neural sign german image descriptions. In Proceedings of the 5th workshop on vision and language
language translation. In Proceedings of the IEEE conference on computer vision and (pp. 70–74).
pattern recognition. Ethnologue (2022a). Argentine sign language. https://www.ethnologue.com/language/
Camgoz, N., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018b). Neural sign aed.
language translation. IEEE Conference on Computer Vision and Pattern Recognition. Ethnologue (2022b). Greek sign language. https://www.ethnologue.com/language/gss.
Camgoz, N., Koller, O., Hadfield, S., & Bowden, R. (2020). Multi-channel transformers Ethnologue (2022c). Persian sign language. https://www.ethnologue.com/language/
for multi-articulatory sign language translation. In ECCVW. psc.
Camgoz, N., Saunders, B., Rochette, G., Giovanelli, M., Inches, G., Nachtrab-Ribback, R., Facebook Messenger 2022. www.facebook.com.
et al. (2021). Content4All open research sign language translation datasets. In IEEE FaceTime, F. 2022. www.Appleapps.apple.com.
international conference on automatic face and gesture recognition. Field, M., Stirling, D., Naghdy, F., & Pan, Z. (2009). Motion capture in robotics review.
Caselli, N., Sehyr, Z., Cohen-Goldberg, A., & Emmorey, K. (2017). ASL-lex: A lexical In Proceedings of the 2009 IEEE international conference on control and automation;
database for ASL. In Behavior research methods, vol. 49 (pp. 784—801). Institute of electrical and electronics engineers (pp. 1697—1702).
CDC (2022). Centers for disease control and prevention. https://www.cdc.gov/. Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical
Chan, C., Ginosar, S., Zhou, T., & Efros, A. (2019). Everybody dance now. IEEE interaction through video prediction. Advances in Neural Information Processing
Conference on Computer Vision and Pattern Recognition. Systems (NIPS).
Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement Forster, J., Schmidt, C., Hoyoux, T., Koller, O., Zelle, U., Piater, J., et al. (2012). RWTH-
networks. In ICCV (pp. 1511–1520). PHOENIX-weather: A large vocabulary sign language recognition and translation
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2018). Deeplab: corpus. In Proceedings of the eighth international conference on language resources and
Semantic image segmentation with deep convolutional nets, atrous convolution, evaluation (pp. 3785—3789).
and fully connected crfs. In TPAMI, vol. 40. Gárate, M. (2014). Developing bilingual literacy in deaf children. Kurosio Publishe.
Cho, B., Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. Geers, A., et al. (2017). Early sign language exposure and cochlear implantation
(2014). Learning phrase representations using RNN encoder–decoder for statistical benefits. In Pediatrics, vol. 140.
machine translation. arXiv:1406.1078. Geng, W., & Yu, G. (2003). Reuse of motion capture data in animation: A review. In
Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., & Bengio, H. Proceedings of the lecture notes in computer science (pp. 620—629).
(2014). Learning phrase representations using RNN encoder-decoder for statistical Ghanem, S., Conly, C., & Athitsos, V. (2017). A survey on sign language recognition
machine translation. In EMNLP (pp. 1724—1734). using smartphones. In Proceedings of the 10th international conference on pervasive
Christopher, K., Kümmel, C., Ritter, D., & Hildebrand, K. (2021). Pose-guided sign technologies related to assistive environments, Island of Rhodes Greece.
language video GAN with dynamic lambda. arXiv:2105.02742. Gibet, S. (2018). Building french sign language motion capture corpora for signing
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated avatars. In Workshop on the representation and processing of sign languages: involving
recurrent neural networks on sequence modeling. arXiv:1412.3555. the language community.
20
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Gibet, S. (2020). Signing avatars: what challenges for the production of content in sign Kataoka, Y., Matsubara, T., & Uehara, K. (2016). Image generation using adversarial
languages? Disability. networks and attention mechanism. In IEEE/ACIS 15th international conference on
Gibet, S., Lefebvre-Albaret, F., Hamon, L., Brun, R., & Turki, A. (2016a). Interactive computer and information science.
editing in french sign language dedicated to virtual signers: Requirements and Kennaway, R. (2013). Avatar-independent scripting for real-time gesture animation,
challenges. In Universal access in the information society, vol. 15 (pp. 525–539). procedural animation of sign language. arXiv:1502.02961.
Gibet, S., Lefebvre-Albaret, F., Hamon, L., Brun, R., & Turki, A. (2016b). Interactive Khan, N., Abid, A., & Abid, K. (2020). A novel natural language processing (NLP)-
editing in french sign language dedicated to virtual signers: Requirements and based machine translation model for English to Pakistan sign language translation.
challenges. In Universal access in the information society, vol. 15 (pp. 525–539). In Cognitive computation, vol. 12 (pp. 748–765).
Gibet, S., & Marteau, P. (2023). Signing avatars-multimodal challenges for text-to-sign Kingma, D., & Welling, M. (2014). Auto-encoding variational bayes. In ICLR.
generation. In 2023 IEEE 17th international conference on automatic face and gesture Kipp, M., Heloir, A., & Nguyen, Q. (2011). Sign language avatars: Animation
recognition (pp. 1–8). and comprehensibility. In International workshop on intelligent virtual agents (pp.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et 113–126).
al. (2014). Attribute2image: Conditional image generation from visual attributes. Ko, S., Kim, C., Jung, H., & Cho, C. (2019). Neural sign language translation based on
Advances in Neural Information Processing Systems, 2672—2680. human keypoint estimation. In Applied sciences, vol. 9.
Graves, A., Fernandez, S., & Schmidhuber, J. (2007). Multi-dimensional recurrent neural Kovar, L., Gleicher, M., & Pighin, F. (2002). Motion graphs. In SIGGRAPH ’02:
networks. In ICANN. Proceedings of the 29th annual conference on computer graphics and interactive
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent techniques (pp. 473–483).
neural networks. In ICASSP. Lample, G., Conneau, A., Denoyer, L., & Ranzato, M. (2017). Unsupervised machine
Gregor, K., Danihelka, I., Graves, A., Rezende, D., & Wierstra, D. (2015). Draw: A translation using monolingual corpora only. arXiv:1711.00043.
recurrent neural network for image generation. In Proceedings of machine learning Lane, H. (2017). A chronology of the oppression of sign language in France and the
research. United States. In Recent perspectives on American sign language (pp. 119—161).
Grieve, S. (1999). English to American sign language machine translation of weather Psychology Press.
reports. In Proceedings of the second high desert student conference in linguistics (pp. Larboulette, C., & Gibet, S. (2018). Avatar signers: what can they teach us?, J. Enaction
23–30). 2018: Day on enaction in animation. Simulation and Virtual Reality.
Grieve, S. (2002). SignSynth: A sign language synthesis application using Web3D and LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied
perl. In International gesture workshop (pp. 134–145). to document recognition. In Proceedings of the IEEE, vol. 86.
Grosjean, F. (2010). Bilingualism, biculturalism, and deafness. In International journal Lee, J., & Shin, S. (1999). A hierarchical approach to interactive motion editing
of bilingual education and bilingualism, vol. 13 (pp. 133–145). for human-like figures. A Hierarchical Approach to Interactive Motion Editing for
Guo, D., Zhou, W., Li, H., & Wang, M. (2018). Hierarchical LSTM for sign language Human-Like Figures, 39–48.
translation. In The thirty-second AAAI conference on artificial intelligence.
Lewis, P., Oğuz, B., Rinott, R., Riedel, S., & Schwenk, H. (2019). MLQA: Evaluating
Hamers, J. (1998). Cognitive and language development of bilingual children. In
cross-lingual extractive question answering. arXiv:1910.07475.
Parasnis, cultural, and language diversity and the deaf experience (pp. 51—75).
Li, D., Xu, C., Liu, L., Zhong, Y., Wang, R., Peterson, L., et al. (2022). Transcribing
Cambridge University Press.
natural languages for the deaf via neural editing programs. In Proceedings of the
Hammadi, M., Muhammad, G., Abdul, W., Alsulaiman, M., Bencherif, M., &
AAAI conference on artificial intelligence, vol. 36 (pp. 11991–11999).
Mekhtiche, M. (2020). Hand gesture recognition for sign language using 3DCNN.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In
IEEE Access, 8, 79491–79509.
Proceedings of the workshop on text summarization branches out.
Hangout, G. 2022. www.hangout.google.com.
Lotter, W., Kreiman, G., & Cox, D. (2015). Unsupervised learning of visual structure
Hanke, T. (2004). HamNoSys – Representing sign language data in language resources
using predictive generative networks. arXiv:1511.06380.
and language processing contexts. In Workshop on the representation and processing
Lucie, N., Larboulette, C., & Gibet, S. (2020). SignSynth: Data-driven sign language
of sign languages. 4th international conference on language resources and evaluation.
video generation. In Computers and graphics, vol. 92 (pp. 76–98).
Hanke, T. (2022). German sign language. https://www.awhamburg.de/en/research/
Lucker, J., & Hersh, M. (2003). Alarm and alerting systems for hearing-impaired and
long-term-scientific-projects/dictionary-german-sign-language.html.
deaf people. In Assistive technology for the hearing-impaired, Deaf and Deaf blind (pp.
HDS (2022). Center for hearing and deaf services. https://www.nationaldeafcenter.org/.
215–255).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective recep-
recognition. In Proceedings of the IEEE conference on computer vision and pattern
tive field in deep convolutional neural networks. Advances in Neural Information
recognition.
Processing Systems (NIPS).
Hersh, M., & Johnson, M. (2003). Hearing-aid principles and technology. In Assistive
Luong, M., Pham, h., & Manning, C. (2015). Effective approaches to attention-based
technology for the hearing-impaired, deaf and deaf blind (pp. 71–116).
neural machine translation. arXiv:1508.04025.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. In Neural
computation, vol. 9. Luqman, H., & Sabri, A. (2019). Automatic translation of arabic text to arabic sign
HSDC (2022). Hearing, speech and deaf cente. https://www.nationaldeafcenter.org/. language. In Universal access in the information society, vol. 18 (pp. 939–951).
Huenerfauth, M. (2004). Spatial and planning models of ASL classifier predicates Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., & Van Gool, L. (2017). Pose guided
for machine translation. In The 10th international conference on theoretical and person image generation. Advances in Neural Information Processing Systems (NIPS).
methodological issues in machine translation (pp. 65–74). Majidi, N., Kiani, K., & Rastgoo, R. (2020). A deep model for super-resolution
Huenerfauth, M. (2005). American sign language generation: Multimodal NLG with enhancement from a single image. In Journal of AI and data mining, vol. 8 (pp.
multiple linguistic channels. In Proceedings of the ACL student research workshop 451–460).
(pp. 37–42). ManCAD (2022). Manchester centre for audiology and deafness. https://www.
Humphries, T. (2013). Schooling in American sign language: A paradigm shift from a research.manchester.ac.uk/portal/en/projects/manchester-centre-for-audiology-
deficit model to a bilingual model in deaf education. In Berkeley review of education, and-deafness-mancad(90f5d28b-08c9-4c76-891b-9340c290529f).html.
vol. 4 (pp. 7–33). Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond
Hutchins, J. (2005). History of machine translation. http://psychotransling.ucoz.com/- mean square error. In ICLR.
ld/0/11-Hutchins-survey.pdf. Matthes, S., Hanke, T., Regen, A., Storz, J., Worseck, S., Efthimiou, E., et al. (2012).
Imo, I. 2022. www.Imo.com. Dicta-sign–building a multilingual sign language corpus. In 5th LREC. Istanbul.
Jain, V., et al. (2007). Supervised learning of image restoration with convolutional McDonald, J., Wolfe, R., Schnepp, J., Hochgesang, J., Jamrozik, D., Stumbo, M., et al.
networks. In ICCV. (2016). An automated technique for real-time production of lifelike animations of
Jemni, M., Ghoul, O., Boulares, M., Yahia, N., Jaballah, K., Othman, A., et al. (2022). American sign language. In Universal access in the information society, vol. 15 (pp.
WebSign. http://www.latice.rnu.tn/websign/. 551–566).
Kahlon, N., & Singh, W. (2021). Machine translation from text to sign language: a Michel, P., & Neubig, G. (2018). MTNT: A testbed for machine translation of noisy
systematic review. Universal Access in the Information Society, 1–35. text. In Proceedings of the 2018 conference on empirical methods in natural language
Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A., Graves, A., & Kavukcuoglu, K. processing.
(2016). Neural machine translation in linear time. arXiv:1610.10099. Morando, M., Ponte, S., Ferrara, E., & Dellepiane, S. (2018). Definition of motion
Kanis, J., Zahradil, j., Jurčíček, F., & Müller, L. (2006). Czech-sign speech corpus for and biophysical indicators for home-based rehabilitation through serious games.
semantic based machine translation. In International conference on text, speech and In Information, vol. 9.
dialogue (pp. 613–620). Mori, M., MacDorman, K., & Kageki, N. (2012). The uncanny valley [from the field. In
Karpouzis, K., Caridakis, G., Fotinea, S.-E., & Efthimiou, E. (2020). Educational IEEE robotics and automation magazine, vol. 19 (pp. 98–100).
resources and implementation of a greek sign language synthesis architecture. In NAD (2022). National association of the deaf. Assistive Listening Systems and Devices.
Web3D technologies in learning, education and training (pp. 54–74). Naert, L., Larboulette, C., & Gibet, S. (2017). Coarticulation analysis for sign language
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for synthesis. In Universal access in human–computer interaction. Designing novel interac-
generative adversarial networks. In Proceedings of the IEEE conference on computer tions: 11th international conference, UAHCI 2017, held as part of HCI international
vision and pattern recognition. 2017, Vancouver, BC, Canada, July 9–14, 2017, proceedings, part II 11 (pp. 55–75).
21
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Naert, L., Larboulette, C., & Gibet, S. (2020a). Lsf-animal: A motion capture corpus in Salto (2022). Polish sign language. https://www.salto-youth.net/tools/otlas-partner-
french sign language designed for the animation of signing avatars. In Proceedings finding/organisation/association-of-polish-sign-language-interpreters.2561/.
of the 12th language resources and evaluation conference (pp. 6008–6017). Saunders, B., Camgöz, N., & Bowden, R. (2020a). Adversarial training for multi-channel
Naert, L., Larboulette, C., & Gibet, S. (2020b). A survey on the animation of signing sign language production. In BMVC.
avatars: From sign representation to utterance synthesis. In Computers and graphics, Saunders, B., Camgoz, N., & Bowden, R. (2020b). Everybody sign now: Translating
vol. 92 (pp. 76–98). spoken language to photo realistic sign language video. arXiv:2011.09846.
Naert, L., Larboulette, C., & Gibet, S. (2021). Motion synthesis and editing for the Saunders, B., Camgöz, N., & Bowden, R. (2020c). Everybody sign now: Translating
generation of new sign language content: Building new signs with phonological spoken language to photo realistic sign language video. arXiv:2011.09846.
recombination. In Machine translation, vol. 35 (pp. 405–430). Saunders, B., Camgoz, N., & Bowden, R. (2020d). Progressive transformers for
Naert, L., Reverdy, C., Larboulette, C., & Gibet, S. (2018). Per channel automatic end-to-end sign language production. In ECCV (pp. 687–705).
annotation of sign language motion capture data. In Workshop on the representation Saunders, B., Camgoz, N., & Bowden, R. (2021). AnonySIGN: Novel human appearance
and processing of sign languages: Involving the language community. synthesis for sign language video anonymization. In IEEE international conference
Nakazawa, T., Yaguchi, M., Uchimoto, K., Utiyama, M., Sumita, E., Kurohashi, S., et on automatic face and gesture recognition.
al. (2016). ASPEC: Asian scientific paper excerpt corpus. In Proceedings of the ninth See, A., & Lamm, M. (2020). Machine translation, sequence-to-sequence and attention.
international conference on language resources and evaluation (pp. 2204–2208). https://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture08-nmt.pdf.
Natarajan, B., & Elakkiya, R. (2022). Dynamic GAN for high-quality sign language Sharma, S., & Kumar, K. (2021). ASL-3DCNN: American sign language recognition tech-
video generation from skeletal poses using generative adversarial networks. In Soft nique using 3-d convolutional neural networks. In Multimedia tools and applications,
computing, vol. 26 (pp. 13153–13175). vol. 80 (pp. 26319–26331).
NCDB (2022). National center on deaf-blindness. https://www.nationaldb.org/. Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., & Woo, W. (2015). Convolutional
NDC (2022). National deaf cente. https://www.nationaldeafcenter.org/. LSTM network: A machine learning approach for precipitation nowcasting. Advances
NIDCD (2022). National institute on deafness and other communication. https://www. in Neural Information Processing Systems (NIPS).
nidcd.nih.gov/. Siarohin, A., Sangineto, E., Lathuiliere, S., & Sebe, N. (2018). Deformable gans for pose-
Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural net- based human image generation. In Proceedings of the IEEE conference on computer
works. In Proceedings of the 33rd international conference on machine learning (pp. vision and pattern recognition.
1747–1756). Siddique, S., Ahmed, T., Talukder, R., & Uddin, M. (2020). English to bangla machine
Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., & Kavukcuoglu, K. translation using recurrent neural network. In International journal of future computer
(2016). Conditional image generation with pixel-cnn decoders. Advances in Neural and communication, vol. 9.
Information Processing Systems (NIPS). Siri, S. 2022. www.apple.com.
Othman, A., & Jemni, M. (2011). Statistical sign language machine translation: from Skype 2022. www.skype.com.
english written text to American sign language gloss. In IJCSI international journal Snapchat 2022. www.snapchat.com.
of computer science, vol. 8 (pp. 65–73). Sripairojthikoon, N., & Harnsomburana, J. (2019). Thai sign language recognition using
Owlcation (2022). Korean sign language. https://owlcation.com/humanities/Korean- 3D convolutional neural networks. In ICCCM 2019: Proceedings of the 2019 7th
Sign-Language. international conference on computer and communications management (pp. 186–189).
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic Start ASL (2022). Spanish sign language. https://www.startasl.com/spanish-sign-
evaluation of machine translation. In ACL’02: Proceedings of the 40th annual meeting language-ssl/.
on Association for Computational Linguistics (pp. 311—318). Stokoe, W., Casterline, D., & Croneberg, C. (1965). A dictionary of American sign language
Pei, W., Zhang, J., Wang, X., Ke, L., Shen, K., & Tai, Y. (2019). Memory-attended on linguistic principles. Linstok Press.
recurrent network for video captioning. In Proceedings of the IEEE conference on Stoll, S., Camgoz, N., Hadfield, S., & Bowden, R. (2018). Sign language production
computer vision and pattern recognition (pp. 8347–8356). using neural machine translation and generative adversarial networks. In BMVC.
Plantard, P., Shum, H., Le Pierres, A., & Multon, F. (2017). Validation of an ergonomic Stoll, S., Camgoz, N., Hadfield, S., & Bowden, R. (2020). Text2Sign: Towards sign
assessment method using Kinect data in real workplace conditions. In Appl. ergon., language production using neural machine translation and generative adversarial
vol. 65 (pp. 562—569). networks. In International journal of computer vision, vol. 128 (pp. 891–908).
Prillwitz, S. (1989). HamNoSys. Version 2.0. Hamburg notation system for sign languages. Stoll, S., Hadfield, S., & Bowden, R. (2020). SignSynth: Data-driven sign language video
An introductory guide. Hamburg Signum Press. generation. In Lecture notes in computer science: vol. 12538, Computer vision - ECCV
Rastgoo, R., Kiani, K., & Escalera, S. (2018). Multi-modal deep hand sign language 2020 workshops.
recognition in still images using restricted Boltzmann machine. In Entropy, vol. 20. Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural
Rastgoo, R., Kiani, K., & Escalera, S. (2020a). Hand sign language recognition using networks. Advances in Neural Information Processing Systems (NIPS).
multi-view hand skeleton. In Expert systems with applications, vol. 150. Swanwick, R. (2010). Bipolicy and practice in sign bilingual education: Develop-
Rastgoo, R., Kiani, K., & Escalera, S. (2020b). Video-based isolated hand sign language ment, challenges and directions. In International journal of bilingual education and
recognition using a deep cascaded model. In Multimedia tools and applications, vol. bilingualism, vol. 13 (pp. 147–158).
79 (pp. 22965—22987). Tamir, M., & Oz, G. (2008). Real-time objects tracking and motion capture in sports
Rastgoo, R., Kiani, K., & Escalera, S. (2021a). Hand pose aware multimodal isolated sign events. In U.S. patent application No. 11/909,080.
language recognition. In Multimedia tools and applications, vol. 80 (pp. 127—163). Tang, S., Hong, R., Guo, D., & Wang, M. (2022). Gloss semantic-enhanced network with
Rastgoo, R., Kiani, K., & Escalera, S. (2021b). Real-time isolated hand sign language online back-translation for sign language production. In Proceedings of the 30th ACM
recognition using deep networks and SVD. Journal of Ambient Intelligence and international conference on multimedia (pp. 5630–5638).
Humanized Computing. Tiedemann, J. (2016). Finding alternative translations in a large corpus of movie
Rastgoo, R., Kiani, K., & Escalera, S. (2021c). Sign language recognition: A deep survey. subtitles. In Proceedings of the 10th international conference on language resources
In Expert systems with application, vol. 164. Article 113794. and evaluation.
Rastgoo, R., Kiani, K., & Escalera, S. (2022a). ZS-SLR: Zero-shot sign language Tornay, S., Camgöz, N., Bowden, R., & Doss, M. (2020). A phonology-based approach
recognition from RGB-d videos. arXiv:2108.10059. for isolated sign production assessment in sign language. In ICMI ’20 companion:
Rastgoo, R., Kiani, K., & Escalera, S. (2022y). A non-anatomical graph structure for Companion publication of the 2020 international conference on multimodal interaction
isolated hand gesture separation in continuous gesture sequences. arXiv:2207. (pp. 102–106).
07619. UN General Assembly (2006). Convention on the rights of persons with disabilities. In
Rastgoo, R., Kiani, K., & Escalera, S. (2022z). Word separation in continuous sign GA res, vol. 61.
language using isolated signs and post-processing. arXiv:2204.00923. Vaitkevičius, A., Taroza, M., Blažauskas, T., Damaševičius, R., Maskeliūnas, R., &
Rastgoo, R., Kiani, K., & Escalera, S. (2023x). A deep co-attentive hand-based video Woźniak, M. (2019). Recognition of American sign language gestures in a virtual
question answering framework using multi-view skeleton. Multimedia Tools and reality using leap motion. In Appl. sci., vol. 9.
Applications 82, 1401–1429. Valero, P., Sivanathan, A., Bosché, F., & Abdel-Wahab, M. (2017). Analysis of con-
Rastgoo, R., Kiani, K., Escalera, S., & Sabokrou, M. (2021d). Sign language production: struction trade worker body motions using a wearable and wireless motion sensor
A review. In Proceedings of the IEEE/CVF conference on computer vision and pattern network. In Autom. constr., vol. 83 (pp. 48—55).
recognition (pp. 3451–3461). Vasani, N., Autee, P., Kalyani, S., & Karani, R. (2020). Generation of Indian sign lan-
Rastgoo, R., Kiani, K., Escalera, S., & Sabokrou, M. (2022b). Multi-modal zero-shot sign guage by sentence processing and generative adversarial networks. In International
language recognition. arXiv:2109.00796. conference on intelligent sustainable systems.
Rezaei, M., Rastgoo, R., & Athitsos, V. (2023). Trihorn-net: a model for accurate Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., et al. (2017).
depth-based 3d hand pose estimation. Expert Systems with Applications 223, 119922. Attention is all you need. Advances in Neural Information Processing Systems (NIPS).
Roccetti, M., Marfia, G., & Semeraro, A. (2012). Playing into the wild: A gesture-based Veale, T., & Conway, A. (1994). Cross modal comprehension in ZARDOZ an english to
interface for gaming in public spaces. In Journal of visual communication and image sign-language translation system. In INLG ’94 proceedings of the seventh international
representation, vol. 23 (pp. 426–440). workshop on natural language generation (pp. 249–252).
Rodriguez-Moreno, I., Martinez-Otzeta, J. M., & Sierra, B. (2023). HAKA: HierArchical Vedantam, R., Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image descrip-
knowledge acquisition in a sign language tutor. In Expert systems with applications, tion evaluation. In Proceedings of the workshop on text summarization branches out
vol. 215. (pp. 4566—4575).
22
R. Rastgoo et al. Expert Systems With Applications 243 (2024) 122846
Vicars, W. (2022). American sign language. http://www.lifeprint.com/. WFD, W. (2022). World federation of the deaf. In Working document on adoption and
Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017). Decomposing motion and adaptation of technologies and accessibility – Prepared by the WFD expert group on
content for natural video sequence prediction. In ICLR. accessibility and technology.
Virginia (2022). Northern virginia resource center for deaf and hard of hearing persons. WhatsApp Messenger 2022. www.whatsapp.com.
https://nvrc.org/. WHO (2022). World health organization. https://www.who.int/.
Virtual Humans Group (2017). Virtual humans research for sign language animation. WHO: World Health Organization (2022). Deafness and hearing loss. http://www.who.
In School of computing sciences (pp. 205–212). UEA Norwich, UK. int/mediacentre/factsheets/fs300/en/.
Walsh, H., Saunders, B., & Bowden, R. (2022). Changing the representation: Examining Yan, X., Yang, J., Sohn, K., & Lee, H. (2016). Attribute2image: Conditional image
language representation for neural sign language production. In Seventh international generation from visual attributes. In ECCV (pp. 776—791).
workshop on sign language translation and avatar technology: The junction of the visual Yang, H. (2014). Sign language recognition with the kinect sensor based on conditional
and the textual. random fields. Sensors, 15, 135–147.
Wang, T., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., et al. (2018). Video-to-video Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. In Proceedings
synthesis. Advances in Neural Information Processing Systems (NIPS). of the IEEE conference on computer vision and pattern recognition.
Wang, T., Liu, M., Zhu, J., Tao, A., Kautz, J., & Catanzaro, B. (2018a). High-resolution Zelinka, J., & Kanis, J. (2020). Neural sign language synthesis: Words are our glosses.
image synthesis and semantic manipulation with conditional GANs. In Proceedings In WACV (pp. 3395–3403).
of the IEEE conference on computer vision and pattern recognition. Zhan, E., Zheng, S., Yue, Y., Sha, L., & Lucey, P. (2019). Generating multi-agent
Wellington, S., Alaniz, A., Hurtado, M., De Silva, B., & De Bem, R. (2022). SynLibras: trajectories using programmatic weak supervision. In ICLR.
A disentangled deep generative model for Brazilian sign language synthesis. In Zhao, L., Kipper, K., Schuler, W., Vogler, C., & Palmer, M. (2000). A machine translation
In 2022 35th SIBGRAPI conference on graphics, patterns and images, vol. 1 (pp. system from english to American sign language. In Conference of the Association for
210–215). machine translation in the Americas (pp. 54–67).
Wencan, H., Zhao, Z., He, J., & Zhang, M. (2022). DualSign: Semi-supervised sign Zij, L., & Barker, D. (2003). South African sign language machinee translation system.
language production with balanced multi-modal multi-task dual transformation. In In Proceedings of the second international conference on computer graphics, Virtual
Proceedings of the 30th ACM international conference on multimedia (pp. 5486–5495). reality, visualization and interaction in Africa (pp. 49–52).
Zwitserlood, I., Verlinden, M., Ros, J., & Schoot, S. (2005). Synthetic signing for the
deaf: eSIGN. (pp. 1–6). http://www.visicast.cmp.uea.ac.uk.
23