0% found this document useful (0 votes)
32 views16 pages

Naz 2023 MIPA-ResCGN

Uploaded by

jcsilva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views16 pages

Naz 2023 MIPA-ResCGN

Uploaded by

jcsilva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Computers and Electrical Engineering 112 (2023) 109009

Contents lists available at ScienceDirect

Computers and Electrical Engineering


journal homepage: www.elsevier.com/locate/compeleceng

MIPA-ResGCN: a multi-input part attention enhanced residual


graph convolutional framework for sign language recognition
Neelma Naz a, *, Hasan Sajid a, Sara Ali a, Osman Hasan a,
Muhammad Khurram Ehsan b
a
National University of Sciences and Technology, Islamabad 44000, Pakistan
b
Faculty of Engineering Sciences, Bahria University Islamabad Campus, Islamabad 44000, Pakistan

A R T I C L E I N F O A B S T R A C T

Keywords: Sign language (SL) is used as primary mode of communication by individuals who experience
Sign language recognition deafness and speech disorders. However, SL creates an inordinate communication barrier as most
Pose sequence modeling people are not acquainted with it. To solve this problem, many technological solutions using
ResGCN
wearable devices, video, and depth cameras have been put forth. The ubiquitous nature of
Part attention
cameras in contemporary devices has resulted in the emergence of sign language recognition
Multi input architecture
Visualization (SLR) using video sequence as a viable and unobtrusive substitute. Nonetheless, the utilization of
SLR methods based on visual features, commonly known as appearance-based methods, presents
notable computational complexities. In response to these challenges, this study introduces an
accurate and computationally efficient pose-based approach for SLR. Our proposed approach
comprises three key stages: pose extraction, handcrafted feature generation, and feature space
mapping and recognition. Initially, an efficient off-the-shelf pose extraction algorithm is
employed to extract pose information of various body parts of a subject captured in a video. Then,
a multi-input stream has been generated using handcrafted features, i.e., joints, bone lengths, and
bone angles. Finally, an efficient and lightweight residual graph convolutional network (ResGCN)
along with a novel part attention mechanism, is proposed to encode body’s spatial and temporal
information in a compact feature space and recognize the signs performed. In addition to enabling
effective learning during model training and offering cutting-edge accuracy, the proposed model
significantly reduces computational complexity. Our proposed method is assessed on five chal­
lenging SL datasets, WLASL-100, WLASL-300, WLASL-1000, LSA-64, and MINDS-Libras,
achieving state-of-the-art (SOTA) accuracies of 83.33 %, 72.90 %, 64.92 %, 100± 0 %, and
96.70± 1.07 %, respectively. Compared to previous approaches, we achieve superior perfor­
mance while incurring a lower computational cost.

1. Introduction

Sign languages (SLs) are non-verbal forms of communications used by deaf and speech impaired people all over the world to
communicate with audially unaffected individuals. These languages are largely communicated by physical movements of hands and
arms, but head, lip, eye, and brow movements are also very helpful. Sign language recognition (SLR) endeavors to translate these visual

* Corresponding author.
E-mail address: neelma.naz@seecs.edu.pk (N. Naz).

https://doi.org/10.1016/j.compeleceng.2023.109009
Received 6 September 2023; Received in revised form 8 October 2023; Accepted 25 October 2023
Available online 4 November 2023
0045-7906/© 2023 Elsevier Ltd. All rights reserved.
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

signals into speech or text, ultimately serving as a medium to establish effective communication between deaf and audially unaffected
people. This, in turn, increases accessibility of resources for the deaf and speech impaired population, providing them with more
opportunities. Automated SLR is an intriguing area of research that calls for knowledge in both computer vision and natural language
processing to effectively comprehend the spatiotemporal linguistic constructions of performed signs. SLR can also play a significant
role for human-computer interaction (HCI) to encourage interaction between people and machines. Isolated SLR (ISLR) and
Continuous SLR (CSLR) are two subcategories of Sign Language Recognition (SLR). While CSLR processes entire utterances comprising
multiple sign glosses for translation, ISLR classifies individual sign records into corresponding gloss categories.
A sign is made up of manual features, i.e., hand shape, palm orientation, and precise hand motions and non-manual features, i.e.,
facial expressions and body posture as shown in Fig. 1. As these features cover small areas in the entire video frame, the background
clutter can easily distract the network from learning discriminative spatiotemporal features and degrading model’s performance.
Additionally, videos have redundancy along temporal dimension and motions between adjacent frames are not significant which
makes it difficult for models to learn embeddings by focusing on significant temporal regions. Despite the significant strides made by
deep learning in advancing SLR systems, there remain these significant challenges that impede the full realization of their potential and
continue to pose significant obstacles to the development of highly accurate and generalizable SLR systems. Prior studies have
demonstrated that the utilization of pre trained 2-dimensional convolutional neural networks (2D- CNNs) as frame level spatial fea­
tures extractors, along with a subsequent late fusion of these extracted features, can enhance the performance of video classification
tasks to a significant extent. However, this approach ignores the temporal dependencies between neighbouring frames and thus leads
to poor recognition performance. To accurately capture the temporal information, the idea of feeding high level spatial features
extracted using 2D-CNNs to recurrent neural networks (RNNs) has also been investigated in literature. To effectively capture low level
and high level spatial and temporal features, 3D-CNNs have also been employed [1]. However, 3D-CNNs have the drawback of large
computational cost and suffer from optimization problems because of joint time and space modeling. Despite the advantages, SLR
using visual features (appearance-based methods) is compute intensive and raises major concerns about the privacy of human signers.
To address this issue, some of the latest works [2–6] leverage human pose information obtained using efficient pose extraction al­
gorithms. While models utilizing skeletal data for input are computationally efficient, they provide lower levels of accuracy compared
to appearance-based approaches, resulting in their limited usage in real-world applications.
To address these challenges offered by SOTA appearance-based and pose-based methods, we propose a novel three-step framework
using a multi-input, attention enhanced graph convolutional network. This approach aims to achieve higher accuracy and compu­
tational efficiency in SLR. Our proposed approach incorporates pose information from the human hands and upper body, extracted
using an off-the-shelf pose extraction model, as input. The key contributions of our study can be summarized as follows:

• We propose a SLR framework called MIPA-ResGCN (Multi Input, Part Attention Enhanced Residual Graph convolutional network),
which significantly improves the SLR performance by effectively modeling spatiotemporal feature representations.
• In MIPA-ResGCN, to improve SLR accuracy, we enrich the network’s understanding of body parts and their connections by
developing a multi-input system that encompasses joint and bones information. This, combined with an early fusion scheme,
improves accuracy while effectively reducing the computational overhead. The gathered multi-inputs are then channeled into a
ResGCN model, which leverages spatio temporal graph convolutional (ST-GCN) blocks to capture spatial and temporal relation­
ships within the sign sequence.
• To further enhance the accuracy, we propose a novel part attention mechanism to eliminate irrelevant information and extract
additional discriminative spatiotemporal features by focusing on the most significant body parts within a sign sequence. Addi­
tionally, we offer visualizations and explanations of the activated body joints to highlight the effectiveness of our proposed
attention module.
• We perform comprehensive evaluations on five challenging sign language datasets: WLASL-100, WLASL-300, WLASL-1000, LSA-
64, and MINDS-Libras. The results demonstrate that our proposed architecture exhibits superior performance in accuracy and
computational efficiency when compared to other SOTA SLR techniques.

The rest of this article is structured as follows. Section-2 reviews the related works for SLR. Section-3 provides a detailed description
of each subcomponent of our proposed SLR method, while Section-4 presents the experimental results, computational efficiency
analysis, ablation studies, and Visualization and Explanations. Finally, Section-5 concludes the paper.

Fig. 1. Manual and Non-Manual sign language (SL) features.

2
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

2. Literature review

Advancements in deep learning architectures, coupled with the availability of high-performance computing resources, have
enabled the development of deep models capable of processing multimodal data for SLR. The field of automatic SLR shares certain
areas of overlap with action recognition [7], leading to a considerable influence of action recognition network designs on methods
proposed for addressing the SLR problem. This section presents a review of prior studies concerning SLR, including the appearance and
pose-based techniques employed for extracting spatiotemporal dependencies. Additionally, attention mechanisms used to emphasize
upon the most salient information in the context of SLR have also been reviewed.

2.1. Sign language recognition using wearable sensors

In general, the two main input modalities considered for SLR are wearable sensors based and vision based as shown in Fig. 2. The
glove-based SL modality utilizes mechanical or optical sensors attached to a glove to capture electrical signals, which are used to detect
the positions of hands. In contrast, vision-based models rely on video data of signers to recognize different signs. In literature, various
wearable sensors-based modalities have been used to capture the spatiotemporal motion patterns of performed signs. Some wearable
systems perform SLR by fusing the motion signals obtained using surface EMG (sEMG) sensors and inertial sensors [8]. Another
wearable DataGlove device-based system was proposed [9] for recognizing static signs within Malaysian SL. Nevertheless, these
techniques are highly invasive, limit movement, and encroach upon daily activities.

2.2. Appearance based sign language recognition

In recent years, with the emergence of deep learning frameworks, several vision tasks including object detection, image classifi­
cation, and action recognition [7] have greatly benefitted from 2D-CNNs. The field of SLR has also taken advantage from these ad­
vancements and various vision-based models using deep learning have been proposed. An appearance-based baseline was introduced
for SLR [10], wherein spatial features were extracted using 2D-CNN model, and these features were subsequently fed to an LSTM model
to learn temporal clues. In [2], spatial features extracted using 2D-CNN were fed to a GRU network to extract temporal dependencies. A
Grammatical Facial Expressions (GFE) recognition system was proposed in [11]. The proposed system robustly tracked and analysed
17 features associated with three categories of non-manual appearance-based features: facial expressions, head gestures, and eye-gaze
and a linear Support Vector Classifier was used to classify the expressions. A dynamic sign language word recognition model using
extreme learning machine (ELM) was proposed in [12].
In recent years, 3D-CNNs are able to encode spatial and temporal information effectively and accurately making them a suitable
choice for appearance based SLR [2]. The field of SLR closely resembles to activity recognition and the researchers community had
greatly benefitted from the algorithms proposed for activity recognition. The I3D design employed for SLR in [2,10], was one of many
3D-CNN action recognition architectures for SLR adaptations that quickly followed. In [13], a gait energy image (GEI) is used to encode
body parts motion in a compact feature space and singular value decomposition (SVD) is combined with support vector machines
(SVM) to make final predictions. To learn more complicated motions inside the signing area and to disregard the background of the
videos, depth cameras have also been examined as a possible tool for this job. While appearance-based techniques have shown
considerable improvement in SLR accuracy, they are often plagued by the high dimensionality of the data, which results in compu­
tationally intensive operations.

2.3. Pose based sign language recognition

Due to reduced dimensionality of human skeletal joints, pose-based SLR is gaining researchers interest. The pose-based methods are
predicated on the idea that the signer’s body, hands, and, in certain cases, face may provide sufficient information required to identify
the performed sign. Two pose-based SLR baselines using gated recurrent unit (GRU) and temporal graph convolutional network
(TGCN) have been proposed in [2]. For each video frame, the data of 55 body and hand key points have been concatenated as an input
feature and fed to a stacked GRU consisting of two layers in the first baseline whereas in the second baseline, same input feature is fed
to a TGCN network. The problem of SLR is addressed by using two separate networks for extracting spatial and temporal features in

Fig. 2. Data Modalities used for sign language recognition.

3
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

[3]. spatial features are extracted using GCNs and temporal features are extracted using a BERT model, and final predictions are
generated using late fusion scheme. In [4], a pose-based transformer architecture is investigated to address SLR . A modified GRU is
proposed in [5] to encode spatiotemporal relationships and is tested with pose based SL data. In [6], a spatiotemporal graph con­
volutional network is proposed for SLR. While pose-based SLR approaches significantly reduce computational complexity but are
currently less accurate than appearance-based methods.

2.4. Attention mechanisms

Attention plays a vital role in the way humans perceive information. Concentration on task-critical discriminative information is a
hallmark of the attention process. The inclusion of attention mechanisms, encompassing spatial, temporal, channel, and self-
attentions, can assist models in emphasizing the most significant information and consequently enhancing the model’s efficacy.
Numerous attention schemes have been proposed for SLR. In [14], a self-attention mechanism has been proposed for efficient ag­
gregation of hand features with their appropriate spatiotemporal context to effectively recognize sign language. Transformer-based
encoder-decoder structures with channel wise self-attention and multichannel attentions were proposed in [15] for sign language
recognition and translation. However, all these approaches use channel, spatial, joints, and frame attentions [16], whereas our work
introduces a novel part attention module which allows to learn distinctive features by focusing on most relevant body regions.

2.5. Sign language datasets

There are more than 300 sign languages used by individuals with hearing and speech disabilities. A variety of publicly accessible
datasets are available for SLR. These datasets vary based on regional sign languages, continuous or isolated sign languages, data sizes,
signer counts, data collection methods, and signer dependencies. Table 1. lists the most recent and relevant visual isolated SL datasets.
For each dataset, six variables including Year, name of dataset, Country, total count of sign classes, number of signers, and total count
of video samples, are specified. Although these datasets target various sign languages, American Sign Language (ASL) has garnered
increasing attention owing to its popularity and usage. Expanding the sign classes is desirable to enhance the overall generalization of
the proposed methodologies for practical applications. We have tested our model on WLASL-100, WLASL-300, WLASL-1000, LSA-64,
and MINDS-Libras datasets.

3. Methodology

In this section, we provide details of our proposed pipeline and individual components of proposed architecture. Our proposed
pipeline as shown in Fig. 3, consists of three stages. The first stage involves extracting hand and body pose information from RGB video
sequence. The second stage deals with data pre-processing and frame sampling. After pre-processing, multi-inputs consisting of joints
and bones information are created and forwarded to MIPA-ResGCN architecture for spatiotemporal feature extraction. Finally, a class
label is predicted for the provided sequence. Further details of the proposed approach are provided in the subsequent sections.

3.1. Stage 1: Pose extraction

In the past, several techniques have been put forth to determine a human pose from RGB photos or video sequences [19,20]. It is
crucial to have a reliable pose estimator because SLR is reliant on hand shapes and locations. The proposed approach utilizes an
open-source framework called MediaPipe Holistic [19] which uses a hybrid architecture to construct pipelines for processing
perceptual data, such as images and videos. To estimate the pose of the face, body, and hands regions for every frame of the input
video, the MediaPipe Holistic incorporates three distinct models: MediaPipe holistic face landmarks, pose landmarks, and hand
landmarks detector. In our study, we utilized the hands and body pose information extracted using these submodules.
Thus, for an input video xi where {xi ∈ R T×H×W×C }; with T, H, W, & C represent number of frames, height, width and number of
channels in each frame respectively, the extracted pose will have the dimensions of {xi ∈ R C×T×V }: where T, V and C are number of
frames, number of joints per frame and number of features per joint respectively.

3.2. Stage 2: Data preprocessing

Out of 75 landmarks generated using this model, we only used data for 65 landmarks. The set of 65 landmarks includes 23
landmarks each for left and right arms, torso, as well as significant face nodes including the lips, eyes, ears, and nose and 21 landmarks

Table 1
Sign language datasets consisting of RGB videos.
Year Dataset Country Class Numbers Subjects Samples

2016 LSA-64 [17] Argentina 64 10 3200


2019 MS-ASL [10] USA 1000 222 25,513
2020 WLASL [2] USA 2000 119 21,803
2021 MINDS-Libras [18] Brazil 20 12 1155

4
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

Fig. 3. A complete overview of proposed approach.

for each hand (4 landmarks for each finger and one for wrist). Since lower body joints do not play a significant role in sign language
recognition, they were discarded. Although MediaPipe provides 3D landmarks for each joint, the depth coordinates represented by z
dimension are not very accurate [19] and introduce noise. Hence, in our approach, we have exclusively utilized 2D coordinates (x & y)
for each joint. The MediaPipe Holistic pose estimation technique offers landmark coordinates that are normalized to [0, 1] with respect
to the image width and height. To maintain a consistent scale, we shift these coordinates by [− 0.5, − 0.5] and multiply them by 2. This
process also ensures that the mean is zero and standard deviation is unity. A Two noise transform is used to augment the data. It creates
a copy of the input sequence. As the input videos lengths may vary, we have sampled a fixed sequence length of 64 frames sampled
from the input video via a random start sampling strategy. Thus, the pose sequence xi forwarded as input to the model has dimension of
{xi ∈ R C×T×V }.

3.3. Stage 3: Proposed architecture

This section provides a detailed exposition of our proposed architecture MIPA-ResGCN. We build on our previous work SIGN­
GRAPH [6], in which we introduced the ResGCN [21] approach for skeleton-based sign language recognition. The implementation of
SIGNGRAPH resulted in a significant performance improvement for pose based SLR. In this work, We have extended SIGNGRAPH [6]
by introducing a multi input architecture and efficient part attention mechanism. In order to exhibit the efficacy of our model in
learning the most distinctive features, we have also introduced the class activation map technique [22] to compute the activation of
individual joints while performing a sign. In this section, the graph convolutions (GCN) being the fundamental component of our
architecture will be firstly introduced followed by the details of MIPA-ResGCN architecture.

3.3.1. Graph convolution


Firstly, we represent the human pose information as a unidirectional graph G = (V, Ɛ) where V = {v1, v2, …., vn) is n number of
nodes representing body and hand joints and Ɛ = {e1, e2, …., em) is m number of edges representing bones connecting these joints. The
relationships between nodes and edges are modeled by an adjacency matrix A ϵ Rn×n . An entry Aij is equal to 1 if node i is connected to
j else it is zero. Each node in this graph has two channels representing x and y coordinates. In a pose sequence, the spatial graph
convolution (S-GCN) for each frame can be defined as Eq. (1):

D ( )
(1)
(l) (l) − 1/2 − 1/2 (l)
Xt(l+1) = M d Xt Dd (Ad + I)Dd ⊗ θd
d=0

Where D is the maximum graph distance and is set at 2, Xtl and Xt are input and output features for frame t and layer l, ⊗ represents
(l+1)

element wise multiplication, Ad is adjacency matrix of order d and I is the identity matrix to model self-loops. Dd is degree matrix used
to normalize Ad and M d and θd are learnable weights. An L × 1 temporal convolutional layer (TCN) is used to collect the contextual
(l) (l)

cues embedded in adjacent frames for the purpose of extracting temporal characteristics. L is a hyper parameter representing temporal
window size. These S-GCN and 2D- TCN are used to construct basic and bottleneck blocks for constructing a ResGCN architecture.

3.3.2. ResGCN architecture


To address the problem of SLR, we have used ResGCN [21] architecture as baseline method. The spatiotemporal graph convolution
(ST-GCN) network serves as the foundation to construct basic and bottleneck blocks of this architecture.
Basic Block: The basic block consists of a spatial basic block connected in series with a temporal basic block. Spatial basic block is
constructed using a spatial GCN (S-GCN) layer which is subsequently followed by a batch normalization (BN) and rectified linear unit
(ReLU) activation layer whereas temporal basic block is composed of a 2D-TCN layer which is then succeeded by a BN and ReLU
activation layer. Fig. 4 shows the implementation of Basic Bock in detail.

5
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

Bottleneck Block: The ResGCN architecture incorporates a bottleneck structure that enables a faster implementation of the model
for both training and inference. The bottleneck block is constructed using a spatial bottleneck block connected in series with a temporal
bottleneck block. The implementation of bottleneck with a reduction rate R leads to a reduction in the number of feature channels. The
spatial bottleneck block is constructed by adding two 1 × 1 convolutional layers before and after a graph convolutional layer. The first
1 × 1 convolutional layer reduces the number of channels at its output by (input channels/R) and forwards the output to graph
convolutional layer and the second 1 × 1 convolutional layer increases the channels by (GCN layer output*R). The temporal bottleneck
follows the same structure with the difference that GCN layer is replaced with a temporal 2D convolutional layer. Each bottleneck
block uses a block residual mechanism to feed the input of each block at its output. The detailed implementation of Bottleneck Block
has been presented in Fig. 5.
The sizes of spatial and temporal kernels are kept as 3 and 9 respectively and reduction rate R is set as 4.
Multi-Branch Input: Most multi-branch networks operate by independently feeding data from various modalities to the same
model, and subsequently merging the outcomes of these streams to form the ultimate decision. Although this approach is effective for
data augmentation and enhances the model performance but leads to high computational expenses and difficulties in hyper parameters
tuning for large datasets. Thus, we have extracted features from both input branches using bottleneck blocks and concatenated them at
an early stage of our model as presented in Fig. 8. The concatenated features are then fed to one main branch to extract discriminative
features. The pose-based action recognition framework [21] divide the input features into three categories: joints, velocities, and
bones. Velocity does not play a significant role in SLR as sign can be performed faster or slower, but its context doesn’t change.
Considering the mentioned reason, we have divided the input features into two categories: 1) joint positions and 2) bone features (bone
lengths & angles). Suppose the original 2D coordinate set of a sign sequence, Xt ∈ R C×T×V , where C, T, and V represent coordinates,
frames, and nodes, is extracted using pose extractor. The relative position ‘r’ of each joint is calculated with respect to the center node
‘c’ of the pose using Eq. (2).
{ ⃒ }
r = vt,i − vt,c ⃒ i = 1, 2, …., V, t < T
( )
s.t. vt,c = mean vt,rightshoulder , vt,leftshoulder (2)

The original nodes and these relative positions are concatenated and sent as joint position input to the first branch. Next, bone
features consisting of bone lengths and bone angles are computed. Bone length l is computed by subtracting each joint vt,i from its
adjacent joints vt,adj as Eq. (3).
{ ⃒ }
l = vt,i − vt,adj ⃒ i = 1, 2, …., V, t < T (3)

Finally, Bone angle is computed as Eq. (4):


⎧ ⎛ ⎞⃒ ⎫
⎪ ⃒ ⎪
⎨ v − v ⃒ ⎬
t,adj ⎟⃒
α = arccos⎜ t,i
⎝ √̅̅̅̅̅̅̅̅̅̅̅
∑ ⎠⃒ i = 1, 2, …., V, t < T (4)

⎩ v2t,i ⃒⃒ ⎪

These bone features are concatenated together and sent as input to the second branch.
Part Attention Module: Inspired by split attention model in [23], a part based attention module has been designed to capture the
significance of each body part throughout the entire sign sequence. We have manually divided input skeleton’s joints into five indi­
vidual body parts P = 5: face, left arm, right arm, left hand, and right hand, based upon each part’s corresponding joints as shown in
Fig. 6.
To compute part attentions, the first step involves applying average pooling in the temporal dimension to the entire skeleton. The
feature maps obtained are subsequently passed through a 2D convolutional layer, succeeded by a BN and ReLU Layer. Afterward, the
whole skeleton is divided into five body parts (previously mentioned) based upon the contributing joints of each part and attention
matrices are calculated using five 2D convolutional layers (one corresponding to each part), and a part-level SoftMax is used to identify
the most essential part. A final skeleton representation is formed by concatenating features of five parts with learnt attention weights.
The complete structure of part attention module is presented in Fig. 7.
The attention block shown in Fig. 7 can be mathematically formulated as explained in Eqs. (5.a) and (5.b):
( )
xp = γ δ(p(xin )θ)θp (5.a)

Fig. 4. Basic Block Implementation Structure.

6
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

Fig. 5. Bottleneck Block Implementation Structure.

Fig. 6. An illustration depicting manually designed body parts.

Fig. 7. The structure of proposed part attention block, where R(residual)=4 and C (Number of channels).

( ({ ⃒ })
xout = xin ⊗ Concat xp ⃒p = 1, 2, …, P ) (5.b)

Where xin and xout represent input and output feature maps, γ(.), δ(.)andp(.) denote SoftMax, ReLU and temporal average pooling
C C
operations respectively. Here θϵR C×R and θp ϵR R×C are learnable parameters for convolutional layers, ⊗ represents element wise
multiplication, C represents total channels in the layer and R is the reduction rate which is set as 4 after extensive experimentation. This
choice strikes an ideal balance between accuracy and computational complexity. Increasing ’R’ to 8 enhances computational efficiency
but significantly lowers recognition accuracy. Conversely, decreasing ’R’ to 2 improves accuracy but leads to more parameters. Fig. 8
provides a complete overview of the proposed MIPA-ResGCN architecture consisting of basic and bottleneck blocks and part attention
mechanism. The numbers presented on each block show the output dimension of feature maps after each block.

4. Results and discussions

In this section, we evaluate our proposed architecture on five publicly available SL datasets: WLASL-100, WLASL-300, and WLASL-

7
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

Fig. 8. Complete overview of Multi Input Part Attention enhanced ResGCN (MIPA-ResGCN) architecture.

1000 [2], LSA-64 [17], and MINDS-Libras [18], using four type of evaluation matrices: accuracy, precision, recall, and F1 score. We
compare our results with SOTA approaches based on appearance and pose features. Additionally, we conduct ablation studies to
explicate the individual contribution of each component within the proposed architecture towards the overall performance.

4.1. Experimental setup

For our Experiments, we have represented the human pose sequence as a graph, based on a spatial configuration described in [21].
The experiments are conducted on a single NVIDIA 3080 RTX GPU with the PyTorch framework. The model is trained for 350 epochs
for WLASL-100, LSA-64, & MINDS-Libras and for 600 epochs for WLASL-300 and WLASL-1000. The embedding size is set at 128 for
WLASL-100, LSA-64, and MINDS-Libras and set at 300 & 512 for WLASL-300 & WLASL-1000 respectively. The model is trained using
Adam optimizer [24], and cyclic learning rate scheduler with a learning rate of 0.01. The batch size is kept as 32. Temporal kernel size
L, Maximum graph distance D, and spatial kernel size are set as 9, 2, and 3 respectively. For model optimization, a cross-entropy loss
measured by Eq. (6) is used as an objective function.

C
Cross Entropy Loss = − yi log ŷi (6)
i=1

where ŷi is the Softmax probability for the ith class and C represents the total number of classes.

4.2. Results on WLASL dataset

Dataset Description: WLASL (word level ASL) is a recent and comprehensive dataset compiled from various online open sources,
featuring a diverse range of signers, lighting conditions, and background variations. The dataset is divided into four subsets: WLASL-
100, WLASL-300, WLASL-1000, & WLASL-2000, where the number in each subset’s name corresponds to the number of signs glosses it

Table 2
Performance Comparison of proposed architecture with SOTA methods on WLASL-100, WLASL-300, WLASL-1000. Note: (T-1: top1, T-5: top5, T-10:
top10 Accuracy in%).
Data Type Model WLASL-100 WLASL-300 WLASL-1000
T-1 T-5 T-10 T-1 T-5 T-10 T-1 T-5 T-10

Appearance Based I3D [2] 65.89 84.11 89.92 56.14 79.94 86.98 47.33 76.44 84.33
TK-3D Convnet [1] 77.55 91.42 – 68.75 89.41 – – – –
Fusion 3 [25] 75.67 86.00 90.16 68.30 83.19 86.22 56.68 79.85 84.71
Pose Based Pose-GRU [2] 46.51 76.74 85.66 33.68 64.37 76.05 30.01 58.42 70.15
Pose-TGCN [2] 55.43 78.68 87.60 38.32 67.51 79.64 34.86 61.73 71.91
GCN-BERT [3] 60.15 83.98 88.67 42.18 71.71 80.93 – – –
MOPGRU [5] 63.18 – – – – – – – –
SPOTER [4] 63.18 – – 43.78 – – – – –
SIGNGRAPH [6] 72.09 88.76 92.64 71.40 92.26 94.16 61.83 85.87 91.04
MIPA-ResGCN (Ours) 83.33 92.64 95.35 72.90 88.92 93.41 64.92 88.37 92.16

8
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

contains. . Our study employs the same training, validation, and testing protocols as specified by the authors of the dataset [2].
Comparison with SOTA methods: This study exhibits the top-1, top-5, and top-10 accuracies attained by our proposed architecture
on the WLASL-100, WLASL-300, and WLASL-1000 datasets. A comparative analysis between the proposed method and SOTA
appearance-based and pose-based approaches is provided in Table 2.
Vs. Pose based: As presented in Table 2, the MIPA-ResGCN obtains an excellent performance of, 83.33 %, 72.90 %, and 64.92 %
thereby outperforming SOTA pose based method by 11.24 %, 1.50 %, and 3.09 % for WLASL-100, WLASL-300, and WLASL-1000
respectively.
Vs. Appearance based: The proposed method exhibits superior performance compared to the SOTA appearance-based method by
5.78 %, 4.15 %, and 8.24 % for WLASL-100, WLASL-300, and WLASL-1000 respectively. In Fig. 9, we present a confusion matrix
illustrating the per-class accuracy for the WLASL-100 dataset. To delve deeper into our proposed framework’s performance, we have
also calculated macro average and weighted average precision, recall, and F1-score for the dataset, as presented in Table 3. These
metrics provide a comprehensive overview of sign recognition accuracy for the WLASL-100 dataset.
The results presented in this section clearly demonstrate the excellent feature-learning capabilities of our proposed architecture,
which can be attributed to several key factors. Firstly, our handcrafted features, including joints, bone lengths, and bone angles, serve
as finely grained and precise inputs, effectively representing hand, upper body, and face positions and motions. Secondly, processing
this multi-input data using a multi-stream architecture, coupled with an efficient residual graph convolutional network, enhances the
model’s capacity to encode spatial and temporal cues more efficiently. Finally, the designed part attention module increases robustness
to noise in joint data and imprecise skeleton information, while focusing on the most critical body parts (e.g., hands). These en­
hancements significantly improve recognition accuracy compared to other state-of-the-art (SOTA) methods.

4.3. Results on LSA-64 dataset

Dataset Description: LSA-64 dataset targets vocabulary of 64 different glosses from Argentinian sign language. It comprises 50
video samples per class, with each class signed by 10 non-expert signers. The 64 glosses include both verbs and nouns used in
Argentinian sign language. The following setup is used for model training and evaluation. The data is divided into 80:20 for training
and test purposes as per the protocol used by dataset authors. To determine the optimal model parameters, a k-fold cross-validation is
employed on the training set with k = 4. The results are evaluated over an average of five repetitions.

Fig. 9. Confusion matrix on WLASL-100 dataset.

9
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

Table 3
Precision, Recall, and F1-Score for WLASL-100, WLASL-300, and WLASL-1000 dataset.
WLASL-100 WLASL-300 WLASL-1000
Precision Recall F1-Score Precision Recall F1-Score Precision Recall F1-Score

Macro Average 0.85 0.83 0.82 0.75 0.73 0.71 0.63 0.64 0.61
Weighted Average 0.85 0.83 0.82 0.75 0.72 0.70 0.64 0.64 0.61

Comparison with SOTA methods: We report the top-1 accuracy of our architecture on the LSA-64 dataset, which achieves a SOTA
performance of 100± 0% accuracy. Table 4 presents a comparison of our method with the current SOTA approaches using appearance,
pose, and hybrid modalities. The results represented with * are obtained as an average of 5 iterations. In Fig. 10(a), we present a
confusion matrix illustrating the per class accuracy for LSA-64 dataset clearly demonstrating an excellent performance of our proposed
architecture towards sign recognition.

4.4. Results on MINDS-libras dataset

Dataset Description: MINDS-Libras [18] dataset targets a vocabulary of 20 different glosses from Brazilian sign language. The
dataset comprises of 60 video recordings per category, each performed by 12 distinct signers, and recorded in a controlled environment
using a Canon EOS Rebel t5i DSLR camera and a Microsoft Kinect v2 to capture RGB and RGB-D sequences. For our work we use only
RGB sequences to extract pose information. The following setup as proposed by [13] is used for model training and evaluation. The
data is divided into 75:25 for training and test purposes as per the protocol used by dataset authors. A k-fold cross validation is
employed on the training set with k = 3 to find the best model parameters. The results are presented as an average of ten repetitions.
Comparison with SOTA methods: We present the top-1 accuracy achieved by our architecture on the MINDS-Libras dataset,
demonstrating state-of-the-art performance with an accuracy of 96.70±1.07 %. Table 5 offers a comprehensive comparison of our
methodology with current state-of-the-art approaches. The macro average and weighted average precision, recall, and F1-score for the
dataset, are presented in Table 6. Additionally, Fig. 10(b) provides a detailed visual representation of the confusion matrix for each
sign within the dataset. Our proposed methodology has a minimum of 83 % on all used matrices of Precision, Recall, and F1-score
(when analyzed on per class basis), which can be considered as a very good model performance.

4.5. Performance analysis under various operating conditions

Incomplete Input Data: We employed an off-the-shelf pose extraction algorithm to capture joints data encompassing both hands,
the upper body, and facial points. In certain scenarios, i.e., encountering blurred frames, there is a possibility for the pose estimator to
miss specific landmarks, leading to incomplete input data. To address these instances and maintain a consistent fixed dimension for our
model’s input, any missed landmarks are populated with null values. Despite this, our model has the capability to learn effective
spatiotemporal features by building the inter-frame and intra-frame relationships and therefore yields accurate results even in such
scenarios as shown in Fig. 11.
Data With Noise: Pose estimators often contend with the challenge of noisy data generation. To address this issue, we introduce
uniform gaussian noise with zero mean and standard deviation of 0.01 to the estimated key points during training and compute the
model’s performance on noisy and noiseless test set. This approach aims to mitigate the impact of inaccuracies in the pose estimation
process, leading to more robust and reliable recognition results. The results shown in Table 7 under various operating conditions show
the robustness of our proposed framework.
Data With Class Unbalance: In the context of our research, we have primarily leveraged publicly available datasets that inherently
maintain a balanced distribution of class data. However, for a comprehensive assessment of our model’s performance under a more
realistic scenario, we have engineered a class-imbalanced dataset by means of random oversampling of minority classes and concurrent
random under sampling of majority classes. Within our training set, we have introduced class imbalance at a defined ratio of 1:4.
Subsequently, we have performed evaluation on the test set. The empirical findings, as presented in Table 8, show the robustness of our
model in this context, with achieved accuracies remaining comparable across both balanced and imbalanced class distributions.

Table 4
Performance Comparison of proposed architecture with SOTA methods on LSA-64 dataset.
Data Type Model Top-1 Accuracy

Appearance Based LSTM+LDS [26] 98.09 ± 0.59 *


DeepSign CNN [27] 96.00
MEMP [28] 99.06
I3D 98.91
Appearance + Pose LSTM+DSC [29] 99.84± 0.19*
ELM+MN CNN [12] 97.81
Pose Only SPOTER [4] 100.00± 0 *
MOPGRU [5] 99.92
MIPA-ResGCN (ours) 100.00± 0*

10
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

Fig. 10. Confusion matrix (a). LSA-64 dataset, (b). MINDS-Libras dataset. (The heatmap color scheme on far right is applicable to both matrices).

Table 5
Accuracy comparison with SOTA methods on MINDS-Libras dataset.
Data Type Model Top-1 Accuracy

Appearance Based CNN3D [26] 72.6


CNN 3D [27] 93.3 ± 1.69
GEI + SVD+SVM [13] 84.66 ± 1.78
Pose Based MIPA-ResGCN (ours) 96.70 ± 1.07 %

Table 6
Precision, Recall, and F1-Score for MINDS-Libras Dataset.
Precision Recall F1-Score

Macro Average 0.98 0.97 0.97


Weighted Average 0.98 0.96 0.96

Fig. 11. Performance analysis under incomplete input data (Missing hands or Body joints).

11
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

Table 7
Performance Analysis for Noisy Data (Normal means with No Noise).
Train Test Test Accuracy (%)

Normal Normal 83.33


Normal With Joints Noise 81.25
With Joints Noise Normal 82.40
With Joints Noise With Joints Noise 81.90

4.6. Failure cases

It is evident from Fig. 9 that our model shows excellent recognition performance across approximately the entire WLASL-100
dataset, however, there are three signs: ‘same’, ‘pizza’, and ‘school’ which our model finds difficult to recognize. All instants of
sign “School” are recognized as “Paper” by the model. Upon investigation, it was observed that both signs are performed in the same
way as shown in Fig. 12(a), leading to difficulties in accurate recognition by the model. Moreover, a significant variability in the
execution of the signs “1. Same” and “2. Pizza” by various signers was observed upon investigation as shown in Fig. 12(b). These
ambiguities in the dataset: 1. different words signed in the same way, 2. same word signed differently by signers, lead the model to
inaccurate gloss predictions.

4.7. Computational and generalized performance analysis

In this study, we conducted a comparison of our proposed architecture (MIPA-ResGCN) with I3D (an appearance-based method),
SPOTER and SIGNGRAPH (pose-based methods) to evaluate the computational efficiency and model’s generalization performance. To
begin, we compared the number of model parameters, finding that MIPA-ResGCN has 0.99 million while I3D has 12.4 million and
SPOTER has 5.92 million model parameters. Next, we evaluated the computational complexity of each model by quantifying the
number of floating-point operations and average time taken to process each video during inference stage, utilizing the FLOP profiler
feature of the DeepSpeed library [30]. The results shown in Fig. 13 demonstrate that our model has a much smaller number of pa­
rameters and inference time as compared to I3D and SPOTER and comparable compute performance to SIGNGRAPH with an accuracy
increase by large margin. The number of GFLOPs required are much smaller as compared to I3D and comparable to SPOTER and
SIGNGRAPH.
To assess the generalizability and robustness of our proposed model, we conducted a series of experiments wherein we trained the
model using smaller subsets of the training data and evaluated its performance on a fixed test set. These experiments were conducted
on the LSA-64 data, which was partitioned into training and test sets at an 80:20 ratio. MIPA-ResGCN, I3D, SIGNGRAPH and SPOTER
models were trained with different splits of training data. Training was conducted using subsets of the training data ranging from 10 %
to the complete dataset, with 20 % more data being added at each stage. To ensure uniform class distributions, training subsets were
selected using a uniform sampling method. The resulting models were then evaluated on the fixed test set. The results, shown in
Fig. 14, indicate that MIPA-ResGCN achieved an accuracy of 92 % on test set when trained on just 10 % split of the training data, while
SPOTER, SIGNGRAPH, and I3D models achieved an accuracy of 88.68 %, 75 % and 45.47 % respectively. The performance of MIPA-
ResGCN continued to improve as the size of the training set increased, ultimately reaching an accuracy of 100 % when trained on 50 %
of the training data. SIGNGRAPH and SPOTER achieved the accuracy of 100% at 70 %, and 90 % split of training data and I3D achieved
the highest accuracy of 98.91 % when trained on 100 % data. The results of the conducted experiments demonstrate that MIPA-
ResGCN performs significantly better than SOTA SLR models even when trained with smaller data sizes. The underlying rationale
behind this behavior can be explained as follows: appearance-based models such as I3D demand a comprehensive understanding of
general concepts like human body movements for the interpretation of sign language. When trained on smaller datasets, the learning
process becomes more challenging, leading to noticeable decline in the model’s performance. Conversely, pose-based methods
leverage body and hand pose information, inherently encompassing human body mechanics, which aids in effective decoding.
However, SPOTER and SIGNGRAPH (pose-based methods) treat the entire pose as a single unit, elongating the time and data required
for sign language decoding. In contrast, our proposed architecture employs a multi-input part attention structure to provide the model
with more refined pose information, and an attention enhanced learning capability which in turn, facilitates rapid and accurate feature
extraction resulting in more generalized representations of signs.

4.8. Ablation studies

This section entails ablation studies that aim to examine the impact of various components that were introduced in the baseline
ResGCN model towards its comprehensive performance. The proposed framework consists of ResGCN with a reduction rate (R = 4) as a
baseline model. The ResGCN model includes one basic and six bottleneck blocks followed by an average pooling and two fully con­
nected layers. We have introduced two input branches consisting of joints and bones information as explained in Section-3. To analyze
the advantages of proposed part attention (PA) mechanism, we have designed two comparative blocks, i.e., joint attention (JA), and
frame attention (FA) according to previous studies [16]. In the past, these attention blocks were originally created for solving activity
recognition challenges, and we have since adapted and expanded these modules for SLR. Experimental results presented in Table 9
clearly demonstrate that our model greatly benefits from multi-input structure. It improves the model’s performance significantly for

12
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

Table 8
Performance Analysis Under Unbalanced Class
Distributions.
Train Test Accuracy (%)

Balanced 83.33
Unbalanced 81.01

Fig. 12. (a) Two different words (1. School, 2. Paper) performed in the same way by signers. (b). Words (1. Pizza, 2. Same) Various instances of each
word performed in entirely different ways by signers.

Fig. 13. Comparison of MIPA-ResGCN (Proposed model), I3D, SIGNGRAPH and SPOTER in terms of number of parameters (millions), average
FLOPS (G), and average inference time (sec).

all datasets. These results also illustrate that inclusion of part attention mechanism enhances the model performance by a huge margin.
This primarily occurs due to part attention module’s increased resilience towards noisy skeleton joints and imprecise pose estimations
in some frames. Overall best recognition accuracies are achieved by including both part attention and multi-inputs for WLASL-100,
WLASL-300, and WLASL-1000 dataset as evident in Table 9.

4.9. Visualizations and explanations

To showcase the efficacy of our model in learning the most distinctive features, we applied the class activation map technique [22].
The activation maps presented in Fig. 15 depict the activated joints in various frames of a sequence. Joints with the highest activation
are represented using brighter colors and on a larger scale in the visualization. It is evident from the results presented in Fig. 15 that out
of the five body parts, our skeleton was divided in for part attention mechanism, the model pays higher attention to the left and right
hands (hands joints are the most activated joints because they have larger scale and brighter colors), as hands locations and movements
are indeed the most distinguished features of sign language. Moreover, the model is also able to correctly capture the significance of
each joint i.e., the sign for “CHAIR” is performed by moving the index and middle fingers of right hand up and then bringing them
down and touching the same two fingers of left hand. As can be seen, in the frames for sign class “CHAIR”, our model gives the most
attention to the joints of these fingers. The observations align perfectly with the understanding that sign language relies heavily on the
shapes, locations, and orientations of the hands.

13
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

Fig. 14. Top-1 accuracies of the MIPA-ResGCN, I3D, SIGNGRAPH, and SPOTER models trained on six subsets of the training data and evaluated on
a fixed 20 % test split for LSA-64.

Table 9
Ablation Studies of model components in accuracy (%) for WLASL-100, WLASL-300, & WLASL-1000 datasets (MI: multi-input, PA: Part attention, FA:
Frame attention, JA: Joint attention).
Model WLASL-100 WLASL-300 WLASL-1000

Baseline (ResGCN): using Bones only 48.06 47.75 46.59


Baseline: using Joints only 72.09 71.40 61.83
Baseline: using Bones only + PA 70.16 56.14 52.40
Baseline: using Joints only + PA 75.19 71.71 63.54
Baseline + MI (Using both Joints & Bones) 77.52 72.01 62.30
Baseline + MI + FA 81.01 71.86 62.42
Baseline + MI + JA 79.07 74.10 64.02
MIPA-ResGCN (Baseline + MI + PA) 83.33 72.90 64.92

Fig. 15. Activated joints for the examples of ’WALK’ and ’CHAIR’ signs. *Joints represented with a bigger scale and brighter color have the highest
activation weights . (Better Viewed in Colour).

14
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

5. Conclusion

In this study, we propose an accurate and efficient framework for pose-based isolated sign language recognition. The architecture
uses an efficient pose extractor to extract pose information which is then divided into two branches: joints and bones, to construct the
multi-input structure. This multi-input is forwarded to ResGCN consisting of basic and bottleneck blocks and a novel part attention
mechanism to force the model to learn the efficient spatiotemporal features by focusing on the most essential body parts and ignoring
nodes with unnecessary information. Our results clearly demonstrate the model’s ability to learn strong spatiotemporal dependencies
thereby providing SOTA accuracies using the pose-based method on the challenging datasets of WLASL, MINDS-Libras, and LSA-64.
Our model provides significant reduction in computational complexity and provides more generalizable results. Additionally, our
visualizations of activated joints effectively illustrate that the model places a strong emphasis on the most crucial body parts in sign
language: hand shapes, locations, and orientations, thereby supporting the assertions made in Section-1. The proposed architecture
will have a large influence on applications requiring gesture recognition. Our future work involves expanding the proposed archi­
tecture to incorporate appearance-based hand features, with the aim of enhancing recognition accuracy in critical scenarios where the
same signs may be signed differently.

CRediT authorship contribution statement

Neelma Naz: Conceptualization, Investigation, Software, Methodology, Visualization, Writing – original draft. Hasan Sajid:
Conceptualization, Project administration, Supervision, Writing – review & editing. Sara Ali: Supervision, Writing – review & editing.
Osman Hasan: Supervision, Writing – review & editing. Muhammad Khurram Ehsan: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.

Data availability

The authors do not have permission to share data.

References

[1] Li D, Yu X, Xu C, Petersson L, Li H. Transferring cross-domain knowledge for video sign language recognition. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition; 2020. p. 6205–14.
[2] Li D, Rodriguez C, Yu X, Li H. Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: Proceedings of the
IEEE/CVF winter conference on applications of computer vision; 2020. p. 1459–69.
[3] Tunga A, Nuthalapati SV, Wachs J. Pose-based sign language recognition using gcn and bert. In: Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision; 2021. p. 31–40.
[4] Boháček M, Hrúz M. Sign Pose-based Transformer for Word-level Sign Language Recognition. In: Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision; 2022. p. 182–91.
[5] Subramanian B, Olimov B, Naik SM, Kim S, Park KH, Kim J. An integrated mediapipe-optimized GRU model for Indian sign language recognition. Sci Rep 2022;
12:1–16.
[6] Naz N, Sajid H, Ali S, Hasan O, Ehsan MK. Signgraph: an Efficient and Accurate Pose-Based Graph Convolution Approach Toward Sign Language Recognition.
IEEE Access 2023;11:19135–47.
[7] Basak H, Kundu R, Singh PK, Ijaz MF, Woźniak M, Sarkar R. A union of deep learning and swarm-based optimization for 3D human action recognition. Sci Rep
2022;12:5494.
[8] Gupta R, Kumar A. Indian sign language recognition using wearable sensors and multi-label classification. Comput Electr Eng 2021;90:106898.
[9] Alrubayi AH, Ahmed MA, Zaidan A, Albahri AS, Zaidan B, Albahri OS, Alamoodi AH, Alazab M. A pattern recognition model for static gestures in malaysian sign
language based on machine learning techniques. Comput Electr Eng 2021;95:107383.
[10] Hamid Reza Vaezi Joze and Oscar Koller. Ms-asl: A largescale data set and benchmark for understanding american sign language. arXiv preprint arXiv:
1812.01053, 2018.
[11] Aleesa RS, Mohammadi HM, Monadjemi A, Hashim IA. Dataset classification: an efficient feature extraction approach for grammatical facial expression
recognition. Comput Electr Eng 2023;110:108891.
[12] Imran J, Raman B. Deep motion templates and extreme learning machine for sign language recognition. Vis Comput 2020;36:1233–46.
[13] Passos WL, Araujo GM, Gois JN, de Lima AA. A gait energy image-based system for Brazilian sign language recognition. IEEE Trans Circuits Syst Regul Pap 2021;
68:4761–71.
[14] Slimane FB, Bouguessa M. Context matters: self-attention for sign language recognition. In: Proceeding of the 25th international conference on pattern
recognition (ICPR); 2021. p. 7884–91.
[15] Camgoz NC, Koller O, Hadfield S, Bowden R. Multi-channel transformers for multi-articulatory sign language translation. In: Proceeding of the European
conference on computer vision; 2020. p. 301–19.
[16] Song S, Lan C, Xing J, Zeng W, Liu J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the
thirty-first AAAI conference on artificial intelligence; 2017. p. 4263–70.
[17] Ronchetti F, Quiroga F, Estrebou CA, Lanzarini LC, Rosete A. LSA64: an Argentinian sign language dataset. In: Proceeding of the XXII congreso Argentino de
ciencias de la computación (CACIC 2016); 2016.
[18] Rezende TM, Almeida SGM, Guimarães FG. Development and validation of a Brazilian sign language database for human gesture recognition. Neural Comput
Appl 2021;33:10449–67.
[19] Ivan GrishchenkoValentin Bazarevsky.Mediapipe holistic 2020. https://ai.googleblog.com/2020/12/mediapipe-holistic-simultaneous-face.html.
[20] Yan G, Woźniak M. Accurate key frame extraction algorithm of video action for aerobics online teaching. Mob Netw Appl 2022;27:1252–61.

15
N. Naz et al. Computers and Electrical Engineering 112 (2023) 109009

[21] Song YF, Zhang Z, Shan C, Wang L. Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. In: Proceedings
of the 28th ACM international conference on multimedia; 2020. p. 1625–33.
[22] Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer
vision and pattern recognition; 2016. p. 2921–9.
[23] Zhang H, Wu C, Zhang Z, Zhu Y, Lin H, Zhang Z, Sun Y, He T, Mueller J, Manmatha R. Resnest: split-attention networks. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition; 2022. p. 2736–46.
[24] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[25] Hosain AA, Santhalingam PS, Pathak P, Rangwala H, Kosecka J. Hand pose guided 3d pooling for word-level sign language recognition. In: Proceedings of the
IEEE/CVF winter conference on applications of computer vision; 2021. p. 3429–39.
[26] Konstantinidis D, Dimitropoulos K, Daras P. Sign language recognition based on hand and body skeletal data. In: Proceedings of the 3DTV-conference: the true
vision-capture, transmission and display of 3D video (3DTV-CON); 2018. p. 1–4.
[27] J.A. Shah, "Deepsign: a deep-learning architecture for sign language," Ph.D. thesis, Univ. Texas, Arlington, TX, USA, 2018.
[28] Zhang X, Li X. Dynamic gesture recognition based on MEMP network. Future Internet 2019;11:91.
[29] Konstantinidis D, Dimitropoulos K, Daras P. A deep learning approach for analyzing video and skeletal features in sign language recognition. In: Proceedings of
the IEEE international conference on imaging systems and techniques (IST); 2018. p. 1–6.
[30] Rasley J, Rajbhandari S, Ruwase O, He Y. Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters. In:
Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining; 2020. p. 3505–6.

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy