A New Benchmark On American Sign Language Recognition Using Convolutional Neural Network
A New Benchmark On American Sign Language Recognition Using Convolutional Neural Network
net/publication/340686365
CITATIONS READS
25 6,194
6 authors, including:
All content following this page was uploaded by Md Moklesur Rahman on 17 April 2020.
Abstract—The listening or hearing impaired (deaf/dumb) peo- mutually intelligible and universal. For example, the ASL and
ple use a set of signs, called sign language instead of speech for BSL are different, even though both of them use the same
communication among them. However, it is very challenging for verbal language. The normal hearing and listening people
non-sign language speakers to communicate with this community
using signs. It is very necessary to develop an application to find it extremely difficult to understand the sign language
recognize gestures or actions of sign languages to make easy even of the nation itself. Hence trained SL interpreters are
communication between the normal and the deaf community. needed during medical and legal appointments, educational
The American Sign Language (ASL) is one of the mostly used and training sessions, etc. The automatic recognition of an
sign languages in the World, and considering its importance, SL and its translation into a natural language can establish
there are already existing methods for recognition of ASL with
limited accuracy. The objective of this study is to propose a a proper communication interface between the hearing or
novel model to enhance the accuracy of the existing methods for listening impaired and normal people.
ASL recognition. The study has been performed on the alphabet ASL also predominants as a second language to the deaf
and numerals of four publicly available ASL datasets. After communities in the United States and Canada [2]. According
preprocessing, the images of the alphabet and numerals were fed to the National Association of Deaf (NAD) [3] in the United
to a newly proposed convolutional neural network (CNN) model,
and the performance of this model was evaluated to recognize States of America, ASL is accepted by many high schools,
the numerals and alphabet of these datasets. The proposed CNN colleges, and universities in the fulfillment of modern and for-
model significantly (9%) improves the recognition accuracy of eign language academic degree requirements. Besides North
ASL reported by some existing prominent methods. America, ASL is also used in many countries across the World,
Index Terms—Hand gesture, American Sign Language, Con- including parts of Southeast Asia and much of West Africa.
volution neural network, Recognition, ASL.
There are some works [4]–[6] already reported in the litera-
ture for automatic recognition of ASL. Some of these methods
I. I NTRODUCTION
have been studied on a sample dataset of few samples, and
According to the World Health Organization (WHO) [1], some using the traditional shallow neural network approach
the number of people having hearing or listening disability for classification. Shallow neural networks require manual
increased from 278 million in 2005 to 466 million in early identification of features and relevant features selection. The
2018. It is assumed that this number will be increased to use of deep learning (DL) techniques for machine learn-
400 million by 2050 [1]. This deaf community uses a set of ing problems have significantly improved the performance
signs to express their language (called sign language), which is of traditional shallow neural networks, especially for image
different for different nations. In other words, a sign language recognition and computer vision problems. DL is a subfield
(SL) is a nonverbal communication language, which utilizes of machine learning in artificial intelligence (AI). It is a set
visual sign patterns made with the hands or any parts of the of algorithms, models with high-level abstractions through
body, used primarily by the people who have the disability architectures which composed of multiple nonlinear transfor-
of hearing and/or listening. Sign languages (SLs) are full- mations. DL algorithms utilize a huge amount of data to extract
fledged natural languages with their own lexicon and grammar. features automatically, aim in emulating the human brain’s
Different SLs such as American Sign Language (ASL), Aus- ability to learn, analyze, observe, and make an inference,
tralian Sign Language, British Sign Language (BSL), Danish especially for extremely difficult problems. DL architectures
Sign Language, French Sign Language, and many others create relationships beyond immediate neighbors in the data
have been developed for deaf communities. Although there and generate learning patterns, extract representations directly
are some striking similarities among the SLs, they are not from data without human intervention. There are different deep
In this paper, four separate datasets of ASL have been In this study, the raw images are transformed into grayscale
utilized to analyze the performance of the proposed method. images. The grays levels of input images are normalized by
The Massey University Gesture dataset [6] contains standard the maximum value of the gray level range. The use of low-
ASL hand gestures which consist of 2425 images in (PNG) resolution images provides faster training without too much
format from 5 individuals and the dataset is called MU impact on the recognition rate. The images are resized to
HandImages ASL. The sign language digit dataset [15] was 64×64 pixels.
collected from 218 Turkey Ankara Ayranci Anadolu high
B. CNN Model Description
school students. There are 10 samples of each digit collected
from each subject. The third dataset that we considered in The architectural design of a CNN contributes to the optimal
this study is the ASL finger spelling dataset [5] collected by performance by a proper selection of convolution layers and
the Center for Vision, Speech and Signal Processing group the number of neurons. There are no universally accepted
at the University of Surrey, UK. The samples of this dataset standard guidelines to select the number of neurons and
was divided into color images and depth images. In our work, convolution layers. Here, we have proposed an architecture
we consider only color images, consist of 24 static signs of a CNN (that we called SLRNet-8) which maximizes the
(excluding the letters J and Z) of the ASL alphabet, which recognition accuracy. Our proposed SLRNet-8 consists of six
were acquired from 5 individuals in different sessions with convolution layers, three pooling layers and a fully connected
similar lighting and backgrounds. The dataset contains over layer besides the input-output layers. The major steps of the
65,000 images of the ASL alphabet. The fourth and the last proposed ASL recognition method have been described in the
dataset that we considered is the ASL Alphabet dataset [16] Fig. 3.
consists of 87,000 samples of 29 classes (26 for the letters
A-Z and three special characters: delete, nothing, space). A
few samples from this dataset are shown in Fig. 1.
Input Layer
The pre-processed images are directly applied to the net-
work through its input layer. There are 4096 nodes in the input
layer, each node corresponds to every pixel of the image at
resolution 64×64.
Convolution Layer
Convolution is the first layer to extract features from an
input data and serves as a basic building block of the CNN.
In the convolution layer, the kernels extract the salient features
from the input data through forward and backward propaga-
Fig. 2: Methodological steps of the proposed ASL recognition tion. In our study, this operation is performed by shifting
system the filters of dimension 3×3 and 5×5 over the input data
matrix. At every shifting, it executes element-wise matrix
multiplications, and then aggregates the results into a feature Dropout is a regularization technique that sets input ele-
map. ments to zero with a given probability in a random manner.
The number of kernels used in the convolutional layers The over-fitting problem occurs when a model’s training
may affect the performance of a CNN model. There are no accuracy is too high in contrast to testing accuracy. In CNN
standard guidelines for opting the number of kernels in a models, a dropout layer followed by the FC layer allows to
convolution layer. In this work, we have performed experiment prevent the over-fitting problem and enhances the performance
using different number of kernels from 32 to 512 at different [20], [21] by setting activation to zero in a random manner
step size, and finally, the combination which maximize the during the training process. The probability of dropout used
accuracy was selected. A batch normalization (BN) [17] layer in this study was 0.5.
which is responsible for accelerating the training process and Output Layer
reducing the internal covariate shift is proceeded by some The output of the classification model i.e., the prediction
convolution layer. of a class with a certain probability is obtained at this layer.
Activation Function Here it is also mentioned that the target class should have
In a CNN architecture, activation functions decide which the highest probability. We set out the number of neurons
node should be fired at a time. We have applied a ReLU [18] in the output layer as there are categories. In the case of a
activation function which substitutes all negative values to 0 multiclass classification problem, the Softmax function returns
and remains identical with the positive values. The selection the probabilities of each class, where the target class will have
of ReLU was inspired by the learning time of the model. In the highest probability. The mathematical expression for the
training, ReLUs are tended to be several times faster [19] than Softmax function is given by:
their equivalents (softplus, tanh, sigmoid), and it can diminish exj
the problem of gradient vanishing. The ReLU function is σ(X)i = PK ; for i = 1, 2, 3, ...K. (3)
k=1 exk
expressed by:
ReLU(y) = max(0, y), (1) where, xi are the inputs from the previous FC layer used
to each Softmax layer node and K is the number of classes.
where y refers to the input to a neuron.
Pooling Layer V. T RAINING D ETAILS
Pooling is a significant concept used in deep learning A. Data Augmentation
process. It makes the training of a CNN faster and reduces Data augmentation means increasing the number of samples
the memory size of the network by reducing the linkages as well as adding the variations in the samples for better
between the convolutional layers. Here, we have used max- training. The traditional data augmentation techniques [22],
pooling operation for this purposes. Max-pooling is the usage [23] include rotation, scaling, shifting and flipping. To keep
of a sliding window across an input space, where the largest smaller the computational burden, here, data augmentation are
value within that window is the output. We have opted 2×2 performed by randomly changing the angles between −10° to
window size for the max-pooling operation. To remove the 10°, zooming by 10%, and shifting by 10% on height and
overlapping problem, the size of stride has been settled at 2. width. These parameters were chosen by trial and error basis
The resultant dimension of the max-pooling operation can be which provided optimum accuracy.
calculated by the following equation:
B. CNN Training
Nin − F
Nout = f loor + 1, (2) There was no distinct test or train sets for any of the
S
datasets considered in this work. In this study, every dataset
where Nin , F and S refer to the size of the input image, was randomly partitioned into K=10 folds, and K − 1 of them
kernel, and stride, respectively. were applied for training the model, and the remaining one
Global Average Pooling Layer was applied for testing the performance of it. This process of
The global average pooling (GAP) layer is very similar to training was repeated 10 times, and finally, the average of all
the max-pooling layer. The only difference is that the entire accuracies was reported as the accuracy of the model. Here,
area is replaced by the average value instead of maximum we choose cross-entropy cost function [24], and a gradient
value. The GAP extremely reduces dimension, where a ten- descent-based Adam optimizer [25] with a learning rate 0.001
sor of size height×width×depth is drastically decreased to was selected. Our SLRNet-8 model was trained for up to 200
1×1×depth. epochs with 64 steps per epoch, and a batch size of 128. If
Fully Connected Layer the validation accuracy did not improve in six consecutive
The features map generated by the GAP layer is fed into the epochs, the learning rate of the model was updated to 75%
fully connected layer (FC) layer. In a FC layer, the neurons of its previous value. We allowed early stopping, and training
in one layer is connected to the neurons in another layer. The was halted if the validation loss did not improve for thirty
FC layer also behaves like a convolution layer with filter of consecutive epochs. The very small real numbers that come
size 1×1. from normal distribution were initially assigned to the network
Dropout Layer weights with a weight decay rate of 1 × 10−6 . The model
TABLE I: PERFORMANCE OF THE PROPOSED
MODEL
Dataset Category Accuracy(%)
Digit 100
MU HandImages ASL
Alphabet 99.95
Sign Language Digits Digit 99.90
ASL Alphabet Alphabet 100
Finger Spelling Alphabet 99.99