Updated Research Paper
Updated Research Paper
Abstract
"Real-Time Conversion for Sign to Text and Text to Using Machine Learning" aims to use
machine learning to create a system that can effortlessly translate sign language gestures into
text and convert text into natural-sounding speech in real time. This groundbreaking
development seeks to address the long-standing issue of communication accessibility for the
deaf and hard-of-hearing communities. By harnessing cutting-edge machine learning
techniques that integrate natural language processing and computer vision, this initiative aims
to break down the barriers and provide a two-way communication channel. This channel will
not only interpret sign language gestures but also transmit information through synthesized
speech and written text. To lay the foundation for this study, a comprehensive review of the
literature is conducted, exploring the progression of text generation, sign language
recognition, and text-to-speech synthesis over time. Building upon this knowledge, the
subsequent sections delve into the system architecture and techniques employed for text-to-
speech synthesis and sign language recognition.
1 Introduction
The significance of this study lies in its capacity to support the deaf community by providing
numerous avenues for easy communication. Leveraging the latest machine learning
techniques, such as language processing, computer vision, and blueprints, enables the
creation of gestures and draft contact letters.
This study delves into the creation of powerful real-time language recognition models,
exploring deeper into computer vision and deep learning [4]. Simultaneously, language
processing tools enable the transformation of written language into spoken language,
facilitating communication between speakers of the same language.
Feature extraction and representation [8] involve transforming an image into a three-
dimensional matrix. This matrix has dimensions equal to the image's height and width, with a
depth value assigned to each pixel. In the case of RGB images, there are three depth values,
while grayscale images have just one. These pixel values play a crucial role in helping
Convolutional Neural Networks (CNNs) extract useful features.
An Artificial Neural Network (ANN) is a network of neurons that imitates the structure of
the human brain. Information is transmitted from one neuron to another through connections.
[1] The first layer of neurons receives inputs, processes them, and passes them on to the
hidden layers. After going through several levels of the hidden layers, the information
reaches the final output layer.
To work effectively, neural networks require training. Different learning strategies exist,
1. Unmonitored education
2. Guidance-Based Education
3. Applied Reinforcement
By leveraging the high-level Keras API, model development and training become more
accessible.[33] TensorFlow also offers eager execution, allowing for instantaneous iteration
and intuitive debugging. Additionally, the Distribution Strategy API helps distribute training
across different hardware configurations for large-scale machine learning tasks without
altering the model definition.
Keras, a Python library, serves as a wrapper around TensorFlow and facilitates the rapid
building and testing of neural networks with minimal code.[28] It assists with various data
types, such as text and images, and provides implementations of commonly used neural
network components like layers, objectives, activation functions, and optimizers.
It offers functionalities like image processing, video recording, and feature analysis,
including object and face recognition. While bindings for Python, Java, and
MATLAB/OCTAVE exist, the primary interface of OpenCV is written in C++.
4 Literature Survey
Sign Language Akshatha Rani Bridges communication gap Achieves 74% accuracy -
to Text-Speech K, Dr. N between deaf-mute Recognizes almost all
Translator Manjanaik individuals and others - letters in ASL - Addresses
Using Machine Utilizes efficient hand the challenge of
Learning [21] tracking with media pipe - communication for deaf
Converts recognized signs and mute individuals
to speech, aiding blind
individuals.
In recent years, there has been extensive research on hand gesture recognition. Through a
literature review, we have identified the fundamental stages involved in this process.
Firstly, let's discuss data collection. One method involves using sensory apparatus, such as
electromechanical devices, to provide precise hand configuration and position.[5] However,
this approach is not user-friendly and can be quite costly.
Next, we move on to data pre-processing and feature extraction for the vision-based
approach. A combination of background subtraction and threshold-based color detection is
used for hand detection.[1] Additionally, the AdaBoost face detector helps differentiate
between hands and faces, which have similar skin tones. Gaussian blur, also known as
Gaussian smoothing, is applied to extract the required training image. By utilizing open
computer vision (OpenCV), we can easily apply this filter. Using instrumented gloves further
aids in obtaining accurate and concise data, while reducing computation time for pre-
processing.[8]
To improve the segmentation of images, color segmentation techniques have been explored.
However, the reliance on lighting conditions and the similarity between certain gestures pose
challenges. To address these issues, we decided to keep the hand's background as a stable
single color. This eliminates the need for segmentation based on skin color and enhances
accuracy for a large number of symbols.[22]
Now, let's talk about gesture classification. Hidden Markov Models (HMM) [15] are utilized
to categorize gestures, specifically addressing their dynamic components. By tracking skin-
color blobs corresponding to the hand, gestures can be extracted from a sequence of video
images. Differentiating between symbolic and deictic classes of gestures is the primary aim.
Statistical objects called blobs are employed in identifying homogeneous regions by
gathering pixels with skin tones.[28] For static hand gesture recognition, the Naïve Bayes
Classifier is employed, which categorizes gestures based on geometric-based invariants
extracted from segmented image data.[33] This method is independent of skin tone and
captures gestures in every frame of the film. Additionally, the K nearest neighbour algorithm,
assisted by the distance weighting algorithm (KNNDW), is utilized to classify gestures and
provide data for a locally weighted Naïve Bayes classifier.[29]
6 Methodology
The method used by our system is based on vision. In this approach, there is no need for
artificial devices to aid in interaction, as all signs can be read using hand gestures.
[18] In our quest to find ready-made datasets for the project, we scoured multiple sources but
couldn't find any that met our requirements in terms of raw image formats. We did manage to
locate RGB value datasets, though. Given this situation, we made the decision to create our
own data set. Here are the steps we followed: Utilizing the Open Computer Vision (OpenCV)
library, we captured around 800 pictures of each symbol in American Sign Language (ASL)
for training purposes. Additionally, we took approximately 200 pictures of each symbol for
testing purposes.
After capturing the image, we applied a Gaussian blur filter to extract various features. The
image, post-Gaussian blur, had the following appearance:
To predict the final symbol made by the user, our method utilizes two layers of algorithms.
Algorithm Layer 1:
We apply the Gaussian Blur filter and threshold to the image obtained from
OpenCV, in order to extract features.
The processed image is then fed into the CNN model for prediction. If a letter is
identified across more than 50 frames [18], it is printed and used to form a word.
The blank symbol represents the space between words.
Algorithm Layer 2:
When a detected letter count exceeds a predefined value, which is not within a
threshold distance from any other letter, we print the letter and append it to the
current string. In our code, we set the value at 50 and the difference threshold at 20.
[22]
[11] In case an incorrect letter is predicted, we discard the current dictionary
containing the number of detections of the current symbol.
If the current buffer is empty, no spaces are detected. However, if the count of the
blank (plain background) exceeds a certain value, it appends the current to the
sentence below and predicts the end of the word by printing a space.
AutoCorrect Feature:
For every incorrectly input word, we utilize the Python library Hunspell_suggest to suggest
suitable alternatives. This allows us to present the user with a list of words matching the
current word, from which they can select a replacement to add to the sentence.[13] This not
only reduces spelling errors but also aids in predicting complex words.[2]
To minimize unnecessary noise, we apply a Gaussian blur to our grayscale input images,
which are originally in RGB format. After resizing the photos to 128 x 128 pixels, we use
adaptive thresholding to separate our hand from the background.
[17] Once the input images have been pre-processed, we perform all the necessary operations
on our model and feed it into the training and testing phases. The prediction layer makes an
informed guess regarding which class the image belongs to.
To ensure that the sum of values in each class adds up to 1, the output is normalized between
0 and 1 using the SoftMax function.
The output from the prediction layer may slightly deviate from the actual value. To improve
accuracy, the network is trained using labelled data. One performance metric used in the
classification process is cross-entropy, a continuous function that equals zero when it matches
the labeled value, but increases for values that differ from the labelled value.[25]
The aim is to minimize cross-entropy as much as possible. This is achieved by modifying the
neural network weights in the network layer. TensorFlow provides an integrated function for
computing cross-entropy. Once the cross-entropy function has been determined, we use
gradient descent, specifically the Adam Optimizer, to optimize it.[11]
7 Conclusion
A cognitive system that will be especially helpful for the deaf and mute community could
potentially be developed by, like, totally implementing the system with an image processing-
based sign language translator. Installing more words from, like, a wider range of signers into
the dataset would be awesome. It would help create a neural network-based system that is
way more dependable and stuff.
After I've implemented the two, uh, layers of the algorithm, the confirmed, you know, and
predicted symbols are more likely, like, to occur together. So, as a result, we accurately
identify, like, almost every single symbol, assuming that it is, like, correctly displayed, you
know what I mean? In this case, there is, like, no presence of background noise or something,
and the lighting conditions, oh my gosh, are like, way more than satisfactory!
As machine literacy and artificial intelligence continue to advance, we can anticipate to see
more sophisticated and accurate speech- to- textbook and symbol- to- textbook conversion
software in the future, which is nothing short of amazing. That is it. This technology literally
allows people with hear and speech impairments to communicate more effectively with
others. It also helps bridge the communication gap between people who use sign language
and those who don't. The use of voice- to- textbook and hand- to- textbook technologies will
also have a huge impact on the education sector. This allows scholars with hail and speech
disabilities to more laboriously share in classroom conversations and lectures. It also helps
preceptors communicate more effectively with scholars. Still, it's critical that this technology
is accessible to everyone, anyhow of their socio- profitable status. In summary, the future of
Speech- to- Text and subscribe- to- Text technology is veritably bright, and we can anticipate
indeed more innovative and sophisticated results in the future.
References
[1] Machine Learning-Based Sign Language Interpreter for the Deaf and Dumb, ISSN: 0970-2555
52nd June Issue, June 2023
[2] Recognition of American Sign Language and its Translation from Text to Speech, Volume
11, Issue IX, September 20, 2023
[3] Recognition of Sign Languages and Their Translation to Text and Speech, Volume: 07,
Issue: 10 | October – 2023
[4] A Framework for Machine Learning and Technique for Converting Speech to Instantaneous
Sign Language for AR Glasses, Vol. 03, Issue 10, October 2023
[5] Sign Language to Speech Conversion, October 20, 2023, Volume 11, Issue X
[6] Real-Time Translation from Sign Language to Text via Transfer Learning, December
2022
[7] Translation from Sign Language to Text and Speech using CNN, Volume:03/Issue:05/May-
2021