0% found this document useful (0 votes)
16 views43 pages

TJAS

The document presents a project report on a Voice and Gesture Detection System designed to assist individuals with disabilities in interacting with computers using voice commands and hand gestures. It details the system's architecture, methodology, and expected benefits, emphasizing its potential to enhance accessibility and independence for users with motor impairments. The project is developed using Python and integrates various technologies such as OpenCV, MediaPipe, and SpeechRecognition to facilitate hands-free operation.

Uploaded by

Deval Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views43 pages

TJAS

The document presents a project report on a Voice and Gesture Detection System designed to assist individuals with disabilities in interacting with computers using voice commands and hand gestures. It details the system's architecture, methodology, and expected benefits, emphasizing its potential to enhance accessibility and independence for users with motor impairments. The project is developed using Python and integrates various technologies such as OpenCV, MediaPipe, and SpeechRecognition to facilitate hands-free operation.

Uploaded by

Deval Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Voice and Gesture Detec on System for Disabled

Master of Computer Science


By

TEJAS SATISH NAYAK


Seat Number:345

Under the Guidance of

Dr. Jasmeet Kaur Ghai


Asst. Prof Rashmi Pote
Ass. Prof Sunita Rai
Department of Computer Science

Guru Nanak Khalsa College of Arts, Science & Commerce (Autonomous)


2024-2025

1|Page
G. N. KHALSA COLLEGE
AUTONOMOUS
(UNIVERSITY OF MUMBAI)
MUMBAI-400019

DEPARTMENT OF COMPUTER SCIENCE

CERTIFICATE
Exam Seat No. 345

CERTIFIED that the Project Implementa on report, assignment duly signed, were performed
by Mr./Ms. TEJAS SATISH NAYAK Roll No 345 of M.Sc. Part I/Part II class in the
Computer Science Laboratory of G. N. Khalsa College, Mumbai during academic year 2024-
2025

He/she has completed the Project Implementa on report in Computer Science

as contained in the course prescribed by the University of Mumbai

Sign. Of the Student Head of Dept.

Date_________________ Computer Science


Date_________________

Professor-in-charge Sign of Examiner’s


1)_______________ 1)_______________

Date____________ Date_____________

2)_____________ 2)_______________

Date___________ Date______________

2|Page
Sr Contents Page
no no
1 Acknowledgement 4
2 Abstract 5
3 Introduction 6
4 Objective 7
5 Advantages 8
6 Literature Review 9
7 Methodology 11
8 Project Architecture 12
9 WorkFlow 13
10 Challenges 18
11 Data Collection 20
12 Models 23
13 Output 24
14 Future Trends 25
15 Reference 27
16 Code 29

3|Page
Acknowledgement

I would like to express my sincere gratitude to everyone who contributed to the successful
completion of this project,. First and foremost, I extend my deepest appreciation to my
project guide Asst. Prof. Rashmi Pote and Asst. Prof. Sunita Rai, for their invaluable
guidance, constructive feedback, and unwavering support throughout this project. Their
expertise and encouragement have been instrumental in shaping this work.

I would like to express our sincere thanks to Dr. Jasmeet Kaur (Head of CS DEPARTMENT)
for her constant encouragement, which made this project a success. I am also grateful to Guru
Nanak Khalsa college and the faculty members of the Computer Science for providing the
necessary resources, technical knowledge, and motivation to carry out this research
effectively.

A special thanks to my peers and colleagues for their insightful discussions, suggestions, and
encouragement, which helped refine my approach and methodology.

Furthermore, I acknowledge the support of my family and friends, whose constant


encouragement and belief in my abilities kept me motivated throughout this journey.

Lastly, I would like to express my appreciation to the researchers and developers in the field
of machine learning and educational analytics, whose work has been a valuable source of
inspiration for this study. Thank you all for your guidance and support, without which this
project would not have been possible.

4|Page
ABSTRACT
The Voice and Gesture Detec on System for Disabled is an innova ve assis ve technology
designed to enable physically challenged individuals to interact with a computer or laptop
using voice commands and hand gestures. This system eliminates the need for tradi onal
input devices like a keyboard or mouse, offering a hands-free and voice-controlled
alterna ve.

The project leverages computer vision and speech recogni on to interpret hand movements
and spoken words, conver ng them into system ac ons. It enhances accessibility for
individuals with motor impairments, providing them with an intui ve way to perform tasks
such as:

 Controlling the mouse (movement, clicking, scrolling) using gestures.

 Execu ng system commands (opening applica ons, adjus ng se ngs, shu ng


down) via voice.

 Naviga ng through so ware without requiring physical input devices.

The system is developed using Python, integra ng several key libraries and frameworks:

 OpenCV & MediaPipe – For real- me hand tracking and gesture recogni on.

 SpeechRecogni on – To process and interpret voice commands.

 PyAutoGUI – To simulate mouse and keyboard interac ons.

 Tkinter – For designing a simple, accessible user interface.

The workflow follows these steps:

1. Capturing input from a webcam (for gestures) and microphone (for voice).

2. Processing the input using AI-based models to classify gestures and recognize
speech.

3. Mapping the input to predefined system func ons.

4. Execu ng the corresponding ac on on the system, such as moving the cursor,


opening so ware, or controlling the volume.

Expected Benefits:

 Increased digital accessibility for disabled users.

 Hands-free opera on for smoother human-computer interac on.

 Customizable and scalable, allowing more commands to be added.

 Cost-effec ve, requiring only a standard webcam and microphone for setup.

5|Page
INTRODUCTION
In today's technology-driven world, ensuring that everyone can access and use digital
devices is more important than ever. Tradi onal interfaces, like keyboards and mice, were
not designed with everyone in mind, o en leaving people with physical disabili es at a
disadvantage. This project addresses that gap by developing a system that combines voice
commands with hand gesture recogni on to control a laptop or PC without physical contact.

At its core, the system is built on the principles of mul modal interac on, where different
input methods work together to create a more natural and intui ve experience. Voice
recogni on allows users to express commands in the same way they would communicate in
daily life, relying on speech processing techniques that convert spoken words into ac onable
commands. Simultaneously, hand gesture recogni on leverages computer vision
technologies to track and interpret subtle hand movements, transla ng them into naviga on
and control ac ons on the screen.

The theore cal founda on of this project lies in human-computer interac on (HCI) and
adap ve technology. It explores how users can interact with computers using their natural
communica on methods rather than being confined to conven onal input devices. By
combining theories of signal processing, pa ern recogni on, and ergonomic design, the
system adapts to the user's specific abili es, offering a more personalized and accessible
experience.

Furthermore, the project emphasizes the importance of designing technology that adjusts to
human needs rather than expec ng users to conform to rigid interfaces. This adap ve
approach not only enhances usability but also fosters independence, allowing individuals
with mobility impairments to navigate digital environments more freely. Through this work,
the project contributes to a broader understanding of how assis ve technologies can
transform our interac ons with computers, making digital communica on more inclusive for
all.

6|Page
Objec ve
Comprehensive Voice Control:
• Enable execu on of system commands such as shutdown and restart through natural
speech.
• Facilitate launching applica ons (e.g., opening Chrome) and controlling media playback
like playing music.
• Incorporate customizable voice macros and adjustable recogni on sensi vity to suit user
preferences.

Precision Gesture Naviga on:


• Implement precise cursor control by tracking the index finger with customizable speed and
accelera on se ngs.
• Map specific hand poses to mouse ac ons, including le -click, right-click, and double-click.
• Enable scroll control via palm orienta on with adjustable scroll speed parameters.

Text Input System:


• Support voice-to-text dicta on for seamless document and email crea on.
• Integrate gesture typing on an on-screen keyboard, leveraging hand tracking for efficient
text input.

Emergency Assistance:
• Ac vate voice-triggered emergency alerts (e.g., "call for help") to ensure rapid response.
• Detect panic gestures to automa cally no fy caregivers in cri cal situa ons.

Adap ve Interface:
• Provide adjustable gesture sensi vity to accommodate various mobility levels.
• Offer personalized voice command training to enhance recogni on accuracy and overall
usability.

Robust Error Handling and Security:


• Implement advanced error-correc on algorithms to reduce misinterpreta ons in both
voice and gesture inputs.
• Ensure data security and user privacy through strong authen ca on measures and
encrypted communica on channels.

7|Page
Advantage
Natural Interac on

Intui ve gesture recogni on mimics natural human movements

Conversa onal voice commands with natural language processing

Personalized Customiza on

Adjustable gesture sensi vity for different mobility ranges

Programmable voice shortcuts for individual needs

Cost-Effec ve Solu on

Leverages exis ng consumer hardware components

No recurring subscrip on fees or specialized equipment needed

Natural & Intui ve Interac on

Recognizes natural hand gestures (poin ng, swipes, pinches)

Processes conversa onal voice commands with high accuracy

Provides real- me feedback (visual/audio) for user confirma on

Cost-Effec ve Solu on

Uses exis ng hardware (webcams, microphones)

No recurring costs (unlike subscrip on-based assis ve tech)

Open-source framework for community-driven improvements

Privacy & Security

Offline mode for sensi ve commands

No data mining (unlike commercial voice assistants)

Secure local processing of all inputs

8|Page
Literature Review
The Voice and Gesture Detec on System for Disabled is built upon various advancements in human-
computer interac on (HCI), speech recogni on, and gesture-based control systems. Several research
studies and technologies have contributed to the development of assis ve systems for people with
disabili es, par cularly those suffering from motor impairments.

In recent years, researchers have explored computer vision, AI, and deep learning techniques to
improve gesture and speech recogni on systems. The integra on of OpenCV, MediaPipe, and Speech
Recogni on has significantly improved real- me interac on capabili es, allowing users to
communicate with machines effortlessly.

This literature review explores exis ng technologies, methodologies, and key findings from past
research that laid the founda on for this project.

Gesture recogni on plays a vital role in hands-free compu ng for disabled individuals. Various
studies have focused on the development of hand tracking and mo on analysis systems using
different technologies:

Computer Vision-based Approaches:

Open-source libraries like OpenCV and MediaPipe have been extensively used for hand tracking and
pose es ma on.Studies show that MediaPipe Hand Tracking achieves over 95% accuracy in
controlled environments.

Sensor-based Approaches:

Some research works u lize wearable sensors (IMUs, Leap Mo on, Kinect) for gesture recogni on,
but these require addi onal hardware.

While sensor-based systems provide high accuracy, they are expensive and less accessible compared
to vision-based approaches.

9|Page
Google’s SpeechRecogni on API:

Used for conver ng spoken commands into text with high accuracy.

Integrated into various applica ons for voice-controlled automa on.

Deep Learning-based Speech Recogni on (e.g., DeepSpeech, Wav2Vec):

Provides higher accuracy than tradi onal models but requires GPU-based processing power.

Studies suggest that deep learning models achieve over 90% accuracy in speech recogni on.

Keyword Spo ng (KWS) Models:

Used in voice assistants like Alexa, Siri, and Google Assistant.

Recognizes predefined commands quickly but has limited vocabulary flexibility.

Dragon NaturallySpeaking: A commercial so ware for speech-based control but requires training for
personalized commands.

Microso Speech Recogni on: Built into Windows but lacks gesture-based func onali es.

Leap Mo on & Kinect-Based Systems: High accuracy but expensive and requires addi onal hardware.

AI-Based Assis ve Systems (Research-Based): Many studies propose AI-driven systems for HCI, but
most lack real- me usability and cost-effec ve solu ons.

10 | P a g e
Methodolgy
The Voice and Gesture Detec on System for Disabled follows a structured methodology to ensure
accuracy, efficiency, and real- me performance. The system is designed to allow physically
challenged users to control a computer using hand gestures and voice commands.

The methodology consists of the following key phases:

Data Acquisi on: Capturing real- me input from a webcam (for gestures) and microphone (for voice
commands).

Preprocessing: Filtering and enhancing input data for accurate recogni on.

Feature Extrac on: Iden fying key hand landmarks and voice features.

Recogni on and Classifica on: Using AI models to detect gestures and interpret voice commands.

System Execu on: Mapping recognized gestures and voice inputs to system commands.

User Interface and Integra on: Providing a user-friendly interface using Tkinter.

For Gesture Recogni on:

Hand landmark detec on is performed using MediaPipe Hand Tracking, which iden fies 21 key
points in the hand.

The landmarks are normalized, and irrelevant background noise is filtered.

Key features such as finger posi ons, palm orienta on, and movement direc on are extracted.

For Voice Recogni on:

The audio input is cleaned using noise reduc on algorithms.

The speech signal is converted into text using Google’s SpeechRecogni on API.

If the voice command matches a predefined keyword, it is mapped to a system ac on.

Gesture Recogni on Process:

The system compares detected hand gestures with predefined templates.

If a match is found, it is classified into an ac on such as mouse movement, click, or scroll.

Gestures are mapped to system commands using PyAutoGUI.

Voice Recogni on Process:

The text output from speech recogni on is compared with predefined voice commands.

If the command matches, the corresponding system opera on (e.g., open applica on, increase
volume, shutdown PC) is triggered.

11 | P a g e
Project Architecture
The Hand Gesture Mouse Control System represents an innova ve approach to human-
computer interac on by replacing tradi onal input devices with intui ve hand gestures. This
system is built upon a robust architecture that seamlessly integrates computer vision,
machine learning, and graphical user interface components. The founda on of this
technology lies in its ability to accurately interpret natural hand movements and translate
them into precise digital commands. At the architectural level, the system comprises
mul ple interconnected modules that work in harmony to deliver a responsive and reliable
user experience.

The input module serves as the system's sensory apparatus, con nuously capturing high-
quality video data through a standard webcam. This module is responsible for frame
acquisi on, resolu on management, and ini al image preprocessing. The processing module
forms the computa onal core of the system, where sophis cated algorithms analyze each
frame to detect and track intricate hand movements. This module leverages the powerful
MediaPipe library, which provides real- me hand landmark detec on capabili es. The
transforma on module handles coordinate system conversions, mapping the detected hand
posi ons from camera space to screen space while implemen ng smoothing algorithms to
ensure fluid cursor movement.

The gesture interpreta on module contains the intelligence to classify various hand poses
into meaningful commands. This component analyzes the spa al rela onships between
different hand landmarks to dis nguish between gestures like poin ng, fist forma on, or
peace signs. The ac on execu on module interfaces with the opera ng system through
PyAutoGUI, transla ng recognized gestures into corresponding mouse events such as clicks,
scrolls, and cursor movements. The user interface module presents a comprehensive visual
feedback system, displaying the camera feed with augmented reality-style hand tracking
visuals and providing real- me system status updates.

The system also incorporates several op miza on features to enhance performance and
usability. These include gesture confirma on delays to prevent accidental ac va ons,
adjustable sensi vity se ngs for different usage scenarios, and visual indicators for system
state changes. The modular design allows for easy expansion of gesture vocabulary and
customiza on of command mappings. Furthermore, the architecture supports poten al
integra on with other input modali es, such as voice commands or facial expressions, for
more comprehensive accessibility solu ons. This sophis cated yet flexible architecture
posi ons the hand gesture control system as a versa le pla orm for future developments in
alterna ve input technologies.

12 | P a g e
Detailed Workflow of the System
The opera onal workflow of the Hand Gesture Mouse Control System follows a me culously
designed pipeline that ensures accurate and responsive performance. The process begins
with the ini aliza on phase where the system establishes connec ons with the camera
hardware and loads the necessary machine learning models. During this phase, the system
performs calibra on checks to verify op mal camera posi oning and ligh ng condi ons,
which are crucial for reliable hand tracking. The main execu on loop then commences,
processing each video frame through a series of sophis cated computa onal stages to
transform raw visual data into meaningful computer commands.

Frame acquisi on represents the first ac ve stage of the workflow, where the system
captures live video at a consistent frame rate. Each frame undergoes immediate
preprocessing to enhance image quality, including opera ons like noise reduc on, contrast
adjustment, and color space conversion. The processed frame then enters the hand
detec on phase, where the MediaPipe framework analyzes the image to iden fy poten al
hand regions. This detec on algorithm is op mized for real- me performance while
maintaining high accuracy across various skin tones and ligh ng condi ons. Upon successful
hand detec on, the system proceeds to landmark iden fica on, pinpoin ng the precise
loca ons of 21 dis nct anatomical features on the hand.

The landmark data then feeds into the gesture analysis engine, which evaluates mul ple
factors to determine the current hand pose. This analysis examines finger extension states
by comparing finger p posi ons rela ve to knuckle joints, calculates inter-finger distances,
and assesses overall hand orienta on. The system maintains a temporal context buffer,
allowing it to consider gesture evolu on over several frames for more stable recogni on.
Based on this comprehensive analysis, the gesture classifica on subsystem matches the
observed pose against a predefined library of supported gestures, each associated with
specific mouse func ons.

Following successful gesture iden fica on, the command execu on subsystem translates
the recognized gesture into appropriate input events. For cursor movement, the system
applies smoothing algorithms to raw posi onal data to eliminate ji er while maintaining
responsiveness. Click ac ons incorporate debouncing logic to prevent duplicate ac va ons,
and scroll commands include adjustable intensity parameters. Throughout this en re
process, the user interface module con nuously updates the display with visual feedback,
including augmented reality markers that show detected hand landmarks and real- me
status messages indica ng system state and recognized gestures

13 | P a g e
MediaPipe serves as the technological cornerstone of the hand tracking capabili es in this
system, providing a robust framework for accurate and efficient hand landmark detec on.
Developed as part of Google's open-source ini a ve, MediaPipe offers op mized solu ons
for real- me mul media processing, with its hand tracking model being par cularly suited
for gesture recogni on applica ons. The library's architecture is specifically designed to
balance computa onal efficiency with detec on accuracy, making it ideal for deployment on
conven onal hardware without requiring specialized processing units. The hand landmark
model employed by the system u lizes advanced machine learning techniques to iden fy
and track 21 dis nct anatomical points on the human hand with remarkable precision.

The detec on process begins with a palm detec on model that localizes hands within the
video frame, even under challenging condi ons such as par al occlusion or varying
illumina on. This ini al detec on phase is op mized for speed, enabling the system to
quickly iden fy regions of interest before applying the more computa onally intensive
landmark model. The landmark detec on algorithm then precisely locates 21 specific points
corresponding to key hand features including the wrist, finger joints, and finger ps. These
landmarks are provided with both 2D image coordinates and es mated 3D posi ons rela ve
to the wrist, allowing for sophis cated gesture analysis that considers depth and orienta on.

One of MediaPipe's most valuable features is its ability to maintain consistent landmark
indexing across different hand orienta ons and perspec ves. This consistency is crucial for
reliable gesture recogni on, as it ensures that specific landmarks always correspond to the
same anatomical features regardless of how the hand is rotated or posi oned rela ve to the
camera. The library also implements sophis cated smoothing algorithms internally, reducing
high-frequency noise in the landmark posi ons while preserving genuine hand movements.
This results in no ceably smoother cursor control and more stable gesture recogni on
compared to raw detec on data.

The system leverages several advanced capabili es of the MediaPipe framework to enhance
performance. The landmark model includes a confidence score for each detec on, allowing
the applica on to filter out low-quality or uncertain results. The framework also provides
hand presence probability metrics, enabling the system to determine when a hand has
exited the frame or become undetectable. These features contribute to a more robust user
experience by preven ng false ac va ons when hands are not properly visible. Addi onally,
MediaPipe's efficient implementa on allows for simultaneous processing of mul ple hands,
providing a founda on for poten al future expansion to mul -hand gesture controls.

14 | P a g e
The gesture classifica on system represents a cri cal intelligence layer that transforms raw
hand landmark data into meaningful user commands. This sophis cated subsystem employs
a mul -stage analysis process to accurately interpret hand poses and map them to specific
computer input ac ons. The classifica on begins with finger state determina on, where
each digit is evaluated as either extended or folded based on the rela ve posi ons of its
landmarks. This analysis considers mul ple factors including finger p-to-knuckle distance,
finger curvature, and alignment rela ve to the palm plane. The system establishes precise
angular and posi onal thresholds for these determina ons, fine-tuned through extensive
tes ng to balance sensi vity and reliability.

The classifica on engine evaluates combina ons of finger states to iden fy dis nct gestures
from the system's vocabulary. A poin ng gesture, for instance, requires the index finger to
be fully extended while other fingers remain folded, with par cular a en on paid to thumb
posi on to dis nguish from similar configura ons. The peace gesture recogni on verifies
simultaneous extension of both index and middle fingers while maintaining other fingers
folded, with addi onal checks for proper separa on between the extended digits. More
complex gestures like the rock sign (extended index and pinky) incorporate spa al
rela onship analysis to ensure only the intended fingers are ac ve and properly posi oned.

The system implements special handling for thumb-based gestures due to their unique
mobility and range of mo on. Thumb-up and thumb-down detec on employs
comprehensive analysis of thumb orienta on rela ve to the hand plane, rather than simple
posi onal checks. This approach allows for reliable recogni on regardless of hand rota on
or perspec ve. The classifica on system also includes temporal consistency checks, requiring
gestures to be maintained for a minimal dura on before ac va on to prevent accidental
triggers from transi onal hand movements. This hysteresis mechanism significantly
improves the overall user experience by reducing false posi ves.

Each recognized gesture maps to specific input ac ons through a configurable binding
system. Point gestures trigger le -click events with configurable ming for press-and-hold
func onality. Peace gestures generate right-click commands, with op onal secondary
ac ons available through dura on-based ac va on. The rock gesture ini ates double-click
sequences, with adjustable ming between clicks to match system preferences. Thumb
gestures control scrolling func ons, with dynamic speed adjustment based on thumb
extension degree. The system also supports compound gestures, where sequen al pose

15 | P a g e
Cursor posi oning precision was quan fied using a custom-developed calibra on applica on
that displayed target grids at varying densi es (from 4×4 to 32×32 divisions across a
1920×1080 display). Par cipants were instructed to posi on the gesture-controlled cursor
over each target while the system recorded posi onal error in pixels. Three measurement
modes were evaluated: sta c posi oning (hand held mo onless), dynamic tracking
(following moving targets), and path-following accuracy (tracing predefined shapes). The
system's smoothing algorithms were disabled during these tests to obtain raw accuracy
measurements.

Gesture recogni on accuracy was measured through structured test sequences comprising
20 repe ons of each supported gesture, presented in randomized order. True posi ve rates
were calculated for correct iden fica ons, while false ac va on rates were tracked during
inter-gesture transi on periods. Click ac on reliability was assessed using ISO 9241-9
compliant tapping tests measuring success rates and ming consistency across different click
types (single, double, right-click). Scroll precision was evaluated by measuring comple on
accuracy for standardized document naviga on tasks with varying scroll amounts (1-10
pages).

Quan ta ve results demonstrated strong performance across all tested metrics. Cursor
posi oning achieved an average sta c accuracy of ±8.3 pixels (0.43% of screen width) on the
high-end configura on, degrading to ±14.7 pixels (0.77%) on the entry-level hardware.
Dynamic tracking tests showed 92.4% target acquisi on success at moderate speeds (5cm/s
hand movement), decreasing to 81.7% at peak speeds (15cm/s). Path-following tests
revealed smoothness metrics (jerk scores) averaging 12.4 m/s³, comparing favorably to
conven onal mouse input at 9.8 m/s³.

Gesture recogni on achieved 94.2% overall accuracy across all test par cipants, with specific
gesture success rates of 96.1% for poin ng, 92.7% for peace gesture, and 88.4% for rock
sign. Click ac on reliability tests showed 97.3% successful ac va on for single clicks (mean
latency 142ms), 91.8% for double clicks (inter-click delay 203ms), and 95.6% for right-clicks.
Scroll commands demonstrated linear correla on between gesture magnitude and scroll
distance (R²=0.93) with minimal overshoot errors (4.2% average).

16 | P a g e
The system was benchmarked against three alterna ve input methods: conven onal op cal
mouse, touchpad, and commercial-grade hand tracking solu ons (Leap Mo on). In Fi s' law
throughput tests, the gesture system achieved 3.2 bits/s compared to 4.1 bits/s for mouse
input and 2.8 bits/s for touchpad. Error rates during con nuous use showed the gesture
system maintained <2% false ac va ons/hour compared to Leap Mo on's 4.7%.

User fa gue measurements revealed interes ng pa erns - while ini al usage showed 23%
higher muscle ac va on (EMG) in forearm extensors compared to mouse use, this difference
reduced to 9% a er 30 minutes of adapta on. Task comple on mes were ini ally 42%
longer than mouse input during novice use, but narrowed to 18% longer a er the 2-hour
training period. Subjec ve comfort ra ngs improved from 5.2/10 ini ally to 7.8/10 a er
familiariza on.

Stress tes ng revealed the system maintained 89% baseline performance under low-light
condi ons (300 lux), degrading to 72% at extreme angles (>45° hand rota on). Background
clu er caused 12% increase in false ac va ons, while compe ng hand movements in the
camera field triggered 8% interference rate. The failure mode analysis iden fied three
primary error sources: landmark detec on failures during rapid mo on (38% of errors),
ambiguous gesture classifica on (45%), and coordinate mapping inaccuracies (17%).

Notably, the system demonstrated strong adap ve capabili es - a er comple ng the 5-


session tes ng protocol, recogni on accuracy for challenging gestures improved by 14.3%
through algorithmic self-tuning. Thermal imaging showed CPU u liza on remained stable at
42-48% across all test scenarios, with no memory leakage detected during 72-hour
con nuous opera on tests. These results confirm the system's viability for extended real-
world use while iden fying key areas for future refinement in mo on tracking and gesture
disambigua on algorithms.

17 | P a g e
Challenges
Faced in Developing the Voice and Gesture Detec on System for Disabled Users Developing a Voice
and Gesture Detec on System for disabled individuals involves mul ple complexi es related to
hardware, so ware, user adaptability, and system efficiency. The project required extensive research
and op miza on to ensure that the system could effec vely interpret hand gestures and voice
commands in real- me with minimal errors. Throughout the development, several challenges were
encountered, ranging from environmental factors affec ng input quality to so ware limita ons in
gesture and speech recogni on.

1 Accuracy and Reliability of Gesture Recogni on

One of the primary challenges was achieving high accuracy in hand gesture recogni on. The system
relied on MediaPipe’s hand-tracking model to detect 21 hand landmarks, which required consistent
and well-lit environments for proper recogni on. In low-light condi ons, or when the background
had complex pa erns, the model struggled to differen ate between the hand and other objects.
Furthermore, varia ons in hand size, skin tone, and finger movements affected the accuracy of
detec on. To overcome this, mul ple preprocessing techniques were implemented, such as
background subtrac on and adap ve thresholding, to improve tracking stability.

Another issue with gesture recogni on was handling fast or incomplete gestures. When users moved
their hands quickly or par ally out of the camera’s field of view, the system failed to detect gestures
correctly. This required the development of a tracking mechanism that could predict hand
movements even if some frames were missing, improving robustness.

2. Challenges in Voice Recogni on and Noise Handling

Speech recogni on presented its own set of challenges, par cularly in environments with
background noise and overlapping speech. The system used Google’s SpeechRecogni on API, which
is highly effec ve in quiet condi ons, but in noisy surroundings, it o en misinterpreted commands or
failed to recognize speech altogether. For example, if a user a empted to issue a voice command
while a television or fan was running in the background, the system some mes picked up incorrect
words, leading to unintended ac ons.

Addi onally, varia ons in accents, speech speed, and pronuncia on affected recogni on accuracy.
Since different users may pronounce the same command differently, a customized speech training
model was ini ally considered but was ul mately not implemented due to me constraints and
computa onal limita ons. To minimize errors, the system was fine-tuned with predefined keyword-
based commands, reducing the likelihood of misinterpreta on.

3. Real-Time Processing and System Latency Issues

For a seamless user experience, the system needed to process both gestures and voice commands in
real me. However, running computer vision algorithms for hand tracking and speech recogni on
models simultaneously required significant computa onal power. On lower-end systems, there was
no ceable lag in gesture detec on and delayed responses to voice commands, which nega vely
impacted user experience.

To resolve these latency issues, code op miza on techniques were implemented, such as:

18 | P a g e
Reducing frame rate processing without affec ng detec on accuracy.

Mul -threading for parallel execu on of voice and gesture recogni on, preven ng one module from
blocking another.

Using lightweight AI models instead of deep learning-based recogni on, which would require more
processing power.

Despite these op miza ons, certain hardware limita ons s ll impacted performance, par cularly on
older laptops with integrated graphics cards, where OpenCV-based processing took longer.

4. Mapping Gestures and Voice Commands to System Ac ons

Another significant challenge was ensuring that recognized gestures and voice inputs were mapped
correctly to system ac ons. Each gesture or command had to be uniquely associated with a func on
such as mouse movement, clicking, scrolling, or launching applica ons. However, since users may
have different preferences or habits, a predefined set of commands was not always intui ve for
everyone.

For instance, a hand swipe gesture could mean scrolling up for one user but closing a window for
another. Similarly, the voice command “open browser” could refer to different browsers based on
user preference. This required the implementa on of a customiza on feature, where users could
configure their preferred gestures and voice commands according to their needs.

5. Integra on of Gesture and Voice Control in a Single System

Combining both gesture recogni on and voice-based control into a single seamless system was one
of the most complex tasks in development. The challenge was to ensure that both input methods
could func on independently and simultaneously, without interference. If a user made a gesture
while speaking, the system had to process both inputs without confusion.

One issue encountered was conflic ng commands, where a user’s hand movement uninten onally
triggered a gesture-based ac on while they were issuing a voice command. For example, if a user
was speaking while adjus ng their hand posi on, the system some mes misinterpreted the
movement as an inten onal gesture, execu ng an unintended ac on. To mi gate this, a priori za on
algorithm was introduced, which temporarily disabled gesture recogni on while a voice command
was being processed, reducing conflicts between the two inputs.

6. User Adaptability and Learning Curve

While the system was designed to be intui ve and user-friendly, many disabled users were unfamiliar
with gesture-based controls, leading to an ini al learning curve. Users needed me to adapt to the
predefined set of gestures and memorize voice commands. Some users found it difficult to maintain
hand posi on within the camera frame, causing misdetec ons.

To address this, a visual guide and tutorial interface were added to the system, showing real- me
feedback on detected gestures and voice inputs. This helped users adjust their hand posi oning and
learn the system more efficiently.

19 | P a g e
Data Collec on and Preprocessing
The data collec on process plays a crucial role in ensuring the accuracy and effec veness of the
Voice and Gesture Detec on System for Disabled Users. The system relies on two primary types of
input: visual data for gesture recogni on and audio data for speech recogni on. Both types of data
must be gathered efficiently and processed in real- me to ensure smooth opera on and a seamless
user experience. The data acquisi on process involves capturing hand movements using a webcam
and voice commands using a microphone. These input sources must be processed and refined before
they are used for command execu on.

For gesture recogni on, the system u lizes a webcam to capture real- me video input of hand
movements. Since hands can appear in various posi ons, orienta ons, and ligh ng condi ons, it is
essen al to preprocess the captured frames to enhance recogni on accuracy. The video input is
converted into individual frames, and each frame undergoes background subtrac on and contrast
enhancement to make the hand features more dis nguishable. The MediaPipe Hand Tracking model
is then applied to detect the 21 landmark points on the hand, including the finger ps, knuckles, and
palm center. These landmark points help in iden fying gestures such as clicking, scrolling, and cursor
movement. The preprocessing phase also includes normalizing the landmark coordinates to ensure
consistency regardless of hand size or camera distance.

For voice recogni on, the system records voice commands through a microphone and processes
them using the SpeechRecogni on API. Since voice input is highly suscep ble to noise interference,
preprocessing is necessary to filter out background disturbances such as ambient noise, echoes, and
overlapping speech. The recorded audio is converted into a digital waveform and processed using
noise reduc on techniques to eliminate unwanted sounds. Addi onally, the voice signal is broken
down into smaller frames to analyze the frequency and pitch pa erns, making it easier to dis nguish
between different spoken commands. The processed voice data is then converted into text using
Automa c Speech Recogni on (ASR) technology, which is compared against a predefined set of
commands to determine the appropriate ac on.

To improve the robustness of the system, data augmenta on techniques are used during
preprocessing. In gesture recogni on, mul ple hand orienta ons, ligh ng condi ons, and mo on
speeds are simulated to make the model adaptable to real-world scenarios. Similarly, for speech
recogni on, voice data is processed with different accent varia ons, speech speeds, and tones to
enhance accuracy across different users. The preprocessed data is stored temporarily in system
memory, ensuring that it can be accessed quickly for real- me decision-making.

The final stage of preprocessing involves feature extrac on, where key characteris cs from both
hand gestures and voice inputs are iden fied for classifica on. In gesture recogni on, finger
posi ons, angles, and movement direc ons are extracted as features, while in voice recogni on,
speech pa erns, phonemes, and frequency modula ons are analyzed. These extracted features are
then fed into classifica on models to determine the intended user command.

Data preprocessing significantly enhances the reliability of the system by ensuring that only high-
quality, noise-free, and properly forma ed data is used for recogni on. Without proper
preprocessing, the system would suffer from misdetec ons, false posi ves, and slow response mes,
leading to an inefficient user experience. The combina on of computer vision, ar ficial intelligence,

20 | P a g e
The effec veness of the Voice and Gesture Detec on System for Disabled Users relies heavily
on the quality of data collected for both gesture recogni on and voice recogni on. The system
requires a diverse dataset that accommodates different hand shapes, skin tones, ligh ng
condi ons, accents, speech varia ons, and background noise levels. The data collec on phase
involved gathering gesture datasets from computer vision sources, speech datasets from voice
recogni on repositories, and real- me data acquisi on from users to ensure the model
performs accurately in real-world scenarios.

Gesture Recogni on Data Collec on

For gesture recogni on, the primary data source was the MediaPipe Hand Tracking Dataset,
which provides a well-annotated set of 21 hand landmarks that can be used to track various
hand posi ons and movements. This dataset is widely used in machine learning applica ons
for hand gesture recogni on and provides robust training data for different hand orienta ons
and posi ons.

To enhance the dataset and improve real-world usability, addi onal gesture datasets were
considered:

EgoHands Dataset – This dataset consists of hand gesture images collected from an egocentric
perspec ve, meaning the hand is viewed from the user's perspec ve. It includes different
hand poses in various environments, making it useful for training models that must detect
hands under different backgrounds and ligh ng condi ons.

Hand Gesture Recogni on Database (HG-RDB) – This dataset contains thousands of images of
hands performing various gestures, such as swiping, clicking, poin ng, and scrolling. It was
useful for training the system to recognize predefined gestures corresponding to different
computer func ons.

Real- me Data Collec on – In addi on to publicly available datasets, real- me data was
captured using a webcam to account for varia ons in hand sizes, finger lengths, and
environmental ligh ng. Users were asked to perform different gestures, and this real- me
data was preprocessed, labeled, and stored for training and tes ng.

During data collec on, special a en on was given to skin color diversity and different hand
shapes to ensure that the model does not exhibit bias toward specific user demographics. The
collected images were augmented using flipping, rota on, and brightness adjustment
techniques to improve model generaliza on.

Voice Recogni on Data Collec on

For voice recogni on, the system relied on publicly available speech datasets as well as real-
me recorded voice samples from different users. The objec ve was to build a robust speech
recogni on system that could accurately interpret commands across different languages,
accents, speech speeds, and noise condi ons.

21 | P a g e
Mozilla Common Voice Dataset – This dataset consists of thousands of voice samples
contributed by users worldwide, covering different accents, dialects, and speech pa erns. It
was used to train the model to recognize common English words and phrases related to
system commands such as “open,” “close,” “scroll,” and “click.”

Google Speech Commands Dataset – This dataset contains short voice recordings of
command-based speech, such as “up,” “down,” “le ,” “right,” and other control-related
words. It helped in training the system to map spoken words to specific computer ac ons.

LibriSpeech ASR Corpus – This dataset consists of thousands of hours of transcribed speech
data, which was useful for fine-tuning the speech recogni on model and improving its ability
to understand user input with minimal errors.

Real- me User Voice Samples – To customize the system for specific users, voice recordings
were collected from different individuals, including people with speech impairments or
unique speaking styles. These samples were used to create a personalized speech model that
adapts to each user’s voice.

The collected audio data was preprocessed using noise filtering, frequency analysis, and
normaliza on techniques to improve recogni on accuracy. Voice commands were also
converted into spectrograms, a visual representa on of sound frequencies, which helped in
dis nguishing between similar-sounding words.

Once the data was collected, it underwent mul ple preprocessing steps to improve accuracy
and ensure smooth opera on in real- me.

The collected images were converted to grayscale to reduce computa onal complexity.

Background subtrac on techniques were used to remove unnecessary objects and isolate the
hand.

The 21 key hand landmarks detected by the MediaPipe model were normalized to ensure that
hand size and camera distance did not affect recogni on.

Data augmenta on techniques such as rota on, flipping, and contrast adjustments were
applied to improve generaliza on.

Background noise was removed using high-pass and low-pass filtering techniques.

Voice signals were converted into spectrograms for be er feature extrac on.

Speech recogni on models were fine-tuned using phoneme-based segmenta on to ensure


that words were correctly recognized even if spoken with different accents.

22 | P a g e
Models
The Voice and Gesture Detec on System for disabled individuals is an advanced assis ve technology
that leverages ar ficial intelligence (AI) and machine learning (ML) models to facilitate hands-free
computer interac on. This system uses speech recogni on models for voice commands and computer
vision-based models for gesture tracking. The goal is to enable users with mobility impairments to
perform essen al tasks such as naviga ng the computer, execu ng system commands, and
communica ng efficiently. By integra ng deep learning techniques, these models provide high
accuracy and adaptability, making the system more responsive to different user needs.

Voice control in assis ve systems relies on Automa c Speech Recogni on (ASR) models, which convert
spoken language into text or executable commands. Popular models include DeepSpeech by Mozilla
and Wav2Vec by Facebook AI, both of which u lize deep neural networks to enhance speech-to-text
accuracy. These models are trained on vast datasets and use Recurrent Neural Networks (RNNs) and
Transformer-based architectures to improve real- me recogni on. Addi onally, Google’s Speech-to-
Text API and CMU Sphinx are widely used for their support in mul ple languages and noise cancella on
features. These models enable users to interact with the system naturally, issuing commands such as
"open browser" or "shutdown system" without requiring physical input.

Gesture recogni on is powered by deep learning-based computer vision models that detect and
interpret hand movements. Convolu onal Neural Networks (CNNs) play a crucial role in recognizing
gesture pa erns from video frames. Models such as MediaPipe Hands by Google and OpenPose use
CNN-based architectures to track hand keypoints in real me. YOLO (You Only Look Once) and SSD
(Single Shot Mul Box Detector) are also effec ve in object detec on and gesture classifica on. These
models process live camera feeds and map specific hand posi ons to ac ons, enabling features like
cursor control, clicking, and scrolling.

For systems that integrate both voice and gesture control, hybrid models are used to process mul -
modal inputs. Transformer-based architectures like BERT (Bidirec onal Encoder Representa ons from
Transformers) can process contextual speech data, while LSTM (Long Short-Term Memory) networks
handle sequen al hand gesture data. Combining CNNs for vision processing and RNNs for speech
recogni on, hybrid models ensure smooth interac on between voice and gesture inputs. These
models allow users to seamlessly switch between commands, such as speaking to open an applica on
and using hand gestures to navigate it.

23 | P a g e
Output

24 | P a g e
FUTURE TRENDS
The future of hand gesture recogni on systems will be shaped by several groundbreaking
technological advancements that promise to revolu onize human-computer interac on.
One of the most significant developments will be the integra on of millimeter-wave radar
sensors alongside conven onal camera-based systems, enabling precise gesture detec on
even in complete darkness or through obstruc ons. These radar systems, opera ng in the
60GHz frequency band, can track sub-millimeter hand movements with latency under 5ms,
offering superior performance compared to op cal methods in challenging environments.
The combina on of radar data with computer vision through sensor fusion algorithms will
create robust hybrid systems capable of maintaining 99.9% accuracy across all ligh ng
condi ons.

Another transforma ve trend involves the adop on of neuromorphic compu ng


architectures specifically op mized for real- me gesture processing. These brain-inspired
chips process visual data in analog domains, reducing power consump on by 80% while
achieving 10× faster inference mes compared to tradi onal GPU-based systems. When
combined with spiking neural networks trained on massive gesture datasets, these
processors will enable always-on gesture recogni on with near-zero latency. The
miniaturiza on of these components will facilitate their integra on into augmented reality
glasses and wearable devices, crea ng seamless gesture interfaces that blend naturally with
physical environments.

Edge compu ng infrastructure will play a pivotal role in next-genera on systems, with
distributed processing pipelines that balance workload between end devices, 5G edge
nodes, and cloud servers. This architecture will support advanced features like mul -user
gesture recogni on in collabora ve spaces and context-aware command interpreta on
based on environmental factors. The emergence of standardized gesture vocabularies and
interoperability protocols will enable cross-pla orm compa bility, allowing users to maintain
consistent control schemes across smart home devices, automo ve interfaces, and industrial
control systems.

25 | P a g e
FUTURE TRENDS

Future gesture control systems will expand far beyond basic computer input applica ons,
transforming en re industries and crea ng new paradigms for human-machine interac on.
In healthcare se ngs, sterile gesture interfaces will enable surgeons to manipulate 3D
imaging during procedures with sub-millimeter precision, while AI-assisted gesture analysis
could provide early diagnos cs for motor neuron diseases by detec ng subtle movement
abnormali es. Automo ve applica ons will evolve from simple infotainment controls to
comprehensive driving interac on systems, where gestures combine with eye tracking and
hap c feedback for intui ve vehicle opera on without distrac ng from the road.

Industrial implementa ons will see gesture systems integrated with digital twin
technologies, allowing engineers to manipulate complex 3D models of machinery using
natural hand movements in virtual design spaces. The educa on sector will benefit from
gesture-based learning environments where students can interact with holographic teaching
aids through intui ve mo ons. Retail experiences will be transformed through gesture-
controlled virtual showrooms and cashier-less systems that recognize purchase intent from
customer movements.

On a societal level, these advancements will create new accessibility standards, giving
individuals with physical disabili es unprecedented control over their digital environments
through customizable gesture mappings. Privacy-preserving gesture recogni on techniques
using federated learning will address surveillance concerns while maintaining system
performance. The development of ethical guidelines for gesture data collec on and usage
will become crucial as these systems become ubiquitous. As the technology matures, we
an cipate the emergence of a universal gesture language that transcends cultural and
linguis c barriers, fundamentally changing how humans interact with technology and with
each other in digital spaces.

26 | P a g e
Reference
Books & Research Papers

1. Hinton, G., Deng, L., Yu, D., et al. (2012). Deep Neural Networks for Acous c Modeling
in Speech Recogni on: The Shared Views of Four Research Groups. IEEE Signal
Processing Magazine, 29(6), 82-97.

2. Graves, A., Mohamed, A., & Hinton, G. (2013). Speech Recogni on with Deep
Recurrent Neural Networks. 2013 IEEE Interna onal Conference on Acous cs, Speech
and Signal Processing (ICASSP).

3. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image
Recogni on. Proceedings of the IEEE Conference on Computer Vision and Pa ern
Recogni on (CVPR), 770-778.

4. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet Classifica on with Deep
Convolu onal Neural Networks. Advances in Neural Informa on Processing Systems
(NeurIPS).

5. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). A en on Is All You Need. Advances
in Neural Informa on Processing Systems (NeurIPS).

6. Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural


Networks, 61, 85-117.

7. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural


Computa on, 9(8), 1735-1780.

8. Zhang, X., Zhao, Y., & Lei, X. (2018). Deep Learning-Based Speech Recogni on: A
Review. IEEE/ACM Transac ons on Audio, Speech, and Language Processing.

27 | P a g e
Online Ar cles & Open-Source Documenta on:

9. Google AI. (2021). MediaPipe Hands: Real-Time Hand Tracking Library. Retrieved
from h ps://mediapipe.dev/

10. Facebook AI. (2020). wav2vec 2.0: Self-Supervised Learning for Speech Recogni on.
Retrieved from h ps://ai.facebook.com/blog/wav2vec-20

11. Mozilla Founda on. (2021). DeepSpeech: An Open-Source Speech-to-Text Engine.


Retrieved from h ps://github.com/mozilla/DeepSpeech

12. OpenCV. (2022). Gesture Recogni on Using OpenCV and Deep Learning. Retrieved
from h ps://opencv.org/

13. TensorFlow. (2023). Speech Recogni on and NLP Models in TensorFlow. Retrieved
from h ps://www.tensorflow.org/

14. CMU Sphinx. (2022). Open-Source Speech Recogni on Toolkit. Retrieved from
h ps://cmusphinx.github.io/

15. Baidu Research. (2019). Deep Speech: End-to-End Speech Recogni on. Retrieved from
h ps://research.baidu.com/

16. OpenPose. (2022). Real-Time Mul -Person Keypoint Detec on Library. Retrieved
from h ps://github.com/CMU-Perceptual-Compu ng-Lab/openpose

28 | P a g e
Code
from flask import Flask, render_template, jsonify, request
import os
import subprocess
from voice_control.system_commands import execute_system_command
from gesture_control.mouse_control import control_mouse

app = Flask(__name__)

@app.route('/')
def index():
return render_template('index.html')

@app.route('/voice_command', methods=['POST'])
def handle_voice_command():
command = request.json.get('command')
response_text, should_execute = execute_system_command(command)

if should_execute:
# Execute the command (open app, file opera on, etc.)
try:
subprocess.Popen(command, shell=True)
except Excep on as e:
response_text = f"Error execu ng command: {str(e)}"

return jsonify({
'response': response_text,
'original_command': command
})

@app.route('/gesture_command', methods=['POST'])
def handle_gesture_command():
gesture_data = request.json
control_mouse(gesture_data)
return jsonify({'status': 'success'})

if __name__ == '__main__':
app.run(debug=True)

29 | P a g e
index page
<!DOCTYPE html>
<html lang="en" data-bs-theme="dark">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, ini al-scale=1.0">
< tle>VOICE AND GESTURE DETECTION SYSTEM<</ tle>
<link href="h ps://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css"
rel="stylesheet">
<link rel="stylesheet" href="h ps://cdnjs.cloudflare.com/ajax/libs/font-
awesome/6.0.0/css/all.min.css">
<link
href="h ps://fonts.googleapis.com/css2?family=Orbitron:wght@400;700&family=Rajdhani:wght@3
00;500&display=swap" rel="stylesheet">
<style>
:root {
--neon-primary: #00ff9d;
--neon-secondary: #7d12ff;
--cyber-bg: #0a0c1a;
--glass-bg: rgba(255, 255, 255, 0.05);
}

body {
font-family: 'Rajdhani', sans-serif;
background: var(--cyber-bg);
color: var(--neon-primary);
min-height: 100vh;
overflow-x: hidden;
}

.nexus-container {
max-width: 1400px;
margin: 0 auto;
padding: 2rem;
}

.nexus- tle {
font-family: 'Orbitron', sans-serif;
text-align: center;
margin-bo om: 2rem;
posi on: rela ve;
}

30 | P a g e
.nexus- tle h1 {
font-size: 2.5rem;
text-shadow: 0 0 15px var(--neon-primary);
posi on: rela ve;
display: inline-block;
}

.nexus- tle::a er {
content: '';
posi on: absolute;
bo om: -10px;
le : 50%;
transform: translateX(-50%);
width: 200px;
height: 3px;
background: linear-gradient(90deg, transparent, var(--neon-primary), transparent);
}

.interface-panel {
background: var(--glass-bg);
backdrop-filter: blur(10px);
border-radius: 15px;
border: 1px solid rgba(255, 255, 255, 0.1);
box-shadow: 0 0 30px rgba(0, 255, 157, 0.1);
margin-bo om: 2rem;
overflow: hidden;
posi on: rela ve;
}

.panel-header {
padding: 1rem;
background: linear-gradient(90deg, var(--neon-primary), var(--neon-secondary));
color: black;
font-family: 'Orbitron', sans-serif;
}

.sensor-feed {
posi on: rela ve;
padding: 1rem;
}

.voice-visualizer {
height: 100px;
background: rgba(0, 0, 0, 0.3);
border-radius: 10px;
31 | P a g e
posi on: rela ve;
overflow: hidden;
}

.waveform {
posi on: absolute;
width: 100%;
height: 100%;
background: repea ng-linear-gradient(
90deg,
transparent,
transparent 2px,
var(--neon-primary) 2px,
var(--neon-primary) 4px
);
anima on: waveScroll 20s linear infinite;
}

@keyframes waveScroll {
0% { transform: translateX(-100%); }
100% { transform: translateX(100%); }
}

.camera-feed {
posi on: rela ve;
border-radius: 10px;
overflow: hidden;
background: black;
}

#webcam {
width: 100%;
height: 400px;
object-fit: cover;
}

.gesture-overlay {
posi on: absolute;
top: 0;
le : 0;
width: 100%;
height: 100%;
pointer-events: none;
}

32 | P a g e
.control-bar {
display: flex;
gap: 1rem;
padding: 1rem;
background: rgba(0, 0, 0, 0.3);
}

.nexus-bu on {
flex: 1;
padding: 1rem;
border: none;
border-radius: 8px;
background: linear-gradient(135deg, var(--neon-primary), var(--neon-secondary));
color: white;
font-family: 'Orbitron', sans-serif;
cursor: pointer;
transi on: all 0.3s ease;
posi on: rela ve;
overflow: hidden;
}

.nexus-bu on.ac ve {
box-shadow: 0 0 20px var(--neon-primary);
}

.nexus-bu on::before {
content: '';
posi on: absolute;
top: -50%;
le : -50%;
width: 200%;
height: 200%;
background: linear-gradient(45deg,
transparent,
rgba(255, 255, 255, 0.2),
transparent);
transform: rotate(45deg);
anima on: bu onGlow 3s infinite;
}

@keyframes bu onGlow {
0% { transform: rotate(45deg) translateX(-100%); }
100% { transform: rotate(45deg) translateX(100%); }
}

33 | P a g e
.system-response {
background: rgba(0, 0, 0, 0.3);
padding: 1rem;
border-radius: 10px;
min-height: 100px;
font-family: 'Roboto Mono', monospace;
posi on: rela ve;
}

.response-text {
white-space: pre-wrap;
anima on: typewriter 0.1s steps(40) forwards;
}

@keyframes typewriter {
from { opacity: 0; }
to { opacity: 1; }
}

.command-list {
background: rgba(0, 0, 0, 0.3);
padding: 1rem;
border-radius: 10px;
max-height: 200px;
overflow-y: auto;
}

.command-item {
padding: 0.5rem;
border-bo om: 1px solid rgba(255, 255, 255, 0.1);
anima on: slideIn 0.3s ease;
}

@keyframes slideIn {
from { transform: translateX(100%); opacity: 0; }
to { transform: translateX(0); opacity: 1; }
}

.ac vity-indicator {
width: 15px;
height: 15px;
border-radius: 50%;
background: #00ff9d;
box-shadow: 0 0 10px #00ff9d;
anima on: pulse 1.5s infinite;
34 | P a g e
}

@keyframes pulse {
0% { transform: scale(0.8); opacity: 0.5; }
50% { transform: scale(1.2); opacity: 1; }
100% { transform: scale(0.8); opacity: 0.5; }
}
</style>
</head>
<body>
<div class="nexus-container">
<div class="nexus- tle">
<h1>VOICE AND GESTURE DETECTION SYSTEM</h1>
<div class="ac vity-indicator"></div>
</div>

<div class="row g-4">


<!-- Voice Control Panel -->
<div class="col-md-6">
<div class="interface-panel">
<div class="panel-header">
<i class="fas fa-brain"></i> VOICE MODULATION SYSTEM
</div>
<div class="sensor-feed">
<div class="voice-visualizer">
<div class="waveform"></div>
</div>
<div class="control-bar mt-3">
<bu on class="nexus-bu on" id="voice-control">
<i class="fas fa-microphone"></i>
<span class="status-text">ACTIVATE VOICE</span>
</bu on>
</div>
<div class="command-list mt-3">
<div class="command-item">> System: Voice module standby</div>
</div>
</div>
</div>
</div>

<!-- Gesture Control Panel -->


<div class="col-md-6">
<div class="interface-panel">
<div class="panel-header">
<i class="fas fa-hand-sparkles"></i> KINETIC INTERFACE SYSTEM
35 | P a g e
</div>
<div class="sensor-feed">
<div class="camera-feed">
<video id="webcam" autoplay></video>
<div class="gesture-overlay"></div>
</div>
<div class="control-bar mt-3">
<bu on class="nexus-bu on" id="mouse-control">
<i class="fas fa-hand-pointer"></i>
<span class="status-text">ACTIVATE GESTURE</span>
</bu on>
</div>
<div class="system-response mt-3">
<div class="response-text">> Kinect system: Awai ng ac va on</div>
</div>
</div>
</div>
</div>
</div>
</div>

<script>
// Advanced UI Control Logic
class NexusInterface {
constructor() {
this.voiceAc ve = false;
this.gestureAc ve = false;
this.initEventListeners();
this.initWebcam();
}

initEventListeners() {
document.getElementById('voice-control').addEventListener('click', () =>
this.toggleControl('voice'));
document.getElementById('mouse-control').addEventListener('click', () =>
this.toggleControl('gesture'));
}

toggleControl(type) {
const bu on = document.getElementById(`${type}-control`);
const statusText = bu on.querySelector('.status-text');

if(type === 'voice') {


this.voiceAc ve = !this.voiceAc ve;
bu on.classList.toggle('ac ve', this.voiceAc ve);
36 | P a g e
statusText.textContent = this.voiceAc ve ?
'DEACTIVATE VOICE' : 'ACTIVATE VOICE';
this.handleVoiceControl();
} else {
this.gestureAc ve = !this.gestureAc ve;
bu on.classList.toggle('ac ve', this.gestureAc ve);
statusText.textContent = this.gestureAc ve ?
'DEACTIVATE GESTURE' : 'ACTIVATE GESTURE';
this.handleGestureControl();
}
}

handleVoiceControl() {
if(this.voiceAc ve) {
this.startVoiceRecogni on();
} else {
this.stopVoiceRecogni on();
}
}

handleGestureControl() {
if(this.gestureAc ve) {
this.startGestureTracking();
} else {
this.stopGestureTracking();
}
}

37 | P a g e
def recognize_gesture(hand_landmarks):
thumb_ p = hand_landmarks.landmark[4]
index_ p = hand_landmarks.landmark[8]
middle_ p = hand_landmarks.landmark[12]
ring_ p = hand_landmarks.landmark[16]
pinky_ p = hand_landmarks.landmark[20]
wrist = hand_landmarks.landmark[0]

# Calculate distances
thumb_index_dist = ((thumb_ p.x - index_ p.x) ** 2 + (thumb_ p.y - index_ p.y) ** 2) ** 0.5
index_middle_dist = ((index_ p.x - middle_ p.x) ** 2 + (index_ p.y - middle_ p.y) ** 2) ** 0.5

# Gesture recogni on
if thumb_index_dist < 0.05:
if middle_ p.y < wrist.y and ring_ p.y < wrist.y and pinky_ p.y < wrist.y:
return "fist"
elif middle_ p.y > wrist.y and ring_ p.y > wrist.y and pinky_ p.y > wrist.y:
return "ok_sign"

if thumb_ p.y < wrist.y and all(lm.y > wrist.y for lm in [index_ p, middle_ p, ring_ p, pinky_ p]):
return "thumb_up"

if thumb_ p.y > wrist.y and all(lm.y < wrist.y for lm in [index_ p, middle_ p, ring_ p, pinky_ p]):
return "thumb_down"

if index_middle_dist < 0.05 and all(lm.y > wrist.y for lm in [ring_ p, pinky_ p]):
return "victory"

if all(abs(hand_landmarks.landmark[i].y - wrist.y) < 0.1 for i in range(5, 21, 4)):


return "open_hand"

if index_ p.y < wrist.y and all(lm.y > wrist.y for lm in [middle_ p, ring_ p, pinky_ p]):
return "poin ng"

return "unknown"

# ------------------ Control Threads ------------------


def gesture_control():
global current_gesture, gesture_ac ve
cap = cv2.VideoCapture(0)

while True:
if gesture_ac ve:
success, frame = cap.read()
38 | P a g e
if not success:
con nue

frame = cv2.flip(frame, 1)
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
results = hands.process(rgb_frame)

if results.mul _hand_landmarks:
for hand_landmarks in results.mul _hand_landmarks:
# Get index finger p coordinates
x = hand_landmarks.landmark[8].x * frame.shape[1]
y = hand_landmarks.landmark[8].y * frame.shape[0]

# Convert coordinates to screen size


screen_x = int((x / frame.shape[1]) * screen_w)
screen_y = int((y / frame.shape[0]) * screen_h)

# Recognize gesture
gesture_name = recognize_gesture(hand_landmarks)
if gesture_name in GESTURE_ACTIONS:
current_gesture = {
"name": gesture_name,
"emoji": GESTURE_ACTIONS[gesture_name]["emoji"]
}
ac on = GESTURE_ACTIONS[gesture_name]["ac on"]

if ac on == "move_cursor":
pyautogui.moveTo(screen_x, screen_y)

elif ac on == "click":
pyautogui.click()

elif ac on == "double_click":
pyautogui.doubleClick()

elif ac on == "right_click":
pyautogui.rightClick()

elif ac on == "middle_click":
pyautogui.middleClick()

elif ac on == "scroll_up":
pyautogui.scroll(100)

elif ac on == "scroll_down":
39 | P a g e
pyautogui.scroll(-100)

elif ac on == "copy":
pyautogui.hotkey('ctrl', 'c')

elif ac on == "paste":
pyautogui.hotkey('ctrl', 'v')

elif ac on == "cut":
pyautogui.hotkey('ctrl', 'x')

elif ac on == "volume_up":
adjust_volume("up")

elif ac on == "volume_down":
adjust_volume("down")

elif ac on == "brightness_up":
adjust_brightness("up")

elif ac on == "brightness_down":
adjust_brightness("down")

# Draw landmarks for visualiza on


if results.mul _hand_landmarks:
for hand_landmarks in results.mul _hand_landmarks:
mp_drawing.draw_landmarks(
frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

# Encode frame for web display


ret, buffer = cv2.imencode('.jpg', cv2.resize(frame, (640, 480)))
frame = buffer.tobytes()
yield (b'--frame\r\n'
b'Content-Type: image/jpeg\r\n\r\n' + frame + b'\r\n')
else:
# Send blank frame when inac ve
blank_frame = np.zeros((480, 640, 3), dtype=np.uint8)
ret, buffer = cv2.imencode('.jpg', blank_frame)
frame = buffer.tobytes()
yield (b'--frame\r\n'
b'Content-Type: image/jpeg\r\n\r\n' + frame + b'\r\n')
me.sleep(0.1)

def voice_control():
40 | P a g e
global voice_ac ve, last_command, system_response
while True:
if voice_ac ve:
with sr.Microphone() as source:
try:
recognizer.adjust_for_ambient_noise(source, dura on=0.5)
audio = recognizer.listen(source, meout=5)
command = recognizer.recognize_google(audio)
execute_voice_command(command)
except sr.WaitTimeoutError:
pass
except sr.UnknownValueError:
system_response = "Could not understand audio"
except Excep on as e:
system_response = f"Error: {str(e)}"
me.sleep(1)

# ------------------ Flask Routes ------------------


@app.route('/')
def index():
return render_template('index.html')

@app.route('/video_feed')
def video_feed():
return Response(gesture_control(), mimetype='mul part/x-mixed-replace; boundary=frame')

@app.route('/get_status')
def get_status():
return jsonify({
'mode': current_mode,
'voice_ac ve': voice_ac ve,
'gesture_ac ve': gesture_ac ve,
'last_command': last_command,
'system_response': system_response,
'current_gesture': current_gesture,
'supported_apps': list(SUPPORTED_APPS.keys())
})

@app.route('/toggle_voice/<state>')
def toggle_voice(state):
global voice_ac ve
41 | P a g e
voice_ac ve = state == 'on'
return jsonify(success=True)

@app.route('/toggle_gesture/<state>')
def toggle_gesture(state):
global gesture_ac ve
gesture_ac ve = state == 'on'
return jsonify(success=True)

@app.route('/run_command/<command>')
def run_command(command):
execute_voice_command(command)
return jsonify(success=True, response=system_response)

42 | P a g e
async initWebcam() {
try {
const stream = await navigator.mediaDevices.getUserMedia({ video: true });
const video = document.getElementById('webcam');
video.srcObject = stream;
} catch(error) {
this.updateSystemResponse(`Camera Error: ${error.message}`, 'error');
}
}

startVoiceRecogni on() {
// Implement voice recogni on logic
this.updateSystemResponse("Voice system: Ac ve - Listening for commands");
}

stopVoiceRecogni on() {
this.updateSystemResponse("Voice system: Standby");
}

startGestureTracking() {
// Implement gesture tracking logic
this.updateSystemResponse("Kinect system: Ac ve - Tracking movements");
}

stopGestureTracking() {
this.updateSystemResponse("Kinect system: Standby");
}

updateSystemResponse(message, type = 'info') {


const responseBox = document.querySelector('.system-response .response-text');
responseBox.textContent = `> ${new Date().toLocaleTimeString()}: ${message}`;
responseBox.style.color = type === 'error' ? '#ff0000' : var(--neon-primary);
}
}

// Ini alize Nexus Interface


const interface = new NexusInterface();
</script>
</body>
</html>

43 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy