0% found this document useful (0 votes)
208 views64 pages

Human Hand Gestures Capturing and Recognition Via Camera

The document describes a project report submitted by four students for their Bachelor of Engineering degree in Computer Science and Engineering. The project aims to develop a system for capturing and recognizing human hand gestures using a camera. The system would help enable communication between deaf or mute individuals and others who do not understand sign language. The report includes sections on introduction, problem definition, development environment, system design and implementation, testing and future enhancements. The goal of the project is to build a system that can recognize sign language gestures using a camera and convert them into text and speech to help overcome communication barriers.

Uploaded by

Bharath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views64 pages

Human Hand Gestures Capturing and Recognition Via Camera

The document describes a project report submitted by four students for their Bachelor of Engineering degree in Computer Science and Engineering. The project aims to develop a system for capturing and recognizing human hand gestures using a camera. The system would help enable communication between deaf or mute individuals and others who do not understand sign language. The report includes sections on introduction, problem definition, development environment, system design and implementation, testing and future enhancements. The goal of the project is to build a system that can recognize sign language gestures using a camera and convert them into text and speech to help overcome communication barriers.

Uploaded by

Bharath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 64

HUMAN HAND GESTURES

CAPTURING AND RECOGNITION VIA

CAMERA
A PROJECT REPORT

Submitted by

ARAVINTHAN.V(620816104009)
JEEVANANDHAN.S(620816104037)
PALPANDI.K(620816104068)
POOVARASAN.P.M(620816104072)

in partial fulfillment for the award of the degree


of

BACHELOR OF ENGINEERING

IN
COMPUTER SCIENCE AND ENGINEERING

GNANAMANI COLLEGE OF TECHNOLOGY


NAMAKKAL - 637 018

ANNA UNIVERSITY : CHENNAI 600 025

APRIL 2020
HUMAN HAND GESTURES
CAPTURING AND RECOGNITION VIA
CAMERA
A PROJECT REPORT

Submitted by

ARAVINTHAN.V(620816104009)
JEENANANDHAN.S(620816104037)
PALPANDI.K(620816104068)
POOVARASAN.P.M(620816104072)

in partial fulfillment for the award of the degree


of

BACHELOR OF ENGINEERING

IN
COMPUTER SCIENCE AND ENGINEERING

GNANAMANI COLLEGE OF TECHNOLOGY


NAMAKKAL - 637 018

ANNA UNIVERSITY : CHENNAI 600 025


APRIL 2020
ANNA UNIVERSITY : CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “HUMAN HAND GESTURES


CAPTURINGAND RECOGNITION VIA CAMERA ”is the bonafide work
of“ARAVINTHAN.V(620816104009),JEEVANANDHAN.S(620816104037),
PALPANDI.K(620816104068), POOVARASAN.P.M(620816104072)” who
carried out the project work under my supervision.

SIGNATURE SIGNATURE
Dr.R.UMAMAHESWARI, Ph.D., Ms.J.JAYARANJANI,

HEAD OF THE DEPARTMENT SUPERVISOR

Head Of the Department, Assistant Professor,

Computer Science and Engineering, Computer Science and Engineering,

Gnanamani College of Technology, Gnanamani College of Technology,

Namakkal-637 018. Namakkal-637 018.

Submitted for the University project viva voce Examination held on


……………..

Internal Examiner External Examiner


ACKNOWLEDGEMENT

We are grateful to the Almighty for the grace and sustained blessings
throughout the project and have given immense strength in executing the work
successfully. We would like to express our deep sense of heartiest thanks to our
beloved Chairman Dr.T.ARANGANNAL and Chairperson
Mrs.P.MALALEENAGnanamani Educational Institutions, Namakkal, for
giving an opportunity to do and complete this project.

We would like to express our sincere gratitude to our Chief Executive


Officer Dr.K.VIVEKANANDAN and Chief Administrative Officer
Dr.P.PREMKUMARGnanamani Educational Institutions, Namakkal, for
providing us with indefinable support.

We would like to express our Principal, Gnanamani College of


Technology, Namakkal, for creating a beautiful atmosphere deep sense of
gratitude and profound thanks to Dr.T.K.KANNAN, which inspired us to take
over this project.

We take this opportunity to convey our heartiest thanks to


Dr.R.UMAMAHESWARI, Professor & Head Department of Computer
Science and Engineering, Gnanamani College of Technology, Namakkal, for
her much valuable support, unfledged attention and direction, which kept this
project on track.

We are extremely grateful to our GuideMS.J.JAYARANJANI,Assistant


Professor, Gnanamani College of Technology, Namakkal, for her much needed
guidance and specially for entrusting us with our project regards to our Project
Coordinators, Faculty Members, Parents, Friends who offered an unflinching
moral support for completion of this project.
ABSTRACT

Generally deaf-dumb people use sign language for communication, but they find
difficulty in communicating with others who don’t understand sign language. Sign
language plays a major role for dump people to communicate with normal people.
It is very difficult for mute people to convey their message to normal people. Since
normal people are not trained on hand sign language. In emergency time conveying
their message is very difficult. Due to which communications between deaf-mute
and a normal person have always been a challenging task. We propose to develop a
device which can convert the hand gestures of a deaf-mute person into speech. So,
the solution for this problem is to convert the sign language into human hearing
voice.We propose a multimodal deep learning architecture for sign language
recognition which effectively combines RGB-D input and two-stream
spatiotemporal networks. Depth videos, as an effective compensation of RGB
input, can supply additional distance information about the signer's hands. A novel
sampling method called ARSS (Aligned Random Sampling in Segments) is put
forward to select and align optimal RGB-D video frames, which improves the
capacity utilization of multimodal data and reduces the redundancy. We get the
hand ROI by joints information of RGB data for local focus in spatial stream. D-
shift Net is proposed as depth motion feature extraction in temporal stream, which
fully utilizes three-dimensional motion information of the sign language.Finally
recognized output is converted into text and speech. This system eliminates
communication barrier between hearing impaired-mute and normal people.
TABLE OF CONTENTS
CHAPTER TITLE PAGE
NO NO
ABSTRACT
LIST OF FIGURES
LIST OF ABBREVIATIONS
1 INTRODUCTION 1
1.1HUMAN COMPUTER INTERACTION 1
1.2 GESTURE RECOGNITION
1.3 GESTURE RECOGNITION ALGORITHMS
1.3.1 3D model based algorithms
1.3.2 Skeletal-based algorithms
1.3.3 Appearance-based models
1.4 APPLICATION BASED ON THE HANDS-FREE
INTERFACE
1.4.1 Interactive exposition
1.4.2 Non-verbal communication
1.5 COLOR MODELS
1.6 HAND MODELING FOR GESTURE RECOGNITION

2 PROBLEM DEFINITION AND DESCRIPTION 13


2.1 LITERATURE SURVEY
2.1.1 Systematic review of kinect applications in
elderly care and stroke rehabilitation

2.1.2 Chalearn gesture challenge :Design and first


results

2.1.3 Gesture recognition: A survey


2.1.4 Result and analysis of the chalearn gesture
challenge 2012
2.1.5 A robust background subtraction-shadow
Detection
2.1.6 Naked image detection based on adaptive and
extensible skin color model
2.1.7 A framework for hand gesture recognition
based on accelerometer and EMG sensors
2.1.8 One-shot learning gesture recognition from
RGB-Date using bag of feature
2.1.9 Discovering motion primitives for
unsupervised grouping and one-shot learning

2.2 EXISTING SYSTEM


2.2.1 Disadvantages
2.3 PROPOSED SYSTEM
2.4 FEASIBILITY STUDY
2.4.1 Technical feasibility
2.4.2 Economic feasibility
2.4.3 Operational feasibility
3 DEVELOPMENT ENVIRONMENT 36
3.1 HARDWARE REQUIREMENTS
3.2 SOFTWARE REQUIRMENTS
3.3 JAVA
3.3.1 Objectives Of Java
3.3.2 Java Server Pages- An Overview
3.3.3 Evolution Of Web Application
3.3.4 Benefits Of JSP
3.3.5 Servlets
3.3.6 Java Servlets
4 SYSTEM DESIGN AND IMPLEMENTATION 45
4.1ACTIVITY DIAGRAM
5 MODULES DESCRIPTION 53
5.1 IMAGE ACQUISITION
5.2 FOREGROUND SEGMENTATION
5.3 FACE AND HAND DETECTION
5.4 HAND TRAJECTORY CLASSIFICATION
5.5 EVALUATION CRITERIA
6 SYSTEM TESTING 59
6.1 TESTING TYPES
6.2 SYSTEM TESTING
6.3 UNIT TESTING
6.4 INTEGRATION TESTING
6.4 VALIDATION TESTING
7 IMPLEMENTATION RESULTS 60
8 CONCLUSION 61
9 FUTURE ENHANCEMENTS 62
10 OUTPUT SCREENSHOTS 63
REFERENCE
LIST OF FIGURES

FIGURE TITLE PAGE


NO NO

4.1 DATA FLOW DIAGRAM 45

LIST OF ABBREVIATIONS
ABBREVIATIONS EXPANSIONS
RGB Red Green Blue
HSV Hue Saturation Value
CIE International Commission
On Lllumination

CHAPTER 1

INTRODUCTION
1.1 Human Computer Interaction

Human–computer interaction (HCI) involves the study, planning, design and uses of the
interaction between people (users) and computers. It is often regarded as the intersection
of computer science, behavioral sciences, design, media studies, and several other fields of
study. Human–computer interaction (HCI) involves the study, planning, design and uses of the
interaction between people (users) and computers. It is often regarded as the intersection
of computer science, behavioral sciences, design, media studies, and several other fields of
study. Humans interact with computers in many ways, and the interface between humans and the
computers they use is crucial to facilitating this interaction. Desktop applications, internet
browsers, handheld computers, and computer kiosks make use of the prevalent graphical user
interfaces (GUI) of today. Voice user interfaces (VUI) are used for speech recognition and
synthesizing systems, and the emerging multi-modal and gestalt User Interfaces (GUI) allow
humans to engage with embodied character agents in a way that cannot be achieved with other
interface paradigms.

The Association for Computing Machinery defines human-computer interaction as "a


discipline concerned with the design, evaluation and implementation of interactive computing
systems for human use and with the study of major phenomena surrounding them". An important
facet of HCI is the securing of user satisfaction (or simply End User Computing Satisfaction).
"Because human–computer interaction studies a human and a machine in communication, it
draws from supporting knowledge on both the machine and the human side. On the machine
side, techniques in computer graphics, operating systems, programming languages, and
development environments are relevant. On the human side, communication theory, graphic and
industrial design disciplines, linguistics, social sciences, cognitive psychology, social
psychology, and human factors such as computer user satisfaction are relevant. And, of course,
engineering and design methods are relevant." Due to the multidisciplinary nature of HCI, people
with different backgrounds contribute to its success. HCI is also sometimes referred to as
human–machine interaction (HMI), man–machine interaction (MMI) or computer–human
interaction (CHI). A UCLA 2014 study of sixth graders and their use of screen-devices found a
lack of face-to-face contact deprived the youngsters of emotional cues including facial
expressions and body language.
Poorly designed human-machine interfaces can lead to many unexpected problems. A
classic example of this is the Three Mile Island accident, a nuclear meltdown accident, where
investigations concluded that the design of the human–machine interface was at least partially
responsible for the disaster. Similarly, accidents in aviation have resulted from manufacturers'
decisions to use non-standard flight instrument or throttle quadrant layouts: even though the new
designs were proposed to be superior in regards to basic human–machine interaction, pilots had
already ingrained the "standard" layout and thus the conceptually good idea actually had
undesirable results.

HCI (Human Computer Interaction) aims to improve the interactions between users and
computers by making computers more usable and receptive to users' needs. Specifically, HCI has
interests in: methodologies and processes for designing interfaces (i.e., given a task and a class of
users, design the best possible interface within given constraints, optimizing for a desired
property such as learn ability or efficiency of use) methods for implementing interfaces (e.g.
software toolkits and libraries) techniques for evaluating and comparing interfaces developing
new interfaces and interaction techniques developing descriptive and predictive models and
theories of interaction A long term goal of HCI is to design systems that minimize the barrier
between the human's mental model of what they want to accomplish and the computer's support
of the user's task. Professional practitioners in HCI are usually designers concerned with the
practical application of design methodologies to problems in the world. Their work often
revolves around designing graphical user interfaces and web interfaces. Researchers in HCI are
interested in developing new design methodologies, experimenting with new devices,
prototyping new software systems, exploring new interaction paradigms, and developing models
and theories of interaction.

1.2 Gesture Recognition

Gesture recognition is a topic in computer science and language technology with the goal of
interpreting human gestures via mathematical algorithms. Gestures can originate from any bodily
motion or state but commonly originate from the face or hand. Current focuses in the field
include emotion recognition from the face and hand gesture recognition. Many approaches have
been made using cameras and computer vision algorithms to interpret sign language. However,
the identification and recognition of posture, gait, proxemics, and human behaviors is also the
subject of gesture recognition techniques. Gesture recognition can be seen as a way for
computers to begin to understand human body language, thus building a richer bridge between
machines and humans than primitive text user interfaces or even GUIs (graphical user
interfaces), which still limit the majority of input to keyboard and mouse.

Gesture recognition enables humans to communicate with the machine (HMI) and interact
naturally without any mechanical devices. Using the concept of gesture recognition, it is possible
to point a finger at the computer screen so that the cursor will move accordingly. This could
potentially make conventional input devices such as mouse, keyboards and even touch-screens
redundant. Gesture recognition can be conducted with techniques from computer vision and
image processing. The literature includes ongoing work in the computer vision field on capturing
gestures or more general human pose and movements by cameras connected to a computer.

Gesture recognition and pen computing: This computing not only going to reduce the
hardware impact of the system but also it increases the range of usage of physical world object
instead of digital object like keyboards, mouses. Using this we can implement and can create a
new thesis of creating of new hardware no requirement of monitors too. This idea may lead us to
the creation of holographic display. The term gesture recognition has been used to refer more
narrowly to non-text-input handwriting symbols, such as inking on a graphics tablet, multi-touch
gestures, and mouse gesture recognition. This is computer interaction through the drawing of
symbols with a pointing device cursor.

In computer interfaces, two types of gestures are distinguished:[9] We consider online


gestures, which can also be regarded as direct manipulations like scaling and rotating. In
contrast, offline gestures are usually processed after the interaction is finished; e. g. a circle is
drawn to activate a context menu.

Offline gestures: Those gestures that are processed after the user interaction with the object.
An example is the gesture to activate a menu.

Online gestures: Direct manipulation gestures. They are used to scale or rotate a tangible
object.

The ability to track a person's movements and determine what gestures they may be
performing can be achieved through various tools. Although there is a large amount of research
done in image/video based gesture recognition, there is some variation within the tools and
environments used between implementations.

Wired gloves: These can provide input to the computer about the position and rotation of the
hands using magnetic or inertial tracking devices. Furthermore, some gloves can detect finger
bending with a high degree of accuracy (5-10 degrees), or even provide haptic feedback to the
user, which is a simulation of the sense of touch. The first commercially available hand-tracking
glove-type device was the Data Glove, a glove-type device which could detect hand position,
movement and finger bending. This uses fiber optic cables running down the back of the hand.
Light pulses are created and when the fingers are bent, light leaks through small cracks and the
loss is registered, giving an approximation of the hand pose.

Depth-aware cameras: Using specialized cameras such as structured light or time-of-flight


cameras, one can generate a depth map of what is being seen through the camera at a short range,
and use this data to approximate a 3d representation of what is being seen. These can be effective
for detection of hand gestures due to their short range capabilities.

Stereo cameras: Using two cameras whose relations to one another are known, a 3d
representation can be approximated by the output of the cameras. To get the cameras' relations,
one can use a positioning reference such as a lexian-stripe or infrared emitters. In combination
with direct motion measurement (6D-Vision) gestures can directly be detected.

Controller-based gestures: These controllers act as an extension of the body so that when
gestures are performed, some of their motion can be conveniently captured by software. Mouse
gestures are one such example, where the motion of the mouse is correlated to a symbol being
drawn by a person's hand, as is the Wii Remote or the Myo, which can study changes in
acceleration over time to represent gestures. Devices such as the LG Electronics Magic Wand,
the Loop and the Scoop use Hillcrest Labs' Free space technology, which uses MEMS
accelerometers, gyroscopes and other sensors to translate gestures into cursor movement. The
software also compensates for human tremor and inadvertent movement. Audio Cubes are
another example. The sensors of these smart light emitting cubes can be used to sense hands and
fingers as well as other objects nearby, and can be used to process data. Most applications are in
music and sound synthesis, but can be applied to other fields.
Single camera: A standard 2D camera can be used for gesture recognition where the
resources/environment would not be convenient for other forms of image-based recognition.
Earlier it was thought that single camera may not be as effective as stereo or depth aware
cameras, but some companies are challenging this theory. Software-based gesture recognition
technology using a standard 2D camera that can detect robust hand gestures, hand signs, as well
as track hands or fingertip at high accuracy has already been embedded in Lenovo’s Yoga
ultrabooks, Pantech’s Vega LTE smartphones, Hisense’s Smart TV models, among other
devices.

Depending on the type of the input data, the approach for interpreting a gesture could be done in
different ways. However, most of the techniques rely on key pointers represented in a 3D
coordinate system. Based on the relative motion of these, the gesture can be detected with a high
accuracy, depending of the quality of the input and the algorithm’s approach.

In order to interpret movements of the body, one has to classify them according to common
properties and the message the movements may express. For example, in sign language each
gesture represents a word or phrase. The taxonomy that seems very appropriate for Human-
Computer Interaction has been proposed by Quek in "Toward a Vision-Based Hand Gesture
Interface" He presents several interactive gesture systems in order to capture the whole space of
the gestures: 1. Manipulative; 2. Semaphoric; 3. Conversational.

Some literature differentiates 2 different approaches in gesture recognition: a 3D model based


and an appearance-based. The foremost method makes use of 3D information of key elements of
the body parts in order to obtain several important parameters, like palm position or joint angles.
On the other hand, Appearance-based systems use images or videos for direct interpretation.

A real hand (left) is interpreted as a collection of vertices and lines in the 3D mesh version
(right), and the software uses their relative position and interaction in order to infer the gesture.

1.3 Gesture Recognition algorithms

1.3.1 3D model-based algorithms

The 3D model approach can use volumetric or skeletal models, or even a combination of the two.
Volumetric approaches have been heavily used in computer animation industry and for computer
vision purposes. The models are generally created of complicated 3D surfaces, like NURBS or
polygon meshes.

The drawback of this method is that is very computational intensive and systems for live analysis
is still to be developed. For the moment, a more interesting approach would be to map simple
primitive objects to the person’s most important body parts (for example cylinders for the arms
and neck, sphere for the head) and analyze the way these interact with each other. Furthermore,
some abstract structures like super-quadrics and generalized cylinders may be even more suitable
for approximating the body parts. The exciting thing about this approach is that the parameters
for these objects are quite simple. In order to better model the relation between these, we make
use of constraints and hierarchies between our objects.

The skeletal version (right) is effectively modelling the hand (left). This has fewer
parameters than the volumetric version and it's easier to compute, making it suitable for real-time
gesture analysis systems.

1.3.2 Skeletal-based algorithms

Instead of using intensive processing of the 3D models and dealing with a lot of
parameters, one can just use a simplified version of joint angle parameters along with segment
lengths. This is known as a skeletal representation of the body, where a virtual skeleton of the
person is computed and parts of the body are mapped to certain segments. The analysis here is
done using the position and orientation of these segments and the relation between each one of
them (for example the angle between the joints and the relative position or orientation)

Advantages of using skeletal models:

 Algorithms are faster because only key parameters are analyzed.


 Pattern matching against a template database is possible
 Using key points allows the detection program to focus on the significant parts of the
body

These binary silhouette (left) or contour (right) images represent typical input for appearance-
based algorithms. They are compared with different hand templates and if they match, the
correspondent gesture is inferred.
1.3.3 Appearance-based models

These models don’t use a spatial representation of the body anymore, because they derive
the parameters directly from the images or videos using a template database. Some are based on
the deformable 2D templates of the human parts of the body, particularly hands. Deformable
templates are sets of points on the outline of an object, used as interpolation nodes for the
object’s outline approximation. One of the simplest interpolation functions is linear, which
performs an average shape from point sets, point variability parameters and external deformators.
These template-based models are mostly used for hand-tracking, but could also be of use for
simple gesture classification.

A second approach in gesture detecting using appearance-based models uses image


sequences as gesture templates. Parameters for this method are either the images themselves, or
certain features derived from these. Most of the time, only one (monoscopic) or two
(stereoscopic) views are used.

1.4 Applications based on the hands-free interface

1.4.1 Interactive expositions

Nowadays, expositions based on new ways of interaction need contact with the visitors
that play an important role in the exhibition contents. Museums and expositions are open to all
kind of visitors, therefore, these ‘‘sensing expositions’’ look forward to reaching the maximum
number of people. This is the case of ‘‘Galicia dixital’’, an exposition. Visitors go through all the
phases of the exposition sensing, touching and receiving multimodal feedback such as audio,
video, haptics, interactive images or virtual reality. In one phase there is a slider-puzzle with
images of Galicia to be solved. There are four computers connected enabling four users to
compete to complete the six puzzles included in the application.

Visitors use a touch-screen to interact with the slider-puzzle, but the characteristics of this
application make it possible to interact by means of the hands-free interface in a very easy
manner. Consequently, the application has been adapted to it and therefore, disabled people can
also play this game and participate in a more active way in the exposition.

1.4.2 Non-verbal communication


By means of human–computer interaction, one ambitious objective is to achieve
communication for people with speech disorders using new technologies. Nowadays, there are
different augmentative communication systems for people with speech limitations, ranging from
unaided communication such as sign languages, to computerized iconic languages with voice
output systems such as MinspeakTM. We present BlissSpeaker, an application based on a
symbolic graphical-visual system for nonverbal communication named Bliss. The Bliss
system can be used as an augmentative system or for completely replacing verbal
communication. It is commonly used by people with cerebral palsy, but with the following
learning aptitude requirements:

1. Cognitive abilities;

2. Good visual discrimination;

3. Possibility of indicating the desired symbol;

4. Good visual and auditory comprehension.

Some speech therapists use it in their sessions to help themselves with children with speech
disorders and to help in the prevention of linguistic and cognitive delays in crucial stages of a
child’s life. The Blissymbolics language is currently composed of over 2000 graphic symbols
that can be combined and re-combined to create new symbols. The number of symbols is
adaptable to the capabilities and necessities of the user, for example, BlissSpeaker has 92
symbols that correspond to the first set of Bliss symbols for preschool children. BlissSpeaker is
an application that verbally reproduces statements built using Bliss symbols, which allows a
more ‘‘natural’’ communication between a child using Bliss and a person that does not
understand or use these symbols, for example, the children’s relatives. The application can work
with any language, as long as there is an available compatible SAPI (Speech Application
Programming Interface). The system’s process is shown. The potential users of BlissSpeaker are
children with speech disorders; therefore, its operation is to be very simple and intuitive.
Moreover, audio, vision and traditional graphical user interfaces combined together configure a
very appealing multimodal interface that can help attract and involve the user in its use.
Furthermore, the use of the hands-free interface with BlissSpeaker will help to fulfil the third
requirement of Bliss user, which is the possibility of indicating the desired symbol. It will offer
children with upper-body physical disabilities and speech difficulties a way to communicate
themselves through an easy interface and their teachers or relatives will understand them better
due to the symbols’ vocal reproduction. Furthermore, the use of the new interface can make
learning of Bliss language more enjoyable and entertaining, and it also promotes the children’s
coordination, because the interface works with head motion. This system was evaluated in a
children’s scientific fair. The system was tested by more than 60 disabled and non-disabled
children from 6 to 14 years of age. A short explanation on how it works was given. They
operated the application with surprising ease and even if they had never seen Bliss symbols
before, they created statements that made sense and reproduced them for their class mates.
Children enjoyed interacting with the computer through the functionalities that the face-based
interface offered. Moreover, upper-body physical disabled children are grateful for the
opportunity of accessing a computer.

1.5 Color Models

The aim of the proposed project is to overcome the challenge of skin color detection for
natural interface between user and machine. So to detect the skin color under dynamic
background the study of various color models was done for pixel based skin detection. Three
color spaces has been chosen which are commonly used in computer vision applications.

RGB: Three primary colors red(R), green(G), and blue(B) are used. The main advantage
of this color space is simplicity. However, it is not perceptually uniform. It does not separate
luminance and chrominance, and the R, G, and B components are highly correlated.

HSV (Hue, Saturation, Value): It express Hue with dominant color (such as red, green,
purple and yellow)of an area. Saturation measures the colorfulness of an area in proportion to its
brightness. The “intensity”, “lightness”, or “Values” is related to the color luminance. This
model discriminates luminance from chrominance. This is a more intuitive method for describing
colors, and because the intensity is independent of the color information this is very useful model
for computer vision. This model gives poor result where the brightness is very low. Other similar
color spaces are HSI and HSL (HLS).

CIE –Lab: It defined by the International Commission on Illumination. It separates a


luminance variable L from two perceptually uniform chromaticity variable (a, b).
1.6 Hand modeling for Gesture recognition

Human hand is an articulated object with 27 bones and 5 fingers. Each of these fingers
consists of three joints. The four fingers (little, ring, middle and index) are aligned together and
connected to the wrist bones in one tie and at a distance there is the thumb. Thumb always stands
on the other side of the four fingers for any operation, like capturing, grasping, holding etc.
Human hand joints can be classified as flexion, twist, directive or spherical depending up on the
type of movement or possible rotation axes. In total human hand has approximately 27 degrees
of freedom. As a result, a large number of gestures can be generated. Therefore, for proper
recognition of the hand, it should be modeled in a manner understandable as an interface in
Human Computer Interaction (HCI). There are two types of gestures, Temporal (dynamic) and
Spatial (shape). Temporal models use Hidden Markov Model (HMM), KalmanFilter , Finite
State Machines, Neural Network (NN). Hand modeling in spatial domain can be further divided
into two categories, 2D (appearance based or view based) model and 3D based model. 2D hand
modeling can be represented by deformable templates, shape representation features, motion and
coloured markers. Shape representation feature is classified as geometric features (i.e. live
feature) and non-geometric feature. Geometric feature deals with location and position of
fingertips, location of palm and it can be processed separately. The non – geometric feature
includes colour, silhouette and textures, contour, edges, image moments and Eigen vectors. Non-
geometric features cannot be seen (blind features) individually and collective processing is
required. The deformable templates are flexible in nature and allow changes in shape of the
object up to certain limit for little variation in the hand shape. Image motion based model can be
obtained with respect to colour cues to track the hand. Coloured markers are also used for
tracking the hand and detecting the fingers/ fingertips to model the hand shape. Hand shape can
also be represented using 3D modeling. The hand shape in 3D can be volumetric, skeletal and
geometric models. Volumetric models are complex in nature and difficult for computation in
real-time applications. It uses a lot of parameters to represent the hand shape. Instead other
geometric models, such as cylinders, ellipsoids and spheres are considered as alternative for such
model for hand shape approximation. Skeletal model represents the hand structure with 3D
structure with reduced set of parameters. Geometric models are used for hand animation and
real-time applications. Polygon meshes and cardboard models are examples of geometric
models.
CHAPTER 2

Problem Definition and Description

2.1 Existing system


Gesture is a form of non-verbal communication using various body parts, mostly hand
and face. Gesture is the oldest method of communication in human. Primitive men used to
communicate the information of food/ prey for hunting, source of water, information about their
enemy, request for help etc. within themselves through gestures. Still gestures are used widely
for different applications on different domains. This includes human-robot interaction, sign
language recognition, interactive games, vision-based augmented reality etc. Another major
application of gestures is found in the aviation industry for placing the aircraft in the defined bay
after landing, for making the passengers aware about the safety features by the airhostess. For
communication by the people at a visible, but not audible distance (surveyors) and by the
physically challenged people (mainly the deaf and dumb) gesture is the only method. Posture is
another term often confused with gesture. Posture refers to only a single image corresponding to
a single command (such as stop), where as a sequence of postures is called gesture (such as move
the screen to left or right). Sometimes they are also called static (posture) and dynamic gesture
(gesture). Posture is simple and needs less computational power, but gesture (i.e. dynamic) is
complex and suitable for real environments. Though sometimes face and other body parts are
used along with single hand or double hands, hand gesture is most popular for different
applications. With the advancement of human civilization, the difficulty of interpersonal
communication, not only in terms of language, but also in terms of communication between
common people and hearing impaired people is gradually being abolished. If development of
sign language is the first step, then development of hand recognition system using computer
vision is the second step. Several works have been carried out worldwide using Artificial
Intelligence for different sign languages. The main objective is to perform effective recognition,
detection and tracking of hands and a color and depth based 3-D particle filter framework is
proposed to solve occlusion. Surfing the web, typing a letter, playing a video game or storing and
retrieving personal or official data are just a few examples of the use of computers or computer-
based devices. Due to increase in mass production and constant decrease in price of personal
computers, they will even influence our everyday life more in near future. Nevertheless, in order
to efficiently utilize the new phenomenon, myriad number of studies has been carried out on
computer applications and their requirement of more and more interactions. In existing system
presents a novel technique for hand gesture recognition through human–computer interaction
based on shape analysis. The main objective of this effort is to explore the utility of a particle
filter-based approach to the recognition of the hand gestures. The goal of static hand gesture
recognition is to classify the given hand gesture data represented by some features into some
predefined finite number of gesture classes. The proposed system presents a recognition
algorithm to recognize a set of six specific static hand gestures, namely: Open, Close, Cut, Paste,
Maximize, and Minimize. The hand gesture image is passed through three stages, preprocessing,
feature extraction, and classification. In preprocessing stage some operations are applied to
extract the hand gesture from its background and prepare the hand gesture image for the feature
extraction stage. In the first method, the hand contour is used as a feature which treats scaling
and translation of problems (in some cases). The complex moment algorithm is, however, used
to describe the hand gesture and treat the rotation problem in addition to the scaling and
translation.

Disadvantages

There are many challenges associated with the accuracy and usefulness of gesture
recognition software. For image-based gesture recognition there are limitations on the equipment
used and image noise. Images or video may not be under consistent lighting, or in the same
location. Items in the background or distinct features of the users may make recognition more
difficult.

The variety of implementations for image-based gesture recognition may also cause issue
for viability of the technology to general usage. For example, an algorithm calibrated for one
camera may not work for a different camera. The amount of background noise also causes
tracking and recognition difficulties, especially when occlusions (partial and full) occur.
Furthermore, the distance from the camera, and the camera's resolution and quality, also cause
variations in recognition accuracy.

In order to capture human gestures by visual sensors, robust computer vision methods are
also required, for example for hand tracking and hand posture recognition or for capturing
movements of the head, facial expressions or gaze direction.

2.2 Literature survey

Systematic review of Kinect applications in elderly care and stroke rehabilitation


Author: David Webster
As the Kinect is a relatively new piece of hardware, establishing the limitations of the
sensor within specific application scenarios is an ongoing process. Nevertheless, provide a list of
current limitations of Kinect that noted based on our review of applications in elderly care
systems. Current Kinect-based fall risk reduction strategies are derived from gait-based, early
intervention methodologies and thus are only indirectly related to true fall prevention which
would require some form of feedback prior to a detected potential fall event. Occlusion in fall
detection algorithms, while partially accounted for through the methodologies of the various
systems discussed, is still a major challenge inherent in Kinect-based fall detection systems.
Current strategies focus on a subject who stands, sits, and falls in an ideal location of the
Kinect’s field of vision, while authentic falls in realistic home environment conditions are more
varied, therefore the current results should not be taken as normative. The Kinect sensor must be
fixed to a specific location and has a range of capture of roughly ten meters. This limitation
dictates that fall events must occur directly in front of the sensor’s physical location. While it has
been noted that a strategically placed array of Kinect sensors could mitigate this limitation, a
system utilizing this methodology has not yet been implemented and evaluated. Without careful
consideration of the opinions of a system’s proposed user base, concerns regarding ubiquitous
always-on video capture systems, such as the Kinect, may inhibit wide-scale system adoption.
During the review, it was noted that research related to the reception of alert support systems is
at an early phase, likely due to in-home hardware previously being cumbersome and expensive.
With the Kinect having the potential to be widely disbursed in in-home setting monitoring
systems, this avenue of research has become more viable and relevant. In this section we provide
a review of applications of Kinect in stroke rehabilitation grouped under 2 categories: 1)
Evaluation of Kinect’s Spatial Accuracy, and 2) Kinect based Rehabilitation Methods. These
categories follow the trend of the literature to first evaluate the Kinect sensor as a clinically
viable tool for rehabilitation. Motor function rehabilitation for stroke patients typically aims to
strengthen and retrain muscles to rejuvenate debilitated functions, but inadequate completion of
rehabilitation exercises drastically reduces the potential outcome of overall motor recovery.
These exercises are often unpleasant and/or painful leading to patients’ tolerance for exercise to
decrease as indicated. Then noted that decreased tolerance or motivation often lead to intentional
and unintentional ‘cheating’ or, in the worst case scenario, avoidance of rehabilitation exercises
altogether. The Kinect may contain the potential to overcome these barriers to in-home stroke
rehabilitation as an engaging and accurate markerless motion capture tool and controller
interface; however, a functional foundation of Kinect-based rehabilitation potential needs to be
established focusing on the underlying strategies of rehabilitation schemas rather than the
placating effects offered by serious games. However, some significant technological limitations
still present are: a fixed location sensor with a range of capture of only roughly ten meters; a
difficulty in fine movement capture; shoulder joint biomechanical accuracy, and fall risk
reduction methodologies that only utilize indirect, gait-based preemptive training. The directions
for future work are vast and have promise to enhance elderly care; stroke patient motivation to
accurately complete rehabilitation exercises; rehabilitation record keeping, and future medical
diagnostic and rehabilitation methods. Based on review of the literature, have reported a
summary of critical issues and suggestions for future work in this domain.
Title: ChaLearn Gesture Challenge: Design and First Results
Author: Isabelle Guyon
For the unpublished methods, summarize the descriptions provided in the fact sheets.
Interestingly, all top ranking methods are based on techniques making no explicit detection and
tracking of humans or individual body parts. The winning team (alfnie) used a novel technique
called “Motion Signature analyses”, inspired by the neural mechanisms underlying information
processing in the visual system. This is an unpublished method using a sliding window to
perform simultaneously recognition and temporal segmentation, based solely on depth images.
The second best ranked participants (team Pennect) did not publish their method yet. From the
fact sheets we only know that it is an HMM-style method using HOG/HOF features with a
temporal segmentation based on candidate cuts. Only RGB images were used. The methods of
the two best ranking participants are quite fast. They claim a linear complexity in image size,
number of frames, and number of training examples. The third best ranked team (One Million
Monkeys) did not publish either, but they provided a high level description indicating that the
system uses a HMM in which a state is created for each frame of the gesture exemplars. The
state machine includes skips and self-loops to allow for variation in the speed of the gesture
execution. The most likely sequence of gestures is determined by a Viterbi search. Comparisons
between frames are based on the edges detected in each frame. Edges are associated with several
attributes including the X/Y coordinates, their orientation, their sharpness, their depth and
location in an area of change. In matching one frame against another, they find the nearest
neighbor in the second frame for every edge point in the first frame, and calculate the joint
probability of all the nearest neighbors using a simple Gaussian model. The system works
exclusively from the depth images. The system is one of the slowest proposed. Its processing
speed is linear in number of training examples but quadratic in image size and number of frames
per video. The fourth best ranked entrant ManavenderMalgireddy (immortals) published in these
proceedings a paper called “Detecting and Localizing Activities/Gestures in Video Sequences”.
They detect and localize activities from HOG/HOF features in unconstrained real-life video
sequences, a more complex problem than that of the challenge. To obtain real-life data, they used
video clips from the Human Motion Database (HMDB). The detection and localization paradigm
was adapted from the speech recognition community, where a keyword model is used for
detecting key phrases in speech. The method learns models for activities-of-interest and creates a
network of these models to detect keywords. According to the paper, the approach out-performed
all the current state-of-the-art classifiers when tested on publicly available datasets such as KTH
and HMDB.The final evaluation performance for the first round of the gesture challenge is
around 10% error, still far from human performance, which is below 2% error. However, the
progress made during round 1, starting at a baseline performance of 60% error indicates that the
objective of attaining or surpassing human performance could possibly be reached in the second
round. There is room for improvement particularly because the top two ranking participants used
only one modality (the first one depth only and the second one RGB only) and because many
gestures are recognizable only when details of hand posture are used, yet none of the methods
disclosed made use of such information. The most efficient techniques so far have used
sequences of features processed by graphical models of the HMM/CRF family, similar to
techniques used in speech recognition. No use of skeleton extraction of body part detection was
made. Rather, orientation and ad hoc features were extracted. It is possible that progress will also
be made in feature extraction by making better use of the development data for transfer learning.
Title:Gesture Recognition: A Survey
Author:SushmitaMitra
Generally, there exist many-to-one mappings from conceptsto gestures and vice versa.
Hence, gestures are ambiguous and incompletely specified. For example, to indicate the concept
“stop,” one can use gestures such as a raised hand with palm facing forward, or, an exaggerated
waving of both hands over the head. Similar to speech and handwriting, gestures vary between
individuals, and even for the same individual between different instances. There have been
varied approaches to handle gesture recognition, ranging from mathematical models based on
hidden Markov chains to tools or approaches based on soft computing. In addition to the
theoretical aspects, any practical implementation of gesture recognition typically requires the use
of different imaging and tracking devices or gadgets. These include instrumented gloves, body
suits, and marker based optical tracking. Traditional 2-D keyboard-, pen-, and mouse-oriented
graphical user interfaces are often not suitable for working in virtual environments. Rather,
devices that sense body (e.g., hand, head) position and orientation, direction of gaze, speech and
sound, facial expression, galvanic skin response, and other aspects of human behavior or state
can be used to model communication between a human and the environment. Gestures can be
static (the user assumes a certain pose or configuration)or dynamic (with pre-stroke, stroke, and
post stroke phases). Some gestures also have both static and dynamic elements, as in sign
languages. Again, the automatic recognition of natural continuous gestures requires their
temporal segmentation. Often one needs to specify the start and end points of a gesture in terms
of the frames of movement, both in time and in space. Sometimes a gesture is also affected by
the context of preceding as well as following gestures. Moreover, gestures are often language
and culture-specific. They can broadly be of the following types: 1) hand and arm gestures:
recognition of hand poses, sign languages, and entertainment applications (allowing children to
play and interact in virtual environments); 2) head and face gestures: some examples are: a)
nodding or shaking of head; b) direction of eye gaze; c) raising the eyebrows; d) opening the
mouth to speak; e) winking, f) flaring the nostrils; and g) looks of surprise, happiness, disgust,
fear, anger, sadness, contempt, etc.; 3) body gestures: involvement of full body motion, as in: a)
tracking movements of two people interacting outdoors; b) analyzing movements of a dancer for
generating matching music and graphics; and c) recognizing human gaits for medical
rehabilitation and athletic training. Typically, the meaning of a gesture can be dependent on the
following: spatial information: where it occurs;pathic information: the path it takes; symbolic
information: the sign it makes; affective information: its emotional quality. Facial expressions
involve extracting sensitive features (related to emotional state) from facial landmarks such as
regions surrounding the mouth, nose, and eyes of a normalized image. Often dynamic image
frames of these regions are tracked to generate suitable features. The location, intensity, and
dynamics of the facial actions are important for recognizing an expression. Moreover, the
intensity measurement of spontaneous facial expressions is often more difficult than that of
posed facial expressions. More subtle cues such as hand tension, overall muscle tension,
locations of self-contact, and pupil dilation are sometimes used. In order to determine all these
aspects, the human body position, configuration (angles and rotations), and movement
(velocities) need to be sensed. This can be done either by using sensing devices attached to the
user. Those may be magnetic field trackers, instrumented (data) gloves, and body suits, or by
using cameras and computer vision techniques. Each sensing technology varies along several
dimensions, including accuracy, resolution, and latency, range of motion, user comfort, and cost.
Glove-based gestural interfaces typically require the user to wear a cumbersome device and carry
a load of cables connecting the device to a computer. This hinders the ease and naturalness of the
user’s interaction with the computer. Vision-based techniques, while overcoming this, need to
contend with other problems related to occlusion of parts of the user’s body. While tracking
devices can detect fast and subtle movements of the fingers when the user’s hand is moving, a
vision-based system will at best get a general sense of the type of finger motion. Again, vision-
based devices can handle properties such as texture and color for analyzing a gesture, while
tracking devices cannot. Vision-based techniques can also vary among themselves in: 1) the
number of cameras used; 2) their speed and latency; 3) the structure of environment (restrictions
such as lighting or speed of movement); 4) any user requirements (whether user must wear
anything special); 5) the low-level features used (edges, regions, silhouettes, moments,
histograms); 6) whether 2-D or 3-D representation is used; and 7) whether time is represented.
There is, however, an inherent loss in information whenever a 3-D image is projected to a 2-D
plane. Again, elaborate 3-D models involve prohibitive high dimensional parameter spaces. A
tracker also needs to handle changing shapes and sizes of the gesture-generating object (that
varies between individuals), other moving objects in the background, and noise.Gesture
recognition is an ideal example of multidisciplinary research. There are different tools for gesture
recognition, based on the approaches ranging from statistical modeling, computer vision and
pattern recognition, image processing, connectionist systems, etc. Most of the problems have
been addressed based on statistical modeling, such as PCA, HMMs, Kalman filtering, more
advanced particle filtering and condensation algorithms. FSM has been effectively employed in
modeling human gestures. Computer vision and pattern recognition techniques, involving feature
extraction, object detection, clustering, and classification, have been successfully used for many
gesture recognition systems. Image-processing techniques such as analysis and detection of
shape, texture, color, motion, optical flow, image enhancement, segmentation, and contour
modeling, have also been found to be effective. Connectionist approaches, involving multilayer
perceptron (MLP), time delay neural network (TDNN), and radial basis function network
(RBFN), have been utilized in gesture recognition as well. While static gesture (pose)
recognition can typically be accomplished by template matching, standard pattern recognition,
and neural networks, the dynamic gesture recognition problem involves the use of techniques
such as time-compressing templates, dynamic time warping, HMMs, and TDNN. In the rest of
this section, we discuss the principles and background of some of these popular tools used in
gesture recognition.
Title: Results and Analysis of the ChaLearn Gesture Challenge 2012
Author: I.Guyon
Gesture recognition is an important sub-problem in many computer vision
applications,including image/video indexing, robot navigation, video surveillance, computer
interfaces, and gaming.With simple gestures such as hand waving, gesture recognition could
enable controlling the lights or thermostat in your home or changing TV channels. The same
technology may even make it possible to automatically detect more complex human behaviors,
to allow surveillance systems to sound an alarm when someone is acting suspiciously, for
example, or to send help whenever a bedridden patient shows signs of distress. Gesture
recognition also provides excellent benchmarks for Adaptive and Intelligent Systems (AIS) and
computer vision algorithms. The recognition of continuous, natural gestures is very challenging
due to the multi-modal nature of the visual cues (e.g., movements of fingers and lips, facial
expressions, body pose), as well as technical limitations such as spatial and temporal resolution
and unreliable depth cues. Technical difficulties include tracking reliably hand, head and body
parts, and achieving 3D invariance. The competition we organized helped improve the accuracy
of gesture recognition using Microsoft Kinect motion sensor technology, a low cost 3D depth-
sensing camera. Most of the participants employed image enhancement and filtering techniques,
in majority denoising or outlier removal and background removal. Some reduced the image
resolution for faster processing. Notably, some of the top ranking participants did not do any
such low level preprocessing. The majority of the top ranking participants used HOG/HOF
features and/or ad-hoc hand crafted features, edge/corner detectors or SIFT/STIP features. The
latter use a bag-of-feature strategy, which ignores exact location of features and therefore
provides some robustness against translations. The winner of both rounds of the challenge claims
that his features are inspired by the human visual system. Very few participants resorted to using
body parts or trained features. Most participants used the depth image only, but about one third
used both RGB and depth images. Interestingly, the second place winner in round 1 used the
RGB image only. About one third of the participants did no dimensionality reduction at all and
one third resorted to feature selection. Other popular techniques included linear transforms (such
as PCA) and clustering.For temporal segmentation, most participants used candidate cuts based
on similarities with the resting position or based on amount of motion. All the top ranking
participants used recognition-based segmentation techniques (in which recognition and
segmentation are integrated). As gesture representation, all highest ranking participants used a
variable length sequence of feature vectors (sometimes in combination with other
representations). To handle such variable length representations, the highest ranking participants
used Hidden Markov Models (HMM), Conditional Random Fields (CRF) or other similar
graphical models. This is a state machine including skips and self-loops to allow for variation in
the speed of the gesture execution. The most likely sequence of gestures is determined by a
Viterbi search. Some highly ranked, but not top ranking, participants used a bag-of-word
representation or image templates, including motion energy or motion history representations.
The corresponding classifiers were usually nearest neighbors (using as metric the Euclidean
distance or correlation). One participant used a linear SVM. Many participants made use of the
development data to either learn features or gesture representations in the spirit of “transfer
learning”. Most participants claimed that the algorithmic complexity of their methods was linear
in image size, number of frames per video, and number of training examples. The median
execution time on the 20 batches of the final evaluation set was 2.5 hours, which is very
reasonable and close to real time performance. However, there were a few outliers and it took up
to 50 hours for the slowest code.
Title:A Robust Background Subtraction and Shadow Detection
Author: ThanaratHorprasert
The capability of extracting moving objects from a videosequence is a fundamental and
crucial problem of many vision systems that include video surveillance, traffic monitoring,
human detection and tracking for video teleconferencing or human-machine interface, video
editing, among other applications. Typically, the common approach for discriminating moving
object from the background scene is background subtraction. The idea is to subtract the current
image from a reference image, which is acquired from a static background during a period of
time. The subtraction leaves only non-stationary or new objects, which include the objects’ entire
silhouette region. The technique has been used for years in many vision systems as a
preprocessing step for object detection and tracking. The results of the existing algorithms are
fairly good; in addition, many of them run in real-time. However, many of these algorithms are
susceptible to both global and local illumination changes such as shadows and highlights. These
cause the consequent processes, e.g. tracking, recognition, etc., to fail. The accuracy and
efficiency of the detection are very crucial to those tasks. This problem is the underlying
motivation of this work and wants to develop a robust and efficiently computed background
subtraction algorithm that is able to cope with the local illumination change problems, such as
shadows and highlights, as well as the global illumination changes. Being able to detect shadows
is also very useful to many applications especially in”Shape from Shadow” problems. The
method must also address requirements of sensitivity, reliability, robustness, and speed of
detection. In this paper, present a novel algorithm for detecting moving objects from a static
background scene that contains shading and shadows using color images. In next section,
propose a new computational color model (brightness distortion and chromaticity distortion) that
helps us to distinguish shading background from the ordinary background or moving foreground
objects. Next, propose an algorithm for pixel classification and threshold selection. Experimental
results and sample applications are respectively.One of the fundamental abilities of human vision
is color constancy. Humans tend to be able to assign a constantcolor to an object even under
changing of illumination overtime or space. The perceived color of a point in a scene dependson
many factors including physical properties of thepoint on the surface of the object. Important
physical propertiesof the surface in color vision are surface spectral reflectanceproperties, which
are invariant to changes of illumination,scene composition or geometry. On Lambertain,or
perfect matte surfaces, the perceived color is the productof illumination and surface spectral
reflectance.This led to our idea of designing a color model thatseparates these two terms; in other
words that separatesthe brightness from the chromaticity component.As the person moves, he
both obscuresthe background and casts shadows on the floor and wall.Red pixels depict the
shadow, and we can easily see how theshape of the shadow changes as the person moves.
Althoughit is difficult to see, there are green pixels, which depict thehighlighted background
pixels, appearing along the edge ofthe person’s sweater. Figure 5 shows a frame of an
outdoorscene containing a person walking across a street. Althoughthere are small motions of
background objects, such as thesmall motions of leaves and water surface, the result showsthe
robustness and reliability of the algorithm. It shows another indoor sequenceof a person moving
in a room; at the middle of thesequence, the global illumination is changed by turning halfof the
fluorescence lamps off. The system is still able to detectthe target successfully.
Title: Naked image detection based on adaptive and extensible skin color model
Author: Jiann-Shu Lee
In a relatively short period of time, the Internet has becomereadily accessible in most
organizations, schools and homes.Meanwhile, however, the problem of pornography through the
Internet access in the workplace, at home and in education hasconsiderably escalated. In the
workplace, the pornography relatedaccess not only costs companies millions in non-
businessInternet activities, but it also has led to shattering business reputationsand harassment
cases. Being anonymous and often anarchic,images that would be illegal to sell even in adult
bookstorescan be easily transferred to home through the Internet,causing juveniles to see those
obscene images intentionally orunintentionally. Therefore, how to effectively block or filter
outpornography has been arousing a serious concern in related researchareas.The mostly used
approach to blocking smut from the Internetis based on contextual keyword pattern matching
technologythat categorizes URLs by means of checking contexts ofweb pages and then traps the
websites assorted as the obscene.Although this method can successfully filter out a mass of
obscene websites, it is unable to deal with images, leading to its failure to detect those obscene
web sites containingnaked images instead of smut texts. Besides the threat comingfrom the web
sites, a lot of the e-mail image attachmentsare naked. Hence, the development of naked image
detectiontechnology is urgently desired to prevent juveniles from gettingaccess to pornographic
contents from the Internet morethoroughly. As can be seen in these methods, none of them
considerthe inference coming from special lighting and color altering.There exist a large number
of naked pictures taken under speciallighting. Usually, warm lighting is applied to make skintone
look more attractive, while human skin color deviatesfrom the normal case at the same time. If
the skin color modelcannot tolerate the deviation, it will tend to miss a lot of nakedpictures. On
the contrary, if the skin color model accommodatesthe deviation, an abundance of non-skin
objects likewood, desert sand, rock, foods, and the skin or fur of animalswould be detected in the
skin detection phase and deterioratesthe system performance. Accordingly, the above
mentionedapproaches suffer from the skin color deviation resulting fromspecial lighting, which
is often seen in the naked images.Dealing with the special lighting effect in the naked images isa
difficult task. If the skin color model tolerates the deviation,lots of non-skin objects would be
detected simultaneously.A feasible solution for the problem is to adapt the adoptedskin chroma
distribution to the lighting of the input image.Based on this concept, a new naked image
detection system isproposed. We develop a learning-based chromatic
distributionmatchingscheme that consists of the online samplingmechanism and the one-class-
one-net neural network. Basedon this approach, the object’s chroma distribution can be
onlinedetermined so that the skin color deviation coming from lightingcan be accommodated
without sacrificing the accuracy.The roughness feature is further applied to reject
confusioncoming from non-skin objects, so the skin area can be moreeffectively detected.
Several representative features inducedfrom the naked images are used to verify these skin
areas.Subsequently, the face detection process is employed to filterout those false candidates
coming from mug shots. The skin tone is formed by the interaction between skin andlight.
Therefore, the captured skin color in an image dependson the surrounding light in addition to the
intrinsic skin tone.To make naked images look more attractive, photographersusually apply
special lighting, thereby altering the chromadistribution of the skin tone. Hence, if the referenced
skinchroma distribution is gathered from the normal conditionsbeforehand, the corresponding
skin detection performancewill be dramatically degenerated.
Title: Principal motion: PCA-basedreconstruction of motion histograms
Author: Hugo Jair Escalante
The principal motion is the implementation of a reconstruction approachto gesture
recognition based on principal components analysis (PCA). Theunderlying idea is to perform
PCA on the frames in each video from thevocabulary, storing the PCA models. Frames in test-
videos are projectedinto the PCA space and reconstructed back using each of the PCA
models,one for each gesture in the vocabulary. Next we measure the reconstructionerror for each
of the models and assign a test video the gesture that obtainsthe lowest reconstruction error. The
rest of this document provides moredetails about the principal motion object. The PCA
reconstruction approach to gesture recognition is inspired from the one-class classification task,
where the reconstruction error via PCA hasbeen used to identify outlier. The method is also
inspired in a recent method for spam classification. The underlying hypothesis of the methodis
that a test video will be better reconstructed with a PCA model that wasobtained with another
video that contains the same gesture.
Title: A Framework for Hand Gesture Recognition Basedon Accelerometer and EMG
Sensors
Author: Xu Zhang, Xiang Chen
Hand gesture recognition provides an intelligent, natural,and convenient way of human
computer interaction(HCI). Sign language recognition (SLR) and gesture-basedcontrol are two
major applications for hand gesture recognition technologies. SLR aims to interpret sign
languages automaticallyby a computer in order to help the deaf communicatewith hearing society
conveniently. Since sign language is akind of highly structured and largely symbolic human
gestureset, SLR also serves as a good basic for the development ofgeneral gesture-based HCI. In
particular, most efforts on SLR are based on hidden Markov models (HMMs) whichare
employed as effective tools for the recognition of signalschanging over time. On the other hand,
gesture-basedcontrol translates gestures performed by human subjects intocontrolling commands
as the input of terminal devices, whichcomplete the interaction approaches by providing
acoustic,visual, or other feedback to human subjects. Many previous researchers investigated
various systems whichcould be controlled by hand gestures, such as media players,remote
controllers, robots, and virtual objects or environments.According to the sensing technologies
used to capture gestures,conventional researches on hand gesture recognition canbe categorized
into two main groups: data glove-based andcomputer vision-based techniques The multichannel
signals recorded in the process of the handgesture actions which represent meaningful hand
gestures arecalled active segments. The intelligent processing of hand gesturerecognition needs
to automatically determine the start andend points of active segments from continuous streams of
inputsignals. The gesture data segmentation procedure is difficult due to movement epenthesis.
The EMG signal level representsdirectly the level of muscle activity. As the hand
movementswitches from one gesture to another, the correspondingmuscles relax for a while, and
the amplitude of the EMGsignal is momentarily very low during movement epenthesis.Thus, the
use of EMG signal intensity helps to implement datasegmentation in a multi sensor system. In
method, only themultichannel EMG signals are used for determining the startand end points of
active segments. The segmentation is basedon a moving average algorithm and thresholding. The
ACCsignal stream is segmented synchronously with the EMG signalstream. Thus, the use of
EMG would help the SLR system to automatically distinguish between valid gesture segments
and movement epenthesis from continuous streams of input signals. The detection of active
segments consists of four steps based on the instantaneous energy of the average signal of the
multiple EMG channels.
Title: One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
Author: Jun Wan
DBN includes HMMs and Kalman filters as special cases and defined five classes of
gestures for HCI and developed a DBN-based model which used local features (contour,
moment, height) and global features (velocity, orientation, distance) as observations. Then
proposed a DBN-based system to control media player or slide presentation. They used local
features (location, velocity) by skin extraction and motion tracking to design the DBN inference.
However, both HMM and DBN models assume that observations given the motion class labels
are conditional independent. This restriction makes it difficult or impossible to accommodate
long-range dependencies among observations or multiple overlapping features of the
observations. Therefore, proposed conditional random fields (CRF) which can avoid the
independence assumption between observations and allow nonlocal dependencies between state
and observations. Then incorporated hidden state variables into the CRF model, namely, hidden
conditional random field (HCRF). They used HCRF to recognize gesture recognition and proved
that HCRF can get better performance. Later,the latent-dynamic conditional field (LDCRF)
model was proposed, which combines the strengths of CRFs and HCRFs by capturing both
extrinsic dynamics and intrinsic sub-structure. The detailed comparisons are evaluated. Another
important approach is dynamic time warping (DTW) widely used in gesture recognition. Early
DTW-based methods were applied to isolated gesture recognition. Then proposed an enhanced
Level-Building DTW method. This method can handle the movement epenthesis problem and
simultaneously segment and match signs to continuous sign language sentences. Besides these
methods, other approaches are also widely used for gesture recognition, such as linguistic sub-
units and topology-preserving self-organizing networks. Although the mentioned methods have
delivered promising results, most of them assume that the local features (shape, velocity,
orientation, position or trajectory) are detected well. However, the prior successes of hand
detection and tracking are major challenging problems in complex surroundings. Moreover, as
shown, most of the mentioned methods need dozens or hundreds of training samples to achieve
high recognition rates. For example, the authors used at least 50 samples for each class to train
HMM and got the average recognition rate 96%. Besides, Yamato et al. suggested that the
recognition rate will be unstable if the number of samples is small. When there is only one
training sample per class, those methods are difficult to satisfy the requirement of high
performance application systems. In recent years, BoF-based methods derived from object
categories and action recognition have become an important branch for gesture recognition.
Dardas and Georganasproposed a method for real-time hand gesture recognition based on
standard BoF model, but they first needed to detect and track hands and that would be difficult in
a clutter background. For example, when the hand and face are overlapped or the background is
similar to skin color, hand detection may fail. Shen et al. extracted maximum
stableextremalregions (MSER) features from the motion divergence fields which In this paper,
we propose a unified framework based on bag of features for one-shot learning gesture
recognition. The proposed method gives superior recognition performance than many existing
approaches. A new feature, named 3D EMoSIFT, fuses RGB-D data to detect interest points and
constructs 3D gradient and motion space to calculate SIFT descriptors. Compared with existing
features such as Cuboid, Harri3D, MoSIFT (Chen and Hauptmann) and 3D MoSIFT, it gets
competitive performance. Additionally, 3D EMoSIFT features are scale and rotation invariant
and can capture more compact and richer video representations even though there is only one
training sample for each gesture class. Thispaper also introduces SOMP to replace VQ in the
descriptor coding stage. Then each feature can be represented by some linear combination of a
small number of visual code words. Compared with VQ, SOMP leads to a much lower
reconstruction error and achieves better performance. Although the proposed method has
achieved promising results, there are several avenues which can be explored. At first, most of the
existing local spatio-temporal features are extracted from a static background or a simple
dynamic background. In our feature research, we will focus on extending 3D EMoSIFT to
extract features from complex background, especially for one-shot learning gesture recognition.
Next, to speed up processing time, we can achieve fast feature extraction on a Graphics
Processing Unit (GPU). Also, we will explore the techniques required to optimize the
parameters, such as the codebook size and sparsity.
Title: Discovering Motion Primitives for Unsupervised Grouping and One-shot Learning of
Human Actions, Gestures, and Expressions
Author: Yang Yang, Imran Saleemi
Learning using few labeled examples should be an essentialfeature in any practical action
recognition system because collection of a large number of examples for each of many diverse
categories is an expensive and laborious task. Although humans are adept at learning new object
and action categories, the same cannot be said about most existing computer vision methods,
even though such capability is of significant importance. A majority of proposed recognition
approaches require large amounts of labeled training data, while testing using either a leave-one-
out or a train-test split scenario. In this paper,we put forth a discriminative yet flexible
representation of gestures and actions that lends itself well to the task of learning from few as
possible examples. We further extend the idea of one-shot learning to attempt a perceptual
grouping of unlabelled datasets and to obtain subsets of videos that correspond to a meaningful
grouping of actions, for instance, recovering the original class-based partitions. This observation
forms the basis of the proposed representation with the underlying idea that intermediate features
(action primitives) should: (a) span as large as possible but contiguous x−y−t volumes with
smoothly varying motion, and (b) should be flexible enough to allow deformations arising from
articulation of body parts. A byproduct of these properties is that the intermediate representation
will be conducive to human understanding. In other words, a meaningful action primitive is one
which can be illustrated visually, and described textually, e.g., ‘left arm moving upwards’, or
‘right legmoving outwards and upwards’, etc.We argue and showexperimentally, that such a
representation is much more discriminative, and makes the tasks of ‘few-shot’ action, gesture, or
expression recognition, or unsupervised clustering simpler as compared to traditional methods.
This paper proposes such a representation based on motion primitives. A summary of our
method to obtain the proposed representation follows. (i) when required, camera motion is
compensated to obtain residual actor-only motion, (ii) a frame difference based foreground
estimation, and ‘centralization’ of the actor to remove translational motion is performed, thus
resulting in a stack of rectangular image regions coarsely centered around the human; (iii)
computation of optical flow to obtain 4d feature vectors (x, y, u, v); (iv) clustering of feature
vectors to obtain components of a Gaussian mixture; (v) spatio-temporal linking of Gaussian
components resulting in instances of primitive actions; and (vi) merging of primitive action
instances to obtain final statistical representation of the primitives. For supervised recognition,
given a test video, instances of action primitives are detected in a similar fashion, which are
labeled by comparing against the learned primitives. Sequences of observed primitives in
training and test videos are represented as strings and matched using simple alignment to classify
the test video. We also experimented with representation of primitive sequences as histograms,
followed by classifier learning, as well as using temporal sequences of primitive labels to learn
state transition models for each class. Compared to the state of the art action representations the
contributions of the proposed work are:Completely unsupervised discovery of representative and
discriminative action primitives without assuming any knowledge of the number of primitives
present, or their interpretation, A novel representation of human action primitives that captures
the spatial layout, shape, temporal extent, as well as the motion flow of a primitive, Statistical
description of primitives as motion patterns, thus providing a generative model, capable of
estimating confidence in observing a specific motion at a specific point in space-time, and even
sampling, Highly abstract, discriminative representation of primitives which can be labeled
textually as components of an action, thus making the recognition task straightforward. This
paper has proposed a method that automatically discovers a flexible and meaningful vocabulary
of actions using raw optical flow learns statistical distributions of these primitives, and because
of the discriminative nature of the primitives, very competitive results are obtained using the
simplest recognition and classification schemes. Our representation offers benefits like
recognition of unseen composite action, insensitivity to occlusions (partial primitive list),
invariance to splittingof primitive during learning, detection of cycle extentsand number, etc.
2.3 Proposed system

Gesture was the first mode of communication for the primitive cave men. Later on human
civilization has developed the verbal communication very well. But still non-verbal
communication has not lost its weight age. Such non – verbal communication are being used not
only for the physically challenged people, but also for different applications in diversified areas,
such as aviation, surveying, music direction etc. It is the best method to interact with the
computer without using other peripheral devices, such as keyboard, mouse. Researchers around
the world are actively engaged in development of robust and efficient gesture recognition
system, more specially, hand gesture recognition system for various applications. The major
steps associated with the hand gesture recognition system are; data acquisition, gesture modeling,
feature extraction and hand gesture recognition. The importance of gesture recognition lies in
building efficient human–machine interaction. Its applications range from sign language
recognition through medical rehabilitation to virtual reality. Given the amount of literature on the
problem of gesture recognition and the promising recognition rates reported, one would be led to
believe that the problem is nearly solved. Sadly this is not so. A main problem hampering most
approaches is that they rely on several underlying assumptions that may be suitable in a
controlled lab setting but do not generalize to arbitrary settings. Several common assumptions
include: assuming high contrast stationary backgrounds and ambient lighting conditions.

In this proposed system, a multimodal two-stream convolutional neuralnetwork is used to learn


the sign language videos to formrobust features and optimize the fusion mode to achieve the
final sign language recognition. We propose a sign languagerecognition method based on
multimodal two-stream neuralnetwork.
The main contributions areas follows: (1)We proposed a sampling method named AlignRandom
Sampling within Segments (ARSS), which sampleRGB data extraction spatial features, and
sample aligneddepth data extraction time features. (2) The D-shift Net isproposed as a depth
motion feature extraction which adaptsto the ARSS sampling method and makes full use of
thetemporal features of the depth data.With the proposed networks, we combine temporal
featuresand spatial features of sign language recognition.

2.4 Feasibility study

A feasibility study is carried out to select the best system that meets performance requirements.
The main aim of the feasibility study activity is to determine whether it would be financially and
technically feasible to develop the product. The feasibility study activity involves the analysis of
the problem and collection of all relevant information relating to the product such as the different
data items which would be input to the system, the processing required to be carried out on these
data, the output data required to be produced by the system as well as various constraints on the
behavior of the system.

Technical Feasibility

This is concerned with specifying equipment and software that will successfully satisfy the user
requirement. The technical needs of the system may vary considerably, but might include:

• The facility to produce outputs in a given time.

• Response time under certain conditions.

• Ability to process a certain volume of transaction at a particular speed.

• Facility to communicate data to distant locations.

In examining technical feasibility, configuration of the system is given more importance than the
actual make of hardware. The configuration should give the complete picture about the system’s
requirements: How many workstations are required, how these units are interconnected so that
they could operate and communicate smoothly? And what speeds of input and output should be
achieved at particular quality of printing.
Economic Feasibility

Economic analysis is the most frequently used technique for evaluating the effectiveness of a
proposed system. More commonly known as Cost / Benefit analysis, the procedure is to
determine the benefits and savings that are expected from a proposed system and compare them
with costs. If benefits outweigh costs, a decision is taken to design and implement the system.
Otherwise, further justification or alternative in the proposed system will have to be made if it is
to have a chance of being approved. This is an outgoing effort that improves in accuracy at each
phase of the system life cycle.

Operational Feasibility

This is mainly related to human organizational and political aspects. The points to be considered
are:

• What changes will be brought with the system?

• What organizational structure are disturbed?

• What new skills will be required? Do the existing staff members have these skills? If not, can
they be trained in due course of time?

This feasibility study is carried out by a small group of people who are familiar with information
system technique and are skilled in system analysis and design process. Proposed projects are
beneficial only if they can be turned into information system that will meet the operating
requirements of the organization. This test of feasibility asks if the system will work when it is
developed and installed.
CHAPTER 3

Development Environment

3.1 Hardware Requirements

 Processor : Dual core processor 2.6.0 GHZ


 RAM : 1GB
 Hard disk : 160 GB
 Compact Disk : 650 Mb
 Keyboard : Standard keyboard
 Monitor : 15 inch color monitor

3.2 Software Requirements

 Operating System : Windows OS


 Language           : JAVA
 IDE : Net beans
 Backend : MYSQL

3.3 Java

This chapter is about the software language and the tools used in the development of the project.
The platform used here is JAVA. The Primary languages are JAVA, J2EE and J2ME. In this
project J2EE is chosen for implementation. Java is a programming language originally
developed by James Gosling at Microsystems and released in 1995 as a core component of Sun
Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a
simpler object model and fewer low-level facilities. Java applications are typically compiled to
byte code that can run on any Java Virtual Machine (JVM) regardless of computer architecture.
Java is general-purpose, concurrent, class-based, and object-oriented, and is specifically designed
to have as few implementation dependencies as possible. It is intended to let application
developers "write once, run anywhere".

Java is considered by many as one of the most influential programming languages of the 20th
century, and is widely used from application software to web applications the java framework is
a new platform independent that simplifies application development internet. Java technology's
versatility, efficiency, platform portability, and security make it the ideal technology for network
computing. From laptops to datacenters, game consoles to scientific supercomputers, cell phones
to the Internet, Java is everywhere! Java is a small, simple, safe, object oriented, interpreted or
dynamically optimized, byte coded, architectural, garbage collected, multithreaded programming
language with a strongly typed exception-handling for writing distributed and dynamically
extensible programs.

Java is an object oriented programming language. Java is a high-level, third generation language
like C, FORTRAN, Small talk, Pearl and many others. You can use java to write computer
applications that crunch numbers, process words, play games, store data or do any of the
thousands of other things computer software can do.

Special programs called applets that can be downloaded from the internet and played safely
within a web browser. Java a supports this application and the follow features make it one of the
best programming languages.

 It is simple and object oriented


 It helps to create user friendly interfaces.
 It is very dynamic.
 It supports multithreading.
 It is platform independent
 It is highly secure and robust.
 It supports internet programming

Java is a programming language originally developed by Sun Microsystems and released in


1995 as a core component of Sun's Java platform. The language derives much of its syntax from
C and C++ but has a simpler object model and fewer low-level facilities. Java applications are
typically compiled to byte code which can run on any Java virtual machine (JVM) regardless of
computer architecture.

The original and reference implementation Java compilers, virtual machines, and class libraries
were developed by Sun from 1995. As of May 2007, in compliance with the specifications of the
Java Community Process, Sun made available most of their Java technologies as free software
under the GNU General Public License. Others have also developed alternative implementations
of these Sun technologies, such as the GNU Compiler for Java and GNU Class path.

The Java language was created by James Gosling in June 1991 for use in a set top box
project. The language was initially called Oak, after an oak tree that stood outside Gosling's
office - and also went by the name Green - and ended up later being renamed to Java, from a list
of random words. Gosling's goals were to implement a virtual machine and a language that had a
familiar C/C++ style of notation.

OBJECTIVES OF JAVA

Java has been tested, refined, extended, and proven by a dedicated community. And
numbering more than 6.5 million developers, it's the largest and most active on the planet. With
its versatility, efficiency, and portability, Java has become invaluable to developers by enabling
them to:

 Write software on one platform and run it on virtually any other platform
 Create programs to run within a Web browser and Web services
 Develop server-side applications for online forums, stores, polls, HTML forms
processing, and more
 Combine applications or services using the Java language to create highly customized
applications or services
 Write powerful and efficient applications for mobile phones, remote processors, low-cost
consumer products, and practically any other device with a digital heartbeat

Today, many colleges and universities offer courses in programming for the Java
platform. In addition, developers can also enhance their Java programming skills by reading
Sun's java.sun.com Web site, subscribing to Java technology-focused newsletters, using the Java
Tutorial and the New to Java Programming Center, and signing up for Web, virtual, or
instructor-led courses.    
Java Server Pages - An Overview

Java Server Pages or JSP for short is Sun's solution for developing dynamic web sites.
JSP provide excellent server side scripting support for creating database driven web applications.
JSP enable the developers to directly insert java code into jsp file, this makes the development
process very simple and its maintenance also becomes very easy.  JSP pages are efficient, it
loads into the web servers memory  on receiving the request very first time and the subsequent
calls are served within a very short period of time. 

    In today's environment most web sites servers dynamic pages based on user request.
Database is very convenient way to store the data of users and other things. JDBC provide
excellent database connectivity in heterogeneous database environment. Using JSP and JDBC its
very easy to develop database driven web application.  Java is known for its characteristic of
"write once, run anywhere." JSP pages are platfJavaServer Pages .

Java Server Pages (JSP) technology is the Java platform technology for delivering dynamic
content to web clients in a portable, secure and well-defined way. The JavaServer Pages
specification extends the Java Servlet API to provide web application developers with a robust
framework for creating dynamic web content on the server using HTML, and XML templates,
and Java code, which is secure, fast, and independent of server platforms.

JSP has been built on top of the Servlet API and utilizes Servlet semantics. JSP has become the
preferred request handler and response mechanism. Although JSP technology is going to be a
powerful successor to basic Servlets, they have an evolutionary relationship and can be used in a
cooperative and complementary manner.

Servlets are powerful and sometimes they are a bit cumbersome when it comes to generating
complex HTML. Most servlets contain a little code that handles application logic and a lot more
code that handles output formatting. This can make it difficult to separate and reuse portions of
the code when a different output format is needed. For these reasons, web application developers
turn towards JSP as their preferred servlet environment.
Evolution of Web Applications

Over the last few years, web server applications have evolved from static to dynamic
applications. This evolution became necessary due to some deficiencies in earlier web site
design. For example, to put more of business processes on the web, whether in business-to-
consumer (B2C) or business-to-business (B2B) markets, conventional web site design
technologies are not enough. The main issues, every developer faces when developing web
applications, are:

1. Scalability - a successful site will have more users and as the number of users is increasing
fastly, the web applications have to scale correspondingly.

2. Integration of data and business logic - the web is just another way to conduct business, and so
it should be able to use the same middle-tier and data-access code.

3. Manageability - web sites just keep getting bigger and we need some viable mechanism to
manage the ever-increasing content and its interaction with business systems.

4. Personalization - adding a personal touch to the web page becomes an essential factor to keep
our customer coming back again. Knowing their preferences, allowing them to configure the
information they view, remembering their past transactions or frequent search keywords are all
important in providing feedback and interaction from what is otherwise a fairly one-sided
conversation.

Apart from these general needs for a business-oriented web site, the necessity for new
technologies to create robust, dynamic and compact server-side web applications has been
realized. The main characteristics of today's dynamic web server applications are as follows:

1. Serve HTML and XML, and stream data to the web client

2. Separate presentation, logic and data

3. Interface to databases, other Java applications, CORBA, directory and mail services
4. Make use of application server middleware to provide transactional support.

5. Track client sessions.

Benefits of JSP

One of the main reasons why the Java Server Pages technology has evolved into what it is today
and it is still evolving is the overwhelming technical need to simplify application design by
separating dynamic content from static template display data. Another benefit of utilizing JSP is
that it allows to more cleanly separating the roles of web application/HTML designer from a
software developer. The JSP technology is blessed with a number of exciting benefits, which are
chronicled as follows:

1. The JSP technology is platform independent, in its dynamic web pages, its web servers, and its
underlying server components. That is, JSP pages perform perfectly without any hassle on any
platform, run on any web server, and web-enabled application server. The JSP pages can be
accessed from any web server.

2. The JSP technology emphasizes the use of reusable components. These components can be
combined or manipulated towards developing more purposeful components and page design.
This definitely reduces development time apart from the At development time, JSPs are very
different from Servlets, however, they are precompiled into Servlets at run time and executed by
a JSP engine which is installed on a Web-enabled application server such as BEA Web Logic
and IBM Web Sphere.

Servlets

Earlier in client- server computing, each application had its own client program and it worked as
a user interface and need to be installed on each user's personal computer. Most web applications
use HTML/XHTML that are mostly supported by all the browsers and web pages are displayed
to the client as static documents. 

A web page can merely displays static content and it also lets the user navigate through the
content, but a web application provides a more interactive experience. 
Any computer running Servlets or JSP needs to have a container. A container is nothing but a
piece of software responsible for loading, executing and unloading the Servlets and JSP. While
servlets can be used to extend the functionality of any Java- enabled server.

They are mostly used to extend web servers, and are efficient replacement for CGI scripts. CGI
was one of the earliest and most prominent server side dynamic content solutions, so before
going forward it is very important to know the difference between CGI and the Servlets.

Java Servlets

Java Servlet is a generic server extension that means a java class can be loaded dynamically to
expand the functionality of a server. Servlets are used with web servers and run inside a Java
Virtual Machine (JVM) on the server so these are safe and portable.

Unlike applets they do not require support for java in the web browser. Unlike CGI, servlets
don't use multiple processes to handle separate request. Servets can be handled by separate
threads within the same process. Servlets are also portable and platform independent.

A web server is the combination of computer and the program installed on it. Web server
interacts with the client through a web browser. It delivers the web pages to the client and to an
application by using the web browser and  the HTTP protocols respectively.

The define the web server as the package of  large number of programs installed on a computer
connected to Internet or intranet for downloading the requested files using File Transfer Protocol,
serving e-mail and building and publishing web pages. A web server works on a client server
model. JSP and Servlet are gaining rapid acceptance as means to provide dynamic content on the
Internet. With full access to the Java platform, running from the server in a secure manner, the
application possibilities are almost limitless. When JSPs are used with Enterprise JavaBeans
technology, e-commerce and database resources can be further enhanced to meet an enterprise's
needs for web applications providing secure transactions in an open platform. J2EE technology
as a whole makes it easy to develop, deploy and use web server applications instead of mingling
with other technologies such as CGI and ASP. There are many tools for facilitating quick web
software development and to easily convert existing server-side technologies to JSP and Servlets.
CHAPTER 4

System Design and Implementation

4.1 Data flow diagram

Image Acquisition Capture image

Depth image RGB image

Foreground segmentation Threshold segmentation

Threshold Mask image

Face and Hand detection


Track the hands

Application access
CHAPTER 5

Modules Description

• Image Acquisition

• Foreground segmentation

• Face and Hand detection

• Hand trajectory classification

• Evaluation criteria

5.1 Image Acquisition

For efficient hand gesture recognition, data acquisition should be as much perfect as
possible. Suitable input device should be selected for the data acquisition. There are a number of
input devices for data acquisition. Some of them are data gloves, marker, hand images (from
webcam/ stereo camera/ Kinect 3D sensor) and drawings. Data gloves are the devices for perfect
data input with high accuracy and high speed. It can provide accurate data of joint angle,
rotation, location etc. for application in different virtual reality environments. At present,
wireless data gloves are available commercially so as to remove the hindrance due to the cable.
Colored markers attached to the human skin are also used as input technique and hand
localization is done by the color localization. Input can also be fed to the system without any
external costly hardware, except a low-cost web camera. Bare hand (either single or double) is
used to generate the hand gesture and the camera captures the data easily and naturally (without
any contact). Sometimes drawing models are used to input commands to the system. The latest
addition to this list is Microsoft Kinect 3D depth sensor. Kinect is a 3D motion sensing input
device widely used for gaming. In this module, we can input image from web camera and also
capture hand and face images. And captured both depth and color image. In 3D computer
graphics a depth map is an image and image channel that contains information relating to the
distance of the surfaces of scene objects from a viewpoint. The term is related to and may be
analogous to depth buffer, Z-buffer, Z-buffering and Z-depth. The "Z" in these latter terms
relates to a convention that the central axis of view of a camera is in the direction of the camera's
Z axis, and not to the absolute Z axis of a scene. And use color map techniques to implement
a function that maps (transforms) the colors of one (source) image to the colors of another
(target) image. A color mapping may be referred to as the algorithm that results in the mapping
function or the algorithm that transforms the image colors. 

5.2 Foreground segmentation

Separating foreground objects from natural images and video plays an important role in
image and video editing tasks. Despite extensive study in the last two decades, this problem still
remains challenging. In particular, extracting a foreground object from the background in a static
image involves determining both full and partial pixel coverage, also known as extracting a
matte, which is a severely under-constrained problem. Segmenting spatio-temporal video objects
from a video sequence is even harder since extracted foregrounds on adjacent frames must be
both spatially and temporally coherent. Previous approaches for foreground extraction usually
require a large amount of user input and still suffer from inaccurate results and low
computational efficiency.

In foreground segmentation section, the background was ruled out from the captured
frames and the whole human body was kept as the foreground. In this module, we implement
thresholding approach. In computer vision, image segmentation is the process of partitioning a
digital image into multiple segments (sets of pixels, also known as super pixels). The goal of
segmentation is to simplify and/or change the representation of an image into something that is
more meaningful and easier to analyze. Image segmentation is typically used to locate objects
and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process
of assigning a label to every pixel in an image such that pixels with the same label share certain
characteristics. Thresholding is the simplest segmentation method. The pixels are partitioned
depending on their intensity value.

5.3 Face and hand detection

Face and hand detection was used to initialize the position of the face and hands for the
tracking phase. After initialization, both face and hands were tracked through video sequences by
HMM method.
5.4 Hand trajectory classification

Hand tracking results were segmented as trajectories, compared with motion models, and
decoded as commands for robotic control.

Neural networks are composed of simple elements operating in parallel. These elements
are inspired by biological nervous systems. As in nature, the network function is determined
largely by the connections between elements. We can train a neural network to perform a
particular function by adjusting the values of the connections (weights) between elements.
Commonly neural networks are adjusted, or trained, so that a particular input leads to a specific
target output. There, the network is adjusted, based on a comparison of the output and the target,
until the network output matches the target. Typically many such input/target pairs are used, in
this supervised learning (training method studied in more detail on following chapter), to train a
network.

5.5 Evaluation criteria

The proposed system was able to detect finger tips even when it was in front of palm, it
reconstruct the 3D image of hand that was visually comparable. This system claimed results 90-
95% accurate for open fingers that is quite acceptable while for closed finger it was 10-20% only
and closed or bended finger is coming in front of palm, so skin color detection would not make
any difference in palm or finger. According to him image quality andoperator was the main
reason for low detection and claims about 90% accuracy in the result, if the lighting conditions
are good. Then used six different parameters to control the performance of system, if he found
much noise there, he could control it using two parameters called as α and β respectively. Finally
claims about 90.45% accuracy, through hidden finger was not detected in his approach.
CHAPTER 6

System Testing

6.1 Testing types


A test case is an asset of data that the system will process as normal input. The strategies
that we have used in our project are,
System Testing
Testing is the stage of implementation of which aimed at ensuring that the system works
accurately and efficiently before live operation commences. Testing is vital to the success of the
system. System testing makes a logical assumption that if all the parts of the system are correct
the goal will be achieved. The candidates system subject to a variety of tests. Online response,
volume, stress, recovery, security and usability tests. A series of testing are performed for the
proposed system before the system is ready for user acceptance testing.
In this project, we can perform system testing into check implemented platform, Android
SDK and Eclipse are running properly to execute the coding. And use the web camera to capture
the image.
Unit Testing

The procedure level testing is made first. By giving improper inputs, the errors occurred
are noted and eliminated .Then the web form level is made.

Each and every module is checked in this unit testing phase. Used controls are executed
the coding successfully without any execution error and run time error. Unit tester checks each
module output.

Integration Testing

Testing is done for each module. After testing all the modules, the modules are integrated
and testing of the final system is done with the test data, specially designed to show that the
system will operate successfully in all its aspects conditions. Thus the system testing is a
confirmation that all its correct and an opportunity to show the user that the system works.
In this testing, check flow of the each module. Image acquisition, preprocessing, text
detection and recognition are integrated into project. Function flow successfully executed from
first module to final module. Integrated testing provide proof concept to list out successfully
executed loop conditions.

Validation Testing

The final step involves validation testing which determines whether the software function
as the user expected. The end-user rather than the system developer conduct this test most
software developers as a process called “Alpha and Beta test” to uncover that only the end user
seems able to find. The compilation of the entire project is based on the full satisfaction of the
end users.
CHAPTER 7

Implementation Results

Implementation is the process that actually yields the lowest-level system elements in the
system hierarchy (system breakdown structure). System elements are made, bought, or reused.
Production involves the hardware fabrication processes of forming, removing, joining, and
finishing, the software realization processes of coding and testing, or the operational procedures
development processes for operators' roles. If implementation involves a production process, a
manufacturing system which uses the established technical and management processes may be
required. The purpose of the implementation process is to design and create (or fabricate) a
system element conforming to that element’s design properties and/or requirements. The element
is constructed employing appropriate technologies and industry practices. This process bridges
the system definition processes and the integration process.

This process may create several artifacts such as an implemented system

 implementation tools

 implementation procedures

 an implementation plan or strategy

 verification reports

 issue, anomaly, or trouble reports

 change requests (about design)


CHAPTER 8

Conclusion

The design of more natural and multimodal forms of interaction with computers or
systems is an aim to achieve. Vision-based interfaces can offer appealing solutions to introduce
non-intrusive systems with interaction by means of gestures. In order to build reliable and robust
perceptual user interfaces based on computer vision, certain practical constraints must be taken
in account: the application must be capable of working well in any environment and should make
use of low-cost devices. This work has proposed a new mixture of several computer vision
techniques for facial and hand features detection and tracking and face gesture recognition, some
of them have been improved and enhanced to reach more stability and robustness. A hands-free
interface able to replace the standard mouse motions and events has been developed using these
techniques. Hand gesture recognition is finding its application for non-verbal communication
between human and computer, general fit person and physically challenged people, 3D gaming,
virtual reality etc. With the increase in applications, the gesture recognition system demands lots
of research in different directions. Finally we implemented effective and robust algorithms to
solve false merge and false labeling problems of hand tracking through interaction and occlusion.
CHAPTER 9
Future enhancements

In future we present an idea of hand gesture recognition to improve the accuracy of the
system and also include eye blink detection for access the systems. In this future, a vision-based
system for detection ofvoluntary eye-blinks is presented, together with its implementationas a
Human–Computer Interface for people withdisabilities. The future algorithm allowsfor eye-blink
detection, estimation of the eye-blink durationand interpretation of a sequence of blinks in real
timeto control a non-intrusive human–computer interface. Thedetected eye-blinks are classified
as short blinks (shorterthan 200 ms) or long blink (longer than 200 ms). Separateshort eye-blinks
are assumed to be spontaneous and are notincluded in the designed eye-blink code.
CHAPTER 10
Output Screenshots

Gestures Recognition Screen


Gestures Capturing And Recognition Via Camera
Preprocessing And Segmentation
Rio Extract
Datasets

References

[1] M. R. Ahsan, “EMG signal classification for human computer interaction: A review,” Eur. J.
Sci. Res., vol. 33, no. 3, pp. 480–501, 2009.
[2] J. A. Jacko, “Human–computer interaction design and development approaches,” in Proc.
14th HCI Int. Conf., 2011, pp. 169–180.
[3] I. H. Moon, M. Lee, J. C. Ryu, and M. Mun, “Intelligent robotic wheelchair with EMG-,
gesture-, and voice-based interface,” Intell.Robots Syst., vol. 4, pp. 3453–3458, 2003.
[4] M. Walters, S. Marcos, D. S. Syrdal, and K. Dautenhahn, “An interactive game with a robot:
People’s perceptions of robot faces and a gesture based user interface,” in Proc. 6th Int. Conf.
Adv. Computer–HumanInteractions, 2013, pp. 123–128.
[5] O. Brdiczka, M. Langet, J. Maisonnasse, and J. L. Crowley, “Detection human behavior
models from multimodal observation in a smart home,” IEEE Trans. Autom. Sci. Eng., vol. 6,
no. 4, pp. 588–597, Oct. 2009.
[6] M. A. Cook and J. M. Polgar, Cook & Hussey’s Assistive Technologies: Principles and
Practice, 3rd ed. Maryland Heights, MO, USA: MosbyElsevier, 2008, pp. 3–33.
[7] G. R. S. Murthy, and R. S. Jadon, “A review of vision based hand gesture recognition,” Int. J.
Inform. Technol. Knowl. Manage., vol. 2, no. 2, pp. 405–410, 2009.
[8] D. Debuse, C. Gibb, and C. Chandler, “Effects of hippotherapy on people with cerebral palsy
from the users’ perspective: A qualitative study,” Physiotherapy Theory Practice, vol. 25, no. 3,
pp. 174–192, 2009.
[9] J. A. Sterba, B. T. Rogers, A. P. France, and D. A. Vokes, “Horseback riding in children with
cerebral palsy: Effect on gross motor function,” Develop. Med. Child Neurology, vol. 44, no. 5,
pp. 301–308, 2002.
[10] K. L. Kitto, “Development of a low-cost sip and puff mouse,” in Proc. 16th Annu Conf.
RESNA, 1993, pp. 452–454.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy