0% found this document useful (0 votes)
128 views24 pages

Shareef Seminar Docs

The document discusses artificial intelligence for speech recognition. It describes how speech recognition engines work by processing spoken input and translating it to text. It discusses the differences between command and control applications versus dictation applications. It also provides details about speaker identification and verification, describing the technologies used including hidden Markov models. Speaker recognition can be classified as either verification or identification and can be text-dependent or text-independent.

Uploaded by

Imran Shareef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views24 pages

Shareef Seminar Docs

The document discusses artificial intelligence for speech recognition. It describes how speech recognition engines work by processing spoken input and translating it to text. It discusses the differences between command and control applications versus dictation applications. It also provides details about speaker identification and verification, describing the technologies used including hidden Markov models. Speaker recognition can be classified as either verification or identification and can be text-dependent or text-independent.

Uploaded by

Imran Shareef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

CHAPTER 1

INTRODUCTION
The speech recognition process is performed by a software component known as the
speech recognition engine. The primary function of the speech recognition engine is to process
spoken input and translate it into text that an application understands. The application can then
do one of two things: The application can interpret the result of the recognition as a command. In
this case , the application is a command and control application. If an application handles the
recognized text simply as text, then it is considered a dictation application. The user speaks to the
computer through a microphone, which in turn, identifies the meaning of the words and sends it
to NLP device for further processing. Once recognized, the words can be used in a variety of
applications like display, robotics, commands to computers, and dictation. No special commands
or computer language are required. There is no need to enter programs in a special language for
creating software. Instead of talking to your computer, you're essentially talking to a web site,
and you're doing this over the phone. OK, you say, well, what exactly is speech recognition?
Simply put, it is the process of converting spoken input to text. Speech recognition is thus
sometimes referred to as speech-to-text .Speech recognition allows you to provide input to an
application with your voice.. In the desktop world, you need a microphone to be able to do this.
In the Voice XML world, all you need is a telephone.

When you dial the telephone number of a big company, you are likely to hear the
sonorous voice of a cultured lady who responds to your call with great courtesy saying “welcome
to company X. Please give me the extension number you want” .You pronounce the extension
number, your name, and the name of the person you want to contact. If the called person
accepts the call, the connection is given quickly. This is artificial intelligence where an automatic
call-handling system is used without employing any telephone operator.
AI is the study of the abilities for computers to perform tasks, which currently are better
done by humans. AI has an interdisciplinary field where computer science intersects with
philosophy, psychology, engineering and other fields. Humans make decisions based upon
experience and intention. The essence of AI in the integration of computer to mimic this learning
process is known as Artificial Intelligence Integration.

DEPARTMENT OF ECE, ACEM Page 1


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

CHAPTER 2

THE TECHNOLOGY
A human identity recognition system based on voice analysis could have seamless
applications. The ASR (Automatic Speaker Recognition) is one such system. Automatic Speaker
Recognition is a system that can recognize a person based on his/her voice. This is achieved by
implementing complex signal processing algorithms that run on a digital computer or a
processor. This Application is analogous to the fingerprint recognition system or other
biometrics recognition systems that are based on certain characteristics of a person.
There are several occasions when we want to identify a person from a given group of
people even when the person is not present for physical examination. For example, when a
person converses on a telephone, all we have is the person’s voice for analysis. It then makes
sense to develop a recognition system based on voice.

Speaker recognition has typically been classified as either a verification or identification


task. Speaker verification is usually the simpler of the two since it involves the comparison of
the input signal with a single given stored reference pattern. Therefore, the verification task only
requires a system to verify, if the speaker is the same as the person he/she identifies
himself/herself. Speaker identification is more complex because the test speaker must be
compared against a number of reference speakers to determine if a match can be made. Not only
the input signal is to be examined to see if it came from a speaker, but the identification of the
individual speaker is also necessary.

The identification of speakers remains a difficult task for a number of reasons. First, the
acquiring of a unique speech signal can suffer as a result of the variation of the voice inputs from
a speaker and environmental factors. Both the volume and pace of speech can vary from one test
to another. Also, unless initially constrained, an extensive vocabulary or unstructured grammar
can affect results. Background noise must also be kept to the minimum so that a changing
environment will not divert the speaker’s attention or the final voicing of a word or sentence. As
a result, many restrictions and clarifications have been placed on speaker and speech recognition
systems.

DEPARTMENT OF ECE, ACEM Page 2


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

One such restriction involves using a closed set for speaker recognition. A closed set
implies that only speakers within the original stored set will be asked to be identified. An open
set would allow the extra possibility of a test speaker not coming from the initially trained set of
speakers, thereby requiring the system to recognize the speaker as not belonging to the original
set. An open set system may also have the task to learning a new speaker and placing him or her
within the original set for future reference.

Another common restriction involves using a test dependent speaker recognition system.
This type of system would require the speaker to utter a unique word or phrase to be compared
against the original set of like phrases. Text-independent recognition, which for most cases is
more complex and difficult to perform, identifies the speaker regardless of the text or phrase
spoken.

Once an utterance, or signal, has been recorded, it is usually necessary to process it to get
the voiced signal in a form that makes classification and recognition possible. Various methods
have included the use of power spectrum values, spectrum coefficients, linear predictive coding,
and a nearest neighbor distance algorithm. Tests have also shown that although spectrum
coefficients and linear predictive coding have given better results for conventional template and
statistical classification methods, power spectrum values have performed better when using
neural networks during the final recognition stages.

Various methods have also been used to perform the classification and recognition of the
processed speech signal. Statistical methods utilizing Hidden Markov Models, linear vector
quantifiers, or classical techniques such as template matching have produced encouraging, yet
limited success. Recent deployments using neural networks, while producing varied success
rates, have offered more options regarding the types of inputs sent to the networks, as well as
provided the ability to learn speakers in both an off and online manner. Although back-
propagation networks have traditionally been used, the implementation of more sophisticated
networks, such as an ART 2 network, has been made.

ASR can be broadly classified into four types:

1. Text-independent identification

2. Text-independent verification

DEPARTMENT OF ECE, ACEM Page 3


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

3. Text-dependent identification

4. Text-dependent verification

Speaker identification is a procedure by which a speaker is identified from a group of ‘n’


people. It should be noted that a totally new speaker not belonging to the group could wrongly be
identified as someone from within the group. Speaker verification is a procedure by which a
speaker who claims his/her identity is verified as being correct or not.

A fundamental requirement for any ASR system is gathering reference samples and
finding certain features from the voice that are characteristic to a person. These feature vectors
are then stored. When a new test sample is made available, the references are either searched to
find the closest match (in case of identification), or a threshold of a distance measure is checked
(in case of verification).

The next aspect to the considered is text-dependency. In a text-independent situation, the


reference utterance and the test utterance are not the same. This type of recognition system finds
its applications in criminology. In a text-dependent situation, the reference utterance and the test
utterance are the same, which gives us a higher degree of accuracy. This type of recognition
system has applications where security is a matter of concern, such as access to a building to a
lab, to a computer, etc.

The dominant technology used in ASR is called the Hidden Markov Model, or HMM.
This technology recognizes speech by estimating the likelihood of each phoneme at contiguous,
small regions (frames) of the speech signal. Each word in a vocabulary list is specified in terms
of its component phonemes. A search procedure is used to determine the sequence of phonemes
with the highest likelihood. This search is constrained to only look for phoneme sequences that
correspond to words in the vocabulary list, and the phoneme sequence with the highest total
likelihood is identified with the word that was spoken. In standard HMMs, the likelihoods are
computed using a Gaussian Mixture Model; in the HMM/ANN framework, these values are
computed using an artificial neural network (ANN).

DEPARTMENT OF ECE, ACEM Page 4


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

CHAPTER 3

SPEECH RECOGNITION

The user speaks to the computer through a microphone, which in turn, identifies the
meaning of the words and sends it to NLP device for further processing. Once recognized, the
words can be used in a variety of applications like display, robotics,
commands to computers, and dictation.
The word recognizer is a speech recognition system that identifies individual words.
Early pioneering systems could recognize only individual alphabets and numbers. Today,
majority of word recognition systems are word recognizers and have more than 95% recognition
accuracy. Such systems are capable of recognizing a small vocabulary of single words or simple
phrases. One must speak the input information in clearly definable single words, with a pause
between words, in order to enter data in a computer. Continuous speech recognizers are far more
difficult to build than word recognizers. You speak complete sentences to the computer. The
input will be recognized and, then processed by NLP. Such recognizers employ sophisticated,
complex techniques to deal with continuous speech, because when one speaks continuously,
most of the words slur together and it is difficult for the system to know where one word ends
and the other
begins. Unlike word recognizers, the information spoken is not recognized instantly by this
system.

What is a speech recognition system?

A speech recognition system is a type of software that allows the user to have their
spoken words converted into written text in a computer application such as a word processor or
spreadsheet. The computer can also be controlled by the use of spoken commands.

Speech recognition software can be installed on a personal computer of appropriate


specification. The user speaks into a microphone (a headphone microphone is usually supplied
with the product). The software generally requires an initial training and enrolment process in
order to teach the software to recognize the voice of the user. A voice profile is then produced
that is unique to that individual. This procedure also helps the user to learn how to ‘speak’ to a
computer.

DEPARTMENT OF ECE, ACEM Page 5


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

3.1 Speech recognition process

DISPLAY

APPLICATION DICTATING
S

SPEECH
RECOGNITION COMMANDS
USER DEVICE TO COMPUTER

INPUT TO
OTHER
ROBOTS,
EXPERTS
NLP UNDERSTAND
VOICE SOUND ING

Figure:3.1 DIALOGUE WITH USER

After the training process, the user’s spoken words will produce text; the accuracy of this
will improve with further dictation and conscientious use of the correction procedure. With a
well-trained system, around 95% of the words spoken could be correctly interpreted. The system
can be trained to identify certain words and phrases and examine the user’s standard documents
in order to develop an accurate voice file for the individual.

However, there are many other factors that need to be considered in order to achieve a
high recognition rate. There is no doubt that the software works and can liberate many learners,
but the process can be far more time consuming than first time users may appreciate and the
results can often be poor. This can be very demotivating, and many users give up at this stage.
Quality support from someone who is able to show the user the most effective ways of using the
software is essential.
When using speech recognition software, the user’s expectations and the advertising on
the box may well be far higher than what will realistically be achieved. ‘You talk and it types’
can be achieved by some people only after a great deal of perseverance and hard work.

DEPARTMENT OF ECE, ACEM Page 6


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

3.2 Terms and Concepts

Following are a few of the basic terms and concepts that are fundamental to speech
recognition. It is important to have a good understanding of these concepts when developing
Voice XML applications.

3.2.1 Utterances

When the user says something, this is known as an utterance. An utterance is any stream
of speech between two periods of silence. Utterances are sent to the speech engine to be
processed. Silence, in speech recognition, is almost as important as what is spoken, because
silence delineates the start and end of an utterance. Here's how it works. The speech recognition
engine is "listening" for speech input. When the engine detects audio input - in other words, a
lack of silence -- the beginning of an utterance is signaled.

Similarly, when the engine detects a certain amount of silence following the audio, the
end of the utterance occurs. Utterances are sent to the speech engine to be processed. If the user
doesn’t say anything, the engine returns what is known as a silence timeout - an indication that
there was no speech detected within the expected timeframe, and the application takes an
appropriate action, such as reprompting the user for input. An utterance can be a single word, or
it can contain multiple words (a phrase or a sentence).

3.2.2 Pronunciations

The speech recognition engine uses all sorts of data, statistical models, and algorithms to
convert spoken input into text. One piece of information that the speech recognition engine uses
to process a word is its pronunciation, which represents what the speech engine thinks a word
should sound like. Words can have multiple pronunciations associated with them. For example,
the word “the” has at least two pronunciations in the U.S. English language: “thee” and
“thuh.”As a Voice XML application developer, you may want to provide multiple pronunciations
for certain words and phrases to allow for variations in the ways your callers may speak them.

DEPARTMENT OF ECE, ACEM Page 7


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

3.2.3 Grammars

As a Voice XML application developer, you must specify the words and phrases that
users can say to your application. These words and phrases are defined to the speech recognition
engine and are used in the recognition process. You can specify the valid words and phrases in a
number of different ways, but in Voice XML, you do this by specifying a grammar. A grammar
uses a particular syntax, or set of rules, to define the words and phrases that can be recognized by
the engine. A grammar can be as simple as a list of words, or it can be flexible enough to allow
such variability in what can be said that it approaches natural language capability.

3.2.4 Accuracy

The performance of a speech recognition system is measurable. Perhaps the most widely
used measurement is accuracy. It is typically a quantitative measurement and can be calculated
in several ways. Arguably the most important measurement of accuracy is whether the desired
end result occurred. This measurement is useful in validating application design Another
measurement of recognition accuracy is whether the engine
recognized the utterance exactly as spoken.

Another measurement of recognition accuracy is whether the engine recognized the


utterance exactly as spoken. This measure of recognition accuracy is expressed as a percentage
and represents the number of utterances recognized correctly out of the total number of
utterances spoken. It is a useful measurement when validating grammar design.

Recognition accuracy is an important measure for all speech recognition applications. It


is tied to grammar design and to the acoustic environment of the user. You need to measure the
recognition accuracy for your application, and may want to adjust your application and its
grammars based on the results obtained when you test your application with typical users.

DEPARTMENT OF ECE, ACEM Page 8


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

CHAPTER 4

SPEAKER INDEPENDENCY
The speech quality varies from person to person. It is therefore difficult to build an
electronic system that recognizes everyone’s voice. By limiting the system to the voice of a
single person, the system becomes not only simpler but also more reliable. The computer must
be trained to the voice of that particular individual. Such a system is called Speaker-dependent
system.
Speaker-independent system can be used by anybody, and can recognize any voice, even
though the characteristics vary widely from one speaker to another. Most of these systems are
costly and complex. Also, these have very limited vocabularies. It is important to consider the
environment in which the speech recognition system has to work. The grammar used by the
speaker and accepted by the system, noise level, noise type, position of the microphone, and
speed and manner of the user’s speech are some factors that may affect the quality of the speech
recognition.

4.1 Speaker Dependence Vs Speaker Independence

Speaker Dependence describes the degree to which a speech recognition system requires
knowledge of a speaker’s individual voice characteristics to successfully process speech. The
speech recognition engine can “learn” how you speak words and phrases; it can be trained to
your voice.

Speech recognition systems that require a user to train the system to his/her voice are
known as speaker-dependent systems. If you are familiar with desktop dictation systems, most
are speaker dependent. Because they operate on very large vocabularies, dictation systems
perform much better when the speaker has spent the time to train the system to his/her voice.

Speech recognition systems that do not require a user to train the system are known as
speaker-independent systems. Speech recognition in the VoiceXML world must be speaker-
independent. Think of how many users (hundreds, maybe thousands) say be calling into your
web site. You cannot require that each caller train the system to his or her voice. The speech
recognition system in a voice-enabled web application MUST successfully process the speech of

DEPARTMENT OF ECE, ACEM Page 9


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

many different callers without having to understand the individual voice characteristics of each
caller.

4.1.1 Advantages of speaker independent system

The advantage of a speaker independent system is obvious anyone can use the system
without first training it. However, its drawbacks are not so obvious. One limitation is the work
that goes into creating the vocabulary templates. To create reliable speaker independents
templates, someone must collect and process numerous speech sample. This is a time-consuming
task; creating these templates is not a one-time effort. Speaker-independent templates are
language-dependant, and the templates are sensitive not only to two dissimilar languages but also
to the differences between British and American English. Therefore, as part of your design
activity, you would need to create a set of templates for each language or a major dialect that
your customers use. Speaker independent systems also have a relatively fixed vocabulary
because of the difficulty in creating a new template in the field at the user’s site.

4.1.2 The advantage of a speaker-dependent system

A speaker dependent system requires the user a train the ASR system by providing
examples of his own speech. Training can be tedious process, but the system has the advantage
of using templates that refer only to the specific user and not some vague average voice. The
result is language independence. You can say ja, si, or ya during training, as long as you are
consistent. The drawback is that the speaker-dependent system must do more than simply match
incoming speech to the templates. It must also include resources to create those templates.

4.1.3 Which is better?

For a given amount of processing power, a speaker dependent system tends to provide
more accurate recognition than a speaker independent system. A speaker independent system is
not necessarily better: the difference in performance stems from the speaker independent
template encompassing wide speech variations.

DEPARTMENT OF ECE, ACEM Page 10


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

4.2 System Configuration


Figures 4.2.1 and 4.2.2 show the identification system and the verification system
configuration, respectively .The first part of the system consists of the data acquisition hardware
that acquires the speech, performs some signal conditioning, digitizes it and gives it to the
computer/processor.

The second part consists of core signal processing and system identification techniques to
extract speaker specific features. These features are stored and are used at a later time for the
actual identification/verification test. At this stage, the system is ready for identification or
verification.

Now, when the test sample is uttered by one of the members of the group, the speech is
digitized and the features are extracted. For identification, distances between this vector and all
the reference vectors are measured and the closest vector is picked up as the correct one. This
vector would correspond to a person whom the system claims as having been identified. For
verification, the person claims his/her identity. The distance between the corresponding reference
vector and the test vector is the computed. If the measured distance is less than a set threshold,
the verification system accepts the speaker; if not, it rejects the speaker.

MEASUREMENT REFERENCE
VOICE ADC WITH ALGORITHM OF DISTANCE VECTORS
SAMPLE SIGNAL TO SELECT
CONDITIONING FEATURES DECISION IDENTITY
MAKING OF PERSON

Figure 4.1: Speaker Identification

DEPARTMENT OF ECE, ACEM Page 11


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

VOICE
SAMPLE ADC WITH ALGORITHM MEASUREMENT REFERENCE
(PERSON SIGNAL
TO SELECT OF DISTANCE VECTOR OF
CLAIMS CONDITIONIN
HIS/HER G FEATURES THE
IDENTITY SPEAKER
THRESHOLD
COMPARISON PERSON
VERIFIED
OR NOT

Figure 4.2: Speaker Verification

The voice input to the microphone produces an analogue speech signal. An analogue-to-
digital converter (ADC) converts this speech signal into binary words that are compatible with
digital computer. The converted binary version is then stored in the system and compared with
previously stored binary representations of words and phrases. The current input speech is
compared one at a time with the previously stored speech pattern after searching by the
computer. When a match occurs, recognition is achieved. The spoken word is binary form is
written on a video screen or passed along to a natural language understanding processor for
additional analysis.

Since most recognition systems are speaker-dependent, it is necessary to train a system to


recognize the dialect of each new user. During training, the computer displays a word and the
user reads it aloud. The computer digitizes the user’s voice and stores it. The speaker has to read
aloud about 1,000 words. Based on these samples, the computer can predict how the user utters
some words that are likely to be pronounced differently by different people.

The block diagram of a speaker-dependent word recognizer is shown in Fig. 4.2.1 The
user speaks before the microphone, which converts the sound into electrical signal. The electrical
analogue signal from microphone is fed to an amplifier provided with automatic gain control
(AGC) to produce an amplified output signal in a specific optimum voltage range, even when the
input signal varies from feeble to loud.

DEPARTMENT OF ECE, ACEM Page 12


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

The analogue signal, representing a spoken word, contains many individual frequencies
of various amplitudes and different phases, which when blended together take the shape of a
complex waveform. A set of filters is used to break this complex input signal into its component
parts. Band pass filters (BEP) pass on frequencies only in certain frequency range, rejecting all
other frequencies. Generally, about sixteen filters are used; a simple system may contain a
minimum of three filters. The more the number of filters user, the higher the probability of
accurate recognition.

Presently, switched capacitor digital filters are used because these can be custom-built in
integrated circuit form. These are smaller and cheaper than active filters using operational
amplifiers. The filter output is then fed to the ADC to translate the analogue signal into digital
word. The ADC samples the filter outputs many times a second. Each sample represents a
different amplitude of the signal .

Evenly spaced vertical lines represent the amplitude of the audio filter output at the
instant of sampling. Each value is then converted to a binary number proportional to the
amplitude of the sample. A central processor unit controls the input circuits that are fed by the
ADCs. A large RAM stores all the digital values in a buffer area. This digital information,
representing the spoken word, is now accessed by the CPU to process it further. The normal
speech has a frequency range of 200 Hz to 7 kHz. Recognizing a telephone call is more difficult
as it has bandwidth limitation of 300Hz to 3.3 Hz.

As explained earlier, the spoken words are processed by the filters and ADCs. The binary
representation of each of these words becomes a template or standard, against which the future
words are compared. These templates are stored in the memory. Once the storing process is
completed, the system can go into its active mode and is capable of identifying spoken words. As
each word is spoken, it is converted into binary equivalent and stored in RAM. The computer
then starts searching and compares the binary input pattern with the templates.

It is to be noted that even if the same speaker talks the same text, there are always slight
variations in amplitude or loudness of the signal, pitch, frequency difference, time gap, etc. Due
to this reason, there is never a perfect match between the template and binary input word. The

DEPARTMENT OF ECE, ACEM Page 13


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

pattern matching process therefore uses statistical techniques and is designed to look for the best
fit.
The values of binary input words are subtracted from the corresponding values, in the
templates. If both the values are same, the difference is zero and there is perfect match. If not,
the subtraction produces some difference or error. The smaller the error, the better the match.
When the best match occurs the word is identified and displayed on the screen or used in some
other manner.
The search process takes a considerable amount of time as the CPU has to make many
comparisons before recognition occurs. This necessitates use of very high-speed processors. A
large RAM is also required as even though a spoken word may last only a few hundred
milliseconds, but the same is translated into many thousands of digital words. It is important to
not e that alignment of words and templates are to be matched correctly in time, before
computing the similarity score. This process, termed as dynamic time warping, recognizes that
different speaker pronounce the same words at different speeds as well as elongate different parts
of the same word. This is important for the speaker-independent recognizers.
Continuous speech recognizers are far more difficult to build than word recognizers. You
can speak complete sentences to the computer. The input will be recognized and, when processed
by NLP, understood. Such recognizers employ sophisticated, complex techniques to deal with
continuous speech, because when one speaks continuously, most of the words slur together and it
is difficult for the system to know where one word ends and the other begins. Unlike word
recognizers, the information spoken is not recognized instantly by this system.

DEPARTMENT OF ECE, ACEM Page 14


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

CHAPTER 5

WORKING OF THE SYSTEM

The voice input to the microphone produces an analogue speech signal. An analogue to
digital converter (ADC) converts this speech signal into binary words that are compatible with
digital computer. The converted binary version is then stored in the system and compared with
previously stored binary representation of words and phrases. The current input speech is
compared one at a time with the previously stored speech pattern after searching by the
computer. When a match occurs, recognition is achieved. The spoken word in binary form is
written on a video screen or passed along to a natural language understanding processor for
additional analysis. Since most recognition systems are speaker-dependent, it is necessary to
train a system to recognize the dialect of each new user. During training, the computer displays a
word and user reads it aloud. The computer digitizes the user’s voice and stores it. The speaker
has to read aloud about 1000 words. Based on these samples, the computer can predict how the
user utters some words that are likely to be pronounced differently by different users.
The block diagram of a speaker- dependent word recognizer is shown in figure. The user
speaks before the microphone, which converts the sound into electrical signal. The electrical
analogue signal from the microphone, is fed to an amplifier provided with automatic gain control
(AGC) to produce an amplified output signal in a specific optimum voltage range, even when the
input signal varies from feeble to loud.
The analogue signal, representing a spoken word, contains many individual frequencies
of various amplitudes and different phases, which when blended together take the shape of a
complex wave form as shown in figure. A set of filters is used to break this complex signal into
its component parts. Band pass filters (BFP) pass on frequencies only on certain frequency range,
rejecting all other frequencies. Generally, about 16 filters are used; a simple system may contain
a minimum of three filters. The more number of filters used, the higher the probability of
accurate recognition. Presently, switched capacitor digital filters are used because these can be
custom- built in integrated circuit form. These are smaller and cheaper than active filters using
operational amplifiers. The filter output is then fed to the ADC to translate the analog signal into
digital word. The ADC samples the filter output many times a second. Each sample represents
different amplitude of the signal .A CPU controls the input circuits that are fed by the ADC’s. A
large RAM stores all the digital values in a buffer area. This digital information, representing the
spoken word, is now accessed by the CPU to process it further.

DEPARTMENT OF ECE, ACEM Page 15


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

5.1 Speaker- dependent word recognizer

I
RAM
N
BPF ADC P DIGITISED
U SPEECH
T

C
I
R
C
U
I
T
S

Figure 5.1 Speaker- dependent word recognizer

The normal speech has a frequency range of 200 Hz to 7KHz. Recognizing a telephone
call is more difficult as it has bandwidth limitations of 300Hz to 3.3KHz.As explained earlier the
spoken words are processed by the filters and ADCs. The binary representation of each of these
word becomes a template or standard against which the future words are compared. These
templates are stored in the memory. Once the storing process is completed, the system can go
into its active mode and is capable of identifying the spoken words. As each word is spoken, it is
converted into binary equivalent and stored in RAM. The computer then starts searching and
compares the binary input pattern with the templates. It is to be noted that even if the same
speaker talks the same text, there are always slight variations in amplitude or loudness of the
signal, pitch, frequency difference, time gap etc. Due to this reason there is never a perfect match
between the template and the binary input word. The pattern matching process therefore uses
statistical techniques and is designed to look for the best fit.
The values of binary input words are subtracted from the corresponding values in the
templates. If both the values are same, the difference is zero and there is perfect match. If not, the
subtraction produces some difference or error. The smaller the error, the better the match. When
the best match occurs, the word templates are to be matched correctly in time, before computing
the similarity score. This process, termed as dynamic time warping recognizes that different

DEPARTMENT OF ECE, ACEM Page 16


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

speakers pronounce the same word at different is identified and displayed on the screen or used
in some other manner.
The search process takes a considerable amount of time, as the CPU has to make many
comparisons before recognition occurs. This necessitates use of very high-speed processors. A
Large RAM is also required as even though a spoken word may last only a few hundred
milliseconds, but the same is translated into many thousands of digital words. It is important to
note that alignment of words and speeds as well as elongate different parts of the same word.
This is important for the speaker- independent recognizers.
Now that we've discussed some of the basic terms and concepts involved in speech
recognition, let's put them together and take a look at how the speech recognition process works.
As you can probably imagine, the speech recognition engine has a rather complex task to handle,
that of taking raw audio input and translating it to recognized text that an application
understands. The major components discussed are:
• Audio input
• Grammar(s)
• Acoustic Model
• Recognized text
The first thing we want to take a look at is the audio input coming into the recognition
engine. It is important to understand that this audio stream is rarely pristine. It contains not only
the speech data (what was said) but also background noise. This noise can interfere with the
recognition process, and the speech engine must handle (and possibly even adapt to) the
environment within which the audio is spoken. As we've discussed, it is the job of the speech
recognition engine to convert spoken input into text. To do this, it employs all sorts of data,
statistics, and software algorithms. Its first job is to process the incoming audio signal and
convert it into a format best suited for further analysis.
Once the speech data is in the proper format, the engine searches for the best match. It
does this by taking into consideration the words and phrases it knows about (the active
grammars), along with its knowledge of the environment in which it is operating for Voice
XML, this is the telephony environment). The knowledge of the environment is provided in the
form of an acoustic model. Once it identifies the the most likely match or what was said, it
returns what it recognized as a text string. Most speech engines try very hard to find a match, and
are usually very "forgiving."

DEPARTMENT OF ECE, ACEM Page 17


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

5.1.1 Acceptance and Rejection

When the recognition engine processes an utterance, it returns a result. The result can be
either of two states: acceptance or rejection. An accepted utterance is one in which the engine
returns recognized text. Whatever the caller says, the speech recognition engine tries very hard to
match the utterance to a word or phrase in the active grammar. Sometimes the match may be
poor because the caller said something that the application was not expecting, or the caller spoke
indistinctly. In these cases, the speech engine returns the closest match, which might be
incorrect. Some engines also return a confidence score along with the text to indicate the
likelihood that the returned text is correct. Not all utterances that are processed by the speech
engine are accepted. Acceptance or rejection is flagged by the engine with each processed
utterance.
5.2 What software is available?

There are a number of publishers of speech recognition software. New and improved
versions are regularly produced, and older versions are often sold at greatly reduced prices.
Invariably, the newest versions require the most modern computers of well above average
specification. Using the software on a computer with a lower specification means that it will run
very slowly and may well be impossible to use. There are two main types of speech recognition
software: discrete speech and continuous speech.
Discrete speech software is an older technology that requires the user to speak one word
at a time . Dragon Dictate Classic Version 3 is one example of discrete speech software, as it has
fewer features, is simple to train and use and will work on Continuous speech software allows
the user to dictate normally. In fact, it works best when it hears complete sentences, as it
interprets with more accuracy when it recognizes the context. The delivery can be varied by
using short phrases and single words, following the natural pattern of speech.
5.3 What technical issues need to be considered when purchasing this system?

The latest versions of speech recognition software (September 2001) require a Pentium 3
processor and 256 MB of memory. Currently, Dragon Naturally Speaking Version 4 and IBM
via Voice Millennium edition have been used in school settings. Very good results can be
obtained with these on fast, high-memory machines. When purchasing a machine, it is worth
mentioning to the supplier that it will be required for running speech recognition software.
Whether choosing a desktop or portable computer, it will also require a good quality
duplex (input and output) sound card. Poor sound quality will reduce the recognition accuracy.

DEPARTMENT OF ECE, ACEM Page 18


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

The microphones supplied with the software may be perfectly adequate, but better results
can often be obtained by using a noise-cancelling microphone. In addition, mobile voice
recorders allow a number of users to produce dictation that can be downloaded to the main
speech recognition system, but be aware of some of the complexities of their use.
5.4 How does the technology differ from other technologies?

Speech recognition systems produce written text from the user’s dictation, without using,
or with only minimal use of, a traditional keyboard and mouse. This is an obvious benefit to
many people who, for any number of reasons, do not find it easy to use a keyboard, or whose
spelling and literacy skills would benefit from seeing accurate text.
The limitations to this type of software are that:

 It needs to be completely tailored to the user and trained by the user.


 It is often set up on one machine, and so can create difficulties for a user who
works from many locations, for example from school and home.
 It depends on the user having the desire to produce text and be able to invest the
time, training and perseverance necessary to achieve it.
 It is most successful for those competent in the art of dictation.
A speech recognition system is a powerful application in that the software’s recognition
of the user’s voice pattern and vocabulary improves with use. A useful tip is to ensure that voice
files can be backed up regularly.
5.5 What factors need to be considered when using speech recognition technology?
The Becta SEN Speech Recognition Project describes the key factors to success
as ‘The Three Ts’: Time, Technology and Training:

Time
Take time to choose the most appropriate software and hardware and match it to the user.
One option for new users is to start with discrete speech software. The skills learned whilst using
it can be transferred to more sophisticated speech recognition software. If the new user is unable
to make effective use of discrete speech recognition software, then it is unlikely they will
succeed with continuous speech software. Familiarization with the product and frequent breaks
between talking are also helpful older computers.

DEPARTMENT OF ECE, ACEM Page 19


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

Training
With speech recognition systems, both the software and the user require training.
Patience and practice are required. The user needs to take things slowly, practicing putting their
thoughts into words before attempting to use the system.
Technology
The best results are generally achieved using a high-specification machine. Sound cards
and microphones are a key feature for success, as is access to technical support and advice.

DEPARTMENT OF ECE, ACEM Page 20


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

CHAPTER 6

THE LIMITS OF SPEECH RECOGNITION


To improve speech recognition applications, designers must understand acoustic memory
and prosody. Continued research and development should be able to improve certain speech
input, output, and dialogue applications. Speech recognition and generation is sometimes helpful
for environments that are hands-busy, eyes-busy, mobility required or hostile and shows promise
for telephone-based ser-vices. Dictation input is increasingly accurate, but adoption outside the
disabled-user community has been slow compared to visual interfaces. Obvious physical
problems include fatigue from speaking continuously and the disruption in an office filled with
people speaking.
By understanding the cognitive processes surrounding human “acoustic memory” and
processing, interface designers may be able to integrate speech more effectively and guide users
more successfully. By appreciating the differences between human-human interaction and
human-computer interaction, designers may then be able to choose appropriate applications for
human use of speech with computers. The key distinction may be the rich emotional content
conveyed by prosody, or the pacing, intonation, and amplitude in spoken language. The emotive
aspects of prosody are potent for human interaction but may be disruptive for human-computer
interaction. The syntactic aspects of prosody, such as rising tone for questions, are important for
a system’s recognition and generation of sentences.
Now consider human acoustic memory and processing. Short-term and working Memory
are some-times called acoustic or verbal memes the human brain that transiently holds chunks of
information and solves problems also supports speaking and listening. Therefore, working on
tough problems is best done in quiet environments without speaking or listening to someone.
However, because physical activity is handled in another part of the brain, problem solving is
compatible with routine physical activities like walking and driving. In short, humans speak and
walk easily but find it more difficult to speak and think at the same time .Similarly when
operating a computer, most humans’ type (or move a mouse) and think but find it more difficult
to speak and think at the same time. Hand-eye coordination is accomplished in different brain
structures, so typing or mouse movement can be performed in parallel with problem solving.
Product evaluators of an IBM dictation software the human brain that transiently holds
chunks of information and solves problems also supports speaking and listening. Therefore,
working on tough problems is best done in quiet environments without speaking or listening to

DEPARTMENT OF ECE, ACEM Page 21


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

someone. However, because physical activity is handled in another part of the brain, problem
solving is compatible with routine physical activities like walking and driving. In short, humans
speak and walk easily but find it more difficult to speak and think at the same time .
Similarly when operating a computer, most humans type (or move a mouse) and think but
find it more difficult to speak and think at the same time. Hand-eye coordination is accomplished
in different brain structures, so typing or mouse movement can be performed in parallel with
problem solving. Product evaluators of an IBM dictation software package also noticed this
phenomenon. They wrote that “thought for many people is very closely linked to language. In
keyboarding, users can continue to hone their words while their fingers output an earlier version.
In dictation, users may experience more interference between outputting their initial thought and
elaborating on it.” Developers of commercial speech recognition software packages recognize
this problem and often advise dictation of full paragraphs or documents, followed by a review or
proofreading phase to correct errors. Since speaking consumes precious cognitive resources, it is
difficult to solve problems at the same time. Proficient keyboard users can have higher levels of
parallelism in problem solving while performing data entry. This may explain why after 30 years
of ambitious attempts to provide military pilots with speech recognition in cockpits, aircraft
designers persist in using hand-input devices and visual displays. Complex functionality is built
In to the pilot’s joy-stick, which has up to 17 functions, including pitch-roll- yaw controls, plus a
rich set of buttons and triggers. Similarly automobile controls may have turn signals, wiper
settings, and washer buttons all built onto a single stick, and typical video camera controls may
have dozens of settings that are adjustable through knobs and switches. Rich designs for hand
input can inform users and free their minds for status monitoring and problem solving.
The interfering effects of acoustic processing are a limiting factor for designers of speech
recognition, but the role of emotive prosody raises further concerns. The human voice has
evolved remarkably well to support human-human interaction. We admire and are inspired by
passionate speeches. We are moved by grief-choked eulogies and touched by a child’s calls as
we leave for work. A military commander may bark commands at troops, but there is as much
motivational force in the tone as there is information in the words. Loudly barking commands at
a computer is not likely to force it to shorten its response time or retract a dialogue box.
Promoters of “affective” computing, or reorganizing, responding to, and making emotional
displays, may recommend such strategies, though this approach seems misguided.

DEPARTMENT OF ECE, ACEM Page 22


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

CONCLUSION
Speech recognition will revolutionize the way people conduct business over the Web and
will, ultimately, differentiate world-class e-businesses. Voice XML ties speech recognition and
telephony together and provides the technology with which businesses can develop and deploy
voice-enabled Web solutions TODAY! These solutions can greatly expand the accessibility of
Web-based self-service transactions to customers who would otherwise not have access, and, at
the same time, leverage a business’ existing Web investments. Speech recognition and Voice
XML clearly represent the next wave of the Web. It is important to consider the environment in
which the speech system has to work. The grammar used by the speaker and accepted by the
system, noise level, noise type, position of the microphone, and speed and manner of the user’s
speech are some factors that may affect the quality of speech recognition. Since, most
recognition systems are speaker independent, it is necessary to train a system to recognize the
dialect of each user. During training, the computers display a word and the user reads it aloud.

DEPARTMENT OF ECE, ACEM Page 23


ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION

REFERENCES
[1] http://research.microsoft.com/en-us/news/features/speechrecognition-082911.aspx

[2] http://dl.acm.org/citation.cfm?id=1752355

[3]http://www.creativecow.net/interstitial.php?url=http%3A%2F%2Fforums.creative

cow.net%2Fthread%2F279%2F626&id=0

[4] www.ijsce.org/attachments/File/v2i5/E1054102512.pdf

[5] http://en.wikipedia.org/wiki/Outline_of_artificial_intelligence

[6] http://www.csd.cs.cmu.edu/research/areas/vis_speech_lang/

DEPARTMENT OF ECE, ACEM Page 24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy