0% found this document useful (0 votes)
62 views8 pages

Voice Recognition With Neural Networks, Type-2 Fuzzy Logic and Genetic Algorithms

This document summarizes techniques for voice recognition using neural networks, type-2 fuzzy logic, and genetic algorithms. It discusses using neural networks to analyze sound signals, type-2 fuzzy rules for decision making, and genetic algorithms to optimize the neural network architecture. It also reviews traditional speaker recognition methods, including normalization techniques, text-dependent approaches using template matching and hidden Markov models, and text-independent methods using vector quantization.

Uploaded by

Praveen D Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views8 pages

Voice Recognition With Neural Networks, Type-2 Fuzzy Logic and Genetic Algorithms

This document summarizes techniques for voice recognition using neural networks, type-2 fuzzy logic, and genetic algorithms. It discusses using neural networks to analyze sound signals, type-2 fuzzy rules for decision making, and genetic algorithms to optimize the neural network architecture. It also reviews traditional speaker recognition methods, including normalization techniques, text-dependent approaches using template matching and hidden Markov models, and text-independent methods using vector quantization.

Uploaded by

Praveen D Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 8

Voice Recognition with Neural Networks,

Type-2 Fuzzy Logic and Genetic Algorithms


1. Apeksha Reddy, SDMCET, Dharwad. Id: apeksha.r.r@gmail.com
2. Ashrit Mangesh R, SDMCET, Dharwad. Id: mangeshashrit@gmail.com

Abstract we describe in this paper the use of neural there are methods in which a small set of words, such as
networks, fuzzy logic and genetic algorithms for voice digits, are used as key words and each user is prompted to
recognition. In particular, we consider the case of speaker utter a given sequence of key words that is randomly chosen
recognition by analyzing the sound signals with the help of every time the system is used. Yet even this method is not
intelligent techniques, such as the neural networks and fuzzy completely reliable, since it can be deceived with advanced
systems. We use the neural networks for analyzing the sound electronic recording equipment that can reproduce key words
signal of an unknown speaker, and after this first step, a set of in a requested order. Therefore, a text-prompted speaker
type-2 fuzzy rules is used for decision making. We need to use recognition method has recently been proposed .
fuzzy logic due to the uncertainty of the decision process. We
also use genetic algorithms to optimize the architecture of the
neural networks. We illustrate our approach with a sample of
sound signals from real speakers in our institution.

Index TermsType-2 Fuzzy Logic, Neural Networks, Genetic


Algorithms, Voice Recognition.

I. INTRODUCTION
Speaker recognition, which can be classified into
identification and verification, is the process of automatically
recognizing who is speaking on the basis of individual
information included in speech waves. This technique makes
(a) Speaker identification
it possible to use the speaker's voice to verify their identity
and control access to services such as voice dialling, banking
by telephone, telephone shopping, database access services,
information services, voice mail, security control for
confidential information areas, and remote access to
computers [10].

Fig. 1 shows the basic components of speaker


identification and verification systems. Speaker identification
is the process of determining which registered speaker
provides a given utterance. Speaker verification, on the other
hand, is the process of accepting or rejecting the identity claim
of a speaker. Most applications in which a voice is used as the
key to confirm the identity of a speaker are classified as
speaker verification [11]. (b) Speaker Verification
Speaker recognition methods can also be divided into text-
dependent and text-independent methods. The former require Fig. 1. Basic structure of speaker recognition systems
the speaker to say key words or sentences having the same
text for both training and recognition trials, whereas the latter II. TRADITIONAL METHODS FOR SPEAKER
do not rely on a specific text being spoken RECOGNITION
Both text-dependent and independent methods share a Speaker identity is correlated with the physiological and
problem however. These systems can be easily deceived behavioural characteristics of the speaker. These
because someone who plays back the recorded voice of a characteristics exist both in the spectral envelope (vocal tract
registered speaker saying the key words or sentences can be characteristics) and in the supra-segmental features (voice
accepted as the registered speaker. To cope with this problem,
source characteristics and dynamic features spanning several likelihood ratio normalization approximates optimal scoring in
segments). the Bayes sense.
A normalization method based on a posteriori probability
The most common short-term spectral measurements has also been proposed .The difference between the
currently used are Linear Predictive Coding (LPC)-derived normalization method based on the likelihood ratio and the
cepstral coefficients and their regression coefficients. A method based on a posteriori probability is whether or not the
spectral envelope reconstructed from a truncated set of claimed speaker is included in the speaker set for
cepstral coefficients is much smoother than one reconstructed normalization; the speaker set used in the method based on the
from LPC coefficients. Therefore it provides a stabler likelihood ratio does not include the claimed speaker, whereas
representation from one repetition to another of a particular the normalization term for the method based on a posteriori
speaker's utterances. As for the regression coefficients, probability is calculated by using all the reference speakers,
typically the first- and second-order coefficients are extracted including the claimed speaker.
at every frame period to represent the spectral dynamics. Experimental results indicate that the two normalization
These coefficients are derivatives of the time functions of the methods are almost equally effective .They both improve
cepstral coefficients and are respectively called the delta- and speaker separability and reduce the need for speaker-
delta-cepstral coefficients. dependent or text-dependent thresholding, as compared with
scoring using only a model of the claimed speaker.
A. Normalization Techniques A new method in which the normalization term is
The most significant factor affecting automatic speaker approximated by the likelihood of a single mixture model
recognition performance is variation in the signal representing the parameter distribution for all the reference
characteristics from trial to trial (inter-session variability and speakers has recently been proposed. An advantage of this
variability over time). Variations arise from the speaker method is that the computational cost of calculating the
themselves, from differences in recording and transmission normalization term is very small and this method has been
conditions, and from background noise. Speakers cannot confirmed to give much better results than either of the above-
repeat an utterance precisely the same way from trial to trial. mentioned normalization methods.
It is well known that samples of the same utterance recorded
D. Text-Dependent Speaker Recognition Methods
in one session are much more highly correlated than samples
recorded in separate sessions. There are also long-term Text-dependent methods are usually based on template-
changes in voices. It is important for speaker recognition matching techniques. In this approach, the input utterance is
systems to accommodate to these variations. Two types of represented by a sequence of feature vectors, generally short-
normalization techniques have been tried; one in the term spectral feature vectors. The time axes of the input
parameter domain, and the other in the distance/similarity utterance and each reference template or reference model of
domain. the registered speakers are aligned using a dynamic time
warping (DTW) algorithm and the degree of similarity
B. Parameter-Domain Normalization between them, accumulated from the beginning to the end of
Spectral equalization, the so-called blind equalization method, the utterance, is calculated.
is a typical normalization technique in the parameter domain The hidden Markov model (HMM) can efficiently model
that has been confirmed to be effective in reducing linear statistical variation in spectral features. Therefore, HMM-
channel effects and long-term spectral variation .This based methods were introduced as extensions of the DTW-
method is especially effective for text-dependent speaker based methods, and have achieved significantly better
recognition applications that use sufficiently long utterances. recognition accuracies.
Cepstral coefficients are averaged over the duration of an
E. Text-Independent Speaker Recognition Methods
entire utterance and the averaged values subtracted from the
cepstral coefficients of each frame. Additive variation in the One of the most successful text-independent recognition
log spectral domain can be compensated for fairly well by this methods is based on vector quantization (VQ). In this method,
method. However, it unavoidably removes some text- VQ code-books consisting of a small number of representative
dependent and speaker specific features; therefore it is feature vectors are used as an efficient means of characterizing
inappropriate for short utterances in speaker recognition speaker-specific features. A speaker-specific code-book is
applications. generated by clustering the training feature vectors of each
speaker. In the recognition stage, an input utterance is vector-
C. Distance/Similarity-Domain Normalization quantized using the code-book of each reference speaker and
A normalization method for distance (similarity, likelihood) the VQ distortion accumulated over the entire input utterance
values using a likelihood ratio has been proposed .The is used to make the recognition decision.
likelihood ratio is defined as the ratio of two conditional
probabilities of the observed measurements of the utterance: Temporal variation in speech signal parameters over the
the first probability is the likelihood of the acoustic data given long term can be represented by stochastic Markovian
the claimed identity of the speaker, and the second is the transitions between states. Therefore, methods using an
likelihood given that the speaker is an imposter. The ergodic HMM, where all possible transitions between states
are allowed, have been proposed. Speech segments are calculated and used for the speaker recognition decision. If the
classified into one of the broad phonetic categories likelihood is high enough, the speaker is accepted as the
corresponding to the HMM states. After the classification, claimed speaker.
appropriate features are selected. Although many recent advances and successes in speaker
In the training phase, reference templates are generated and recognition have been achieved, there are still many problems
verification thresholds are computed for each phonetic for which good solutions remain to be found. Most of these
category. In the verification phase, after the phonetic problems arise from variability, including speaker-generated
categorization, a comparison with the reference template for variability and variability in channel and recording conditions.
each particular category provides a verification score for that It is very important to investigate feature parameters that are
category. The final verification score is a weighted linear stable over time, insensitive to the variation of speaking
combination of the scores from each category. manner, including the speaking rate and level, and robust
This method was extended to the richer class of mixture against variations in voice quality due to causes such as voice
autoregressive (AR) HMMs. In these models, the states are disguise or colds. It is also important to develop a method to
described as a linear combination (mixture) of AR sources. It cope with the problem of distortion due to telephone sets and
can be shown that mixture models are equivalent to a larger channels, and background and channel noises.
HMM with simple states, with additional constraints on the From the human-interface point of view, it is important to
possible transitions between states. consider how the users should be prompted, and how
It has been shown that a continuous ergodic HMM method recognition errors should be handled. Studies on ways to
is far superior to a discrete ergodic HMM method and that a automatically extract the speech periods of each person
continuous ergodic HMM method is as robust as a VQ-based separately from a dialogue involving more than two people
method when enough training data is available. However, have recently appeared as an extension of speaker recognition
when little data is available, the VQ-based method is more technology. This section was not intended to be a
robust than a continuous HMM method .A method using comprehensive review of speaker recognition technology.
statistical dynamic features has recently been proposed. In this Rather, it was intended to give an overview of recent advances
method, a multivariate auto-regression (MAR) model is and the problems, which must be solved in the future.
applied to the time series of cepstral vectors and used to
G. Speaker Verification
characterize speakers. It was reported that identification and
verification rates were almost the same as obtained by a The speaker-specific characteristics of speech are due to
HMM-based method. differences in physiological and behavioural aspects of the
speech production system in humans. The main physiological
F. Text-Prompted Speaker Recognition Method aspect of the human speech production system is the vocal
In the text-prompted speaker recognition method, the tract shape. The vocal tract modifies the spectral content of an
recognition system prompts each user with a new key acoustic wave as it passes through it, thereby producing
sentence every time the system is used and accepts the input speech. Hence, it is common in speaker verification systems
utterance only when it decides that it was the registered to make use of features derived only from the vocal tract.
speaker who repeated the prompted sentence. The sentence The acoustic wave is produced when the airflow, from the
can be displayed as characters or spoken by a synthesized lungs, is carried by the trachea through the vocal folds. This
voice. Because the vocabulary is unlimited, prospective Source of excitation can be characterized as phonation,
impostors cannot know in advance what sentence will be whispering, frication, compression, vibration, or a
requested. Not only can this method accurately recognize combination of these. Phonated excitation occurs when the
speakers, but it can also reject utterances whose text differs airflow is modulated by the vocal folds. Whispered excitation
from the prompted text, even if it is spoken by the registered is produced by airflow rushing through a small triangular
speaker. A recorded voice can thus be correctly rejected. opening between the arytenoid cartilage at the rear of the
This method is facilitated by using speaker-specific nearly closed vocal folds. Frication excitation is produced by
phoneme models, as basic acoustic units. One of the major constrictions in the vocal tract. Compression excitation results
issues in applying this method is how to properly create these from releasing a completely closed and pressurized vocal
speaker-specific phoneme models from training utterances of tract. Vibration excitation is caused by air being forced
a limited size. The phoneme models are represented by through a closure other than the vocal folds, especially at the
Gaussian-mixture continuous HMMs or tied-mixture HMMs, tongue. Speech produced by phonated excitation is called
and they are made by adapting speaker-independent phoneme voiced, that produced by phonated excitation plus frication is
models to each speaker's voice. In order, to properly adapt the called mixed voiced, and that produced by other types of
models of phonemes that are not included in the training excitation is called unvoiced.
utterances, a new adaptation method based on tied-mixture Using cepstral analysis as described in the previous section,
HMMs was recently proposed. an utterance may be represented as a sequence of feature
In the recognition stage, the system concatenates the vectors. Utterances spoken by the same person but at different
phoneme models of each registered speaker to create a times result in similar yet a different sequence of feature
sentence HMM, according to the prompted text. Then the vectors. The purpose of voice modeling is to build a model
likelihood of the input speech matching the sentence model is that captures these variations in the extracted set of features.
There are two types of models that have been used extensively After capturing the sound signals, these voice signals are
in speaker verification and speech recognition systems: digitized at a frequency of 8 KHz, and as consequence we
stochastic models and template models. The stochastic model obtain a signal with 8008 sample points. This information is
treats the speech production process as a parametric random the one used for analyzing the voice.
process and assumes that the parameters of the underlying We also used the Sound Forge 6.0 computer program for
stochastic process can be estimated in a precise, well-defined processing the sound signal. This program allows us to cancel
manner. The template model attempts to model the speech noise in the signal, which may have come from environment
production process in a non-parametric manner by retaining a noise or sensitivity of the microphones. After using this
number of sequences of feature vectors derived from multiple computer program, we obtain a sound signal that is as pure as
utterances of the same word by the same person. Template possible. The program also can use fast Fourier transform for
models dominated early work in speaker verification and The program also can use fast Fourier transform for voice
speech recognition because the template model is intuitively filtering. We show in Figure 4 the use of the computer
more reasonable. However, recent work in stochastic models program for a particular sound signal.
has demonstrated that these models are more flexible and
hence allow for better modelling of the speech production
process. A very popular stochastic model for modelling the
speech production process is the Hidden Markov Model
(HMM). HMMs are extensions to the conventional Markov
models, wherein the observations are a probabilistic function
of the state, i.e., the model is a doubly embedded stochastic
process where the underlying stochastic process is not directly
observable (it is hidden). The HMM can only be viewed
through another set of stochastic processes that produce the
sequence of observations.
The pattern matching process involves the comparison of a
given set of input feature vectors against the speaker model
for the claimed identity and computing a matching score. For
the Hidden Markov models discussed above, the matching
score is the probability that a given set of feature vectors was
generated by a specific model. We show in Figure 2 a
schematic diagram of a typical speaker recognition system.

III. VOICE CAPTURING AND PROCESSING Fig. 4. Main window of the computer program for processing
The first step for achieving voice recognition is to capture the signals.
the sound signal of the voice. We use a standard microphone
for capturing the voice signal. After this, we use the sound
We also show in Figure 5 the use of the Fast Fourier
recorder of the Windows operating system to record the
Transform (FFT) to obtain the spectral analysis of the word
sounds that belong to the database for the voices of different
"way" in Spanish.
persons. A fixed time of recording is established to have
homogeneity in the signals. We show in Figure 3 the sound
signal recorder used in the experiments.

Fig. 5. Spectral analysis of a specific word using the FFT.


Fig. 3. Sound recorder used in the experiments. IV. NEURAL NETWORKS FOR VOICE RECOGNITION
We used the sound signals of 20 words in Spanish as training We now show in Table 2 a comparison of the recognition
data for a supervised feed forward neural network with one ability achieved with the different training algorithms for the
hidden layer. The training algorithm used was the Resilient supervised neural networks. We are showing average values of
Backpropagation (trainer) that has been used previously with experiments performed with all the training algorithms. We
good results. We show in Table 1 the results for the experiments can appreciate from this table that the resilient
with this type of neural network. backpropagation algorithm is also the most accurate method,
The results of Table I are for the Resilient Backpropagation with a 92% average recognition rate.
training algorithm because this was the fastest learning
algorithm found in all the experiment (required only 7% of the TABLE II. COMPARISON OF AVERAGE RECOGNITION OF FOUR TRAINING
total time in the experiments). The comparison of the time ALGORITHMS.
performance with other training methods is shown in Figure 6.

TABLE 1. RESULTS OF FEEDFORWARD NEURAL NETWORKS FOR 20 WORDS


IN SPANISH

We describe below some simulation results of our approach


for speaker recognition using neural networks. First, in Figure
7 we have the sound signal of the word "example" with noise.
Next, in Fig. 8 we have the identification of the word
"example" without noise. We also show in Fig. 9 the word
"layer" with noise. In Fig. 10, we show the identification of
the correct word "layer" without noise.

Fig. 7. Input signal of the word "example" with noise

From the figures 7 to 10 it is clear that simple monolithic


neural networks can be useful in voice recognition with a
small number of words. It is obvious that words even with
noise added can be identified, with at least 92% recognition
rate (for 20 words). Of course, for a larger set of words the
Fig. 6. Comparison of the time performance of several recognition rate goes down and also computation time
training algorithms. increases. For these reasons it is necessary to consider better
methods for voice recognition.
neural networks from the same training data. We describe

Fig. 8. Identification of the word "example".

Fig. 10. Identification of the word "layer".

in this section our modular neural network approach with the use
of type-2 fuzzy logic in the integration of results .

We now show some examples to illustrate the hybrid


approach. We use two modules with one neural network each
in this modular architecture. Each module is trained with the
same data, but results are somewhat different due to the
uncertainty involved in the learning process. In all cases, we
use neural networks with one hidden layer of 50 nodes and
"trainrp" as learning algorithm. The difference in the results is
then used to create a type-2 interval fuzzy set that represents
the uncertainty in the classification of the word. The first
example is of the word "example" which is shown in Fig. 11.

Fig. 9. Input signal of the word "layer" with noise added.

V. VOICE RECOGNITION WITH MODULAR NEURAL NETWORKS


AND TYPE-2 FUZZY LOGIC
We can improve on the results obtained in the previous section
by using modular neural networks because modularity enables
us to divide the problem of recognition in simpler sub-
problems, which can be more easily solved. We also use type-2
fuzzy logic to model the uncertainty in the results given by the

Fig. 11. Sound signal of the word "example" .


Considering for now only 10 words in the training, we have We now describe the complete modular neural network
that the first neural network will give the following results: architecture (Fig. 12) for voice recognition in which we now
use three neural networks in each module. Also, each module
SSE = 4.17649e-005 (Sum of squared errors) only processes a part of the word, which is divided in three
Output = [0.0023, 0.0001, 0.0000, 0.0020, 0.0113, parts one for each module.
0.0053, 0.0065, 0.9901, 0.0007, 0.0001]

The output can be interpreted as giving us the membership


values of the given sound signal to each of the 10 different
words in the database. In this case, we can appreciate that the
value of 0.9901 is the membership value to the word
"example", which is very close to 1. But, if we now train a
second neural network with the same architecture, due to the
different random inicialization of the weights, the results will
be different. We now give the results for the second neural
network:
SSE = 0.0124899

Output = [0.0002, 0.0041, 0.0037, 0.0013, 0.0091,


0.0009, 0.0004, 0.9821, 0.0007, 0.0007]

We can note that now the membership value to the word


"example" is of 0.9821. With the two different values of
membership, we can define an interval [0.9821, 0.9901],
which gives us the uncertainty in membership of the input Fig. 12. Complete modular neural network architecture for
signal belonging to the word "example" in the database. We voice recognition.
have to use centroid deffuzification to obtain a single We have also experimented with using a genetic algorithm
membership value. If we now repeat the same procedure for for optimizing the number of layers and nodes of the neural
the whole database, we obtain the results shown in Table II. In networks of the modules with very good results. The approach
this table, we can see the results for a sample of 6 different is very similar to the one described in the previous chapter.
words. We show in Fig. 13 an example of the use of a genetic
algorithm for optimizing the number of layers and nodes of
TABLE II. SUMMARY OF RESULTS FOR THE TWO MODULES (M1 AND M2)
FOR A SET OF WORDS IN "SPANISH ".
one of the neural networks in the modular architecture. In this
figure we can appreciate the minimization of the fitness
function, which takes into account two objectives: sum of
squared errors and the complexity of the neural network.

The same modular neural network approach was extended to


the previous 20 words (mentioned in the previous section) and
the recognition rate was improved to 100%, which shows the
advantage of modularity and also the utilization of type-2 fuzzy
logic. We also have to say that computation time was also Fig. 13. Genetic algorithm showing the optimization of a
reduced slightly due to the use of modularity. neural network.
VI. CONCLUSIONS VIII. ACKNOWLEDGEMENTS
We have described in this paper an intelligent approach for 1. We thank JNCE, Shivamogga for conducting
pattern recognition for the case of speaker identification. We Techzone 2k10 and giving us an opportunity to
first described the use of monolithic neural networks for voice participate in the same.
recognition. We then described a modular neural network 2. We thank Principal, Staff and the management of our
approach with type-2 fuzzy logic. We have shown examples college, SDMCET, Dharwad for continuously
for words in which a correct identification was achieved. We supporting us in completing this paper.
have performed tests with about 20 different words in wich
were spoken by three different speakers. The results are very
good for the monolithic neural network approach, and
excellent for the modular neural network approach. We have
considered increasing the database of words, and with the
modular approach we have been able to achieve about 96%
recognition rate on over 100 words. We still have to make
more tests with different words and levels of noise.

VII. REFERENCES
[1] O. Castillo, O. and P. Melin, "A New Approach for Plant Monitoring using
Type-2 Fuzzy Logic and Fractal Theory", International Journal of
General Systems, Taylor and Francis, Vol. 33, 2004, pp. 305-319.

[2] S. Furui, "Cepstral analysis technique for automatic speaker verification",


IEEE Transactions on Acoustics, Speech and Signal Processing, 29(2),
1981, pp. 254-272.

[3] S. Furui, "Research on individuality features in speech waves and


automatic speaker recognition techniques", Speech Communication,
5(2), 1986, pp. 183-197.

[4] S. Furui, "Speaker-independent isolated word recognition using dynamic


features of the speech spectrum", IEEE Transactions on Acoustics,
Speech and Signal Processing, 29(1), 1986, pp. 59-59.

[5] S. Furui, "Digital Speech Processing, Synthesis, and Recognition". Marcel


Dekker, New York, 1989.

[6] S. Furui, "Speaker-dependent-feature extraction, recognition and


processing techniques", Speech Communication, 10(5-6), 1991, pp. 505-
520.

[7] S. Furui, "An overview of speaker recognition technology", Proceedings


of the ESCA Workshop on Automatic Speaker Recognition,
Identification and Verification, 1994, pp. 1-9.

[8] A. L. Higgins, L. Bahler, and J. Porter, "Speaker verification using


randomized phrase prompting", Digital Signal Processing, Vol. 1, 1991,
pp. 89-106.

[9] N.N Karnik, and J.M. Mendel, An Introduction to Type-2 Fuzzy Logic
Systems, Technical Report, University of Southern California, 1998.

[10] T. Matsui, and S. Furui, "Concatenated phoneme models for text-variable


speaker recognition", Proceedings of ICASSP'93, 1993, pp. 391-394.

[11] T. Matsui, and S. Furui, "Similarity normalization method for speaker


verification based on a posteriori probability", Proceedings of the ESCA
Workshop on Automatic Speaker Recognition, Identification and
Verification, 1994, pp. 59-62.

[12] P. Melin, M. L. Acosta, and C. Felix, "Pattern Recognition Using Fuzzy


Logic and Neural Networks", Proceedings of IC-AI'03, Las Vegas,
USA, 2003, pp. 221-227.

[13] P. Melin, and O. Castillo, A New Method for Adaptive Control of Non-
Linear Plants Using Type-2 Fuzzy Logic and Neural Networks,
International Journal of General Systems, Taylor and Francis, Vo

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy