Speech Recognition: 4.1 Front End The Perception Processor 3.5 Energy Delay Squared
Speech Recognition: 4.1 Front End The Perception Processor 3.5 Energy Delay Squared
Speech Recognition
http://www.siliconintelligence.com/people/binu/perception/node21.html
Next: 4.1 Front End Up: The Perception Processor Previous: 3.5 Energy Delay Squared Contents
4. Speech Recognition
Modern approaches to large vocabulary continuous speech recognition are surprisingly similar in terms of their high-level structure [111]. The work described herein is based on the CMU Sphinx 3.2 system, but the general approach is applicable to other speech recognizers [49,74]. The explanation of large vocabulary continuous speech recognition (LVCSR) in this chapter is based on a simple probabilistic model presented in [80,111]. The human vocal apparatus has mechanical limitations that prevent rapid changes to sound generated by the vocal tract. As a result, speech signals may be considered stationary, i.e., their spectral characteristics remain relatively unchanged for several milliseconds at a time. DSP techniques may be used to summarize the spectral characteristics of a speech signal into a sequence of acoustic observation vectors. Typically, 100 such vectors will be used to represent one second of speech. Speech recognition then becomes a statistical problem of deriving the word sequence that has the highest likelihood of corresponding to the observed sequence of acoustic vectors. This notion is captured by the equation: (4.1)
Here,
is a sequence of
words and
is a sequence of
acoustic observation vectors. Equation 4.1 may be read as is the particular word sequence which has maximum a posteriori probability given the observation sequence . Using Bayes' rule, this equation may be rewritten as:
(4.2)
denotes the probability of the acoustic vector sequence . denotes the probability with which the word sequence
denotes the probability with which the acoustic vector sequence occurs in the spoken language. is independent of the word sequence, therefore can be computed without knowing . Thus Equation 4.2 may be rewritten as: (4.3)
The set of DSP algorithms that convert the speech signal into the acoustic vector sequence is commonly referred to as the front end. The quantity is generated by evaluating an acoustic model. The term is generated from a language model.
1 of 2
5/20/2002 10:13 AM
4. Speech Recognition
http://www.siliconintelligence.com/people/binu/perception/node21.html
Subsections 4.1 Front End 4.2 Acoustic Model 4.3 Language Model 4.4 Overall Operation 4.5 Architectural Implications
Next: 4.1 Front End Up: The Perception Processor Previous: 3.5 Energy Delay Squared Contents
Binu K. Mathew
2 of 2
5/20/2002 10:13 AM