0% found this document useful (0 votes)
3 views141 pages

DSP II - DVP - CDP 2pp

The document outlines a course on Voice Processing, detailing the course administration, content, and chapters, with a focus on the fundamentals of speech signals and audio processing. It covers topics such as the structure of speech signals, speech production, and the human auditory system, along with methods for analyzing voiced and unvoiced speech. The course is taught by Dr. Ali Al-Hajj Hassan and includes a textbook reference for further study.

Uploaded by

georgesatieh4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views141 pages

DSP II - DVP - CDP 2pp

The document outlines a course on Voice Processing, detailing the course administration, content, and chapters, with a focus on the fundamentals of speech signals and audio processing. It covers topics such as the structure of speech signals, speech production, and the human auditory system, along with methods for analyzing voiced and unvoiced speech. The course is taught by Dr. Ali Al-Hajj Hassan and includes a textbook reference for further study.

Uploaded by

georgesatieh4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

DVP Slides

Dr. Ali Al Hajj Hassan


2 Crédits

Département Electrique
Printemps 2025
Sem VIII
DSPII- Voice Processing

Introduction & Fundamentals


Chapter 1

Dr. Ali Al-Hajj Hassan

Course Administration

· Instructor: Dr. Ali El-Hajj Hassan


· Email: ali.hajjhassan@net.usj.edu.lb

· Text-Book:
¹ Principles of Digital Audio, Ken Pohlmann, Mc Graw Hill,
2011.

1
Course Content

· Structure of the speech signal.


· Speech production.

· Temporal analysis of the audio signal, Spectral analysis.

· Spectro-temporal analysis.

· Cepstral analysis.

· Estimation of the fundamental frequency (pitch) and


formants.

Chapters

· Chapter 1: Fundamentals of Speech Signals.


· Chapter 2: Speech Coding.

· Chapter 3: Linear Prediction Coding (LPC).

· Chapter 4: Speech Recognition Theory.

2
Chapter 1 Outline

· Acoustic Theory of Speech


¹ Voiced and Unvoiced Speech
¹ Voiced and Unvoiced Model

· Human Auditory System

· Speech Processing
¹ Pitch Period Estimation
¹ Short Time Analysis
¹ Convolution

Audio

· Physics describes sound as a form of energy that is transmitted


through a medium as pressure waves making rarefactions and
compressions.
· These pressure changes can be captured by an electromechanical
device, such as a microphone, and converted into an analog
signal.
· Typically, a recording device consists of a magnet surrounded by
a coil, which is attached to a diaphragm.
¹ As sound waves vibrate the diaphragm, the coil moves about the
magnet, which induces an electric current proportional to the
amplitude of the vibration.
¹ The electric current is captured as an analog signal, which can be
further stored on an analog recording device such as a magnetic
tape.

3
Audio

Digital Representation of Audio

· Analog audio signals are represented as waveforms,


either simple or complex.
· A simple waveform is a pure tone with single frequency

· A complex waveform has multiple frequencies or

sinusoidal waves combined together (amplitude of


complex signal is joint of all amplitudes of individual
frequencies).
· Voice and music are complex of course.

4
Digital Representation of Audio

· To digitize, we need sampling and quantization.


· The process is known as PCM (pulse code modulation)

· Mono, stereo, or multichannel (surround)

· New technology: spatial audio

Digital Audio

· Started with one microphone to capture the sound.


· In the microphone, electric current proportional to the

sound is generated and stored.


· Binaural recording: a pair of microphones records two

different sounds or channels, which are then played back


on two different speakers without mixing or cross talk.
This creates the perception of spatial sound
· Stereophonic use more than 2 channels.

5
Audio Formats

Audio Formats

6
Audio Formats

The Need for Audio Compression

7
Sound Amplitude

· Measured in decibels, which is a relative (not absolute)


measuring system.
· The decibel scale is a simple way of comparing two
sound intensity levels (watt per square centimeter)
· Defined as 10 log (I/Io), where I is the intensity of the
measured sound signal, and Io is an arbitrary reference
intensity.
· For audio purposes, Io is taken to be the minimum
intensity of sound that humans can hear, and has been
experimentally measured to be 10-16 w/cm2.
· Any I with intensity lower than Io will, therefore, have a
negative decibel value and vice versa.

Sound Amplitude

· If an audio signal has an intensity of 10-9 watts/cm2, its


decibel level is 70 dB.
· Because of the relative scale, an increase in 1 dB implies

that the sound intensity has gone up tenfold.

8
Acoustic Theory of Speech
· Speech sounds are sensations of air pressure vibrations
produced by air exhaled from the lungs and modulated and
shaped by the vibrations of the glottal cords and the resonance
of the vocal tract as the air is pushed out through the lips and
nose.
· Speech is also a sequence of elementary acoustic sounds or
symbols known as phonemes that convey the spoken form of a
language.
· Speech sounds have a rich and multi-layered temporal-
spectral variation that convey words, intention, expression,
intonation, accent, speaker identity, gender, age, style of
speaking, state of health of the speaker and emotion.

17

Acoustic Theory of Speech


· The anatomy of the human speech
production consists of: the lungs,
larynx, vocal tract cavity, nasal
cavity, teeth, lips, and the
connecting tubes.
· It is useful to interpret speech
production in terms of acoustic
filtering. The three main cavities of
the speech production system are
nasal, oral, and pharyngeal
forming the main acoustic filter.
· The form and shape of the vocal
and nasal tracts change
continuously with time, creating
an acoustic filter with time-varying
frequency response.
18

9
Acoustic Theory of Speech

· Inside the larynx is one of the


most important components of
the speech production system
the vocal cords.
· Vocal cords are a pair of elastic

bands of muscle and mucous


membrane that open and close
rapidly during speech
production.

19

Acoustic Theory of Speech


· The vocal tract: Pharyngeal + oral cavities, where the form
and shape of the vocal and nasal tracts change
continuously with time creating an acoustic filter with
time-varying frequency response.
· As air from the lungs travels trough the tracts, the
frequency spectrum is shaped by the freqeuncy selectivity
of these tracts.
· Inside the larynx, the vocal cords serve to determine the
frequency of speech.

20

10
Source-Filter Model of Speech Production
· Speech sounds result from a combination of:
¹ a source of sound energy (the larynx) modulated by a time-
varying transfer function filter (vocal tract) determined by the
shape and size of the vocal tract.
· This results in a shaped spectrum with broadband energy
peaks.
· In this model the source of acoustic energy is at the larynx, and
the vocal tract serves as a time-varying filter whose shape
determines the phonetic content of the sounds.

21

Source-Filter Model of Speech Production


· The shape of the glottal pulses of air flow is determined by the
manner and the duration of the opening and closing of the glottal
folds in each cycle of voice sounds and contributes to the perception
of the voice quality and the speaker’s identity.
· The glottal pulse consists of an open phase and a closed phase during
a pulse.
· The modeling and estimation of the glottal pulse is one of the
ongoing challenges of speech processing research in order to estimate
the period.
· The period T0 of the glottal waveform is called the pitch period;
F0=1/T0 is the fundamental frequency (pitch) of speech harmonics.
· The pitch is an important parameter used in speech recognition.

22
The Liljencrants-Fant (LF)
model of a glottal pulse
11
Acoustic Theory of Speech

· Whereas the source model of speech signals captures and explains


the detailed fine structure of speech spectrum, the filter model
captures and explains the envelope of the speech spectrum.
· The reflective and resonance characteristic of the physical space, such
as vocal tract, through which a sound wave propagates changes the
spectrum of sound and it is perception.
· The vocal tract space composed of the oral and the nasal cavities and
airways can be viewed as a time-varying acoustic filter that amplifies
and filters the sound energy and shapes its frequency spectrum.
· The resonance frequencies of the vocal tract tube are called formant
frequencies or simply formants, which depend on the shape and
dimensions of the vocal tract.
· The vocal tract is slowly varying i.e. changes occur slowly compared
to the pitch period.
23

Voiced and Unvoiced Speech


· There are two broad types of speech sounds:
¹ voiced sounds, like an “e”, are essentially due to vibrations of
the vocal cords (Periodic air pulses), they are well modeled
by sums of sinusoids
¹ unvoiced sounds, like “s”, are more noise-like (Force air
through a constriction in vocal tract, producing turbulence).

24

12
Voiced and Unvoiced Speech
· For many speech applications, it is important to distinguish between voiced and
unvoiced speech.
· There are two methods for doing it:
¹ Short-time energy function: Split the speech signal x(n) into blocks of 10-20 ms, and
calculate the energy within each block. The amplitude of unvoiced segments is
noticeably lower than that of the voiced segments. The short-time energy of speech
signals reflects the amplitude variation and is

¹ Zero-crossing rate: The rate at which the speech signal crosses zero can provide
information about the source of its creation. Unvoiced speech has a much higher ZCR
than voiced speech. This is because most of the energy in unvoiced speech is found in
higher frequencies than in voiced speech, implying a higher ZCR for the former. The
zero crossing rate of the frame ending at time instant m is defined by

25

Voiced and Unvoiced Speech

· The challenge is to find the best threshold.

26

13
Voiced and Unvoiced Model
· A representation of the excitation in terms of separate source
generators for voiced and unvoiced speech.
· The unvoiced excitation is assumed to be a random noise sequence.
· The voiced excitation is assumed to be a periodic impulse train with
impulses spaced by the pitch period.
· Each voiced sound is approximately periodic, but different sounds
are different periodic signals.
· we can model the vocal tract as an LTI (Linear Time-Invariant) filter
over short time intervals, and over the timescale of phonemes (10, 20
and 30 ms), the impulse response, frequency response, and system
function of the system remains relatively constant.

27

Voiced and Unvoiced Model


· The waveform for this particular
phoneme will then be the
convolution of the driving periodic
pulse train x(t) with the impulse
response v(t).
· The magnitude of its spectrum
|S(f)| will be the product of X(f)
and the magnitude response |V
(f)|.
· The maxima of |S(f)| are the
formant frequencies of the
phoneme.
· It is common to process speech in
blocks (also called “frames”) over
which the properties of the speech
waveform can be assumed to
remain relatively constant.
28

14
Human Auditory System
· Sound enters the ear and travels through the ear canal.
· The eardrum (transductor) receives the sound pulses and conveys the pressure changes
via middle ear bones and cochlea to auditory nerve endings.
· The middle ear, consisting of the eardrum and very small bones, is connected to a small
membrane attached to the fluid-filled cochlea.
· The middle ear’s main function is to convert the acoustic air pressure vibrations in the ear
canal to fluid pressure changes in the cochlea.
· The inner ear starts with the oval window at one end of the cochlea.
· The mechanical vibrations from the middle ear excite the fluid in the cochlea via this
membrane, and this vibrations should to be .
· The cochlea is a long helically coiled and tapered tube filled with fluid.
· Internally, the cochlea is lined with hair cells that sense frequency changes propagated
through its fluid, and the nerves transform the mechanical vibrations of fluid to electrical
signals.

Absolute Threshold (AT)


· It is the minimum detectable level of that sound in the absence of any other
external sounds.
· It characterizes the amount of energy needed in a pure tone such that it can be
detected by a listener in a noiseless environment.
· This curve gives the decibel levels for all frequencies at which the sound at that
frequency is just audible.
· The curve showing the threshold of human hearing, and it varies from person to
person.
· To create such a curve, a human subject is kept in a quiet room and a particular
frequency tone is generated so that subject can hear it. It is then reduced to zero
and increased until it is just barely audible and the decibel level for that frequency
is noted.
· Repeating this process for all the frequencies generates the curve.

15
Frequency Domain Limits

· The human auditory system is capable of perceiving


frequencies between 20 Hz and 20 KHz.
· Higher frequencies (ultrasonic, above 20 KHz) are not

perceived by the ear.


· The dynamic range of hearing, defined as the ratio of the
maximum sound amplitude to the quietest sound
humans can hear is around 120 decibels.
· Prior to compression appropriate filters can be utilized to

ensure that only the “audible frequency content” is


input to the encoder.

Time Domain Limits

· Normally, events separated by more than 30


milliseconds can be resolved separately.
· The perception of simultaneous events (less than 30

milliseconds apart) is resolved in the frequency


domain.

16
Masking/ Hiding

· If the incoming audio signal has a frequency whose


energy level is below the threshold curve, it will not be
heard.
· This is important in processing and compression because

if a frequency is ultimately not going to be perceived,


there is no need to assign any bits to it.
· However, if the input signal has frequency whose

amplitude level is above the threshold curve, it must be


assigned bits during processing and compression.

Masking/ Hiding
· In addition, the presence of a perceptible frequency changes the
shape of the threshold curve.
· All the audible levels were recorded in the presence of a 1 KHz
frequency at 45 dB.
· From the curve, it can be seen that when there is a 1 KHz tone
present, other frequency tones less than 45 dB in the neighborhood
of 1 KHz are not perceived.
· A neighborhood tone of 1.1 KHz at 20 dB would normally have been
heard well (in quiet), but now will not be heard as a distinct tone in
the presence of the stronger 1 KHz tone. The 1 KHz tone is masking
or hiding the 1.1 KHz neighborhood tone for the decibel levels
discussed.
· Here the 1 KHz tone is known as the masker and any other tone in
the neighborhood, which is masked, is known as the maskee.

17
Masking/ Hiding

· Masking refers to the phenomenon where one sound is rendered


inaudible because of the presence of other sounds.
· The presence of a single tone, for instance, can mask the neighboring
signals with the masking capability inversely proportional to the
absolute difference in frequency.
· In general, masking capability increases with the intensity of the
reference signal, or the single tone in this case.
· The features of the masking curve depend on each individual and can
be measured in practice by putting a subject in a laboratory
environment and asking for his/her perception of a certain sound
tuned to some amplitude and frequency values in the presence of a
reference tone.

Masking/ Hiding

· The higher the frequency, the larger the spectral spread that it hides
· When a 1 KHz and 1.1 KHz tone are simultaneously present at 45 dB
and 20 dB respectively, the 1 KHz tone masks the 1.1 KHz tone and,
hence, the higher tone cannot be heard.
· If the 45 dB masking tone is removed and only the maskee 20 dB tone
is present, the ear should perceive it because it is above the threshold
curve in quiet.

18
Masking/ Hiding

· The human auditory system is naturally divided into


different bands such that the ear cannot perceptually
resolve two frequencies in the same band.
· Experiments indicate that this critical bandwidth remains
constant, 100 Hz in width at frequencies below 500 Hz.
· But this width increases linearly for frequencies above
500 Hz, from 100 Hz to about 6 KHz.
· Although this is an empirical formulation, it shows that
the ear operates like a set of band-pass filters, each
allowing a limited range of frequencies through and
blocking all others.

Masking/ Hiding

Masking effects with multiple maskers. In the presence of multiple


maskers, the threshold curve gets raised, masking a lot more
frequencies
19
Masking/ Hiding

· When a 1 KHz and 1.1 KHz tone are simultaneously present at


45 dB and 20 dB respectively, the 1 KHz tone masks the 1.1
KHz tone and, hence, the higher tone cannot be heard.
· If the 45 dB masking tone is removed and only the maskee 20
dB tone is present, the ear should perceive it because it is above
the threshold curve in quiet.
· However, the 20 dB tone is not heard instantly but only after a
finite small amount of time.
· The amount of time is dependent on the intensity difference
between the masker
· The larger the masking tone’s intensity, the longer it will take
for the maskee to be heard after the masking tone’s removal.

Analog & Digital Signals

· A signal is analog if it can be represented by a continuous function


· Digital signals are represented by a discrete set of values at specific
instances of the input domain (time, space or both)
· These specific instances are usually regular

20
Advantages of Digital over Analog

· Create complex and interactive content


- Ex: access a pixel in an image, or a group of pixels in a region or even a section of a sound track.
- Different digital media types can be combined to create richer content (not easy in analog medium).

· Stored digital signals don’t degrade over time


· (not often the case with analog signals)
· ex: VHS video lose image quality by repeated usage.

· Digital data can be efficiently compressed and transmitted across


networks

· Easy to store digital signals


- Magnetic media (portable 3.5 inch or floppy drive) or hard drives (flash drives, memory cards, …)
- digital data from any source can be stored from on a common medium (not the case with analog data)

Brief: digital data is preferred while it offers better quality, higher fidelity, creation of mixed content, compressed, distributed,
stored and retrieved easily.

Analog to Digital Conversion


· Two steps to convert from A to D:
Sampling
¹

¹ Quantization

To convert back from D to A, we have to do interpolation.

Original Reconstructed

42
Note: most desirable property in ADC is to ensure that rendered analog signal is very
similar to the initial analog signal. Ex: the end device which the digital content is
rendered in CRT monitor 21
ADC: Sampling

· In brief, choosing “some” values, i.e. samples, from the


signal and not all the values of the signals.
· Given a signal x(t), the sampled signal:

· xs(1) = x(T), xs(2)= x(2T) and so on.


Note:
- T is a critical parameter. Should be the same for every signal ?
- decrease T (increase f) ==> nb of samples increases and so does storage
requirement.
- T43too large Î under sampled Î aliasing
- T too small Î over sampled Î large amount of storage (might redundant info)

ADC: Sampling

· Sampling is done in:


¹ 1D (time) for Sound signals
¹ 2D (x,y) for Image signals
¹ 3D (x,y,time) for Video signals

44

22
ADC: Quantization

· Encoding the signal value at every sampled location with a


pre-defined precision, i.e. number of levels
· How many levels should we use? How many bits should be
used?
· A signal which values range from 0 to 8 can be represented
by 2 bits or by 3 bits or 4 bits … which is better?

45

ADC: Quantization

· Examine the
difference:
sampled at same frequency and
quantized with:
- 8 bits (256 levels)
- 4 bits (16 levels)
- 3 bits (8 levels).

46

23
ADC: Quantization
· The entire range R of the signal is represented by a finite
number of bits b.

· Q[] maps the continuous value xs(n) to the nearest digital


value xq(n) using b bits.
· With b bits, we have 2 levels (3 bits Æ 8 levels)
b

· Quantization Step: Delta=R/2b

b increases Î quantization error decreases

Question: how many bits b should represent each sample? Is it the same for all signals?
Answer: depends on signal type and its usage.
47

ADC: Quantization
· What are the challenges?
· Audio signals (music) are quantized using 16 bits

· Audio signals (speech) Æ 8 bits is enough

· Some signals require a non-uniform quantization if the

distribution of its output is non-uniform.

48

24
Bit Rate
· Number of bits being produced per second.
· Important when storing digital signals and transmitting over
networks with varying bandwidths.

49

Signal Types
· Continuous, smooth, non-smooth, symmetric, Finite
support, periodic…

50

25
Linear Time Invariant (LTI) Systems
· A system with input x and output y is LTI if:

· In other words, the output signal of a system at a given time,


depends only on the input signal at that instant in time.
· LTI properties: convolution of 2 signals

51

Linear Time Invariant (LTI) Systems


· Relationship between time and frequency domains for LTI
systems:

· In time domain, the system is characterized by its impulse


response
· In52Frequency domain, it’s characterized by its transfer function
which is the Fourier Trans. Of the impulse response.
26
Dirac Delta Function
· Sharp peak at 0, and value 0 everywhere!
· Integral of 1:

· Sifting property:
· The delta function “sift-out” the value of the signal at time

T.
· It follows that it can be used with convolution to “time-

delay” f(t) by any amount T.

53 Using its
symmetry

Nyquist Rate
· Look at these signals, and tell me how many samples we need
from each one to fully reconstruct it back to its analog form.

· Nyquist rate: The signal has to be sampled using a sampling


frequency greater than twice the maximal frequency occurring in
the signal.
· For a signal with maximal freq of 10 kHz, the sampling should
happen at a 20 kHz rate (Nyquist rate)
54
· Consider a sin wave with 1HZ freq, and try sampling at 1 and at
2 HZ 27
Nyquist Rate

· If sampling frequency is higher than Nyquist, no problem!


· But, more data to store and/or transmit

· If sampling is lower than Nyquist, signal is not well

captured which results in artifacts when converted back to


analog Æ Aliasing.
· Aliasing is the effect that lets us see helicopter blades look

slower. Our eyes can’t sample fast enough.

55

Digital Filters
· Removed unwanted parts of the signal before sampling,
such as background noise beyond 4KHZ in some song.
· There are analog (Circuit components) and digital filters

(PC or DSP chip).

28
Digital Filters Advantages
· Programmable
· Designed and tested on PCs

· Combined with other filters easily

· Not temperature and wear-out dependent like analog

ones
· Handle low-freq accurately

· Versatile in general

Fourier Transform

· Every periodic continuous signal can be expressed as a


weighted combination of sinusoid (cosine and sine) waves.
· The weights are called Fourier Series coefficients, or

spectral components.
· Transform the signal from Time to Frequency Domain.

29
Fourier Transform
· Every periodic continuous signal can be expressed as a
weighted combination of sinusoid (cosine and sine)
waves.

· Ai and Bj are the coefficients.


· We can choose sin, cos or both.

Fourier Transform
· If we choose only cosine for example

· Bi gives the importance or weight to the ith frequency


wi = i(2π)/T
· If the integral evaluates to non-zero, it shows that freq wi

is present in the signal


· If it evaluates to zero or very small values, it means its not

present.
30
Fourier Transform Examples

Discrete Fourier Transform

If only cosines:

31
Speech Processing

· Speech processing is the study of speech signals and their


processing methods. This includes a number of related areas:
¹ Speech coding: compression of speech signals for
telecommunication
¹ Speech recognition: extracting the linguistic content of the speech
signal
¹ Speech synthesis: computer-generated speech (e.g., from text)
· The basic and commonly used signal processing techniques in
speech analysis are:
¹ Pitch period estimation,
¹ All-pole/all-zero filters,
¹ Convolution.
63

Speech Processing
· Properties of speech signals change with time. To
process them effectively it is necessary to work on a
frame-by-frame basis, where a frame consists of a
certain number of samples.
· The actual duration of the frame is known as length.
Typically, length is selected between 10 and 30 ms (or
80 and 240 samples).
· Within this short interval, properties of the signal
remain roughly constant.
· Thus, many signal processing techniques are adapted
to this context when deployed to speech coding
applications.
64

32
Speech Processing:
Pitch Period Estimation

· One of the most important parameters in speech analysis,


synthesis, and coding applications is the fundamental
frequency, or pitch, of voiced speech.
· The pitch frequency is directly related to the speaker and sets
the unique characteristic of a person.
· For men, the possible pitch frequency range is usually found
somewhere between 50 and 250 Hz, while for women the range
usually falls between 120 and 500 Hz. In terms of period, the
range for a male is 4 to 20 ms, while for a female it is 2 to 8 ms.
· Pitch period must be estimated at every frame. By comparing a
frame with past samples, it is possible to identify the period in
which the signal repeats itself, resulting in an estimate of the
actual pitch period.
65

Speech Processing:
Pitch Period Estimation

· Design of a pitch period estimation algorithm is a


complex undertaking due to lack of perfect periodicity,
interference with formants of the vocal tract, uncertainty
of the starting instance of a voiced segment, and other
real-world elements such as noise and echo.
· Many techniques have been proposed for the estimation

of pitch period, from these methods there are:


¹ The AutoCorrelation Method.
¹ The Magnitude Difference Function.

66

33
Speech Processing:
Pitch Period Estimation
The AutoCorrelation Method
· To perform the estimation on the signal s(n), with n being
the time index.
· Consider the frame that ends at time instant m, where the

length of the frame is equal to N (i.e., from n= m-N+1 to


m).
· The autocorrelation value reflects the similarity between

the frame s(n), from n= m-N+1 to m, with respect to the


time-shifted version s(n- l), where l is a positive integer
representing a time lag.
· The range of lag is selected so that it covers a wide range

of pitch period values.


67

Speech Processing:
Pitch Period Estimation
The AutoCorrelation Method
· For instance, for l =20 to 150, the possible pitch frequency
values range from 53 to 400 Hz at 8 kHz sampling rate.
· By calculating the autocorrelation values for the entire range of
lag, it is possible to find the value of lag associated with the
highest autocorrelation representing the pitch period estimate,
since, in theory, autocorrelation is maximized when the lag is
equal to the pitch period.
· It is important to mention that, in practice, the speech signal is
often lowpass filtered before being used as input for pitch
period estimation. Since the fundamental frequency associated
with voicing is located in the low-frequency region (<500 Hz),
lowpass filtering eliminates the interfering high-frequency
components as well as out-of-band noise, leading to a more
accurate estimate.
68

34
Speech Processing:
Pitch Period Estimation
The AutoCorrelation Method
· In this Figure, we shows voiced portion of a speech
waveform and the autocorrelation values obtained for
l=20 to 150 , m = 1500 and N = 180. The lag corresponding
to the highest peak is 71 and is the pitch period estimate.

69

Speech Processing:
Pitch Period Estimation
Magnitude Difference Function
· One drawback of the autocorrelation method is the need for
multiplication, which is relatively expensive for
implementation, especially in those processors with limited
functionality. To overcome this problem, the magnitude
difference function is invented:

· For short segments of voiced speech it is reasonable to expect


that s(n) – s(n-l) is small for l= 0,±T, ±2T, . . . , with T being the
signal’s period. Thus , by computing the magnitude difference
function for the lag range of interest, one can estimate the
period by locating the lag value associated with the minimum
magnitude difference.
70

35
Speech Processing:
Pitch Period Estimation
Magnitude Difference Function
· Consider the same portion as in the example of AutoCorrelation. The
plot of MDF show that the lowest MDF occurs at l =70. Compared
with the previous results, the present method yields a slightly lower
estimate.
· The methods discussed earlier can only find pitch period values that
are multiples of the sampling period (8 kHz) 0.125 ms. In many
applications, higher resolution is necessary to achieve good
performance. Other signal processing techniques can be introduced
to extend the resolution beyond the limits set by fixed sampling rate.

71

Speech Processing:
The All-Pole and All- Zero Filter
· Consider the filters with system functions:

¹ M is the order of the filter and the ai are the filter’s


coefficients.
· H(z) and A(z) are the inverse of each other. H(z) is an all-
pole filter since only poles are present, while A(z) is all-
zero filter.
· Let x(n) being the input to the filter and y(n) the output,
the time-domain equation corresponding to H and A
filters respectively are:

72

36
Speech Processing:
The All-Pole and All- Zero Filter
· A signal flow graph for direct form implementation of both filters.
· Note that the impulse response of an all-pole filter has an infinite
number of samples with nontrivial values due to the fact that the
scaled and delayed versions of the output samples are added back to
the input samples.
· This is referred to as an infinite-impulse-response (IIR) filter. For the
all-zero filter, however, the impulse response only has M+1 nontrivial
samples (the rest are zeros) and is known as a finite-impulse-response
(FIR) filter.

73

Speech Processing:
The All-Pole and All- Zero Filter
Calculation of Output Sequence on a Frame-by Frame Basis

74

37
Speech Processing:
The All-Pole and All- Zero Filter
State-Save Method

75

Speech Processing:
The All-Pole and All- Zero Filter
Zero-Input Zero-State Method

76

38
Speech Processing:
Short time Spectral Analysis

· The short-time Fourier transform plays a fundamental role in


frequency domain analysis of the speech signal. It is used to
represent the time varying properties of the speech waveform in the
frequency domain. The time-dependent Fourier transform is

¹ where w (k − n) is a real window sequence used to isolate the portion


of the input signal that will be analyzed at a particular time index, k.
· During the analysis of speech signals, the shape and length of the
window can affect the frequency representation of speech.
· Thus, the role of the windows in speech processing is to determine
the portion of the speech signal to be processed.
· This is done by zeroing out the signal outside the region of interest.
77

Speech Processing:
Short time Spectral Analysis

· There are many possible windows including rectangular,


Bartlett, Hamming, Hanning, Blackman, Kaiser, etc. the
time domain shapes of these windows functions are
illustrated in Figure.

78

39
Speech Processing:
Convolution
· Given the linear time-invariant (LTI) system with impulse response h(n), and
denoting the input as x(n) and the output as y(n), the convolution sum between
x(n) and h(n) is one of the fundamental relations in signal processing and given
by

· We explore the usage of the convolution sum for the calculation of the output
sequence with emphasis on the all-pole filter. A straightforward way to find the
impulse response sequence is by using the time domain difference equation given
in section 4.2, when the input is a single impulse: x(n)=δ(n). That is

· Proceeding on a sample-by-sample basis, the impulse response sequence is


determined by the filter coefficients. We have: h(n)=0 for n<0, h(0)=1, h(1)= -
a1h(0) = -a1, h(2)= -a1h(1) – a2, …, h(M-1) = -a1h(M-2) - … - aM-2 h(1) – aM-1

79

Thank you For Listening


Digital Voice Processing II
Chapter 1

40
41
DSPII- Image Processing

Speech Coding
Chapter 2

Dr. Ali Al-Hajj Hassan

Outline
· Introduction of Speech Coding
· Desirable Properties of a Speech Coder
· Coding Delay
· Classification of Speech Coders
· Waveform Coders
· DPCM
· ADPCM
· Linear Prediction
· Auto-Regressive (AR) Model
· Linear Prediction Problem
· Error Minimiztion
· Minimum Mean-Squared Prediction Error
· Estimation of LP Coefficients
· Levinson-Durbin
· Leroux-Gueguen

· LPC
· LPC CODER
· FS1015 LPC CODER
· LPC DECODER
2
· LIMITATIONS OF LPC

42
Introduction of Speech Coding

· The speech coding is a procedure to represent a digitized


speech signal using as few bits as possible, maintaining at
the same time a reasonable level of speech quality. This
procedure is also known as speech compression.
· Due to the increasing demand for speech communication,
speech coding technology has received augmenting levels
of interest from the research, standardization, and
business communities.
· Advances in microelectronics and the vast availability of
low-cost programmable processors and dedicated chips
have enabled rapid technology transfer from research to
product development.
3

Introduction of Speech Coding

· The block diagram of a speech coding system to support


telecommunication applications.
· The output is a discrete-time speech signal whose sample

values are also discretized.

43
Introduction of Speech Coding
· The frequency contents limited between 300 and 3400 Hz.
· According to the Nyquist theorem, the sampling frequency
must be at least twice the bandwidth of the continuous-time
signal in order to avoid aliasing.
· 8 kHz is commonly selected as the standard sampling
frequency for speech signals.
· To convert the analog samples to a digital format using
uniform quantization and maintaining the quality more than 8
bits/sample is necessary.
· The use of 16 bits/sample provides a quality that is considered
high.
· The bit rate at the input is 64 kbps for 8 bits/sample (128 kbps
for 16 bits/sample).
5

Introduction of Speech Coding


· The source encoder attempts to reduce the input bit-rate.
The output of the source encoder represents the encoded
digital speech and in general has substantially lower bit-
rate than the input.
· The encoded digital speech data is further processed by
the channel encoder, providing error protection to the
bit-stream before transmission to the communication
channel, where various noise and interference can
sabotage the reliability of the transmitted data.
· In this course we focus on the design of the source
encoder and source decoder.
· The decoder takes the encoded bit-stream as its input to
produce the output speech signal.
6

44
Desirable Properties of a Speech Coder
· The main goal of speech coding is either to maximize the
perceived quality at a particular bit-rate, or to minimize
the bit-rate for a particular perceptual quality.
· The appropriate bit-rate at which speech should be
transmitted or stored depends on the cost of
transmission or storage, the cost of coding
(compressing) the digital speech signal, and the speech
quality requirements.
· The bit-rate is reduced by representing the speech signal
(or parameters of a speech production model) with
reduced precision and by removing inherent redundancy
from the signal, resulting therefore in a lossy coding
scheme.
7

Desirable Properties of a Speech Coder


·Low bit-rate: The lower the bit-rate of the encoded bit-
stream, the less bandwidth is required for transmission,
leading to a more efficient system. This requirement is in
constant conflict with speech quality. In practice, a trade-
off is required.
· High speech quality: The decoded speech should have a
quality acceptable for the target application.
· Robustness across different speakers/languages: The
underlying technique of the speech coder should be
general enough to model different speakers and different
languages adequately. Note that this is not a trivial task,
since each voice signal has its unique characteristics.
8

45
Desirable Properties of a Speech Coder
·Robustness in the presence of channel errors: This is
crucial for digital communication systems where channel
errors will have a negative impact on speech quality.
· Low memory size and low computational complexity
(efficient realization): In order for the speech coder to be
practicable, costs associated with its implementation must
be low; these include the amount of memory needed for
its operation, as well as computational demand.
· Low coding delay: In the process of speech encoding and
decoding, delay is introduced, which is the time shift
between the input speech of the encoder with respect to
the output speech of the decoder. An excessive delay
creates problems with real-time two-way conversations.
9

Coding Delay
· Encoder buffering delay: many speech encoders require
the collection of a certain number of samples before
processing. The typical linear prediction (LP)-based
coders need to gather one frame of samples ranging from
160 to 240 samples, or 20 to 30 ms, before proceeding with
the actual encoding process.

10

46
Coding Delay

· Encoder processing delay: the encoder consumes a certain


amount of time to process the buffered data and construct
the bit-stream. This delay can be shortened by increasing
the computational power of the underlying platform and
by utilizing efficient algorithms. The processing delay
must be shorter than the buffering delay, otherwise the
encoder will not be able to handle data from the next
frame.

11

Coding Delay
· Transmission delay: also known as decoder buffering
delay, since it is the amount of time that the decoder must
wait in order to collect all bits related to a particular
frame so as to start the decoding process.
· There are only 2 transmission modes:
¹ In constant mode, the bits are transmitted synchronously at a
fixed rate, which is given by the number of bits corresponding to
one frame divided by the length of the frame. This mode of
operation is dominant for most classical digital communication
systems, such as wired telephone networks.
¹ In burst mode, all bits associated with a particular frame are
completely sent within an interval that is shorter than the encoder
buffering delay. This mode is inherent to packetized network and
the internet, where data are grouped and sent as packets.
12

47
Coding Delay

· Decoder processing delay: this is the time required to


decode in order to produce one frame of synthetic speech.
As for the case of the encoder processing delay, its upper
limit is given by the encoder buffering delay, since a
whole frame of synthetic speech data must be completed
within this time frame in order to be ready for the next
frame.

13

Classification of Speech Coders


· By bit rate: All speech coders are designed to reduce the
reference bit-rate of 128 kbps toward lower values.
Depending on the bit-rate of the encoded bit-stream, it is
common to classify the speech coders according to the
following table.

14

48
Classification of Speech Coders

· By coding techniques: waveform coder, parametric coder


or hybrid.
· Waveform coders try to preserve the original shape of the

signal waveform, and hence it can generally be applied to


any signal source. These coders are better suited for high
bit-rate coding, since performance drops sharply with
decreasing bit-rate. In practice, these coders work best at
a bit-rate of 32 kbps and higher (example: PCM, Adaptive
Differential PCM (ADPCM).

15

Classification of Speech Coders

· The parametric coder makes no attempt to preserve the


original shape of the waveform. The speech signal is
assumed to be generated from a model, which is
controlled by some parameters. During encoding,
parameters of the model are estimated from the input
speech signal, and transmitted as the encoded bit-stream.
The most successful model, however, is based on linear
prediction (LP).

16

49
Classification of Speech Coders

· Hybrid coder combines the strength of a waveform coder


with that of a parametric coder.

17

Classification of Speech Coders


· By mode: Single-mode coders or multimode coders.
· Single-mode coders are those that apply a specific, fixed
encoding mechanism at all times, leading to a constant
bit-rate for the encoded bit-stream.
· Multimode coders were invented to take advantage of
the dynamic nature of the speech signal, and to adapt to
the time-varying network conditions. In this
configuration, one of several distinct coding modes is
selected, with the selection done by source control, when
it is based on the local statistics of the input speech, or
network control, when the switching obeys some external
commands in response to network needs or channel
conditions.
18

50
Waveform Coders
· The main coder in this category is Pulse Code Modulation
(PCM)
· process of quantizing the samples of a discrete-time signal
· PCM is the most obvious method developed for the digital coding
of waveforms
· A quantizer with a non-uniform transfer characteristic yields
higher performance and form the core of the ITU-T G.711 PCM
standard (μ-law, A-law).
· The compression and expansion characteristics are for μ=255 and
A= 87.56; 8 bits/sample is adopted, leading to a bit-rate of 64 kbps
at 8 kHz of sampling frequency. The narrowband speech coding
standards, related to PCM, recommended by ITU-T are:

19

DPCM

· DPCM encodes the PCM values as differences between


the current and the previous value. This general approach
is refined by predicting the current sample based on the p
previous samples.

20

51
ADPCM

· DPCM system described above has a fixed predictor and


a fixed quantizer; much can be gained by adapting the
system to track the time-varying behavior of the input.
· Adaptation can be performed on the quantizer, on the

predictor, or on both. They can change the prediction


order.
· The resulting system is called Adaptive Differential PCM

(ADPCM).

21

ADPCM

22

52
Linear Prediction (LP)
· Linear prediction (LP) forms an integral part of almost all
modern day speech coding algorithms.
· The fundamental idea is that a speech sample can be
approximated as a linear combination of past samples.
· The basic assumption is that speech can be modeled as an
autoregressive (AR) signal, which in practice has been found
to be appropriate.
· Linear prediction analysis is an estimation procedure to find
the AR parameters, given samples of the signal.
· LP is an identification technique where parameters of a system
are found from the observation.
· The goal is to minimize the error between the real and the
predicted value.
23

Auto-Regressive (AR) Model

· The sequence values x(n), x(n-1),…., x(n-M), represent the


realization of an autoregressive (AR) process of order M if it
satisfies the difference equation:
x(n)+a1x(n-1)+ a2x(n-2)+….+ aM x(n-M)= e(n)
· where the constants a1, a2, . . . ,aM are known as the AR
parameters and e(n) represents a white noise
· process.
· The above equation can be written as
x(n)=-a1x(n-1) - a2x(n-2)-….- aM x(n-M)+ e(n)
· The present value of the process, x(n), is equal to a linear
combination of past values of the process, x(n-1),…., x(n-M),
plus an error term e(n). The process x(n) is said to be regressed
x(n-1),…., x(n-M), in particular, x(n) is regressed on previous
values of itself, hence the name autoregressive.
24

53
Moving Average (MA) Model

· The moving average process x(n) of order k satisfies the


difference equation:
x(n)= v(n)+b1v(n-1)+ b2v(n-2)+….+ b k v(n-k)
¹ where the constants b1, b2, …, b K are known as the
· MA parameters and v(n) represents a white noise
process.
· Thus, an MA process is formed by a linear combination of
(k+1) white noise samples.

25

Auto-Regressive - Moving Average


(ARMA) Model
· The autoregressive moving average process x(n) of orders
(Mk) satisfies the difference equation
x(n)+a x(n-1) + a x(n-2)+….+a x(n-M) = v(n)+ b v(n-
1 2 M 1

1)+ b v(n-2)+….+ b v(n-k)


2 k

¹ where the constants a1,…, a M, b1, …, b k are the ARMA


parameters, with v(n) a white noise process.
· The ARMA model is the most flexible of all three linear
models; however, its design and analysis are more
difficult than the AR or the MA model.
· AR model as it is the most appropriate for speech
modeling.
26

54
Linear Prediction Problem
· The parameters of an AR model are estimated from the
signal itself.
· The white noise signal x(n) is filtered by the AR process

synthesizer with the AR parameters denoted by ai to


obtain s(n), i.e. the AR signal.
· The prediction error is equal to the difference between the

actual sample and the predicted one


e(n) = s(n) - sˆ(n)

27

Error Minimization
· The problem is now How to estimate to coefficients
· This estimation is done by selecting the appropriate
coefficients (ai) that minimize the mean-squared
prediction error J:

· The optimal LP coefficients can be found by setting the


partial derivatives of J with respect to ai to zero; that is, for
k=1,2,3…M.

· It is characterized by a unique minimum. If the prediction


order M is known, it can be rearranged in the following
form:
28

55
Error Minimization

· This defines the optimal linear prediction coefficients in terms


of the autocorrelation Rs(j) of the signal s(n).
· In Matrix form, the equation becomes Rsa=-r
· This allows the finding of the optimal linear prediction
coefficients if the autocorrelation values Rs(j) of s(n) are known
from j=1 to M, So a=-Rs-1r

· Different algorithms will be presented in this chapter.


29

Prediction Gain

· The prediction gain of a predictor is given as the ratio


between the variance of the input signal and the variance
of the prediction error in decibels (dB).

· Prediction gain is a measure of the predictor’s


performance.
· Lower prediction error leads to a higher gain then better

predictor.
30

56
Minimum Mean-Squared Prediction Error
· Now we suppose that the prediction error is the same
· as the white noise used to generate the AR signal s(n).
· The mean-squared error is minimized with
[E e2(n)] = E [x2(n)] =σx2
· The prediction gain is maximized.
· Taking into account the AR parameters used to generate the signal
s(n), we have the minimized mean-squared
· error:

31

Estimation of LP Coefficients
· In general, inverting a matrix is quite computationally
demanding. It`s a time consuming.
· Efficient algorithms are available to solve the equation,
which take advantage of the special structure of the
correlation matrix.
· Levinson–Durbin algorithm
· Leroux–Gueguen algorithm
· both suitable for practical implementation of LP analysis
· Levinson–Durbin algorithm is implemented under a
floating-point environment
· Leroux–Gueguen algorithm is better suited for fixed-
point implementation.
32

57
Levinson-Durbin
· As the correlation matrix is Toeplitz , so it is invariant under
interchange of its columns and then its rows.
· The correlation matrix of a given size contains as subblocks all
the lower order correlation matrices.
· Based on such proprieties, Levinson–Durbin algorithm is an
iterative–recursive process where the solution of the zero-order
predictor is first found, which is then used to find the solution
of the first-order predictor; this process is repeated one step at
a time until the desired order is reached.
· The minimum mean-squared prediction error achievable with
a zero-order predictor is given by the autocorrelation of the
signal at lag zero, or the variance of the signal itself E(0) = J0 =
R(0) . This is considered as initialization and will be used by
the predictor of order one.
33

Levinson-Durbin
· The Levinson–Durbin algorithm :
· Inputs to the algorithm are the autocorrelation coefficients R(l) and the outputs
are the LP coefficients and the reflections coefficients (RC) kl:
¹ Initialization : l=0, J0 = R(0)
¹ Recursion : for l=1,2,3,…,M
º Step 1. Compute the l
th RC

º Step 2. Calculate LP coefficients for the lth order predictor

º Step 3. Compute the minimum mean-squared prediction error associated with the lth
order solution: Jl = Jl-1 (1-kl2) ; Set l = l +1; return to Step 1
¹ The final LP coefficients are: ai(M) , i=1,2,….M.
34

58
Levinson-Durbin
· Levinson–Durbin algorithm could present some
difficulties for fixed-point implementation because it lies
in the values of the LP coefficients, since they possess a
large dynamic range and a bound on their magnitudes
cannot be found on a theoretical basis.
· For the implementation of Levinson–Durbin algorithm
under a fixed point environment, careful planning is
necessary to ensure that all variables are within the
allowed range.
· If we consider the set of RCs k ; i = 1, . . . ,M. Finding the
i
corresponding LP coefficients ai can be solved directly
from the equations step 2 in the Levinson–Durbin
algorithm.
35

Leroux-Gueguen
· Leroux and Gueguen proposed a method to compute the RCs from
the autocorrelation values without dealing directly with the LP
coefficients.
· Hence, problems related to dynamic range in a fixed-point
environment are eliminated.
· The algorithm can be summarized as follows:

36

59
LPC

· Linear Prediction Coding (LPC) belongs to the class of


parametric coders where the synthetic speech does not
assume the shape of the original signal in the time
domain.
· The speech signal is characterized in terms of a set of

model parameters.
· LPC algorithm is one of the earliest standardized coders

that works at low bit-rate (2.4 kbps, FS1015 standard).


This coder was created around 1984 to provide secure
communication in military applications.

37

LPC
· Linear prediction coding relies on a highly simplified model for
speech production. The speech signal is characterized in terms of a
set of model parameters.
· The driving input of the filter or excitation signal is modeled as either
an impulse train (voiced speech) or random noise (unvoiced speech).
· Depending on the voiced or unvoiced state of the signal, the switch is
set to the proper location so that the appropriate input is selected.
· Energy level of the output is controlled by the gain parameter. For a
short enough length of the frame, properties of the signal essentially
remain constant.

38

60
LPC

· In each frame, parameters of the model are estimated


from the speech samples;
¹ Voicing: whether the frame is voiced or unvoiced, 1 bit.
¹ Gain: mainly related to the energy level of the frame.
¹ Filter coefficients: specify the response of the synthesis
filter.
¹ Pitch period: in the case of voiced frame.

39

LPC

· The parameter estimation process is repeated for each


frame, with the results representing information on the
frame. Thus, instead of transmitting the PCM samples,
parameters of the model are sent.
· By carefully allocating bits for each parameter so as to

minimize distortion, an impressive compression ratio can


be achieved.
· The bit-rate of 2.4kbps for the FS1015 coder is

approximately 53 times lower than the corresponding bit-


rate for 16-bit PCM.
40

61
FS1015 LPC Coder

· The following table summarizes the bit allocation scheme


for the FS1015 LPC coder.

· The scheme utilizes a total of 54 bits per frame; f.


· Frame length of 22.5 ms.

· Bit-rate 2400 bits/s (bps).


41

LPC Coder

· The input speech is


first segmented into
non overlapping
frames.
· A pre-emphasis filter
is used to adjust the
spectrum of the input
signal;
· The voicing detector
classifies the current
frame as voiced or
unvoiced and
outputs one bit
indicating the voicing
state.
42

62
LPC Coder
· The pre-emphasized
signal is used for LP
analysis, where ten
LPCs are derived.
· These coefficients are
quantized with the
indices transmitted
as information of the
frame.
· The quantized LPCs
are used to build the
prediction-error
filter, which filters
the pre-emphasized
speech to obtain the
prediction-error
43
signal at its output.

LPC Coder
· Pitch period is
estimated from the
prediction-error signal
if and only if the frame
is voiced.
· Power of the
prediction-error
sequence is calculated
next, which is different
for voiced and
unvoiced frames.
· The voicing bit, pitch
period index, power
index, and LPC index
are packed together to
form the bit-stream of
the LPC coder.
44

63
LPC Decoder
· It is assumed
that the
output of the
impulse train
generator is
comprised of a
series of unit
amplitude
impulses,
while the
white noise
generator has
unit-variance
output.
45

LIMITATIONS OF LPC

· The LPC coder has relatively low computational cost and


makes the low bit-rate speech coder a practical reality.
· This model is also highly inaccurate in various

circumstances. The limitations of the LPC model are


given below and are targets for improvement by the next
generation of speech coders.
· In many instances, a speech frame cannot be classified as strictly
voiced or strictly unvoiced. Indeed, there are transition frames
(voiced to unvoiced and unvoiced to voiced) that the LPC model
fails to correctly sort. This inaccuracy of the model generates
annoying artifacts such as buzzes and tonal noises.
46

64
LIMITATIONS OF LPC
· The use of strictly random noise or a strictly periodic impulse
train as excitation does not match practical observations using
real speech signals. The excitation signal can be observed in
practice as prediction error and is obtained by filtering the
speech signal using the prediction-error filter. In general, the
excitation for unvoiced frames can be reasonably approximated
with white noise. For voiced frames, however, the excitation
signal is a combination of a quasiperiodic component with
noise. Thus, the use of an impulse train is a coarse
approximation that degrades the naturalness of synthetic
speech. For the FS1015 coder, the excitation pulses are obtained
by exciting an all pass filter using an impulse train. The above
discussion is applicable for a typical prediction order of ten,
such as the case of the FS1015 coder.
47

LIMITATIONS OF LPC

· No phase information of the original signal is preserved:


neither voiced nor unvoiced frames have explicit
parameters containing indications about the phase. The
synthetic speech sounds like the original because the
magnitude spectrum, or power spectral density, is similar
to the original signal. Even though a human listener is
relatively insensitive to the phase, retaining some phase
information adds naturalness to the synthetic speech,
leading to an improvement in quality.

48

65
LIMITATIONS OF LPC

· The approach used to synthesize voiced frames, where an


impulse train is used as excitation to a synthesis filter
with coefficients obtained by LP analysis, is a violation of
the foundation of AR modeling. The violation introduces
spectral distortion into the synthetic speech, which
becomes more and more severe as the pitch period
decreases. This is the reason why the LPC coder does not
work well for low-pitch-period or high-pitch frequency
talkers, like women and children. For the typical male,
however, spectral distortion is moderate, and reasonable
quality can be obtained with the LPC coder.
49

LIMITATIONS OF LPC

· Due to the poor quality of the LPC coder, it is no longer


active for communication purposes. However, there are
still applications where the LPC principle is appropriate,
such as low-quality reproduction and speech synthesis.
As we will see in next chapter, the principles of the LPC
coder can be refined so as to create coders with higher
performance such as code-excited linear prediction
(CELP) coder, mixed excitation linear prediction (MELP),
etc.

50

66
Thank you For Listening
Digital Signal Processing II
Chapter 2

67
DSPII- Image Processing

LPC-based Speech Coding Techniques


Chapter 3

Dr. Ali Al-Hajj Hassan

Outline
· Introduction to LPC Coder
· Vector Quantization
· Analysis-by-Synthesis Principle
· Long-Term and Short-Term LP Model for Speech Synthesis
· Code-Excited Linear Prediction
· Perceptual Weighting
· Encoder Operation
· Decoder Operation
· Excitation Signal
· Excitation CodeBook Search
· FS1016 CELP
· Special CodeBooks for CELP
· ACELP
· Mixed Excitation Linear Prediction (MELP)
· Period Jitter
· Pulse Shaping
· Mixed Excitation
2

68
Introduction to LPC Coder

· Major contributions of the LPC coder can be summarized


as follows:
¹ Demonstration of low bit-rate speech coding based on
linear prediction.
¹ Use of a simple speech production model to achieve high
coding efficiency.
¹ Provision of a reference framework for next generation
speech coders.

Introduction to LPC Coder

· The principles of the LPC coder can be refined so as to


create coders with higher performance.
· The Code-Excited Linear Prediction (CELP) is essentially

an improved LPC coder, with the incorporation of a pitch


synthesis filter to avoid the strict voiced/unvoiced
classification, and an excitation codebook to partially
retain phase information.
· The mixed excitation linear prediction (MELP) coder
employs a more sophisticated excitation signal to
improve naturalness of the synthetic speech.

69
Vector Quantization

· Vector quantization (VQ) concerns the mapping in


multidimensional space from a (possibly continuous-
amplitude) source ensemble to a discrete ensemble.
· The mapping function proceeds according to some
distortion criterion or metric employed to measure the
performance of VQ.
· By definition, a vector quantizer Q of dimension M and

size N is a mapping from a vector X and it is known as


the codebook of the quantizer containing N*M-
dimensional outputs or reproduction points, called
codevectors or codewords.
5

Vector Quantization

· In VQ, vectors of a certain dimension form the input to


the vector quantizer.
· At both the encoder and decoder of the quantizer there is

a set of vectors (codebook) having the same dimension as


the input vector.
· The vectors in this codebook, known as codevectors, are

selected to be representative of the population of input


vectors.

70
Vector Quantization

· At the encoder, the input vector is compared to each


codevector in order to find the closest match.
· The elements of this codevector represent the quantized
vector.
· A binary index is transmitted to the decoder in order to

inform about the selected codevector. Because the


decoder has exactly the same codebook, it can retrieve the
codevector given its binary index.

Analysis-by-Synthesis Principle

· In a speech coder, the speech signal is represented by a


combination of parameters: gain, filter coefficients,
voicing strengths, etc.
· Two systems can be realized:

· Open loop.
· Closed loop.

71
Analysis-by-Synthesis Principle

· In an open-loop system, the parameters are extracted


from the input signal, which are quantized and later used
for synthesis. This is the principle studied in previous
chapter (LPC).
· In closed loop system, a more effective method is to use

the parameters to synthesize the signal during encoding


and fine-tune them so as to generate the most accurate
reconstruction.

Analysis-by-Synthesis Principle

· A closed-loop optimization procedure is done to choose


the best parameters so as to match as much as possible
the synthetic speech with the original speech.
· In the Figure the signal is synthesized during encoding
for analysis purposes, the principle is known as analysis-
by-synthesis.

10

72
Long-Term and Short-Term LP Model
for Speech Synthesis
· The parameters of the two predictors, in the long-term
and short-term linear prediction model for speech
production shown in the Figure, are estimated from the
original speech signal.
· The long-term predictor is responsible for generating
correlation between samples that are one pitch period
apart.

11

Long-Term and Short-Term LP Model


for Speech Synthesis

· The filter with system function:



ࡴ࢖ ࢠ ൌ
૚ ൅ ࢈ࢠିࢀ
· Š‹•ˆ‹Ž–‡”describes the effect of the long-term predictor in

synthesis. It is known as the long-term synthesis filter or


pitch synthesis filter.

12

73
Long-Term and Short-Term LP Model
for Speech Synthesis
· The short-term predictor recreates the correlation present
between nearby samples, with a typical prediction order equal
to ten.
· The synthesis filter associated with the short-term predictor
has a system function:

ࡴࢌ ࢠ ൌ
૚ ൅ σࡹ࢏ୀ૚ ࢇ࢏ ࢠ
ି࢏

· This filter is known as the formant synthesis filter since it


generates the envelope of the spectrum in a way similar to the
vocal track tube, with resonant frequencies known simply as
formants.
· The gain g is usually found by comparing the power level of
the synthesized speech signal to the original level.
13

Code-Excited Linear Prediction (CELP)


· The CELP coder is based on the analysis-by-synthesis
principle, where the excitation sequences contained in a
codebook are selected according to a closed-loop
method.
· Other parameters, such as the filter coefficients, are
determined in an open-loop fashion.

14

74
Code-Excited Linear Prediction (CELP)

· A commonly used error criterion, such as the sum of


squared error, can be applied to select the final excitation
sequence; hence, waveform matching in the time domain
is performed, leading to a partial preservation of phase
information.

15

Code-Excited Linear Prediction (CELP)


· CELP Speech Production:
· The excitation signal is selected by a closed-loop search

procedure and applied to the synthesis filters.


· The synthesized waveform is compared to the original

speech segment, the distortion is measured, and the


process is repeated for all excitation codevectors stored
in a codebook.

16

75
Code-Excited Linear Prediction (CELP)

· The index of the best excitation sequence is transmitted


to the decoder, which retrieves the excitation codevectors
from a codebook identical to that at the encoder.
· The extracted excitation is scaled to the appropriate level

and filtered by the cascade connection of pitch synthesis


filter and formant synthesis filter to yield the synthetic
speech.

17

Code-Excited Linear Prediction (CELP)

· The pitch synthesis filter creates periodicity in the signal


associated with the fundamental pitch frequency, and the
formant synthesis filter generates the spectral envelope.
· The CELP coder relies on the long-term (pitch synthesis

filter) and short-term (formant synthesis filter) linear


prediction models.

18

76
Perceptual Weighting
· The CELP Speech Production function:
૚ ૚
ࡴࢌ ࢠ ൌ  ൌ
࡭ሺࢠሻ ૚ ൅ σࡹ ࢏ୀ૚ ࢇ࢏ ࢠ
ି࢏

· with A(z) denoting the system function of the formant analysis filter.
· If the excitation codebook contains a total of L excitation
codevectors, the encoder will pass through the loop L times
for each short segment of input speech; a mean-squared error
value is calculated after each pass.
· The excitation codevector providing the lowest error is
selected at the end.
· The dimension of the codevector depends on the length of the
speech segment under consideration.
19

Perceptual Weighting

· Note that in CELP, the masking phenomenon in the


human auditory system can be explored to yield a more
suitable error measure.
· In a typical speech spectrum, the amount of noise at the
peaks can be higher than the amount of noise at the
valleys.
· A simple way of controlling the noise spectrum is by
filtering the error signal through a weighting filter
before minimization.
· The filter amplifies the error signal spectrum in non-
formant regions of the speech spectrum, while
attenuating the error signal spectrum in formant regions.
20

77
Perceptual Weighting Filter

· CELP Encoder With Perceptual Weighting Filter:

21

Perceptual Weighting
· One efficient way to implement the weighting filter is by using
the system function:
‫ݖ ܣ‬ ͳ ൅ σெ
௜ୀଵ ܽ௜ ‫ݖ‬
ି௜
ܹ ‫ ݖ‬ൌ ൌ
‫ݖ ܣ‬Ȁߛ ͳ ൅ σெ ௜ ି௜
௜ୀଵ ܽ௜ ߛ ‫ݖ‬
· with γ is a constant in the interval [0, 1] and determines the degree to
which the error is de-emphasized in any frequency region.
· If γ ->1 , W(z) -> 1 , and hence no modification of the error
spectrum is performed.
· If γ -> 0 , W(z) -> A(z), which is the formant analysis filter.
· The most suitable value of γ is selected subjectively by
listening tests, and for 8-kHz sampling, γ is usually between
0.8 and 0.9.
22

78
Perceptual Weighting Filter

· Since the formant synthesis filter and the weighting filter


are in cascade, they can be merged to form the modified

formant synthesis filter H f (z) = as shown in the
஺ ௭Ȁఊ
Figure.

23

Encoder Operation

· Input speech signal is segmented into frames and


subframes.
· The scheme of four subframes in one frame is a popular
choice.
· Length of the frame is usually around 20 to 30 ms, while

for the subframe it is in the range of 5 to 7.5 ms.


· Short-term LP analysis is performed on each frame to
yield the LPC.
· Long-term LP analysis is applied to each subframe.

24

79
Encoder Operation

25

Encoder Operation

· Input to short-term LP analysis is normally the original


speech, or preemphasized speech.
· Input to long-term LP analysis is often the short-term

prediction error.
· Coefficients of the perceptual weighting filter, pitch

synthesis filter, and modified formant synthesis filter are


known after this step.
· The excitation sequence can now be determined. The

length of each excitation codevector is equal to that of the


subframe; thus, an excitation codebook search is
performed once every subframe.
26

80
Encoder Operation

· The search procedure begins with the generation of an


ensemble of filtered excitation sequences with the
corresponding gains.
· Mean-squared error (or sum of squared error) is

computed for each sequence, and the codevector and


gain associated with the lowest error are selected.

27

Encoder Operation

· The index of
excitation
codebook, gain,
long-term LP
parameters, and
LPC are
encoded, packed,
and transmitted
as the CELP bit-
stream.
28

81
Decoder Operation
· It basically unpacks and decodes various parameters
from the bit-stream, which are directed to the
corresponding block so as to synthesize the speech.
· A post-filter is added at the end to enhance the quality of
the resultant signal by attenuating the noise components
in the spectral valleys, thus enhancing the overall
spectrum.

29

Excitation Signal
· For efficient temporal analysis, a speech frame is usually
divided into a number of subframes, four subframes.
· For each subframe, the excitation signal is generated and the
error is minimized to find the optimum excitation.
· The excitation varies between the pulse train and the random
noise.

30

82
Excitation Signal

· A general form of the excitation signal e(n) can be


expressed as
e(n) = ev(n) + eu(n), 0 ≤ n ≤ Nsub − 1 ,
¹ where eu(n) is the excitation from a fixed or secondary
codebook given by eu(n) = Guck(n), 0 ≤ n ≤ Nsub − 1 ,
¹ where Gu is the gain, Nsub is the length of the excitation
vector (or the subframe), and ck(n) is the nth-element of kth-
vector in the codebook.
· ev(n) is the excitation from the long term prediction filter
and given through the minimization procedure.
31

Excitation Signal

· The weighted error ew(n) can be described as


ew(n) = xw(n) − ˆ xw(n).
· The squared error of ew(n) is given by
ே౩౫ౘ ିଵ

‫ ୵ܧ‬ൌ ෍ ݁୵ ݊
௡ୀ଴

32

83
Excitation CodeBook Search

· Excitation codebook search is the most computationally


intensive part of CELP coding.
· Several ideas have been proposed and tested surrounding

the topic, with the sole purpose of accelerating the search


process without compromising significantly the output
quality.

33

Excitation CodeBook Search


· The search procedure is repeated for every input subframe and
consists of :
¹ Filter the input speech subframe with the perceptual weighting
filter.
¹ For each codevector in the excitation codebook:
º Calculate the optimal gain and scale the codevector using the value
found.
º Filter the scaled excitation codevector with the pitch synthesis filter.
º Filter the pitch synthesis filter’s output with the modified formant
synthesis filter.
º Subtract the perceptually filtered input speech from the modified
formant synthesis filter’s output; the result represents an error sequence.
º Calculate the energy of the error sequence.
· The index of the excitation codevector associated with the lowest
error energy is retained as information on the input subframe.
34

84
FS1016 CELP

· In 1984, the U.S. Department of Defense initiated a


program to develop a new secure voice communication
system to supplement the existing FS1015 LPC coder.
· Between 1988 and 1989, the 4.8-kbps CELP coder was

selected and became an official federal standard in 1991


known as FS1016.
· Besides the basic principles of CELP coding, the FS1016

contains features and modifications to improve both the


speech quality as well as the computational efficiency.

35

FS1016 CELP

· The concept of the adaptive codebook (ACB) was


developed as a modification to reduce complexity, but
still utilize the same principle to minimize the weighted
difference through a closed-loop analysis-by-synthesis
approach.
· The excitation codebook, as it is known above, is renamed

the stochastic codebook (SCB); at this moment we


assume that it contains samples with fixed values
originating from a white noise source.
For the FS1016 coder, the pitch synthesis filter is bypassed.

36

85
FS1016 CELP

· The bit allocation scheme of the FS1016 coder.

38

Special CodeBooks for CELP


·Binary codebooks: contain codevectors with only binary
components. i.e. zeros and ones.
· Ternary codebooks: the components of the codevectors

are chosen from the set {-1,0,1}. This offers more


flexibility than the binary codebook
· Overlapping codebooks: have codevectors that are a

shifted version of the preceding vector plus some extra


components.
¹ An example of a 6 dimensional codebook with overlap 1
would be w =(d ,d ,d ,d ,d ,d ), w =(d ,d ,d ,d ,d ,d ),
1 1 2 3 4 5 6 2 2 3 4 5 6 7

w =(d ,d ,d ,d ,d ,d ) and so on.


3 3 4 5 6 7 8

39

86
ACELP

· Another attempt to reduce the computational cost of


standard CELP coders is Algebraic CELP or ACELP.
· The term ‘‘algebraic’’ essentially means the use of simple

algebra or mathematical rules to create the excitation


codevectors, with the rules being addition and shifting.
· Based on the approach, there is no need to physically

store the entire codebook, resulting in significant


memory saving.

40

ACELP

· Several standardized coders are available:


¹ ITU-T G.723.1 Multipulse Maximum Likelihood
Quantization (MP-MLQ)/ ACELP (1995).
¹ ITU-T G.729 Conjugate Structure (CS)–ACELP (1995).
¹ ETSI Adaptive Multirate (AMR) ACELP (1999).
¹ TIA IS641 ACELP (1996).
¹ ETSI GSM Enhanced Full Rate (EFR) ACELP (1996).

41

87
ACELP

· Such vocoders are used for different applications that depend


on the bit rate, robustness to channel errors, algorithm delay,
complexity, and sampling rate.
· G.729 and G.723.1 are widely used in real-time
communications over the Internet due to their low-bit rates
and high qualities.
· The AMR is mainly used for the GSM and the WCDMA
wireless systems for its flexible rate adaptation to error
conditions of wireless channels.
· The codebook vector consists of a set of interleaved
permutation codes containing few nonzero elements. The
ACELP fixed-codebook structures have been used in G.729 and
G.723.1 low-bit rate at 5.3 kbps, and WCDMA AMR.
42

ACELP

· mk is the pulse position, k is the pulse number, the interleaving depth


is 5.
· In this codebook, each codevector contains four nonzero pulses
indexed by ik .
· Each pulse can have either the amplitudes of +1 or −1, and can
assume the positions given in the Figure.
· The codevector, ck , is determined by placing four unit pulses at the
locations mk multiplied with their signs (±1) as follows:

· ࢉ࢑ ࢔ ൌ ࢙૙ ࢾ ࢔ െ ࢓૙ ൅ ࢙૚ ࢾ ࢔ െ ࢓૚ ൅ ࢙૛ ࢾ ࢔ െ ࢓૛ ൅ ࢙૜ ࢾ43 ࢔ െ ࢓૜
· ™Š‡”‡݊ ൌ Ͳǡͳǡ ǥ ǡ ͵ͻƒ†᩹ߜ ݊ is a unit pulse.
88
Mixed Excitation Linear Prediction (MELP)

· MELP coder is designed to overcome some of the


limitations of LPC.
· It utilizes a more sophisticated speech production
model, with additional parameters to capture the
underlying signal dynamics with improved accuracy.
· The essential idea is the generation of a mixed excitation
signal as input to the synthesis filter, where the
‘‘mixing’’ refers to the combination of a filtered periodic
pulse sequence with a filtered noise sequence.
· The benefits require an extra computational cost,
practicable only with the powerful digital signal
processors.
44

Mixed Excitation Linear Prediction (MELP)

· The main improvements of the MELP model with respect


to the LPC model are :
¹ PERIOD JITTER
¹ PULSE SHAPING

¹ MIXED EXCITATION

45

89
Mixed Excitation Linear Prediction (MELP)

· MELP Model of Speech Production:

46

Period Jitter

· A randomly generated period jitter is used to perturb the


value of the pitch period so as to generate an aperiodic
impulse train.
· One of the fundamental limitations in LPC is the strict
classification of a speech frame into two classes: unvoiced
and voiced.
· The MELP coder extends the number of classes into three:

unvoiced, voiced, and jittery voiced.


· The latter state corresponds to the case when the

excitation is aperiodic but not completely random, which


is often encountered in voicing transitions.
47

90
Period Jitter

· This jittery voiced state is controlled in the MELP model


by the pitch jitter parameter and is essentially a random
number.
· Experimentally, it was found that a period jitter uniformly

distributed up to ±25% of the pitch period produced


good results.
· The short isolated tones, often encountered in LPC coded
speech due to misclassification of voicing state, are
reduced to a minimum.

48

Pulse Shaping

· Shape of the excitation pulse for periodic excitation is


extracted from the input speech signal and transmitted as
information on the frame.
· In the simplest form of LPC coding, voiced excitation

consists of a train of impulses; which differs a great deal


from real world cases, where each excitation pulse
possesses a certain shape, different from an ideal impulse.
· The shape of the pulse contains important information

and is captured by the MELP coder through Fourier


magnitudes of the prediction error.
49

91
Pulse Shaping
· These quantities are used to generate the impulse
response of the pulse generation filter, responsible for the
synthesis of periodic excitation.
· Note that Fourier magnitudes are calculated only when the frame
is voiced or jittery voiced.

Illustration
of signals
associated
with the
pulse
generation
filter. 50

Mixed Excitation

· Periodic excitation and noise excitation are first filtered


using the pulse shaping filter and noise shaping filter,
respectively; with the filters’ outputs added together to
form the total excitation, known as the mixed excitation.
· This is the core idea of MELP and is based on practical

observations where the prediction-error sequence is a


combination of a pulse train with noise.
· Thus, the MELP model is much more realistic than the

LPC model, where the excitation is either impulse train or


noise.
51

92
Mixed Excitation
·In the Figure, the frequency responses of the shaping
filters are controlled by a set of parameters called voicing
strengths, which measure the amount of ‘‘voicedness’’.
· The responses of these filters are variable with time,

with their parameters estimated from the input speech


signal, and transmitted as information on the frame.

52

Mixed Excitation
· A tenth-order LP analysis is performed on the input
speech signal using a 200- sample (25-ms) Hamming
window centered on the last sample in the current frame.
· The autocorrelation method is utilized together with the
Levinson–Durbin algorithm.
· The coefficients are quantized and used to calculate the
prediction-error signal.

The Bit
Allocation
for the
MELP Coder

53

93
Thank you For Listening
Digital Signal Processing II
Chapter 3

94
95
DSPII- Image Processing

Speech Recognition
Chapter 4

Dr. Ali Al-Hajj Hassan

Outline

· Introduction to Speech Recognition


· Problem Formulation
· Speech Units
· Entropy of Speech
· Feature Analysis
· Discrete Cosine Transform Function
· Cepstral Coefficients
· Statistical Acoustic Models
· Hidden Markov Models (HMMS)
· Dynamic Time Warping (DTW)
· Voice-Activated Name Dialing
· Resolutions of Speech Features and Models
· Spectral-Temporal Feature Resolution
· Statistical Model Resolution
· Model Context Resolution

96
Introduction to Speech Recognition
· Speech recognition systems have a wide range of applications
from the relatively simple isolated-word recognition systems
for name-dialing, automated customer service and voice-
control of cars and machines to continuous speech
recognition.
· Like any pattern recognition problem, the fundamental
problem in speech recognition is the speech pattern
variability.
· Speech recognition methods aim to model, and where
possible reduce, the effects of the sources of speech
variability.
· The most challenging sources of variations in speech are
speaker characteristics including accent and background
noise.
3

Introduction to Speech Recognition


· In general the sources of speech variability are as follows:
¹ Duration variability: No two spoken realizations of a word,
even by the same person, have the same duration.
¹ Spectral variability: No two spoken realizations of a word, even
by the same person, have the same spectral-temporal trajectory.
¹ Speaker variability: Speech is affected by the anatomical
characteristics, gender, health, and emotional state of the speaker.
¹ Accent: Speaker accent can have a major effect on speech
characteristics and on speech recognition performance.
¹ Contextual variability: The characteristic of a speech unit is
affected by the acoustic and phonetic context of the units
preceding or succeeding it.
¹ Noise: Speech recognition is affected by noise, echo, channel
distortion and adverse environment.
4

97
Introduction to Speech Recognition

· In terms of the continuity of the input speech that a


speech recognition system can handle, there are broadly
two types of speech recognition systems:
¹ Isolated-word recognition systems, with short pauses
between spoken words, are primarily used in small
vocabulary command control applications.
¹ Continuous speech recognition is the recognition and
transcription of naturally spoken speech. Continuous
speech recognition systems are usually based on word
model networks constructed from compilation of phoneme
models using a phonetic transcription dictionary.
5

Problem Formulation

· The speech recognition problem can be stated as follows:


given a stream speech features X extracted from the
acoustic realization of a spoken sentence, and a network
of pre-trained speech models Γ, decode the most likely
spoken word sequence W=[w1,w2, …, w N].
· Note that speech model network Γ contains models of

acoustic features representation of words and can also


include a model of language grammar.

98
Problem Formulation

· Formally, the problem of recognition of words from a


sequence of acoustic speech features can be expressed in
terms of maximization of a probability function as:

෢ ൌ ࢓ࢇ࢞ࢃ ࢌ ࢃȀࢄǡ ࢣ

¹ where f(·) is the conditional probability of a sequence of


words W given a sequence of speech features X.
· In practice speech is processed sequentially and the
occurrence of a word, a phoneme or a speech frame is
conditioned on the previous words, phonemes and
speech frames.
7

Problem Formulation

· Block diagram of the overall speech recognition system.


· The input speech signal, s[n], is converted to the sequence
of feature vectors, X = {x1,x2, . . . ,xT }, by the feature
analysis block (also denoted spectral analysis).
8

99
Problem Formulation

· The feature vectors are computed on a frame-by-frame


basis using predefined techniques to represent the short-
time spectral characteristics.
· The pattern classification block (also denoted as the
decoding and search block) decodes the sequence of
feature vectors into a symbolic representation that is the
maximum likelihood string that could have produced the
input sequence of feature vectors.
· The pattern recognition system uses a set of acoustic
models (represented as hidden Markov models) and a
word lexicon to provide the acoustic match score for each
proposed string.
9

Problem Formulation

· A language model (N-gram) is used to compute a


language model score for each proposed word string.
· The final block in the process is a confidence scoring

process (also denoted as an utterance verification block),


which is used to provide a confidence score for each
individual word in the recognized string.

10

100
Speech Unit
· Every communication system has a set of elementary symbols
(or alphabet) from which larger units such as words and
sentences are constructed.
· For example in digital communication the basic alphabet are
“1” and “0”, and in written English the basic units are A to Z.
· The elementary linguistic unit of spoken speech is called a
phoneme and its acoustic realization is called a phone.
· There are between 60 to 80 phonemes in spoken English; the
exact number of phonemes depends on the dialect.
· In automatic speech processing the number of phonemes can
be reduced to between 40 to 60 phonemes depending on the
dialect.
11

Speech Unit

· Phonetic units are not produced in isolation and that


their articulation and temporal-spectral “shape” is
affected by the context of the preceding and succeeding
phones as well as the linguistic, expressional and tonal
context in which they are produced.
· For speech recognition context dependent triphone units
are used. A triphone is a phone in the context of a
preceding and a following phones.
· Assuming that there are about 40 phonemes, theoretically
there will be about 40×40x40 = 64000 triphones.
· Due to linguistic constraints some triphones do not occur
in practice.
12

101
Entropy of Speech
· Entropy of an information source gives the theoretical
lower bound for the number of binary bits required to
encode the source.
· The entropy of a set of communication symbols is

obtained as the probabilistic average of log2 of the


probabilities of the symbols.
· The entropy of a random variable X with M states or
symbols X=[x1, ..., xM] and the state or symbol
probabilities [p1, ..., pM], where PX(xi)= pi, is given by

ࡴ ࢄ ൌ െ ෍ ࡼࢄ ࢞࢏ ࢒࢕ࢍ૛ ࡼࢄ ࢞࢏
࢏ୀ૚ 13

Entropy of Speech

· We can consider the entropy of speech at several levels;


· such as the entropy of words contained in a sequence of speech or
the entropy of intonation or the entropy of the speech signal
features.
· The calculation of the entropy of speech is complicated as
speech signals simultaneously carry various forms of
information such as phonemes, topic, intonation signals,
accent, speaker voice, etc.

14

102
Feature Analysis
· The feature analysis subsystem converts time-domain raw
speech samples into a compact and efficient sequence of
spectral-temporal feature vectors that retain the phonemic
information but discard some of the variations due to speaker
variability and noise.
· The most widely used features for speech recognition are
cepstral feature vectors which are obtained from a Discrete
Cosine Transform (DCT) function of the logarithm of
magnitude spectrum of speech.

15

Feature Analysis

· The temporal dynamics of speech parameters, i.e. the


direction and the rate of change of time-variation of
speech features, play an important role in improving the
accuracy of speech recognition.
· Temporal dynamics are often modeled by the first and the

second order differences of cepstral coefficients.

16

103
Discrete Cosine Transform Function

· There are several tools that can be used to transform a


given sequence into different representation.
· A Discrete Cosine Transform (DCT) expresses a finite

sequence of data points in terms of a sum of cosine


functions oscillating at different frequencies.
· The most commonly used form is given by
ࡺି૚
࣊ ૚
ࢄ ࢑ ൌ ෍ ࢞ ࢔ ࢉ࢕࢙ ሺ࢔ ൅ ሻ࢑ ࢑ ൌ ૙ǡ Ǥ Ǥ Ǥ Ǥ ǡ ࡺ െ ૚
ࡺ ૛
࢔ୀ૙

17

DCT vs DFT
· DCT outperforms DFT in terms of speech energy
compaction.
· A clean speech is divided into frames with 50%
overlapping, and the transform is performed.
· The speech is reconstructed.
· The Mean Square Error (MSE) is computed.
· DCT has the added advantage of higher spectral
resolution than the DFT for the same window size.
¹ For a window size of N, DCT has N independent spectral
components while DFT produces only N/2+1 independent
spectral components, as the other components are just
complex conjugates.

104
DCT vs DFT

· If we want also to use


the Global SNR of the
reconstructed speech
using DCT and DFT
versus the Global SNR
of the noisy speech.
· For that DCT is

typically used to
extract the cepstral
vector from the speech
signal.

Cepstral Coefficients

· The most commonly used features are cepstral


coefficients that are derived from an Inverse Discrete
Fourier Transform (IDFT) of logarithm of short-term
power spectrum of a speech segment (with a typical
segment length of 20-25ms) as:
ࡺି૚
૛࣊࢔࢑
࢐ ࡺ
ࢉ ࢔ ൌ ෍ ࢒࢔ ࢄ ࢑ ࢋ
࢑ୀ૙
· where X(k) is the FFT-spectrum of speech x(n).

20

105
Cepstral Coefficients
· As the spectrum of real valued speech is symmetric, the
DFT can be replaced by a Discrete Cosine Transform
(DCT) as:
ࢉ ࢔ ൌ ࡰ࡯ࢀ ‫࢑ ࢄ ܖܔ‬
· The cepstral parameters encode the shape of the log spec

· For example the coefficient c(0) is given by


c(0) = log |X (0)| +..+ log |X (N −1)|= log(|X (0)|×..× |X (N −1)|)

· c(0) is the average of log magnitude spectrum, or equivalently the


geometric mean of magnitude spectrum.
· c(1) describes the tilt or the slope of the log spectrum.
· Every cepstral coefficient gives an important information to
conserve the accuracy of the speech.
21

Cepstral Coefficients
· Some useful properties of cepstrum features are as follows:
¹ The lower index cepstral coefficients represent the spectral
envelop of speech, whereas the higher indexed coefficients
represent fine details (i.e. excitation) of the speech spectrum.
¹ Logarithmic compression of the dynamic range of spectrum,
benefiting lower power higher frequency speech components.
¹ Insensitivity to loudness variations of speech if the coefficient c(0)
is discarded.
¹ As the distribution of the power spectrum is approximately
lognormal, the logarithm of power spectrum, and hence the
cepstral coefficients, are approximately Gaussian.
¹ The cepstral coefficients are relatively de-correlated allowing
simplified modeling assumptions.
22

106
Cepstral Coefficients
· A widely used form of cepstrum is Mel Frequency
Cepstral Coefficients (MFCC), for scaling or
normalization of the frequencies.
· To obtain MFCC features, the spectral magnitude of FFT
frequency bins are averaged within frequency bands spaced
according to the mel scale which is based on a model of human
auditory perception. The scale is approximately linear up to
about 1000 Hz and approximates the sensitivity of the human ear
as:
fmel = 1125.log(0.0016 f +1)
· where fmel is the mel-scaled frequency of the original frequency f in
Hz.

23

Cepstral Coefficients

· Cepstrum coefficients can also be derived from linear


prediction model parameters. If p is the order of the LPC
analysis , the LP-cepstrum coefficients are given by:



ࢉ ࢔ ൌ െࢇ ࢔ ൅ ෍ ૚ െ ࢇ ࢑ ࢉ ࢔ െ ࢑ ᩷ͳ ൑ ݊ ൑ ‫݌‬

࢑ୀ૚

24

107
Cepstral Coefficients
· Temporal difference features are a simple and effective means
for description of the trajectory of speech parameters in time.
· For speech recognition relatively simple cepstral difference features
are defined by:
∂c(m)= c(m+1) − c(m−1) and ∂∂c(m)=∂c(m+1) − ∂c(m−1)
¹ where ∂c(m) is the first order time-difference of cepstral features,
in speech processing it is also referred as the velocity features.
¹ The second order time difference of cepstral features ∂∂c(m) is also
referred to as the acceleration features.
· Dynamic or difference features are effective in improving
speech recognition.
· The use of difference features is also important for improving
quality in speech coding, synthesis and enhancement
applications and the perceptual quality in speech synthesis.
25

Cepstral Coefficients

108
Statistical Acoustic Models
· For speech recognition, an efficient set of statistical
acoustic models is needed to:
¹ capture the mean and variance of the spectral-temporal
trajectory of speech sounds,
¹ discriminate between different speech sounds.
· Such models for recognizing sequences include:
¹ Hidden Markov Models (HMMs): are commonly used for
medium to large vocabulary speech recognition systems.
¹ Dynamic Time Warping (DTW) Method: mostly used for
isolated-word applications. For small vocabulary isolated-
word recognition, as in name-dialing cost, practicable only
with the powerful digital signal processing.
27

Statistical Acoustic Models

· The statistical acoustic models should take into


considerations:
· Context-independent models or context-dependent models
where several models are used for each word (or phoneme) to
capture the variations in the acoustic production of the words
caused by the variations of different acoustic context within which
words occurs.
¹ For large vocabulary and continuous speech recognition,
sub-word models are used due to their efficiency for
training and adaptation of models, and for the ease of
expansion of the vocabulary size.
28

109
Markov Chain

Definitions
· A stochastic process {X(t)}, is a function of time where the
value depends on the random experiment.
· At each time t Є T, X(t) is a random variable.
· E: space state = set of values that the stochastic process X(t)
could take at each instant t Є T
¹ Discret.
¹ Continuous.
· T set of times:
¹ Discret.
¹ Continuous.
· 4 different types of stochastic processes.
29

Markov Chain

· A stochastic process X has a discrete space state is a


counting process if and only if any particular realization
of this process is an increasing function:
¹ with X(t0) =0.
¹ X(t1) ≤ X(t2) for t1<t2.
· A stochastic process X is said to have independent
increments if and only if the random variables X(tn) - X(tn-
1),..., X(t3) - X(t2), X(t2) - X(t1), X(t1) - X(t0) are independent
whatever t0<t1<t2 < ... <tn-2 <tn-1<tn.
· A stochastic process X is said to be stationary if and only
if the random variables X(tn) - X(tn-1) and X(t1) - X(t0) are
distributed according to the same law whatever n in Z.
30

110
Markov Chain

Poisson Process
· X a stochastic process with continuous time and discrete
space state is a Poisson process with parameter λ if and
only if:
¹ X is a counting process.
¹ X is a stationary process and has independent increments.
ࣅ࢚ ࢑ െࣅ࢚
¹ ࡼ ࢄ ࢙൅࢚ െࢄ ࢙ ൌࡷ ൌ ࢋ k=0,1,2,3...
࢑Ǩ
ࣅ࢚ ૙ ିࣅ࢚
º Ex: P(X(࢙+࢚)−X(࢙)=0)= ࢋ
૙Ǩ

31

Discrete Time Markov Chain (DTMC)


Discrete Time Markov Chain (DTMC)
· A stochastic process {Xn}n Є N has a discrete space state and discrete
time is a Discrete Time Markov Chain if and only if:
¹ P[Xn=j |Xn-1=in-1, Xn-2=in-2, …, X0=i0] = P[Xn=j |Xn-1=in-1]
º Where in is the space state and n is the time.
· We are only interested in homogeneous DTMC i.e. the transition
probabilities from one state (previous state) to another do not depend
on n.
· Pij=P[Xn=j |Xn-1=i]: is the Transition Probability from state i to state j.
· ¦Pij=1 (for all j) form the matrix of transition P=[Pij]i,j Є E
· Representation
· E={1, 2, 3, 4}
¹ P23 = P41 = 1
¹ P12+ P14=1
¹ P33+ P31=1
32

111
Example

· E={1, 2, 3, 4}
¹ P23 = P41 = 1
¹ P12+ P14=1
¹ P33+ P31=1
· P is a transition matrix.

33

Example
1/2

ͳȀʹ ͳȀʹ Ͳ 1
ܲ ൌ ͳȀʹ Ͳ ͳȀʹ 1/2
1/3
ͳȀ͵ ʹȀ͵ Ͳ 1/2

2/3 3
2

1/2

34

112
Markov Chain
DTMC Analysis
· The analysis of the transient regime of a DTMC consists
of determining the vector S (n) of state probabilities Sj (n) =
P(Xn=j] j Є E; so that the process {Xn}n Є N is in the state j at
the nth step of the process.
· S
(n)= [S (n), S (n), S (n), …] this vector depend on P and S (0).
1 2 3

S (n)= S (n-1)P
S (n)= S (0)[P]n

35

Example
1/2
ͳȀʹ ͳȀʹ Ͳ
ܲൌ ͳȀʹ Ͳ ͳȀʹ 1
ͳȀ͵ ʹȀ͵ Ͳ
Suppose at t0 = 0: 1/2
S1 =1
(0) 1/3
1/2
S2 (0) = 0
S3 (0) =0 2/3
2 3
That mean the stochastic process will be
at t=0 in the State 1.
After 1 transition, where we will be? Ata any state? 1/2
S1 (1) = 1/2 = S1 (0) . P11
S2 (1) = 1/2 = S1 (0) . P12
S3 (1) = 0 = S1 (0) . 0
ͳȀʹ ͳȀʹ Ͳ
S = S .P = (1/2 1/2 0). ͳȀʹ Ͳ ͳȀʹ = (1/2 1/4 1/4)
(2) (1)

ͳȀ͵ ʹȀ͵ Ͳ
36

113
Markov Chain
Classification of states
· A DTMC is irreducible if and only if of any
state i we can reach any state j in a finite
number of steps.
¹ whatever i,j Є E, it exists/ Pij(m) <> 0.
· A state j is periodic if we can only return to it
after a number of steps that are multiples of
k>1.
¹ There exists k >1 / Pij(m) = 0 for m not a multiple of
k.
¹ Period of state j = largest integer k verifying this
property.
¹ The period of a DTMC is equal to the GCD
(PGCD) of the period of each of its states.
¹ A DTMC is aperiodic if its period is equal to 1.
38

Markov Chain
Steady State
· The analysis of the steady state of a DTMC consists of
examining the probability vector π(n) when n tends
towards infinity.
· Calculation:

¹ π(n)= π(0) [P]n, Pn can be calculated by diagonalization P.


· In an irreducible and aperiodic DTMC, the steady state
vector of probabilities πj = lim πj(n) always exists and is
independent of the initial distribution π(0) so we can
solve:
π= π.Р and Σ πi =1
39

114
Example
1/2
ͳȀʹ ͳȀʹ Ͳ
ܲ ൌ ͳȀʹ Ͳ ͳȀʹ 1
ͳȀ͵ ʹȀ͵ Ͳ
Is this chain irreducible?? 1/2
Is this chain aperiodic?? 1/3
1/2

π= π.Р and Σ πi =1 2/3


2 3
ͳȀʹ ͳȀʹ Ͳ
(π1 π2 π3). ͳȀʹ Ͳ ͳȀʹ =(π1 π2 π3)
ͳȀ͵ ʹȀ͵ Ͳ 1/2
π1= ½ π1 + ½ π2 + 1/3 π3
π2= ½ π1 + 0 + 2/3 π3
π3= ½ π2
π1 + π2 +π3 =1
And we can solve this equations to get the values of π1 = 8/17,π2 =6/17, π3 =3/17.
π (8/17 6/17 3/17) 40

Example
· Let us represent the state of the weather by a 1st-order,
ergodic Markov model, M:
¹ State 1: rain
¹ State: cloud
¹ State 3: sun
º With state transition probabilities;
· Given today is sunny (x1=3) what is
the probability with model M of
observing the sequence of
weather states “sun-sun-rain-cloud-sun”?

41

115
Example
· )= P(X={3,3,1,2,3}ȁ
P(Xȁ )
= P(x1=3).P(x2=3ȁx1=3). P(x3=1ȁx1=3).P(x4=2ȁx3=1). P(x5=3ȁx4=2)
= S3P33P31P12P23.
S (2/11, 6/11, 3/11)
P(Xȁ)= 3/11 . 0.6 . 0.2 . 0.3 . 0.1 = 0.00098.

42

Hidden Markov Models (HMMS)


Introduction
· The Probability of state i generating a discrete observation

ot, which has one of a finite set of values is


bi(ot)=P(otȁxt=i)
· Probability distribution of a continuous observation ot,

which can have one of infinite set of values is


bi(ot)=p(otȁxt=i)

43

116
Hidden Markov Model (HMM)


ͳǤ —„‡”‘ˆ†‹ˆˆ‡”‡–•–ƒ–‡•ܰǡ ‫ א ݔ‬ሼͳǡ ǥ ǡ ܰሽǢ

ʹǤ —„‡”‘ˆ†‹ˆˆ‡”‡–‡˜‡–• ‘„•‡”˜ƒ–‹‘•‘”ˆ‡ƒ–—”‡• ‫ܭ‬ǡ ݇ ‫ א‬ሼͳǡ ǥ ǡ ‫ܭ‬ሽǢ

3. Initial−state probabilities,
࣊ ൌ ࣊࢏ ൌ ࡼ ࢞૚ ൌ ࢏ ˆ‘”ͳ ൑ ݅ ൑ ܰǢ

4.–ƒ–‡ െ –”ƒ•‹–‹‘’”‘„ƒ„‹Ž‹–‹‡•ƒ–”‹šǡ
࡭ ൌ ࢇ࢏࢐ ൌ ࡼ ࢚࢞ ൌ ࢐ ࢚࢞െ૚ ൌ ࢏ for ͳ ൑ ݅ǡ ݆ ൑ ܰǢ

5. Discrete output probabilities matrix,


࡮ ൌ ࢈࢏ ࢑ ൌ ࡼ ࢕࢚ ൌ ࢑ ࢚࢞ ൌ ࢏ for ͳ ൑ ݅ ൑ ܰ
ƒ†ͳ ൑ ݇ ൑ ‫ܭ‬Ǥ
44

Hidden Markov Models (HMMS)


· HMM is important for Speech Recognition we are not
able to see the state (letter) but we are able to see the
feature (cepstral coefficients) to analyze them to find the
letter, so it`s a recognition.
· HMM could also be one of the tools of machine learning

in AI (in pattern recognition).






45

117
Hidden Markov Models (HMMS)

· The state sequence ܺ ൌ ሼͳǡͳǡʹǡ͵ǡ͵ǡͶሽ produces the set of


observations ࣩ ൌ ሼ‫ ͳ݋‬ǡ ‫ ʹ݋‬ǡ ǥ ǡ ‫݋‬͸ ሽǣ
ࡼሺࢄȁࣅሻ ൌ ࣊૚ ࢇ૚૚ ࢇ૚૛ ࢇ૛૜ ࢇ૜૜ ࢇ૜૝
ࡼ ङ ࢄǡ ࣅ ൌ ࢈૚ ࢕૚ ࢈૚ ࢕૛ ࢈૛ ࢕૜ ࢈૜ ࢕૝ ࢈૜ ࢕૞ ࢈૝ ࢕૟
ൌ ࣊૚ ࢈૚ ሺ࢕૚ ሻࢇ૚૚ ࢈૚ ሺ࢕૛ ሻࢇ૚૛ ࢈૛ ሺ࢕૜ ሻ ǥ





46

Hidden Markov Models (HMMS)







47

118
Hidden Markov Models (HMMS)

· The triple λ = (π, A, B) thus defines an HMM.


· We shall refer to λ as the model and the model parameter

set interchangeably without ambiguity.


· In the development of the HMM methodology, the

following problems are of particular interest.

52

Left-to-Right HMM
· In speech recognition field, we use left-to-right models
for HMM.
· Left-to-right models obviously have no transitions from

right to left, or more formally, from higher-index states to


lower-index states.
· This implies that the rightmost state is an absorbing state:

once it has been reached, there is no way to get out.


· Every other state is transient, because the chain will

eventually leave that state without ever returning to it.

53

119
Left-to-Right HMM
· Left-to-right models have several advantages.
· For time-varying signals like speech, the nature of the signal is
reflected in the model topology.
· The number of possible paths through the model is reduced
significantly, simplifying computations; because:
¹ the number of states and transitions is relatively small,
¹ estimating transition and emission probabilities becomes more
tractable,
¹ fewer parameters to estimate implies more reliable estimates.
· Finally, practical experience in speech recognition shows that
the left-to-right topology performs better than any other
topology.
54

Example

55

120
Hidden Markov Models (HMMS)
· Three tasks within HMM framework:
¹ Evaluation problem: Given the observation sequence O and a model λ,
how do we efficiently evaluate the probability of O being produced by
the source model λ, P(Observations|Model), i.e. P (O|λ)?
¹ It mean in our case we check each letter what is the probability to give a
specific cepstral coefficient.

¹ Decoding problem: How to deduce from O the most likely state


sequence x (letter) in a meaningful manner given the model. It`s the
recognition.

¹ Training problem: We should optimize the template by training the


parameters in the methods.
¹ The objective is to construct a model that best fits the training data (or
best represents the source that produced the data).
¹ This is an estimation problem: given the observation O, how do we
solve the inverse problem of estimating the parameters in λ.

56

Evaluation Problem
· One can simply evaluate this probability directly from the
definition.
· It is obtained by summing the joint probability over all

possible sequences x as follows:


 ࡼ ࡻȁࣅ  ൌ ෍ ࡼ ࡻȁ࢞ǡ ࡮ ‫࢞ ࡼ ڄ‬ȁ࡭ǡ ࣊
‫ܔܔ܉‬᩹࢞

 ൌ ෍ ࣊࢞૚ ࢈࢞૚ ࢕૚ ࢇ࢞૚࢞૛ ࢈࢞૛ ࢕૛ ‫ ڄ‬ǥ ‫ࢀ࢞ࢇ ڄ‬ష૚ ࢞ࢀ ࢈࢞ࢀ ࢕ࢀ


࢞૚ ǡ࢞૛ ǡǤǤǤǡ࢞ࢀ

 ൌ ෍ ࣊࢞૚ ෑ ࢇ࢚࢞ష૚࢚࢞ ࢈࢚࢞ ࢕࢚


࢞૚ ǡ࢞૛ ǡǤǤǤǡ࢞ࢀ ࢚ୀ૛

57

121
Evaluation Problem

58

Example

59

122
Evaluation Problem

· Compute likelihood of
a set of observations
with a given model
(Trellis).
· ࢻ࢚ (i) is the probability

of partial observations
sequence o1, o2, …, ot
(until time t) when
being in state i at time
t.
62

Evaluation Problem

· The forward procedure can be summarized by the following steps to calculate the
forward likelihood ࢻ࢚ ࢏ ࢖ሺ࢚࢞ ൌ ࢏ǡ ࢕૚࢚ȁɉሻ:

1.Initialization
Set t ൌ ૚Ǣ
ࢻ૚ ࢏ ൌ ᩷ ࣊࢏ ࢈࢏ ࢕૚ ǡ ᩷૚ ൑ ࢏ ൑ ࡺ
2. Induction

ࢻ࢚ା૚ ࢐ ൌ ࢈࢐ ࢕࢚ା૚ ෍ ࢻ࢚ ࢏ ࢇ࢏࢐ ǡ ૚ ൑ ࢐ ൑ ࡺ


࢏ୀ૚
3. Update time
date time
‫ܜ܍܁‬࢚᩹ ൌ ࢚ ൅ ૚Ǣ
‫ܖܚܝܜ܍܀‬᩹‫ܗܜ‬᩹‫ܘ܍ܜܛ‬᩹૛ܑ᩹܎࢚᩹ ൏ ࢀǢ
4. Termination

ࡼ ࡻȁࣅ ൌ ෍ ࢻࢀ ࢏
࢏ୀ૚

63

123
Example

64

Decoding Problem
· There are several ways to define the decoding objective.
· The most trivial choice is, following the Bayesian

framework, to maximize the instantaneous posteriori


probability zt(i)= P(xt=i|o,λ) i.e. we decode the state at
time t by choosing ࢗ᪄࢚ to be
ࢗ᪄࢚ ൌ ࢇ࢘ࢍ࢓ࢇ࢞૚ஸ࢏ஸࡺ ࢠ࢚ ࢏

· The forward algorithm allows the probability of an HMM


to be evaluated with respect to a given observation
sequence, but it does not give any indication as to the
underlying state sequence.
65

124
Decoding Problem

· As a remedy, we will present the Viterbi algorithm,


which can be used to find the most likely path through a
model, this is known as sequential decoding.
· The Viterbi algorithm is similar in structure to the
forward procedure presented in the previous section.
· Both have a similar kind of induction step, but whereas
the forward algorithm sums probabilities over the state
space, the Viterbi algorithm takes the maximum.
· Consider the highest probability at time t along any single
path that ends in state i given by:

ࢾ࢚ ࢏ ൌ ࢓ࢇ࢞࢞૚ ǡ࢞૛ ǡǤǤǤǡ࢚࢞ష૚ ࡼ ࢞૚ ࢞૛ ǥ ࢚࢞ି૚ ǡ ࢚࢞ ൌ ࢏ǡ ࢕૚ ࢕૛ ǥ ࢕࢚ ȁࣅ
66

Decoding Problem

· By induction we can have at t+1

ࢾ࢚ା૚ ࢏ ൌ ࢈࢐ ࢕࢚ା૚ ࢓ࢇ࢞ ࢾ࢚ ࢏ ࢇ࢏࢐


૚ஸ࢏ஸࡺ

· Since there can be several preceding states i that


maximize the probability, the auxiliary variable \j(t) is
actually a set of states.
· In practice, it suffices to select just one of the
maximizing states and discard the rest.
· \*(T) specifies the state (sT) where the most likely path
ends. The whole path can be retrieved by following the
back pointers.
67

125
Decoding Problem
· The complete Viterbi algorithm is given below:
1. Initialization
 ‫ ࢚ܜ܍܁‬ൌ ᩷૛Ǣ
 ࢾ૚ ࢏ ൌ ᩷ ࣊࢏ ࢈࢏ ࢕૚ ǡ ᩷૚ ൑ ࢏ ൑ ࡺ
 ࣒૚ ࢏ ൌ ᩷૙ǡ ᩷૚ ൑ ࢏ ൑ ࡺ
2. Induction
ࢾ࢚ ࢐ ൌ ࢈࢐ ࢕࢚ ࢓ࢇ࢞૚ஸ࢏ஸࡺ ࢾ࢚ି૚ ࢏ ࢇ࢏࢐ ǡ ૚ ൑ ࢐ ൑ ࡺ
࣒࢚ ࢐ ൌ ‫࢞ࢇ࢓܏ܚ܉‬૚ஸ࢏ஸࡺ ࢾ࢚ି૚ ࢏ ࢇ࢏࢐ ǡ ૚ ൑ ࢐ ൑ ࡺ
3. Update Time
 ‫ܜ܍܁‬࢚᩹ ൌ ࢚ ൅ ૚Ǣ
 ‫ܖܚܝܜ܍܀‬᩹‫ܗܜ‬᩹‫ܘ܍ܜܛ‬᩹૛ܑ᩹܎࢚᩹ ൑ ࢀǢ
 ‫܍ܛܑܟܚ܍ܐܜ۽‬ǡ ‫ܕܐܜܑܚܗ܏ܔ܉܍ܐܜ܍ܜ܉ܖܑܕܚ܍ܜ‬ሺ܏‫ܘ܍ܜܛܗܜܗ‬૝ሻǤ
4. Termination
ࡼ‫ כ‬ൌ ࢓ࢇ࢞૚ஸ࢏ஸࡺ ࢾࢀ ࢏
ࢗ‫ ࢀכ‬ൌ ‫࢞ࢇ࢓܏ܚ܉‬૚ஸ࢏ஸࡺ ࢾࢀ ࢏
5. Path (state sequence) Backtracking
(a) Initialization
Set t = T-1
(b) Backtracking
‫כ‬
࢚ࢗ‫ כ‬ൌ ࣒࢚ା૚ ࢚ࢗା૚

(c) Update Time


 ‫ܜ܍܁‬࢚᩹ ൌ ࢚ െ ૚Ǣ
 ‫ܖܚܝܜ܍܀‬᩹‫ܗܜ‬᩹‫ܘ܍ܜܛ‬᩹ ࢈ ܑ᩹܎࢚᩹ ൒ ૚Ǣ
 ‫܍ܛܑܟܚ܍ܐܜ۽‬ǡ ‫ܕܐܜܑܚܗ܏ܔ܉܍ܐܜ܍ܜ܉ܖܑܕܚ܍ܜ‬Ǥ
68

Training Problem
· The objective is to construct a model that best fits the training
data, or best represents the source that produced the data.
· This is an estimation problem: given the observation O, how
do we solve the inverse problem of estimating the parameters
in λ.
· We often follow the method of maximum likelihood (ML), i.e.
we choose λ* = (π, A, B) such that P(O|λ) is maximized for the
given training sequence O .
· The best model in maximum likelihood sense is therefore the
one that is most probable to generate the given observations.

69

126
Dynamic Time Warping (DTW)

· Speech is a time-varying process in which the duration


of a word and its sub-words varies randomly.
· Hence a method is required to find the best time

alignment between a sequence of vector features


representing a spoken word and the model candidates.
· The best time-alignment between two sequences of

vectors may be defined as the alignment with the


minimum Euclidean distance.
· For isolated-word recognition the time-alignment

method used is dynamic-time warping (DTW).


70

Dynamic Time Warping (DTW)


· The best matching template is the one with the lowest
distance path aligning the input pattern to the template.
· For illustration of DTW: consider a point (i,j) in the time-time
matrix (where i indexes the input pattern frame, and j the
template frame), then previous point must have been (i-1,j-1),
(i-1,j) or (i,j-1).
· The key idea in dynamic programming is that at point (i,j) we
continue with the lowest accumulated distance path from (i-1,
j-1), (i-1,j) or (i,j-1).
· The DTW algorithm operates in a time-synchronous manner:
each column of the time matrix is considered in succession
(equivalent to processing the input frame-by-frame) so that, for
a template of length N, the maximum number of paths
considered at any time is N.
71

127
Dynamic Time Warping (DTW)

· If D(i,j) is the global distance up to (i,j) and the local distance at


(i,j) is d(i,j) then we have the recursive relation:

D(i, j) = min[D(i −1, j), D(i, j −1), D(i −1, j −1)] + d(i, j)

· Given that D(1,1) = d(1,1), we have the basis for an efficient


recursive algorithm for computing D(i,j).
· The final global distance D(n,N) gives us the overall matching
score of the template with the input.
· The input word is then recognized as the word corresponding
to the template with the lowest matching score.
72

Resolutions of Speech Features and Models

· The resolution of the features and models employed in a


speech recognition system depends on the following
factors:
¹ Acoustic feature resolution, i.e. spectral-temporal
resolution.
¹ Acoustic model resolution; i.e. the smallest segment
modeled.
¹ Statistical model resolution and Contextual resolution.

75

128
Spectral-Temporal Feature Resolution

· The spectral and temporal resolution of speech features


are determined by the following factors:
· (i) the speech signal window size for feature extraction (typically
25 ms),
· (ii) the rate at which speech features are sampled (usually every 5
to 10 ms), and Speech Communication
· (iii) speech feature vector dimensions; typically 13 cepstral
features plus 13 first difference and 13 second difference
cepstral features.

76

Statistical Model Resolution

· Model resolution is determined by:


¹ (i) the number of models,
¹ (ii) the number of states per model,
¹ (iii) the number of sub-state models per state.

· For example, when using hidden Markov models, each


HMM has N states (typically N=3-5), and the distribution
of feature vectors within each state is modeled by a
mixture of M multi-variate Gaussian densities.

77

129
Model Context Resolution

· The acoustic production of a speech unit is affected by its


acoustic context; that is by the preceding and succeeding
speech units.
· In phoneme-based systems, context-dependent triphones
are used.
· Since there are about 40 phonemes, the total number of
triphones is 403=64000, although many of these cannot
occur due to linguistic constraints.
· The states of triphone models are often clustered and tied
to reduce the total number of parameters and hence
obtain a compromise between contextual resolution and
the number of model parameters used.
79

Thank you For Listening


Digital Signal Processing II
Chapter 4

130
131
DSPII- Speeech Processing

Exercises
Chapter 1

Dr. Ali Al-Hajj Hassan

Exercise

132
Exercise

Exercise

133
DSPII- Speeech Processing

Exercises
Chapter 2

Dr. Ali Al-Hajj Hassan

Exercise

134
Exercise

DSPII- Speeech Processing

Exercises
Chapter 3

Dr. Ali Al-Hajj Hassan

135
Exercise

Exercise

136
DSPII- Speeech Processing

Exercises
Chapter 4

Dr. Ali Al-Hajj Hassan

Exercise

137
Exercise

Exercise

138
Thank you For Listening
Digital Signal Processing II
Speech Processing- Exercises

139

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy