DSP II - DVP - CDP 2pp
DSP II - DVP - CDP 2pp
Département Electrique
Printemps 2025
Sem VIII
DSPII- Voice Processing
Course Administration
· Text-Book:
¹ Principles of Digital Audio, Ken Pohlmann, Mc Graw Hill,
2011.
1
Course Content
· Spectro-temporal analysis.
· Cepstral analysis.
Chapters
2
Chapter 1 Outline
· Speech Processing
¹ Pitch Period Estimation
¹ Short Time Analysis
¹ Convolution
Audio
3
Audio
4
Digital Representation of Audio
Digital Audio
5
Audio Formats
Audio Formats
6
Audio Formats
7
Sound Amplitude
Sound Amplitude
8
Acoustic Theory of Speech
· Speech sounds are sensations of air pressure vibrations
produced by air exhaled from the lungs and modulated and
shaped by the vibrations of the glottal cords and the resonance
of the vocal tract as the air is pushed out through the lips and
nose.
· Speech is also a sequence of elementary acoustic sounds or
symbols known as phonemes that convey the spoken form of a
language.
· Speech sounds have a rich and multi-layered temporal-
spectral variation that convey words, intention, expression,
intonation, accent, speaker identity, gender, age, style of
speaking, state of health of the speaker and emotion.
17
9
Acoustic Theory of Speech
19
20
10
Source-Filter Model of Speech Production
· Speech sounds result from a combination of:
¹ a source of sound energy (the larynx) modulated by a time-
varying transfer function filter (vocal tract) determined by the
shape and size of the vocal tract.
· This results in a shaped spectrum with broadband energy
peaks.
· In this model the source of acoustic energy is at the larynx, and
the vocal tract serves as a time-varying filter whose shape
determines the phonetic content of the sounds.
21
22
The Liljencrants-Fant (LF)
model of a glottal pulse
11
Acoustic Theory of Speech
24
12
Voiced and Unvoiced Speech
· For many speech applications, it is important to distinguish between voiced and
unvoiced speech.
· There are two methods for doing it:
¹ Short-time energy function: Split the speech signal x(n) into blocks of 10-20 ms, and
calculate the energy within each block. The amplitude of unvoiced segments is
noticeably lower than that of the voiced segments. The short-time energy of speech
signals reflects the amplitude variation and is
¹ Zero-crossing rate: The rate at which the speech signal crosses zero can provide
information about the source of its creation. Unvoiced speech has a much higher ZCR
than voiced speech. This is because most of the energy in unvoiced speech is found in
higher frequencies than in voiced speech, implying a higher ZCR for the former. The
zero crossing rate of the frame ending at time instant m is defined by
25
26
13
Voiced and Unvoiced Model
· A representation of the excitation in terms of separate source
generators for voiced and unvoiced speech.
· The unvoiced excitation is assumed to be a random noise sequence.
· The voiced excitation is assumed to be a periodic impulse train with
impulses spaced by the pitch period.
· Each voiced sound is approximately periodic, but different sounds
are different periodic signals.
· we can model the vocal tract as an LTI (Linear Time-Invariant) filter
over short time intervals, and over the timescale of phonemes (10, 20
and 30 ms), the impulse response, frequency response, and system
function of the system remains relatively constant.
27
14
Human Auditory System
· Sound enters the ear and travels through the ear canal.
· The eardrum (transductor) receives the sound pulses and conveys the pressure changes
via middle ear bones and cochlea to auditory nerve endings.
· The middle ear, consisting of the eardrum and very small bones, is connected to a small
membrane attached to the fluid-filled cochlea.
· The middle ear’s main function is to convert the acoustic air pressure vibrations in the ear
canal to fluid pressure changes in the cochlea.
· The inner ear starts with the oval window at one end of the cochlea.
· The mechanical vibrations from the middle ear excite the fluid in the cochlea via this
membrane, and this vibrations should to be .
· The cochlea is a long helically coiled and tapered tube filled with fluid.
· Internally, the cochlea is lined with hair cells that sense frequency changes propagated
through its fluid, and the nerves transform the mechanical vibrations of fluid to electrical
signals.
15
Frequency Domain Limits
16
Masking/ Hiding
Masking/ Hiding
· In addition, the presence of a perceptible frequency changes the
shape of the threshold curve.
· All the audible levels were recorded in the presence of a 1 KHz
frequency at 45 dB.
· From the curve, it can be seen that when there is a 1 KHz tone
present, other frequency tones less than 45 dB in the neighborhood
of 1 KHz are not perceived.
· A neighborhood tone of 1.1 KHz at 20 dB would normally have been
heard well (in quiet), but now will not be heard as a distinct tone in
the presence of the stronger 1 KHz tone. The 1 KHz tone is masking
or hiding the 1.1 KHz neighborhood tone for the decibel levels
discussed.
· Here the 1 KHz tone is known as the masker and any other tone in
the neighborhood, which is masked, is known as the maskee.
17
Masking/ Hiding
Masking/ Hiding
· The higher the frequency, the larger the spectral spread that it hides
· When a 1 KHz and 1.1 KHz tone are simultaneously present at 45 dB
and 20 dB respectively, the 1 KHz tone masks the 1.1 KHz tone and,
hence, the higher tone cannot be heard.
· If the 45 dB masking tone is removed and only the maskee 20 dB tone
is present, the ear should perceive it because it is above the threshold
curve in quiet.
18
Masking/ Hiding
Masking/ Hiding
20
Advantages of Digital over Analog
Brief: digital data is preferred while it offers better quality, higher fidelity, creation of mixed content, compressed, distributed,
stored and retrieved easily.
¹ Quantization
Original Reconstructed
42
Note: most desirable property in ADC is to ensure that rendered analog signal is very
similar to the initial analog signal. Ex: the end device which the digital content is
rendered in CRT monitor 21
ADC: Sampling
ADC: Sampling
44
22
ADC: Quantization
45
ADC: Quantization
· Examine the
difference:
sampled at same frequency and
quantized with:
- 8 bits (256 levels)
- 4 bits (16 levels)
- 3 bits (8 levels).
46
23
ADC: Quantization
· The entire range R of the signal is represented by a finite
number of bits b.
Question: how many bits b should represent each sample? Is it the same for all signals?
Answer: depends on signal type and its usage.
47
ADC: Quantization
· What are the challenges?
· Audio signals (music) are quantized using 16 bits
48
24
Bit Rate
· Number of bits being produced per second.
· Important when storing digital signals and transmitting over
networks with varying bandwidths.
49
Signal Types
· Continuous, smooth, non-smooth, symmetric, Finite
support, periodic…
50
25
Linear Time Invariant (LTI) Systems
· A system with input x and output y is LTI if:
51
· Sifting property:
· The delta function “sift-out” the value of the signal at time
T.
· It follows that it can be used with convolution to “time-
53 Using its
symmetry
Nyquist Rate
· Look at these signals, and tell me how many samples we need
from each one to fully reconstruct it back to its analog form.
55
Digital Filters
· Removed unwanted parts of the signal before sampling,
such as background noise beyond 4KHZ in some song.
· There are analog (Circuit components) and digital filters
28
Digital Filters Advantages
· Programmable
· Designed and tested on PCs
ones
· Handle low-freq accurately
· Versatile in general
Fourier Transform
spectral components.
· Transform the signal from Time to Frequency Domain.
29
Fourier Transform
· Every periodic continuous signal can be expressed as a
weighted combination of sinusoid (cosine and sine)
waves.
Fourier Transform
· If we choose only cosine for example
present.
30
Fourier Transform Examples
If only cosines:
31
Speech Processing
Speech Processing
· Properties of speech signals change with time. To
process them effectively it is necessary to work on a
frame-by-frame basis, where a frame consists of a
certain number of samples.
· The actual duration of the frame is known as length.
Typically, length is selected between 10 and 30 ms (or
80 and 240 samples).
· Within this short interval, properties of the signal
remain roughly constant.
· Thus, many signal processing techniques are adapted
to this context when deployed to speech coding
applications.
64
32
Speech Processing:
Pitch Period Estimation
Speech Processing:
Pitch Period Estimation
66
33
Speech Processing:
Pitch Period Estimation
The AutoCorrelation Method
· To perform the estimation on the signal s(n), with n being
the time index.
· Consider the frame that ends at time instant m, where the
Speech Processing:
Pitch Period Estimation
The AutoCorrelation Method
· For instance, for l =20 to 150, the possible pitch frequency
values range from 53 to 400 Hz at 8 kHz sampling rate.
· By calculating the autocorrelation values for the entire range of
lag, it is possible to find the value of lag associated with the
highest autocorrelation representing the pitch period estimate,
since, in theory, autocorrelation is maximized when the lag is
equal to the pitch period.
· It is important to mention that, in practice, the speech signal is
often lowpass filtered before being used as input for pitch
period estimation. Since the fundamental frequency associated
with voicing is located in the low-frequency region (<500 Hz),
lowpass filtering eliminates the interfering high-frequency
components as well as out-of-band noise, leading to a more
accurate estimate.
68
34
Speech Processing:
Pitch Period Estimation
The AutoCorrelation Method
· In this Figure, we shows voiced portion of a speech
waveform and the autocorrelation values obtained for
l=20 to 150 , m = 1500 and N = 180. The lag corresponding
to the highest peak is 71 and is the pitch period estimate.
69
Speech Processing:
Pitch Period Estimation
Magnitude Difference Function
· One drawback of the autocorrelation method is the need for
multiplication, which is relatively expensive for
implementation, especially in those processors with limited
functionality. To overcome this problem, the magnitude
difference function is invented:
35
Speech Processing:
Pitch Period Estimation
Magnitude Difference Function
· Consider the same portion as in the example of AutoCorrelation. The
plot of MDF show that the lowest MDF occurs at l =70. Compared
with the previous results, the present method yields a slightly lower
estimate.
· The methods discussed earlier can only find pitch period values that
are multiples of the sampling period (8 kHz) 0.125 ms. In many
applications, higher resolution is necessary to achieve good
performance. Other signal processing techniques can be introduced
to extend the resolution beyond the limits set by fixed sampling rate.
71
Speech Processing:
The All-Pole and All- Zero Filter
· Consider the filters with system functions:
72
36
Speech Processing:
The All-Pole and All- Zero Filter
· A signal flow graph for direct form implementation of both filters.
· Note that the impulse response of an all-pole filter has an infinite
number of samples with nontrivial values due to the fact that the
scaled and delayed versions of the output samples are added back to
the input samples.
· This is referred to as an infinite-impulse-response (IIR) filter. For the
all-zero filter, however, the impulse response only has M+1 nontrivial
samples (the rest are zeros) and is known as a finite-impulse-response
(FIR) filter.
73
Speech Processing:
The All-Pole and All- Zero Filter
Calculation of Output Sequence on a Frame-by Frame Basis
74
37
Speech Processing:
The All-Pole and All- Zero Filter
State-Save Method
75
Speech Processing:
The All-Pole and All- Zero Filter
Zero-Input Zero-State Method
76
38
Speech Processing:
Short time Spectral Analysis
Speech Processing:
Short time Spectral Analysis
78
39
Speech Processing:
Convolution
· Given the linear time-invariant (LTI) system with impulse response h(n), and
denoting the input as x(n) and the output as y(n), the convolution sum between
x(n) and h(n) is one of the fundamental relations in signal processing and given
by
· We explore the usage of the convolution sum for the calculation of the output
sequence with emphasis on the all-pole filter. A straightforward way to find the
impulse response sequence is by using the time domain difference equation given
in section 4.2, when the input is a single impulse: x(n)=δ(n). That is
79
40
41
DSPII- Image Processing
Speech Coding
Chapter 2
Outline
· Introduction of Speech Coding
· Desirable Properties of a Speech Coder
· Coding Delay
· Classification of Speech Coders
· Waveform Coders
· DPCM
· ADPCM
· Linear Prediction
· Auto-Regressive (AR) Model
· Linear Prediction Problem
· Error Minimiztion
· Minimum Mean-Squared Prediction Error
· Estimation of LP Coefficients
· Levinson-Durbin
· Leroux-Gueguen
· LPC
· LPC CODER
· FS1015 LPC CODER
· LPC DECODER
2
· LIMITATIONS OF LPC
42
Introduction of Speech Coding
43
Introduction of Speech Coding
· The frequency contents limited between 300 and 3400 Hz.
· According to the Nyquist theorem, the sampling frequency
must be at least twice the bandwidth of the continuous-time
signal in order to avoid aliasing.
· 8 kHz is commonly selected as the standard sampling
frequency for speech signals.
· To convert the analog samples to a digital format using
uniform quantization and maintaining the quality more than 8
bits/sample is necessary.
· The use of 16 bits/sample provides a quality that is considered
high.
· The bit rate at the input is 64 kbps for 8 bits/sample (128 kbps
for 16 bits/sample).
5
44
Desirable Properties of a Speech Coder
· The main goal of speech coding is either to maximize the
perceived quality at a particular bit-rate, or to minimize
the bit-rate for a particular perceptual quality.
· The appropriate bit-rate at which speech should be
transmitted or stored depends on the cost of
transmission or storage, the cost of coding
(compressing) the digital speech signal, and the speech
quality requirements.
· The bit-rate is reduced by representing the speech signal
(or parameters of a speech production model) with
reduced precision and by removing inherent redundancy
from the signal, resulting therefore in a lossy coding
scheme.
7
45
Desirable Properties of a Speech Coder
·Robustness in the presence of channel errors: This is
crucial for digital communication systems where channel
errors will have a negative impact on speech quality.
· Low memory size and low computational complexity
(efficient realization): In order for the speech coder to be
practicable, costs associated with its implementation must
be low; these include the amount of memory needed for
its operation, as well as computational demand.
· Low coding delay: In the process of speech encoding and
decoding, delay is introduced, which is the time shift
between the input speech of the encoder with respect to
the output speech of the decoder. An excessive delay
creates problems with real-time two-way conversations.
9
Coding Delay
· Encoder buffering delay: many speech encoders require
the collection of a certain number of samples before
processing. The typical linear prediction (LP)-based
coders need to gather one frame of samples ranging from
160 to 240 samples, or 20 to 30 ms, before proceeding with
the actual encoding process.
10
46
Coding Delay
11
Coding Delay
· Transmission delay: also known as decoder buffering
delay, since it is the amount of time that the decoder must
wait in order to collect all bits related to a particular
frame so as to start the decoding process.
· There are only 2 transmission modes:
¹ In constant mode, the bits are transmitted synchronously at a
fixed rate, which is given by the number of bits corresponding to
one frame divided by the length of the frame. This mode of
operation is dominant for most classical digital communication
systems, such as wired telephone networks.
¹ In burst mode, all bits associated with a particular frame are
completely sent within an interval that is shorter than the encoder
buffering delay. This mode is inherent to packetized network and
the internet, where data are grouped and sent as packets.
12
47
Coding Delay
13
14
48
Classification of Speech Coders
15
16
49
Classification of Speech Coders
17
50
Waveform Coders
· The main coder in this category is Pulse Code Modulation
(PCM)
· process of quantizing the samples of a discrete-time signal
· PCM is the most obvious method developed for the digital coding
of waveforms
· A quantizer with a non-uniform transfer characteristic yields
higher performance and form the core of the ITU-T G.711 PCM
standard (μ-law, A-law).
· The compression and expansion characteristics are for μ=255 and
A= 87.56; 8 bits/sample is adopted, leading to a bit-rate of 64 kbps
at 8 kHz of sampling frequency. The narrowband speech coding
standards, related to PCM, recommended by ITU-T are:
19
DPCM
20
51
ADPCM
(ADPCM).
21
ADPCM
22
52
Linear Prediction (LP)
· Linear prediction (LP) forms an integral part of almost all
modern day speech coding algorithms.
· The fundamental idea is that a speech sample can be
approximated as a linear combination of past samples.
· The basic assumption is that speech can be modeled as an
autoregressive (AR) signal, which in practice has been found
to be appropriate.
· Linear prediction analysis is an estimation procedure to find
the AR parameters, given samples of the signal.
· LP is an identification technique where parameters of a system
are found from the observation.
· The goal is to minimize the error between the real and the
predicted value.
23
53
Moving Average (MA) Model
25
54
Linear Prediction Problem
· The parameters of an AR model are estimated from the
signal itself.
· The white noise signal x(n) is filtered by the AR process
27
Error Minimization
· The problem is now How to estimate to coefficients
· This estimation is done by selecting the appropriate
coefficients (ai) that minimize the mean-squared
prediction error J:
55
Error Minimization
Prediction Gain
predictor.
30
56
Minimum Mean-Squared Prediction Error
· Now we suppose that the prediction error is the same
· as the white noise used to generate the AR signal s(n).
· The mean-squared error is minimized with
[E e2(n)] = E [x2(n)] =σx2
· The prediction gain is maximized.
· Taking into account the AR parameters used to generate the signal
s(n), we have the minimized mean-squared
· error:
31
Estimation of LP Coefficients
· In general, inverting a matrix is quite computationally
demanding. It`s a time consuming.
· Efficient algorithms are available to solve the equation,
which take advantage of the special structure of the
correlation matrix.
· Levinson–Durbin algorithm
· Leroux–Gueguen algorithm
· both suitable for practical implementation of LP analysis
· Levinson–Durbin algorithm is implemented under a
floating-point environment
· Leroux–Gueguen algorithm is better suited for fixed-
point implementation.
32
57
Levinson-Durbin
· As the correlation matrix is Toeplitz , so it is invariant under
interchange of its columns and then its rows.
· The correlation matrix of a given size contains as subblocks all
the lower order correlation matrices.
· Based on such proprieties, Levinson–Durbin algorithm is an
iterative–recursive process where the solution of the zero-order
predictor is first found, which is then used to find the solution
of the first-order predictor; this process is repeated one step at
a time until the desired order is reached.
· The minimum mean-squared prediction error achievable with
a zero-order predictor is given by the autocorrelation of the
signal at lag zero, or the variance of the signal itself E(0) = J0 =
R(0) . This is considered as initialization and will be used by
the predictor of order one.
33
Levinson-Durbin
· The Levinson–Durbin algorithm :
· Inputs to the algorithm are the autocorrelation coefficients R(l) and the outputs
are the LP coefficients and the reflections coefficients (RC) kl:
¹ Initialization : l=0, J0 = R(0)
¹ Recursion : for l=1,2,3,…,M
º Step 1. Compute the l
th RC
º Step 3. Compute the minimum mean-squared prediction error associated with the lth
order solution: Jl = Jl-1 (1-kl2) ; Set l = l +1; return to Step 1
¹ The final LP coefficients are: ai(M) , i=1,2,….M.
34
58
Levinson-Durbin
· Levinson–Durbin algorithm could present some
difficulties for fixed-point implementation because it lies
in the values of the LP coefficients, since they possess a
large dynamic range and a bound on their magnitudes
cannot be found on a theoretical basis.
· For the implementation of Levinson–Durbin algorithm
under a fixed point environment, careful planning is
necessary to ensure that all variables are within the
allowed range.
· If we consider the set of RCs k ; i = 1, . . . ,M. Finding the
i
corresponding LP coefficients ai can be solved directly
from the equations step 2 in the Levinson–Durbin
algorithm.
35
Leroux-Gueguen
· Leroux and Gueguen proposed a method to compute the RCs from
the autocorrelation values without dealing directly with the LP
coefficients.
· Hence, problems related to dynamic range in a fixed-point
environment are eliminated.
· The algorithm can be summarized as follows:
36
59
LPC
model parameters.
· LPC algorithm is one of the earliest standardized coders
37
LPC
· Linear prediction coding relies on a highly simplified model for
speech production. The speech signal is characterized in terms of a
set of model parameters.
· The driving input of the filter or excitation signal is modeled as either
an impulse train (voiced speech) or random noise (unvoiced speech).
· Depending on the voiced or unvoiced state of the signal, the switch is
set to the proper location so that the appropriate input is selected.
· Energy level of the output is controlled by the gain parameter. For a
short enough length of the frame, properties of the signal essentially
remain constant.
38
60
LPC
39
LPC
61
FS1015 LPC Coder
LPC Coder
62
LPC Coder
· The pre-emphasized
signal is used for LP
analysis, where ten
LPCs are derived.
· These coefficients are
quantized with the
indices transmitted
as information of the
frame.
· The quantized LPCs
are used to build the
prediction-error
filter, which filters
the pre-emphasized
speech to obtain the
prediction-error
43
signal at its output.
LPC Coder
· Pitch period is
estimated from the
prediction-error signal
if and only if the frame
is voiced.
· Power of the
prediction-error
sequence is calculated
next, which is different
for voiced and
unvoiced frames.
· The voicing bit, pitch
period index, power
index, and LPC index
are packed together to
form the bit-stream of
the LPC coder.
44
63
LPC Decoder
· It is assumed
that the
output of the
impulse train
generator is
comprised of a
series of unit
amplitude
impulses,
while the
white noise
generator has
unit-variance
output.
45
LIMITATIONS OF LPC
64
LIMITATIONS OF LPC
· The use of strictly random noise or a strictly periodic impulse
train as excitation does not match practical observations using
real speech signals. The excitation signal can be observed in
practice as prediction error and is obtained by filtering the
speech signal using the prediction-error filter. In general, the
excitation for unvoiced frames can be reasonably approximated
with white noise. For voiced frames, however, the excitation
signal is a combination of a quasiperiodic component with
noise. Thus, the use of an impulse train is a coarse
approximation that degrades the naturalness of synthetic
speech. For the FS1015 coder, the excitation pulses are obtained
by exciting an all pass filter using an impulse train. The above
discussion is applicable for a typical prediction order of ten,
such as the case of the FS1015 coder.
47
LIMITATIONS OF LPC
48
65
LIMITATIONS OF LPC
LIMITATIONS OF LPC
50
66
Thank you For Listening
Digital Signal Processing II
Chapter 2
67
DSPII- Image Processing
Outline
· Introduction to LPC Coder
· Vector Quantization
· Analysis-by-Synthesis Principle
· Long-Term and Short-Term LP Model for Speech Synthesis
· Code-Excited Linear Prediction
· Perceptual Weighting
· Encoder Operation
· Decoder Operation
· Excitation Signal
· Excitation CodeBook Search
· FS1016 CELP
· Special CodeBooks for CELP
· ACELP
· Mixed Excitation Linear Prediction (MELP)
· Period Jitter
· Pulse Shaping
· Mixed Excitation
2
68
Introduction to LPC Coder
69
Vector Quantization
Vector Quantization
70
Vector Quantization
Analysis-by-Synthesis Principle
· Open loop.
· Closed loop.
71
Analysis-by-Synthesis Principle
Analysis-by-Synthesis Principle
10
72
Long-Term and Short-Term LP Model
for Speech Synthesis
· The parameters of the two predictors, in the long-term
and short-term linear prediction model for speech
production shown in the Figure, are estimated from the
original speech signal.
· The long-term predictor is responsible for generating
correlation between samples that are one pitch period
apart.
11
12
73
Long-Term and Short-Term LP Model
for Speech Synthesis
· The short-term predictor recreates the correlation present
between nearby samples, with a typical prediction order equal
to ten.
· The synthesis filter associated with the short-term predictor
has a system function:
ࡴࢌ ࢠ ൌ
σࡹୀ ࢇ ࢠ
ି
14
74
Code-Excited Linear Prediction (CELP)
15
16
75
Code-Excited Linear Prediction (CELP)
17
18
76
Perceptual Weighting
· The CELP Speech Production function:
ࡴࢌ ࢠ ൌ ൌ
ሺࢠሻ σࡹ ୀ ࢇ ࢠ
ି
· with A(z) denoting the system function of the formant analysis filter.
· If the excitation codebook contains a total of L excitation
codevectors, the encoder will pass through the loop L times
for each short segment of input speech; a mean-squared error
value is calculated after each pass.
· The excitation codevector providing the lowest error is
selected at the end.
· The dimension of the codevector depends on the length of the
speech segment under consideration.
19
Perceptual Weighting
77
Perceptual Weighting Filter
21
Perceptual Weighting
· One efficient way to implement the weighting filter is by using
the system function:
ݖ ܣ ͳ σெ
ୀଵ ܽ ݖ
ି
ܹ ݖൌ ൌ
ݖ ܣȀߛ ͳ σெ ି
ୀଵ ܽ ߛ ݖ
· with γ is a constant in the interval [0, 1] and determines the degree to
which the error is de-emphasized in any frequency region.
· If γ ->1 , W(z) -> 1 , and hence no modification of the error
spectrum is performed.
· If γ -> 0 , W(z) -> A(z), which is the formant analysis filter.
· The most suitable value of γ is selected subjectively by
listening tests, and for 8-kHz sampling, γ is usually between
0.8 and 0.9.
22
78
Perceptual Weighting Filter
23
Encoder Operation
24
79
Encoder Operation
25
Encoder Operation
prediction error.
· Coefficients of the perceptual weighting filter, pitch
80
Encoder Operation
27
Encoder Operation
· The index of
excitation
codebook, gain,
long-term LP
parameters, and
LPC are
encoded, packed,
and transmitted
as the CELP bit-
stream.
28
81
Decoder Operation
· It basically unpacks and decodes various parameters
from the bit-stream, which are directed to the
corresponding block so as to synthesize the speech.
· A post-filter is added at the end to enhance the quality of
the resultant signal by attenuating the noise components
in the spectral valleys, thus enhancing the overall
spectrum.
29
Excitation Signal
· For efficient temporal analysis, a speech frame is usually
divided into a number of subframes, four subframes.
· For each subframe, the excitation signal is generated and the
error is minimized to find the optimum excitation.
· The excitation varies between the pulse train and the random
noise.
30
82
Excitation Signal
Excitation Signal
32
83
Excitation CodeBook Search
33
84
FS1016 CELP
35
FS1016 CELP
36
85
FS1016 CELP
38
39
86
ACELP
40
ACELP
41
87
ACELP
ACELP
· ࢉ ൌ ࢙ ࢾ െ ࢙ ࢾ െ ࢙ ࢾ െ ࢙ ࢾ43 െ
· ݊ ൌ Ͳǡͳǡ ǥ ǡ ͵ͻ᩹ߜ ݊ is a unit pulse.
88
Mixed Excitation Linear Prediction (MELP)
¹ MIXED EXCITATION
45
89
Mixed Excitation Linear Prediction (MELP)
46
Period Jitter
90
Period Jitter
48
Pulse Shaping
91
Pulse Shaping
· These quantities are used to generate the impulse
response of the pulse generation filter, responsible for the
synthesis of periodic excitation.
· Note that Fourier magnitudes are calculated only when the frame
is voiced or jittery voiced.
Illustration
of signals
associated
with the
pulse
generation
filter. 50
Mixed Excitation
92
Mixed Excitation
·In the Figure, the frequency responses of the shaping
filters are controlled by a set of parameters called voicing
strengths, which measure the amount of ‘‘voicedness’’.
· The responses of these filters are variable with time,
52
Mixed Excitation
· A tenth-order LP analysis is performed on the input
speech signal using a 200- sample (25-ms) Hamming
window centered on the last sample in the current frame.
· The autocorrelation method is utilized together with the
Levinson–Durbin algorithm.
· The coefficients are quantized and used to calculate the
prediction-error signal.
The Bit
Allocation
for the
MELP Coder
53
93
Thank you For Listening
Digital Signal Processing II
Chapter 3
94
95
DSPII- Image Processing
Speech Recognition
Chapter 4
Outline
96
Introduction to Speech Recognition
· Speech recognition systems have a wide range of applications
from the relatively simple isolated-word recognition systems
for name-dialing, automated customer service and voice-
control of cars and machines to continuous speech
recognition.
· Like any pattern recognition problem, the fundamental
problem in speech recognition is the speech pattern
variability.
· Speech recognition methods aim to model, and where
possible reduce, the effects of the sources of speech
variability.
· The most challenging sources of variations in speech are
speaker characteristics including accent and background
noise.
3
97
Introduction to Speech Recognition
Problem Formulation
98
Problem Formulation
ൌ ࢇ࢞ࢃ ࢌ ࢃȀࢄǡ ࢣ
ࢃ
Problem Formulation
99
Problem Formulation
Problem Formulation
10
100
Speech Unit
· Every communication system has a set of elementary symbols
(or alphabet) from which larger units such as words and
sentences are constructed.
· For example in digital communication the basic alphabet are
“1” and “0”, and in written English the basic units are A to Z.
· The elementary linguistic unit of spoken speech is called a
phoneme and its acoustic realization is called a phone.
· There are between 60 to 80 phonemes in spoken English; the
exact number of phonemes depends on the dialect.
· In automatic speech processing the number of phonemes can
be reduced to between 40 to 60 phonemes depending on the
dialect.
11
Speech Unit
101
Entropy of Speech
· Entropy of an information source gives the theoretical
lower bound for the number of binary bits required to
encode the source.
· The entropy of a set of communication symbols is
ࡴ ࢄ ൌ െ ࡼࢄ ࢞ ࢍ ࡼࢄ ࢞
ୀ 13
Entropy of Speech
14
102
Feature Analysis
· The feature analysis subsystem converts time-domain raw
speech samples into a compact and efficient sequence of
spectral-temporal feature vectors that retain the phonemic
information but discard some of the variations due to speaker
variability and noise.
· The most widely used features for speech recognition are
cepstral feature vectors which are obtained from a Discrete
Cosine Transform (DCT) function of the logarithm of
magnitude spectrum of speech.
15
Feature Analysis
16
103
Discrete Cosine Transform Function
17
DCT vs DFT
· DCT outperforms DFT in terms of speech energy
compaction.
· A clean speech is divided into frames with 50%
overlapping, and the transform is performed.
· The speech is reconstructed.
· The Mean Square Error (MSE) is computed.
· DCT has the added advantage of higher spectral
resolution than the DFT for the same window size.
¹ For a window size of N, DCT has N independent spectral
components while DFT produces only N/2+1 independent
spectral components, as the other components are just
complex conjugates.
104
DCT vs DFT
typically used to
extract the cepstral
vector from the speech
signal.
Cepstral Coefficients
20
105
Cepstral Coefficients
· As the spectrum of real valued speech is symmetric, the
DFT can be replaced by a Discrete Cosine Transform
(DCT) as:
ࢉ ൌ ࡰࢀ ࢄ ܖܔ
· The cepstral parameters encode the shape of the log spec
Cepstral Coefficients
· Some useful properties of cepstrum features are as follows:
¹ The lower index cepstral coefficients represent the spectral
envelop of speech, whereas the higher indexed coefficients
represent fine details (i.e. excitation) of the speech spectrum.
¹ Logarithmic compression of the dynamic range of spectrum,
benefiting lower power higher frequency speech components.
¹ Insensitivity to loudness variations of speech if the coefficient c(0)
is discarded.
¹ As the distribution of the power spectrum is approximately
lognormal, the logarithm of power spectrum, and hence the
cepstral coefficients, are approximately Gaussian.
¹ The cepstral coefficients are relatively de-correlated allowing
simplified modeling assumptions.
22
106
Cepstral Coefficients
· A widely used form of cepstrum is Mel Frequency
Cepstral Coefficients (MFCC), for scaling or
normalization of the frequencies.
· To obtain MFCC features, the spectral magnitude of FFT
frequency bins are averaged within frequency bands spaced
according to the mel scale which is based on a model of human
auditory perception. The scale is approximately linear up to
about 1000 Hz and approximates the sensitivity of the human ear
as:
fmel = 1125.log(0.0016 f +1)
· where fmel is the mel-scaled frequency of the original frequency f in
Hz.
23
Cepstral Coefficients
ࢉ ൌ െࢇ െ ࢇ ࢉ െ ᩷ͳ ݊
ୀ
24
107
Cepstral Coefficients
· Temporal difference features are a simple and effective means
for description of the trajectory of speech parameters in time.
· For speech recognition relatively simple cepstral difference features
are defined by:
∂c(m)= c(m+1) − c(m−1) and ∂∂c(m)=∂c(m+1) − ∂c(m−1)
¹ where ∂c(m) is the first order time-difference of cepstral features,
in speech processing it is also referred as the velocity features.
¹ The second order time difference of cepstral features ∂∂c(m) is also
referred to as the acceleration features.
· Dynamic or difference features are effective in improving
speech recognition.
· The use of difference features is also important for improving
quality in speech coding, synthesis and enhancement
applications and the perceptual quality in speech synthesis.
25
Cepstral Coefficients
108
Statistical Acoustic Models
· For speech recognition, an efficient set of statistical
acoustic models is needed to:
¹ capture the mean and variance of the spectral-temporal
trajectory of speech sounds,
¹ discriminate between different speech sounds.
· Such models for recognizing sequences include:
¹ Hidden Markov Models (HMMs): are commonly used for
medium to large vocabulary speech recognition systems.
¹ Dynamic Time Warping (DTW) Method: mostly used for
isolated-word applications. For small vocabulary isolated-
word recognition, as in name-dialing cost, practicable only
with the powerful digital signal processing.
27
109
Markov Chain
Definitions
· A stochastic process {X(t)}, is a function of time where the
value depends on the random experiment.
· At each time t Є T, X(t) is a random variable.
· E: space state = set of values that the stochastic process X(t)
could take at each instant t Є T
¹ Discret.
¹ Continuous.
· T set of times:
¹ Discret.
¹ Continuous.
· 4 different types of stochastic processes.
29
Markov Chain
110
Markov Chain
Poisson Process
· X a stochastic process with continuous time and discrete
space state is a Poisson process with parameter λ if and
only if:
¹ X is a counting process.
¹ X is a stationary process and has independent increments.
ࣅ࢚ െࣅ࢚
¹ ࡼ ࢄ ࢙࢚ െࢄ ࢙ ൌࡷ ൌ ࢋ k=0,1,2,3...
Ǩ
ࣅ࢚ ିࣅ࢚
º Ex: P(X(࢙+࢚)−X(࢙)=0)= ࢋ
Ǩ
31
111
Example
· E={1, 2, 3, 4}
¹ P23 = P41 = 1
¹ P12+ P14=1
¹ P33+ P31=1
· P is a transition matrix.
33
Example
1/2
ͳȀʹ ͳȀʹ Ͳ 1
ܲ ൌ ͳȀʹ Ͳ ͳȀʹ 1/2
1/3
ͳȀ͵ ʹȀ͵ Ͳ 1/2
2/3 3
2
1/2
34
112
Markov Chain
DTMC Analysis
· The analysis of the transient regime of a DTMC consists
of determining the vector S (n) of state probabilities Sj (n) =
P(Xn=j] j Є E; so that the process {Xn}n Є N is in the state j at
the nth step of the process.
· S
(n)= [S (n), S (n), S (n), …] this vector depend on P and S (0).
1 2 3
S (n)= S (n-1)P
S (n)= S (0)[P]n
35
Example
1/2
ͳȀʹ ͳȀʹ Ͳ
ܲൌ ͳȀʹ Ͳ ͳȀʹ 1
ͳȀ͵ ʹȀ͵ Ͳ
Suppose at t0 = 0: 1/2
S1 =1
(0) 1/3
1/2
S2 (0) = 0
S3 (0) =0 2/3
2 3
That mean the stochastic process will be
at t=0 in the State 1.
After 1 transition, where we will be? Ata any state? 1/2
S1 (1) = 1/2 = S1 (0) . P11
S2 (1) = 1/2 = S1 (0) . P12
S3 (1) = 0 = S1 (0) . 0
ͳȀʹ ͳȀʹ Ͳ
S = S .P = (1/2 1/2 0). ͳȀʹ Ͳ ͳȀʹ = (1/2 1/4 1/4)
(2) (1)
ͳȀ͵ ʹȀ͵ Ͳ
36
113
Markov Chain
Classification of states
· A DTMC is irreducible if and only if of any
state i we can reach any state j in a finite
number of steps.
¹ whatever i,j Є E, it exists/ Pij(m) <> 0.
· A state j is periodic if we can only return to it
after a number of steps that are multiples of
k>1.
¹ There exists k >1 / Pij(m) = 0 for m not a multiple of
k.
¹ Period of state j = largest integer k verifying this
property.
¹ The period of a DTMC is equal to the GCD
(PGCD) of the period of each of its states.
¹ A DTMC is aperiodic if its period is equal to 1.
38
Markov Chain
Steady State
· The analysis of the steady state of a DTMC consists of
examining the probability vector π(n) when n tends
towards infinity.
· Calculation:
114
Example
1/2
ͳȀʹ ͳȀʹ Ͳ
ܲ ൌ ͳȀʹ Ͳ ͳȀʹ 1
ͳȀ͵ ʹȀ͵ Ͳ
Is this chain irreducible?? 1/2
Is this chain aperiodic?? 1/3
1/2
Example
· Let us represent the state of the weather by a 1st-order,
ergodic Markov model, M:
¹ State 1: rain
¹ State: cloud
¹ State 3: sun
º With state transition probabilities;
· Given today is sunny (x1=3) what is
the probability with model M of
observing the sequence of
weather states “sun-sun-rain-cloud-sun”?
41
115
Example
· )= P(X={3,3,1,2,3}ȁ
P(Xȁ )
= P(x1=3).P(x2=3ȁx1=3). P(x3=1ȁx1=3).P(x4=2ȁx3=1). P(x5=3ȁx4=2)
= S3P33P31P12P23.
S (2/11, 6/11, 3/11)
P(Xȁ)= 3/11 . 0.6 . 0.2 . 0.3 . 0.1 = 0.00098.
42
43
116
Hidden Markov Model (HMM)
ͳǤ ܰǡ א ݔሼͳǡ ǥ ǡ ܰሽǢ
3. Initial−state probabilities,
࣊ ൌ ࣊ ൌ ࡼ ࢞ ൌ ͳ ݅ ܰǢ
4. െ ǡ
ൌ ࢇ ൌ ࡼ ࢚࢞ ൌ ࢚࢞െ ൌ for ͳ ݅ǡ ݆ ܰǢ
117
Hidden Markov Models (HMMS)
47
118
Hidden Markov Models (HMMS)
52
Left-to-Right HMM
· In speech recognition field, we use left-to-right models
for HMM.
· Left-to-right models obviously have no transitions from
53
119
Left-to-Right HMM
· Left-to-right models have several advantages.
· For time-varying signals like speech, the nature of the signal is
reflected in the model topology.
· The number of possible paths through the model is reduced
significantly, simplifying computations; because:
¹ the number of states and transitions is relatively small,
¹ estimating transition and emission probabilities becomes more
tractable,
¹ fewer parameters to estimate implies more reliable estimates.
· Finally, practical experience in speech recognition shows that
the left-to-right topology performs better than any other
topology.
54
Example
55
120
Hidden Markov Models (HMMS)
· Three tasks within HMM framework:
¹ Evaluation problem: Given the observation sequence O and a model λ,
how do we efficiently evaluate the probability of O being produced by
the source model λ, P(Observations|Model), i.e. P (O|λ)?
¹ It mean in our case we check each letter what is the probability to give a
specific cepstral coefficient.
56
Evaluation Problem
· One can simply evaluate this probability directly from the
definition.
· It is obtained by summing the joint probability over all
57
121
Evaluation Problem
58
Example
59
122
Evaluation Problem
· Compute likelihood of
a set of observations
with a given model
(Trellis).
· ࢻ࢚ (i) is the probability
of partial observations
sequence o1, o2, …, ot
(until time t) when
being in state i at time
t.
62
Evaluation Problem
· The forward procedure can be summarized by the following steps to calculate the
forward likelihood ࢻ࢚ ሺ࢚࢞ ൌ ǡ ࢚ȁɉሻ:
1.Initialization
Set t ൌ Ǣ
ࢻ ൌ ᩷ ࣊ ࢈ ǡ ᩷ ࡺ
2. Induction
ࡺ
ࡼ ࡻȁࣅ ൌ ࢻࢀ
ୀ
63
123
Example
64
Decoding Problem
· There are several ways to define the decoding objective.
· The most trivial choice is, following the Bayesian
124
Decoding Problem
Decoding Problem
125
Decoding Problem
· The complete Viterbi algorithm is given below:
1. Initialization
࢚ܜ܍܁ൌ ᩷Ǣ
ࢾ ൌ ᩷ ࣊ ࢈ ǡ ᩷ ࡺ
࣒ ൌ ᩷ǡ ᩷ ࡺ
2. Induction
ࢾ࢚ ൌ ࢈ ࢚ ࢇ࢞ஸஸࡺ ࢾ࢚ି ࢇ ǡ ࡺ
࣒࢚ ൌ ࢞ࢇܚ܉ஸஸࡺ ࢾ࢚ି ࢇ ǡ ࡺ
3. Update Time
ܜ܍܁࢚᩹ ൌ ࢚ Ǣ
ܖܚܝܜ܍܀᩹ܗܜ᩹ܘ܍ܜܛܑ࢚᩹᩹᩹ ࢀǢ
܍ܛܑܟܚ܍ܐܜ۽ǡ ܕܐܜܑܚܗܔ܉܍ܐܜ܍ܜ܉ܖܑܕܚ܍ܜሺܘ܍ܜܛܗܜܗሻǤ
4. Termination
ࡼ כൌ ࢇ࢞ஸஸࡺ ࢾࢀ
ࢀכൌ ࢞ࢇܚ܉ஸஸࡺ ࢾࢀ
5. Path (state sequence) Backtracking
(a) Initialization
Set t = T-1
(b) Backtracking
כ
࢚ כൌ ࣒࢚ା ࢚ା
Training Problem
· The objective is to construct a model that best fits the training
data, or best represents the source that produced the data.
· This is an estimation problem: given the observation O, how
do we solve the inverse problem of estimating the parameters
in λ.
· We often follow the method of maximum likelihood (ML), i.e.
we choose λ* = (π, A, B) such that P(O|λ) is maximized for the
given training sequence O .
· The best model in maximum likelihood sense is therefore the
one that is most probable to generate the given observations.
69
126
Dynamic Time Warping (DTW)
127
Dynamic Time Warping (DTW)
D(i, j) = min[D(i −1, j), D(i, j −1), D(i −1, j −1)] + d(i, j)
75
128
Spectral-Temporal Feature Resolution
76
77
129
Model Context Resolution
130
131
DSPII- Speeech Processing
Exercises
Chapter 1
Exercise
132
Exercise
Exercise
133
DSPII- Speeech Processing
Exercises
Chapter 2
Exercise
134
Exercise
Exercises
Chapter 3
135
Exercise
Exercise
136
DSPII- Speeech Processing
Exercises
Chapter 4
Exercise
137
Exercise
Exercise
138
Thank you For Listening
Digital Signal Processing II
Speech Processing- Exercises
139