0% found this document useful (0 votes)
11 views8 pages

NLP Unit V

The document discusses the fundamentals of speech recognition and automatic speaker recognition, detailing the stages of analysis, feature extraction, modeling, and testing. It covers various speech analysis techniques, feature extraction methods, and pattern comparison techniques, including temporal and spectral analysis. Additionally, it explains the mathematical and perceptual considerations in measuring spectral dissimilarity for effective speech pattern recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

NLP Unit V

The document discusses the fundamentals of speech recognition and automatic speaker recognition, detailing the stages of analysis, feature extraction, modeling, and testing. It covers various speech analysis techniques, feature extraction methods, and pattern comparison techniques, including temporal and spectral analysis. Additionally, it explains the mathematical and perceptual considerations in measuring spectral dissimilarity for effective speech pattern recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT 5

The goal of speech recognition is for a machine to be able to "hear,” understand," and "act upon" spoken
information. The earliest speech recognition systems were first attempted in the early 1950s at Bell Laboratories,
Davis, Biddulph and Balashek developed an isolated digit Recognition system for a single speaker.

The goal of automatic speaker recognition is to analyze, extract characterize and recognize information about the
speaker identity. The speaker recognition system may be viewed as working in a four stages

1. Analysis

2. Feature extraction

3. Modeling

4. Testing

1. Speech Analysis Techniques

Speech data contains different type of information that shows a speaker identity. This includes speaker specific
information due to vocal tract, excitation source and behavior feature. The information about the behavior feature
also embedded in signal and that can be used for speaker recognition. The speech analysis stage deals with stage
with suitable frame size for segmenting speech signal for further analysis and extracting. The speech analysis
technique done with following three techniques.

1.1 Segmentation analysis- In this case speech is analyzed using the frame size of 20 ms with shift of 2.5ms to
extract speaker information. Studies are used to extract vocal tract information of speaker recognition.

1.2 Sub segmental analysis - Speech analyzed using the frame size of 5 ms with shift of 2.5ms is known as Sub
segmental analysis. This technique is used to mainly analyze and extract the characteristic of the excitation state.

1.3 Supra segmental analysis - In this case, speech is analyzed using the frame size of 250 ms with shift of 6.25
ms. This technique is used to analyze and characteristic due to behavior character of the speaker.

2. FEATURE EXTRACTION METHODS

Features of speech have a vital part in the segregation of a speaker from others. Feature extraction reduces the
magnitude of the speech signal devoid of causing any damage to the power of speech signal.

Before the features are extracted, there are sequences of preprocessing phases that are first carried out. The
preprocessing step is pre-emphasis. This is achieved by passing the signal through a FIR filter which is usually a
first-order finite impulse response (FIR) filter. This is succeeded by frame blocking, a method of partitioning the
speech signal into frames. It removes the acoustic interface existing in the start and end of the speech signal.

1
The framed speech signal is then windowed. Bandpass filter is a suitable window that is applied to minimize
disjointedness at the start and finish of each frame. The two most famous categories of windows are Hamming and
Rectangular windows. It increases the sharpness of harmonics, eliminates the discontinuous of signal by tapering
beginning and ending of the frame zero. It also reduces the spectral distortion formed by the overlap.

Feature extraction is process of obtaining different features such as power, pitch, and vocal tract
configuration from the speech signal. Feature extraction involves analysis of speech signal. Broadly the
feature extraction techniques are classified as temporal analysis and spectral analysis technique. In
temporal analysis the speech waveform itself is used for analysis. In spectral analysis spectral
representation of speech signal is used for analysis.

The temporal features (time domain features), which are simple to extract and have easy physical interpretation,
like: the energy of signal, zero crossing rate, maximum amplitude, minimum energy, etc.

The spectral features (frequency-based features), which are obtained by converting the time based signal into the
frequency domain using the Fourier Transform, like: fundamental frequency, frequency components, spectral
centroid, spectral flux, spectral density, spectral roll-off, etc. These features can be used to identify the notes,
pitch, rhythm, and melody.

2.1 Spectral Analysis techniques

2.1.1 Critical Band Filter Bank Analysis


It is one of the most fundamental concepts in speech processing. It can be regarded as crude model of the
initial stages of transduction in human auditory system. Critical bank filter bank is simply bank of linear
phase FIR bandpass filters that are arranged linearly along the Bark (or mel) scale. The bandwidths are
chosen to be equal to a critical bandwidth for corresponding center frequency. Bark i.e. critical band rate
scale and mel scale are perceptual frequency scale defined as

Bark = 13atan(0.76f /1000) + 3.5atan( f 2/(7500)2) (1)

mel frequency = 2595log10(1 + f /700) (2)

An expression for critical bandwidth is

BWcritical = 25 + 75[1 + 1.4(f /1000)2 ]0.69 (3)

2.1.2. Cepstral Analysis


Cepstrum is computed by taking inverse discrete Fourier transform (IDFT) of logarithm of magnitude of
discrete Fourier transform finite length input signal.
N-1

S (n)= (1 / N ) ∑ s(k) exp( j2π / N) nk


K=0
(n) is defined as cepstrum.

In speech recognition cepstral analysis is used for formant tracking and pitch (f0) detection. The samples of (n) in its
first 3ms describe v(n) and can be separated from the excitation. The later is viewed as voiced if (n) exhibits sharp
periodic pulses. Then the interval between these pulses is considered as pitch period. If no such structure is visible in
2
(n), the speech is considered unvoiced.

2.1.3 Mel Cepstrum Analysis


This analysis technique uses cepstrum with a nonlinear frequency axis following mel scale. For obtaining mel
cepstrum the speech waveform s(n) is first windowed with analysis window w(n) and then its DFT S(k) is computed.
The magnitude of S(k) is then weighted by a series of mel filter frequency responses whose center frequencies and
bandwidth roughly match those of auditory critical band filters.

The next step in determining the mel cepstrum is shown below


Ul

Emel (n, l )= (1 / Al)  |Vl ( k ) S ( k ) | 2


k =L1

Where Ul and Ll are upper and lower frequency indices over which each filter is nonzero and Al is the energy of
filter which normalizes the filter according to their varying bandwidths so as to give equal energy for flat spectrum.
The real cepstrum associated with Emel(n,l) is referred as the mel-cepstrum and is computed for the speech frame at
time n as
N-1
Cmel (n, m) = (1 / N ) ∑ log{Emel (n,l) }cos [2 π (l + 1 / 2) / N ]
l=0

Such mel cepstral coefficients Cmel provide alternative representation for speech spectra which exploits auditory
principles as well as decorrelating property of cepstrum.

2.1.4 Linear Predictive Coding (LPC) Analysis


The basic idea behind the linear predictive coding (LPC) analysis is that a speech sample can be approximated as
linear combination of past speech samples. By minimizing the sum of the squared differences (over a finite interval)
between the actual speech samples and the linearly predicted ones, a unique set of predictor coefficients is
determined. Speech is modeled as the output of linear, time-varying system excited by either quasi-periodic pulses
(during voiced speech), or random noise (during unvoiced speech). The linear prediction method provides a robust,
reliable, and accurate method for estimating the parameters that characterize the linear time-varying system
representing vocal tract. A very important LPC parameter set which is derived directly from LPC coefficients is LPC
Cepstral coefficients cm. The recursion used for this is
m-1
cm = ∑ (k / m) ckαm – k m>p
k=1
This method is efficient, as it does not require explicit Cepstral computation. Hence combines decorrelating
property of cepstrum with computational efficiency of LPC analysis.

2.2 Temporal Analysis


It involves processing of the waveform of speech signal directly. It involves less computation compared
to spectral analysis but limited to simple speech parameters, e.g., power and periodicity.

2.2.1 Power Estimation


The use of some sort of power measures in speech recognition is fairly standard today. Power is rather
simple to compute. It is computed on frame-by-frame basis as

3
Ns-1

P ( n ) = (1 / Ns)  (w ( m ) s(n− Ns / 2+ m )) I
m=0

Where Ns is the number of samples used to compute the power, s(n) denotes the signal, w(m) denotes the
window function, and the, and n denotes the sample index of center of the window. In most speech
recognition system Hamming window is almost exclusively used. Rather than using power directly in
speech recognition systems use the logarithm of power multiplied by 10, defined as the power in decibels,
in an effort to emulate logarithmic response of human auditory system. It is calculated as

Power in dB = 10 log10(P(n)) (45)

The major significance of P(n) is that it provides basis for distinguishing voiced speech segments from
unvoiced speech segments. The values of P(n) for the unvoiced segments are significantly smaller than that
for voiced segments. The power can be used to locate approximately the time at which voiced speech
becomes unvoiced and vice versa.

2.2.2 Fundamental Frequency Estimation


Fundamental Frequency (f0) or pitch is defined as the frequency at which the vocal cords vibrate during a
voiced sound. Fundamental frequency has long been difficult parameter to reliably estimate from the
speech signal. Previously it was neglected for number of reasons, including large computational burden
required for accurate estimation, the concern that unreliable estimation would be a barrier to achieving high
performance, and difficulty in characterizing complex interactions between f0 and super segmental
phenomenon. It is useful in speech recognition of tonal languages (e.g. Chinese) and languages that have
some tonal components (e.g. Japanese). Fundamental frequency is often processed on logarithmic scale,
rather than a linear scale to match the resolution of human auditory system. There are various algorithms
to estimate f0 we will consider two widely used algorithms: Gold and Rabiner algorithm, cepstrum based
pitch determination algorithm.

Gold and Rabiner algorithm


It is one of earliest and simplest algorithm for f0 estimation. In this algorithm the speech signal is processed so as to
create a number of impulse trains which retain the periodicity of the original signal and discard features which are
irrelevant to the pitch detection process. This enables use of very simple pitch detectors to estimate the period of each
impulse train. The estimates of several of these pitch detectors are logically combined to infer the period of the
speech waveform. The algorithm can be efficiently implemented either in special purpose hardware or on general-
purpose computer.

Cepstrum based Pitch Determination


In the cepstrum, we observe that for the voiced speech there is a peak in the cepstrum at the fundamental period of
the input speech segment. No such a peak appears in the cepstrum for unvoiced speech segment. If the cepstrum peak
is above the preset threshold, the input speech is likely to be voiced, and position of peak is good estimate of pitch
period. Its inverse provides f0. If the peak does not exceed the threshold, it is likely that the input speech segment is
unvoiced.

3. DIFFERENT PATTERN COMPARISON TECHNIQUES


In speech recognition system a speech pattern are compared to determine their similarity. Pattern comparison can be

4
done in wide variety of ways. The comparison between the phoneme feature of an arbitrary piece of speech, and
those predefined for various phonemes classes can be made via a tree as shown in below Figure.

phonemes

diphthon gs
vowels consonents semivowels ay-AY ᴐy-
OY aw-
AW
ey-EY
front nasals
m(M) whisper affricaties liquids
[i-IY,I-IH,e- stops frictives
EH,ᴂ-AE] n(N) h-H w-W
ᶯ(NG) ᶴ-L

mid voiced
[a-AA b-B voiced
glides
ᵌᵊ-ER d-D
r-R
ᴧ-AH g-G
y-Y
AX,
ᴐ-AO]
unvoiced

unvoiced
p-P
back t-T
[u-UW, U-UH, O- -K
OW]

Pattern based approach is of great interest in speech recognition system. We define a test pattern T as the
concatenation of spectral frames over the duration of speech. Such that

T={t1,t2,t3,…….tl}

Where ti is the spectral vector of the input speech at time i & l is the total number of frames of speech. In
a similar manner we define a set of reference patterns, {R1,R2,R3,…..Rv} where each reference pattern,
Rj,1≤ j ≤ v in order to identify the reference pattern that has the minimum dissimilarity, and to associate
the spoken input with this pattern.

There are various techniques exist for pattern comparison in speech recognition system.

3.1 Speech Distortion Measures


3.1.1 Mathematical Considerations
A key component of most pattern-comparison algorithm is a prescribed measurement of dissimilarity
between two feature vectors. This measurement of dissimilarity can be handled with mathematical rigor if
the pattern are visualized in a vector space.

Assume we have two feature vectors x,y defined on a vector space χ. We define a metric or distance function

5
d on the vector space χ as a real-valued function on the Cartesian product χ x χ such that
0≤d(x,y)<∞ for x,y є χ and d(x,y)=0 if and only if x=y

d(x,y)= d(y,x)for x,y є χ

d(x,y) ≤ d(x,z)+ d(y,z) for x,y,z є χ

In addition a distance function is called invariant if d(x+z,y+z)= d(x,y)

Properties of above three equations are commonly referred to as the positive definiteness, the symmetry,
and the triangle inequality conditions, respectively. A metric with the above properties ensure a high
degree of mathematical tractability.
If a measure of difference, d, satisfies only the positive definiteness property, we customarily call it a distortion
measure, particularly when the vectors are representation of signal spectra.

3.1.2. Perceptual Consideration


Choice of an appropriate measure of spectral dissimilarity or distance measure is the concept of subjective
judgment of sound difference. Phonetically relevant distance measure has the property that spectral changes
that perceptually make the sound being compared seem to be different should be associated with large
distance.

Consider two spectral representation S(ω) and S` (ω) using a distance measure d(S,S`). Spectral changes
that perceptually lead to the sounds judged as being phonetically different include a) significant differences
in formant locations i.e. the spectral resonances of S(ω) and S` (ω)occur at very different frequencies b) )
significant differences in formant bandwidth
i.e. the frequency widths of the special resonances of S(ω)
and S` (ω) are very different.

3.1.3 Spectral- Distortion Measures


Measuring the difference between two speech patterns in term of average or accumulated spectral distortion appears
to be a very reasonable way of comparing patterns both in terms of its mathematical tractability and its
computational efficiency. For computational convenience, when processing a time varying signal such as speech, we
often use the short-term autocorrelation defined over a frame of speech samples

{x(i), i = 0,1……N-1}
rn = [rN n , n = 0,1, … . . N − 1
0, n ≥ N
3.2 Log Spectral Distance
The log-spectral distance (LSD), also referred to as log-spectral distortion or root mean square log-spectral
distance, is a distance measure (expressed in dB) between two spectra. The log-spectral distance between spectra
P(w) and P^(w) is defined as:

6
Where P(w) and P^(w) are power spectra. The log spectra distance is symmetric.

In speech coding, log spectral distortion for a given frame is defined as the root mean square difference between the
original LPC log power spectrum and the quantized or interpolated LPC log power spectrum. Usually the average
of spectral distortion over a large number of frames is calculated and that is used as the measure of performance
of quantization or interpolation.

Thus, it calculates the log Spectral Distance between a speech signal and a distorted version of it. It has the
capability of calculating this distance for a specified sub-band as well. This measure is used for evaluation of
processed speech quality in comparison to the original speech.

3.3 Cepstral Distance


The cepstrum is a common transform used to gain information from a person’s speech signal. It can be used to
separate the excitation signal (which contains the words and the pitch) and the transfer function (which contains the
voice quality). The word cepstrum has the first syllable “ceps” which is “spec” with the letters rearranged in a
different order. Thus, the words cepstrum and spectrum are related through an interesting naming convention that
will use again. Below Figure provides a block diagram showing how a signal would be converted to the cepstral
domain.

The complex cepstrum of a signal is defined as the Fourier transform of the log of the signal spectrum. For a power
spectrum (magnitude-squared Fourier transform) S (ω), which is symmetric with respect to (ω) =0 and is periodic
for a sampled data sequence, the Fourier series representation of log S (ω) can be expressed as

log S (ω) =  cn e- jnω


n=∞

where cn = c-n are real and often referred to as the Cepstral coefficients.

3.3 Weighted Cepstral


A weighted cepstral distance measure is proposed and is tested in a speaker-independent isolated word recognition
system using standard DTW (Dynamic Time Warping) techniques. The measure is a statistically weighted distance
measure with weights equal to the inverse variance of the cepstral coefficients. The experimental results show that
the weighted cepstral distance measure works substantially better than both the Euclidean cepstral distance and the
log likelihood ratio distance measures across two different data bases. This result was more than 3% higher than that
7
obtained using the simple Euclidean cepstral distance measure and about 2% higher than the results using the log
likelihood ratio distance measure. The recognition error rate obtained using the weighted cepstral distance measure
was about 1 percent for digit recognition. This result was less than one fourth of that obtained using the simple
Euclidean cepstral distance measure and about one-third of the results using the log likelihood ratio distance
measure. The most significant performance characteristic of the weighted cepstral distance was that it tended to
equalize the performance of the recognizer across different talkers.

The weighted cepstral distance measure dWCEP which is described by the following equation:
p
dWCEP = ∑ w(i) (ct(i) - cr(i))2
i=1
where w(i) is the inverse of the ith diagonal element viiof the covariance matrix V. The measure dWCEP is a weighted
Euclidean distance measure where each individual cepstral component c(i) is variance – equalized by the weight
w(i).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy