NLP Unit V
NLP Unit V
The goal of speech recognition is for a machine to be able to "hear,” understand," and "act upon" spoken
information. The earliest speech recognition systems were first attempted in the early 1950s at Bell Laboratories,
Davis, Biddulph and Balashek developed an isolated digit Recognition system for a single speaker.
The goal of automatic speaker recognition is to analyze, extract characterize and recognize information about the
speaker identity. The speaker recognition system may be viewed as working in a four stages
1. Analysis
2. Feature extraction
3. Modeling
4. Testing
Speech data contains different type of information that shows a speaker identity. This includes speaker specific
information due to vocal tract, excitation source and behavior feature. The information about the behavior feature
also embedded in signal and that can be used for speaker recognition. The speech analysis stage deals with stage
with suitable frame size for segmenting speech signal for further analysis and extracting. The speech analysis
technique done with following three techniques.
1.1 Segmentation analysis- In this case speech is analyzed using the frame size of 20 ms with shift of 2.5ms to
extract speaker information. Studies are used to extract vocal tract information of speaker recognition.
1.2 Sub segmental analysis - Speech analyzed using the frame size of 5 ms with shift of 2.5ms is known as Sub
segmental analysis. This technique is used to mainly analyze and extract the characteristic of the excitation state.
1.3 Supra segmental analysis - In this case, speech is analyzed using the frame size of 250 ms with shift of 6.25
ms. This technique is used to analyze and characteristic due to behavior character of the speaker.
Features of speech have a vital part in the segregation of a speaker from others. Feature extraction reduces the
magnitude of the speech signal devoid of causing any damage to the power of speech signal.
Before the features are extracted, there are sequences of preprocessing phases that are first carried out. The
preprocessing step is pre-emphasis. This is achieved by passing the signal through a FIR filter which is usually a
first-order finite impulse response (FIR) filter. This is succeeded by frame blocking, a method of partitioning the
speech signal into frames. It removes the acoustic interface existing in the start and end of the speech signal.
1
The framed speech signal is then windowed. Bandpass filter is a suitable window that is applied to minimize
disjointedness at the start and finish of each frame. The two most famous categories of windows are Hamming and
Rectangular windows. It increases the sharpness of harmonics, eliminates the discontinuous of signal by tapering
beginning and ending of the frame zero. It also reduces the spectral distortion formed by the overlap.
Feature extraction is process of obtaining different features such as power, pitch, and vocal tract
configuration from the speech signal. Feature extraction involves analysis of speech signal. Broadly the
feature extraction techniques are classified as temporal analysis and spectral analysis technique. In
temporal analysis the speech waveform itself is used for analysis. In spectral analysis spectral
representation of speech signal is used for analysis.
The temporal features (time domain features), which are simple to extract and have easy physical interpretation,
like: the energy of signal, zero crossing rate, maximum amplitude, minimum energy, etc.
The spectral features (frequency-based features), which are obtained by converting the time based signal into the
frequency domain using the Fourier Transform, like: fundamental frequency, frequency components, spectral
centroid, spectral flux, spectral density, spectral roll-off, etc. These features can be used to identify the notes,
pitch, rhythm, and melody.
In speech recognition cepstral analysis is used for formant tracking and pitch (f0) detection. The samples of (n) in its
first 3ms describe v(n) and can be separated from the excitation. The later is viewed as voiced if (n) exhibits sharp
periodic pulses. Then the interval between these pulses is considered as pitch period. If no such structure is visible in
2
(n), the speech is considered unvoiced.
Where Ul and Ll are upper and lower frequency indices over which each filter is nonzero and Al is the energy of
filter which normalizes the filter according to their varying bandwidths so as to give equal energy for flat spectrum.
The real cepstrum associated with Emel(n,l) is referred as the mel-cepstrum and is computed for the speech frame at
time n as
N-1
Cmel (n, m) = (1 / N ) ∑ log{Emel (n,l) }cos [2 π (l + 1 / 2) / N ]
l=0
Such mel cepstral coefficients Cmel provide alternative representation for speech spectra which exploits auditory
principles as well as decorrelating property of cepstrum.
3
Ns-1
P ( n ) = (1 / Ns) (w ( m ) s(n− Ns / 2+ m )) I
m=0
Where Ns is the number of samples used to compute the power, s(n) denotes the signal, w(m) denotes the
window function, and the, and n denotes the sample index of center of the window. In most speech
recognition system Hamming window is almost exclusively used. Rather than using power directly in
speech recognition systems use the logarithm of power multiplied by 10, defined as the power in decibels,
in an effort to emulate logarithmic response of human auditory system. It is calculated as
The major significance of P(n) is that it provides basis for distinguishing voiced speech segments from
unvoiced speech segments. The values of P(n) for the unvoiced segments are significantly smaller than that
for voiced segments. The power can be used to locate approximately the time at which voiced speech
becomes unvoiced and vice versa.
4
done in wide variety of ways. The comparison between the phoneme feature of an arbitrary piece of speech, and
those predefined for various phonemes classes can be made via a tree as shown in below Figure.
phonemes
diphthon gs
vowels consonents semivowels ay-AY ᴐy-
OY aw-
AW
ey-EY
front nasals
m(M) whisper affricaties liquids
[i-IY,I-IH,e- stops frictives
EH,ᴂ-AE] n(N) h-H w-W
ᶯ(NG) ᶴ-L
mid voiced
[a-AA b-B voiced
glides
ᵌᵊ-ER d-D
r-R
ᴧ-AH g-G
y-Y
AX,
ᴐ-AO]
unvoiced
unvoiced
p-P
back t-T
[u-UW, U-UH, O- -K
OW]
Pattern based approach is of great interest in speech recognition system. We define a test pattern T as the
concatenation of spectral frames over the duration of speech. Such that
T={t1,t2,t3,…….tl}
Where ti is the spectral vector of the input speech at time i & l is the total number of frames of speech. In
a similar manner we define a set of reference patterns, {R1,R2,R3,…..Rv} where each reference pattern,
Rj,1≤ j ≤ v in order to identify the reference pattern that has the minimum dissimilarity, and to associate
the spoken input with this pattern.
There are various techniques exist for pattern comparison in speech recognition system.
Assume we have two feature vectors x,y defined on a vector space χ. We define a metric or distance function
5
d on the vector space χ as a real-valued function on the Cartesian product χ x χ such that
0≤d(x,y)<∞ for x,y є χ and d(x,y)=0 if and only if x=y
Properties of above three equations are commonly referred to as the positive definiteness, the symmetry,
and the triangle inequality conditions, respectively. A metric with the above properties ensure a high
degree of mathematical tractability.
If a measure of difference, d, satisfies only the positive definiteness property, we customarily call it a distortion
measure, particularly when the vectors are representation of signal spectra.
Consider two spectral representation S(ω) and S` (ω) using a distance measure d(S,S`). Spectral changes
that perceptually lead to the sounds judged as being phonetically different include a) significant differences
in formant locations i.e. the spectral resonances of S(ω) and S` (ω)occur at very different frequencies b) )
significant differences in formant bandwidth
i.e. the frequency widths of the special resonances of S(ω)
and S` (ω) are very different.
{x(i), i = 0,1……N-1}
rn = [rN n , n = 0,1, … . . N − 1
0, n ≥ N
3.2 Log Spectral Distance
The log-spectral distance (LSD), also referred to as log-spectral distortion or root mean square log-spectral
distance, is a distance measure (expressed in dB) between two spectra. The log-spectral distance between spectra
P(w) and P^(w) is defined as:
6
Where P(w) and P^(w) are power spectra. The log spectra distance is symmetric.
In speech coding, log spectral distortion for a given frame is defined as the root mean square difference between the
original LPC log power spectrum and the quantized or interpolated LPC log power spectrum. Usually the average
of spectral distortion over a large number of frames is calculated and that is used as the measure of performance
of quantization or interpolation.
Thus, it calculates the log Spectral Distance between a speech signal and a distorted version of it. It has the
capability of calculating this distance for a specified sub-band as well. This measure is used for evaluation of
processed speech quality in comparison to the original speech.
The complex cepstrum of a signal is defined as the Fourier transform of the log of the signal spectrum. For a power
spectrum (magnitude-squared Fourier transform) S (ω), which is symmetric with respect to (ω) =0 and is periodic
for a sampled data sequence, the Fourier series representation of log S (ω) can be expressed as
∞
where cn = c-n are real and often referred to as the Cepstral coefficients.
The weighted cepstral distance measure dWCEP which is described by the following equation:
p
dWCEP = ∑ w(i) (ct(i) - cr(i))2
i=1
where w(i) is the inverse of the ith diagonal element viiof the covariance matrix V. The measure dWCEP is a weighted
Euclidean distance measure where each individual cepstral component c(i) is variance – equalized by the weight
w(i).