Scale Transform in Speech Analysis
Scale Transform in Speech Analysis
1, JANUARY 1999
Abstract— In this paper, we study the scale transform of of each other, i.e.,
the spectral-envelope of speech utterances by different speakers.
This study is motivated by the hypothesis that the formant (2)
frequencies between different speakers are approximately related
by a scaling constant for a given vowel. The scale transform where is the scale factor. While the uniform
has the fundamental property that the magnitude of the scale-
transform of a function X (f ) and its scaled version
p
X( f )
tube model is not the best model for the vocal-tract, it
are same. The methods presented here are useful in reducing illustrates the general features investigated within this paper,
variations in acoustic features. We show that the F-ratio tests namely that there is scaling in the frequency domain as our
indicate better separability of vowels by using scale-transform previous work has shown [8].
based features than mel-transform based features. The data The paper is organized as follows. In the next section we dis-
used in the comparison of the different features consist of 200 cuss some of the properties of scale-transform that are relevant
utterances of four vowels that are extracted from the TIMIT
data base. to this paper. In Section III we detail a method to obtain the
smoothed estimate of the formant envelope. Subsequently, the
Index Terms— Formants, scale cepstrum, speaker normaliza- discrete implementation of the scale-cepstrum is explained. In
tion, speech analysis, speech front-end.
Section V, we describe the simulations that we have performed
to compare the separability of vowels when scale-cepstrum and
I. INTRODUCTION mel-cepstral coefficients are used as features.
of the scale transform. We have previously used the scale- taken. The magnitude of the scale-cepstrum is then used as a
transform to study the effect of pitch-variation and the resulting feature vector.
broadening of pitch harmonics [10], [11].
Before we describe the scale-transform based procedure, we
IV. DISCRETE IMPLEMENTATION OF SCALE-CEPSTRUM
discuss the method used to obtain a smoothed estimate of the
formant envelope. Since the sampling frequency of the TIMIT data base is 16
KHz, for computations in this paper, we assume that the signal
is bandlimited between 100–7000 Hz. The scale-cepstrum may
III. ESTIMATION OF FORMANT ENVELOPE
therefore be represented as
According to the source-filter model for speech production,
vowels are produced by the vocal tract filter driven by the pitch
excitation. In the spectral domain this therefore corresponds (8)
to the product of the spectrum of the vocal-tract filter and the
spectrum of the pitch, i.e., Using substitution of variables, , we have
(7)
(9)
where , and are the observed spectrum, the
frequency response of the vocal-tract and the spectrum of the
which is the conventional Fourier transform of
pitch excitation function, respectively.
. For digital implementation, we sample in
Vowels are almost completely described by the first three
the domain and obtain an expression which can be easily
resonances (formants) of the vocal tract frequency represen-
implemented using the fast Fourier transform (FFT), i.e.,
tation, . These resonances are affected by the length of
the pharyngeal-oral tract, the constriction along the tract, and
the narrowness of the tract.
Since we are interested only in the vocal-tract response, we
would like to remove the effects of pitch excitation. In this
paper, the following procedure proposed by Nelson [12] is
(10)
used to suppress the effects of pitch. This method is similar
to averaged periodogram techniques [13]. In the proposed
where and .
method, each frame of speech is segmented into overlapping
The phase term can be ignored,
subframes, and each subframe is Hamming windowed. For the
since it does not contribute to the magnitude of .
purposes of this paper, we have chosen the subframes to be
can be easily computed from the time-lag
96 samples long (for speech sampled at 16 KHz), and the
samples of the smoothed formant-envelope, as
overlap between the subframes is 64 samples. We estimate the
sample autocorrelation function for each subframe and average
over the available subframes. This averaged autocorrelation
estimate is then Hamming windowed and is used to compute
the scale-cepstrum described in the next section. We denote (11)
the windowed average autocorrelation estimate as .
In the proposed analysis method, pitch is effectively sup- where is the sampling period in the time-lag domain.
pressed since the duration of each subframe is less than the The magnitude of scale-cepstral coefficients, i.e.,
expected pitch-interval. For every subframe that contains an , are used as features that describe the
individual pitch-pulse there is a broadband energy contribution various vowels.
to the spectrum of that subframe but not to any other subframe.
The result is that the averaged spectrum contains all of the
formant structure but almost none of the pitch structure. V. COMPARISON OF FEATURES
The scale-cepstrum is obtained by computing the scale In this section, we will compare the separability of vow-
transform of and is denoted by . is the els classes when scale-cepstral and mel-cepstral coefficients
Fourier transform of , the windowed averaged autocorre- are used as features. We point out that when we refer to
lation estimate. In the calculation of the scale cepstrum, the scale-cepstral coefficients as features, we assume that we are
analytic spectrum is used rather than the symmetric spectrum, using the magnitude of in the feature vector. In
since the scale properties are not valid for symmetric log/mel comparing the separability afforded by the different cepstral
warped spectrum. The reason for using the logarithm opera- features, a generalized F-ratio method is used [14], [15]. In de-
tion, is that it provides a more parsimonious representation in riving the F-ratio separability, let and denote the mean
the scale-cepstral domain. Note that the logarithm operation feature vector and sample covariance matrix, respectively,
affects only the magnitude of the spectral components. There- of the th phoneme class. We assume equal probability of
fore, formant envelopes that are frequency-scaled versions of phoneme classes. Let , where denotes
each other continue to remain so even after the logarithm is the number of phoneme classes being compared. We then
42 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999
+++
Fig. 1. Separability between phonemes using scale (indicated by “ - - ”) and mel-cepstral coefficients (indicated by “--”) based on 200 utterances
of clean speech. The data is taken from the region seven of the TIMIT training set data and is spoken by 22 male and 13 female talkers. It is seen
that the scale-cepstrum provides better separability for most vowels. The few cases for which it is less than mel-cepstrum may be explained by recalling
that the scale-cepstrum is based on the assumption of linear-scaling while mel-cesptrum uses the warping based on psychoacoustics. It is hoped that the
use of such warping functions will further improve the scale-cepstrum.
compute the within-class and between-class scatter matrices, Simulation Results: The data consist of a total of 200
and , respectively, as utterances of each vowel from 22 male and 13 female speakers
from dialect region seven of the TIMIT training set data. /Ae/,
/iy, /ih/, and /ow/ are the four vowels that are considered
(12)
for comparison of the different cepstra. Each utterance is so
chosen that the corresponding phoneme is relatively stationary
and over at least 768 samples, and the middle 512 samples are
used in the computation of the different cepstra. Therefore,
each utterance corresponds to 32 ms of speech, since the
(13)
TIMIT data is sampled at 16 kHz. The scale-cepstral and
mel-cepstral coefficients of clean and noisy utterances are
The separability criterion is then given by computed. The noisy utterance is simulated by adding arti-
ficially generated white Gaussian noise. The signal-to-noise
(14) ratio (SNR) is defined as the ratio of energy in the utterance
UMESH et al.: SPEECH ANALYSIS 43
+++
Fig. 2. Separability between phonemes using scale (indicated by “ - - ”) and mel-cepstral coefficients (indicated by “--”) based on 200 utterances
at 18 dB SNR. The data is taken from the region seven of the TIMIT training set data and is spoken by 22 male and 13 female talkers. Note that
the scale-cepstrum provides better separability than mel-cepstrum.
to the noise energy. In our experiments, we used SNR’s of 6 decreases to zero at the center frequencies of the two adjacent
and 18 dB. filters. The vector of energies is computed by weighting the
The scale-cepstrum is computed using the following values discrete Fourier transform (DFT) coefficients by the magnitude
of the parameters: , , , and response of the filterbank. The mel-cepstral coefficients are
. Note that is interpolated by a factor then obtained by computing the inverse cosine transform of
of two. the vector of log energies.
The mel-cepstrum is implemented using the program in the In all cepstra the zeroth coefficient is not used since this is
signal processing information base [16] written by Slaney. roughly a measure of the spectral energy. Coefficients 1–20
The 40 triangular filters were used. The center frequencies are used to measure the separability between the different
of the first 13 linearly spaced filters are 66.67 Hz apart phoneme classes. We would like to remind the reader that
starting at 133.34 Hz. The center frequencies of the other the magnitude of the scale-cepstral coefficients are used as
27 filters are chosen to have a ratio of 1.071 170 3 between elements in the feature vector. We normalize the vector of
successive filters. This covers the frequency band up to 8 KHz. coefficients to unit energy. The size of feature vectors are
Each filter’s magnitude frequency response has a triangular varied from 1 to 20 coefficients and the separability measure
shape that is maximum at the center frequency and linearly is computed. Figs. 1–3 show the separability measure, , as
44 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999
+++
Fig. 3. Separability between phonemes using scale (indicated by “ - - ”) and mel-cepstral coefficients (indicated by “--”) based on 200 utterances
at 6 dB SNR. The data is taken from region seven of the TIMIT training set data and is spoken by 22 male and 13 female talkers. It is observed that the
scale-cepstrum is robust to noise and the degradation in separability when compared to clean speech is small.
a function of the number of coefficients for clean and noisy is similar to mel-warping [17]. It is hoped that using such a
speech. warping function may further improve the features.
From the figures, it is clear that the scale-cepstrum provides
better separability than the mel-cepstrum for most vowels. VI. CONCLUSION
Further, it is also seen to be robust to noise. It is also intuitively
We have studied the scale-transform of the formant en-
satisfying to note that vowels closer together in the vowel
velopes of utterances of vowels by different speakers. This
triangle are much more difficult to separate (e.g., iy and was motivated by the hypothesis that the formant frequencies
ih ) when compared to those that are far apart (e.g., iy between different speakers are approximately related by a
and ow ). For clean speech, it is seen that the separability scaling constant for any given phoneme. We have described
for a few vowels is lesser than the mel-cepstrum and this may a procedure for the discrete-implementation of the scale-
be partly explained by recalling that for the scale-cepstrum, we transform on the formant envelope, which we estimate by
have made a linear frequency-scaling assumption, i.e., is using averaged-periodogram techniques. Our results on vowels
independent of frequency. Our recent studies show that is indicate that the scale-cepstral coefficients provides better
actually frequency dependent and the corresponding function separation between vowels than mel-cepstral coefficients. The
UMESH et al.: SPEECH ANALYSIS 45
data used in the comparison were 200 utterances of four Leon Cohen (SM’91–F’98) was born in Barcelona,
vowels from the TIMIT data base and the generalized F-ratio Spain, on March 23, 1940. He received the B.S.
degree from The City College of New York in 1958
was used as the criterion for the comparison of the features. and the Ph.D. degree from Yale University, New
We have shown that the scale-cepstral methods are useful Haven, CT, in 1966.
in devising features that take into account variations due to He has contributed to the fields of signal analysis,
astronomy, physics, and mathematics.
scaling in the frequency domain.
REFERENCES
[1] S. B. Davis and P. Mermelstein, “Comparison of parametric repre-
sentations for monsyllabic word recognition in continuously spoken
sentences,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-
28, pp. 357–366, Aug. 1980. Nenad Marinovic was born in Belgrade, Yu-
[2] C. Jankowski, H.-D. H. Vo, and R. P. Lippmann, “A comparison of
goslavia. He received his Dipl.-Ing. degree in
signal processing front ends for automatic word recognition,” vol. 3,
electrical engineering from the University of
pp. 286–293, July 1995.
Belgrade in 1975 and the Ph.D. degree in electrical
[3] D. J. Nelson, “Wavelet based analysis on speech signals,” in Proc.
engineering from the City University of New York,
IEEE Conf. Automatic Speech Recognition, Snowbird, UT, Dec. 1993,
in 1986.
pp 89–90.
[4] , ”Alternate spectral analysis methods in speech analysis,” in For almost 20 years, until 1996, he has been doing
Proc. 13th Ann. Speech Research Symp., Johns Hopkins Univ., Balti- signal processing research in both industry and
more, MD, June 1993, pp. 304–315. academia. His main interest was in signal theory,
[5] T. Kamm, G. Andreou, and J. Cohen, “Vocal tract normalization in statistical signal processing, and their applications in
speech recognition: Compensating for systematic speaker variability,” biomedical imaging, computer vision, radar, sonar,
in Proc. 15th Ann. Speech Research Symp., Johns Hopkins Univ., underwater acoustics, and speech. From 1986 to 1996, he was a Professor in
Baltimore, MD, June 1995, pp. 175–178. the Department of Electrical Engineering, City College of the City University
[6] P. Bamberg, “Vocal tract normalization,” Tech. Rep., Verbex, 1981. of New York, where his research contributing to this paper was carried out.
[7] E. Eide and H. Gish, “A parametric approach to vocal tract length Since then, he moved to financial industry where he is currently involved in
normalization,” in Proc. IEEE ICASSP’96, Atlanta, GA, pp. 346–349. stochastic modeling of securities markets for measurement and management
[8] L. Cohen, N. Marinovic, S. Umesh, and D. Nelson, “Scale-invariant of financial risks.
speech analysis,” in Proc. Int. Soc. Optical Engineering, San Diego,
CA, 1995, vol. 2569, pp. 522–537.
[9] L. Cohen, “The scale representation,” IEEE Trans. Signal Processing,
vol. 41, pp. 3275–3292, Dec. 1993. Douglas J. Nelson (M’98) was born in Minneapolis,
[10] N. Marinovic, L. Cohen, and S. Umesh, “Scale and harmonic type MN, on November 5, 1945. He received the B.A.
signals,” in Proc. Int. Soc.r Optical Engineering, San Diego, CA, 1994, degree in mathematics from the University of Min-
vol. 2303, pp. 411–418. nesota, Minneapolis, in 1967, and the Ph.D. degree
[11] , “Joint representation in time and frequency scale for harmonic
in mathematics from Stanford University, Stanford,
type signals,” in Proc. IEEE Int. Symp. Time-Frequency and Time-Scale
CA, in December 1972.
Analysis, Philadelphia, PA, 1994, pp. 84–87.
He was a Visiting Assistant Professor of Math-
[12] D. Nelson, “Correlation based speech formant recovery,” in Proc. IEEE
ematics at Carnegie Mellon University, Pittsburgh,
Int. Conf. Acoust., Speech, Signal Processing, Munich, Germany, Apr.
PA, from September 1972 to September 1975. Since
1997, pp. 1643–1646.
[13] A. H. Nuttall and G. C. Carter, “Spectral estimation using combined time November 1975, he has been with the National
and lag weighting,” Proc. IEEE, vol. 70, pp. 1115–1125, Sept. 1982. Security Agency, Fort Meade, MD, where he has
[14] K. Fukunaga, Introduction to Statistical Pattern Recognition. New been involved in the development of signal processing algorithms for radar,
York: Academic, 1990. communication signals, and speech.
[15] T. Parsons, Voice and Speech Processing. New York: McGraw-Hill,
1987.
[16] D. H. Johnson and P. N. Shami, “The signal processing information
base,” IEEE Signal Processing Mag., pp. 36–42, Oct. 1993.
[17] S. Umesh, L. Cohen, and D. Nelson, “Frequency warping and pyscho-
acoustic frequency scales,” in Proc. Int. Soc. Optical Engineering, San
Diego, CA, 1996, vol. 2825, pp. 530–539.