0% found this document useful (0 votes)
8 views6 pages

Scale Transform in Speech Analysis

This paper investigates the scale transform of spectral envelopes in speech analysis, focusing on the relationship between formant frequencies of different speakers for the same vowel. The authors demonstrate that scale-transform based features offer better vowel separability compared to mel-transform based features, using data from 200 utterances of four vowels. The findings suggest that the scale-cepstral coefficients provide improved performance in distinguishing vowel sounds, particularly in noisy conditions.

Uploaded by

Susanta Sarangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

Scale Transform in Speech Analysis

This paper investigates the scale transform of spectral envelopes in speech analysis, focusing on the relationship between formant frequencies of different speakers for the same vowel. The authors demonstrate that scale-transform based features offer better vowel separability compared to mel-transform based features, using data from 200 utterances of four vowels. The findings suggest that the scale-cepstral coefficients provide improved performance in distinguishing vowel sounds, particularly in noisy conditions.

Uploaded by

Susanta Sarangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

40 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO.

1, JANUARY 1999

Scale Transform in Speech Analysis


S. Umesh, Member, IEEE, Leon Cohen, Fellow, IEEE, Nenad Marinovic, and Douglas J. Nelson, Member, IEEE

Abstract— In this paper, we study the scale transform of of each other, i.e.,
the spectral-envelope of speech utterances by different speakers.
This study is motivated by the hypothesis that the formant (2)
frequencies between different speakers are approximately related
by a scaling constant for a given vowel. The scale transform where is the scale factor. While the uniform
has the fundamental property that the magnitude of the scale-
transform of a function X (f ) and its scaled version
p
X( f )
tube model is not the best model for the vocal-tract, it
are same. The methods presented here are useful in reducing illustrates the general features investigated within this paper,
variations in acoustic features. We show that the F-ratio tests namely that there is scaling in the frequency domain as our
indicate better separability of vowels by using scale-transform previous work has shown [8].
based features than mel-transform based features. The data The paper is organized as follows. In the next section we dis-
used in the comparison of the different features consist of 200 cuss some of the properties of scale-transform that are relevant
utterances of four vowels that are extracted from the TIMIT
data base. to this paper. In Section III we detail a method to obtain the
smoothed estimate of the formant envelope. Subsequently, the
Index Terms— Formants, scale cepstrum, speaker normaliza- discrete implementation of the scale-cepstrum is explained. In
tion, speech analysis, speech front-end.
Section V, we describe the simulations that we have performed
to compare the separability of vowels when scale-cepstrum and
I. INTRODUCTION mel-cepstral coefficients are used as features.

T HERE have been a number of acoustic features that


have been studied in speech analysis [1], [2]. In this
paper, we study features based on the scale-transform. The
II. THE SCALE TRANSFORM
Before we describe the computation of the scale-transform
scale-transform based features are motivated by the vocal-tract based features we briefly review some important properties
normalization methods. The relationship of vocal tract scaling of the scale transform [9]. The scale-transform of a function,
to cepstral coefficients and mel warping and the effects of , is given by
scale on recognition performance was suggested by Nelson [3],
[4]. This principle was applied in recognition experiments by
(3)
Kamm et al. [5]. Such normalization techniques are necessary,
since different speakers have different formant frequencies for
the same vowel. A main source for this variability among and the inverse scale transform is
different speakers is due to the differences in vocal-tract
lengths [3]–[7]. As a simple model of this notion consider a (4)
uniform tube with length . The frequency spectrum is given
by A basic property of the scale-transform is that the magnitude
of the scale transform of a function, and its normalized
scaled version, , are equal. (Note that
(1) corresponds to dilation, while corresponds to
compression.) To show this consider the scale transform of
where is the velocity of sound. Hence, if we have two , i.e.,
speakers, and , their respective spectra are scaled versions
(5)
Manuscript received April 12, 1996; revised October 5, 1998. This work
was supported in part by the Department of Defense through NSA HBCU/MI
Program. The associate editor coordinating the review of this manuscript and
Using the following substitution of variables, , we
approving it for publication was Prof. Joseph Picone. have
S. Umesh was with the City University of New York, New York, NY
10021 USA. He is now with the Department of Electrical Engineering, Indian
Institute of Technology, Kanpur 208 016, India.
L. Cohen is with Hunter College, City University of New York, New York,
NY 10021 USA (e-mail: lcchc@cunyvm.cuny.edu). (6)
N. Marinovic is with City College, City University of New York, New
York, NY 10021 USA. Hence, the magnitude of the scale transform of and its
D. Nelson is with the U.S. Department of Defense, Ft. Meade, MD 20755
USA. scaled version are the same. The scaling constant contributes
Publisher Item Identifier S 1063-6676(99)00178-9. only to the phase and does not appear in the magnitude
1063–6676/99$10.00  1999 IEEE
UMESH et al.: SPEECH ANALYSIS 41

of the scale transform. We have previously used the scale- taken. The magnitude of the scale-cepstrum is then used as a
transform to study the effect of pitch-variation and the resulting feature vector.
broadening of pitch harmonics [10], [11].
Before we describe the scale-transform based procedure, we
IV. DISCRETE IMPLEMENTATION OF SCALE-CEPSTRUM
discuss the method used to obtain a smoothed estimate of the
formant envelope. Since the sampling frequency of the TIMIT data base is 16
KHz, for computations in this paper, we assume that the signal
is bandlimited between 100–7000 Hz. The scale-cepstrum may
III. ESTIMATION OF FORMANT ENVELOPE
therefore be represented as
According to the source-filter model for speech production,
vowels are produced by the vocal tract filter driven by the pitch
excitation. In the spectral domain this therefore corresponds (8)
to the product of the spectrum of the vocal-tract filter and the
spectrum of the pitch, i.e., Using substitution of variables, , we have

(7)
(9)
where , and are the observed spectrum, the
frequency response of the vocal-tract and the spectrum of the
which is the conventional Fourier transform of
pitch excitation function, respectively.
. For digital implementation, we sample in
Vowels are almost completely described by the first three
the domain and obtain an expression which can be easily
resonances (formants) of the vocal tract frequency represen-
implemented using the fast Fourier transform (FFT), i.e.,
tation, . These resonances are affected by the length of
the pharyngeal-oral tract, the constriction along the tract, and
the narrowness of the tract.
Since we are interested only in the vocal-tract response, we
would like to remove the effects of pitch excitation. In this
paper, the following procedure proposed by Nelson [12] is
(10)
used to suppress the effects of pitch. This method is similar
to averaged periodogram techniques [13]. In the proposed
where and .
method, each frame of speech is segmented into overlapping
The phase term can be ignored,
subframes, and each subframe is Hamming windowed. For the
since it does not contribute to the magnitude of .
purposes of this paper, we have chosen the subframes to be
can be easily computed from the time-lag
96 samples long (for speech sampled at 16 KHz), and the
samples of the smoothed formant-envelope, as
overlap between the subframes is 64 samples. We estimate the
sample autocorrelation function for each subframe and average
over the available subframes. This averaged autocorrelation
estimate is then Hamming windowed and is used to compute
the scale-cepstrum described in the next section. We denote (11)
the windowed average autocorrelation estimate as .
In the proposed analysis method, pitch is effectively sup- where is the sampling period in the time-lag domain.
pressed since the duration of each subframe is less than the The magnitude of scale-cepstral coefficients, i.e.,
expected pitch-interval. For every subframe that contains an , are used as features that describe the
individual pitch-pulse there is a broadband energy contribution various vowels.
to the spectrum of that subframe but not to any other subframe.
The result is that the averaged spectrum contains all of the
formant structure but almost none of the pitch structure. V. COMPARISON OF FEATURES
The scale-cepstrum is obtained by computing the scale In this section, we will compare the separability of vow-
transform of and is denoted by . is the els classes when scale-cepstral and mel-cepstral coefficients
Fourier transform of , the windowed averaged autocorre- are used as features. We point out that when we refer to
lation estimate. In the calculation of the scale cepstrum, the scale-cepstral coefficients as features, we assume that we are
analytic spectrum is used rather than the symmetric spectrum, using the magnitude of in the feature vector. In
since the scale properties are not valid for symmetric log/mel comparing the separability afforded by the different cepstral
warped spectrum. The reason for using the logarithm opera- features, a generalized F-ratio method is used [14], [15]. In de-
tion, is that it provides a more parsimonious representation in riving the F-ratio separability, let and denote the mean
the scale-cepstral domain. Note that the logarithm operation feature vector and sample covariance matrix, respectively,
affects only the magnitude of the spectral components. There- of the th phoneme class. We assume equal probability of
fore, formant envelopes that are frequency-scaled versions of phoneme classes. Let , where denotes
each other continue to remain so even after the logarithm is the number of phoneme classes being compared. We then
42 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999

+++
Fig. 1. Separability between phonemes using scale (indicated by “ - - ”) and mel-cepstral coefficients (indicated by “--”) based on 200 utterances
of clean speech. The data is taken from the region seven of the TIMIT training set data and is spoken by 22 male and 13 female talkers. It is seen
that the scale-cepstrum provides better separability for most vowels. The few cases for which it is less than mel-cepstrum may be explained by recalling
that the scale-cepstrum is based on the assumption of linear-scaling while mel-cesptrum uses the warping based on psychoacoustics. It is hoped that the
use of such warping functions will further improve the scale-cepstrum.

compute the within-class and between-class scatter matrices, Simulation Results: The data consist of a total of 200
and , respectively, as utterances of each vowel from 22 male and 13 female speakers
from dialect region seven of the TIMIT training set data. /Ae/,
/iy, /ih/, and /ow/ are the four vowels that are considered
(12)
for comparison of the different cepstra. Each utterance is so
chosen that the corresponding phoneme is relatively stationary
and over at least 768 samples, and the middle 512 samples are
used in the computation of the different cepstra. Therefore,
each utterance corresponds to 32 ms of speech, since the
(13)
TIMIT data is sampled at 16 kHz. The scale-cepstral and
mel-cepstral coefficients of clean and noisy utterances are
The separability criterion is then given by computed. The noisy utterance is simulated by adding arti-
ficially generated white Gaussian noise. The signal-to-noise
(14) ratio (SNR) is defined as the ratio of energy in the utterance
UMESH et al.: SPEECH ANALYSIS 43

+++
Fig. 2. Separability between phonemes using scale (indicated by “ - - ”) and mel-cepstral coefficients (indicated by “--”) based on 200 utterances
at 18 dB SNR. The data is taken from the region seven of the TIMIT training set data and is spoken by 22 male and 13 female talkers. Note that
the scale-cepstrum provides better separability than mel-cepstrum.

to the noise energy. In our experiments, we used SNR’s of 6 decreases to zero at the center frequencies of the two adjacent
and 18 dB. filters. The vector of energies is computed by weighting the
The scale-cepstrum is computed using the following values discrete Fourier transform (DFT) coefficients by the magnitude
of the parameters: , , , and response of the filterbank. The mel-cepstral coefficients are
. Note that is interpolated by a factor then obtained by computing the inverse cosine transform of
of two. the vector of log energies.
The mel-cepstrum is implemented using the program in the In all cepstra the zeroth coefficient is not used since this is
signal processing information base [16] written by Slaney. roughly a measure of the spectral energy. Coefficients 1–20
The 40 triangular filters were used. The center frequencies are used to measure the separability between the different
of the first 13 linearly spaced filters are 66.67 Hz apart phoneme classes. We would like to remind the reader that
starting at 133.34 Hz. The center frequencies of the other the magnitude of the scale-cepstral coefficients are used as
27 filters are chosen to have a ratio of 1.071 170 3 between elements in the feature vector. We normalize the vector of
successive filters. This covers the frequency band up to 8 KHz. coefficients to unit energy. The size of feature vectors are
Each filter’s magnitude frequency response has a triangular varied from 1 to 20 coefficients and the separability measure
shape that is maximum at the center frequency and linearly is computed. Figs. 1–3 show the separability measure, , as
44 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999

+++
Fig. 3. Separability between phonemes using scale (indicated by “ - - ”) and mel-cepstral coefficients (indicated by “--”) based on 200 utterances
at 6 dB SNR. The data is taken from region seven of the TIMIT training set data and is spoken by 22 male and 13 female talkers. It is observed that the
scale-cepstrum is robust to noise and the degradation in separability when compared to clean speech is small.

a function of the number of coefficients for clean and noisy is similar to mel-warping [17]. It is hoped that using such a
speech. warping function may further improve the features.
From the figures, it is clear that the scale-cepstrum provides
better separability than the mel-cepstrum for most vowels. VI. CONCLUSION
Further, it is also seen to be robust to noise. It is also intuitively
We have studied the scale-transform of the formant en-
satisfying to note that vowels closer together in the vowel
velopes of utterances of vowels by different speakers. This
triangle are much more difficult to separate (e.g., iy and was motivated by the hypothesis that the formant frequencies
ih ) when compared to those that are far apart (e.g., iy between different speakers are approximately related by a
and ow ). For clean speech, it is seen that the separability scaling constant for any given phoneme. We have described
for a few vowels is lesser than the mel-cepstrum and this may a procedure for the discrete-implementation of the scale-
be partly explained by recalling that for the scale-cepstrum, we transform on the formant envelope, which we estimate by
have made a linear frequency-scaling assumption, i.e., is using averaged-periodogram techniques. Our results on vowels
independent of frequency. Our recent studies show that is indicate that the scale-cepstral coefficients provides better
actually frequency dependent and the corresponding function separation between vowels than mel-cepstral coefficients. The
UMESH et al.: SPEECH ANALYSIS 45

data used in the comparison were 200 utterances of four Leon Cohen (SM’91–F’98) was born in Barcelona,
vowels from the TIMIT data base and the generalized F-ratio Spain, on March 23, 1940. He received the B.S.
degree from The City College of New York in 1958
was used as the criterion for the comparison of the features. and the Ph.D. degree from Yale University, New
We have shown that the scale-cepstral methods are useful Haven, CT, in 1966.
in devising features that take into account variations due to He has contributed to the fields of signal analysis,
astronomy, physics, and mathematics.
scaling in the frequency domain.

REFERENCES
[1] S. B. Davis and P. Mermelstein, “Comparison of parametric repre-
sentations for monsyllabic word recognition in continuously spoken
sentences,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-
28, pp. 357–366, Aug. 1980. Nenad Marinovic was born in Belgrade, Yu-
[2] C. Jankowski, H.-D. H. Vo, and R. P. Lippmann, “A comparison of
goslavia. He received his Dipl.-Ing. degree in
signal processing front ends for automatic word recognition,” vol. 3,
electrical engineering from the University of
pp. 286–293, July 1995.
Belgrade in 1975 and the Ph.D. degree in electrical
[3] D. J. Nelson, “Wavelet based analysis on speech signals,” in Proc.
engineering from the City University of New York,
IEEE Conf. Automatic Speech Recognition, Snowbird, UT, Dec. 1993,
in 1986.
pp 89–90.
[4] , ”Alternate spectral analysis methods in speech analysis,” in For almost 20 years, until 1996, he has been doing
Proc. 13th Ann. Speech Research Symp., Johns Hopkins Univ., Balti- signal processing research in both industry and
more, MD, June 1993, pp. 304–315. academia. His main interest was in signal theory,
[5] T. Kamm, G. Andreou, and J. Cohen, “Vocal tract normalization in statistical signal processing, and their applications in
speech recognition: Compensating for systematic speaker variability,” biomedical imaging, computer vision, radar, sonar,
in Proc. 15th Ann. Speech Research Symp., Johns Hopkins Univ., underwater acoustics, and speech. From 1986 to 1996, he was a Professor in
Baltimore, MD, June 1995, pp. 175–178. the Department of Electrical Engineering, City College of the City University
[6] P. Bamberg, “Vocal tract normalization,” Tech. Rep., Verbex, 1981. of New York, where his research contributing to this paper was carried out.
[7] E. Eide and H. Gish, “A parametric approach to vocal tract length Since then, he moved to financial industry where he is currently involved in
normalization,” in Proc. IEEE ICASSP’96, Atlanta, GA, pp. 346–349. stochastic modeling of securities markets for measurement and management
[8] L. Cohen, N. Marinovic, S. Umesh, and D. Nelson, “Scale-invariant of financial risks.
speech analysis,” in Proc. Int. Soc. Optical Engineering, San Diego,
CA, 1995, vol. 2569, pp. 522–537.
[9] L. Cohen, “The scale representation,” IEEE Trans. Signal Processing,
vol. 41, pp. 3275–3292, Dec. 1993. Douglas J. Nelson (M’98) was born in Minneapolis,
[10] N. Marinovic, L. Cohen, and S. Umesh, “Scale and harmonic type MN, on November 5, 1945. He received the B.A.
signals,” in Proc. Int. Soc.r Optical Engineering, San Diego, CA, 1994, degree in mathematics from the University of Min-
vol. 2303, pp. 411–418. nesota, Minneapolis, in 1967, and the Ph.D. degree
[11] , “Joint representation in time and frequency scale for harmonic
in mathematics from Stanford University, Stanford,
type signals,” in Proc. IEEE Int. Symp. Time-Frequency and Time-Scale
CA, in December 1972.
Analysis, Philadelphia, PA, 1994, pp. 84–87.
He was a Visiting Assistant Professor of Math-
[12] D. Nelson, “Correlation based speech formant recovery,” in Proc. IEEE
ematics at Carnegie Mellon University, Pittsburgh,
Int. Conf. Acoust., Speech, Signal Processing, Munich, Germany, Apr.
PA, from September 1972 to September 1975. Since
1997, pp. 1643–1646.
[13] A. H. Nuttall and G. C. Carter, “Spectral estimation using combined time November 1975, he has been with the National
and lag weighting,” Proc. IEEE, vol. 70, pp. 1115–1125, Sept. 1982. Security Agency, Fort Meade, MD, where he has
[14] K. Fukunaga, Introduction to Statistical Pattern Recognition. New been involved in the development of signal processing algorithms for radar,
York: Academic, 1990. communication signals, and speech.
[15] T. Parsons, Voice and Speech Processing. New York: McGraw-Hill,
1987.
[16] D. H. Johnson and P. N. Shami, “The signal processing information
base,” IEEE Signal Processing Mag., pp. 36–42, Oct. 1993.
[17] S. Umesh, L. Cohen, and D. Nelson, “Frequency warping and pyscho-
acoustic frequency scales,” in Proc. Int. Soc. Optical Engineering, San
Diego, CA, 1996, vol. 2825, pp. 530–539.

S. Umesh (S’92–M’94) was born in Bangalore, In-


dia. He received the B.E. (Hons.) degree in electrical
and electronics engineering from the Birla Institute
of Technology and Science, India, in 1987, the
M.E. (Hons.) degree in electronics engineering from
Madras Institute of Technology, India, in 1989, and
the Ph.D. degree in electrical engineering from the
University of Rhode Island, Kingston, in 1993.
He is currently an Assistant Professor in the De-
partment of Electrical Engineering, Indian Institute
of Technology, Kanpur, India. From June 1994 to
May 1996, he was with Hunter College, City University of New York as a
Research Associate. His main research interests are in the area of statistical
signal processing. His current research interests are in developing robust
acoustic features for applications in speech processing.
Dr. Umesh is a recipient of the Indian AICTE Career Award for Young
Teachers.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy