Workshop On Acoustic Voice Analysis PDF
Workshop On Acoustic Voice Analysis PDF
0195
0275
0375
0195
PROCEEDINGS
0275
EDITED BY D. WONG
0375
0275
0375
Sponsored by the
0195
National Center for Voice and Speech
for the
0275
National Institute on Deafness and
Other Communication Disorders
0375
Conference Site:
PROCEEDINGS
BY DARRELL WONG, PH.D.
The National Center for Voice and Speech is a multi-site, interdisciplinary organization dedicated to delivering state-of-the-
art voice and speech research to practitioners, trainees and the general public. Members of the consortium arc The Univer
sity of Iowa, The Denver Center for the Performing Arts, The University of Wisconsin-Madison and The University of Utah.
The NCVS gratefully acknowledges its source of support: Grant P60 DC00976 from the National Institutes on Deafness
and Other Communication Disorders, a division of the National Institutes of Health.
FOREWORD
traditionally studied the acoustic signal captured by a microphone. In the health sciences, the
human voice has been studied as a way of revealing information about the health of an
individual. While there has been considerable progress in the analysis of speech and voice
signals for the diagnosis and documentation of vocal disorders, some concern has been
expressed regarding the need to reach a consensus on the utility, feasibility, and
The workshop proceedings are intended to be a step toward this process. A Workshop
on Acoustic Voice Analysis was held on the 17th and 18th of February, 1994, in Denver,
Colorado. The site was the Wilbur James Gould Voice Research Center (then known as The
Recording and Research Center), a division of The Denver Center for the Performing Arts
(DCPA). Sponsorship and financial support was provided by the National Center for Voice
and Speech (NCVS), and the DCPA. The NCVS is a research and training center funded by the
National Institute on Deafness and Other Communication Disorders.
The Proceedings consist of written versions of topics discussed during the workshop.
Some of the papers may be found in other journals. As a rule, previously published material
was accepted, since the workshop was seen as a summary as much as a venue for new ideas.
Attendance and contributions were by invitation, so that a broad spectrum of the voice
analysis community could be represented. While the audience and list of papers does not
exhaustively represent the community, we were able to present perspectives from industry
representatives, speech clinicians, speech science academicians, and medical personnel.
The topics presented include recording techniques, file formats, perturbation statistics
extraction algorithms, nomenclature and classification, and the nature of perturbation. As a
result, the papers ranged in style from technical summaries and algorithm descriptions to
perspectives and commentaries.
As part of this Foreword, mention should be made regarding the organization of the
manuscript. Each paper is identified using the first few letters of the primary author's name.
The first paper, HESS, by Dr. Wolfgang Hess is the keynote address for the Workshop. He was
given the task of introducing the concepts involved in pitch determination - generally considered
the heart of perturbation analysis. Because perturbations are viewed as deviations from the
steady state, the demarcation of fundamental periods is crucial. The next four papers
(identified as TALK, MIL1, DEL, and QI), discuss the topics of pitch (or FO) marking, pitch
perturbation, and amplitude perturbation. Following that are three papers (RAB, GER, and
LEM), which discuss the utility of. perturbation measures and the-statistical methodology for
defining the 'normal limits' of speech characteristics.
The paper EPN discusses a method of pitch marking, jitter measurement, and their
results as applied to aged voices. In JIA and HUA, protocols and observations are made
regarding the capturing of voice samples in the context of reducing subject frequency and
intensity variability. In KHE, a discussion and demonstration of new methods in spectral
estimation are presented, while WON, presents a qualitative discussion on the sources of
perturbation from a biomechanical perspective. The latter paper was not presented during the
workshop, but it has been submitted by the editor as a relevant topic. The short summary in
MILD makes suggestions on hardware selection in the context of different types of voice
processing. MIL2 and CUR discuss file formats, while WINH discusses microphone selection
and placement as it affects perturbation measurements.
Finally,l>r. Ingo Titze has written a summary statement (TTTZE), the first part of which
may be considered as his personal perspective on the analysis, nomenclature and classification
of voice data. The second part of the statement is a set of recommendations and a glossary of
terminology. Only the recommendations should be viewed as majority opinion. The summary
statement can be obtained as a separate document from the NCVS.
The proceedings have focused on a very narrow set of issues which are important to the
voice analysis community. We hope that the results are informative, and that our efforts will at
least generate discussion, if not a consensus, in the community.
Darrell Wong,
June, 1995.
m
TABLE OF CONTENTS
\M-x Suggestion for a Pitch Extraction Method and File Format for Pathological Voice Data
Dimitar D. Deliyski... DEL
§ How We Do It: Automated Target Matching and Data Selection Procedure in Voice
Sample Acquisition
Jack Jiang, David Hanson andjie Chen JIA
An Assessment of the Viability of Multimedia File Formats for Voice Data Use
Timothy W. Curran CUR
Summary Statement
IngoR. Titze TITZE
For more information or additional copies of this report, contact:
Wolfgang J. Hess
Institute for Communications Research and Phonetics (IKP), University of Bonn
Poppelsdorfer Allee 47, D—53115 Bonn, Germany
wgh@uni-bonn.de
Abstract. This paper presents a survey of methods for pitch determination of speech signals
with special emphasis on time-domain methods. As speech is a time-variant signal the re
sult of the measurement will depend on the method applied. This implies that we first de
fine what is subsumed under the term pitch. From the point of view of speech production
this is rate of vocal fold vibration or the duration of individual laryngeal excursion cycles which
is measured in the time domain by algorithms that are able to track the signal period by
period. From a more signal-oriented point of view where the emphasis is laid on periodicity
of voiced speech signals, this will be fundamental period (duration), or, if the measurement
is carried out in the frequency domain, fundamental frequency. Pitch determination algo
rithms (PDAs) which follow this definition usually operate on the basis of some short-time,
i.e., frame-to-frame representation. After a short review of these PDAs a survey of time-
domain algorithms is presented. These include methods such as structural analysis of the
speech signal with or without preprocessing, determination of individual periods from the
first partial of the signal, determination of the point of glottal closure, and multi-channel
approaches. Some remarks on glottal inverse filtering are added. The paper then discusses
the issue of error analysis. Errors in pitch determination are classified into gross errors and
measurement inaccuracies, and it is a main problem for any algorithm, when it detects an
estimate that seems to be wrong, to detect reliably whether this is due to a measurement
failure or to a momentary irregularity of the signal. The paper also addresses the possibility
to use an instrument that directly measures the laryngeal excitation, notably a laryngo-
graph, for gaining reference contours from which the PDAs can be evaluated or trained.
Pitch, i.e., fundamental frequency (or rate of vocal-fold vibration) F$ as well as fundamental
period Tq takes on a key position in the acoustic speech signal. The prosodic information
of an utterance is predominantly determined by this parameter. The ear is by an order of
magnitude more sensitive to changes of fundamental frequency than to changes of other
speech signal parameters (Flanagan and Saslow, 1958, Klatt, 1973; Harris and Umeda,
1987). The quality of vocoder speech as well as of synthetic speech (when natural-speech
units are used) is essentially influenced by the quality and faultlessness of the pitch mea
surement (Gold, 1977). Hence the importance of this parameter claims for good and reli
able measurement methods.
Besides voicing determination, pitch determination is one of the two subproblems of
voice source analysis. In voiced speech, the vocal cords vibrate in a quasi-periodic way.
HES-2
Speech segments with voiceless excitation are generated by turbulent air flow at a constric
tion or by the release of a closure in the vocal tract. The parameters we have to determine
in voice source analysis are the manner of excitation, i.e., the presence of a voiced excitation
and the presence of a voiceless excitation, a problem which is referred to as voicing deter
mination, and - for the segments of the speech signal where a voiced excitation is present
— pitch determination.
Automatic pitch determination has a rather long history which goes back even beyond
the times of vocoding (e.g. Griitzmacher and Lottermoser, 1937). The most important de
velopments leading to today's state of the art were made in the-sixties^and-seventies; these
methods that are reviewed in this paper are extensively discussed in (Hess, 1983). Since
then, few absolutely new principles have been invented; a number of methods, however,
were improved and refined, whereas other solutions were revived that required an amount
of computational effort appearing unrealistic at the time the algorithm was first developed.
On the other hand, new techniques such als neural networks or - even more recently -
the wavelet transform initiated new developments especially in time-domain pitch deter
mination where further improvements are to be expected in the near future. With speech
Fig. 1. Example of a speech signal (after pitch determination). Beginning of the utterance "Algo
rithms and devices for pitch determination". Speaker: male; undistorted signal. Scale: 250 ms
per line. The analysis was done using the algorithm by Hess (1979). ( ) Voiceless,
( ) Pause, (| 11 11 |) pitch period boundaries ("markers"). Markers indicated by short lines
were found irregular
HES-3
corpora coming into use that contain many labeled and processed speech data, researchers
nowadays tend toward thoroughly examining and checking the performance of their algo
rithms.
At the first glance the task looks simple: one has just to detect the fundamental frequen
cy of a quasi-periodic signal. Dealing with speech signals, however, the assumption of qua-
si-periodicity is often far from reality. Figure 1 shows an arbitrary (but typical) example of
a speech signal. For a number of reasons, the task of pitch determination must be counted
among the most difficult problems in speech analysis.
1) In principle, speech is a nonstationary process; the momentary position of the vocal
tract may change abruptly at any time. This leads to drastic variations in the temporal
structure of the signal, even between subsequent pitch periods.
2) Due to the flexibility of articulatory gestures and the wide variety of voices, there exist
a multitude of possible temporal structures. Narrow-band formants at low harmonics (es
pecially at the second or third harmonic) are a particular source of trouble.
3) For an arbitrary speech signal uttered by an unknown speaker, the fundamental fre
quency can vary over a range of almost four octaves (50 to 800 Hz). Especially for female
voices, Fq thus often coincides with the first formant (the latter ranging from about 200 Hz
to 1400 Hz). This causes problems when inverse filtering techniques are applied.
4) The excitation signal itself is not always regular (see Fig. 2). Even under normal
conditions, i.e., when the voice is neither hoarse nor pathologic, the glottal waveform ex
hibits occasional irregularities (Dolansky and Tjernlund, 1968; Fujimura, 1968; Lieber-
20 ms
I Time
Modal Register
Fig. 2. Speech signals in different phonation: vocal fry (upper line); modal register, i.e., normal
speech (lower line, with pitch period delimiters). Signal: sustained vowel [e], speaker WGH
(male)
HES-4
man, 1963). In addition, the voice may temporarily fall into vocal fry or laryngealization
(Hollien, 1974) which is a nonpathologic mode of voice excitation with rather large and
irregular intervals between subsequent glottal pulses . Such laryngealizations are deliber
ately used by many speakers as boundary signals or substitutions for glottal stops (Huber,
1988) and may therefore occur anywhere in fluent speech.
5) Additional problems arise in speech communication systems where the signal is often
distorted or band limited (for instance, in the telephone channel). This may be detrimental
for some applications. For voice quality measurement or vocal jitter determination, for
instance, even the inevitable phase distortions introduced by an ordinary analog tape or
cassette recorder (cf. Fig. 3) may be intolerable.
Literally hundreds of methods for pitch determination have been developed. This paper
will give a survey of the prevailing principles and discuss selected methods in more detail.
First, we will deal with possible definitions of the parameter/?*7c/i itself (Sect. 1), followed
by a gross categorization of the various principles of its determination (Sect. 2). After that
we will go into a more detailed discussion of individual principles and individual solutions.
Section 3 will present a brief survey of short-term analysis methods, and then Sect. 4 will
deal more extensively with time-domain methods. Sections 5 and 6 will finally discuss
problems of error analysis and evaluation, accurate voicing and pitch determination using
instruments such as the laryngograph, and various applications.
As to the realization, we will not distinguish between a hardware device (whether ana
log or digital) and an algorithmic solution: they are all regarded aspitch determination algo
rithms (PDAs). In addition we will separate the problems of pitch determination and voic
ing determination although the two are often realized within the same algorithm; we will
assume in the following that a voiced/unvoiced decision has been done already, and that
there are only voiced signals to be processed by the respective PDA.
Fo = 75 Hz Fo = 150 Hz Fo = 400 Hz
LT Jlllll
Fig. 3a-c. Phase distortions in analog tape recordings, (a) Rectangular waveform; (b) same wa
veform after recording on a high-quality analog tape recorder; (c) same waveform after rere-
cording it with the same recorder in backward direction
HES-5
Pitch can be measured in many ways. If the signal is completely stationary and periodic all
these strategies - provided they operate correctly - lead to identical results. Since the
speech signal is nonstationary and time variant, however, aspects of strategy such as the
starting point of the measurement, the length of the measuring interval, the way of averag
ing (if any), or the operating domain (time, frequency, lag etc.) of an individual algorithm
start influencing the results and may lead to estimates that differ from algorithm to algo
rithm even if all these results are "correct" and "accurate." Before entering a discussion on
individual methods, we must therefore have a look at the parameter pitch and provide a
clear definition of what should be measured and what is actually measured.
A word on terminology first. There are three points of view for looking at a speech pro
cessing problem (Zwicker et al, 1967): the production, the signal-processing, and the per
ception point of views, respectively. In the actual case of pitch determination the produc
tion point of view is obviously oriented toward the generation of the excitation signal in the
larynx; we will thus have to start from a time-domain representation of the waveform as a
train of laryngeal pulses. If an algorithm or device works in a speech production oriented
way, it measures individual laryngeal excitation cycles or, if some averaging is performed, it
determines the rate of vocal-fold vibration. The signal-processing point of view can be char
acterized in such a way that (quasi-)periodicity is observed in the signal, wherever that sig
nal comes from, and that the task is just to extract those features that best represent this
periodicity. The pertinent terms are fundamental frequency or fundamental period. If indi
vidual cycles are determined, we may (somewhat inconsistently) speak of pitch periods or
simply of periods. The perception point of view leads to a frequency-domain representa
tion since pitch sensation corresponds to a frequency and not to an average period or a
sequence of periods (Goldstein, 1973; Terhardt, 1979; Plomp, 1976). This point of view is
associated with the original meaning of the term pitch. However, the term pitch has consis
tently been used as some kind of "common denominator", i.e., as a general name for all
those terms mentioned before, at least in the technical literature (Kohler, 1982). In addi
tion, psychoacousticians have started to create new terms for describing the aspects of
pitch perception, such as spectral pitch or virtual pitch (Terhardt, 1979), mostly because they
felt it necessary to specify partial aspects of the complex phenomenon of pitch perception
more precisely, but also in order to avoid confusions. In the following, we will therefore use
the termpitch in this wider sense wherever a more restricted description is undesirable or
impossible, and take the more precise terms otherwise.
Defining the-different fepresentatiefts of pitch, 4t-appears reasonable to proceed from
production to perception. Going in that direction we will start at a local and detailed repre
sentation and arrive at a more global representation in the case of the perception-oriented
view. The basic definitions could thus read as follows (Hess, 1983:475,1992; Hess and In-
defrey, 1987):
7o is defined as the elapsed time between two successive laryngeal pulses. Measure
ment starts at a well specified point within the glottal cycle, preferably at the point of
HES-6
glottal closure or - if the glottis does not close completely - at the point where the
glottal area reaches its minimum. m\
PDAs that obey this definition will be able to locate the point of glottal closure and to de
limit individual laryngeal excitation cycles. This task, which usually forms part of a glottal
inverse filter, goes far beyond the scope of ordinary pitch determination; if the speech sig
nal alone is available for the analysis, reliable results are to be expected only for selected
algorithms and only if the signal is totally undistorted. With the aid of an instrument this
problem can be solved in a more general way.
7o is defined as the elapsed time between two successive laryngeal pulses. Measure
ment starts at an arbitrary point within the glottal cycle. Which point that is depends
on the individual method, but for a given PDA this point is always located at the same
position within the glottal cycle. n\
Ordinary time-domain PDAs follow this definition. The reference point can be a signifi
cant extreme, a certain zero crossing, an excursion cycle, and so on. This is not necessarily
the point of glottal closure itself. Usually, however, it is possible to derive the point of glot
tal closure from this reference point when the signal is undistorted. Yet the presence of
phase distortions can even destroy this possibility. PDAs that follow this definition usually
track the signal period by period in a synchronous way, and a commonly used term (al
though somewhat inconsistent with the definition of the termpitch as given above) for what
is measured here is individual pitch periods.
To is defined as the elapsed time between two successive laryngeal cycles. Measure
ment starts at an arbitrary instant which is fixed according to external conditions, and
ends when a complete cycle has elapsed. n>)
This is an incremental definition of To. To is still defined as the length of an individual peri
od, but no longer from the speech production point of view, since the definition has noth
ing to do with an individual excitation cycle. The synchronous way of processing is main
tained, but the phase relations between the laryngeal waveform and the markers, i.e., the
pitch period delimiters at the output of the algorithm are lost. Once a reference point in
time has been established, it will be kept only as long as the measurement is correct and as
long as voicing continues. If there is a measurement error, or if voicing ceases, the location
of the reference point is lost, and the next reference point may be completely different with
respect to its position within the excitation cycle.
To is defined as the average length of several periods, i.e., as the average elapsed time
between a small number of successive laryngeal cycles. In which way the averaging is
performed, and how many periods are involved, is a matter of the individual algo
rithm. (4a)
This is the standard definition of To for any PDA that applies stationary short-term analy
sis, including the implementations of frequency-domain PDAs. Well-known methods, such
as cepstrum (Noll, 1967) or autocorrelation (Rabiner, 1977) follow this definition. The
corresponding frequency-domain definition reads as follows.
Fo is defined as the fundamental frequency of an (approximately) harmonic pattern in
the (short-term) spectral representation of the signal. It depends on the particular
HES-7
This definition is principally different from the previous ones. Above all, it is a long-term
definition (Terhardt et al, 1982). The pitch perception theories were developed for sta
tionary complex sounds and were only extended toward short pulse trains with varying am
plitude patterns and constant frequencies, but not toward signals with varying fundamental
frequency. Except for some investigations which indicate that the difference limen for Fo
changes goes up by at least an order of magnitude when time-variant stimuli are involved
(Harris and Umeda, 1987; 't Hart, 1981), the question of the behavior of the human ear
with respect to short-term pitch perception is only partially answered, and our knowledge
about what kind of short-term "analysis" is executed in the human ear and how it is
executed is still incomplete. Hence even such PDAs that claim to be perception-oriented
(e.g., Duifhuis et al., 1982, Hermes, 1988) enter the frequency domain in a similar way as
in definition (4b), i.e., by a standard short-term transformation such as the discrete Fourier
transform (DFT) with previous windowing of the signal.
Since the results of individual algorithms may be different according to the definition
they follow, and since the definitions (1) through (5) are partly given in the time (or lag)
domain, partly in the frequency domain, it is necessary to reestablish the relation between
the time- and frequency-domain representations of pitch,
Fo = 1 I To > (6)
in such a way that when a measurement is carried out in one of the domains, however To
or Fo are defined there, the representation in the other domain will always be established
by this equation.
the original speech signal, the PDA operates in the time domain. It will thus measure To
according to one of the definitions (1) through (3). In all other cases, somewhere in the
preprocessor the time domain is left. Since the speech signal is time variant, this cannot be
done other than by a short-term transformation; in this case we will usually determine To
otF0 according to definitions (4a,b) or (5); in some rare cases (for instance, AMDF) defi
nition (3) may apply as well. Accordingly, we have the two PDA categories: a) time-domain
PDAs, and b) short-term analysis PDAs.
T
Frequency-Domain
Correlation Analysis
Techniques
Maximum
Likelihood
Fig. 4. Methods of short-term analysis (short-time analysis) pitch determination. [Time and lag
scales are identical; the frequency scale in the box Harmonic Analysis was magnified.]
tion, for instance the average magnitude difference function AMDF (Sobolev and Baro-
nin, 1968; Ross et al., 1974):
If the signal were strictly periodic, the distance function would take on a value of zero at
d=To. For the quasi-periodic speech signal there will be a strong minimum in the AMDF
at this value of the lag (delay time) d. In contrast to all other short-term PDAs where the
estimate of 7q or Fq is indicated by a maximum whose position and value have to be deter
mined, the minimum has an ideal target value of 0 so that we only need to determine its
position. For this reason, distance functions do not require (quasi-)stationarity within the
measuring interval; they can cope with very short frames of one pitch period or even less.
HES-10
This principle thus represents the only short-term analysis PDA which is able to follow def
inition (3) (Moser and Kittel, 1977). The AMDF has also been successfully applied to the
linear-prediction residual (Un and Yang, 1977).
The frequency-domain methods are also split up into two groups. Direct determination
of Fo as the location of the lowest peak in the power spectrum is unreliable and inaccurate.
It is thus preferred to investigate the harmonic structure of the signal. One way to do this
is spectral compression, which computes the fundamental frequency as the greatest com
mon divider of all harmonics. The power spectrum is compressed along the frequency axis
by a factor of two, three etc. and then added to the original power spectfuffi. This operation
gives a peak at Fo resulting from the coherent additive contribution of the higher harmon
ics (Schroeder, 1968; Noll, 1970; Martin, 1981,1987). Some of these PDAs stem from theo
ries and functional models of pitch perception in the human ear (Terhardt, 1979; Terhardt
et al., 1982; Duifhuis et al., 1982; Hermes, 1988). - The second frequency domain tech
nique leads back into the time domain. Instead of transforming the power spectrum itself
(which would lead to the autocorrelation function), however, the inverse transform is per
formed on the logarithmic power spectrum. This results in the well known cepstrum (Noll,
1967), which shows a distinct peak at the "quefrency" (lag) d=To.
Finally we have to mention the least-squares ("maximum likelihood") approach. This is
originally a mathematical procedure to separate a periodic signal of unknown period To
(Noll, 1970) from Gaussian noise within a finite signal. Since neither the speech signal is
periodic nor the background noise (plus the aperiodic components of the speech signal it
self) can be expected as Gaussian, the approach has to be slightly modified in order to work
in a PDA (Wise et al., 1976; Friedman, 1977).
In summary, short-term analysis PDAs provide a sequence of average pitch estimates
rather than a measurement of individual periods. They are not very sensitive to phase dis
tortions or to absence of the first partial.
3.2 Example: Double-Transform PDA with Nonlinear Distortion in the Frequency Domain
The sensitivity against strong first formants, especially when they coincide with the second
or third harmonic, is one of the big problems in pitch determination. This problem is suit
able met by some procedure of spectral flattening.
Spectral flattening can be achieved in several ways. One of them is time-domain nonlin
ear distortion, such as center clipping (Sondhi, 1968; Rabiner, 1977). A second way is lin
ear spectral distortion by inverse filtering (Markel, 1972; Un and Yang, 1977). A third way
is frequency-domain amplitude compression by nonlinear distortion of the spectrum. This
algorithm operates as follows: 1) short-term analysis and transform into the frequency do
main via a suitable discrete Fourier transform, 2) nonlinear distortion in the frequency do
main, and 3) inverse Fourier transform. The resulting domain is again equivalent to the
time domain; to avoid confusion, we will henceforth call it the lag domain.
Two members of this group were already mentioned: the autocorrelation PDA (Rabin
er, 1977) and the cepstrum PDA (Noll, 1967) which are more closely related than one
might conclude from the presentation in Fig. 4. It is well known that the autocorrelation
function can be computed as the inverse Fourier transform of the power spectrum. Here,
HES-11
the distortion consists in taking the squared magnitude of the complex spectrum. The cep-
strum, on the other hand, uses the logarithm of the spectrum. The two methods therefore
differ only in the characteristics of the respective nonlinear distortions applied in the spec
tral domain. The cepstrum PDA is known to be rather insensitive to strong formants at
higher harmonics but to develop a certain sensitivity with respect to additive noise. The
autocorrelation PDA, on the other hand, is insensitive to noise but rather sensitive to
strong formants. Regarding the slope of the distortion characteristic, we observe the dy
namic range of the spectrum being expanded by squaring the spectrum for the autocorrela
tion PDA, whereas the spectrum is substantially flattened by takingthelogarithm. The two
requirements - robustness against strong formants and robustness against additive (white)
noise - are in some way contradictory. Expanding the dynamic range of the spectrum em
phasizes strong individual components, such as formants, and suppresses wideband noise,
whereas spectral flattening equalizes strong components and, at the same time, raises the
level of low-energy regions in the spectrum thus raising the level of the noise as well. Thus
it is worth while to look for other characteristics that perform spectral amplitude compres
sion. Sreenivas (1981) proposes the 4th root of the power spectrum instead of the logarithm.
For larger amplitudes this characteristic behaves very much like the logarithm; for small
amplitudes, however, it has the advantage to go to zero and not to -oo. Weiss et al. (1966)
use the amplitude spectrum, i.e., the magnitude of the complex spectrum.
Indefrey et al. (1985) implemented these principles together with optional preproces
sing to systematically investigate the performance of these PDAs. The four nonlinear spec
tral functions mentioned before (power spectrum, amplitude spectrum, fourth root of
power spectrum, and logarithm) were, among other tests, evaluated using signals with add
ed noise at various noise levels. The PDA was found to break down somewhere between
-6 and -12 dB SNR. This value is consistent with data reported elsewhere in the litera
ture for related PDAs (Schroeder, 1968; Noll, 1970; Wise et al., 1976) and shows that there
exist a number of short-term PDAs that are extremely noise resistant.
Knowing that many errors arise from a mismatch during short-term analysis (which re
sults in too few or too many pitch periods within a given frame), Fujisaki et al. (1986) inves
tigated the influence of the relations between the error rate, the frame length and the actu
al value of 7q for an autocorrelation PDA which operates on the LP residual. The optimum
occurs when the frame contains about three pitch periods. Since this value is different for
every individual voice, a fixed-frame PDA runs nonoptimally for most situations. For an
exponential window, however, this optimum converges to a time constant of about 10 ms
for all voices. For a number of PDAs, for example the autocorrelation PDA, such a window
permits recursive updating of the autocorrelation function, i.e., even sample-by-sample
pitch estimatran^vithoutxxcessivexomputatronalxffort.
This category of PDAs is less homogenous than that of the short-term analysis methods.
One possibility to split them up is according to the way how the burden of data reduction
HES-12
Multi-Channel Time-Domain
Analysis Pitch Determination
I
Different
Principles
Different
Pitch Ranges
Extraction of
Fundamental
Harmonic
Structural
Analysis
i Structure
Sequence
Simplification
of Extremes
Nonzero
Threshold
Exponential Analysis
Decay
is distributed among the preprocessor and the basic extractor. Doing this, we find most
time-domain PDAs between two extremes (Fig. 5):
1) The burden is imposed on the preprocessor. In the extreme case, only the waveform
of the first harmonic is offered to the basic extractor.
2) The burden is imposed on the basic extractor, which then has to cope with the whole
complexity of the temporal signal structure. In the extreme case, the preprocessor is totally
omitted.
Time-domain PDAs are principally able to track the signal period by period. At the out
put of the basic extractor we find a sequence of period boundaries (pitch markers). Since
the local information on pitch is taken from each period individually, time-domain PDAs
are more sensitive to local signal degradations and thus less reliable than the majority of
their short-term analysis counterparts. On the other hand, time-domain PDAs may still op
erate correctly even when the signal itself is irregular due to temporary voice perturbation
or laryngealization.
HES-13
Time Time
A pitch period is the truncated response of the vocal tract to an individual glottal impulse.
Since the vocal tract behaves like a lossy linear system, its impulse response consists of a
sum of exponentially damped oscillations. It is therefore to be expected that the magnitude
of the significant peaks in the signal is greater at the beginning of the period than versus the
end (Fig. 6). Appropriate investigation of the signal peaks (maxima and/or minima) leads
to an indication of periodicity.
There are problems associated with this approach, however. First, the frequencies of
the dominant damped waveforms are determined by the local formant pattern and may
change abruptly. Second, the damping of the formants, particularly of a low first formant,
is often quite weak and can be overrun by temporary changes of the signal level. Third, if
the signal is phase distorted, different formants may be excited at different points in time.
These problems are surmountable, but they lead to relatively complicated algorithmic
solutions which have to regard a great variety of temporal structures. Since most of the pro
gram instructions are decisions, however, these PDAs run relatively fast. The usual way to
carry out the analysis is the following (Reddy, 1967; NJ.Miller, 1975; D. Howard, 1989).
1) Do a moderate low-pass filtering to remove the influence of higher formants.
2) Determine all the local maxima and minima.
3) Exclude those extremes which are found insignificant until one significant point per
period is left.
4) Reject obviously incorrect poiflts-by local correction.
gram (cf. Fig. 11). The basic extractor consists of a four-layer perceptron structure, the in
put layer comprising 41 successive samples with 9 channels each. TWo hidden layers with
10 units each and a fully connected network are followed by a one-unit output layer which
is intended to yield an impulse when the network encounters a signal structure associated
to the instant of glottal closure. The network is trained using (differentiated) output signals
of a laryngograph (cf. Sect. 6.2) as reference data. Such a structure has the advantage that
it can be based upon several features occurring at different instants during a pitch period.
It was shown to outperform conventional devices of the same type, for instance the peak-
picking PDA (D. Howard, 1989; see next paragraph) which was evaluatedfor comparison.
A different solution originates from the analog domain (Dolansky, 1955; Filip, 1969;
Winckel, 1964). The envelope of the period is modeled by a cascade of analog differentia
tors and diode-resistance-capacitance circuits with short rise time constants and compara
tively long decay time constants. These circuits emphasize the principal peaks of the signal
and suppress all the others. The performance, however, strongly depends on the proper
adjustment of the decay time constants. For that reason this relatively simple device works
well only for a restricted range of Fq (about 2 octaves). A manual range switch or some
thing similar is required if a wider range of Fq is to be analyzed. Due to its simplicity, this
principle has been revived in a recent application for cochlear prostheses (D. Howard,
1989). Using a logarithmic amplifier, Howard's PDA avoids a lot of problems associated
with the older devices, and his device compares favorably to a number of other PDAs
tested for this special application.
Fo can be detected in the signal via the waveform of the fundamental harmonic. If present
in the signal, this harmonic is extracted from the signal by extensive low-pass filtering in the
preprocessor. The basic extractor can then be relatively simple. Figure 7 shows the princi
ple of three basic extractors: the zero crossings analysis basic extractor as the simplest de
vice, the nonzero threshold basic extractor, and finally the threshold analysis basic extrac
tor with hysteresis. The zero-crossings analysis basic extractor sets a marker whenever the
zero axis is crossed with a defined polarity. This requires that the input waveform has two
and only two zero crossings per period. The threshold analysis basic extractor sets a marker
whenever a given nonzero threshold is exceeded. The threshold analysis basic extractor
with hysteresis acts like the normal threshold analysis basic extractor except that the mark
er is not set before a second (lower) threshold is crossed in opposite direction. This more
elaborate device requires a lesser degree of low-pass filtering in the preprocessor.
The requirement of extensive low-pass filtering is one of two weak points of this other
wise fast and simple principle. For the zero-crossings analysis basic extractor an attenua
tion of 18 dB/octave is necessary within the range of Fq to be determined (McKinney, 1965;
cf. also Fig. 7). Accordingly, the amplitude of the signal at the basic extractor will vary by
more than 50 dB due to the variations of Fq alone. This dynamic range, increased by the
intrinsic dynamic range of the signal (at least another 30 dB), is too much for the PDA to
work correctly over the whole range of Fq. The application of a zero-crossings analysis ba
sic extractor thus limits the possible fundamental frequency range. For the threshold analy-
HES-15
sis basic extractor the problem is not so acute, but the fact that the threshold must be
adapted to the overall signal amplitude complicates the design of the PDA. In addition
there is a systematic measurement artefact associated with the threshold-analysis basic ex
tractor when the amplitude of the input signal varies and the threshold is not properly
adapted (Fig. 8). Another inaccuracy (Fig. 9) is intrinsic to the first partial of the signal.
When Fq is close to the formant Fl, variations of that formant result in time-variant phase
distortions of the first partial which will locally change the period duration and with it the
Tq estimate. These inaccuracies are in the order of a few percent; yet they may be intoler
able if the respective application requires high accuracy.
In a number of applications, such as voice quality measurement or preparation of refer
ence elements for time-domain speech synthesis (Charpentier and Moulines, 1989), where
the signals are expected to be clean, the use of a PDA applying first-partial processing may
be advantageous. Dologlou and Carayannis (1989) developed a PDA that overcomes a
great deal of the problems associated with the filter necessary to isolate the first partial. An
adaptive linear-phase low-pass filter is applied in the preprocessor. This filter consists of
a variable-length cascade of second-order filters with a double zero in the z plane at z = -1.
These filters are consecutively applied to the input signal; after each step the algorithm
HES-16
—t— Time
Time
0.95
Fig. 8. Measurement errors introduced by changes of the signal amplitude in conjunction with a
thr4eshold-analysis basic extractor. Input signal: sinusoid (upper line). (Bottom line) Deviation
of the estimate T& (threshold as in upper line). Only a zero crossings analysis basic extractor or
a maximum detector avoids this error
Time
650 Fl/Hz
I
,_--_ Time
1.05
Time
1.0
0.95
1.05
Time
1.0
0.95
Fig. 9. Measurement errors introduced by changes of the formant F\: example. Input signal: si
nusoid with the frequency/=Fo=420 Hz (upper line). The formant changes from 300 to 650 Hz
(this corresponds to a transition [i—a]). The measurement of Tq is influenced by the phase shift
caused by the formant change. (Third line) Deviation of the estimate 7e when a zero-crossings
analysis basic extractor is used; (bottom line) same for a threshold-analysis basic extractor
(threshold as in second line)
HES-17
tests whether the higher harmonics are sufficiently attenuated; if yes, the filter stops. To is
then derived from the remaining first partial by a simple maximum detector. Very low-fre
quency noise is tolerable since it barely influences the positions of the maxima.
The second weak point is that this principle is a priori restricted to environments where
the first harmonic is present in the signal. There are many applications where this is the
case (for instance, in voice quality measurement). If such a PDA, however, is to be applied
to processing band-limited signals, the first harmonic must be enhanced or reconstructed.
One way to do this is nonlinear distortion. In that respect, many proposals have been made
from the beginning on (e.g., Griitzmacher and Lottermoser, 1937; Risberg et al., 1960). No
single nonlinear characteristic, however, is able to enhance the first harmonic of the signal
in an optimal way for any situation, i.e., for any speaker or environmental condition
(McKinney, 1965; Hess, 1979); some of them work well in a constrained environment (for
instance only with band-limited signals or male voices) or in a realization where several
channels with different nonlinear functions are combined (Hess, 1979).
Algorithms of this type take on some intermediate position between the principles of struc
tural analysis and fundamental harmonic extraction. The majority of these algorithms fol
low one of two principles: a) inverse filtering, and b) epoch detection. Both these prin
ciples deal with the fact that the laryngeal excitation function has a temporal structure
which is much simpler and more regular than the temporal structure of the speech signal
itself, and both methods, when they work, are able to follow definition (1) if the signal is
not grossly phase distorted.
The inverse filter approach cancels the transfer function of the vocal tract and thus re
constructs the laryngeal excitation function (cf. also Sect. 5.1). If one is interested in pitch
only and not in the excitation function itself, a crude approximation of the inverse filter is
sufficient. Such an approximation is realized for instance when the analysis is confined to
the first formant (Hess, 1976). The inverse filter approach has one weak point which occurs
frequently with female voices. When Fq is high, it may coincide with the first formant. If the
Time
Time
HES-18
inverse filter is not blocked, it then removes the fundamental harmonic (which is extremely
strong in this case) from the signal and brings the PDA into failure.
The second principle, epoch extraction, is based upon the fact that at the beginning of
each laryngeal pulse there is a discontinuity (in form of an impulse) in the second deriva
tive of the excitation function. Usually this discontinuity cannot be reliably detected in the
speech signal due to phase distortions which occur when the waveform passes the vocal
tract. The signal is thus first phase shifted by 90° (applying a Hilbert transform). The
squares of the original and the phase shifted signals are then added and yield a new signal
which represents the instantaneous amplitude of the signal and now shows-a distinct peak
at the time when the discontinuity in the excitation function occurs. The original method
works only when the spectrum of the investigated signal is flat to some extent. To enforce
spectral flatness, the analyzed signal is for instance band-limited to high frequencies well
above the narrow-band lower formants (Ananthapadmanabha and Yegnanarayana, 1975).
Another way is is to analyze the LPC residual (Ananthapadmanabha and Yegnanarayana,
1979) or to filter the signal into subbands (De Mori et al, 1977).
The epoch detection principle depends on the presence of a discontinuity in the second
derivative of the laryngeal excitation function. This discontinuity is often weak, especially
in back vowels like [u], when a formant exactly coincides with the first or a higher harmon
ic, or when speech is uttered with a soft or a falsetto voice. In two more recent approaches
(Di Francesco and Moulines, 1989; Cheng and O'Shaughnessy, 1989), this drawback was
overcome by the finding that the global statistical properties of the waveform change with
glottal opening and closing as well. These PDAs, which exploit different features of the
signal and were developed independently from each other, derive and apply a generalized
maximum-likelihood measure that indicates the instant of glottal closure more precisely
than previous epoch-detection PDAs (cf. also Sect. 5.2).
Except for the algorithmic investigation of the temporal structure and — nowadays -
epoch detection, most simple time-domain PDAs are restricted with respect to the range
of Fq or the type of signal to be processed. One way to increase the range or the reliability
of these PDAs is to implement several of them in parallel and to perform some decision as
to which one has the "correct" output. The partial PDAs may be identical in design, and
each of them may process a subrange of Fq (McKinney, 1965; Leon and Martin, 1969). On
the other hand, they may apply different principles without restriction of the frequency
range. The PDAs by Risberg et al. (1960) or by Hess (1979), for instance, use several non
linear functions to enhance the first harmonic in different ways. Gold and Rabiner (1969)
combine several simple peak detection basic extractors together with a pattern-matching
procedure. The selection criteria in order to find the most likely channel are defined by a
certain channel hierarchy, by a regularity check applying a minimum-frequency selection
principle (Risberg et al., 1960; Hess, 1979), by statistical measures (Bruno et al., 1982), or
by syntactic rules (De Mori et al., 1977). The selection is continuously checked so that the
PDA is able to change its choice at any time.
HES-19
Input Signal
After Rectification
Signal After Band-Pass Filter and Smoothing CH Freq Amplitude
A A A A A A A t
1 120
^ArArArArArArA 2 257
WWWWWWWv' t 3 392
t 4 -525
5 658
6 791
7 923
8 1060
9 1206
10 1370
11 1552
12 1751
13 1970
14 2211
15 2476
16 2765
17 3081
18 3425
19 3800
Fig. 11. Realization of a PDA
with subbands and rectification Output Signal
(Yaggi, 1962). Such a configura
tion was later used by Howard
\yv\/\/v\/N/
and Walliker (1989) as input to a £
neural network
One problem with time-domain multichannel PDAs is that the individual channels
often mark period boundaries at different instants in time, for instance when significant
maxima and minima are exploited independently of each other (Gold and Rabiner, 1969).
Unless there is a special synchronization routine (Hess, 1979), such PDAs are no longer
able to correctly synchronize themselves with the signal and thus have to operate according
to the incremental definition (3) or even the short-term definition (4) although they per
tain to the time-domain category.
Multi-channel preprocessing by a filter bank dates back to the days of the channel vo
coder where the spectral analyzer could also be used as a preprocessor for pitch deter
mination. If the bandwidths of the channels are not too great, there will not be more than
HES-20
one partial in the lower channels and not more than one formant in the mid and upper
channels each. A PDA can thus easily extract the fundamental harmonic once it knows in
which channel it is to be found. On the other hand the filter bank output, taken as a whole,
behaves in a way similar to a wide-band spectrogram. Those channels which carry the
waveforms of the formants coherently reveal maxima of the envelope at the beginning of
each pitch period after the instant of glottal closure. This feature can also be exploited for
a subsequent structural analysis.
One of the first PDAs of this kind was developed and investigated by Yaggi (1962). Yag-
gi, however, reported problems with phase distortions in the filter bank: With nowaday's
digital filter technology such filter banks can be built as linear-phase networks, and the re
cent wavelet transform (cf. the PDA by Katambe and Boudreaux-Bartels, 1990), which
may be applied like a bank of octave filters, provides another effective means for its imple
mentation. Such a preprocessor (with 9 channels) also serves as the input for the PDA by
I. Howard et al. (1989) where the basic extractor is realized by a neural network which per
forms a structural analysis and is trained to determine the instant of glottal closure.
Glottal inverse filtering is the approximative reconstruction of the excitation signal (the
glottal waveform) from the speech signal. From the linear model of speech production we
know that the voiced speech signal x(n) can be thought of as being generated by the pulse
generator characterized by its z transform P(z). The pertinent pulse sequence/?(rc) passes
the glottal shaping filter G(z\ at the output of which we have the glottal excitation signal
s(n). This signal excites the supraglottal system consisting of the vocal tract V(z) and the
radiation component R(z). In terms of transfer functions we obtain
where>4 represents the overall amplitude. A PDA, in this model, can be defined as a device
which determines P{z) fromJ^z). For glottal inverse filtering the task would then read
Thus a filter has to be applied whose transfer function reverts the influence of the vocal
tract and the radiation component.
In speech production the radiation component is the low-impedance load which termi
nates the vocal tract; the volume velocity of the air flow at the lips (and the nose) is con
verted into sound pressure in the distant field. In a first approximation, which is valid for
lower frequencies where the wavelength is large compared to the diameter of the mouth
opening, this conversion involves a differentiation, causing a zero at zero frequency. In the
inverse filter this zero is reverted by an integrator component, i.e., by a first-order recur
sive filter with a pole near z=l. For reasons of stability, the pole must stay inside the unit
circle.
HES-21
MA a A r, 7694
4v 4.
Fig. 12a-h. Inverse-filter analysis, (a) Signal: sustained vowel /e/, speaker LJB (male), 32 ms
per line; (b) waveform of the formant F\\ (c-e) same as (b), this time for the formants Fl-F4\
(f) differentiated output signal of the inverse filter; (g) output signal of the inverse filter; (hj
reconstructed glottal excitation signal, filtered by the inverse filter and the integrator. The in
verse filter was tuned to the following formant frequencies and bandwidths- Fl=357 Hz
F2=2056 Hz, 7^3=2493 Hz, F4=3500 Hz; Bl=26 Hz, B2=40 Hz, B3=150 Hz, B4=250 Hz. The
transfer function of the integrator filter used is l///i(z)=l-0.995z~1. All signals were normal
ized before plotting. The numbers on the right-hand side of (a-e) indicate the amplitude of the
signal and the individual formants; the amplitude of the signal was normalized to a value of
10000
As glottal inverse filtering is intended to yield a waveform rather than certain instants in
time, the signal must not at all be phase distorted at low frequencies - a condition nowa
days easily met by digital recording equipment.
Glottal inverse filtering requires accurate determination of all formants. For this the
following principles have4>een* implemented:
1) individual determination of the different formants, mostly in an interactive way (e.g.
Lindqvist, 1965);
2) automatic formant measurement by nonstationary linear-prediction (LPC) analysis
during the closed-glottis interval (Wong et al., 1979, Alku, 1992); and
3) cepstrum techniques.
In the classical method (e.g. Lindqvist, 1965), which is carried out in an interactive way, an
HES-22
antiresonance circuit (i.e., a second-order filter with a complex zero) is provided to cancel
each formant individually. The input signal is confined to stationary vowels with significant
high-frequency components and formants that are well separable, such as [a] or [el A
crude formant analysis provides reasonable initial estimates. Then the antiresonance fil
ters are manually adjusted to the frequencies and bandwidths of the individual formants
Figure 12 shows an example.
A glottal inverse filter using linear-prediction (LPC) analysis was proposed by Wong
Markel, and Gray (1979). Linear prediction models the speech tract as a digital all-pole
filter, r
k
x(n) = e(n)+ 5>(#i-i) , (10)
and determines the filter coefficients in such a way that the filter optimally matches the
structure of the signal. "Optimally," in this respect, means that the filter has been opti
mized according to a given criterion. The criterion mostly used involves minimizing the
short-term energy of the prediction error, i.e., the energy of the residual signal e(n) within
the frame analyzed. This criterion must be further confined for this special application.
Equation (10) says that a sample x(n) can be approximately predicted as the weighted
average of the k previous sample of the signal x; e(n) will be the prediction error at the
instant n. From the speech production point of view, ifx(n) is the speech signal, and if the
filter is to serve as a model for the speech tract, then e(n) represents some kind of excita
tion signal; however, e(n) is usually not identical with the glottal waveform. LPC analysis
can be used here when the algorithm is modified in such a way that e(n) represents the
glottal waveform itself or at least a waveform having a defined relation to it. The most
straightforward way to achieve this is to verify that the LPC filter transfer function A(z)
represents the transfer function V(z) of the vocal tract; in this case the residual signal e(n)
represents the glottal waveform except for the radiation component, whose reciprocal
must be supplied in the form of the first-order integrator filter already known from the ear
lier discussion in this section.
If A(z) is to represent the vocal-tract transfer function V(z) it is necessary to be certain
that the poles of^(z) represent formants only and nothing else. This leads to a modifica
tion of the LPC algorithm which involves the following two steps.
1) The poles oL4(z) have to be explicitly determined after the analysis; poles that do not
pertain to a formant must be excluded from the inverse filter. Routines which perform this
task are standard in most scientific program libraries. Once the poles are explicitly known,
one can easily assign them to the formants as far as possible and exclude the remainder!
One can also exclude a whole frame from further processing if the LPC algorithm has ob
viously missed a formant (this happens, for instance, when two real poles are supplied
instead of a low-frequency or high-frequency formant).
2) In order to represent the vocal-tract transfer function V(z) as accurately as possible,
the LPC analysis should be carried out during the closed-glottis interval only. During the
open-glottis interval the subglottal system and the vocal tract are coupled via the glottis.
This coupling affects the transfer function of the supraglottal system: subglottal formants
HES-23
Fig. 13a-f. Glottal inverse filter by Wong et al. (1977,1979): example of performance, (a) Signal-
sustained vowel /e/, male speaker, 32 ms per line; same signal as in Fig.12; (b) prediction error
depending on the starting point q of the frame with the maximum of the normalized error indi
cated on the right-hand side; (c) reconstructed glottal waveform (the integrator being the same
as; in Fig.12); (d) differentiated output signal of the inverse filter; (e) locations of the poles of
A(z) in the z plane for those cases where A(z) was found appropriate to serve for use in the in
verse filter; (f) locations of the poles in^(z) for all other cases; ( ) frame selected for
computation of the inverse filter. All the frames which pertain to (e) have been marked in (b) by
a short continuation line below the baseline. Formant frequencies and bandwidths for the in
verse filter applied: Fl=352 Hz, F2=2081 Hz, F3=2652 Hz, F4=3733 Hz; 51 = 10 Hz 52=109
Hz, 53=193 Hz, 54=246 Hz. The constraints of the LPC analysis to separate those frames which
arc suited for selection for the inverse filter (e) and the remainder (f) are rather simple A frame
was excluded from selection when 1) the pertinent LPC filter was not stable, 2) less than 4 for
mants were detected in the frame, 3) the frame contained a formant frequency below 250 Hz or
4) one or several formants had excessively large bandwidths. Although there is some variance in
the estimatesin <e),.4he-formant frequeneies-and-bandwidthsare determined rather consistently
for the pertinent frames
and antiformants are added to the overall transfer function, and the frequencies and band-
widths of the vocal-tract formants are slightly changed. (Wakita and Fant, 1978). For nor
mal LPC analysis the global estimate is sufficient; here, however, greater accuracy is re
quired.
HES-24
Compared to an ordinary frame for LPC, the closed-glottis interval is rather short so that
the covariance method of linear prediction has to be applied. If the assumption holds that
the vocal tract is not excited during the closed-glottis interval, the prediction error will be
very low in this case since the vocal tract then represents a linear passive all-pole system.
To determine the closed-glottis interval therefore the LPC analysis (using the covariance
method and a frame length K which guarantees that k+K+1 does not exceed the length of
the closed-glottis interval) must be carried out at each sample individually (i.e., using a
frame interval equal to the sampling interval of the signal). Low prediction error then indi
cates that the frame is totally embedded™ the etesed-glottis interval (cf. Fig. 13).
An alternative criterion for the selection of the closed-glottis interval is the stability of
the modeled filter A(z). During the closed-glottis interval the waveforms pertaining to the
formants always decay; in this case the LPC filter A(z) will be stable. On the other hand,
an instable filter A(z) indicates that there is strong excitation within the analysis interval.'
A problem with the algorithm by Wong et al. (1979) is that it requires an LP analysis
over the closed-glottis interval. In some voices the closed-glottis interval is very short, or
the glottis even does never close completely. This degrades the estimate of the formants
and thus the performance of this algorithm. Alku (1992) developed a glottal inverse filter
that allows us to perform an iterative LP analysis more globally. First the general slope of
the spectrum is approximately flattened by an inverse filter of order 1 to yield an optimal
starting point for formant estimation. An LP analysis is carried out over that filtered signal
to yield a representation for the transfer function V{z) of the vocal tract. The original signal
is then inverse filtered with this filter and passed through an integrator filter. This yields a
reasonable estimate of the glottal waveform which is then refined in a second iteration
which is almost identical to the first part of the algorithm. It is only now, however, that the
frame length is confined to exactly one pitch period ranging from one point of maximal
glottal opening (which is determined from the glottal waveform estimate) to the next one.
Again the spectrum is flattened using a low-order inverse LP filter, and the vocal-tract
transfer function is estimated. Since the algorithm now acts period synchronously, the re
sults are much more accurate than in the first step. Again the original signal is inverse fil
tered with 1/F(z) and passed through an integrator filter to cancel the effect of lip radi
ation; this yields the final estimate for the glottal waveform.
Among all events that characterize the pitch period the instant ofglottal closure (IGC) oc
cupies a key position. Due to the Bernoulli force exerted on the vocal cords by the air flow
in the glottis during the open-glottis interval, the vocal cords are so strongly forced togeth
er that they close abruptly and remain closed for about half the glottal cycle (for details see
the discussion in Sect.3.1). The air flow is abruptly terminated; this causes a discontinuity
in the time derivative of the glottal volume velocity. All formants, particularly the higher
ones, are thus simultaneously excited at the IGC. It is thus justified from the speech pro
duction point of view to define the beginning of the pitch period in the speech signal to coin
cide with the IGC.
HES-25
The IGC is rather prominent in normal phonation, i.e., modal register and medium
voice effort. It is rather prominent during vocal fry as well. For soft voices as well as for the
falsetto register glottal closure still occurs, but somewhat more gradually. In some special
cases (breathy voice, certain voice pathologies) the glottis never closes completely. This
kind of speech is characterized by weak higher formants. On the other hand, the instant of
glottal opening, which passes rather smoothly most of the time, tends to exhibit a second
discontinuity (and thus tends to become a second point of excitation) when the voice effort
is high.
We can thus expect that the IGC usuaHy represents the most significant and - at the
same time - the most easily detectable single event within the pitch period when a refer
ence point with respect to the excitation function is required. In spite of this the task of
IGC determination is not at all trivial.
Scanning the PDAs discussed up to now, we see that the algorithms that apply structural
simplification (in particular epoch detection) are best suited for IGC determination. In
principle most time-domain PDAs place their markers at positions which have some de
fined relation to the excitation signal. But in many cases this relation is time variant since
it depends on the momentary state of the vocal tract. In addition, IGC determination im
plies the detection of a discontinuity, which is wide-band information, and which is thus
masked both by narrow-band formants and high-frequency attenuation in the signal.
The PDA by Ananthapadmanabha and Yegnanarayana (1979) raises the question of the
phase of the excitation signal. The ideal case is given when the excitation pulse has a unipo
lar peak. If the excitation signal is phase shifted by 90°, the IGC coincides with a zero
crossing of the excitation pulse, and the amplitude of the pulse is much reduced. This diffi
culty is overcome by investigating the instantaneous magnitude of the signal which is pulse-
like when the spectrum of the signal investigated is approximately flat.
The already-mentioned PDA by I. Howard et al. (1989), which applies a neural network
for structural analysis of the output of a filter bank, can be trained toward detecting the
IGC. The neural network performs some kind of holistic scan of the structural properties
of the signal segment at its input layer and fires at the moment for which it has been
trained. This means that the temporal assignment between the temporal structure of the
signal and the instant at which the device signals a pitch period boundary is arbitrary and
a matter of training. The PDA will thus be trained to detect the IGC when the desired out
put of the neural net has such a shape that it is close to unity at the IGC and close to zero
everywhere else. The differentiated output signal of a laryngograph, after suitable normal
ization, has this property (cf. Sect. 6.2).
To evaluate the performance of a measuring device, one should have another instrument
with at least the same accuracy. If this is not available, at least objective criteria - or data
- are required to check and adjust the behavior of the new device. In pitch and voicing
determination both these bases of comparison are tedious to generate. There is no PDA
which operates without errors (Rabiner et al., 1976). There is no reference algorithm, even
HES-26
with instrumental support, that goes completely without manual inspection or control
(Krishnamurthy and Childers, 1986; Hess and Indefrey, 1987). Only rather recently speech
databases with reference pitch contours and voicing information have become available
(e.g., Carre et al., 1984; Picone et al., 1987), and only then designers of new PDAs started
providing detailed data on the performance of their algorithms (e.g. Fuiisaki et al 1986-
Indefrey et al., 1985). ''
When a PDA is equipped with an error detecting routine (and the majority of PDAs are
even if no postprocessor is used), and when it detects that an individual estimate may be
wrong, it is usually not able to reliably decide whether this situation is a true measurement
error - which sheuW be corrected-or-atleast-indicated-- or a signaHrregularity, where the
estimate may be correct and should be preserved as it is. This inability of most PDAs to
distinguish between the different sources of errorlike situations is one of the great prob
lems in pitch determination yet unsolved.
Measurement inaccuracies cause a noisiness of the obtained 7o or Fo contour. They are
small deviations from the correct value but can nevertheless be annoying to the listener.
Again there are three main causes.
HES-27
1) Inaccurate determination of the key feature. This applies especially to algorithms that
exploit the temporal structure of the signal, for instance when the key feature is a principal
maximum whose position within a pitch period depends on the formant F\.
2) Intrinsic measurement inaccuracies, such as the ones introduced by sampling in digital
systems.
3) "Errors" from small fluctuations of the voice (jitter or shiommer), which contribute to
the perception of "naturalness" and should thus be preserved (or even measured).
Voicing errors are misclassifications of the VDA. We have to distinguish between voiced-
to-unvoiced errors where a frame is classified unvoiced although it is in fact voiced, and
unvoiced-to-voiced errors with the opposite way of misclassification. This scheme, as estab
lished by Rabiner et al. (1976), does not take into account mixed excitation. Voiced-to-un
voiced errors and unvoiced-to-voiced errors must be regarded separately because they are
perceptually not equivalent (Viswanathan and Russell, 1984), and the reasons leading to
such errors in an actual implementation may be different and even contradictory.
ime
Time
Time
Fig. 14a-c. Speech signal (a), laryngogram (b), and differentiated laryngogram (c). The markers
delimiting the individual periods were derived from the maxima of (c). Signal: transition [ja];
speaker WGH (male)
pitch period causes the laryngeal conductance to become time variant; thus the HF current
is amplitude modulated. In the receiver the current is demodulated and amplified. Finally,
the resulting signal is high-pass filtered in order to remove unwanted low-frequency com
ponents due to vertical movement of the larynx
Figure 14 shows an example of the laryngogram (the output signal of the laryngograph)
together with the pertinent speech signal. In contrast to the speech signal, the laryngogram
is hardly affected by the momentary position of the vocal tract, and the changes in shape
or amplitude are comparatively small. Since every glottal cycle is represented by a single
pulse, the use of the laryngograph reliably suppresses gross period determination errors. In
addition, it supplies the basis for a good voiced-unvoiced discrimination since the laryngo
gram is almost zero during unvoiced segments where the glottis is always open. Nonethe
less, the laryngograph is not free from any problem: it may fail temporarily or permanently
for some individual speakers, or it may miss the beginning or end of a voiced segment by
a short interval, for instance when the vocal folds, during the silent phase of a plosive, con
tinue to oscillate without producing a signal, or when voicing is resumed after a plosive,
and the glottis does not completely close during the first periods (Childers and Krishna-
murthy, 1985). For such reason, visual inspection of the reference contour is necessary
even with this configuration; these checks, however, can be confined to limited segments
of the signal.
What key feature is best used for delimiting the individual periods? According to the
theory of voice excitation (van den Berg, 1958; cf. also Stevens, 1977), the instant of glottal
closure is the point of maximum vocal-tract excitation, and it is justified to define this
instant to be the beginning of a pitch period. In the laryngogram this feature is well docu
mented. As long as the glottis is open, the conductance of the larynx takes on a minimum,
and the laryngogram is low and almost flat. When the glottis closes, the laryngeal conduc-
HES-29
tance goes up, and the laryngogram shows a steep upward slope. The point of inflection
during the steep rise of the laryngogram, i.e., the instant of the maximum change of the
laryngeal conductance, was found suited best to serve as the reference point for this event.
To press measurement inaccuracies due to signal sampling below the difference limen
for perception of Fo changes over the whole range of Fo, a temporal resolution correspond
ing to a sampling frequency of more than 100 kHz is required. The strategy of the algo
rithm is as follows.
Due to the absence of reliable criteria and systematic guidelines, rather few publications
on early PDAs included a quantitative evaluation of the algorithms presented. The main
results of the classic study by Rabiner et al. (1976) read as follows.
HES-30
1) None of the PDAs involved worked without errors, even under good recording condi
tions. Each PDA had its own "favorite" error; nevertheless, any error condition actually
occurred for any of the PDAs.
2) Almost any gross error is perceptible; in addition, unnatural noisiness of a pitch con
tour is well perceived.
3) The subjective evaluation did not match the preference of the objective evaluation.
In fact, none of the objective criteria (number of gross errors, noisiness of the pitch con
tour, voicing errors) correlated well with the subjective scale of preference.
Hence the question what errors in pitch and voicing determination are the really annoy
ing ones for the human ear remained open. This issue was further pursued by Viswanathan
and Russell (1984) who developed objective evaluation methods that are closely corre
lated to the subjective judgments. The individual error categories are weighted according
to the consistency of the error, i.e., the number of consecutive erronoeus frames, the mo
mentary signal energy, the magnitude of the error, and the special context.
Indefrey et al. (1985), concentrating on the evaluation of PDAs only, investigated sever
al short-term PDAs in various configurations. Some of the results were shown further
above (Sect. 3.2). In a sequel, Indefrey (1987) added several other PDAs to this evaluation.
He showed that in many situations different short-term analysis PDAs behave in a comple
mentary way so that combining them to a multi-channel PDA could lead to a better overall
performance.
7. Aspects of Application
The area of speech communication systems is one of the important application areas of
pitch determination. Other areas include a) phonetics and linguistics (including musicolo-
gy), i.e., the measurement of pitch contours as carriers of prosodic, phonetic, and musical
information; b) education: training aids for the deaf or teaching aids for foreign languages;
and c) the application as a diagnostic aid in voice pathology and phoniatrics. Here deter
mination of source parameters from the signal can serve as a quick and easily accessible
help for voice diagnostics and for examining the progress of voice therapy. In phoniatric
practice direct measurement and investigation of the speech organs is usual and natural,
and pitch determination instruments are a most valuable aid; deriving source parameters
from the signal, however, is a hopeful alternative, in particular for early detection of devel
oping voice diseases and for diagnostic evaluation of slight pathologies (Davis, 1978).
Each of these applications has a different profile of requirements (Hess, 1983:521).
With respect to these requirements the respective applications can be subdivided accord
ing to whether the human ear is the final "customer" of a measured pitch contour or not.
If the human ear is at the end of the chain the PDA is a part of, it is crucial to know whether
there is a time delay for manual correction permitted or not. There is no time in vocoder
systems or in an electronic musical instrument or in the recent application of speech-pro
cessing hearing prostheses, e.g., cochlear implants (D. Howard, 1989; Fourcin et al., 1983).
There is time for manual correction, on the other hand, in high-quality speech synthesis
systems which concatenate original speech data in parametric or waveform-coded repre-
HES-31
sentation and need accurate pitch determination to manipulate pitch and duration (e.g.,
Charpentier and Moulines, 1989). Even a laryngograph may be applied for such a purpose
(Krishnamurthy and Childers, 1986). In the last few years powerful waveform coding
schemes which do not need a PDA at all or only a very rudimentary one have been devel
oped that make a vocoder unnecessary in many applications. Those applications which will
continue to require a PDA in speech communication systems, such as hearing prostheses
or high-quality speech synthesis from stored data, are more fault tolerant than the vocoder.
Future developments in the domain of pitch and voicing determination are thus likely to
move away from the search for a new principle that-is able "to solve~everything" toward
improved implementations of known algorithms that are cheap, fast and robust at the
same time.
References
Dolansky L. O. (1955): "An instantaneous pitch-period indicator." J. Acoust. Soc. Am 27, 67-72
Dolansky L. O., Tjernlund P. (1968): "On certain irregularities of voiced speech waveforms'" IEEE
Trans. AU-76, 51-56
Dologlou I., Carayannis G. (1989): "Pitch detection based on zero-phase filtering " Speech Com
munication 5, 309-318
Duifhuis H., Willems L. E, Sluyter R. J. (1982): "Measurement of pitch in speech: an implementa
tion of Goldstein's theory of pitch perception." J. Acoust. Soc. Am. 71, 1568-1580
Filip M. (1969): "Envelope periodicity detection." J. Acoust. Soc. Am. 45, 719-732
Flanagan J. L., Saslow M. G. (1958): "Pitch discrimination for synthetic vowels." J Acoust Soc
Am. 30, 435-442
Fourcin A. J., Abberton E. (1971): "First applications of a new laryngograph." Medical and Bioloe-
ical Illustration 21, 111-182
Fourcin A. J Douek E., Moore B., Rosen S., Walliker J., Howard D. M., Abberton E., Framton
S. (1983): "Speech perception with promontory stimulation." Ann. NY Acad. Sci. 405, 280-294
Friedman D. H. (1977): "Pseudo-maximum-likelihood speech pitch extraction." IEEE Trans
ASSP-25, 213-221
Fujimura O. (1968): "An approximation to voice aperiodicity." IEEE Trans. AU-16, 68-72
Fujisaki H., Hirose K., Shimizu K. (1986): "A new system for reliable pitch extraction of speech "
Proc. IEEE ICASSP-86, paper 34.16 (IEEE, New York)
Gold B. (1977): "Digital Speech Networks." Proc. IEEE 65, 1636-1658
Gold B., Rabiner L. R. (1969): "Parallel processing techniques for estimating pitch periods of
speech in the time domain." J. Acoust. Soc. Am. 46, 442-448
Goldstein J. L. (1973): "An optimum processor theory for the central formation of the pitch of
complex tones." J. Acoust. Soc. Am. 54, 1496-1516
Grutzmacher M., Lottermoser W. (1937): "Uber ein Verfahren zur tragheitsfreien Aufzeichnung
von Melodiekurven." Akustische Z. 2, 242-248
Harris M. S., Umeda N. (1987): "Difference limens for fundamental frequency contours in sen
tences." J. Acoust. Soc. Am. 81,, 1139-1145
Hart J. 't (1981): "Differential sensitivity to pitch distance, particularly in speech." J. Acoust Soc
Am. 69, 811-822
Hermes D. J. (1988): "Measurement of pitch by subharmonic summation." J. Acoust. Soc Am 83
257-264 ' '
Hess W. J. (1976): "A pitch-synchronous digital feature extraction system for phonemic recogni
tion of speech." IEEE Trans. ASSP-24, 14-25
Hess W. J. (1979): "Time-domain pitch period extraction of speech signals using three nonlinear
digital filters." Proc. IEEE ICASSP-79, 773-776 (IEEE, New York)
Hess W. J. (1983): Pitch determination of speech signals - algorithms and devices (Springer, Berlin)
Hess W. J. (1992): "Pitch and voicing determination." In Advances in speech signal processing; ed.
by M. M. Sondhi and S. Furui (Marcel Dekker, New York), 3-48
Hess W. J., Indefrey H. (1984): "Accurate pitch determination of speech signals by means of a la
ryngograph." Proc. IEEE ICASSP-84, paper 18B.1 (IEEE, New York)
Hess W. J., Indefrey H. (1987): "Accurate time-domain pitch determination of speech signals by
means of a laryngograph." Speech Commun. 6, 55-68
Hollien H. (1974): "On vocal registers." J. Phonetics 2, 125-143
Howard D. M. (1989): "Peak-picking fundamental period estimation for hearing prostheses." J.
Acoust. Soc. Am. 86, 902-910
Howard I. S., Huckvale M. A. (1988): "Speech fundamental period estimation using a trainable
pattern classifier." Proc. Speech-88 (FASE), Edinburgh, (CEP Consultants, Edinburgh)
HES-33
Rabiner L. R. (1977): "On the use of autocorrelation analysis for pitch detection " IEEE Trans
ASSP-2J, 24-33
Rabiner L. R., Cheng M. J., Rosenberg A. E., McGonegal C. A. (1976): "A comparative study of
several pitch detection algorithms." IEEE Trans. ASSP-24, 399-413
Reddy D. R. (1967): "Pitch period determination of speech sounds." Commun. ACM10, 343-348
Risberg A., Moller A., Fujisaki H. (1960): "Voice fundamental frequency tracking." STL-QPSR
#1, 3-5 (Royal Inst. of Technol., Stockholm)
Ross M.J., Shaffer H.L., Cohen A., Freudberg R., Manley H.J. (1974): "Average magnitude differ
ence function pitch extractor." IEEE Trans. ASSP-22, 353-361
Schroeder M. R. (1%8): "Period histogram and product spectrum: newmethods for fundamental-
frequency measurement." J. Acoust. Soc. Am. 43, 829-834
Secrest B. G., Doddington G. R. (1982): "Postprocessing techniques for voice pitch trackers "
Proc. IEEE ICASSP-82, 111-175 (IEEE, New York)
Sobolev V. N., Baronin S. P. (1968): "Investigation of the shift method for pitch determination."
Elektrosvyaz 12, 30-36 (in Russian)
Sondhi M. M. (1968): "New methods of pitch extraction." IEEE Trans. AU-16, 262-266
Sreenivas T. V. (1981): Pitch estimation of aperiodic and noisy speech signals. (Diss., Dept. of Electr.
Eng., Indian Inst. of Technology, Bombay)
Stevens K. N. (1977): "Physics of laryngeal behavior and larynx modes." Phonetica 34, 264-279
Stevens K. N., Kalikow D. N., Willemain T. R. (1975): "A miniature accelerometer for detecting
glottal waveforms and nasalization." J. Speech Hear. Res. 18, 594-599
Terhardt E. (1979): "Calculating virtual pitch." Hearing Research 1, 155-182
Terhardt E., Stoll G., Seewann M. (1982): "Algorithm for extraction of pitch and pitch salience
from complex tonal signals." J. Acoust. Soc. Am. 71, 679-688
Un C. K., Yang S. C. (1977): "A pitch extraction algorithm based on LPC inverse filtering and
AMDF." IEEE Trans. ASSP-25, 565-572
Viswanathan V. R., Russell W. H. (1984): Subjective and objective evaluation of pitch extractors
for LPC and harmonic-deviations vocoders (BBN Report # 5726, Bolt Beranek and Newman
Cambridge, MA, USA)
Wakita H., Fant G. (1978): "Toward a better vocal-tract model." STL-QPSR #1,9-29 (Royal Inst.
of Technol., Stockholm)
Weiss M.R., Vogel R.P., Harris CM. (1966): "Implementation of a pitch-extractor of the double
spectrum analysis type." J. Acoust. Soc. Am. 40, 657-662
Winckel F. (1964): "Tonhohenextraktor fur Sprache mit Gleichstromanzeige." Phonetica 11,
248-256
Wise J.D., Caprio J.R., Parks T.W. (1976): "Maximum likelihood pitch estimation." IEEE Trans
ASSP-24, 418-423
Wong D. Y., Markel J. D., Gray A. H. (1979): Least-squares glottal; inverse filtering from the
acoustic speech waveform." IEEE Trans. ASSP-27,350-355
Yaggi L. A. (1962): Full duplex digital vocoder (Texas Instrument, Dallas, TX; 5P14-A62)
Zwicker E., Hess W, Terhardt E. (1967): "Erkennung gesprochener Zahlworte mit Funktionsmo-
dell und Rechenanlage." Kybernetik 3, 267-272
TALK-1
David Talkin
Entropic Research Laboratory, Inc.
In the following discussion, we transform the speech signal using the cross-
correlation function (CCF). Given
sm, m = 0,1,2,3,...,
a sampled speech signal with sampling interval T, analysis frame interval t, and
a window size w, at each frame we advance z = t/T samples with n — w/T
TALK-2
where
j+n-l
e3 = £ -?,
and z is the frame index for M frames. Note that —1.0 < (j) < 1.0.
We refer to the value of k as the lag and to i as the /rarae index. We
can represent (frmk graphically by assigning lag to the ordinate, frame index (or
time) to the abscissa and the value of (j) at the corresponding time and lag to the
degree of shading, with dark shading representing high values (close to 1.0) and
white representing low values (close to -1.0). These graphical representations
are referred to as correllograms.
An utterance containing clear and problematic voiced speech sections with
the corresponding CCFs and correllogram may be seen in Figure 1. The only
local evidence for the true FO is the location and height of maxima in the CCF. A
segment of unvoiced speech and its CCF may be seen in Figure 1(D). Note that,
in general, the CCF of voiced speech has maxima with comparable amplitudes
at lag intervals corresponding to integer multiples the fundamental period while
the CCF of unvoiced speech has its most prominent maximum at zero lag. If the
CCF for the problematic case is viewed in a larger temporal context, as in the
correlogram of Figure 1(B), the location of the local maximum corresponding
to the "true" FO becomes more evident.
Note the following general observations regarding speech and speech CCFs:
2. When multiple maxima in 0 exist and have values close to 1.0, the maxi
mum corresponding to the shortest period is usually the correct choice.
3. True cj) maxima in temporally adjacent analysis frames are usually located
at comparable lags, since F0 is a slowly-vary ing function of time.
7. The short-time spectra of voiced and unvoiced speech frames are usually
quite different.
close to zero for unvoiced frames. The constant 7 permits adjustment of the
liklihood of a voiced decision.
The inter-frame FO transition cost 6 at frame i when hypotheses j and k at
the current and previous frames are both voiced is defined as
where
Qij — fcmini
where /cm;n at each frame are the indices, /c, which minimize £)^, so that the
optimal state sequence can be retrieved. Back pointers from each state at frame
i may be traced backwards until they converge to a common, globally optimal
state at frame i — /, where It is the latency of the decision. In practice, this
decision latency for the FO estimation problem is rarely greater than 100ms.
Thus, it is feasible to implement FO estimators using this algorithm that can
operate continuously, in real time, with modest delay. Finally, the FO estimate
for the frame is
ZL,%3
where the values of j are those which result in the minimum value for D in the
region of convergence.
TALK-5
1.3 Discussion
Reasonable values for the constants in the algorithm may be determined using
hill climbing techniques on a standard speech database where the FO and voicing
state have been hand marked (or otherwise reliably determined, for instance
using electro glottography). Fortunately, the performance of the algorithm is
weakly sensitive to the exact parameter values once the general operating region
has been found.
This algorithm permits estimation of FO on a cycle-by-cycle basis, since t,
the frame step size and u>, the correlation window size can both be set smaller
than the expected fundamental period. This is in contrast to autocorrelation-
based approaches, where the autocorrelation window must be several glottal
periods long.
A variety of inter-frame spectral distance measures can serve as the basis for
the "stationarity" measure Si. Secrest and Doddington suggest the use of LPC
log area ratios [6]. Good results have been obtained with a stationarity measure
defined as:
5=
where i is the index of the current frame; rmsi is signal RMS in frame i\ and
itakura(i,j) is the Itakura-Saito distortion measure [3] between frames i and j.
The precision of the FO estimation can be considerably improved by parabolic
interpolation of the CCF. If a parabola is fit to the three points comprising the
peak in the CCF, the peak of the parabola is a good estimate of the "true"
peak of the corresponding continuous CCF. Thus, instead of using the compu
tationally expensive approach of increasing the rate at which the speech signal
is sampled, one can apply interpolation on the few peaks in the CCF that are
finally identified as FO period markers.
It is important that DC and other very low frequency noise components be
removed from the signal prior to application of the CCF. Otherwise, these can
generate very high correlation values in unvoiced and "silent" regions of the
signal, incorrectly encouraging a "voiced" decision. A high-pass filter with zero
response at 0 Hz and a half-power corner frequency at 80 Hz has been found to
be quite effective.
The computational load of the dynamic programming (DP) can be reduced
by limiting the number of candidates considered at each frame. The DP load
grows as the square of the number of candidates (states) in each frame. Thus,
instead of considering all local maxima in the CCF as period candidates, only
the highest N need be considered, where N is on the order of 10-20. This
significantly reduces the load in the unvoiced regions where there are many
local maxima, none of which will ultimately contribute to a period estimation!
The computational load of the CCF may be reduced by performing it in two
stages. Note that for a given window duration and frame rate, the cost of com-
TALK-6
puting (j) grows as the square of the speech sample rate. Thus, initial estimates
of the CCF peak locations can be made on a sample-rate-reduced version of the
speech signal. The peak locations can then be refined by recomputing the CCF
at the higher sample rates only in the vicinity of the initial peak estimates and
for only the most promising peaks.
Figure 1
Waveform (A), correllogram (B) and cross correlation functions (C, D, E)
based on a female voice saying "Are any sub...". The cross correlation plots C,
D and E, which were computed at .83 sec, .5 sec and .68 sec, respectively show
correlation values as a function of correlation lag with zero lag at the extreme
left in each plot. In C the "true" peak corresponding to FO is actually lower in
amplitude than the peak at twice the true period. In D, the true peak is the
highest non-zero lag peak. Note that the non-zero-lag peaks in the correlation
function based on unvoiced speech, seen in E, are all considerably lower than
the zero-lag peak. The correllogram, B, shows the correlation value plotted as
a function of time (horizontal axis) and lag (vertical axis). Correlations close to
one are shown in black; minus one in white. When the time context surrounding
the problematic correlation function in E is taken into account by examining
the correllogram in B, the correct peak choice is obvious.
References
[6] Secrest, B. G. and Doddington, G. R., "An integrated pitch tracking algo
rithm for speech systems," ICASSP - 1983, pp. 1352-1355, Boston.
TALK-7
Bififl Bi55
Fiernre 1
MIL 1-1
Paul H. Milenkovic
Department of Electrical and Computer Engineering
University of Wisconsin-Madison
1415 Johnson Drive
Madison, Wisconsin 53706
Abstract
Milenkovic (1987) describes a waveform model for the measurement of the
aperiodicity of a voiced speech waveform. The model contains a periodic com
ponent, which may vary in amplitude between pitch periods, and a periodicity
error (also called the noise component), which has a constant mean-square
value across pitch periods. Minimizing the mean-square of the periodicity error
provides estimates of 1) pitch period (used to determine jitter), 2) amplitude
variation of the periodic component (used to determine shimmer), and 3) mag
nitude of the aperiodicity noise (used to determine voice SNR).
This report describes how minimizing the periodicity error is equivalent to
performing a rotation transformation on signal vectors from two adjoining pitch
periods. This transformation is known as SVD in signal processing (Haykin,
1991) and principal components analysis in statistics (Nash, 1979). This con
nection gives a more numerically stable formula for computing the minimum
mean-square error. It also provides a geometric interpretation of the periodic
and noise components in relation to the signal vectors, proving the existence of
the periodic and noise components in the form required by the model.
1 Introduction
by adjusting amplitude factor K and pitch period tp (see Qi and Shipp, 1992 for a
related method).
The objection to the simpler procedure is how it works when both s(t) and sp(t)
contain aperiodicity noise. The simpler procedure may work for radar where sp(t)
is the noise-free outbound pulse and s(t) is the noisy pulse. When both s and sp
contain noise, the simpler procedure will result in a complicated relation between the
minimum mean-square value of e and the true magnitude of the noise contained in
both s and sp. In addition, the relationship between K and waveform shimmer is
complicated on account of the bias introduced by the noise.
MIL 1-3
The more complicated procedure has the advantage that it gives correctly scaled
estimates of shimmer and aperiodicity noise if a particular waveform model holds
true. This report also shows that the seemingly peculiar algorithm of Milenkovic
performs a vector rotation and is therefore identical to the well-known procedure for
singular value decomposition (SVD) (Haykin, 1991). SVD is also known as principal
components analysis, which has extensive theoretical rationale (Nash, 1979). SVD
leads to a geometrical interpretation of signal and noise components, which proves
that the model has an exact least-squares instead of only an approximate least mean-
square solution as originally supposed.
To widen the application of SVD-based waveform matching, this report purpose
fully leaves open many other details of a voice analysis system. The initial pitch
estimate is such a detail. The waveform matching procedure assumes a rough esti
mate of the pitch period by other means, and varies tp to refine the estimate. Another
open issue is the question of aligning the analysis frame on pitch epochs. The method
of Milenkovic (1987) employs a sliding analysis frame. In a voice analysis system with
a reliable means of determining the glottal epoch, the methods described in this report
are also applicable to an analysis frame that is aligned on that epoch.
2 Methods
This section of the report 1) describes a waveform periodicity model and reviews the
minimum MSE estimate of the model parameters, 2) shows how this model can be
reexpressed as a rotation transformation applied to a pair of signal vectors and how
the minimum MSE solution can be expressed as the calculation of the optimal rotation
that performs SVD, and 3) summarizes this result in the form a numerically-stable
recipe for calculation.
The waveform s(t) is the speech signal and sp(t) = s(t - tp) is the signal from one
pitch period before. The quantity tp is the estimated pitch period, and tp can be
adjusted for a best waveform match between pitch periods. A model of waveform
periodicity separates s(t) into a periodic component p(t) and a periodicity error (or
noise component) e(t) according to
Furthermore, because the periodic component can vary in amplitude between pitch
periods according to p(t) = Kp(t — ip),
where T is the interval between waveform samples, n0 is the integer index controlling
position of the analysis frame, and np is the number of samples in a pitch period-long
frame. In a similar manner, vectors e and ep contain samples of the periodicity error
signals e(t) and ep(t). The vectors s and sp have actual numerical values. The vectors
e and ep are only theoretical constructs in the model, but we can estimate their vector
magnitudes from observations of s and sp.
In the formula s — ifsp, the periodic component p (vector of samples of p(t — tp))
simply cancels out, resulting in
Next, assume that e and ep are of equal vector magnitude according to E = eeT =
epej and that they are orthogonal according to ee£ = 0; this is a statement of
statistical independence of the noise components in each pitch period. The symbol T
denotes vector transpose and eeT = ||e||2 denotes the Cartesian dot product formula
for the vector norm square. That e and ep are orthogonal and equal norm permits
equating
|2 = (1 + K2)E. (9)
|2 (10)
= ssT - 2Kssl + K\sTp.
MIL 1-5
8E_
OK ~~ (1 -
Defining
q = ssT-spsJ, (13)
r = ssj, (14)
with solution
K = R±y/W+l, R = 7T- (16)
It
We take the + branch of ± because that gives a positive value of K, the usual situation
with a voiced speech waveform.
This concludes the review of Milenkovic (1987). Next, this solution is reexpressed
as a rotation transformation.
MIL-1-6
-1
e,. = =(e - Kep) = -se + cep (17)
1 K
s = c = (18)
where c2 + s2 = 1, the necessary and sufficient condition for s and c to be the sine
and cosine of an angle. It also follows that
K2
c2 =
K2 2(1 + R2 + Ry/TTB2) 2V1 + + R)
(21)
2VT+B2 '
and remembering that R = ^, it follows from y/1 + R2 = ^y/Ar2 + q2 that
2 K2 y/ir2 + q2 + q __ v + q
(22)
C = 1 + K2 = 2y/4r2 + q2 " " 2v '
where v = y/Ar2 + q2.
Taking the positive branch of the square root and setting
c == (23)
MIL 1-7
s = ±1
= ±
\
(24)
vc
where we take the branch of the square root having the same sign as r. This insures
the correct result when signal crosscorrelation r < 0, a rare occurence with voiced
speech that we need to account for anyway.
= ssj\ (25)
= ssT-spsJ, (26)
v = + q2. (27)
The coefficient r is the crosscorrelation between the two pitch periods while q is the
signal energy difference between pitch periods.
If q ^ 0, the condition where the signal energy is greater than in the previous
pitch period, compute
c = (28)
If q < 0, the condition where the signal energy is less than in the previous pitch
period, compute
s = SGN(r) c = (29)
vs
E = s2ss . (30)
MIL 1-8
In the special case of equal amplitude pitch periods, ssT = spsj and c = s = l/v2,
and the expression simplifies to
3 Results
(32)
The subscript r reminds us that the matrix is a unitary rotation matrix. According
to the theory of principal components, when c and s satisfy c2 + s2 = 1 and ||er||2 a
minimum, sr and er are orthogonal principal components. The vectors s, sp, sr, and
er all lie on an ellipse with sr, and er marking the major and minor axes.
The existence of the periodicity model is proved by geometric construction. We
express the major principle component as
sr = pr (33)
where pr and e^ are mutually orthogonal vectors selected from the subspace of vectors
orthogonal to er and where ||e^-||2 = ||er||2. Breaking the major principle component
down in this way is possible because ||sr||2 > ||er||2 by virtue of which component is
major (the bigger one) and minor (the smaller one). The magnitudes of pr and e^-
are uniquely determined (note that ||pr||2 = ||sr||2-||e^-||2 = ||sr||2-||er||2 on account
of orthogonality and equality of norms), but the vectors pr and e^ are not unique.
That is OK, because we are only interested in the magnitudes for SNR calculations
and do not need to recover the actual vectors.
The original vectors are recovered from the principal components according to
(34)
MIL 1-9
When the rotation matrix is applied to orthogonal vectors of equal magnitude, the
rotation ellipse becomes a circle, and the rotated vectors remain orthogonal with the
same magnitude. This allows us to express periodicity error and periodic components
of the waveform in terms of principal components. The periodicity error vectors e
and ep are orthogonal, equal-magnitude, and rotated versions of minor principal com
ponent er and geometrically constructed vector e^r. The major principal component
sr is the sum of the two constructed vectors pr and e^-, and the periodic components
are given by p = spr and Kp = cpr for K = c/s.
This geometric construction is taking two principle components and generating
three vectors: the periodic component (its version scaled by K counts as the same
vector) and two independent periodicity error components. As a result of this two-
to-three mapping, the construction is not unique, but it exists, the minimum norm
of the minor principal component makes the periodicity error components minimum
norm, the three vector elements of the periodicity model exist, and the norms (vector
magnitudes) are unique.
The proposed definition of periodicity SNR is the average of the energy in the
signal for each pitch period divided by the energy in the periodicity error:
SNR = I'|S||2
2
+ I|S
er 2
|er||2 2 ||er|
(35)
HNR =
2 ||erp 2
|Br||2-||er|la
2 IN|2
I (]M£ - l) (36)
MIL 1-10
4 Conclusions
References
Muta, H., Baer, T., Wagatsuma, K., Muraoka, T., and Fukuda, H. (1988). "A
pitch-synchronous analysis of hoarseness in running speech," J. Acoust. Soc.
Am. 84, 1292-1301.
Nash, J. C. (1979). Compact Numerical Methods for Computers, John Wiley, New
York.
Qi, Y., and Shipp, T. (1992). An adaptive method for tracking voicing irregularities,
J. Acoust. Soc. Am. 91, 3471-3477.
10
DEL-1
Dimitar D. Deliyski
ABSTRACT
A method for quantitative evaluation of these random functions is described. The method
computes some their statistical characteristics which can be useful in assessing voice in clinical
practice. More than 33 acoustic parameters are computed, such as: average fundamental
frequency, phonatory frequency range, several frequency and amplitude short- ami long-term
perturbation and variation measures, noise-to-harmonic ratio, voice turbulence and soft phonation
indexes, quantitative measures of voice breaks, sub-harmonic components and vocal tremors. This
set of parameters, which corresponds to the model, allows a multi-dimensional voice quality
assessment. A computer system based on above model and method was developed for the CSL
model 4300 (Kay Elemetrics Corp.). A group of 68 people with normal and disordered voices
was analyzed using the system in order to define normative values for the acoustic voice
parameters.
The file format for voice data used by Kay Elemetrics Corp. is described This format, which is
very similar to a multi-media format supported by Microsoft, allows to keep all the information
and associated data in a single file.
l.INTRODUCTION
The classic way to describe the acoustics of human speech is by using the Linear Model of Speech
Production [1, 2], where the voice signal is presented as a result of a periodic impulse sequence
(excitation) filtered by the glottis, the vocal tract and the lips.
However, the real voice contains irregular components which are (probably) due to the chaotic
nature of the laryngeal mechanism [3]. A voice without irregularity is not perceived as human
DEL-2
which is why the advanced speech synthesizers, based on the Hnear model, introduce some pitch
irregularity [4,14].
Voice pathology can cause increased noise components in the voice signal such as: fundamental
frequency and amplitude irregularities and variations with different patterns, sub-harmonic
frequency components, turbulent noise, voice breaks and tremors [2. .5-8]. Understanding the
acoustics of these changes is the key to the development of methods for the evaluation of
pathologic voices. A formal expression of these changes is given by the Extended Acoustic Model
of the Pathological Voice Production [9] on Rg.1.
feedbacks
vocal tract
excitation glottis (ooca! folds)
The discrete-time formal representation of the model describes the excitation function
as a modulated impulse sequence, where the frequency modulating (FM) function q>(n) and the
amplitude modulating (AM) function a(n) are random time functions; n=0,/...oo is discrete time
(samples); To is the period of the sequence (samples); Sfn) is a Kronecker delta function
O)=l, 5(nO)=O)\ and the carrier sequence is
«=0
DELS
= e'(n)*g(n) = fv («)*(»-m);
ic* 200/r/sec;
The White Noise Generator (WNG) adds components w(n) which model the turbulent
components and the Voice Break Switch (VBS) describes the interruptions of the voice
generation, where: g(n) is the impulse response of the glottal filter, Go- scale factor, T- sampling
period (sec.), Ac and ¥fc- amplitude and frequency break thresholds, cl and c2- comparators. The
convolution of u'(n), the impulse response of the vocal tract filter v(n) and the impulse response of
the lip-radiation filter l(n) results into the modeled voice signal
where v(n) and l(n) are considered invariable because it is assumed that the laryngeal pathology
does not affect the vocal tract and the lips.
All a(n), <p(n) and w(n) are random time functions. Therefore the task of acoustic evaluation of
pathological voices can be regarded as the extraction of specific statistical parameters of these
functions which have clinical significance. The method described below includes three separate
parts: pitch extraction (demodulation), noise evaluation and long-term components (tremor)
analysis.
3.PITCH EXTRACTION
The amplitude and frequency demodulation curves of the voice signal contain information about
the time-domain behavior of a(n) and tfh/ The period-to-period pitch extraction [10] is the
classic type of demodulation used for evaluation of voice pathology [7, 8]. However the
irregularity of the disordered voice makes the pitch extraction inaccurate, often impossible.
In order to provide reliable data an adaptive time-domain pitch-synchronous method for pitch
extraction was developed. It consists of the following main steps: fundamental frequency (Fo)
estimation, Fo verification, period-to-period Fo-extraction and computation of time-domain voice
parameters.
The Fo-estimation provides preliminary information about the pitch. It is based on short-term
autocorrelation analysis with non-linear sgn-coding [11] of the voice signal x(n)
DEL-4
where: x'(i)~O if Pmin<xfl)<Pmax;
x90)~l ifx(i)*Pmax;
and Pmax=KpAmax;
Pmin^KpAmin;
Amax and Amin - global extremes of the current window in the voice signal x(n) The length of
the autocorrelation window is 30ms or 10ms depending on the /Extraction range (67-625Hz or
200-1000Hz). The sampling rate is 50kHz ami every window is tow-pass filtered at ISOOHz
before coding. The value of the coding threshold at this stage of the analysis is KpO. 78 in order
to eliminate the incorrect classification of /^harmonic components as Fo [12]. The current
window is considered to be voiced with period To xmax if the global maximum is
Rmax(xmax)>KdR(T°*O)f where the voiced/ unvoiced threshold value id Kd^0.27 [12].
The Fo-verification procedure is similar to the /-^-estimation The autocorrelation function is
computed again for the same windows at Kp^0.45 in order to suppress the influence of
components sub-harmonic to Fo, The results are compared to the previous step ami the decision
about the correct To is made for all windows where difference is discovered
A period-to-period Fo-extraction is made on the original signal x(n) using a peak-to-peak
extraction measurement. It is synchronous with the verified pitch and voiced/unvoiced results
computed in the previous steps. A linear 5-point interpolation is applied on the final period-to-
period Fo-data in order to increase the resolution. This increased resolution is necessary for
meaningful frequency perturbation measurements. The peak-to-peak amplitude is also extracted
for every period.
The following time-domain voice parameters are computed from the extracted pitch data:
Fundamental frequency information measurements: Average Fundamental Frequency Fo
/Hz/ [2]t Average Pitch Period To /ms/f Highest Fundamental Frequency Fhi fth/, Lowest
Fundamental Frequency Flo /Hz/, Standard Deviation of the Fumkanental Frequency STD /Hz/
[5], Phonatory Fumkmental Frequency Range PFR /semi-tones/, Length of Analyzed Data
Sample Tsam /sec/ and Number of Pitch Periods PER.
Short and long-term frequency perturbation functions: Absolute Jitter Jita /us/ [13], Jitter
Percent Jitt /%/ [13], Relative Average Perturbation RAP /%/ [7], Pitch Period Perturbation
Quotient PPQ /%/ [8], Smoothed Pitch Period Perturbation Quotient sPPQ /%/ and
Fundamental Frequency Coefficient Variation vFo /%/ [5].
DELS
Short and long-term amplitude perturbation functions: Shimmer in dB ShdB /dB/ [13],
Shimmer Percent Shim /%/ [13], Amplitude Perturbation Quotient APQ /%/ [8], Smoothed
Amplitude Perturbation Quotient sAPQ /%/ and Peak-to-Peak Amplitude Coefficient of
Variation vAm /%/ [5].
Voice break related measurements: Degree of Voice Breaks DVB /%/ [15] - the ratio of the
total length of areas representing voice breaks to the time of the complete voiced sample; and
Number of Voice Breaks NVB. The criteria for voice break area can be a missing impulse for the
current period or an extreme irregularity of the pitch period.
Voice irregularity related measurements: Degree of Irregular Vocalization DUVPAJ [IS]- the
ratio of the number of autocorrelation windows classified as unvoiced to the total number of
autocorrelation windows; and Number of Unvoiced Segments NUV.
4.NOISE EVALUATION
The analysis of the voice signal in the frequency domain provides another approach to the
evaluation of its irregularity (noise). The amount of in-harmonic spectral components correlates to
the perception of hoarseness of the pathological voice [16]. To evaluate the level of noise
components and separate the turbulent noise correlating to the intensity of the function w(n), a
pitch-synchronous frequency-domain method was developed. The following parameters are
extracted: Noise to Harmonic Ratio NHR- a general evaluation of the noise presence in the
analyzed signal (including amplitude and frequency variations, turbulence noise, sub-harmonic
components and/or voice breaks); Voice Turbulence Index VT1- mostly correlating with the
turbulence components caused by incomplete or loose adduction of the vocal folds; and Soft
Phonation Index SPI- an evaluation of the poorness of high-frequency harmonic components that
may be an indication of loosely adducted vocal folds during phonation.
1. Election of two groups of windows of 81.92 ms (4096 points) of the voice signal. The first
group includes a sequence of windows of the voiced areas in the analyzed signal with a half
window overlap. The second group includes four non-contiguous windows, where the
frequency and amplitude perturbations are the lowest for the signal.
2. For every window in both groups the following steps apply: low-pass filtering (cutoff 6000Hz,
order 22, Hamming window), downsampling to 12.5kHz and conversion of the real signal into
analytical one using Hilbert filtering; computation of the power spectrum of the window using a
1024-points Complex Fast Fourier Transform (FFT) on the analytical signal; calculation of the
average fundamental frequency within the current window from the time-domain analysis data
and synchronous harmonic/in-harmonic separation; computation of the current window's NHR,
SPI and VTI. NHR is a ratio of the in-harmonic energy in the range 1500-4500Hz to the
DEL-6
harmonic spectral energy (70-4500 Hz) and S/V is a ratio of the lower-frequency (70-1600Hz)
to the higher-frequency (1600-4S00Hz) harmonic energy for the first group of windows. VT1 is
a ratio of the spectral in-harmonic high-frequency energy (280O-580OHz) to the spectral
harmonic energy (70-4500Hz) for the second group of windows.
5.TREMOR ANALYSIS
The pitch extraction process yields the amplitude and frequency demodulation curves of the voice
signal. These curves contain information about the long-term amplitude and frequency variability
(tremor) of the voice signal [17]. Methods for frequency and amplitude tremor analysis are
developed. The algorithm for frequency tremor analysis includes the following steps:
1. Division of the Fa-data resulting from pitch extraction into windows of 2 sec. length with 1
sec. step overlap.
2. Application of the following procedures to every window: low-pass filtering of the Fo~ data
(cutoff 30Hz) and downsampling to 400Hz; calculation of the total energy of the resulting
signals; subtraction of the DC-component and computation of the autocorrelation function on
the residual signal; division of the autocorrelation data by the total energy and accumulation of
the results from every window. The maxima of the resulting autocorrelation curve show the
intensity and frequency of the long-term (up to 30Hz) frequency-modulating components.
3. Calculation of the Fo-Tremar Intensity Index FTR1 /%/- the value of the global maximum of
the average autocorrelation curve and the corresponding position Fo-Tremar Frequency Fftr
/Hz/
The same method applies for computation of the Amplitude Tremor Intensity Index A TRI /%/ and
the Amplitude-Tremor Frequency Fatr /Hz/ from the peak-to-peak amplitude data resulting from
pitch extraction.
6.APPLICATION
Based on the model and the methods described above a Multi-Dimensional Voice Program
MDVP was developed utilizing the Computerized Speech Lab (CSL) model 4300 (Kay Elemetrics
Corp.). CSL, a hardware/software system which uses an MS-DOS based computer as host,
includes signal conditioning, 16-bit A/D converters, dual digital signal processors (DSP16A &
TMS32025) and support peripherals. The MDVP system computes a set of 33 acoustic voice
parameters in about 16 seconds and provides flexible routines for graphical representation of the
results[Rg.2-3]. Also a user-upgradable voice database allows automatic comparison of the
current results with different nosological groups.
DEL-7
Dttltt U1LM SHUN Si'LAK ftNajLYZ£ SIttIS CKftl'M LD1I MftCRUS
DfDchl : ft.NSP e.Z14>K 282S>
l^w^^wmfr^
TIMS- <••©>
■B>FundA*ental Frequency 8.21434< 11S.
Tlmm
Fig.2: MDVP-Display of the voice waveform (view A), period-to-period fundamental frequency (B),
peak-to-peak amplitude (C), Fo-tremor (D) and amplitude tremor (E) autocorrelation curves, long-
term average linear spectrum of the voiced areas of the signal (F) and histogram of the
distribution of Fo (G).
Jit* Currmmk
NUU Jiit
MSH
ilU 68.5 us
litl 8.78 X
NUB RPO SAP 8.39 X
pro 8.48 X
•PPQ 8.98 X
BUM «?PQ vFo 8.86 X
ShdB 8.27 dB
ShiM 3.IS X
DSH vFo
Threshold
Ualuoa
DUB ShdB
ftTRI ShiM
FTKI APQ
SPI
UTI
NHR
Fig.3: Multi-Dimensional Diagram display of the acoustic parameters. The area within the circle
shows the normative threshold range and the polygon - the currently computed values.
DELS
In order to extract the normative threshold values of the acoustic parameters sustained phonation
of the vowel 'a' of 15 persons (7m,8f) with normal voice production and of 53 patients (25m,28f)
with laryngeal diseases were analyzed using the MDVP system. The following nosological groups
were included in the study: laryngeal cancer, benign neoplasms, chronic laryngitis, functional
dysphonia and paralysis of a recurrent nerve. The computed normative threshold values for this
database are;
The normative values may vary depending on the nosologicai groups included in the specific
study. A separate database is recommended to be selected or created for different applications.
7.F1LE FORMAT
The format of sampled data files used by Kay Elemetrics Corp. was developed to meet the
requirement for a single file that would contain any information that may be associated with a
piece of sampled data and could be expanded to include additional features as those were
incorporated into the program without rendering previous data files obsolete. A angle file is
advantageous because it keeps all information about a recording in one file. Separate files to
describe a recording can be confusing and inadvertently separated. This file format is very flexible
and is designed to be changeable to accommodate future requirements. Under exploration, for
example, is the inclusion of videostroboscopic images with the file so that acoustic and images of
the vocal cords can be viewed in synchronization with spectrograms and waveform displays. This
new capability, unforeseen when the CSL was first developed, can be accommodated with the
CSL file format without rendering previous data files obsolete.
Additionally, it was necessary that the format could be readily identified by any program
attempting to read the file to determine that the file was, in fact, an appropriate sampled data file.
DEL-9
Toward these ends, a format made up of a number of nested named data BLOCKS was
developed. The specification may be expanded by defining additional BLOCK types to
accommodate new features and identifiability is provided since the name and placement of each
BLOCK is specified for the file format and may be quickly checked as the file is read. This means,
among other things, that it is not necessary to spedfy a particular filename extension in order to
identify the file type to a program so that the extension may be put to bettor use as a classification
aid for the user if desired.
A sampled data file which conforms to this specification contains the string "FORM" as the first 4
characters in the file (the FILE TITLE), followed by a BLOCK containing all data for the file.
The BLOCK following the FILE TITLE may (and most certainly wUl) contain one or more nested
BLOCKS. An example of Kay Elemetrics data file structure is shown of Fig.4.
Currently Kay's NSP data file format used in CSL can accommodate the following information:
creation date, time and title, sampling rate, signal length, signal levels for each channel, sampled
data from up to four channels, IP A phonetic transcription, named tags and voiced impulse
markers for each channel, palatometric data and a comment field. Under concideraration is the
inclusion of synchronous vtdeostroboscopic images, signals associated with swallowing, patient's
case history data, clinical evaluation and acoustic analysis results. Also the format is intended to
accommodate several channels of data with different sampling rates.
The NSP format is very similar and easily convertible to RIF format, which is supported by
Microsoft as a multi-media format. The products from the CSL-family support also input and
output to several other file formats as TIM1T, US, DAT-tape, binary without header and flexible
generic binary formats with header set by the user.
REFERENCES
[1]. Fant, C.G.M. Acoustic theory of Speech Production The Hague: Mouton 1960.
[2]. Davis, S. Acoustic Characteristics of Normal and Pathological Voices. Speech and
Language Research and Theory. Academic Press. N J. 1979.
[3]. Titze, I., Baken, R., Herzel, H. Evidence of chaos in vocal fold vibration. Vocal Fold
Physiology. Edited by Ingo Titze. Singular Publishing, USA. 1993.
[4]. Klatt, D.H., Klatt, L.C. Analysis, synthesis, and perception of voice quality variations among
female and male talkers. J Acoust Soc Am 87, (2), 820-836, 1990.
[5]. Hirano, M. Clinical Examination of Voice. Springer Vertag Vienna. 1981.
[6]. Kent, R. Vocal Tract Acoustics. Journal of Voice, Vol 7, No.2,97-117,1993.
[7]. Koike, Y. Application of some acoustic measures for the evaluation of laryngeal dysfunction,
Stud.Phonol.Vn, 17-23,1973.
[8]. Koike,Y, Takahashi,H.,Calcaterra,T. Acoustic measures for detecting laryngeal pathology.
ActaLaryngol. 84, 105-117,1977.
[9]. Deliyski, D. Digital Processing of Voice Signals in the Diagnosis of Laryngeal Diseases.
Doctoral Dissertation, Bulgarian Academy of Sciences, Institute of Industrial Cybernetics
and Robotics, Sofia, Bulgaria An Bulgarian/1990.
[10]. Hess, W. Pitch Determination of Speech Signals. Springer Verhg. NY. 1983.
[11]. Rabiner, L. On the use of autocorrelation analysis for pitch detection. IEEE Trans. ASSP.
Vol.ASSP-22, No.3, 1974.
[12]. Deliyski, D. Investigation of the autocorrelation function characteristics in pathologic voice
signal analysis. 3-th International Conf. on Statistical Theory of Communications STS'88,
Varna, Bulgaria, pp. 17 /in Russian/, 1988.
[13]. Pinto, N., Titze, I. Unification of perturbation measures in speech signals.
J.Acoust.Soc.Amer. 87, (3), 1278-1289, 1990.
[14]. Hillenbrand,! Perception of aperiodicities in synthetically generated voices.
J.Acoust.Soc.Amer. 83(6),2361-2371,1988.
[15]. Nikolov, Z., Deliyski D., Drumeva L., Boyanov B. Computer system for diagnostics of
pathological voices, in Proc: XXI-st Congress International Association of Logopedics and
Phoniatrics. Prague, Czechoslovakia, Vol.1, 973-976, 1989.
[16]. Kasuya, H., Ogawa Sh., KMashima K., Ehibara S. Normalized noise energy as an acoustic
measure to evaluate pathologic voices. J. Acoust. Soc. Amer. 80, (5), 1986.
[17]. Winholtz W., Ramig L. Vocal tremor analysis with the vocal demodulator. J. Speech
Hearing Res., Vol.35, 562-573, 1992.
Qi-1
Wolfgang J. Mess
Institut fur Kommunikationsforchung und Phonctik
Univcrsitat of Bonn
May 4, 1994
Abstract
I. Introduction
Laryngeal diseases and disorders may cause disturbances in the voice signal. One
significant disturbance is the presence of noise (Horii, 1980; Yumoto et al., 1982;
Hillenbrand, 1987). The level of noise present in human voice often is difficult to
quantify, in part, because the voice signal is complex and quasi-periodic (Kasuyact alM
1986; Muta et al., 1988; Qi, 1992). Because of the complex, quasi-periodic nature
of human voice, many well-defined concepts in signal processing may not be directly
applicable to the analysis of human voice signals. For example, the fundamental
frequency of a periodic signal, f(t) = f(t + nT), n € I, is defined as f In theory,
this definition cannot be applied to human voice signals because these signals arc not
truly periodic. Similar problems exist for the amplitude of voice signals as well. For
example, it is known that the amplitude of a sinusoid refers to the maximum positive
or negative excursion of the sinusoid from zero. The amplitude of a complex, periodic
signal often refers to the amplitude of each sinusoidal component of the complex
signal (Oppenheim and Schafer, 1989). The amplitude of a complex, quasi-periodic
voice signal is not well identified. In this paper, tve used the term fundamental
period to refer to the duration between acoustic events that correspond to one cycle
of vocal fold or voice source vibration. Fundamental frequency (fO) is the inverse of
the fundamental period. The term amplitude is used to refer to the value of the voice
signal at any instant in time. Amplitude perturbation refers to the total random
variation in amplitude within one fundamental period.
Zero-padding can be used when the level of fO perturbation is relatively small (Yumoto
et al., 1982). When the perturbation in fundamental frequency is relatively large, for
example, in pathological voices, the zero-padding normalization method should be
used because the computed variance in amplitude will be significantly inflated by fO
perturbation. By way of example, two periods of a voice signal are shown in Figure la
and their difference is shown in Figure lb. As can be seen, the amplitude differences
between these two periods are primarily due to the difference in temporal structure
of the signals. If one period is compressed or stretched, the amplitude perturbation
or difference between the two periods is negligible (sec Figures lc and Id).
One of us has recently suggested that voice amplitude perturbation should be esti
mated as the ensemble variance in amplitude after all periods are optimally aligned in
time (Qi, 1992). In this earlier work, optimal time-alignment of fundamental periods
was accomplished using an end-point-constrained, dynamic programming procedure,
in which the end-points of each period were aligned first, i.e., prior to optimal time
alignment of every point within a period. This method of estimating amplitude
perturbation was shown to be highly accurate even when relatively large fO pertur
bations were added to voice signals. An assumption underlying this method of time-
normalization is that the boundaries of each fundamental period can be determined
accurately.
A. Zero-Phase Transformation
• Compute the magnitude spectrum and set the phase of each frequency compo
nent to zero.
The algorithm for unconstrained DP (Brown and Rabincr, 1932) is similar to that
for constrained DP (Qi, 1992), except for the processing of starting and ending points
of each period. The process of time-normalization can be viewed as the search for
an optimal matching path through the lattice of points (see Figure 3). The specific
algorithm used in this work is briefly summarized below:
1. At the ith step in the horizontal direction, the lower limit and the upper
limit for searching in the vertical direction were given by max(l,^ - 6) and
mm(W, ^ + 6), respectively. These searching boundaries defined a polygon,
shown in Figure 3.
2. Within these searching limits, the path for connecting each point (i,j) to pre
vious points in the lattice was determined by minimizing the total cost (rms
Qi-7
(a) Starting with all points on the searching border (i = IV; and ; = IVi). Be
cause there were no predecessors, the squared difference between samples
on these points was computed as the starting cost.
(b) Looping through all (i,j). In cadi loop, the costs from the current point
(ij) to the predecessors (t-l,j), (»',;-1), and (i_l, j_l) were computed.
The connection between point (»,;) and one of its predecessors was made
such that the cost for making the connection plus the cost for reaching the
particular predecessor was minimized. This minimum cost was stored as
the cost for reaching point (t,j).
(c) When i = MJ = N, the search was terminated and a complete path could
be retrieved from the point with minimum total cost on the ending border
of the searching limits (i = M,N-6<j < N and; = N,M-6 < i < M).
3. The final amplitude difference between any two periods was equal to the mini
mum total cost on the ending border of the searching limits.
The vowel /a/ was synthesized with a formant synthesizer. The synthesizer was
a 5-pole, autoregressive digital filter whose coefficients were determined by 5 given
pairs of formant frequencies and bandwidths (Rabincr and Schafcr, 1978). The unper
turbed excitation source to the synthesizer was an equally-spaced impulse train. The
amplitude of the impulse was set to 1000. Controlled perturbations were superim
posed on the impulse train and the synthesis was made by convolving the perturbed
excitation source with the impulse response of the autoregressive filter. The sampling
frequency of the synthesizer was 16 kHz. Twenty periods were synthesized for each
SNR computation.
ation of the Gaussian distribution, given as the percentage of the impulse amplitude.
Fundamental frequency perturbation was introduced by adding a zero-mean, uni
formly distributed random number to each period of the impulse train. The level of
/o perturbation was controlled by the standard deviation of the random number gen
erator, given as the percentage of the average period. Error in period determination
was introduced by adding another zero-mean, uniformly distributed random number
to the known location of each impulse after the vowel had been synthesized. The
level of the error was controlled by the maximum of the random number generator,
given in number of samples.
Qi-9
turbation, the average /0 (at 120 Hz and 220 Hz, respectively) and the degree of
/o perturbation (5%) were held constant. The standard deviation of the Gaussian
random noise was increased from 1% to 25% of the impulse amplitude (1000) in incre
impulse amplitude, respectively. The maximum of the random number generator for
producing PBD error was varied from 0 sample to 10 sample in incremental steps of
1 sample.
Natural voices were used to further evaluate SNR estimation. The natural voices
were used only to determined the effect of PBD errors on the computed SNRs. The
Sixteen, non-smoking, healthy adults (8 men and 8 women) provided voice sam
ples. Each subject produced a sustained /a/ at a constant, comfortable intensity
level for a duration of more than 1 second. The microphone (ASTATIC, CTM-80)
was placed about 10 cm in front of the subject's mouth. A pistonphone (GENRAD,
Minical-1987) was used to record a calibration tone prior to each recording session,
and all recordings were made in a quiet room. The recorded productions were digi
tized into a computer (SUN, SparclO/30) at a sampling frequency of 16 kHz and a
8
Qi-10
quantization level of 16 bits. The signal was passed through an anti-aliasing filter
with a cut-off frequency of about 7.5 kHz prior to the digitization. A waveform editor
(Speech Acoustic Lab, Ocean) was used to select a stable 20-period, segment for each
subject.
The period boundaries of the selected voice segments were determined from the
residue signal of linear predictive (LP) inverse filtering. The order of the LP filter
was 12. The autocorrelation method and the Hamming window were used in the LP
analysis. The window length was 256 points and the window step size was 128 points.
The location of period boundaries was identified using a time-delayed, peak-picking
algorithm. Time-delay was introduced to ensure that each maximum located was
global within a given time bracket and demarcated boundaries of the fundamental
periods. These period marks were assumed to be the correct period boundaries.
Error in PBD was introduced by adding a zero-mean, uniformly distributed ran
dom number to the absolute time locations of detected period boundaries. The level
of PBD error was controlled by the maximum of the random number generator. This
maximum varied from 0 sample to 10 sample, in incremental steps of 1 sample. The
altered locations were used as the period boundaries for SNR computation. The SNRs
were computed in the same manner as described earlier for synthetic voices.
A two-step procedure was used in the statistical analyses of the computed SNRs.
First, a polynomial regression (3rd order) of SNR as a function of PBD error was
made for each subject. Second, an analysis of variance (ANOVA) was undertaken to
assess the effects of gender and method of SNR estimation on the coefficients of the
regression polynomial. In the ANOVA, the dependent variables was the coefficients
of the regression polynomial. The independent variables were gender group, method
of SNR estimation, and the interaction between gender group and method of SNR
estimation.
Qi-11
The computed and known SNRs of the synthetic signals are plotted as a func
tion of noise level in Figure 4, as a function of fO perturbation in Figure 5, and as a
function of PBD error in Figures 6, 7 and 8. For synthetic voice signals, ZPT- and
UDP-bascd SNRs accurately define known SNRs over a wide range of signal condi
tions/perturbations. ZP-bascd SNRs significantly underestimate known SNRs (see
Figure 4). ZPT- and UDP-based calculations of SNR were not significantly influ
enced by level of /0 perturbation, whereas ZP-based SNR was significantly influenced
by level of /„ perturbation (sec Figure 5). UDP-bascd calculations of SNR were also
not significantly influenced by PBD error. ZPT-based calculations of SNR decreased
slightly as level of PBD error increased, when perturbations in fO and amplitude were
small (see Figure 6). ZP-based calculations of SNR were significantly influenced by
level of PBD error (see Figures 6, 7 and 8).
The computed SNRs for each of the 16 natural voices are plotted as a function
of PBD error in Figure 9. The means and standard deviations of the regression
parameters of the functions plotted in Figure 9 arc tabulated in Table I. The A NOVA
The ANOVA results indicate that gender did not exert a significant effect on
the regression parameters and that the interaction between gender and SNR estima
tion method was not significant. Method of SNR estimation did exert a significant
influence on the regression parameters. Post hoc testing revealed that significant
method differences in intercept and linear slope existed between ZP and ZPT and
between ZP and UDP-based estimations. Significant differences in the quadratic and
cubic terms were found between ZP and UDP and between ZPT and UDP-based
estimations. ZP-based SNR functions exhibited a significantly smaller intercept and
a significantly larger negative slope than ZPT- and UDP-based functions (see Figure
10
Qi-12
9). ZP- and ZPT-based functions exhibited significantly larger quadratic and cubic
terms than UDP-based functions.
Three major findings emerge from these analyses. As expected, ZP-based SNRs
generally did not provide accurate estimation of amplitude perturbations of syn
thetic and natural samples. Second, ZPT-bascd SNRs provided accurate estimation
of amplitude perturbations of synthetic samples; however, estimation of amplitude
perturbations present in natural voices measured by this method was significantly
influenced by the level of PBD error. Third, UDP-bascd SNRs provided an accurate
estimation of amplitude perturbations of synthetic and natural samples that was not
significantly influenced by fO perturbation and PBD error.
This work was motivated, in part, by our need to develop a method of mea
suring amplitude perturbation that did not depend upon accurate determination of
period boundaries. In this paper and in prior work (Qi, 1992), we have attempted
11
Qi-13
V. Acknowledgment
This work was supported, in part, by a grant from the National Institute of
12
Qi-14
References
Kasuya, H., Ogava, SM Mashima, K., and Ebihara, S. (1986). Normalized noise
energy as an acoustic measure to evaluate pathologic voice. J. Acoust Soc
Amer., 80:1329.
Muta, H., Baer, T., Kikuju, W., Tcrvo, M., and Fukuda, H. (1988). A pitch-
synchronous analysis of hoarseness in running speech. J. Acoust. Soc Amer
84:1292-1301.
Qi, Y. (1992). Time normalization in voice analysis. J. AcousL Soc. Amer., 92'<>569-
2576.
Rabiner, L., Rosenberg, A., and Levison, S. (1978). Considerations in dynamic time
warping algorithms for discrete word recognition. IEEE Trans. AcousL, Speech
Signal Processing, ASSP-26:575-582.
13
Qi-lf
14
10 15 20 25 30 35 40 45
10 15 20 25 30 35 40 45
-500
-1000
45
Tune (N sample)
Q.
Figure 2. Sample signals before (top) and after (bottom) zero-phase transformation.
(M-5.N) (M.N)
(M.N-8)
•0.1+5)
(1+5.D
I 1I II , ■ I . I I I , 1
I Mr 7 I T
13 5 7 9 11
15
10
S
o 5
'3
a:
Si
0
Noise 5%
fO Perturbation 5%
-5
-20
13 5 7 9 11
20
15
10
5
I
I
a
0
-5
-10-
-15 -
-20
13 5 7 9 11
0 5 10
5 10 0 5 10
Maximum Error in Period Determination (Sample)
(a)
o o o o o
♦ .f r.
HI ♦
4
t .•
4
f
♦
f f
TABLE I. Results of Linear Regression Analysis
TADLE II. Results of Analysis of Variance
30
20-
5 10
5 10 0 5 10
Maximum Error in Period Determination (Sample)
(a)
0 5 10 10
5 10 0 5 10
Maximum Error in Period Determination (Sample)
(b)
RAB-1
Acoustic analysis is often favored over perceptual evaluation of pathologic voice because
it is considered objective, and thus reliable. "Subjective11 ratings of voice quality are not highly
regarded as either clinical or research tools, because of problems with intra- and interjudge
reliability (e.g. Ludlow, 1981; Cullinan et al., 1963), because they are considered to lack
objectivity and do not require great technical sophistication (Weismer & Liss, 1991), and
because there is ho accepted set of perceptual scales used by clinicians (e.g. Jensen, 1965;
Yumoto et al., 1982). In part because of these views, so-called "objective," non-perceptual
measures for vocal assessment have received much more attention in voice research. The
assumption seems to be that some day acoustic measures may ftinction in the place of perceptual
assessment, thus alleviating concerns about listener unreliability.
However, recent studies suggest that this traditional bias in favor of acoustic analyses of
voice may be unwarranted. Perceptual data (Kreiman et al., 1993; Gerratt et al., 1993) indicate
that much of the noise in listeners' ratings is in fact predictable, and thus potentially controllable.
Further, a study comparing several systems for perturbation measurement (Bielamowicz et al.,
1993) suggested that agreement among different systems may be worse than assumed.
Bielamowicz et al. compared values of jitter and shimmer produced by C-Speech (ver. 3.1), Kay
CSL, SoundScope (ver. 1.09), and by an interactive hand marking system2 developed at the VA
Medical Center in West Los Angeles, for 50 voices ranging from normal to severely pathologic.
Results for jitter are summarized in Figure 1. Analysis packages varied in their level of overall
agreement, with Pearson's r for pairs of algorithms ranging from .21 to .77. However, even
systems whose jitter measurements were moderately correlated did not necessarily produce the
1 This research was supported in part by NIDCD grant # DC 01797 and by VA Merit Review
Funds. Address correspondence to Jody Kreiman, VA Medical Center, West Los Angeles,
Audiology and Speech (126), Wilshire & Sawtelle Blvds., Los Angeles, CA 90073.
2 In this system, a waveform landmark (positive peak, negative peak, or zero crossing) that
could be identified reliably from cycle to cycle was selected by hand. Perturbation measures
were calculated using linear or parabolic interpolation, as appropriate (Titze et al., 1987).
RAB-2
CM
5
o
"5
same numbers for a given voice. None of the lines in Figure 1 has a slope near 1, indicating that
Two questions emerge from these findings. First, how do perceptual ratings of voice
quality actually compare to acoustic measurements in reliability? Second, how similar are
perceptual and acoustic analyses in their characteristics as measurement systems? That is, how
similar are the patterns of agreement and disagreement among raters to those among analysis
systems? To address these questions, we asked listeners to rate the roughness of the voice
samples examined by Bielamowicz et al. (1993), and compared their ratings to the jitter
measurements produced by the four analysis systems studied there.
METHOD
Listeners
participated in this experiment. Each had a minimum of two years1 experience evaluating
pathologic voice quality.
RAB-3
Stimuli
Fifty voices (29 male and 21 female) were selected from an existing library of samples.
These voices were also used in the study by Bielamowicz et al. (1993) described above! Voices
ranged from normal to severely disordered, with approximately the same number of voices at
each of 5 severity levels.
Voice samples were originally obtained by asking speakers to sustain the vowel /a/.
Utterances were low-pass filtered at 8 kHz, and a 2-second sample was digitized at 20 kHz from
the middle of each utterance. Prior to the listening tests, digitized segments were normalized for
peak voltage, and onsets and offsets were smoothed by 50 ms ramps to eliminate click artifacts.
Procedure
Listeners rated each voice twice, although they were not informed that any voices were
repeated. Stimuli were rerandomized for each listener and were presented at a comfortable
listening level in free field.
RESULTS
Intrarater Agreement
Levels of test-retest agreement were acceptable for all listeners. Across listeners the
correlation (Pearson's r) between the first and second ratings ranged from .75 to .90, with a mean
of .83 (sd = .06). On the average, the first and second ratings differed by 9.8 mm (sd = 9.04).
Matched sample t-tests compared the first and second ratings of each voice, and indicated
that ratings drifted significantly within a listening session. On the average, voices sounded
significantly rougher at the second presentation than at the first (t = -7.56, df = 499, p < .01 one-
tailed). Differences between the first and second ratings were also significant for 5 of the 10
individual raters (p < .01, adjusted for multiple comparisons). This drift is consistent with
previous studies using unanchored rating protocols (Kreiman et al., 1993; Gerratt et al., 1993).
Of course, computer-based algorithms will always produce identical results under
identical conditions. However, changes in analysis parameters within a given system did
produce differences in results (Bielamowicz et al., 1993). Repeated independent analyses of 8
RAB-4
voices using the interactive hand-marking system produced mean jitter values within 0.05 ms in
all cases (mean difference = 0.01 ms; sd = 0.02); percent jitter values were within 1% in all cases
(mean difference = 0.20%; sd = 0.35%). (One voice was rejected as unmarkable in both
analyses.) Mean jitter values produced by tokenized and untokenized analyses in C-Speech were
correlated at .80; CSL analyses with tolerances of 1 and 20 ms were correlated at .47.
Interrater Reliability
Pairs of raters varied considerably in the extent to which their ratings agreed. Interrater
agreement ranged from .32 - .90 (as measured by Pearson's r), with a mean of .71 (sd = .14),
compared to a range of .21 to .77 for the different analysis systems (Figure 1). The intraclas^s
correlation (ICC) was calculated using a mixed model ANOVA treating voices and listeners as
random effects and presentations (first vs. second) as a fixed effect (model (2,1); e.g., Ebel,
1951; Shrout & Fleiss, 1979). This statistic reflects the overall cohesiveness of a group of raters,
as compared to the pairwise comparisons above, and reflects the extent to which the present data
might generalize to a new random sample of listeners. For the present data, the ICC = .64,
consistent with the variability seen in the pairwise comparisons. Confidence intervals about the
ICC were calculated using the formula in Shrout and Fleiss (1979). With 95% certainty, the true
ICC value fell in the range .54 < p < .75.
10 20 30 40 50 60 70 80
Examination of patterns of agreement among pairs of raters suggested that subjects fell
into two distinct "populations." One group included 7 raters; the other included 3. A one-way
ANOVA showed that pairs of raters drawn from a single population agreed significantly better
than pairs drawn from different populations (F(l,43) = 52.30, p < .01). Within a hypothetical
population of raters, Pearson's r for pairs of raters ranged from .61 to .90, with a mean of .81 (sd
= .07); across populations, r ranged from .32 to .81, with a mean of .60 (sd = .12).
Figure 2 shows the width of the 95% confidence interval (in mm) about the mean rating
of each voice, plotted against the mean rating for that voice. The better the agreement among
raters, the smaller the confidence interval. This figure shows the typical pattern (cf. Kreiman et
al., 1993) of better agreement among raters (i.e., narrower confidence intervals) for voices at
scale extremes, and worse agreement for moderately severe voices. The width of the confidence
intervals ranged from 4.4 mm (± 2.2 mm) to 21.8 mm (i.e., ± 10.9 mm), for the 75 mm scale
used here. In contrast, Figure 3 shows the 95% confidence intervals (in percent) around the
mean of the percent jitter values produced by the different acoustic analysis systems.
Uncertainty about measured jitter (indicated by larger confidence intervals) increases as a linear
10 15 20
Mean V. Jitter
Figure 3. Variability in measured jitter as a function of the mean of values produced by four
analysis systems.
RAB-6
2 3 4 6
Rated Severity
2 3 4 6
Rated Severity
function of the mean value (F(l,47) = 878.11, p < .01; r2 = .95). Confidence interval width
ranged from 0.48% to more than 44%.3
Figures 4 and 5 show how measurement uncertainty varies with severity of (perceived)
pathology, for listeners and analysis systems respectively. For listeners, the range of variability
in ratings increased slightly for voices with moderately severe pathology, again consistent with
previous studies (Kreiman et al., 1993). That is, uncertainty about reliability is greatest for
voices in the mid-range of pathology, and least for voices with mild or extremely severe
pathology. In contrast, the "variability of the variability" associated with measured jitter
increases with severity for the analysis packages, although packages apparently agreed about
some voices at all severity levels.
Figure 6 shows the confidence intervals around mean voice ratings, plotted against the
confidence intervals for mean jitter values. Data on both axes have been log transformed. This
figure indicates that listeners tend to be most reliable when acoustic measures are most
unreliable, and vice versa4.
DISCUSSION
Levels of intrarater agreement in this study compare well to those in the literature (e.g.,
Kreiman et al., 1993; Gerratt et al., 1993), and represent good performance by experienced
listeners. At least some test-retest disagreement is caused by systematic drift in ratings, which
may be controllable by "anchored" paradigms using fixed comparison stimuli, as we have
recently proposed (Gerratt et al., 1993). In one sense, intra-system reliability is not a serious
issue for acoustic analyses, because computer-based algorithms will always produce identical
results under identical conditions. However, changes in analysis parameters within a given
system did produce differences in results (Bielamowicz et al., 1993). Recall that the correlation
between listeners' first and second ratings of the voices ranged from .75 to .90. The correlation
for analyses with different parameters within a given package ranged from .47 to .80. Thus
across voices performance for even the worst listeners compared well with that of the most
consistent analysis systems.
Across pairs of listeners, interrater reliability (measured by Pearson's r) ranged from .32
to .90; the ICC was .64. This compares well to Pearson's r for pairs of analysis systems, which
ranged from .21 to .77. However, agreement levels among listeners improved greatly (r = .61 to
.90) when listeners were compared only to others drawn from the same hypothetical "population
of raters." The finding that listeners agreed and disagreed in groups is consistent with
3 Mean jitter values produced by C-Speech were converted to percent jitter for this analysis.
4 The regression is significant (F(l,47) = 14.88, p < .01); r2 = .24.
RAB-8
CO
O)
c
4*
(0
3 -
"O
CO
c
2 -
CO
o
o
-10 12 3 4
reliability improved. Variability in ratings remained fairly constant across levels of severity,
while variability in measured jitter increased dramatically with severity. These results suggest
that acoustic measures have advantages over perceptual measures for discriminating among
essentially normal voices. However, these advantages disappear once signals become irregular.
5 In particular, listeners varied in how they handled breathy turbulent noise and tremor.
RAB-9
We therefore question the clinical assumption that acoustic measures may reasonably substitute
for perceptual evaluation in the assessment of pathological vocal quality.
Our results suggest that measured jitter is a function of both signals and algorithms,
much as perceptual measures are a function of both signals and listeners. While standardization
of analysis techniques would solve the problem of disagreements among systems, a standard
protocol will still represent a mapping between signals and measured values. The critical issue
then becomes defining the "correct" algorithm, the choice of which must depend not only on
technical considerations, but also on the purpose for which these measures are intended. As long
as acoustic measures are used to detect or define pathology, to aid in diagnosis, to measure the
extent of pathology, or to monitor treatment, they must reflect listeners' perceptions reasonably
well. Standardization without attention to the characteristics of the application will result in
measurements which are not useful.
REFERENCES
Bielamowicz, S., Kreiman, J., Gerratt, B.R., Dauer, M.S., and Berke, G.S. (1993). A
comparison of voice analysis systems for perturbation measurement. Paper presented at
the 125th Meeting of the Acoustical Society of America, Ottawa.
Cullinan, W.L., Prather, E.M., & Williams, D.E. (1963). Comparison of procedures for scaling
severity of stuttering. Journal of Speech and Hearing Research, 6, 187-194.
Ebel, R. (1951). Estimation of the reliability of ratings. Psychometrica, 76,407-424.
Gerratt, B.R., Kreiman, J., Antonanzas-Barroso, N., & Berke, G.S. (1993). Comparing internal
and external standards in voice quality judgments. Journal of Speech and Hearing
Research, 36, 14-20.
Hillenbrand, J. (1988). Perception of aperiodicities in synthetically generated voices. Journal of
the Acoustical Society of America, 83, 2361-2371.
Jensen, P.J. (1965). Adequacy of terminology for clinical judgment of voice quality deviation.
The Eye, Ear, Nose and Throat Monthly, 44 (December), 77-82.
Kreiman, J., Gerratt, B.R., & Berke, G.S. (in press). The multidimensional nature of pathologic
vocal quality. To appear in Journal of the Acoustical Society ofAmerica.
RAB-10
Kreiman, J., Gerratt, B.R., Kempster, G., Erman, A., & Berke, G.S. (1993). Perceptual
evaluation of voice quality: Review, tutorial, and a framework for future research.
Journal of Speech and Hearing Research, 36, 21 -40.
Ludlow, C. (1981). Research needs for the assessment of phonatory function. ASHA Reports,
11, 3-8.
Shrout, P., & Fleiss, J. (1979). Intraclass correlations: Uses in assessing rater reliability.
Psychological Bulletin, 86, 420-428.
Titze, I., Horii, Y., & Scherer, R.C. (1987). Some technical considerations in voice perturbation
measurements. Journal of Speech and Hearing Research, 30, 252-260.
Weismer, G., & Liss, J. (1991). Reductionism is a dead-end in speech research: Perspectives on
a new direction. In C. Moore, K. Yorkston, & D. Beukelman (Eds.), Dysarthria and
apraxia of speech: Perspectives on management (pp. 15-27). Baltimore: Brookes.
Wendahl, R. (1966). Laryngeal analog synthesis of jitter and shimmer: Auditory parameters of
harshness. Folia Phoniatrica, 18, 98-108.
Yumoto, E., Gould, W.J., & Baer, T. (1982). Harmonics-to-noise ratio as an index of the degree
of hoarseness. Journal of the Acoustical Society of America, 71, 1544-1550.
GER-1
Acoustic analysis is becoming the preferred means of documenting normal and abnormal
vocal qualities. These measures have long been popular in research applications, and the
availability of off-the-shelf, automated programs now permits researchers and clinicians to
generate acoustic values in almost real time fashion. Thus, use of these measures is becoming
increasingly common in the clinic. Reflecting the increasing popularity and availability of
acoustic measures, speech scientists and clinicians have focused much energy on technical and
theoretical aspects of measuring vocal jitter and other aspects of vocal quality. We are here
today, a little over 30 years after Lieberman's (1963) original paper on jitter, finally talking
about ways to standardize these measures.
Although standardization is a worthy and noble goal, discussion of a measure's utility
should be a prerequisite to the investment of more resources and effort The utility of acoustic
measures is occasionally questioned in the literature, particularly with respect to analyses of
pathological voice (e.g., Hillenbrand, 1987), yet very little serious discussion has taken place in
the scores of papers that have appeared in recent years. One notable exception is Catford, who
argued in 1977 that the study of the acoustic signal without direct regard for the underlying
physiology or its perception by a listener is without much purpose. An analogy to this argument
is the study of the ink used in writing. Normally, the ink is useful only to the extent that its
brightness, color, texture, and pattern conveys information from the writer to the reader. The
careful investigation of these ink qualities in themselves is not usually informative about either
the writer or the reader. Similarly, acoustic measures may shed light on physiologic or
aerodynamic processes in speech; however, direct measures of these processes provide much
better information, and are widely available.
What are the theoretical significance and practical uses of acoustic measures of vocal
quality? This paper will focus on vocal perturbation, because of its great popularity for both
1 This research was supported in part by NIDCD grant #DC01797, and by VA research funds.
Address correspondence to Bruce R. Gerratt Audiology and Speech Pathology (126), VA
Medical Center, West Los Angeles, Wilshire & Sawtelle Blvds., Los Angeles, CA 90073.
GER-2
voice researchers and clinicians and because knowledge of the vocal period is a requirement for
a number of other popular measures of voice. However, to varying degrees, our concerns apply
to many other acoustic measures of voice quality. We will argue that perturbation measures
have never been shown to correlate well with perceived vocal quality, have never been
convincingly demonstrated to distinguish among pathological diagnoses, and in fact do not even
consistently differentiate normal from pathological signals. Further, making these measures for
voices that deviate from normal periodicity is technically difficult, if not impossible. In fact, the
logic of measuring periodic deviation breaks down as the voices increasingly deviate from
periodicity.
However, understanding the relationship between acoustic and perceptual measures has
proven difficult. Voices that are quite similar in quality can have quite different perturbation
GER-3
measures, and voices that are quite different in quality can have perturbation levels which are
similar.
In fact, studies examining the correlation between perturbation measures and perceptual
qualities in disordered and normal speakers have had consistently negative results (e.g. Arends et
al., 1990; Eskenazi et al., 1990; Heiberger & Horii, 1982; see Ludlow et al., 1987 for review).
Studies using synthetic stimuli or using imitations of pathological qualities produced by normal
speakers reported higher correlations (e.g. Coleman & Wendahl, 1967; Hillenbrand, 1988;
Wendahl 1963, 1966). These better results are probably explained by the fact that the stimuli
used vary primarily in only one dimension.
In contrast, pathological voices are perceptually complex, with many vocal qualities co-
occuring and interacting. Studies using multivariate techniques and pathologic speakers have
reported better correlations between perturbation measures and perceptual dimensions in
multidimensional contexts (e.g. Kempster et al., 1991; Kreiman et al., in press; Eskanazi et al.,
1990). Presumably, these improved correlations reflect the use of more appropriate perceptual
models.
Thus, it appears that there is not a simple one-to-one correspondence between one
perturbation measure and one perceptual quality. Traditional approaches seeking such
associations imply far too simple a model of quality perception. Even traditional "qualities"
such as breathiness and roughness may in fact be multidimensional, as we have recently argued.
For example, voices described as breathy can include a diverse number of qualities which should
not be funnelled into one perceptual construct. In fact, we found that a large source of listener
variability (and an associated reduction in reliability) is that when listeners rate a voice on a
perceptual scale, they pay attention to different dimensions of that quality (Kreiman et al., in
press).
acoustic signals and qualities, because acoustic signals do not provide information about these
listener- and protocol-dependent effects.
Figure 1. Acoustic waveform from a severely pathological voice: This 120 msec signal taken
from a sustained production of/a/ corresponds to a severely breathy, moderately rough vocal
quality, produced by a 39 year old man with chronic unilateral vocal fold paralysis after several
attempts at Teflon augmentation.
Importantly, supraperiodic voice signals are fairly common among pathological and
normal speakers. KJatt and Klatt (1988) reported that bicyclicity occurred in more than 25% of
the utterances they examined. Both human operators and machine algorithms have great
difficulty in measuring periodicity in these signals. The result is great disagreement among
programs and among people. Thus, the practical problem of actually making the measurement
in many of the voices which researchers and clinicians want to study also seriously reduces the
utility of the measures.
GER-5
Figure 2. Bicyclic phonation (FO/2 subharmonic). There is noticeable period doubling in this
120 msec signal of a sustained /a/ from a 79 year old woman with vocal hyperrunction.
Boundaries of vocal periods are marked by arrows. The corresponding auditory impression is a
buzzing, mechanical, rough quality.
Figure 3. Diplophonic phonation (biphonation) - a high frequency wave (260 Hz) modulated by
one of much lower frequency (44 Hz): This 120 msec signal represents a sustained /a/ produced
by a 29 year old woman with a unilateral vocal fold paralysis. The corresponding auditory
impression is rough, pulsed, and quite complex, with ambiguous pitch.
Conclusion
We have argued that categorizing and describing the acoustic signal is far less important
than understanding its relationship to the other levels within the speech chain. However, we
have presented some discouraging arguments regarding the possibility of relating acoustic
measures to these other levels. Problems in correlating these measures to physiology appear
unsolvable at present, an observation which has been made repeatedly by many. Significant
theoretical and practical problems exist in relating acoustic measures to vocal quality perception.
GER-6
although ultimately these may someday be alleviated by developing better perceptual models
which include careftil attention to interactions among the signal, the listener, and the listening
task. Finally, periodicity is difficult to define for many voices, and some point exists beyond
which vocal cycles cannot be identified reliably. Measuring jitter makes little sense for this
category of voice. However, it is unclear at what point periodicity truly disappears, and it is
unclear how such a point might be defined consistently. The limits of periodicity need better
definition so that a user can know the level of confidence associated with a measurement result.
Until these concerns regarding measurement utility are ftdly addressed, standardization of
measures based upon vocal periodicity may proceed to an uncertain goal.
REFERENCES
Arends, N., Povel, D-J., van Os, E., and Speth, L. (1990). Predicting voice quality of deaf
speakers on the basis of glottal characteristics. Journal of Speech and Hearing Research,
33,116-122.
Coleman, F. and Wendahl,R. W. (1967). Vocal roughness and stimulus duration. Speech
Monographs, 34, 85-92.
Eskenazi, L., Childers, D. G., and Hicks, D. M. (1990). Acoustic correlates of vocal quality.
Journal of Speech and Hearing Research, 33, 298-306.
Gerratt, B. RM Kreiman, J., Antonanzas-Barroso, N., and Berke, G. S. (1993). Comparing
internal and external standards in voice quality judgments. Journal of Speech and
Hearing Research, 36, 14-20.
Hecker, M. H. L., and Kreul, E. J. (1971). Descriptions of the speech of patients with cancer of
the vocal folds. Part I: Measures of fundamental frequency. Journal of the Acoustical
Society ofAmerica, A9, 1275-1282.
Heiberger, V. L., and Horii, Y. (1982). Jitter and shimmer in sustained phonation. In N. J. Lass
(editor), Speech and Language: Advances in Basic Research and Practice, Vol. 7
(Academic Press, New York), pp. 299-332.
Hillenbrand, J. (1987). A methodological study of perturbation and additive noise in
synthetically generated voice signals. Journal of Speech and Hearing Research, 30, 448-
461.
Hirano, M. (1989). Objective evaluation of the human voice: Clinical aspects. Folia
Phoniatrica, 41, 89-144.
GER-7
Hirano, M., Hibi, S., Yoshida, T., Hirade, Y., Kasuya, H., and Kikuchi, Y. (1988). Acoustic
analysis of pathological voice. Ada Otolaryngologica (Stockholm), 105, 432-438.
Kempster, G. B., Kistler, D. J., and Hillenbrand, J. (1991). Multidimensional scaling analysis of
dysphonia in two speaker groups. Journal of Speech and Hearing Research, 34, 53443-
543.
KJatt, D. H. and Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality
variations among female and male talkers. Journal of the Acoustical Society ofAmerica,
87, 820-857.
Titze, I. R., Horii, Y., and Scherer, R. C. (1987). Some technical considerations in voice
perturbation measurements. Journal of Speech and Hearing Research, 30,252-260.
Wendahl, R. W. (1963). Laryngeal analog synthesis of harsh voice quality. Folia Phoniatrica,
15,241.
Wendahl, R. W. (1966). Some parameters of auditory roughness. Folia Phoniatrica, 18, 26-32.
LEM-1
Division of Biostatistics
Department of Statistics
Youmouk University
Irbid, Jordan
Introduction
Clinicians are often disenchanted by the possible diagnostic value of measures of voice and
speech for various medical disorders associated with voice and speech. However, before the
diagnostic value of these measures can be evaluated, we must establish normal ranges for these
measures under standard conditions and across a variety of healthy subpopulations. If the
percentiles are to be estimated by the crude maximum likelihood estimates, random sample sizes
to establish normal ranges within acceptable limits usually are require to be 400 to 2000 subjects
per subpopulation. If one employs resampling methods, one can substantially reduce necessary
sample sizes. This translates to large savings in cost and time, making the establishment of normal
limits more feasible than ever before.
Measures of voice perturbation are typically skewed and bounded below by 0.0, such as
jitter, shimmer, the harmonic-to-noise ratio and the coefficients of variation for frequency and
amplitude. For measures such as these one is typically required to sample at least 800 subjects to
establish a normal range. Focusing on these four measures of voice perturbation, we will review
the concept of normal range, present the data to be used for our examples, highlight resampling
options and present antithetic resampling results for examples from a sample of 47 disease-free
males (Ramig and Ringel, 1983).
This work has been supported in part by the National Center for Voice and Speech, NCVS, with
support form the National Institute on Deafness and Other Communication Disorders (Grant:
P60 DC009 76) and in part by the University of Iowa College ofMedicine Departments of
Preventive Medicine and Environmental Health, and Otolaryngology.
LEM-2
Background
By definition, the normal range of a continuous variable contains 95% of all disease-free
individuals in a population. Recognize that abnormal and disease-free are not synonymous. By
definition, 5% of the disease-free people must have abnormal values. When a person has a
characteristic in the normal range, one is not guaranteed to not have associated medical problems.
Changes within the normal range may be pathologic and indicative of a medical problem. The
diagnostic value of a variable depends upon the distribution of a variable for people with a certain
medical condition and the prevalence of the medical condition in the population that gets referred
for evaluation. If everyone in the general population gets tested and the prevalence in the general
population is low, the predictive value positive will be minimal even for sensitivities in the range
of 99%.
Different populations may have different normal ranges. Males and females certainly have
different normal ranges for fundamental frequencies and one expects them to have different
normal ranges for other variables as well. Especially for variables which measure maximum
performance speech tasks, one will find the normal ranges will change with age (Ramig and
Ringel, 1983). Age-related changes have been reported for fundamental frequency, maximum
phonation range and average jitter by Mysak (1959), Endres, Bambach and Flosser (1971), Segre
(1971), Hollien and Shipp (1972), and Wilcox and Horii (1980). For acoustic voice measures, the
normal range typically becomes wider as the upper limit changes more rapidly then the lower limit
reflecting an increase in intersubject variability with age. If the measure is a perturbation measure
bounded below by 0.0, the lower limit should be set at 0.0 and not change at all. The upper limit
should be the 95th percentile (P95) and may be expected to increase with age. Whereas, other
measures such as maximum duration of sustained phonation or maximum phonation range should
have a normal range bounded below by P2 5 and above by P97 5. For these factors they can be
expected to decrease with age. There are ethnic differences in some normal ranges.
It may not be desirable to be in or stay in the normal range. Just as basketball players
prefer to be abnormally tall, singers and orators value their voices with abnormally high or low
pitch or abnormally wide pitch range. Olympic records are set by people with abnormal abilities.
If a normal range is changing with age and you were normal, maintaining the youthful level may
The normal range is not to be confused with the normal distribution, even though the
normal range is often explained using the normal distribution. In the normal distribution case one
typically wants the middle 95% disease-free individuals defined as normal, leaving 2.5% on each
side. In general one can split the 5% disproportionately provided it makes sense pathologically,
as long as you capture 95% of the disease-free individuals. Perturbation measures are skewed.
The whole idea of telling somebody that they are abnormal with a jitter of 0.1% or with any other
perturbation measure very close to 0 doesn't have any diagnostic value when the measure is
associated with a disorder. You only want to consider a person abnormal if the value is
excessively high. For perturbation measures we always recommend using the 95th percentile
(P95) as the upper the normal limit and 0.0 as your lower limit.
Ramig and Ringel (1983) present perturbation measure data conditional upon 47 male
subjects' exercise level. Their objective was to contrast healthy men who were active vs. inactive.
The subject pool was partitioned into thirds by activity level with the middle 1/3 excluded. Thus,
the data does not represent a random sample of men and the presented estimates should tend to
overestimate the upper limits of the normal ranges.
i=2
— , where N is the number of consecutive cycles.
(N-l)
For jitter the xj is the frequency of the ith cycle and for shimmer the x[ is the amplitude of the ith
cycle. The coefficients of variation for frequency and amplitude are determined by
Jitter and shimmer are means of absolute values and thus always nonnegative.
Coefficients of variation in frequency and amplitude are nonnegative with very skewed
distributions and are typically analyzed inappropriately using normal distribution theory.
Similarly, the harmonic-to-noise ratio must be nonnegative and is very skewed.
LEM-4
We present the data from Ramig and Ringel (1983) using stem-and-leaf plots, which retain
all the data unlike histograms where the actual values of the data are lost. The data is also
ordered and reflects the shape of the distribution. Consider Figure 1, where we present the jitter
data. The stem is the middle column of numbers and the leaves are to the right. There are 47
leaves for the 47 subjects. For jitter the leaf unit is to the hundredth of a percent. The first row
6 | 2 | 455789 represents 0.24%, 0.25%, 0.25%, 0.27%, 0.28% and 0.29%, the six lowest jitters
(leaves) with the 6 on the left representing the cumulative frequency through the first level. The
totals on the left are cumulative from both the minimum value on the top and the maximum value
from the bottom. The (9) on the left of the third row indicates the number of leaves on the branch
with the median value (P50) of 0.43%, which is easily identified by using the cumulative counts
either from the top or the bottom. The data ranges from a minimum of 0.24% to a maximum of
1.13%, represented by the last leaf at the bottom of the figure. The crude maximum likelihood
455789
022223556666
111223789
1249
23789
2347
78
478
6
3
The stem-and-leaf plots for shimmer, harmonic-to-noise ratio, and coefficients of variation
for amplitude and frequency appear in Figures 2-5, respectively. The shimmer values range from
0.8% through 7.2% with a median value P50 of 1.7% and P95 is 6.1%. The harmonic-to-noise
ratios vary from 14.7 to 25.6 with a P50 of 20.6 and P95 equal to 24.6. The coefficient of
amplitude ranges from 2.0 to 15.3 with P50 =6.2 and P95 = 11.3; and the coefficient of frequency
ranges from 0.52 to 2.16 with P50 = 0.94 and P95 = 1.61.
LEM-5
8999
000012222234
5556667789 7
0222333444 59
689 3
01
037889
00 12246668
5 2244466999
234458999
23557
12 2
56
2 46
226
158
02259
01 01135999
0388 12344569
11258 23344456
2234678899
112222236 0366788
124 0
0244599 2
36
8 18
038
Resampling Methods
We will briefly review the bootstrap (or uniform resampling) and antithetic resampling
methods. The bootstrap is a uniform resampling of your data where you randomly sample the
original data with replacement. The technique is very useful whenever the sampling distribution
of an estimator is unknown. (Efron, 1990) From this group of 47 observations, randomly sample
47 observations with replacement. Some observations will be selected more than once and some
will not be selected at all. One random sample doesn't provide you with an estimate of the
standard deviation of the sampling distribution of the estimator. Thus, you take, say, 100 random
samples with replacement from your original data. The distribution of the 100 estimates
approximates the sampling distribution of the estimator and then confidence intervals can be
established for the estimated parameter even when they may not be available theoretically.
To understand antithetic resampling, let x^, Xr2,, Xr3^, ... xfnl represent the original n
observations sorted from the minimum Xr^ to the maximum Xr,. If there are ties, they are
replicated as in the stem-and-leaf figures. Each antithetic sample estimate begins with a uniform
resampling just as in the bootstrap. Now, each sample is paired with another sample where the
ranks are the reverse in the following sense: if x™ is in the initial sample, then Xrnl] is in the
antithetic pair. For n = 47, there are the same number of x^'s in the paired sample as there are
x[2j's in the first sample. The estimates from the antithetic paired samples are negatively
correlated, since an overestimate of the parameter in one sample provides a tendency to
underestimate the parameter in the other sample of the antithetic pair. The two estimates in each
pair are averaged to obtain the estimate from the pair. Taking the estimates from 100 antithetic
pairs, one obtains an approximation of the sampling distribution of the antithetic resampling
estimator. Since the antithetic pair estimates are negatively correlated, one is much better off
taking a random resampling of 100 antithetic pairs than one is taking 200 bootstrapped
resamples.
We performed extensive Monte Carlo studies across a variety of symmetric and skewed
distributions and sample sizes (64 and 100) to compare crude maximum likelihood estimates,
bootstrap resampling, importance resampling and antithetic resampling, we found the antithetic
resampling estimate was the least biased and had the smallest mean square error (MSE) about the
actual value of P95 being estimated. The MSE of the estimates of P95 using antithetic resampling
was at least 17% less than the crude estimates in all cases of underlying distributions and sample
LEM-7
sizes. To highlight the accuracy when estimating P95 = 9.488 from a Chi-square distribution with
4 degrees of freedom based upon 1000 simulations, the mean crude maximum likelihood estimates
and mean antithetic resampling estimates were 8.928 and 9.344, respectively.
When the actual parametric form of the underlying distribution function is known, then
solving for the maximum likelihood estimator is recommended over antithetic resampling.
However, the benefit is small and if the underlying distribution is questionable or possibly a
importance sampling with appropriately chosen weights. (Kim-Anh and Hall (1991), Hall (1991)
Results
Having taken 100 random resamples from each of the observed distributions of
perturbation measures, we present the crude maximum likelihood estimates and antithetic
resampling estimates of P95 in Table 1. In addition, we include an estimated standard error of P95
(the standard deviation of the antithetic estimates), the corresponding confidence intervals and an
estimated relative efficiency. To couple variance estimates for the relative efficiency we divided
the variance of the 200 unpaired bootstrap estimates by the variance of the 100 estimates from the
antithetic pairs.
The estimates P95 and P95 of P95 varied depending upon the observed tails of the
perturbation measure distributions. The greatest difference appeared with shimmer where the
crude estimate is 15% greater than the antithetic estimate. Even though the estimates may not
vary dramatically, note in the stem-and-leaf plots how dramatically one could vary the crude
estimates by adding a single observation to the tail of the distribution.
The 95% confidence intervals for P95 are fairly wide with a sample size as small as 47.
There is also a 0.090 probability with n = 47 that all observations were less than P95. When this
occurs the estimates are obviously underestimates. With a minimum random sample size of 100
this probability drops substantially to 0.0059. One should not attempt to establish normal limits
with sample sizes of 47 even with antithetic resampling.
The minimum estimated relative efficiency is 2.2. Thus, one will have a savings of at least
55% in the number of subjects required to obtain an estimate of P95 for these perturbation
measures. With n = 47 we have antithetic estimates which would have required sample sizes of at
LEM-8
least 103 (47*2.2) for crude estimates. If one would be required to random sample 800 disease-
free individuals in a subpopulation, one now needs 364 individuals to obtain the same degree of
accuracy. If one wants to estimate percentiles further from the median, the relative efficiency
decreases as the negative correlations approach 0. If one wants to estimate percentiles closer to
the median, the negative correlations approach -1 and the relative efficiency increases
dramatically.
The correlations of the antithetic pairs for jitter, shimmer and harmonic-to-noise ratio are
-0.181, -0.074 and -0.076, respectively. For the coefficients of variation the correlations are
-0.149 and -0.084 for amplitude and frequency, respectively. These negative correlations were
crucial to obtain smaller variances through antithetic resampling than for crude estimates or
bootstrap resampling estimates. The benefit becomes apparent when you consider Figures 6-7,
where the resampled estimates for the coefficient of variation of amplitude are the means for the
pairs in Figure 6 and the individual unpaired estimates in Figure 7. When paired, the standard
deviation is 0.798; while unpaired, the standard deviation is 1.217.
000000
33333
6666666666
02222333
4455566
779 9999999999999999999999999999
00000000000001111111122222222 000000000000000000000000000000000000000
444444444444444444444466666666666666 44444444444444444444444444444444444444444444444444444444444
6 !
888 88888888888888888888888888888888888888888
22
122223333
6
333333333333
Figure 6: Stem-and-leaf plot for the 100 Figure 7: Stem-and-leaf plot fof* the
antithetic pair estimates of the unpaired 200 estimates of the
coefficient of variation of amplitude coefficient of variation of amplitude
(leafunit = 0.10) (leafunit = 0.10)
5
LEM-10
Discussion
The diagnostic value of any voice characteristic for a specific voice or speech disorder
depends upon knowledge of the normal range of the characteristic for disease-free individuals, as
well as the distribution of the variable for subjects with the disorder. Current dissatisfaction with
the diagnostic value of many measures exists with a lack of knowledge of the normal range.
Especially, for measures of perturbation there will be many conditions where there is substantial
overlap between the perturbation measure distribution of the disease-free and those with the
disorder, and there will be little diagnostic value if any. The diagnostic value varies both by voice
or speech disorder and voice characteristic. With the advent of antithetic resampling one can
practically establish normal limits for voice measures using substantially fewer disease-free
subjects. Since one must expect the normal limits to vary by age, gender and ethnicity, the
savings is magnified many times when establishing a comprehensive set of normal limits.
It is crucial that the set of voice measures that has diagnostic value for any specific
disorder is as small as possible in order to eliminate the random possibility of diagnosing an excess
number of disease-free individuals as having a voice disorder, that is having a low specificity.
Thus, for each voice disorder one needs to restrict the number of voice characteristics considered
to have diagnostic value to at most four or five. Given the correlation between many voice
measures, the preferred sets should have measures which are orthogonal to each other.
If one's level of physical activity effects the perturbation measures assessed, then the
estimates in Table 1 are likely overestimates of the true P95's. This would be the likely case if one
should expect less than 5% the 24 males excluded from the Ramig and Ringel (1983) study to
have values less than the estimated P95's. Since the Ramig and Ringel sample of 47 was not a
random sample of disease-free subjects, the emphasis of this article is not on the estimated values
but upon the antithetic resampling method to estimate the normal ranges..
In speech analysis subjects are at a premium, since it can be expensive and time
consuming; thus we require resampling methods such as antithetic resampling to obtain good
estimates of the parameters of interest. This is true for other measures of speech performance
besides percentiles for disease-free subjects. This technique should be expanded to estimation
within tokens and across tokens to properly evaluate individual patients.
LEM-11
References
Do, Kim-Anh and Hall, Peter (1991). On importance sampling for the bootstrap, Biometrika.
78(1): 161-167.
Efron, Bradley (1990). More efficient bootstrap computations, Journal of the American
Statistical Association. 85(409): 79-89.
Endres, W., Bambach, W. and Flosser, G. (1971). Voice spectograms as a function of age, voice
disguise and voice imitation. Journal of the Acoustical Society of America. 49:1842-1848.
Hall, Peter (1991). Bahadur representations for uniform resampling and importance resampling,
with applications to asymptotic relative efficiency, The Annals of Statistics. 19(2): 1062-
1072.
Hammersley, I. M and Handscomb, D. C. (1964). Monte Carlo Methods. John Wiley, New
York.
Hinkley, D.V. and Shi, S. (1989). Importance sampling and the nested bootstrap, Biometrika.
76(3): 435-446.
Hollien, H. and Shipp, T. Speaking fundamental frequency and chronological age in males.
Journal of Speech and Hearing Research. 15: 155-159.
Johns, M. Vernon. (1988). Importance sampling for bootstrap confidence intervals, Journal of
the American Statistical Association. 83(403): 709-714.
Mysak, E.D. (1959). Pitch and duration characteristics of older males. Journal of Speech and
Hearing Research. 2:46-54.
Ramig, L.A. and Ringel R.L. (1983). Effects of physiological aging on selected acoustic
characteristic of voice. Journal of Speech and Hearing Research. 26: 22-30.
Segre, R. (1971). Senescence of the voice. Eve. Ear. Nose and Throat Monthly. 50:223-233.
Serfling, Robert J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley &
Sons, New York.
Wilcox, K.A. and Horii, Y. (1980). Age and changes in vocal jitter. Journal of Gerontology. 35:
194-198.
EPN-l
1. Introduction
As data for a study of changes in voice with age, Arthur House has two sets of recordings
of three male subjects, the first made in 1960, when the subjects were all about 37 years old.
and the second in 1990 when they were about 67. Certain findings about vowel duration and
amplitude have already been reported. (House and Stevens 1993). I have been working with
Dr. House on a continuation of the study, measuring other speech features such as intonation
patterns and excitation characteristics. Here two aspects of this work are discussed: automatic
location of pitch epochs, and estimation of jitter given these epochs.
(Note: the phrase "pitch period" is used in the literature ambiguously: sometimes it means
the length of time between glottal closures, and sometimes it means the segment of speech
starting at one glottal closure and ending at the next. Here I use "pitch period" for the first, and
"pitch epoch" for the second.)
There are three talkers, AH. JM. and KS. Each talker produced the same set of bisyllabic
nonsense tokens, consisting of a vowel imbedded between identical consonants, and preceded by
an unstressed /HAX/. Thus typical stimuli sound like "hubob", "hutat", "hugig". Each talker
went through exactly the same set of utterances in his "old" recording as he had done in his
"young" recording (but in a slightly different order).
Tokens were recorded on tape, then digitized at a lOkz sampling rate. 12 bits per sample.
Hum and noise are negligible.
In the full study, there are 10 vowels and 23 consonants. For this paper, I selected 12
consonants and three vowels. /AA/. /IY/, and /UW/, and extracted from each token a sub-
token consisting of the "middle half of the vowel; that is, the signal starting 1/4 of the way into
the vowel, and ending 3/4 of the way into the vowel (according to Dr. House's hand marking).
Table 1 gives the total number of pitch epochs per vowel.
Talker
young old young old young old
Vowel AH AH JM JM KS KS
/AA/ 172 170 169 177 135 133
/IY/ 146 154 144 144 113 104
/UW/ 165 170 146 143 113 107
Table 1: Number of pitch epochs per vowel, per talker, per age
EPN-2
If we were observing the larynx, we might take as pitch period the time between occurrences of
some well-defined event in the laryngeal cycle. However we are not observing the larynx, or even
capturing the signal at the larynx; we have to make do with a poorly-understood transformation
of that signal.
If pitch and amplitude are steady, and formants and excitation are not changing, intervals
between identical events in successive epochs might be a usable approximation to the desired
laryngeal intervals. An example would be the time of occurrence of the largest peak in the pitch
epoch. We would also expect, in this case, that the interval at which the signal autocorrelation
has a maximum will be the same as this interval between well-defined events; and indeed, peak-
picking and autocorrelation, the two most widely-used methods for finding pitch period, usually
yield very similar estimates. (See e.g. Titte and Liang 1993)
However, rf any or all of these properties are changing, it's a whole new ball game. If pitch is
changing, what exactly do we mean by "pitch period"? At the larynx, there is perhaps a physical
event that occurs once per "cycle"; but in the speech signal, what part of two successive epochs
should we use as benchmarks to measure interval? Or rf we use autocorrelation, how should it be
done? (Because the length and type of correlation certainly affect the location of its maximum.)
If formants are moving, or the excitation is changing, successive epochs look different, and
there may be no event in two successive epochs to use for an interval measurement. If. in the
speech signal, we could find the moment of a laryngeal event such as closure, we could use that
as the benchmark; I know no way of doing this reliably, especially when the form of the epoch is
changing.
The tokens in this study contain all these sources of variability. They are spoken with "list
intonation", the style in which subjects produce every word or phrase in the list, with falling pitch.
(The pitch may actually rise a little at the start ) Figure 1 illustrates this; you can see that the
pitch periods at the end of the utterance are longer than those at the beginning. (You can also
see that amplitude is decreasing, another feature of this speaking style.)
^
~ Figure 1: A typical token
Since the vowel is surrounded by consonants, the formants move, especially at tbe beginning
and end of the vowel. In Figure 1 you can see how formant movement is causing differences in
shape of successive pitch epochs. There are also some tokens with change in excitation, and it
is these that give the most trouble to the pitch tracker and epoch finder described below.
EPN-3
If we needed to determine jitter absolutely (or clinically), these difficulties would be formidable.
However what we want in our study is a comparison between jitter now and jitter 30 years ago,
or jitter of AH and jitter of KS, say. And until and unless we look at jitter for vowels in their
individual contexts, we have plenty of data; at least 100 pitch epochs for every talker/vowel.
Thus (and perhaps this applies in the clinical case as well) consistency is more important than
accuracy. We have so far considered four working definitions of jitter, all of which depend on a
certain epoch-finding algorithm, which is described in the next Section. The measures of jitter
are discussed in Section 5.
Our algorithm for finding pitch epochs involves two kinds of speech analysis and two dynamic
programs. It is complicated because in real speech, no single pitch-finding method (known to
me) finds all individual pitch epochs reliably, without occasionally making an unacceptable error.
Since in our study we have a great number of tokens to deal with, and cannot look at all pitch
tracks for all tokens, and since it doesn't take many doubled or halved pitch periods to bias
a statistical test, we need a pitch-epoch-finder that we can trust to operate without error on
reasonably clean speech. The algorithm proceeds as follows:
1) Form the smoothed, half-wave-rectified LPC residue for the entire token;
2) From the residue, estimate pitch period every 100 samples;
3) Find all the peaks in the rectified LPC residue;
4) From peaks, select "pitch pulses" that divide the token into epochs;
5) At each pitch pulse, reestimate pitch period from the original speech.
4.1 The LPC Residue
The LPC residue is created by doing a 12-coefficient LPC at centisecond (100 sample) inter
vals, using 150 Hanning-windowed samples, and putting each 100 samples back through its local
inverse LPC filter to obtain the current 100 samples of residue. The first line in Figure 2 shows
a typical speech token, the second line shows its LPC residue.
The next step is a mild low-pass filtering of the LPC residue. The filter is a crude one: the
output is the sum of the amplitudes of the last 5 samples, plus the sum of the last 6 samples,
plus the sum of the last 7 samples.
In preparation for picking peaks, the residue is tested to determine whether the positive-going
peaks or the negative-going ones are more prominent. The sum of squares of the positive values
in the residue is compared with the sum of squares of the negative ones. (Fourth powers might be
better.) If the negative total is larger, the residue is inverted. Then, since only positive peaks will
be used, the signal is half-wave rectified (negative values are set to zero). The rectified smoothed
residue is shown as the third line of Figure 2.
EPN-4
Signal
LPC Residue
0.005
-o.oos
0 200 400 600 800
0.5
0
400 600 800
Pitch is computed every 100 samples, using as input signal the low-passed LPC residue. The
pitch finder is a correlation pitch detector with a dynamic-program pitch tracker, which returns
at every time the "best" 11-long sequence of pitch periods ending at that time. Graphs of pitch
tracks on all the material used in this study show that the pitch program made no egregious error
on any token.
Next, all the peaks in the rectified residue are located and normalized. Normalization consists
in expressing the height of a peak as the ratio of its height to that of the tallest peak within a
span of the local pitch period. The fourth line in Figure 2 shows the normalized peaks.
Pitch and normalized peaks are passed to a dynamic-program "pitch-pulse finder", which
accepts a sequence of candidate peaks (times and amplitudes), and finds the "best" subsequence,
where goodness is a function of both consistency of the interval between chosen peaks, and their
amplitudes. This subroutine, too, produces no egregious error on any token. The last line in
Figure 2 shows the speech again, with pulses superimposed. There is one pulse per epoch.
EPN-5
The last step in the algorithm is a more precise determination of pitch period. At each chosen
pulse, as descibed above, there is a local (crude) pitch period P. At that pulse the signal is
autocorrelated at every (integral) lag L in the range (P-P/2) to (P+P/2). More precisely, each
pitch lag L is tested by forming the dot product of the L-long stretch to the left of the pulse
with the L-long stretch to the right of the pulse, normalized by the power in the two L-long
stretches. Figure 3 illustrates the process. Panel (a) panel shows the signal, centered at the local
"pulse". Panel (b) is the autocorrelation function of this signal, computed at integral lags. (For
a discussion of this technique see e.g. Hirose et al. 1992.)
Finally, cubic spline interpolation is done near the maximum of the autocorrelation function,
creating values every 10 microseconds. Panel (c) is a blowup of the region inside the rectangle in
panel (b), showing the interpolated points. The abscissa at which the maximum occurs is defined
to be the pitch period.
Signal Autocorrelation
0.03 Interpolation
0.02
0.01
-0.01
-0.02
-300 -100 100 300
-3 -2
(a)
(c)
5. Defining Jitter
Now that successive pitch periods for a token aft known, we would like a number that
indicates how much the sequence of pitch periods departs from a "smooth" sequence. There is
little agreement among voice analysts on how to measure "roughness" in such sequences, even
in sustained monotone vowels. (See eg Karnefl et al 1991.) The simplest measure is just sums
of (absolute) differences between adjacent pitch periods, this puts all its emphasis on adjacent
epochs, and none on epochs two or more apart
(Voice analysts art not alone having no obviously correct way to express roughness in a
sequence of observations. Every expermenUl science that produces data of this kind has this
problem, and many a statistician has blunted his spear (or obtained a Ph.D.) trying to find a
satisfactory answer to this ill-posed question.)
Four definitions of jitter have been looked at to date; they give different numbers on any one
token, but when used in a comparative way they all seem to tell about the same story.
EPN-6
A simple procedure, found in many papers on jitter, is based on absolute difference between
the period now and the average of the previous period and the follower period. That is. if three
successive pitch periods are P{n - l),P(n). and P{n + 1), the local contribution to jitter is
If epochs are short, then a given difference from expected is perhaps more significant than if
they are long; one way to take this into account is to express the difference as a percentage of
the local pitch period, and define jitter as the average of these percentages. Our first measure
of jitter then, is
Distance from average has a disturbing property. In places where pitch is changing fast, the
contributions to total jitter tend to be larger those from regions of slow change. In Figure 4,
the solid curve is pitch period, and the dotted one is the average of the two surrounding pitch
periods. The vertical distance of the point at epoch 8 to its "expected" value is large, but it is
really quite close to the line joining its two neighbors. (The vertical distance can be reduced by
averaging in the current period as well as its two neighbors.)
76
74
72
70
10
To overcome this difficulty, one can take as the contribution of each point to the total
jitter its distance from the line joining its two neighbors. If three consecutive periods are again
P{n - l),P(n), and P(n + 1) this distance is
EPJV-7
J2(n) =
(P{n + \) - P(n - 1))*
It is now perhaps inappropriate to normalize by the size of the local pitch period. Our second
working definition is
Ji«er = (l/n)*2>(n)
n
There are many algorithms for approximating a sequence of points by a "smooth curve". Use
of any particular algorithm is based on a desire by the owner of a set of data to capture some
feature of the data. In a recent paper on characterizing intonation patterns in spoken Mandarin.
S. Chen and Y. Wang (Chen and Wang. 1990) describe an algorithm they consider appropriate to
their problem, based on Legendre polynomials. On the sequences of pitch periods for the tokens
in our study, except in a few instances their algorithm produces a curve that, to the eye, is indeed
a "smooth" version of the plotted data.
Figure 5 shows three such curves (dotted line) superimposed on the data they purport to
approximate (solid line). Panel (a) shows an unusually good fit, panel (b) an unusually bad fit,
and panel (c) a typical fit.
10 6 8 10 12 14 16 18 6 8 10
(a)
(C)
If we call the ordinates of the approximating curve {C{n)}9 and view C{n) as the "expected"
value of the nth period, then the distance analagous to that defined in Section 5.1 is
The fourth and final measure is again a response to the problem of having larger deviations
where the pitch periods are changing rapidly. The program finds the distance from the plotted
data point P{n) to the nearest point on the curve {C{n)}, and jitter is the average of the
absolute distances (not normalized).
6 Results
This is not a report on the results of the study on aging. What is of interest here is the
consistency of a jitter measurement, that being our principal desideratum. Table 1 shows average
jitter (of the four kinds) for each talker, at each age, saying each vowel.
Jitter type
Table 2: Average of the four kinds of jitter, per talker, per age, per vowel
7 Conclusions
The pitch-epoch finding algorithm is adequate. It is large and computationally intensive, but
that is not a problem on today's computers. The author would be happy to share code (some
Fortran, some C) for this or any other part of the computations described above.
For our purposes, there seems little to choose among the four measures of jitter described in
this paper. Each is fairly consistent across a given condition, and they are fairly consistent with
each other. We will continue investigating all four measures, and perhaps try others, on the rest
of the collected data.
It would have been good to make direct digital recordings rather than tape recordings. Wow
and flutter are probably not affecting average jitter (the law of large numbers is in our favor),
but they surely affect the variance of the jitter. (Variance measurements on the small amount of
data in this study are not encouraging.) This makes it fruitless to try to measure jitter on small
samples, such as one phoneme in one context by one talker.
8 References
Chen, Sin-Horng and Wang, Yih-Ru (1990), Vector Quantization of Pitch Information in
Mandarin Speech, IEEE Trans. Comm. Vol. 38, No. 9
Hirose, K., Fujisaki, H., and Seto, S. (1992), A scheme for pitch extraction of speech using
autocorrelation function with frame length proportional to the time lag, Proc. ICASSP 1992 pp.
1-149 to 1-152
House, AS. and Stevens, K.N. (1993) Speech production: Thirty years after, Jour Acoust.
Soc. Amer. Vol. 94, No. 3. Pt. 2 (Abstract)
Karnell, M.P., Scherer, R.S., and Fischer, L.B. (1991), Comparison of Acoustic Voice Per
turbation Measures Among Three Independent Laboratories, Journal of Speech and Hearing
Research, Vol. 34
Titze, I.R. and Liang, H. (1993) Comparison of Fo Extraction Methods for High-Precision
Voice Perturbation Measurements, Jour Amer Speech and Hearing, Vol. 36
TiA-1
Introduction:
0.8 r
1.4
1.2
0.7
1.0
0.6
0.8
0.6
0.5
Figure 1
JIA-2
JITTER SHIMMER
0.10
n=ll n=ll
30 0.08
0.06
20
0.04
10
0.02
0.00
Figure 2
JIA-3
Methods:
o o
■^■i
Intensity Target
Area
/
□
PRE AMP RMS/DC
Frequency
/
EGG F/V
Figure 3
voice.
In our custom made software, data acquisition has two modes that
are based on the triggering strategy. These are a manual mode
and an automatic target matching mode. In the automatic target
matching mode, there are two steps for the data acquisition:
as 2 seconds
(see figure 5), Intensity(db)
Target
then the data
A
of last steady
Area
section of
phonation in Cursor
temporary file Real
was saved as a Phonatogram
file with a
given name.
The rest of the
data in the
temporary file
was discarded.
If the
frequency and
the amplitude
was not steady
for 2 seconds,
the program Frequency
reset itself
every 120
Figure 4
seconds by
rejecting all
the data in the
Frequency
temporary file Frequency Contour
and starting a Target Range
One convenient
Acoustic
feature of our
Signal
system was its
"backward" data
acquisition.
When the
Retained Segment
investigator (saved as a file)
wanted to Rejected segment
acquire a
section of Figure 5
patient voice
data with
specific characteristics, it was difficult, or tedious to do in a
conventional system. Because the investigator did not always
know ahead of time when the right combination of frequency and
intensity would be achieved it was necessary to have the
acquisition time long enough to wait for the sample that was
desired. After the acquisition, it was necessary to search, what
was sometime a huge file, for the desired sample, another tedious
task.
mode. In this
Retained Segment
mode, the
(saved as a file)
investigator
the section
he/she wants to
Forwards Acquisition
acquire, and
starts the
acquisition in
the background. Backward Acquisition Trigger
From this Point
Discussion:
Reference:
Fred D. Minifie
Hideki Kasuya
Summary
The purpose of this paper is to present the results of a controlled study of the day-to-day
variabilities of three acoustic parameters (jitter, shimmer, and normalized noise energy), and two
electroglottographic (EGG) parameters (contact quotient and contact quotient perturbation) for
vowels produced at three vocal efforts (soft, normal, loud). Data were obtained using a
sophisticated bilinear interpolation pitch detection method. A repeated measures design required
HUA-2
subjects to produce the vowels /ae/ and /a/ five times a day over three days at each vocal effort
level. The jitter, shimmer, and normalized noise energy (NNE) values from acoustic measures and
Contact Quotient (CQ) and Contact Quotient Perturbation (CQP) values varied significantly
among the three vocal effort levels. The clinical implication of this finding is that vocal effort
variability must be taken into account if representative measures are to be obtained for clinical
use.
Key Words
Jitter, shimmer, normalized noise energy, contact quotient, contact quotient perturbation,
Introduction
Scientists have long known that clinical use of acoustic and electroglottographic (EGG)
measures provides a convenient and non-invasive way to evaluate laryngeal function (Davis,
1976; Aronson, 1980; Huang and Hu 1988). The three acoustic measures that have received the
most attention in the literature as indicators of vocal function are cycle-to-cycle variations in
normalized noise energy (NNE) (Hirano, Matsushita, and Hiki, 1976; Kasuya, Ogawa, and
Kikuchi, 1986 ). There are four EGG measures that are also provide useful information about
normal and pathological vocal function. The four EGG measures are contact quotient (CQ),
contact quotient perturbation (CQP), cycle-to-cycle fundamental period variations from EGG
(Baken, 1987; Huang 1988; Huang, Minifie, & Lin 1992). The usefulness of such measures as
indicators of vocal function is dependent upon the irreliability and the sensitivity of the measures
to changes in vocalizations. Previous studies have looked at intrasubject variability of vocal jitter
in voice signals from day to day (Linville, 1988; Haggins and Saxman, 1989), the relationship of
HUA-3
vocal jitter to voice intensity levels (Titze, Horii & Scherer, 1987), vocal jitter changes with the
aging voice (Brown, Morris, and Michael, 1989), and differences in vocal jitter from vowel to
vowel (Orlikoff and Huang 1991). Similar studies need to be done to indicate the relative stability
of each of the acoustic measures and EGG measures used to evaluate vocal function.
Accurate characterization of acoustic and EGG measures is essential not only in the
evaluation of vocal pathologies, but also in the accurate modeling of the voice source for the
speech synthesis (Fant, 1980). Two of the major questions about acoustic measures and EGG
measures remain unresolved: 1) how do these measures change with changes in vocal effort level,
and 2) what is the day-to-day variability in these measures ? One way to address these questions
understanding of the variability of voice perturbation of normal speakers at different vocal efforts
over time is needed, therefore, before the use of acoustic measures and EGG measures can be
used appropriately as clinical measures for voice assessments.
There are three purposes for this study: 1) to introduce a sophisticated bilinear
interpolation pitch detection method, 2) to use the new pitch period detection method to evaluate
the stability of acoustic and EGG measures of vocal function during changes in vocal effort level,
and 3) to evaluate the stability of such measures from day to day.
Similarly, the detection of glottal noise requires an accurate pitch period marker in order to match
Various FO extraction methods have been summarized by Hess (1983). They can be
classified in two major categories: 1) event-detection methods, such as the peak-picking and zero-
distances, amplitude magnitude difference function, cepstral analysis, and harmonic compression.
Milenkovic (1987) found that greater reliability and accuracy could be obtained by matching the
entire waveshape across adjacent cycles rather than by identifying isolated events, like zero-
crossing and peak-picking. Kasuya (1986, 1989) developed a rather accurate pitch detection
method based on a cycle-to-cycle amplitude magnitude difference function (AMDF). This pitch
detection method compares the sampled data points for an entire waveform with those from
adjacent cycles.
Since the measures of jitter, shimmer, NNE, CQ, and CQP are based on the cycle-to-cycle
similarity of the wave form, accurate determination of the pitch period is of crucial importance.
This paper presents a new method for determining pitch periods, from which jitter, shimmer,
NNE, CQ and CQP are measured. The method incorporates a bilinear interpolation procedure
into the average magnitude difference function (AMDF) to evaluate the cycle-to-cycle waveform
similarity in sustained vowel utterances. This method was selected based on experiments with
synthetic speech showing that the method performs better than several other methods taken for
comparison. The method of bilinear interpolation of sample points on the Amplitude Magnitude
Difference Function (AMDF) is shown in figure 1. The pitch markers (qi,q2^3- .Qn-l^n or
C1*C2>C3»--'Cn-l'Cn) shown at the top of figure, are estimated using an automatic method based on
zero crossings of the vowel wave form. The method then locates the pitch boundary on the basis
of the AMDF to indicate the beginning of each pitch period. In this case, six points, shown at the
bottom of figure, around a primary dip in the AMDF are separated into two groups; one group
includes a minimum AMDF point, while the other includes the second minimum point. Two lines
HUA-5
are obtained from the two groups on the basis of the least mean square criterion. The point where
two lines cross is regarded as the real pitch boundary, the beginning of a real pitch period
(PI,P2,P3,..,Pn-l.Pn)
A } I /I. A A .
UJ
a
D
TIME
Figure 1. Schematic illustration of the pitch detection method with bilinear interpolation
or the average magnitude difference function (AMDF).
PQ = 100_vy' k*p(n+m)
N-k +i h -1)
(1)
where k is the length of moving average (an odd integer greater than one) and m=(k+l)/2. In our
system k=5 and m=3. If p(n) is the pitch period of acoustic signal, then PQ is the pitch period
perturbation quotient (jitter), and if p(n) is the peak-to-peak amplitude of acoustic signal, then PQ
HUA-6
sequence of the EGG signal, then PQ is the contact quotient perturbation quotient (CQP).
These jitter and shimmer values are measured from the pitch period and peak-to-peak
amplitudes, respectively. After computation of the perturbation measures, the pitch period of
each glottal vibration included in the count was displayed. More than 50 cycles were used for
each perturbation analysis as supported by Titze (1987). Only segments that had pitch period
fluctuations within 10% in either a positive or negative direction of the mean pitch period were
analyzed. This criterion was used so that only very steady wave form segments would be
analyzed for all subjects, thus minimizing variability due to selection of cycles for analysis. If no
segment consisting of at least 50 cycles could be found to fit this criterion, no perturbation values
were computed for that vowel production. Only a few of the normal vowel prolongations were
rejected by following this criterion. This criterion is more problematic in analyses of pathological
voices because many abnormal voice have relatively few stable segments with pitch period
fluctuations within 10% of the mean pitch period. Such extremely variant voices cannot be
analyzed. Accuracy of the new pitch detection method will be discussed with synthesized steady
vowels later.
With respect to the EGG-jitter and EGG-shimmer, Haji (1986) found that the EGG-jitter
was nearly equivalent to the jitter and shimmer obtained from acoustic signal, so that the EGG-
jitter and EGG-shimmer data are not reported in this paper. The CQ measure from EGG signal
provides unique information about vocal fold behavior that is, for the most part, invisible to other
available techniques (Baken 1987; Orlikoff and Baken 1990). The CQP measure provides
precise information about the rate, symmetry and regularity of the vocal fold contact phase during
vocal fold vibration (Huang, Minifie and Lin 1992, 1993). It is for these reasons that the CQ,
CQP measures were chosen in our study. Rothenberg (1988) has suggested the use of variable
baseline crossings with interpolation of a criterion level to demarcate the EGG contact phase. In
HUA-7
the present study, a baseline of 25% of the peak-to-peak EGG amplitude of each wave is
associated with the EGG minimal contact phase and is selected for measuring the CQ and CQP
The method of noise energy measurement used in this experiment provides more insight
into perturbation measurement. The relative magnitude of noise included in the voice signal is
evaluated using an acoustic measurement NNE (normalized noise energy) described by Kasuya
(1986). We have chosen to use the normalized noise energy measure because it can differentiate
among normal and pathological voices more sensitively than does the harmonic-to-noise ratio
(Kasuya 1993; Hirano 1989). An adaptive comb filtering method is used in NNE for estimating
vocal noise in normal and pathological voices. (This procedure was initially investigated for the
enhancement of degraded speech due to additive white noise.) The NNE (dB) is given by the
equation:
where w(n) and x(n) are respectively an estimated vocal turbulent noise component and an
original voice waveform, and BL is a constant for compensating for the amount of noise energy
B. Test Signals
The accuracy of the pitch detection method was tested using periodic synthesized signals
M Ink
£^ <D(A)) (3)
HUA-8
where A(k) is the amplitude of k-th harmonic component which simulates a vowel /ae/ as in "bat",
<X>(k) is the phase of k-th harmonic component, M is the number of harmonics, and T is the
normalized pitch period (points). In our system, <D(k>=0 and M = 23. The T is defined by the
following equation:
T=F,xP (4)
where Fs is the sampling frequency, and P is the pitch period. For simulating a child voice, T is
allowed to vary from 133 to 134 points with a step 0.2, which corresponds to a change from
3.325 to 3.35 ms. For simulating a female voice, T is allowed to vary from 174 to 175 points,
which corresponds a change from 4.35 to 4.375 ms. Similarly, for simulating a male voice, T is
allowed to vary from 333 to 334 points, corresponding to a change from 8.325 to 8.35 ms.
Results showing the accuracy of the pitch detection method, with and without
interpolation, are provided in Table 1. Here, two interpolation methods on the AMDF were
employed in order to determine which method provides the more precise pitch period extraction.
The two methods are: parabolic interpolation, and interpolation with bilinear approximation. The
measures obtained using these interpolation methods were compared to measures derived when
The results obtained from these test signals, which include a constant pitch periods from
integer multiples and non-integer multiples, allow us to draw the following conclusions about
pitch period detection. First, the standard deviations of data obtained via the parabolic and
bilinear interpolation methods were always smaller than the standard deviation when no
interpolation was used. The second observation is that the bias of the interpolation methods was
generally smaller than the bias obtained with the "no interpolation" method. Third, the bias of
bilinear interpolation method was always smaller than the bias obtained with the parabolic
method. Thus, it appears clear from Table 1 that the bilinear method is superior to the parabolic
HUA-9
method. Also, Table 1 indicates that both of the interpolation methods are better than the no
interpolation method.
measurement, white-noise signals were scaled appropriately and then added point-for-point with
the above periodic synthesized signals y(n). As the signal-noise ratio (SNR) decreased, reflected
by increasing the amount of white noise added point-for-point on synthesized signals, jitter and
the value of normalized RMS error of the pitch period clearly increase, as shown in figure 2. also,
it is clear that variations in jitter imposed by noise are relatively small when SNRs are more than
HUA-10
measurement. ^
O
10-
a a
10 20 30 40 50
SNR (dB)
Fifurt 2. Jitter, normalized RMS and bias errors as a function of signal-to-noise ratio
(SNR) of synthesized signals with white noise.
The next investigation was to examine the influence of three vocal efforts on the measures
of jitter, shimmer, NNE, CQ, and CQP over time. Three male subjects pronounced the sustained
vowels /ae/ and /a/ at three vocal efforts (soft, normal and loud) five times a day on three different
days. Jitter, shimmer, NNE, CQ and CQP were measured from each vowel sample.
HUA-11
While these acoustic and EGG measures may be easy to obtain, useful in analysis of the
voice disorders and helpful in measuring progress during therapy, it is important to understand
how these measures change during different vocalization conditions. Hence, our interest is how
these measures change during voice production at different vocal effort levels.
A. Subjects
Subjects were three normal male adult subjects with no history of voice disorders, or
present complaint of voice disorders. All subjects were in good health on each day of testing with
B. Stimuli
We manipulated vocal effort level by having subjects produce the vowels /ae/ and /a/ at
"soft", "normal", and "loud" vocal levels. Using a repeated measures design, each of three adult
male normal talkers produced five replications of each vowel, at each vocal effort level, on each
of three different days. These utterances were produced under two different conditions: 1)
spontaneous vowel productions, and 2) imitative vowel production.
Spontaneous vowel production: Each subject was first directed to sustain the vowel /ae/ as in
"bat" five times with each utterance lasting for more than 3 seconds at normal effort. Then, the
subject was asked to repeat the sustained vowel five times again with each production lasting
more than 3 seconds, at a soft vocal effort. Similarly, five replications of vowel were obtained at
loud vocal effort. The same procedure was used to obtain tokens of the vowel /a/ at each of the
three vocal effort levels.
Imitative vowel production: Each subject was required to sustain each vowel (/ae/ and /a/) in
imitation of synthetically generated vowel tokens at each of the intensity levels: loud=75 dB SPL,
normal=72 dB SPL, and soft=67 dB SPL.
HUA-12
Each subject for this experiment was seated in a sound-proof room (IAC 1200) and
comfortably positioned in a head rest so that a condenser microphone (SONY ECM22-P) was
was placed over the thyroid lamina to obtain electroglottographic signals. During the recording
of both acoustic and EGG signals into a computer, no attempt was made to control fundamental
Each vocal token produced by the subjects in this experiment was digitized at a sampling
frequency of 22050 Hz per channel with an accuracy of 16 bits/sample, and analyzed by using the
software: Voice Evaluation and Therapy (VET 2.00) from Tiger Electronics (Huang, Minifie &
Lin 1992). Only the middle portions of the vowel at each vocal effort were used for analysis.
C Results
The results of this experiment can be seen in the following series of figures. Figure 3
shows the results for the vowel /ae/ produced in a natural, spontaneous manner. Each of the bar
graphs shows the means and standard deviations of the data obtained for each vocal effort level
condition: soft, normal, and loud. For example, it can be seen in Figure 3(a) that jitter decreases
with increasing vocal effort level. Similarly, shimmer reduces with increasing vocal effort level
(Figure 3(b)). Please note that we have used the acronym NNE (Kasuya 1986) to represent
normalized noise energy (or what we have referred to above as glottal noise energy). Obviously
this graph has to be interpreted in light of the fact that noise energy is measured in relation to the
amplitude of the harmonic energy in the vowel. Therefore, the minus values indicate how many
decibels below the signal energy is the level of the noise energy (e.g., a smaller minus value
indicates a larger amount of noise than does a larger minus value. Figure 3(c) shows that as vocal
effort level is increased, that the relative amount of noise in the vocalization decreases.
If we look at the Contact Quotient graph (Figure 3(d)), it can be observed that at loud
levels of phonation, the vocal folds are closed for a considerably longer percentage of each vocal
HUA-H
I 'Sid
mm
i iS.
III
(b)
HI Spontaneous
mm
Hi
Vowel/ae/
Production
00 (e)
Figure 3. Jitter, Shimmer, NNE, CQ, and CQP from a sustained vowel /ae/ as a function of three
vocal efforts in the spontaneous vowel production.
Spontaneous
Vowel /a/
Production
(d) (e)
Figure 4. Jitter, Shimmer, NNE, CQ, and CQP from a sustained vowel /a/ as a function of three
vocal efforts in the spontaneous vowel production.
HUA-14
I ,-
\ocal EITan
(b) (c)
** :w.-y.-.:-
Imitative
ill iff Vowel/ae/
ii Production
(a) (b)
Figure 5 Jitter, Shimmer, NNE, CQ, and CQP from a sustained vowel /ae/ as a function of three
vocal efforts in the imitative vowel production.
Imitative
Vowel/a/
Production
()
Figure 6 Jitter, Shimmer, NNE, CQ, and CQP from a sustained vowel /a/ as a function of three
vocal efforts in the imitative vowel production.
HUA-15
cycle than during the normal and soft levels of phonation. And in the contact quotient
perturbation graph (Figure 3(e)) we see that the percentage of contact quotient perturbation
Figure 4(a)-(e) shows the data obtained from spontaneous productions of the vowel /a/.
While the values of the various measures may differ slightly from those obtained for the vowel
Figure 5 shows the vowel /ae/ produced at different vocal effort level conditions, in
imitative response to target acoustic models produced by voice synthesis to reflect vocal effort
levels. The target vowels for the loud, normal, and soft conditions were synthesized at 75, 72,
and 68 dB SPL, respectively. Similar patterns of changes are observed in these imitative
HUA-16
changes in the vocal effort level. Figure 6 shows similar results for the /a/ vowel produced at
different vocal effort levels in imitative response to the acoustic targets produced by vowel
synthesis.
Shown in Table 2 are the results of numerous analyses of variance applied to the measures
of jitter, shimmer, normalized noise energy, contact quotient, and contact quotient perturbation.
Perhaps the most important finding from this study is related to changes occurring from changes
in vocal effort level. This table shows that in all cases, changes in vocal effort level caused
significant changes in the three acoustic measures and in both of EGG measures.
Discussion
In this paper, we have discussed the development of a computer program for the
energy, contact quotient, and contact quotient perturbation (jitter, shimmer, NNE, CQ, and
CQP), based on a newly pitch detection method. Both parabolic and bilinear interpolation
methods of the AMDF provide an obvious advantage for the estimation of pitch period when
compared to peak picking and zero crossing procedures. If a relatively low sampling rate is used,
such as 11025 Hz, interpolations will provide an even greater advantage over these "no
interpolation" procedure.
Our primary interest was to investigate the influence of vocal effort on vocal perturbation,
glottal noise, CQ, and CQP measurements and to study the day-to-day variability of each
influence. During our "every other day" sampling procedure for obtaining the jitter, shimmer,
NNE, CQ, CQP values associated with the three vocal efforts, we observed that, in most cases,
the loud vocal effort produced the lowest values. These results suggest that it is very important
to control vocal effort when analyzing vocalizations, we assume that the same conclusion would
apply to vocalization produced by normal and pathological subjects. On the other hand, it
HUA-17
appears reasonably to analyze only very steady utterances from subjects in order to get a better
approximation of a speaker's typical perturbation value. This criterion may make it impossible to
measure the vowel productions of some pathological speakers. As Titze (1987) has suggested, it
appears best to use a voice sample at least 20-30 cycles in duration when measuring jitter and
shimmer in normal speakers. Whether or not this is the case with some, or all, pathologic
speakers is uncertain. What is clear is that a longer sample duration is needed in order to obtain a
more stable estimate of perturbation measures. Certainly, longer vowel duration is desirable, but
at a cost of increased processing time. The results of the present study suggest that more vowel
repetitions are needed to determine a speaker's typical production of a given vowel. The first
vowel produced during a given recording session usually yields the highest amount of variability,
When tokens are recorded in a high quality digital audio tape recorder or digitized directly
into a computer prior to analysis it appears to have a noticeable effect on the measures obtained.
The take-home message from this experiment is that if these acoustic and EGG measures
are to be taken in the clinic, and used to compare the patient's performance from one point in time
to another, it is important to have the vocalizations produced at the same vocal effort level.
Secondly, Table 2 shows that in most cases there was variability in these measures from day to
day. Thus, it may be important to obtain recordings from several days in order to obtain a good
indication of "average" subject performance. Finally, it should be pointed out that this experiment
was designed to investigate how these measures varied during vocalizations produced by normal
talkers, under the prescribed conditions It would be of considerable clinical importance to
determine whether patients with voice disorders produce similar changes. Further investigations
with both normal and pathologic speakers should begin to provide an answer.
A cknowledgments
HUA-1S
We would like to thank Dr. Y. Kikuchi at Utsunomiya University and Dr. Robert Orlikoff
at Memphis State University for their suggestions regarding this research project.
References
Publication.
2. Boone, D. R. and McFarlane, S. (1988). The Voice and Voice Therapy. 4th Edition,
Prentice Hall.
3. Brown, Jr. W.S, Morris, R.J. and Michael, J. F. (1989). Vocal jitter in young adult and
Press.
speech, SCRL Monograph 13, Speech Communication Research Laboratory, Inc., Santa
Barbara.
7. Haji, T., Horiguchi, S., Bear, T., and Gold, W.J. (1986). Frequency and amplitude
8. Hirano, M., Matsushita, H., Hiki, S. (1976). Acoustic analysis for voice disorders: a basic
conception for the use of acoustic measurements for the diagnosis in voice disorders.
10. Hirano M., (1989). "Objective evaluation of the human Voice: Clinical Aspects", Folia.
11. Huang, Z. and Hu, N. (1988). Research for Laryngeal Cancer Evaluation and Diagnosis",
12. Huang, Z. (1988). A Review of Speech Analysis and Synthesis System to Evaluate
13. Huang, Z., Minifie, F. and Lin, X.. (1992). Objective Evaluation of Pathological Voices
A Preliminary Clinical Decision Program. Paper presented at ASHA. San Antonio, Texas,
Nov. 1992.
14. Huang, Z., Minifie, F. and Lin, X. (1992). An Integrated Clinical Program for Voice
Evaluation and Therapy. Paper presented at ASHA, San Antonia, Texas, Nov. 1992.
15. Huang, Z., Minifie, F. and Lin, X.. (1993). Measures of Vocal Function During Changes
in Vocal Effort Level. Paper presented at ASHA. Anaheim, California, Nov. 1993.
16. Kasuya, H., Ogawa, S. and Kikuchi, Y. (1986) An Acoustic Analysis of Pathological
17 Kasaya, H., Zue W., and Endo, Y. (1993). Measurements of Laryngeal Turbulent Noise
18. Higgins, M. B., and Saxman, J., H. (1989). A comparison of intrasubject variation across
19. Minifie, F., Hixon, T. J. and Williams, F. (1973). Normal Aspects of Speech. Hearing,
20 Orlikoff, R. F. and Baken, R.J. (1990). Consideration of the relationship between the
21. Orlikoff, R,F and Huang, Z. (1991). Influence of Vowel Production on Acoustic and
Nov. 1991.
22. Rothenberg, M., and Mahshie, J.J. Monitoring vocal fold abduction through vocal
24. Titze, I., Horii, Y., and Scherer, R. (1987). Some technical considerations in voice
25. Zhu, Minghui and Huang, Z. (1990) Glottal Source, Speaking and Singing Voice and
in cooperation with
'Portions of this work were supported by grants to Dr. Jamieson from the National Sciences and Engi
neering Research Council of Canada and the Ontario Ministry of Health.
KHE-2
Abstract
This paper compaxes four spectral analysis techniques which are now available in com
mercial software packages: the short-time Fourier transform (STFT); two autoregressive
methods, the autocorrelation and modified covariance methods; and a generalized time-
frequency representation based on the Wigner distribution which uses a cone-shaped kernel
(TFRCK). With the TFRCK, good time and frequency resolution can be obtained simul
taneously. This desirable feature is not possible with any of the other three methods. In
addition, when white Gaussian noise is added to the signal, it is shown that this method is
able to provide an unbiased estimate of the signal without noise.
1 Introduction
Over the past decade, there have been significant advances in the methods available for the
analysis of speech and other complex, time-varying signals. Several of these methods are now
widely available to researchers through inexpensive software packages [4]; [9]; [6]. The choice
of the analysis method has implications for the results obtained, but these implications are
not obvious to researchers and clinicians who wish to analyze the characteristics of speech
and other time-varying complex signals.
One of the most widely used approaches is the spectrogram, evaluated using a short-time
Fourier transform. This technique has been found suitable for some applications, but it may
not accurately characterize the signal under analysis. In particular, high resolution in both
In an effort to improve the analysis flexibility and to provide more accurate time-varying
spectral estimation of speech signals, alternative techniques based on models of the speech
production system have been developed. One example is autoregressive modeling of speech,
where the signal is represented by an all-pole filter [5]. This quasi-stationary approach gives
better results than the STFT for speech signals but is still inadequate because the signals
analyzed (e.g., speech) are often nonstationary.
As a consequence of these limitations, there has been increased interest in the use of
KHE-3
spectral analysis methods which provide greater time-frequency resolution than conventional
spectral methods. A generalized time-frequency representation where the kernel has the form
of a cone has been developed in an attempt to improve the results of the previous methods
[12]. In this nonstationary approach, time and frequency resolution are independent.
In this paper, the accuracy and usefulness of four spectral estimation methods are com
pared for synthetic signals and natural speech. The algorithms considered are the STFT,
sentations
J-oo J-
~ £ h (2)
where X(f) is the Fourier transform of x(t) and ♦({,/) is the two-dimensional Fourier
transform of <f>{t,r). Even though this representation is not always positive, it is considered
as a time-frequency representation for specific choices of the kernel. Each of these represen
tations is characterized by a special form of the kernel function, so that the properties of a
distribution are related to the properties of its kernel. In this paper, two distributions will
be analyzed - the short-time Fourier transform and a representation where the support of
the kernel has the form of a cone.
KHE-4
The short time Fourier transform belongs to the class of the generalized time-frequency
representations [3]. The interpretation from the generalized time-frequency point of view
allows us to find ways to try to overcome these limitations.
where h(t) is a real and symmetric window. The kernel of the STFT takes the following
form:
)h{t-T-), (4)
The supports of the kernel in each of t and r directions are proportional to the duration
shows that smoothing is introduced by the kernel in both time and frequency. The kernel
of the STFT (Eq. (4)) depends on t. When it is convolved with the signal correlation (Eq.
(1)), time smoothing appears. To increase time resolution, the support of the kernel in the
t direction has to be decreased. To realize that, the window duration has to be reduced
which will reduce the support of the kernel in the r direction. Since the Fourier transform
is performed in the r direction, frequency resolution will decrease. Thus, increasing time
resolution will decrease frequency resolution. Using a similar reasoning, it could be shown
that time resolution will be reduced if frequency resolution is increased. Ideally, to overcome
this tradeoff, kernels should be designed to be separable functions in t and r, so that altering
the support of the kernel in one dimension will not affect the support in the other dimension.
Although the STFT has many drawbacks, it continues to be used widely. Among the
class of the generalized time-frequency representations, only the STFT always gives positive
values for the power spectral density. Thus, it is able to represent the energy distribution
of the signal being analyzed. In addition, the STFT is computationally efficient so that
real-time implementations are possible.
KHE-5
the reasons behind the limitations of the spectrogram and how to overcome these limitations.
Eqs. (1) and (2) show that different representations can be generated by specifying different
To obtain good time and good frequency resolution simultaneously, an obvious choice
for the kernel is one which has the narrowest possible width in time and in frequency, so
that when it is convolved with the signal correlation (Eq. (1)) and the spectrum correlation
(Eq. (2)), it does not introduce any smoothing. The function that has the narrowest width
is the Dirac delta function. With <f>{t,r) = 6(t)y the two-dimensional Fourier transform of
this kernel is $(£,/) = 6(f). This kernel gives the sharpest possible time and frequency
signals are analyzed with the Wigner distribution, the results are satisfying, and frequency
changes are tracked correctly without any smearing. However, with multicomponent signals,
(such as speech), the Wigner distribution introduces interfering cross terms, artifacts which
make the interpretation of the results very difficult. The presence of cross terms is due to
the nonlinear nature of the Wigner distribution. To be able to use the Wigner distribution
to analyze speech, further processing is therefore essential.
In an effort to reduce the interference terms of the Wigner distribution while simultaneously
trying to preserve its desirable properties, sophisticated smoothing functions have been de
veloped. These functions are able to reduce the undesirable effects of the interference terms
nel, was proposed for nonstationary signals [12]. This distribution overcomes the limitations
KHE-6
of the spectrogram and the artifact problem of the Wigner distribution and has been shown
to provide high resolution in both time and frequency while simultaneously attenuating the
interference terms [8]. The support of the kernel used has the form of a cone in the t-r
plane. Mathematically, the cone kernel is defined as [12]
0 otherwise,
where a is a parameter used to specify the slopes of the cone and g(r) a tapering window.
The width of the kernel along the r-axis specifies the frequency resolution which is
inversely proportional to the length of the window T. The width of the kernel along the
t-axis is independent of T which makes the time and frequency resolution independent.
Some specific properties of the time-frequency representation using a cone kernel are
examined in [10]. In particular, to satisfy the finite time support property, a time-frequency
representation should be zero whenever the signal is zero. This property is violated with
an effort to attenuate the interference terms. An example is the spectrogram, where due to
the smearing in time, the values are not always zero when the signal is zero. In contrast,
this property is satisfied with the representation using a cone kernel. In addition, if the
signal contains white noise, the representation using a cone kernel is able to produce an
unbiased estimate of the same representation of the signal without noise [10]. In other
time-frequency representations, the power spectral density of the noise is usually added to
the time-frequency representation of the signal. Nonnegativity is not preserved in the cone
kernel representation or in the Wigner distribution. Therefore, these representations cannot
be considered as energy distributions but serve as high resolution analyses of signals in time
and frequency. Thus, the representation with a cone kernel is suitable for analyzing speech
signals where there is a need to resolve two closely-spaced formants, or to track a rapidly-
changing spectral peak. Simultaneously, good estimates of the time of occurrence of events
are possible with this technique.
KHE-7
In an attempt to improve the resolution and spectral fidelity of the FFT, particularly for
short data segments, several alternative spectral estimation techniques have been developed.
These techniques represent the data to be analyzed by a model and are termed parametric
methods.
Many problems associated with the FFT axe attributed to the assumptions made about
data falling outside the measurement interval. The finite data sequence may be viewed as a
sequence of infinite length multiplied by a finite length window. The use of only these data
implicitly assumes the unmeasured data are zero, which is usually not the case.
Alternative spectral estimation procedures are designed to alleviate the inherent limita
tions of the FFT approach. Rather than assuming that the data outside the window are
zero, a more reasonable assumption is made. The process which generated the data can be
modeled, and the model is then used to improve the estimate of the data falling outside the
window.
Many approaches exist to determine the model parameters. Two particular cases are
considered here, the autocorrelation method and the modified covariance method. Details
As with the STFT, the autoregressive techniques use a quasi-stationary approach to analyze
nonstationary signals such as speech [11]. As a consequence, the exact time of occurrence of
A serious problem with the autoregressive spectral estimation technique is its sensitivity
to the addition of noise to the signal [5]. For the case of two sinusoids in noise, the resolution
decreases as the signal to noise ratio decreases [5]. In addition, the spectral peaks are
broadened and displaced from their true positions. This noise sensitivity occurs because the
autoregressive technique uses an all-pole model to represent the signal, but this model is
A synthetic signal in clean and noisy conditions and a speech signal are used to evaluate
the performance of the four algorithms discussed above, as implemented in the CSRE2
speech analysis system [4]. The speech token was chosen to have characteristics which are
traditionally hard to identify, including closely-spaced formants, rapid formant transitions,
and brief components such as consonant bursts.
In each figure, the top window shows a time-frequency representation of the signal and
the middle window displays the time-domain waveform. The bottom-right window shows
the spectral slice at the time step where the marker in the top window is positioned and the
bottom-left window displays the data used to calculate this spectral slice.
sinusoids was synthesized at a sampling frequency of 10 kHz. The first 500 ms comprised a
1 kHz tone added to a 1.25 kHz tone; the next 500 ms comprised a 3 kHz tone only (i.e., the
3 kHz tone did not overlap in time with the other two tones). For the noisy signal, a white
The synthesized signal was analyzed with each of the different algorithms. Fig. 1 displays
a narrowband spectrogram and Fig. 2 shows a wide band spectrogram of the same signal.
The spectrogram was evaluated with a STFT. Fig. 1 shows that the 1 kHz tone and the
1.25 kHz tone are resolved in frequency but that they appear to overlap in time with the 3
kHz tone. Time resolution was better in Fig. 2 than in Fig. 1 but the separate tracks of the
1 kHz and 1.25 kHz tones are obscured (i.eM frequency resolution is much worst than in Fig.
1). Thus, the STFT may hide some characteristics of the signal (e.g., the two low-frequency
tones cannot be distinguished in the wideband spectrogram) or produce others which do not
truly exist (the narrowband spectrogram gives the illusion that the 3 tones overlap in time).
Analysis with the autoregressive technique using the autocorrelation method generates
results similar to those for the STFT. A short window gives good time resolution at the ex
pense of good frequency resolution, while the opposite happens with a long window duration.
No results were obtained for this signal with the autoregressive technique using the modified
covariance algorithm, because this algorithm suffered from ill-conditioning with this signal.
A generalized time frequency distribution with a cone-shaped kernel was used to analyze
the same signal, with the results displayed in Figure 3. As can be seen, this method provided
good time resolution with no loss of good frequency resolution. The track of the 1 kHz tone is
clearly distinguishable from that of the 1.25 kHz tone. There is no overlap in time between the
low frequency tones and the 3 kHz tone. Interfering cross terms appear for a short duration
during the transition but they are attenuated by about 30 dB compared to the power of
signal. With multicomponent signals, the distribution using a cone kernel appears to be able
to provide the excellent time-frequency resolution of the Wigner distribution, without the
interfering cross terms in time or in frequency. Among the four techniques considered, this
approach yields the most lucid representations of the signal. Timing information as well as
the spectral content of the signal can be estimated with almost no error.
White noise at 0 dB SNR was added to the signal. The results of the autoregressive
techniques were the worst, the noise heavily distorted the signal and it was not possible to
distinguish the signal components. The results of the autoregressive techniques were not
shown for this case, instead, the comparison is done between the best representation among
the three conventional methods and the TFRCK. The results of the analysis with the STFT
are shown in Fig. 4. The display is dominated by the power spectral density of the noise
in the composite signal. The spectral slice centered at time 592.1 ms and displayed in the
bottom right window shows the high level of the noise. Clearly, it is difficult to locate the
location of the tone in this display.
The results obtained with the TFRCK are shown in Fig. 5. As can be seen, there is no
noise around the tracks of the tones and the power spectral density of the noise was not added
to the spectrum of the signal as occurred in the case reviewed above. This result confirms
that the TFRCK can provide an unbiased estimate of the signal even in the presence of noise
[10].
KHE-10
With speech signals, the results with the STFT were the worst. The results of the two
autoregressive methods were very similar for the analysis window sizes used. Therefore, to
be concise, the comparison was carried between the autocorrelation method and the TFRCK.
The analysis parameters in each case were chosen to provide the best representation of the
signal for a given method. Fig. 6 and Fig. 7 show the results of analyzing the bisyllabic
/agil/ with the autocorrelation method using a short window and a long window respectively.
In Fig. 6, the frequency resolution is not good, the 3rd and 4th formants of the /il/ portion
are not distinguishable. In contrast, frequency resolution is improved in Fig. 7 but time
resolution is reduced, the width of the /g/ is wider in Fig. 7 than in Fig. 6. In addition,
in Fig. 7, the beginning of formant transitions from the /g/ to the /il/ is obscured and
compared to Fig. 6, it is clearly seen that the onset of the formant transitions is shifted
The results with the TFRCK algorithm are shown in Fig. 8. Frequency resolution is good
in this case and the tracks of the 3rd and 4th formants in the /il/ portion can be seen clearly.
Simultaneously, good time resolution is preserved, the width of the /g/ and the beginning of
the formant transitions are defined as well, if not better than in the previous analysis when
a small window was used. For more detailed analysis, the "ZOOM IN" feature of CSRE was
used to examine the portion of the displays around the /g/ release segment. This is shown
in Figs. 9, 10 and 11 for the autocorrelation method with short and long analysis intervals
and for the TFRCK respectively. In Fig. 9, the 2nd, 3rd and 4th formants are difficult
to discern. While in Fig. 10 the formant tracks are distinguishable, the release is clearly
longer in duration than in the previous case. The beginning of the release in this display
the middle window. Because of the poor time resolution in this case, an error of about
10 ms could be made in locating burst onset from the time-frequency display of Fig. 10.
Fig. 11 shows that the TFRCK both separates the formants and simultaneously correctly
estimates the duration of the release relative to the time-domain waveform. For this natural
speech example, the TFRCK is therefore able to provide good time and frequency resolution
KHE-U
simultaneously which was not possible with the other techniques considered.
5 Conclusion
Among the spectral analysis techniques considered, the TFRCK approach was found to
provide superior time-frequency resolution and appears to be better suited for the analysis
of nonstationary signals. Good time resolution was obtained without degrading frequency
resolution. In contrast to other spectral analysis techniques, when the signal was degraded
by noise, this approach continued to provide an unbiased estimate of the signal without noise.
The time-frequency representation with a cone kernel thus reveals important characteristics
Figure Captions
Fig. 3: Results of the analyzing the signal described in Fig. 1 using the TFRCK method.
Fig. 4: Spectrogram of the signal described Figure 1 in a background of Gaussian noise
with SNR=0 dB.
Fig. 5: Results of processing the signal of Figure 4 with the TFRCK method.
Fig. 6: Spectral analysis of the utterance /agil/, using the autocorrelation method of
autoregressive modeling. Results are shown for a short analysis window.
Fig. 8: Results of analyzing the signal used in Fig. 6 using the TFRCK method.
Fig. 9: Magnification of the display in Fig. 6 showing the segment around the release burst
in more detail.
Fig. 10: Magnification of the display in Fig. 7 showing the segment around the release
Fig. 11: Magnification of the display in Fig. 8 showing the segment around the release
burst in more detail.
References
[2] Cohen, L. 1989. "Time-Frequency distributions - A Review," Proc. IEEE, 77, no. 7,
941-981.
[4] Jamieson, D. G., Ramji, K., Kheirallah, I. and Nearey T. M. 1992. "CSRE: A Speech
Research Environment," In Ohala, J. J., Nearey, T. M., Derwing, B. L., Hodge, M.M.,
and Wiebe, G. E. (Editors.) Proc. of the Second Int. Conf. on Spoken Language Pro
cessing, 1127-1130.
[5] Kay, S. M. and Marple, S. L. Jr. 1981. "Spectrum Analysis - A Modern Perspective,"
Proc. IEEE, 69, no. 11, 1380-1419.
[6] Kay Elemetrics. Computerized Speech Lab. [speech analysis system]. Pine Brook, NJ.
[7] Lim J. S. and Oppenheim A. V. 1988. Advanced Topics in Signal Processing (Prentice
Hall, Englewood Cliffs, New Jersey), pp. 14-39.
[8] Loughlin P. J., and Pitton J. W., and Atlas, L. E. 1991. "New Properties to Alleviate
Interference in Time-Frequency Representations," IEEE Int. Conf. Acoust, Speech,
Signal Processing, 3205-3208.
[10] Oh, S. and Marks, R. J. 1992. "Some Properties of the Generalized Time-Frequency
Representation with Cone-Shaped Kernel," IEEE Trans. Signal Processing, 40, no. 7,
1735-1745.
[12] Zaho, YM Atlas, L. E., and Marks R. J. 1990. "The Use of Cone-Shaped Kernels for
Generalized Time-Frequency Representations of Nonstationary Signals," IEEE Trans.
Acoust., Speech, Signal Processing, 38, no. 7, 1084-1091.
Fig. 1 Fig. 2
Fig. 3 Fig. 4
Fig. 5
Fig. 6 Fig. 7
Fig. 8 Fig. 9
Fig. 10 Fig. 11
WON-1
INTRODUCTION
Measurements of jitter and shimmer on voice signals are obtained for the purpose of
discerning perturbations in the oscillatory behaviour of the vocal folds. Jitter is defined as the
average cycle-to-cycle change in the fundamental period length, and shimmer is the average cycle-
to-cycle change in amplitude, but there is no generally accepted definition of amplitude for a
complex signal. Traditional definitions rely on specific events in time, such as the largest positive
peak-to largest negative peak, or the largest negative peak value. Other definitions include the root-
mean-squared (RMS) value of each cycle (given that the fundamental period markers have been
correctly placed) used by Kempster and Kistler (1984) and Hillenbrand (1987), or the gain factor
has been observed by both Hillenbrand and Milenkovic and found to vary with vowel and Fo. The
concept of time-aliasing stems from the impulsive nature of the source-filter model of speech. The
vocal tract is modeled as a linear filter excited by an impulse train. From cycle to cycle, preceding
vocal tract impulse responses overlap and add to the current cycle. When the train is aperiodic, the
changing phase relationships between consecutive impulse responses cause the signal to become
distorted relative to previous cycles, resulting in measurable shimmer. Since all of the amplitude
definitions defined above extract measurements at a^tage where the time^liasing has already
solely attributed to amplitude modulation of the vocal fold oscillations. It would thus be useful to
lrThis work has been funded by the National Institutes of Health, Grant DC 00387-08
WON-2
devise techniques that can ignore or overcome aperiodic time aliasing as a shimmer source, or else
identify and analyze other signals that more accurately reflect the fold behaviour.
In this study the second alternative is pursued. We examined the behaviour of data
obtained from a time domain simulation model that generates signals at the level of the vocal folds
and throughout the vocal tract. This enabled us to characterize possible sources of shimmer
induced by FM modulation of a driven model of vocal fold tissue displacement. It was found that
vocal tract time aliasing is only one of a number of sources of shimmer originating from different
mechanisms and affecting various signals. Shimmer also appears to be a result of the nonlinear
interactions between the transglottal acoustic pressure, mucosal wave velocity, and tissue
pressure Pin, the subglottal pressure Ps, the minimum glottal area Ag, the glottal flow U and its
time derivative (dU in this paper), and finally the output pressure from the vocal tract Po. In this
An interactive computer simulation of the vocal fold and vocal tract system has been used
to model the behaviour of the folds under conditions of FM subharmonic modulation of the vocal
fold tissue displacement. Subharmonic modulation of order 1/2 was chosen so that any changes in
the plots could be observed by inspection. An FM extent level of 30% was chosen to exaggerate
the effects, although it is acknowledged that this is much greater than values typically found in
human phonation. The model, SPEAK-model 2, was developed at the University of Iowa by one of
the authors. It incorporates source-tract interaction, but does not incorporate self oscillation.
Instead, there is direct control of tissue displacement by a mathematical driving function. Empirical
relationships previously described in the literature involving Fo, vocal fold length and thickness,
mucosal wave velocity, lung pressure, and transglottal pressure are used to define the behaviour of
the system. The effects of modulation are demonstrated as mechanisms supported by these
equations. Figures from the simulations are used to illustrate the phenomena.
Modeling equations
A full description of the model appears in (Titze 1995). A brief outline is given here.
Consider the top view of the vocal folds (Figure!). The vocal processes are at x = +/- £o, at y = 0.
The lowest mode of vibration occurs in the y-direction (a half sinusoid). When £o is small,
complete closure can occur, while for large values, a glottal chink causing flow leakage occurs (the
For the simple case without jitter, the edges of the vocal folds oscillate sinusoidally in time
with an angular frequency co = 2tcFq. The glottal width at any point on the y axis (as the folds
WON-3
where L is the vocal fold length, and £m is the maximum vocal fold displacement (at y = L/2). The
first term in (1) is the prephonatory initial displacement.
To make the model respond in a manner similar to the human folds, relationships extracted
from the literature have been utilized. For example, it is known that £m is dependent on lung
pressure and Fo. The following rule is adopted from Titze (1988):
^n = 17.4PL0-5F0-l-6 (2)
Figure 2 shows a three dimensional view of the folds. A z-axis has been added to describe
the motion of the mucosal wave as it travels from the bottom to the top of the folds. If the vertical
dimension is sliced into layers, each layer k can have its own value of initial condition £ok and
maximum displacement £mk. The propagation of the mucosal surface wave can then be obtained
by replacing sin(cot) in equation (1) with sin(co(t-z/c)), where z is the vertical point under
consideration and c is the wave velocity. If the vertical axis is sliced into N layers, the phase
between layers is assumed to vary linearly between layers from the bottom (k=l) to the top (k=N)
of the z-axis:
where <t>k replaces cot in equation (1) and T is the thickness from bottom to top. Titze, Jiang and
Hsiao (1993) conducted an experiment to measure c. They examined the motion of two sutured
points on a canine hemi-larynx placed in the vertical dimension. The time required for the
displacement of the superior suture to 'catch up1 to the displacement of the inferior suture was
measured by observing the time taken to pass a certain point. This time was interpreted as an
indication of the wave propagation speed. In the results section of the paper, the authors discuss the
fact that the inter-suture distance (and the overall thickness of the folds) may vary during vibration
as a function of Fo.
We present an equation which takes this thickness variation into account:
where To is the nominal thickness at rest, T is the thickness during vibration for a given Fo, and a
is a constant parameter of value 0.01 meters (Titze, 1995).
Ftom the above driving equations, the displacement x at any point (yjc) may be obtained:
g(yjc,t) - 2[£ok(l-y/L) + Uksin(<|>k)sin(Ky/L)] (5)
The glottal area between the symmetric folds at any layer k can be calculated by integration, and
the minimum glottal area Ag, observable by a photoglottogram, can be estimated by finding the
minimum area layer at each point in time.
The glottal flow is determined by the transglottal pressure. The orifice Ag, the subglottal
pressures Ps, and the supraglottal pressure Pin determines the flow U via the equation:
(6)
where kt is the transglottal pressure coefficient (assumed constant for this study), and p is air
density. An /a/ vocal tract shape is modeled using wave reflection equations (LUjencrants, 1985).
FREQUENCY MODULATION
written as:
g(y,t) in equation (1) thus becomes frequency modulated at a modulation frequency Fm and an
extent E, proportional to Fm/Fo. Figure 3 illustrates the basic shape of the modulated displacement
when Fo is 125 Hz, Fm is 62.5 Hz, and E is O.2(27tFm/Fo)- T^ effects of Equation 7 propagate
through the model in the following ways.
If Equation 7 is assumed, the mucosal wave velocity c now varies with Fo1 (via Equation
4). The modulation effect then propagates into Equations 3 and 5. Before we discuss what happens
in our model, let us examine what happens in aisimplersystem of equations. Consider the signal in
Figure 3. It might represent the oscillation of the bottom layer (k=l) of the folds. If a constant time
delayed version of this signal is used to represent the top layer (k=N) of the vocal folds, amplitude
variations will occur in the minimum glottal area wave Ag. These variations in the minimum glottal
area are depicted in Figure 4. Such peak-to-peak variations in the 'minimum aperture1 appear
regardless of the relative sizes of the waves. However, consider the top of the folds to be oscillating
WON-5
with a constant phase delay relationship. The minimum glottal area exhibits no amplitude
variations here (Figure 5). It can be shown that the minimum glottal area in Figure 5 will always be
free of amplitude variations, regardless of the relative sizes of the top and bottom displacements.
This is provided that a constant phase lag is maintained between the layers and that each of the
In our vocal fold model, if c is proportional to the 'nominal1 average value Fo, it can be
shown analytically that a constant time delay mucosal wave will be observed (see Appendix).
the mucosal wave e.g. (Titze 1989) report a constant phase delay, although an FM situation is not
usually considered. It remains to be demonstrated which type of delay - constant time, constant
Figure 6 demonstrates what happens in the driven vocal fold model when the nominal
average value of Fo is used for both the maximum vocal fold displacement and the speed of
mucosal wave propagation. The displacement signal x exhibits FM behaviour, but no amplitude
perturbation. The glottal area waveforms (nonminimum) for three points on the z axis (bottom
(k=l), middle (k=10), and top (k=21)) show a similar result to Figure 4. The FM subharmonic in x
produces amplitude perturbations in minimum glottal area Ag, the flow U, the derivative with
respect to time dU and on into other signals in the vocal tract. Note that the contact area CA does
not exhibit any amplitude perturbations, and that the x and Agj, Agio*an^ Ag2i plots are on an
expanded time scale relative to the other plots, so that the effect of overlapping Ag waves can be
Figure 7 demonstrates the situation when the instantaneous Fof is used for the mucosal
wave speed. A constant maximum amplitude is assumed (time varying £m will be discussed later).
The three glottal area waves exhibit no visible minimum aperture amplitude variation in Ag or CA,
as predicted by the previous discussion. The glottal flow U, however, does exhibit maximum
amplitude variations, which then propagate into dU and on into the vocal tract. The cause of this
The glottal flow in Figure 7 demonstrates amplitude perturbation even though Ag does not.
This occurs because of a combination of two effects. First, from Equation 6, the glottal flow U is
related to Ag and the transglottal pressure Ps-Pin. Pin is related to vocal tract loading, which is
usually inductive, causing the flow to skew to the right (i.e. it is delayed relative to Ag). As a
result, the peak of the flow wave U occurs after the peak of Ag. While the peaks of Ag are the
same height, the negative closing slope varies due to the FM modulation. As a result, the value of
WON-6
Modulation of
If (^ were constant, the glottal width equation (1) would vary sinusoidally for constant
Fo, and in a modulated manner (Figure 3) for Fo'. When a time varying relationship between £m
and Fo' is assumed for Equation 2 (again assuming that these equations apply to dynamically
modulated conditions), direct amplitude modulation of the maximum tissue displacement occurs.
Figure 9 assumes a constant phase delay (Fo' in Equation 4) and dynamically varying £m (Fo* in
Equation 2). The displacement x shows both amplitude and frequency modulation. As a result,
amplitude perturbation appears in all subsequent signals (except for CA).
DISCUSSION
This study has identified two potential laryngeal sources for jitter-induced shimmmer, and
suggested that caution be used regarding the interpretation of another. One source is the
modulation of the maximum amplitude of vibration (£m), which directly influences the amplitude
of the glottal areas. The second source is the slope of the minimum glottal area, which determines
the peak in the flow wave (due to the inductive load of the vocal tract). It should also be noted that
it is the slope of the flow wave, not the peak, which is closely tied to the impulsive excitation of the
vocal tract pressure wave. Often it is this peak in Pq that is marked in voice waveform analysis.
If mucosal wave velocity c is dependent on Fo', it has been demonstrated here that a
constant phase delay rather than a constant time delay occurs between the top and bottom of the
folds, making it unlikely (in this model of vibration) that minimum glottal area (Ag) amplitude
perturbation due to a constant time delay mucosal wave is a source of shimmer.
It should be remembered, however, that the Fo' and ^modulations likely occur
simultaneously, resulting in Ag perturbation as illustrated in Figure 9. In a real vocal fold, where
changes in the stiffness of the muscle are likely to be the source of perturbation, it is probable that
shimmer or not. It is a measure of the electrical conductivity measured from one side of the folds to
the other. In all the examples that were given, complete closure was achieved (the z layers were
preconfigured so that the initial £ok were close together, causing CA to reach a maximum for all
cycles. Figure 10 illustrates the case where this assumption is relaxed, resulting in incomplete
closure for some of the layers in the z-axis. As a result, CA exhibits amplitude perturbation. It
should thus be noted that the ability of CA to demonstrate glottal displacement amplitude
perturbation is limited due to 'saturation1, because it is inherently a measure of behaviour during
Since the electroglottograph does not directly reflect maximum tissue displacement, we
should ask which signal is likely to most accurately describe the tissue displacement. Assuming
that the vocal fold mucosal wave is a constant phase delay phenomena, then the glottal area signal
will directly describe the displacement. This is best measured by the photoglottograph.
SUMMARY
The interactions among several voice production variables have been qualitatively
described for a driven model of vocal fold displacement. The mechanisms identified here suggest
that shimmer in the glottal flow U occurs from slope changes in Ag, while direct modulation of the
mechanism due to slope changes in Ag is likely to be common to all vocal fold models, the
empirical equations relating £m and c to Fq are not used in self-oscillating models, since Fq is not
directly controlled. An examination of self-oscillating models in which physiological variables are
APPENDIX
Consider the tissue displacement equations £t and £b for the top and bottom of the fold:
A1
A2
where £ob and £# are the initial displacements, c%f is therfundamentai frequency (possibly time
varying), t is the time variable, T is the variable vertical thickness, of the fold and c is the velocity
If Fq is assumed constant, c = aFoTTTg, then a constant time delay equation replaces A2:
On the other hand, if Fo (and therefore c) is assumed time varying, then c = oFqT/Tq or
c ■ acoofTV2iiT0, and A2 becomes a constant phase lag equation:
A4
A5
BroLIOGRAPHY
Hillenbrand, J., "A Methodological Study of Perturbation and Additive Noise in Synthetically
Generated Signals". Journal of Speech and Hearing Research, 30 (4), 448-461, December 1987.
Kempster, G.B., and Kistler, D.J., "Perceptual Dimensions of Dysphonic Voices". Journal of the
Acoustical Society of America, 75 (Suppl. 1) S8 (A), 1984.
Liljencrants, J., "Dynamic Line Analogs for Speech Synthesis". Quarterly Progress and Status
Report, STL-QPSR, 1/1985,1-14. Speech Transmission Laboratory, Royal Institute of Technology
(KTH), Stockholm, Sweden.
Milenkovic, P., "Least Mean Square Measures of Voice Perturbation", Journal of Speech and
Hearing Research, 30 (4), 529-538, December 1987.
Oppenheim, A., and Schafer, R., Discrete-Time Signal Processing. Prentice-Hall, Englewood
Cliffs, New Jersey, 1989.
Qi, Y.Y., Weinberg, B., Bi, N., and Hess, W.J., "Minimizing the Effect of Period Determination
on the Computation of Amplitude Perturbation in Voice", NCVS Acoustic Voice Analysis
Workshop Proceedings, Denver, Colorado, Jan. 17-18,1994.
Titze, I.R., "The Physics of Small Amplitude Oscillation of the Vocal Folds". Journal of the
Acoustical Society of America, 83 (4), 1536-1552,1988.
Titze, I.R., Jiang, J., and Hsiao, T.Y., "Measurement of Mucosal Wave Propagation and Vertical
Phase Difference in Vocal Fold Vibration", Annals ofOtol. RhinoL LaryngoL 102,58-63,1993.
Titze, I.R., "On the Relation Between the Subglottal Pressure and Fundamental Frequency in
Phonation". Journal of the Acoustical Society ofAmerica,^ (2), 901-906,1989.
WONG-9
0 0
0.5
CD
I
<
-0.5 h •
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
F_0 = 125Hz : modjreq = 62.5 Hz : mod extent = 20 %
Figure 3. Subharmonic Frequency Modulated Sinusoid. Fq = 125 Hz, Fm = 62.5 Hz, E = 20%
WONG-11
Bottom Segment with Time-Delayed Top Segment [ time delay = 1 millisec ]
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Minimum "Aperture" between Top and Bottom Segments
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
F 0 = 125 Hz : modjreq = 62.5 Hz : mod extent = 20%
Figure 4. Time delayed FM subharmonic (1/2) modulated sine waves and the minimum aperture
wave result.
WONG-12
Bottom Segment with Phase-Lagged Top Segment -{-phase lag = 60 deg ]
0.035 0.04 0.045 0.05
Minimum "Aperture between Top and Bottom Segments
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
F 0 = 125 Hz : modjreq = 62.5 Hz : mod extent = 20 %
Figure 5. Phase delayed FM subharmonic (1/2) modulated sine waves and the minimum aperture
wave result;
WONG-13
P -P
1 s Mn
dU
CA
g2l
/\
X
Figure 6. SPEAK generated subharmonic (1/2) array of signals. Assumes Fo is constant for
Equations 2 and 4, causing a constant time delay between displacements.
WONG-14
P.-P,in
dU
B21
Figure 7. SPEAK generated subharmonic (1/2) array of signals. Assumes Fo is constant for
Equation 2, and varies for Equation 4, causing a constant phase delay between displacements.
WONG-15
dU
Figure 8. SPEAK generated subharmonic (1/3) array of signals. Assumes Fo is constant for
Equation 2 and varies for Equation 4, causing a constant phase delay between displacements, but
uses 1/3 subharmonic to generate displacement slope changes.
WONG-16
dU
Figure 9. SPEAK generated subharmonic (1/2) array of signals. Assumes Fo' (time varying; is
used for Equations 2 and 4, causing a constant phase delay between displacements, and amplitude
modulation of the displacement.
WONG-17
dU
CA
Figure 10. SPEAK generated subharmonic (1/2) array of signals. Assumes Fo' (time varying) is
used for Equations 2, and 4, causing a constant phase delay between displacements, and amplitude
modulation of the displacement. Lower folds do not close.
mud-i
The source may be, e.g. Microphone, Hectroglottogram (EGG), Photoglottogram (PGG)
Glottal Flow, Pressure.
Amplification (optional"!
Must meet requirements of converter (some converters have built-in gain stages).
FiltertrecoininenHffif)
Low pass at 1/2 of the sampling rate (or less). Only if there is no high frequency energy can
you do this. Some converters will low-pass automatically once the sampling rate is selected.
A/D convert^
Specifications
A. Miscellaneous
- number of Channels.
- max sampling length.
- input range
B. Bandwidth
- highest sampling rate
- lowest sampling rate
• fixed or variable sampling rates.
C. Signal/noise
-total noise is the sum of all parts in the chain.
- more bits, the better.
Source Types
(1) Microphone "
Purpose Freq response S/NfdBY
Auditory 20Hz - 20Khz 80
F0 Extraction F0/4-8*F0
RMS
Titter FO/4 - 20*F0
Shimmer FO/4 - 20*F0
Harm/Noise FO/4 - 20*F0
Inverse Filter 3Hz-8*F0
(2)EGGorPGG
Purpose Freq response S/WHRY
F0 Extraction F0/4-8*F0
Wave Shape 3Hz - 20*f0
A Guide to Selecting A/D Hardware
The source may be, e.g. Microphone, Electroglottogram (EGG), Photoglottogram (PGG),
Glottal Flow, Pressure.
Amplification foptional^
Must meet requirements of converter (some converters have built-in gain stages).
Filtertrecommenrjeff)
Low pass at 1/2 of the sampling rate (or less). Only if there is no high frequency energy can
you do this. Some converters will low-pass automatically once the sampling rate is selected.
A/D converter
Specifications
A. Miscellaneous
- number of Channels.
- max sampling length.
- input range
B. Bandwidth
- highest sampling rate
- lowest sampling rate
- fixed or variable sampling rates.
C. Signal/noise
-total noise is the sum of all parts in the chain.
- more bits, the better.
Source Tyne.fi
(1) Microphone '
Purpose Freo response S/
Auditory 20Hz - 20Khz 80
F0 Extraction FO/4 - 8*F0
RMS
Titter FO/4 - 20*F0
Shimmer FO/4 - 20*F0
Harm/Noise FO/4 - 20*F0
Inverse Filter 3Hz - 8*F0
(2)EGGorPGG
Purpose Freo response S/NfdBV
F0 Extraction F0/4-8*F0
Wave Shape 3Hz - 20*f0 high
MIL2-1
Paul H. Milenkovic
Department of Electrical and Computer Engineering
University of Wisconsin-Madison
1415 Johnson Drive
Madison, Wisconsin 53706
Abstract
CBatch is a software program for the DOS operating system that can feed
a sequence of acoustic data files through a user-supplied acoustic analysis pro
gram. CBatch takes care of all of the translating required to read and write a
variety of widely-used waveform file formats. The analysis program is a soft
ware filter that reads and writes to the DOS standard input and output data
streams, and CBatch intercepts these streams and redirects them to waveform
files. The analysis program obtains information typically stored in file headers
(sample rate, number of samples, data range and units) by writing keyword
strings to standard output and by reading alphanumeric responses from stan
dard input. The analysis program writes a keyword string to tell CBatch that
it is read for waveform data, and then it reads blocks of waveform samples from
standard input and it optionally writes processed blocks of samples (such as a
pitch trace) to standard output.
1
MIL 2-2
programs could read and write. The other way is to develop a standard interface
between the acoustic analysis program and a shell program that could translate the
many existing file formats. This report concentrates on this second approach.
The concept is to implement the acoustic analysis algorithm as a software filter.
The software filter has an exceedingly simple interface: it reads blocks of bytes from
the operating system-designated standard input stream and it writes blocks of bytes
to the standard output stream. The software filter has control over how many bytes
it wants to read or write. A shell program intercepts the standard input and output
data streams and takes care of what files are to be processed and in which format.
There is an additional embellishment to the software filter. The filter program
makes requests of the shell program by writing alphanumeric keywords to standard
output and it reads the shell's alphanumeric responses from standard input. This
communication is primarily for obtaining information from the waveform file header
- sample rate, data units, number of samples - without having to know the details of
the header format. This communication can also be used for signalling. For example,
the filter can tell the shell when it wants to switch over to reading and writing blocks
of waveform samples.
In the usual situation of a main program calling a subroutine, the main program
asks the questions and the subroutine provides the responses. In this situation, it is
the suboordinate filter that is asking the questions and the supervisory shell that is
providing the responses. This makes the filter much simpler because it only needs to
ask for the things it needs to know in the order it needs to know them. It also makes
the filter interface extensible. Adding new keywords does not invalidate existing filters
which only know about the old keywords. Finally, it makes the interface interactive.
The filter can indicate whether it wants both the input and output data streams
attached to waveforms or whether it wants input only so it can write the output in
its own format.
The CBatch program is a shell that employs the proposed filter interface to imple
ment batch processing of waveform files in a variety of formats. CBatch is a program
written in Turbo Pascal (Version 4 or later) that runs under DOS (Version 3 or later).
The complete source will be made available through E-mail, an FTP server, or on a
floppy disk: the source is ready but details of distribution are being arranged.
MIL 2-3
NCVS92 is the format developed at the National Center for Voice and
Speech (NCVS) for the dissemination of test data for acoustic anal
ysis algorithms. It also has an ASCII keyword free-format header.
This type of header is quite compatible with SPHERE, and there
was discussion at the February 1994 Denver meeting that SPHERE
should replace this format. This format contains keywords for data
units and data range (translation from binary into real-world units
like Volts or dB) which are desirable to add to SPHERE.
RIFF is the Microsoft multimedia waveform file format. The files typ
ically wave the .WAV extension. Software packages sold with the
popular sound cards for multimedia (Sound Blaster 16, Pro-Audio,
Ensoniqs, etc.) employ this format.
UW XRMB is the file format used to collect both acoustic and pellet
track signals in the University of Wisconsin X-ray Microbeam system.
The files have the .DF extension.
ASCII is alphanumeric text, desirable for the output of pitch traces for
statistical analysis.
where params is a list of parameters separated by spaces, fname is either a valid DOS
path name (directory and file name) or wildcard specification (such as *.wav), and
cmd is the name of your filter program followed by the parameters required by that
program.
The entry fname specifies one or more waveform files. If the filter uses both input
and output, those are the input files, and the output will go to files of the same name
but in the current directory. That way you can analyze files on CD-ROM (such as
TIMIT) and have the results go to files in the current directory on your fixed disk. If
the filter specifies a file extension, the output will go to files having that extension.
The parameters are
/H: h overrides the autodetection of the input file header and strips off h
16 bit words from the beginning of the file.
/0:o overrides the input file header specification for conversion between
offset binary and two's-complement numeric formats. For 12 bit
offset binary, specify /0:2048.
MIL 2-5
/R:r specifies the number of bits of data resolution. If the input file
header does not specify resolution, the value r is assumed. If the
input file header specifies resolution, the data will be shifted to r
bits.
/B overrides the input file header to specify that input data needs to be
byte-swapped (for waveforms collected on SUN or Macintosh com
puters).
/N:n overrides the input file header and specifies that the data contains
n channels of sample interleaved waveforms.
/Did is the decimation factor. The parameter /D:2 means that the in
put file was downsampled (decimated) to contain only every second
sample, so the input waveform will be upsampled by repeating every
sample.
/sis overrides the input file header specification of the sampling rate in
kHz.
/r:r overrides the input file header specification for the data range. For
example, /R:12 (12 bit resolution) and /r:20 (data range of 20)
means a sample value of -2048 corresponds to -10 (Volts or whatever
other units are specified) and 2048 corresponds to 10.
/uiuname overrides the input file header specification of the units name
(such as /u:Volts or /u:ml/s).
/Yirname means that the fname entry specifies one or more directories,
and that rname designates the same file name that occurs in those
directories. For example, if the files are c:\rec001\speech.wav,
c:\rec002\speech.wav, ..., specify the parameter /F:speech.wav
and use c:\rec??? for your fname entry.
/A:a specifies the number of decimal digits accuracy for ASCII output.
Example:
applies the pitch analysis filter cptacf to all the files in the directory c:\speech
having the .wav extension. The pitch analysis takes its parameter settings from the
file cptacf .sav. The analysis results will be in ASCII format, sampled in run-length
fashion (when the value changes) but no more than once every 2 ms. The output
files will have the same name as the waveform files, but they will go into the current
directory and have the .f 0 extension.
The source code distribution includes the sample filter program rectify .pas, which
performs a full-wave rectification of the input signal. It also includes the Turbo Pascal
unit file stdio.pas, which simplifies use of the filter interface.
The basic structure of a filter program is to write keyword queries to standard
output and to read alphanumeric responses to the queries from standard input. When
this communication phase is done, write the query initiate_binary_IO (the proce
dure call StdCmdO ') substitutes initiate_binary_IO for the null string) to initiate
the data phase. Then simply read blocks of 16 bit samples from standard input and
optionally write blocks of 16 bit samples to standard output: the number of samples
to read and write is under your control. If you are doing only input, keep reading
samples until standard input returns zero samples (this action resets the interface).
If you are doing input and output, write zero samples to standard output as your last
call (this also resets the interface).
CBatch recognizes these keywords. The first set of keywords are the communica
tion phase dialog preamble: they must precede the other keywords and if used, they
must be in the order listed below.
extension=ex< specifies that the output file has extension ext (up to three
letters).
The following keywords make up the body of the dialog and may occur in any
order after the preamble.
6
MIL 2-7
RES? returns the number of significant bits in the 16 bit waveform samples
(controlled by CBatch /R:r parameter).
f name? returns the data file name, with the designated extension (from
extension=ext) tacked on. If you have selected filter input only and
are outputting in your own file format, use this query to get a file
name. This is also useful if you want to do your own input formatting
and have selected filter output only.
sname? returns the actual input file name without changing its extension.
This is useful if you need to label analysis results with their file of
origin.
DOUBLE: textquery prompts the user with the text string textquery for a
decimal (double precision data type) value. This is useful if the filter
needs to query the user for analysis settings. An alternative is for
the filter to read such settings from the command line or from a file.
sample_ms=5 sets the waveform sample interval in ms. Use this if you
have selected output only and doing your own file read.
buff er_ms= sets the output file duration in ms. Use this if you have
selected output only and are doing your own file read.
units_name=wraarae sets the output file units name. If you are doing pitch
analysis, you may want to set units_name=Hz.
range=r sets the output file range (use range = 1000 for pitch analysis).
CBatch is a Turbo Pascal (Version 4 or later) program divided into a main program a
set of Turbo Pascal units (separately compilable modules containing procedures and
data type declarations). These modules are
February 1994
CUR-2
Introduction
been limited.
A. Market Gateway.
have existed for some time, some skepticism about the wisdom
multimedia standards.
motherboards•
and software that converts 386 and 486 PCs into spectrum
B. Management Gateway
C. Technology Gateway
files are the only files widely supported with digital audio
utilities.
CUR-9
REFERENCE
SUMMARY STATEMENT
BY INGO R. TITZE, PH.D.
The National Center for Voice and Speech is a multi-site, interdisciplinary organization dedicated to delivering state-of-the-
art voice and speech research to practitioners, trainees and the general public. Members of the consortium are The Univer-
sity of Iowa, The Denver Center for the Performing Arts, The University of Wisconsin-Madison and The University of Utah.
The NCVS gratefully acknowledges its source of support: Grant P60 DC00976 from the National Institutes on Deafness
and Other Communication Disorders, a division of the National Institutes of Health.
SUMMARY STATEMENT 1
FORWARD
A workshop was held on the 17th and 18th of February, 1994, in Denver, Colorado to reach
better agreement on purpose and methods of acoustic analysis of voice signals. Sponsorship was by
the National Center for Voice and Speech, a research and training center funded by the National
Institute on Deafness and Other Communication Disorders, and The Denver Center for the Perform-
ing Arts. Topics included definitions and nomenclature in voice analysis, algorithms for extraction
of parameters, high fidelity recording of microphone signals, computer file structures, sharing of
data bases, and development of test signals. Attendance and contributions were by invitation, keep-
ing in mind a balance between industry and academia. The following contributors were present:
Dr. Wong, Coordinator of Technology Transfer at the National Center for Voice and Speech,
acted as chairman of the workshop and editor of the proceedings. Dr. Titze, Director of the National
Center for Voice and Speech and Executive Director of the WJ Gould Voice Research Center, led
most of the discussions and served as author of the Summary Statement. In this Summary Statement,
only the Recommendations (pp 26-30) should be viewed as majority opinion. All other materials are
explanatory and the opinion of the author. The full proceedings may be obtained by writing to the
National Center for Voice and Speech, Wendell Johnson Speech and Hearing Center, The University
of Iowa, Iowa City, Iowa 52242.
1
TheWilburJamesGouldVoiceResearchCenterisadivisionofTheDenverCenterforthePerformingArts.
Forward....................................................................................................................................................2
Introduction..............................................................................................................................................4
Signal Typing..........................................................................................................................................18
Test Utterances.......................................................................................................................................24
Summary of Recommendations
A. Classification of Signals and General Analysis Approach.............................................................26
B. Extraction of Cyclic Parameter Contours and Perturbation Measures...........................................26
C. Test Utterances for Voice Analysis..................................................................................................28
D. Acquisition of AcousticVoice Signals............................................................................................28
E. File Formats....................................................................................................................................29
F. Data Base Sharing...........................................................................................................................30
G. Data Base Management............................................................................................................ 30
Glossary of Terms...................................................................................................................................31
References..............................................................................................................................................35
SUMMARY STATEMENT 3
INTRODUCTION
Analysis of acoustic signals of the human voice has many purposes. From a technological
standpoint, there is an ever-growing need to store, code, transmit, and synthesize voice signals. The
telecommunications industry has dichotomized transmission of information into either voice or data,
suggesting that voice signals are a class of their own. From a basic science standpoint, investigators
have traditionally studied the microphone signal to understand speech production and perception,
given that the acoustic signal is the common link between them. Finally, from a health science
standpoint, the human voice has been shown to carry much information about the general health and
well-being of an individual. Our voice reveals who we are and how we feel, giving considerable
insight into the structure and function of certain parts of the body.
This workshop was limited to voice analysis rather than speech analysis, the focus being on
the extraction of information about the source of sound from a microphone signal. Thus, no attempt
was made to discuss or summarize general speech analysis dealing with vocal tract information. For
a complete review of speech analysis, the reader is referred to the three volumes of selected papers
published by the Acoustical Society of America (Miller et al., 1991; Atal et al., 1991 and Kent et al.,
1991).
More specifically, the workshop was a response to an urgency expressed by a group of
voice scientists, voice clinicians, and manufacturers of instrumentation to reach some consensus on
utility, feasibility, and standardization of voice perturbation methods. There has been much expec-
tation and much disappointment in what perturbation analysis can offer for diagnosis and assessment
of voice disorders. This workshop gives some of the underlying reasons for both the high expecta-
tion and the limited success.
Perturbation analysis is based on the premise that small fluctuations in frequency, ampli-
tude, and waveshape are always present in a voice signal, reflecting the internal “noises” of the
human body. Every attempt on the part of the speaker to produce a perfectly steady sound results in
an aperiodic waveform. Movements of tissue and air are modulated by the irregular internal motion
of electrical impulses, fluids, and cells within an organ. Thus, what might appear to be steady
movement or posture on a macroscopic scale is often pulsatile movement on a microscopic scale, as
evidenced by twitching of muscles, expansion and contraction of blood vessels, and beating of cilia
to transport fluids. If we could shrink to microscopic dimension and travel through the human body,
we would see that much of the physical plant (the hydraulic, electrical, and chemical systems) exhib-
its complex back-and-forth motions (oscillations). These micromovements impose fluctuations on
what would otherwise be smooth and steady activity.
Voice production can be thought of as the activation of an entire system of coupled oscilla-
tors. The intent to vocalize activates motor commands that are responsible for the neural inputs to an
Figure 1.
A list of biological
oscillators
involved in voice
production and
factors that may
influence them.
SUMMARY STATEMENT 5
NOMENCLATURE AND DEFINITIONS
We begin with a few terms that describe the general phenomenon of irregularity in the human
voice, but do not (and probably should not) have precise mathematical definitions.
Descriptive Terminology
A perturbation is usually thought to be a minor disturbance, or a temporary change, from
an expected behavior. For example, if something is expected to move in a circular orbit but assumes
a slightly elliptical path, we say the circular orbit is perturbed. If a person is chewing and encounters
a small, hard object in the food, the normal chewing motions are momentarily perturbed. Perturba-
tions are usually such that they do not alter the qualitative appearance of a visual or temporal pattern,
at least not indefinitely. They are small irregularities that are for the most part overlooked.
A fluctuation suggests a more severe deviation from a pattern. It reflects an inherent insta-
bility in the system. Whereas a perturbed system usually returns to normal (it is attracted to a stable
state), a fluctuating system is somewhat out of control; it cannot find a stable state. Examples are a
hand tremor, a flag blowing in the wind, or a car fishtailing on a slippery road. Closer to home in
terms of the human voice, a vocal tremor or vibrato may be described as a fluctuation in fundamental
frequency and amplitude. It is more than a perturbation because there is no ultimate stabilization of
fundamental frequency or intensity toward some constant value. The tremor or vibrato is a pattern
itself, rather than a small deviation from a pattern.
Variability is the ability of someone or something to vary, by design or by accident. More
formally, it is the amount of variation as determined by a statistical measure. In a golf swing, a basic
motion may be repeated over and over again, but conditions of the ground surface, the weather, the
ball, the club, or the player may alter the precise motion. Thus, variability may cause the final result
(the resting position of the ball) to be far from the expected result. However, depending on how
intelligently human variability is used, the final result can also be better than expected. If the player
uses variability in muscle activity to compensate for wind and surface variability intelligently, the
overall deviation (in the final ball position) may be less than the deviation that would be obtained by
a perfectly consistent robot. Thus, variability may be used to fight variability, but it can also have a
catastrophic effect if allowed to run rampant. (For a discussion of variability in speech, see Perkell
& Klatt, 1986).
Jitter refers to a short-term (cycle-to-cycle) perturbation in the fundamental frequency of
the voice. Some of the early investigators (e.g., Lieberman, 1961, 1963) displayed speech wave-
forms oscillographically and saw that no two periods were exactly alike. The fundamental fre-
quency appeared jittery; hence, the term jitter. Shimmer was then invented as a companion word for
amplitude-jitter; i.e., a short-term (cycle-to-cycle) perturbation in amplitude (Wendahl, 1966).
SUMMARY STATEMENT 7
roughness, but more often there is a lack of periodicity. Breathiness is a vocal quality that contains
the sound of breathing (expiration, in particular) during phonation. Acoustically, there is a signifi-
cant component of noise in the signal due to glottal air turbulence. Sometimes the term hoarseness
is used to describe the combination of roughness and breathiness.
The terms described thus far - perturbation, fluctuation, variability, jitter, shimmer, tremor,
wow, vibrato, flutter, roughness, breathiness, hoarseness, and several others defined in the glossary -
have no mathematical definitions. No numbers or physical units of measurement need to be attached
to them, although some of them can be rated psychophysically. Nevertheless, they serve a purpose in
describing vocal phenomena and the associated physical processes. At this point, some additional
terms will be reviewed that have mathematical definitions.
where n is any positive integer and To is the period. To must be the smallest value possible to be
deemed the fundamental period. Equation (1) can never be strictly satisfied in a voice signal. All
vocal events tend to be aperiodic. The term quasi-periodic is sometimes used to suggest that there is
only a small deviation from periodicity. It must be kept in mind, however, that quasi-periodicity is
simply a special case of aperiodicity. Furthermore, in physics the term quasiperiodic has the special
meaning of the superposition of two or more periodic signals with incommensurate (non-integer
ratio) frequencies. Hence, we prefer not to use the term, but adopt nearly-periodic to avoid confu-
sion.
A series of events is termed cyclic if the events recur, but not necessarily in periodic fashion.
A cyclic event is recognized on the basis of a pattern that involves neighboring points on a waveform
(e.g., a zero crossing, a maximum value, a minimum value).
A cyclic parameter is a construct of cyclic events (e.g., inter-pulse-interval, open quotient,
skewing quotient, peak-to-zero amplitude, peak-to-peak amplitude, maximum flow declination rate).
Some of these parameters are identifiable only after the acoustic waveform has been inverse filtered,
which is the process of removing the vocal tract resonances from the waveform to obtain the glottal
airflow (Rothenberg, 1973). In a sinusoidal waveform, the amplitude A, the period T, and the fre-
quency 1/T are obvious cyclic parameters and have precise definitions. In a complex periodic wave-
form, the fundamental period To and fundamental frequency Fo = 1/To also have exact definitions
(equation 1), but amplitude can be defined in a variety of ways. Traditionally, the peak value (maxi-
mum positive or negative) and the peak-to-peak value (maximum positive to maximum negative)
have been used. As alternatives, Hillenbrand (1987) used the root-mean-squared (RMS) intensity in
Figure 2. A
fundamental
frequency
(Fo) profile
used for
perturbation
analysis. The
subject was a
normal adult
male pho-
nating a
steady [b]
vowel at
approxi-
mately 100 Hz
for about 12
seconds.
SUMMARY STATEMENT 9
one burst of instability in the middle of the contour. Over the rest of the utterance, the F o variation
was considerably smaller. (Other graphs in Figure 2 will be discussed later).
Now let xi represent an arbitrary cyclic parameter, for which some stylistic contours are
illustrated in Figure 3. Part (a) shows an irregular contour, similar to that of Figure 2 just discussed,
but with fewer cycles. Part (b) shows a regular “up-down” pattern that is often seen in voice signals,
and parts (c) and (d) show a linear and sinusoidal trend, respectively. The “up-down” pattern in part
(b) suggests the presence of a subharmonic frequency Fo/2, or a period doubling 2To. Clearly, if only
every other point were plotted in the contour, a constant would result and periodicity would be
achieved. Thus, the true period is doubled. In equation (1), period doubling is represented by using
only the even values of n.
SUMMARY STATEMENT 11
Figure 4a demonstrates an amplitude modulation (AM) and Figure 4b a frequency modula-
tion (FM) of a series of glottal pulses. Mathematically, the modulation extent is defined as
for sinusoidal frequency modulation, where A1 and A2 are the largest and smallest amplitudes, re-
spectively, and T 1 and T2 are the largest and smallest periods in the signal. Note that modulation
extent approaches 1.0 (100%) when either A 2 or T2 approaches zero. Such an extreme condition
violates a basic principle of modulation, however, because the carrier signal momentarily loses its
amplitude completely for AM, whereas the frequency (1/T) momentarily approaches infinity for FM.
Practical modulations are usually well below 100%. In a vocal vibrato, for example, a 3% frequency
modulation is typical. Amplitude modulations can be larger in vocal signals, but seldom exceed
50%.
For modulation extent to be measurable in a voice signal, the modulation frequency FM (the
number of modulation cycles per second) should be well below the carrier frequency Fc = Fo. (In the
theoretical limit, Fm /Fc is governed by the Nyquist frequency). If Fm is too high, there is insufficient
sampling of the modulation envelope and large errors may occur in its detection. Such is the case
with subharmonic modulations, which are often undersampled in a voice signal (note that there are
only two points per cycle in Figure 3c). Vibrato and tremor, on the other hand, are usually ad-
equately sampled because their frequencies are naturally well below F o (see Figure 3d as an ex-
ample).
If the mean value is intended to be a constant, as in steady vowel phonation, then a zeroth-order
perturbation of the i-th cycle can be defined as
(The term zeroth-order is used because a constant is basically a zero-order pattern or trend). Higher-
order perturbation functions are defined as the following finite differences:
In general, since the first subscript represents the order n of the perturbation function and the second
subscript represents the i-th cycle, higher-order (n+1) perturbation functions are generated recur-
sively as
SUMMARY STATEMENT 13
where K is a normalization factor that keeps the coefficient of xi positive and unity in each perturba-
tion function. Note that with this normalization, all perturbation functions are zero when xi is a
constant.
The perturbation functions can be used to remove known or assumed trends in the cyclic
parameter contour. The zeroth-order perturbation function removes nothing, the first order perturba-
tion function removes a constant (the mean value x), the second order function removes a linear
trend, the third order function removes a quadratic trend, and so on. In general, the n-th order
perturbation function removes a polynomial trend of order n-1 in the contour.
Consider a linear trend as shown in Figure 3c. It is represented by the relation
where k is the rise per cycle. It is easily seen from equation (6) that P1i = k and that all higher-order
perturbation functions in this example are zero. Thus, the first order perturbation function extracts
the linear trend, whereas the higher order perturbation functions remove it. The second graph from
the top in Figure 2 shows a second order perturbation function computed from a human voice. The
scaling is smaller than that of the contour because it is an absolute scaling (+10% deviation from the
mean value). Note that the short-term fluctuations of the contour are retained, but the long-term
trends are removed. For example, the gradual downward slope of the Fo contour in the beginning
one-third of the utterance has been removed. So has the tremorous variation that is most noticeable
in the middle of the contour. All that is left in the second-order perturbation is the short-term “noise”.
If a linear trend is deliberately produced by the voice, such as a uniform Fo glide between
two pitches in a specified amount of time, then k is a known quantity. It can simply be inserted into
the perturbation formulas. For example, the first-order perturbation then becomes
which is now known as the deviation from a linear trend. If a linear trend is suspected as an inherent
pattern, but k is not known, it can be computed from the data by linear regression. This is a well-
known statistical procedure (Hays, 1988). Furthermore, all patterns with forward predictability (e.g.,
a sinusoid, a damped sinusoid, an exponential) can collectively be removed by linear predictive
coding (LPC), with only random (or unpredictable) events remaining in the residual perturbation
function. LPC analysis is based on the assumption that xi can be predicted from a weighted sum of
M previous samples,
where the a’s (the predictor coefficients) are determined by a linear least squares fit to the contour
(Markel & Gray, 1976).
This function computes the deviation from a local mean. If 2m + 1 = N, the total number of cycles
in the window, the perturbation function becomes P oi (equation 5). If m = 1, then the summation
becomes the three-cycle local average used by Koike (1973). For a two-cycle local average, the j =
0 value is omitted and P2i is obtained (equation 7). An 11 cycle average (m = 5) has also been used
(Takahashi & Koike, 1975).
The autocorrelation function of the cyclic parameter contour serves a purpose contrary to
that of a trend remover (such as the second-order perturbation function). It removes the short-term
cycle-to-cycle “noise” but keeps the long term patterns. Mathematically, the autocorrelation func-
tion is computed as
where the brackets indicate average (expected) values over a fixed window of observation. Basically,
the autocorrelation function is the contour multiplied by a delayed version of itself, the delay being
one period, two periods, three periods, and so on (Rabiner & Schafer, 1978; Bendat & Piersol, 1986).
In Figure 2 (third waveform from top on left side), the computation was done from 0 delay periods to
597 delay periods. The autocorrelation is always maximum for 0 delay periods (the function corre-
lates perfectly with itself if not delayed), where it has the value 1.0. At all other points, it is greater
than -1.0 and less than +1.0 if properly normalized. Note that the fluctuation seen in the autocorrelation
function indicates that a small amount of a “vibrato” is present in the subject’s voice. This is percep-
tually below threshold. The subject intended to produce a straight tone, but since he was vocally
trained to sing with vibrato, he could not completely suppress it. This is a good example, then, of a
case in which acoustic analysis “digs out” something that is easily lost in both the raw Fo contour and
the auditory perception.
The histogram (bottom left corner in Figure 2) shows a distribution of the cyclic parameter
values for all of the 1,195 cycles. On the vertical axis is the number of the occurrences of the
parameter value in a narrow range (bin). Note that the greatest number of occurrences of Fo are near
the midrange value (99.8 Hz), whereas large deviations from the midrange occur infrequently. The
distribution is nearly Gaussian, suggesting that perturbations are primarily random. In contrast, the
distribution would be bimodal (two major peaks) if a subharmonic or a strong vibrato were present in
the Fo contour.
SUMMARY STATEMENT 15
Finally, the power spectrum of the parameter contour (bottom right) is a useful display of
the dominant frequencies that modulate the contour. Note that a frequency of about 5 Hz stands out
in this spectrum. This is the frequency of the small amount of vibrato in the voice. All other peaks
in the power spectrum are at least 10 dB lower and do not represent significant components. Again,
subharmonics, tremors, or any other modulations can easily be detected in this type of display.
In summary, a cyclic parameter profile of the type shown in Figure 2 is a useful tool in voice
analysis. It helps to quantify visually what is perceived aurally. A similar profile can be constructed
for amplitude variation or for any other cyclic parameter (open quotient, maximum flow declination,
skewing quotient, etc.).
Perturbation Measures
A perturbation measure is an effective value of the overall perturbation in the cyclic con-
tour. For example, the standard deviation from the mean is
This measure can also be identified as the root-mean-squared (RMS) value of the zeroth-order per-
turbation function (recall equation 5) .
The mean rectified value, or mean absolute value, of the zeroth-order perturbation is de-
fined as
This measure of perturbation is fundamentally not much different from σo, but it is a little easier to
compute because it does not involve squares and square roots. Also, it does not weight outliers (large
deviations from the mean) as heavily as σo because first-power terms rather than second-power terms
are used in the summation.
In general, a collection of perturbation measures can be written as
with δ1 being the most frequently used measure in the literature. In Figure 2, σo has the value of
0.832%, σ1 has the value of 0.419%, and δ2 has the value of 0.316%.
where N is the number of cycles, T is the greatest period found among the N cycles, and fA is the
average acoustic waveform per cycle (obtained by padding all cycles to the maximum period with
zeros and averaging point by point from event marker to event marker). The noise energy is then
defined as
where fi is the waveform in the i-th cycle, and the harmonics to noise ratio is
If the HNR is used as a perturbation measure, it needs to be noted that this measure is not specific to
any cyclic parameter. Therein lies its asset as well as its liability. One cannot tell if the period, the
amplitude, or the waveshape is perturbed. Simple Gaussian noise added to a periodic waveform can
decrease the HNR, as will jitter or shimmer. Thus, the measure correlates best with an overall per-
ception of “noisiness and roughness” in the signal, regardless of what the source might be. New
approaches described by Qi (1992) and Qi et al. (1995) includes a time-base correction that mini-
mizes the effect of jitter as a contributor to noise. Thus, these approaches begin to separate the
sources of noise in the HNR measure.
SUMMARY STATEMENT 17
SIGNAL TYPING
The most interesting voice signals are encountered when vocal fold vibration is highly influ-
enced by nonlinearity in tissue and air movement, or when coupled oscillator modes become
desynchronized. For example, two modes of the same vocal fold, or two modes between opposite
folds, may compete for dominance. A resolution to the mode conflict is what we have described as
period-n phonation, whereby each mode is allowed to have its turn, so to speak, making the overall
period much longer. Another resolution is a long-range modulation (over several cycles), the fre-
quency of which is incommensurate with Fo. In some cases, however, there is no resolution at all in
terms of any real or apparent periodicity, and oscillation becomes chaotic.
In the language of nonlinear dynamics, a qualitative change in the behavior of a dynamical
system is known as a bifurcation. It usually occurs when some parameter of the vibrating system is
changed gradually (e.g., lung pressure, vocal fold tension, or asymmetry between the vocal folds).
Figure 5 shows sketches of how glottal flow waveforms transform after two successive bifurcations.
The first bifurcation is seen as a period doubling (part a to part b) whereas the second is seen as a
total loss of periodicity (part b to part c).
Type 1 signals - nearly-periodic signals that display no qualitative changes in the analysis
segment; if modulating frequencies or subharmonics are present, their energies are an order of mag-
nitude below the energy of the fundamental frequency.
Type 2 signals - signals with qualitative changes (bifurcations) in the analysis segment, or
signals with subharmonic frequencies or modulating frequencies whose energies approach the en-
ergy of the fundamental frequency; there is therefore no obvious single fundamental frequency
throughout the segment.
A spectrogram is useful in making the classification. For example, Figure 6 shows a spec-
trogram of a patient with hyperfunctional childhood dysphonia. The fundamental frequency is 300
Hz. Bifurcations can be seen to occur around 400 ms (the beginning of a period-3 phonation),
around 900 ms (return to the original), and around 1100-1200 ms (beginning of a mixture between
period-3 and period-4 phonation). The signal is therefore classified as type 2.
Figure 6.
Narrow-band
computer
spectrogram
for a patient
with hyper-
functional
childhood
dysphonia.
Abrupt
transitions to
different
phonatory
regimes are
visible,
indicating
bifurcations in
vocal fold
vibration.
SUMMARY STATEMENT 19
Figure 7.
Fundamental
frequency
(Fo) profile for
the patient
with hyper-
functional
childhood
dysphonia.
A fundamental frequency profile, similar to that of Figure 2, is shown for this dysphonic
patient in Figure 7. Note that bifurcations can be identified in the F o contour as segments where the
Fo extractor is uncertain about the constant 298 Hz value. In two cycles the extracted Fo drops down
to 98 Hz, close to the Fo/3 subharmonic. In one case, the extracted Fo jumps to 420 Hz. In general,
Fo is extracted reliably only in the three segments where the waveform is nearly periodic.
The second-order perturbation function has wild fluctuations. It is clear from this display
that a single perturbation measure for the entire segment is meaningless and that the visual displays
carry more information than can be characterized by a single number.
As another example, analysis was performed on the waveform of a patient with unilateral
laryngeal nerve paralysis (Figure 8). The waveform itself shows intermittent segments of low fre-
quency modulation (segments b and d). The fundamental frequency is 285 Hz and the modulation
frequency is 32 Hz. If only segments a, c, and d had been acquired and analyzed, the signal would
have been classified type 1. As it is, it is clearly a type 2 signal.
Figure 8. Micro-
phone signal of a
patient with
unilateral
lyarngeal nerve
paralysis. Parts (a)
to (e) should be
viewed serially,
200 ms per
segment, for a
total of 1s (After
Herzel et al,
1994).
Figure 9. Narrow-
band spectro-
gram of a
patient with
unilateral
laryngeal nerve
paralysis,
corresponding to
the waveform in
Figure 8.
SUMMARY STATEMENT 21
In the Fo profile, shown in Figure 10, the F o contour again shows some large fluctuations in
the segments where 30 Hz modulation takes place. The Fo extractor is trying to recognize the pres-
ence of a 285 Hz fundamental, but gets confused with the modulation frequency. The second order
perturbation function again exhibits large fluctuations (much greater than +10%), indicating that
perturbation measures will be unreliable. Finally, the power spectrum of the F o contour shows the
modulation frequency as a strong peak between 30 and 40 Hz.
Figure 10.
Fundamental
frequency (Fo)
profile for the
patient with
unliateral
laryngeal nerve
paralysis,
corresponding to
Figures 8 and 9.
SUMMARY STATEMENT 23
TEST UTTERANCES
The traditional clinical goals of constructing test utterances are to determine (1) how voice
effects speech intelligibility and communication effectiveness and (2) what insight can be gained
about laryngeal health or general body condition. An additional pedagogical goal would be to deter-
mine (3) how the effectiveness of vocal training can be quantified.
Historically, clinicians have used a battery of test utterances that progress from vowels to
isolated syllables or words to complete sentences or paragraphs. Almost everyone agrees that the
tasks must reveal control of pitch, loudness, and some aspect of vocal quality. In addition, the
interaction among respiratory, phonatory, and articulatory components of speech are important to
most clinicians.
Table 1 shows a set of utterances. The top half of the table lists a variety of nonspeech
utterances, and the bottom half lists some speech utterances. The battery includes most of the utter-
ances used historically but expands the list significantly in the direction of dynamic testing. Phonatory
glides are introduced for the assessment of coordinated muscle activity in the larynx and respiratory
system.
All utterances may be customized to an individual’s Voice Range Profile (VRP). This VRP
should be obtained first to establish the bounds for further testing. Low, medium, and high pitch can
then be defined as some percentage of the F o range, say 10%, 50%, and 80%. The same can be done
to define soft, medium, and loud intensity. With these definitions, sustained vowels are elicited at
strategic locations within the VRP to determine phonatory stability. This is followed by [s] and [z]
consonants for respiratory competence. Finally, a series of pitch, loudness, adduction, and register
glides are executed to determine range, speed, accuracy, and stability in phonation. Tests of this kind
were discussed by Kent et al. (1987).
In the second half of the table, speech and song material is used with increasing phonetic,
emotional, and artistic complexity. After traditional counting, an all-voiced sentence is first used to
test F o control independent of adductory control. This is followed by a sentence with frequent voic-
ing onset and offset tailored to specific larynges. The “Rainbow Passage,” an often-used paragraph
in speech diagnostics for English, is then administered as a de facto standard. At this point, some
parent-child speech is attempted. Exaggerated Fo, intensity, and register patterns emerge in this test
as subjects mimic typical parentese, such as those found in the “Goldilocks” story. Further testing of
extreme Fo and intensity patterns (with highly expressive vocalizations) comes with a dramatic reci-
tation, such as one of Shakespeare’s soliloquies. Finally, a portion of a familiar song (“Happy Birth-
day”) is sung in both modal and falsetto register to examine “heavy” and “light” production in a
singing mode. The use of falsetto singing has been found to be useful in detecting swelling of vocal
fold tissue (Bastian et al., 1990).
Table 1
Proposed Test Utterances
NONSPEECH
Voice Range Profile defines test frequencies and intensities (low = 10% of Fo range, medium = 50% of Fo range, high = 80% of F o
range; soft = 10% of intensity range, medium = 50% of intensity range, loud = 80% of intensity range)
Pitch Glides
1. low-high-low, one octave, 0.25 Hz
2. low-high-low, one octave, 1.0 Hz
3. low-high-low, one octave, maximum rate
Loudness Glides
1. soft-loud-soft, 0.25 Hz
2. soft-loud-soft, 1.0 Hz
3. soft-loud-soft, maximum rate
Register Glides
1. modal-pulse-modal, 0.1 Hz
2. modal-falsetto-modal, 0.1 Hz
3. modal-falsetto-modal, maximum rate, as in yodeling
SPEECH
Counting from 1 to 100, comfortable pitch and loudness
All voiced sentence, “Where are you going?”, soft, medium, loud
Sentence with frequent voice onset and offset “The blue spot is on the key again”, soft, medium, loud
Oral reading of “Rainbow Passage”
Descriptive speech, “Cookie Theft” picture
Parent-child speech, “Goldilocks and The Three Little Bears”
Dramatic speech involving deep emotions (fear, anger, sadness, happiness, disgust)
Singing part of “Happy Birthday to you”, modal and falsetto register
SUMMARY STATEMENT 25
SUMMARY OF RECOMMENDATIONS
The workshop participants discussed and approved a number of recommendations. They are
divided into several subheadings dealing with classification of signals, extraction of cyclic param-
eters, test utterances, acquisition of signals, file formats, and data base sharing. Whenever references
are given, they are not intended to be the original or most authoritative, but those that contain more
detailed explanations by the workshop participants and their colleagues.
SUMMARY STATEMENT 27
C. Test Utterances for Voice Analysis
Test utterances for acoustic voice analysis can be classified as (a) sustained vowels and
sustained voiced consonants, (b) vowels and voiced consonants with prescribed patterns of a cyclic
parameter (e.g., glides, scales, etc.), or (c) speech utterances.
C1. Sustained vowels should continue to be used for voice perturbation analysis because they
elicit a stationary process in vocal fold vibration.
C2. If utterances with prescribed patterns (e.g., Fo glides, intensity glides, etc.) are used, the
patterns should be removed in the analysis and not included as part of the perturbation measure.
C3. Whenever possible, a high vowel ([i] or [V] and a low vowel ([b] or [<] should be used to
report voice perturbation because source-vocal tract interactions are vowel dependent and can there-
fore influence laryngeal behavior.
C4. Multiple tokens of a sustained vowel (on the order of 10) are necessary to obtain reliable
perturbation measures (Scherer et al, in press). Generally, the number of tokens required increases
with the size of the perturbation measure.
C5. Since voice perturbations vary with Fo, intensity, and voice quality, these quantities should
be defined whenever inter and intra-subject differences are reported.
E. File Formats
A number of file formats exist for speech and voice data (e.g. SPHERE, ILS, RIFF, Kay's
NSP, CSRE40, CSpeech and NCVS92). These formats have been developed over many years and
have a number of adherents.
E1. SPeech HEader REsources (SPHERE), developed by the National Institute of Standards
and Technology (NIST), has the potential for high usage within the general scientific community,
and is recommended. It is currently being used for the dissemination of the Texas Instruments-MIT-
NIST (TIMIT) speech database. It contains a 1024 byte ASCII header followed by the data (which
may be compressed). The header consists of a fixed format portion identifying the header type, and
the length of the header. Following this is the object-oriented free format portion of the header,
which describes such characteristics as sampling rate, channel count, and coding method. Software
utilities have been provided by NIST for reading, writing and compressing data files. Information
and software are available through Jon Fiscus, National Institute of Standards and Technology, Bldg.
225, Room A-216, Gaithersburg, Maryland 20899.
E2. If the data are to be used outside the general scientific community, or consists of multiple
sources (e.g. video and audio), or requires compatibility with common PC based sound cards, the
Microsoft RIFF format (which defines WAV files) is recommended. The RIFF format is very similar
to Kay Elemetric's NSP format, which has been used widely in clinically-based voice laboratories.
Kay provides utilities for conversion between RIFF and NSP.
SUMMARY STATEMENT 29
E3. If neither of these formats are suitable, it is recommended that the format chosen conform
to a structure in which the header and data are isolated, so that others may strip the header to gain
access to the data. NCVS92, ILS, RIFF, and SPHERE are some of the formats that adhere to this
principle.
SUMMARY STATEMENT 31
Crossover Frequency: The fundamental frequency for which there is an equal probability for perception of
two adjacent registers.
Cyclic Parameter: Any quantity that is defined within a cycle (e.g. amplitude, period, open quotient, skewing
quotient in the context of any periodic repetition of the event).
Dichrotic: See biphonation.
Diplophonia: Phonation in which the pitch is supplemented with another pitch that corresponds to a frequency
an octave higher; some roughness is usually perceived; dynamically, there is a period doubling (an
Fo/2 subharmonic).
Divergent Glottis: The glottis widens from bottom to top.
Dysphonic: Abnormal in phonation.
Falsetto Register: A register in which the voice is perceived to be continuous (non-pulsed) and weak in timbre;
acoustically, the fundamental carries the greatest amount of energy; physiologically, only partial con-
tact is made between the vocal folds, especially vertically.
Fluctuation: A back and forth irregular movement, usually indicating instability in a system.
Flutter: Phonation with amplitude or frequency modulations (or both) in the 8-12 Hz range; physiologically;
also called bleat, as the bleating of a lamb.
Forced Oscillation: Oscillation imposed on a system by an external periodic source.
Free Oscillation: An oscillation without any imposed driving forces.
Frequency: The number of events per second; in a sinusoid, the number of cycles (2π radians) per second.
Fundamental Period: In a periodic signal, the smallest value To that satisfies the relation f(t+To)=f(t) for all
time t; in a voice signal, instantaneous To is the time between two cyclic (recurring) events, whereas
average To is the smallest constant inter-event duration that best matches a series of prominent recurring
events.
Fundamental Pitch: In a voiced sound, the lowest perceived pitch associated with vocal fold vibration.
Fundamental Frequency: The inverse of fundamental period.
Glottalized Voice: A voice that contains frequent transient sounds (clicks) that result from relatively forceful
adduction or abduction during phonation.
Glottis: The airspace between the vocal folds.
Harmonic Frequencies: Frequencies that are related to the fundamental frequency by an integer ratio.
Histogram: A display of the number of times a variable takes on a certain value, or a small range of values, in
its total range; also known as the distribution density of the variable.
Hoarse Voice: The combination of rough voice and breathy voice.
Honky (Nasal) Voice: A voice quality associated with the excessive acoustic energy coupling to the nasal tract;
acoustically, nasality is characterized by a low-frequency murmur and spectral zeros.
Jitter: A short-term (cycle-to-cycle) variation in the fundamental frequency of a signal.
Lift: A transition point along a pitch scale where vocal production becomes easier (lifted). The term is used to
describe register transitions.
Loft: A suggested term for the highest (loftiest) register; usually referred to as falsetto voice.
Loudness: The psychoacoustic perceptual measure of a sound on a strong-weak continuum; the primary
acoustic correlate is sound pressure level.
Mean: The value obtained by adding up N numbers and dividing by N.
Mean Rectified: The value obtained by first rectifying (taking the absolute value of) a set of numbers and then
taking the mean.
Median: The value obtained by working a histogram of a set of numbers and letting the number of entries
above and below the value be equal.
Median Rectified: The value obtained by first rectifying (taking the absolute value of) a set of numbers and
then finding the median.
SUMMARY STATEMENT 33
Rough Voice: An uneven, bumpy quality that appears to be unsteady in the short-term, but stationary in the
long-term; acoustically, the waveform is often aperiodic, with the modes of vibration lacking synchrony,
but voices with subharmonics can also be perceived as rough.
Self-Sustained Oscillation: An oscillation that continues indefinitely without a periodic driving force; since
the net energy loss per cycle must be zero, self-sustained oscillation requires an energy source.
Shimmer: A short-term (cycle-to-cycle) variation in the amplitude of a signal.
Spectral Slope: A measure of how rapidly energy decreases with increasing frequency, or, for periodic wave
forms, with increasing harmonic number. Also known as spectral tilt or spectral roll-off.
Stationarity: The property of a signal that suggests no long-term drifts; the autocorrelation function
<x(t) * x(t+δ)> depends only on δ, not on t, and decays to zero with increasing t; the spectrogram
remains constant over time.
Strained (Tense) Voice: A voice that appears effortful; visually, hyperfunction of the neck muscles is apparent;
the entire larynx seems compressed.
Strohbass: Literal translation from German, “straw bass”, because of its perceptual similarity to crackling
straw; it is effectively the pulse register when used in singing.
Subharmonic Frequencies: Frequencies that lie between or below the harmonic frequencies and are rational
divisions of the fundamental frequency (e.g. 1/2, 1/3) or their integer multiples.
Temporal Gap Transition: The transition from a continuous sound to a series of pulses in the perception of
vocal registers.
Tremor: A 1-15 Hz modulation of a cyclic parameter (e.g. amplitude or fundamental frequency), either of a
neurologic origin or an interaction between neurological and biomechanical properties of the vocal
folds. See flutter, vibrato, and wow.
Trill: A rapid alternation of a primary note with a secondary note (usually a semitone or a tone higher); used as
an ornament in music.
Trillo: A rapid repetition of the same note in the 8-12 Hz range; used as an ornament in music.
Twangy Voice: A sharp, bright quality, as produced by a plucked string. Twang is often attributed to nasality,
but it is probably more laryngeally-based. It is often part of a dialect or singing style.
Variability: Literally, the ability of something to vary, by design or by accident. More formally, the amount of
variation as determined by a statistical measure.
Ventricular Phonation: Phonation with the false vocal folds; unless intentional, it is generally considered an
abnormal muscle pattern dysphonia associated with hyperactivity in the false fold region.
Vibrato: A natural ingredient of a singing voice, especially in classical Western singing; acoustically, a 4-7 Hz
sinusoidal modulation of Fo and/or intensity; the modulation extent is typically +3% in frequency, but
varies considerably in amplitude. Physiologically, the origin of natural vibrato lies in laryngeal muscle
contraction rather than lung pressure modulations.
Whisper: Speech produced by turbulent glottal airflow in the absence of vocal fold vibration.
Whistle Register: A register in which the sound is perceived as a whistle, usually high in pitch and flute-like in
quality; physiologically, the claim is that a posterior glottal gap can serve as an orifice for vortex shed
ding and an epilaryngeal resonator can reinforce the sound, but the resonance mechanism is yet specu-
lative.
Wobble: See wow.
Wow (Wobble): Phonation with amplitude and/or frequency modulations in the 1-3 Hz range.
Yawny Voice: A quality associated with a lowered larynx and widened pharynx, as in a yawn.
Acknowledgement
The author has been greatly influenced by the writings of (and personal communication with) Dr.
Hanspeter Herzel. He read the manuscript with interest and care and made many suggestions.
Aronson, A., Ramig, L., Winholtz, W., & Silber, S. (1992). Rapid voice tremor, or “flutter”, in amyotrophic lateral sclerosis.
Annals of Otology, Rhinology & Laryngology, 101(6), 511-518.
Atal, B., Miller, J., & Kent, R. (1991). Papers in Speech Communication: Speech Processing. Woodbury, NY: Acoustical
Society of America.
Baken, R. J. (1990). Irregularity of vocal period and amplitude: A first approach to the fractal analysis of voice. Journal
of Voice, 4(3), 185-197.
Bastian, R. W., Keidar, A., & Verdolini-Marston, K. (1990). Simple vocal tasks for detecting vocal fold swelling. Journal
of Voice, 4(2), 172-183.
Bendat, J., & Piersol, A. (Eds.). (1986). Random Data: Analysis and Measurement Procedures. New York: John Wiley and
Sons.
Bergé, P., Pomeau, Y. & Vidal, C. (1984). Order Within Chaos: Toward A Deterministic Approach to Turbulence. New York:
John Wiley & Sons.
Berry, D., Herzel, H., Titze, I.R., & Krischer, K. (1994). Interpretation of biomechanical simulations of normal and chaotic
vocal fold oscillations with empirical eigenfunctions. Journal of the Acoustical Society of America, 95(6), 3595-
3604.
Cox, N. (1989). Technical considerations in computations of spectral harmonics-to-noise ratio for sustained vowels.
Journal of Speech and Hearing Research, 32(1), 203-218.
Deem, J.F., Manning, W.H., Knack, J.V., & Matesich, J.S. (1989). The automatic extraction of pitch perturbation using
microcomputers: Some methodological considerations. Journal of Speech and Hearing Research, 32, 689-697.
Doherty, E., & Shipp, T. (1988). Tape recorder effects on jitter and shimmer extraction. Journal of Speech and Hearing
Research, 31, 485-490.
Gerratt, B.R., & Kreiman, J. (1995). The utility of acoustic measures of voice quality. In D. Wong (Ed.), Workshop on
Acoustic Voice Analysis. Iowa City, IA: National Center for Voice and Speech.
Hakes, J., Doherty, E., & Shipp, T. (1990). Trillo rates exhibited by professional early music singers. Journal of Voice, 4(4),
305-308.
Hays, W. (1988). Statistics, 4th Ed. New York: Holt, Rinehart & Winston, Inc.
Herzel, H., Steinecke, I., Mende, W., & Wermke, K. (1991). Chaos and bifurcations during voiced speech. In E. Mosekilde
(Ed.), Complexity, Chaos, and Biological Evolution (pp 41-50). New York: Plenum Press.
Herzel, H., Berry, D., Titze, I.R., & Saleh, M. (1994). Analysis of vocal disorders with methods from nonlinear dynamics.
Journal of Speech and Hearing Research, 37(5), 1001-1007.
Hess, W. (1983). Pitch Determination of Speech Signals: Algorithms and Devices. Berlin, Heidelberg, New York, Toronto:
Springer-Verlag.
Hess, W.J. (1995). Pitch determination of speech signals - with special emphasis on time domain methods. In D. Wong
(Ed.), Workshop on Acoustic Voice Analysis. Iowa City, IA: National Center for Voice and Speech.
Hillenbrand (1987). A methodological study of perturbation and additive noise in synthetically generated voice signals.
Journal of Speech and Hearing Research, 30, 448-461.
Kasuya, H., Ogawa, S., Mashima, K., & Ebihara, S. (1986). Normalized noise energy as an acoustic measure to evaluate
pathologic voice. Journal of the Acoustical Society of America, 80, 1329-1334.
Kent, R. D., Kent, J. F., & Rosenbek, J. C. (1987). Maximum performance tests of speech production. Journal of Speech
and Hearing Disorders, 52, 367-387.
Kent, R., Atal, B., & Miller, J. (1991). Papers in Speech Communication: Speech Production. Woodbury, NY: Acoustical
Society of America.
Klingholz, F. (1987). The measurement of the signal-to-noise ratio (SNR) in continuous speech. Speech Communication,
6, 15-26.
Koda, J. & Ludlow, C. (1992). Evaluation of laryngeal muscle activation in patients with voice tremor. Otolaryngology -
Head and Neck Surgery, 107(5), 684-696.
Koike, Y. (1973). Application of some acoustic measures for the evaluation of laryngeal dysfunction. Stud. Phonol., VII,
17-23.
Lemke, J., & Samawi, H.M. (1995). Establishment of normal limits for speech characteristics. In D. Wong (Ed.), Workshop
on Acoustic Voice Analysis. Iowa City, IA: National Center for Voice and Speech.
Lieberman, P. (1961). Perturbations in vocal pitch. Journal of the Acoustical Society of America, 33, 597-602.
Lieberman, P. (1963). Some acoustic measures of the fundamental periodicity of normal and pathologic larynges. Journal
of the Acoustical Society of America, 35, 344-353.
SUMMARY STATEMENT 35
Markel, J.D. & Gray, A.H., Jr. (1976). Linear Prediction of Speech. New York: Springer-Verlag.
Milenkovic, P. (1987). Least mean square measures of voice perturbation. Journal of Speech and Hearing Research, 30,
529-538.
Milenkovic, P.H. (1995). Rotation-based measure of voice aperiodicity. In D. Wong (Ed.), Workshop on Acoustic Voice
Analysis. Iowa City, IA: National Center for Voice and Speech.
Miller, J., Kent, R., & Atal, B. (1991). Papers in Speech Communication: Speech Perception.Woodbury, NY: Acoustical
Society of America.
Moon, F.C. (1987). Chaotic Vibrations: An Introduction for Applied Scientists and Engineers. New York: John Wiley &
Sons.
Niimi, S., Horigughi, S., Kobayashi, N., & Yamada, M. (1988). Electromyographic study of vibrato and tremolo in singing.
In O. Fujimura (Ed.), Voice Production, Mechanisms and Functions (pp. 403-414). New York: Raven Press.
Orlikoff, R. (1990). Heartbeat-related fundamental frequency and amplitude variation in healthy young and elderly male
voices. Journal of Voice, 4(4), 322-328.
Perkell, J.S. & Klatt, D.H. (1986). Invariance and Variability in Speech Process. Hillsdale, NJ: Lawrence Earlbaum Asso
ciates, Publishers.
Pinto, N. & Titze, I. (1990). Unification of perturbation measures in speech analysis. Journal of the Acoustical Society of
America, 87(3), 1278-1289.
Qi, Y.Y., Weinberg, B., Bi, N., & Hess, W.K. (1995). Minimizing the effect of period determination on the computation of
amplitude perturbation in voice. In D. Wong (Ed.), Workshop on Acoustic Voice Analysis. Iowa City, IA:
National Center for Voice and Speech.
Qi, Y. (1992). Time normalization in voice analysis. Journal of the Acoustical Society of America, 92, 2569-2576.
Rabiner, L.R., & Schafer, R.W. (1978). Digital Processing of Speech Signals. Englewood Cliffs NJ: Prentice-Hall.
Rabinov, C.R., Kreiman, J., & Gerratt, B.R. (1995). Comparing reliability of a perceptual and acoustic measures of voice.
In D. Wong (Ed.), Workshop on Acoustic Voice Analysis. Iowa City, IA: National Center for Voice and Speech.
Ramig, L. & Shipp, T. (1987). Comparative measures of vocal tremor and vocal vibrato. Journal of Voice, 1(2), 162-167.
Rothenberg, M. (1973). A new inverse-filtering technique for deriving the glottal air flow waveform during voicing.
Journal of the Acoustical Society of America, 53(6), 1632-1645.
Scherer, R., Vail, V., & Guo, C. (in press). Required number of tokens to establish reliable voice perturbation values.
Journal of Speech and Hearing Research.
Takahashi, H., & Koike, Y. (1975). Some perceptual dimensions and acoustic correlates of pathological voices. Acta
Otolaryngologica (Stockholm), Suppl. 338, 2-24.
Talkin, D. (1995). Cross correlation and dynamic programming for estimation of fundamental frequency. In D. Wong
(Ed.), Workshop on Acoustic Voice Analysis. Iowa City, IA: National Center for Voice and Speech.
Terhardt, E. (1974). On the perception of periodic sound fluctuations (roughness). Acustica, 30, 201-213.
Titze, I.R., Horii, Y., & Scherer, R.C. (1987). Some technical considerations in voice perturbation measurements. Journal
of Speech and Hearing Research, 30, 252-260.
Titze, I.R. (1991). A model for neurologic sources of aperiodicity in vocal fold vibration. Journal of Speech and Hearing
Research, 34, 460-472.
Titze, I., Baken, R. & Herzel, H. (1993). Evidence of chaos in vocal fold vibration. In I. Titze (Ed.), Vocal Fold Physiology:
Frontiers in Basic Science (pp 143-188). San Diego: Singular Publishing Group.
Titze, I. & Liang, H. (1993). Comparison of Fo extraction methods for high precision voice perturbation measurements.
Journal of Speech and Hearing Research, 36(6), 1120-1133.
Titze, I.R., & Winholtz, W.S. (1993). The effect of microphone type and placement on voice perturbation measurements.
Journal of Speech and Hearing Research, 36(6), 1177-1190.
Titze, I.R., Solomon, N.P., Luschei, E.S., & Hirano, M. (1994). Interference between normal vibrato and artificial stimu-
lation of laryngeal muscles at near vibrato rates. Journal of Voice, 8(3), 215-223.
Yumoto, E., Gould, W. J., & Baer, T. (1982). The harmonics-to-noise ratio as an index of the degree of hoarseness. Journal
of the Acoustical Society of America, 71, 1544-1550.
Wendahl, R.W. (1966). Laryngeal analog synthesis of jitter and shimmer auditory parameters of harshness. Folia Phoniatrica,
18, 99-108.
Winholtz, W., & Titze, I. (in press). Miniature head mount microphone for acoustic analysis. Journal of Speech and
Hearing Research.