0% found this document useful (0 votes)
33 views19 pages

Singh Theunissen 2003-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views19 pages

Singh Theunissen 2003-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Modulation spectra of natural sounds and ethological theories of auditory processing

Nandini C. Singh, and Frédéric E. Theunissen

Citation: The Journal of the Acoustical Society of America 114, 3394 (2003); doi: 10.1121/1.1624067
View online: https://doi.org/10.1121/1.1624067
View Table of Contents: https://asa.scitation.org/toc/jas/114/6
Published by the Acoustical Society of America

ARTICLES YOU MAY BE INTERESTED IN

Spectro-temporal modulation transfer functions and speech intelligibility


The Journal of the Acoustical Society of America 106, 2719 (1999); https://doi.org/10.1121/1.428100

Multiresolution spectrotemporal analysis of complex sounds


The Journal of the Acoustical Society of America 118, 887 (2005); https://doi.org/10.1121/1.1945807

Temporal coherence versus harmonicity in auditory stream formation


The Journal of the Acoustical Society of America 133, EL188 (2013); https://doi.org/10.1121/1.4789866

Auditory stream segregation on the basis of amplitude-modulation rate


The Journal of the Acoustical Society of America 111, 1340 (2002); https://doi.org/10.1121/1.1452740

A cocktail party with a cortical twist: How cortical mechanisms contribute to sound segregation
The Journal of the Acoustical Society of America 124, 3751 (2008); https://doi.org/10.1121/1.3001672

Binaural interference and auditory grouping


The Journal of the Acoustical Society of America 121, 1070 (2007); https://doi.org/10.1121/1.2407738
Modulation spectra of natural sounds and ethological theories
of auditory processing
Nandini C. Singha) and Frédéric E. Theunissenb)
Department of Psychology and Neuroscience Institute, University of California, Berkeley,
3210 Tolman Hall, Berkeley, California 94720-1650

共Received 23 April 2003; revised 28 August 2003; accepted 15 September 2003兲


The modulation statistics of natural sound ensembles were analyzed by calculating the probability
distributions of the amplitude envelope of the sounds and their time-frequency correlations given by
the modulation spectra. These modulation spectra were obtained by calculating the two-dimensional
Fourier transform of the autocorrelation matrix of the sound stimulus in its spectrographic
representation. Since temporal bandwidth and spectral bandwidth are conjugate variables, it is
shown that the joint modulation spectrum of sound occupies a restricted space: sounds cannot have
rapid temporal and spectral modulations simultaneously. Within this restricted space, it is shown that
natural sounds have a characteristic signature. Natural sounds, in general, are low-passed, showing
most of their modulation energy for low temporal and spectral modulations. Animal vocalizations
and human speech are further characterized by the fact that most of the spectral modulation power
is found only for low temporal modulation. Similarly, the distribution of the amplitude envelopes
also exhibits characteristic shapes for natural sounds, reflecting the high probability of epochs with
no sound, systematic differences across frequencies, and a relatively uniform distribution for the log
of the amplitudes for vocalizations. It is postulated that the auditory system as well as engineering
applications may exploit these statistical properties to obtain an efficient representation of
behaviorally relevant sounds. To test such a hypothesis we show how to create synthetic sounds with
first and second order envelope statistics identical to those found in natural sounds. © 2003
Acoustical Society of America. 关DOI: 10.1121/1.1624067兴
PACS numbers: 43.80.Ka, 43.64.Bt, 43.64.Qh 关WA兴 Pages: 3394 –3411

I. INTRODUCTION tive fields of visual neurons have been shown to perform


optimal filtering operations on natural images 共van Hateren,
Natural sounds span a restricted range of all possible 1992a; Dan et al., 1996兲.
sounds just as natural scenes only represent a small subset of The use of natural sounds for understanding auditory
all possible images 共Attneave, 1954; Field, 1987兲. This phe- processing has, for the most part, followed a different path.
nomenology can be quantified by calculating the degree of On one hand, auditory neuroethologists were pioneers in the
statistical redundancy found in natural sounds. The use of use of behaviorally relevant stimuli to probe the physiology
this redundancy is clearly demonstrated by the multiple of the sensory systems. This approach led to the classic dis-
forms of compression that are available for the digital stor- coveries of pulse-echo tuned neurons in the bat 共Suga et al.,
age of music and that result in relatively little perceptual 1978兲, song selective neurons in songbirds 共Margoliash,
degradation 共Painter and Spanias, 2000兲. We will argue that 1983兲 and call selective neurons in the primate 共Newman and
the characterization of the statistics of natural sounds is also
Wollberg, 1978兲. In this respect, the auditory system appears
potentially important for understanding acoustical perception
to be at least as ‘‘selective’’ for specific natural sounds as the
and its underlying neuro-physiological basis. A theory of
visual system is for specific natural images. On the other
neural representation and neural computation in sensory sys-
hand, a systematic study of the statistical structure that char-
tems that takes into account the natural environment, as
acterizes these natural vocalizations and then would yield
originally proposed by Attneave 共1954兲 and Barlow 共1961兲,
theoretical predictions for the response properties of single or
has been fruitful in advancing our understanding of the vi-
network of auditory neurons has not been pursued to the
sual system and we propose that a similar approach will lead
same degree as in the visual modality.
to insights in auditory science. This theoretical framework
Two studies have taken this systematic approach by ana-
leads to a series of predictions and experiments that have
lyzing the statistics of the sound pressure waveform. In an
now demonstrated how neural computations and representa-
tions in the early stages of the visual system are adapted to initial study, Rieke et al. 共1995兲 demonstrated that auditory
the processing of natural scenes 共reviewed in Simoncelli and nerve fibers in the frog transmitted information more effi-
Olshausen, 2001兲. For example, the spatio-temporal recep- ciently when the power spectrum of broadband sounds
matched the power spectrum of the natural frog call. More
a兲
recently, by examining the higher-order statistics of natural
Current affiliation: National Brain Research Center; Sector-15, Part II;
sounds, Lewicki found that the basis set that best represented
Gurgaon, 122 001, Haryana, India.
b兲
Author to whom correspondence should be addressed. Electronic mail: the independent components of vocalizations was obtained
fet@socrates.berkeley.edu by a Fourier decomposition whereas the basis set that best

3394 J. Acoust. Soc. Am. 114 (6), Pt. 1, Dec. 2003 0001-4966/2003/114(6)/3394/18/$19.00 © 2003 Acoustical Society of America
represented the independent components of environmental order statistics of amplitude envelopes along the temporal
sounds was obtained by a wavelet decomposition. The bio- dimension but very little is known about the joint statistics of
logical basis set generated by the filtering properties of the the spectro-temporal modulations of natural sounds. For this
cochlea and the hair cells fell in the middle of these two reason, we investigated the lower order joint statistics of
solutions, suggesting that the initial stage of auditory pro- three different ensembles of natural sounds: human speech,
cessing could have evolved to be optimized to the different zebra finch song and environmental sounds. As was done in
statistics of these two important groups of natural sounds Attias and Schreiner 共1997兲, we calculated and fitted the
共Lewicki, 2002兲. probability distributions of amplitudes for such envelopes.
Here we extend this theoretical approach to the auditory We then calculated the joint second order statistics of the
computations and representations not of the sound pressure amplitude envelopes, which we call the modulation spec-
waveform but of the spectro-temporal amplitude envelopes trum. We found that natural sounds have a characteristic
that are obtained by the decomposition of sound into fre- modulation spectrum and discuss the implications of our re-
quency channels. This decomposition is performed in bio- sults for an ethologically based theory of auditory process-
logical systems by the cochlea and in engineering applica- ing.
tions, such as speech recognition, by the use of filter-banks.
The importance of the spectro-temporal amplitude envelopes II. METHODS
of sound in capturing the significant statistics of the natural
sounds as well as in predicting the neural response of higher- A. Estimating modulation spectra
level auditory neurons is very well documented. First, spec- Figure 1 illustrates how the modulation spectrum of a
trograms are used extensively in the analysis of animal vo- sound ensemble is defined. First, a specific time-frequency
calizations, not only because they provide a clear pictorial representation of the sound is calculated. For example, in
representation of the different types of vocal gestures, but Fig. 1共a兲, a spectrographic representation is chosen to display
also because the spectrographic representation is a better pic- the spectral and temporal structure present in a zebra finch
torial match of our perception of the sound than any plot of song. This time-frequency representation can be expressed in
the sound pressure waveform. For the same reasons, time- its Fourier domain. On the right panel, the 2-D image made
frequency representations are used extensively in preprocess- by the spectrogram is shown as a weighted sum of sinusoidal
ing stages of speech recognition or sound compression algo- gratings of variable period, orientation and phase. Each spec-
rithms 共Painter and Spanias, 2000兲. Second, the importance trographic ‘‘grating’’ corresponds to a particular broadband
of the statistical structure of these envelopes for speech per- sound called a ripple sound. The ripple sounds are character-
ception has clearly been demonstrated. Degradation of this ized by their sinusoidal amplitude modulations in time and in
structure along either the spectral or temporal dimension re- frequency. The function describing the amplitude envelope
sults in loss of intelligibility 共Drullman et al., 1994; Drul- for each frequency band f of a particular ripple sound is
lman, 1995; Shannon et al., 1995兲. Similarly, psychophysical written as
studies have shown that humans are particularly sensitive to
S i 共 t, f 兲 ⫽A i cos共 2 ␲␻ t,i t⫹2 ␲␻ f ,i f ⫹ ␾ i 兲 . 共1兲
either temporal modulations alone 共Viemeister, 1979兲, spec-
tral modulations alone 共Green, 1986兲 or the joint spectro- The spectrogram for the sound of interest can then be written
temporal modulations 共Chi et al., 1999兲 of these amplitude as a sum of such ripple components 共or the equivalent inte-
envelopes, and that this sensitivity is restricted to relatively gral in a continuous formulation兲:
low modulations rates. Finally, whereas auditory neurons in
the auditory midbrain and forebrain are not sensitive to the S 共 t, f 兲 ⫽A 0 ⫹ 兺i S i共 t, f 兲 .
phase of the sound pressure waveform, they do acquire novel
temporal and spectral amplitude modulation tuning that is A i determines the relative strength of the modulation depth
not observed at the lower levels of auditory processing 共relative to the dc term A 0 ) for that particular ripple sound
stream 共Popper and Fay, 1992兲. For these reasons, the char- component. The parameter ␻ t describes the modulation fre-
acterization of the response properties of higher level audi- quency of the amplitude envelope along the temporal dimen-
tory neurons has included their response to amplitude modu- sion has units of Hz. In this report, it is referred to as the
lated tones 共Phillips and Hall, 1987; Eggermont, 2002兲, temporal modulation frequency or simply the modulation
spectrally modulated sounds 共Schreiner and Calhoun, 1994; frequency. In other reports it has also been named ripple
Calhoun and Schreiner, 1998兲 and more recently to complex velocity or drifting velocity. Since ripple velocity has been
spectro-temporal stimuli which are used to extract the joint used to describe the number of frequency units spanned per
spectral-temporal receptive fields 共STRFs兲 of the neurons second by the ripple ( ␻ t / ␻ f ), we will avoid the use of that
共Eggermont et al., 1983; deCharms et al., 1998; Theunissen term. The parameter ␻ f describes the modulation frequency
et al., 2000; Depireux et al., 2001; Sen et al., 2001; Escabi of the amplitude envelope along the spectral dimension and
and Schreiner, 2002; Miller et al., 2002兲. has units of 1/Hz or for wavelet time-frequency representa-
Since both auditory perception and the responses of au- tions 1/oct. In this report, it is referred to as the spectral
ditory neurons seem to be particularly sensitive to the struc- modulation frequency. It has also been called ripple density
ture in sound amplitude envelopes it becomes crucial to de- or ripple peak density 共Chi et al., 1999; Klein et al., 2000;
scribe the statistical nature of this structure in natural sounds. Depireux et al., 2001; Escabi and Schreiner, 2002兲. ␸ is the
Attias and Schreiner 共1997兲 have begun to study the second initial phase of the ripple. Note that although we use the

J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing 3395
A
Frequency + + +

+ + +…

Time

C
Frequency (Hz)

B
400

}
200 Modulation Spectrum
0 100 200
2
Time (ms)

ω f (1/kHz)
10
ωf (1/kHz)

ωt (Hz)
1
-30 30

0
-100 0 100
ω t (Hz)
…

FIG. 1. Definition of the modulation spectrum. Panel 共a兲 illustrates the decomposition of a sound represented by its spectrogram into its Fourier components:
a sum of ripple sounds, which can be thought of as the acoustic analog of visual gratings. Each subfigure on the right side of the equation is the spectrogram
of a single ripple sound component. The sound shown is a song from a zebra finch. As illustrated in 共b兲, ripples are characterized by their temporal modulation,
␻ t 共Hz兲, their spectral modulations, ␻ f 共1/Hz or 1/oct兲, and their phase. A single point in a two-dimensional Cartesian plot can be used to represent the ripple
sound components of a given spectral and temporal modulation, irrespective of phase. 共c兲 To calculate the modulation spectrum, a representative group of
sounds is decomposed into its ripple components and the power density of each ripple is estimated and plotted with gray-scale on the two-dimensional
Cartesian plot. For the modulation spectrum shown in 共c兲, we used 20 zebra finch songs of approximately 2 s each. The details of the calculations are
illustrated in Fig. 2.

symbol ␻, the modulation frequencies in Eq. 共1兲 and in the The calculation of the modulation spectrum is similar to
rest of the paper are specified in units of oscillation frequen- that of the standard frequency spectrum except that it re-
cies and not in angular frequencies. quires the additional preprocessing step of calculating the
The modulation spectrum then shows the density distri- spectrogram of the sound 共or any other time-frequency rep-
bution of amplitudes A i of the component ripple sounds for resentation兲 before calculating the modulus square of the 2-D
an ensemble of sounds as a function of ␻ t and ␻ f . Figure Fourier transform of the ensemble of spectrograms 关Fig.
1共c兲 shows the modulation spectrum for an ensemble of ze- 1共c兲兴. As is the case for the frequency spectrum, the same
bra finch song. For time-frequency representations that yield result can be obtained by first estimating the auto-correlation
a real valued amplitude envelope, the modulation spectrum is function and then calculating the real valued 2-D Fourier
symmetric along the origin and therefore can be shown in transform. Figure 2 illustrates the entire calculation process
two quadrants. Ripple sounds where ␻ f ⫽0 are broadband using this second approach. A spectrogram is first obtained
noise that are sinusoidally modulated in amplitude at a fre- by decomposing the sound into an ensemble of narrow-band
quency given by ␻ t . Ripple sounds where ␻ t ⫽0 are con- signals obtained from the output of a filter bank. The ampli-
stant sounds with a sinusoidal frequency spectrum where the tude envelope of each narrow-band signal is obtained from
distance between peaks in the spectrum is given by 1/␻ f . the analytical signal 共Flanagan, 1980; Cohen, 1995; Theunis-
Ripple sounds where ␻ t • ␻ f ⭓0 are down-sweeps and are sen and Doupe, 1998兲. The value of the amplitude envelope
shown in the upper right quadrant. Ripple sounds where ␻ t calculated in that fashion is identical to the amplitude ob-
• ␻ f ⭐0 are up-sweeps and are shown in the upper left quad- tained in a short-time Fourier transform of a segment of
rant 关see Fig. 1共b兲兴. As we will explain below, the range of sound centered at t and windowed with a function given by
possible values for ␻ t and ␻ f is restricted because of the the Fourier transform of the gain function of the particular
mathematical nature of time-frequency representations. filter in the filter bank 共Flanagan, 1980兲. As described in

3396 J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing
Autocorrelation matrix Modulation spectrum
60 2

ωf (1/kHz)
40

df=0

df
2d FFT 1
20

df=2
0 0
-300 0 300 -100 0 100
Time (ms) ωt (Hz)

FIG. 2. Calculation of the modulation spectrum. A spectrographic representation of the sounds in the ensemble is first obtained by a decomposition into
frequency bands using a bank of Gaussian filters. This decomposition results in a set of narrow-band signals with time-varying amplitude envelopes. The
spectrogram is a pictorial representation of this time-varying envelope. The stimulus auto-correlation matrix is then obtained by cross-correlating the
amplitude envelope of a particular band with the amplitude envelope of all the other bands, including itself. These cross-correlation functions are then
averaged for all functions with equal frequency band offsets 共df兲 and collapsed into an auto-correlation matrix, which shows the correlations as a function of
time delay (X-axis兲 and frequency band offset (Y -axis兲. The two-dimensional Fourier transformation of this auto-correlation matrix is calculated to obtain the
modulation spectrum of the sound ensemble.

more detail below, further transformations can then be ap- modulations and is slightly asymmetric with more energy for
plied to the amplitude of the envelopes before calculating the down-sweeps than up-sweeps.
modulation spectrum. The amplitude envelopes 共or its trans-
formed value兲 in each band are then used to estimate an
autocorrelation matrix, which shows the average product of
the amplitude at frequency f and time t with the amplitude at B. Time-frequency scale and the estimation of
frequency f ⫹d f and time t⫹dt. The average is taken over modulation spectra
all times t and frequencies f . The 2-D Fourier transform of The bandwidth of the filters in the filter bank has a direct
this auto-correlation matrix yields the modulation spectrum: effect on the band occupancy of the amplitude envelope. In
P MS( ␻ t , ␻ f ), where ␻ t are the temporal frequencies corre- each frequency band, the temporal modulation spectrum of
sponding to dt and ␻ f are the spectral frequencies corre- the amplitude square of the envelope is restricted to frequen-
sponding to d f . For a modulation spectrum based on a wave- cies below the bandwidth of the filter 共Flanagan, 1980兲.
let decomposition, a similar calculation is done but with a Therefore, high frequency temporal modulations can only be
filter bank of logarithmically spaced filters and fixed octave observed with wide bandwidth filters. Similarly, the spectral
widths. amplitude modulations for a given temporal window along
In our studies, we used three separate filter banks. In all frequency space are restricted to the modulation frequencies
three cases, the filters had Gaussian shapes and each filter below the bandwidth given by the temporal window. There-
bank was characterized by the fixed bandwidth of the filters. fore, high frequency spectral modulations can only be mea-
We used widths of 62.5, 125 and 250 Hz measured as the sured with wide temporal windows. Since the temporal win-
standard deviation parameter of the Gaussian function de- dow is given by the Fourier transform of the gain function of
scribing the gain of each filter. The filters were equally the filter in the filter bank 共Flanagan, 1980兲, high frequency
spaced on the frequency axis and separated from each other spectral modulations can only be measured with narrow
by one standard deviation. The auto-correlation matrix was band-pass filters. These properties are another form of the
calculated for time delays of ⫾300 ms. Before taking the well-known compromise between time and frequency reso-
2-D Fourier transform, the auto-correlation matrix was mul- lution in time-frequency representations 共Cohen, 1995兲.
tiplied by a Hanning window. For most of our analyses, we Because of the time-frequency trade-off in resolution,
calculated the modulation spectra using a log transformation one cannot generate a spectrographic representation that ex-
on the amplitude values and we subtracted the mean log hibits both high spectral and high temporal frequency modu-
amplitude before windowing. The log transformation was lations. Since spectrographic representations can be designed
used because, as shown here and previously 共Attias and to be invertible 共up to a single absolute phase兲 physical
Schreiner, 1997兲, the distribution of envelope amplitude of sounds that have simultaneously high spectral and high tem-
natural sounds has a strong exponential component. The last poral amplitude modulations in the spectrographic time-
plot in Fig. 2 shows a graphical representation of the modu- frequency representation do not exist. More specifically, the
lation spectrum of a zebra finch song obtained in our calcu- uncertainty principle tells us that the product of the band-
lations with the 125-Hz bandwidth filter bank and the log width, ␴ f , and duration, ␴ t , of the sound sample 共the win-
transform. As seen in the figure, the modulation spectrum has dowed signal used in the spectrographic decomposition兲
a low-pass characteristic both in temporal and spectral must satisfy the following inequality 共Cohen, 1995兲:

J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing 3397
10

Allowed Region

0
-1000 0 1000
BW=196.35 Hz BW=392.7 Hz BW=785.4 Hz
10 10 10

5 5 5

0 0 0
-1000 0 1000 -1000 0 1000 -1000 0 1000

10 10 10
White Noise
ωf (1/kHz)

5 5 5

0 0 0
-1000 0 1000 -1000 0 1000 -1000 0 1000
ωt(Hz)

FIG. 3. Schematics showing the physically plausible region of power for modulation spectrum and the region sampled by choosing a particular time-frequency
scale. The figure in the top row shows the region of physically plausible power for modulation spectra obtained from a spectrographic representation of sound.
Since time and frequency are conjugate variables, the uncertainty principle imposes restrictions on the simultaneous sampling of fine temporal and spectral
modulation frequencies 共see text兲. The hyperbolic curves given by ␻ t ␻ f ⫽ ␲ enclose the allowed region shaded in gray. The middle row shows the actual
region sampled by choosing a particular bandwidth for Gaussian filters in the filter bank: only the region within an ellipse can be sampled at a time. For
spectrograms, the bounds of the ellipse are given by a second set of hyperbolic curves defined by ␻ t ␻ f ⫽ ␲ /2. Choosing narrow bands in the spectrogram
共narrow filters in the filter bank兲 results in a relatively low value for the highest temporal modulation that can be measured and a relatively high value for the
highest spectral modulation that can be measured and vice versa. This principle is illustrated by estimating the modulation spectrum of white-noise shown in
the bottom row. The modulation spectrum of white noise should fill the entire allowed region but, depending on the time-frequency scale chosen, we find
power only within the region defined by the ellipse shown in bold. The contour lines show the areas that encompass 50%, 80%, 90% and 95% of the total
power.

1 filter in the filter bank. For Gaussian-shaped filters, the band-


␴ f ␴ t⭓ . width of nonzero power is theoretically infinite but the power
4␲
in the higher frequency modulations quickly decreases. The
Since we are measuring modulation frequencies 共spectral and effective bandwidth of the filter, defined mathematically as
temporal兲 by estimating the correlations across multiple mea- the square root of the average deviations square from the
surements of such samples, we can rewrite this inequality, by center frequency, is simply the standard deviation parameter
specifying the upper frequency limit of the temporal and of the Gaussian filter in the filter bank, ␴ filt . The bandwidth
spectral modulations which are given by max(␻t)⫽ 1/2␴ t , of the spectral modulation is then given by the maximum
and max(␻ f )⫽ 1/2␴ f . Sounds are therefore restricted to the spectral modulation: BW( ␻ f )⫽max(␻ f )⫽1/2 ␴ filt . The time
range of modulation frequencies given by 兩 ␻ t • ␻ f 兩 ⭐ ␲ . window that corresponds to the Gaussian filter in the filter
We provide an illustration of these properties by calcu- bank is also a Gaussian function with standard deviation pa-
lating the modulation spectrum of white noise for the three rameter 共also the effective duration of the window兲 given by
different bandwidths that we used in our filter bank 共see Fig. ␴ t ⫽1/2 ␲␴ f . The bandwidth of the temporal modulations is
3兲. White noise sound includes all modulation frequencies then given by the maximum temporal modulation: BW( ␻ t )
and should therefore fill uniformly the entire area given by ⫽max(␻t)⫽1/2 ␴ t ⫽ ␲␴ f . Thus when a spectrogram is ob-
兩 ␻ t • ␻ f 兩 ⭐ ␲ as shown on the top panel 共gray area labeled tained with Gaussian filters the product of the temporal
‘‘Allowed Region’’兲. However, when we choose a particular modulation bandwidth and spectral modulation bandwidth is
time-frequency scale for our spectrographic representation, BW( ␻ f )•BW( ␻ t )⫽ ␲ /2⭐ ␲ , as required by the uncertainty
we are effectively only measuring modulation frequencies principle. The hyperbola describing this function within the
that are found in a subarea within this allowed region. As allowed region is shown in the middle plots of Fig. 3. In
mentioned above, the maximum temporal and spectral addition, for a given time-frequency scale set by the param-
modulation frequencies are given by the bandwidth of the eter describing the width of the filters in the filter bank 共or

3398 J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing
equivalently the width of the time window兲, the modulation ity in the modulation spectrum obtained at different time-
spectrum will be restricted to frequencies given by the spe- frequency scales.
cific BW( ␻ t ) and BW( ␻ f ) corresponding to ␴ f . For a
Gaussian filter, the regions sampled are not rectangular but, C. Descriptive quantifiers of modulation spectra
as shown in Fig. 3, ellipsoid with major and minor axes
given by &BW( ␻ t ) and &BW( ␻ f ). To quantitatively describe some of the structure ob-
Figure 3 shows the modulation spectrum for white noise served in the modulation spectra of our sound ensembles, we
calculated for the three values of the time-frequency scale used a small set of simple measures that estimated the sepa-
parameter used in the spectrograms in our analyses, which rability, symmetry, low-pass quality and shape. For these
was determined by the width of the filter in the filter bank four measures, we used the modulation spectra on the log
( ␴ filt⫽62.5,125,250 Hz). The corresponding spectral- amplitude with the mean level subtracted as explained in
temporal modulation ellipse was then determined by Sec. II.A. We also calculated separately the relative power of
BW( ␻ t ) 共196.35, 392.7, 785.4 Hz兲 and BW( ␻ f ) the dc component of the linear amplitudes to yield a measure
(8,4,2 kHz⫺1 ). All of the significant estimated energy in the of modulation depth. Similar measures have been also used
modulation spectrum fell within this ellipse. For white noise, to quantify the modulation spectrum of the spectral-temporal
a large area at the center of the ellipse was uniformly receptive field of auditory neurons 共Depireux et al., 2001兲.
sampled, illustrating the fact that white noise also has white 1. Separability
modulation spectra. Note, however, that the power in the
A fully separable modulation spectrum is one that will
modulation does decrease significantly before reaching the
factorize into a function of ␻ t and ␻ f over all quadrants.
edge of the ellipsoid area. The modulation spectra for white
P MS( ␻ t , ␻ f )⫽G( ␻ t ) . H( ␻ f ). A separable modulation spec-
noise obtained at different time-frequency scales has the
trum signifies that the probability of occurrence of joint
same geometric shape but occupies a different modulation
spectral-temporal modulations 共down-sweeps of up-sweeps兲
frequency area. Clearly, for white noise, there is no appro-
is expected from the average probability of the spectral or
priate time-frequency scale since the entire range would be
temporal modulations measured separately. To quantify the
required to properly describe the spectral and temporal
separability, we calculated the singular value decomposition
modulations actually present in the sound. In this case, the
of the modulation spectrum,
modulation spectra obtained from a spectrographic represen-
n
tation are uniquely a reflection of the shape and bandwidth of
the filters in the filter bank. For other sound ensembles, how- P MS共 ␻ t , ␻ f 兲 ⫽ 兺 ␭ i g i共 ␻ t 兲 •h i共 ␻ f 兲 ,␭ 1 ⬎␭ 2 ⬎¯⬎␭ n ,
i⫽1
ever, there might be time-frequency scales that include most
of the energy in the spectral and temporal modulations and calculated the ratio of the first singular value relative to
present in the sounds. For those sound ensembles, one could the overall power given by the sum of all singular values:
therefore estimate the modulation spectrum from a single
␭1
spectrographic representation with the realization that modu- ␣ sep⫽ .
lation frequencies that would fall outside the sampled ellipse
n
兺 i⫽1 ␭i
would be filtered out. When the modulation spectrum is fully separable, ␣ sep will
We attempted to estimate a measure of the optimal time- be close to 1.
frequency scale of the spectrogram by measuring the entropy
of the power density function given by the modulation spec- 2. Asymmetry
trum. We reasoned that the modulation spectrum with the
highest entropy would be the one that included most of the The modulation spectrum will be asymmetric if there are
temporal and spectral modulation structure found in the more down-sweeps than up-sweeps in the sound ensemble.
song. To calculate the entropy, we transformed the modula- We quantified the asymmetry by calculating the relative
tion spectra into a discrete probability function that specified power in the first and second quadrants:
the probability of the occupancy of a discrete subdivision of P down⫺ P up
the modulation spectrum defined into small rectangles de- ␣ asym⫽ ,
P down⫹ P up
fined by ( ␻ t ⫺d ␻ t , ␻ x ⫺d ␻ x ) and ( ␻ t ⫹d ␻ t , ␻ x ⫹d ␻ x ),
p( ␻ t , ␻ f ). The limits for ␻ t and ␻ f were chosen to cover the where P down is the total power in upper right quadrant 共of
space given by the corresponding sample ellipse for each positive ␻ t and ␻ f ) and P up is the total power in the upper
bandwidth and d ␻ t and d ␻ f were set for all three band- left quadrant 共of negative ␻ t and positive ␻ f ). If ␣ asym is
widths at d ␻ t ⫽1.66 Hz and d ␻ f ⫽0.065 cycles/kHz. The close to 0, the modulation spectrum is symmetric. If ␣ asym is
entropy of the probability distribution is then obtained with positive, then there are more down-sweeps than up-sweeps in
the sound ensemble and vice-versa.
H共 P 兲⫽ 兺



t
⫺p 共 ␻ t , ␻ f 兲 log2 共 p 共 ␻ t , ␻ f 兲兲 .
f 3. Low-pass coefficient and starriness
This entropy measure has units of bits but its absolute We observed that in natural sounds most of the energy is
value is not interpretable since it is dependent on the size of concentrated in the low temporal and spectral frequencies
d ␻ t and d ␻ f and on the choice of units. We are using the and that the energies in the higher temporal and spectral
measure solely in a relative manner to compare the variabil- modulations are not distributed uniformly but instead are

J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing 3399
concentrated along the axes. In particular, for animal vocal- modulation depth is usually defined from the amplitude of
izations and human speech, most of the spectral modulations the envelopes. This measure was applied to the modulation
are found only for low temporal modulations. To quantify spectrum obtained without the logarithmic transformation.
these effects, we calculated two parameters. The first param-
eter measures the energy in the low frequencies relative to
the total energy:
D. Generating synthetic sounds from a modulation
P low spectrum
␣ low⫽ ,
P total We describe a straightforward methodology to generate
where complex sounds that can match both the frequency and
modulation spectrum of a sound ensemble. Our interest is in
P low⫽ 冕 冕⫹⌬ ␻ t

⫺⌬ ␻ t
⫹⌬ ␻ f

⫺⌬ ␻ f
P共 ␻ t , ␻ f 兲d ␻ td ␻ f .
designing synthetic sounds that match the first and second
order modulation statistics of natural sounds. In particular,
these sounds can be used to estimate the spectro-temporal
We chose ⌬ ␻ t ⫽10 Hz and ⌬ ␻ f ⫽0.195 kHz⫺1 .
tuning of auditory neurons as well as their potential sensitiv-
The second parameter measures the relative energy of
ity to the phase of the modulations that are present in natural
the modulation spectrum that excludes the regions of joint
vocalizations. Similarly, one could use these synthetic
high temporal and spectral frequency as well as the region of
sounds to study perceptual sensitivity to specific spectral-
very low joint temporal and spectral frequencies calculated
temporal modulations or phase in human or animals.
with P low . This area is found next to the x axis and y axis
The method is similar to the method used by Klein et al.
and includes the high temporal modulations but only at low
共2000兲 and Escabi and Schreiner 共Escabi and Schreiner,
spectral modulations and vice-versa. It is calculated with
2002兲 to generate synthetic sounds with a band-limited flat
P ⌬ ␻ t ⫹ P ⌬ ␻ f ⫺2 P low modulation spectrum, which has been called noise ripple.
␣ star⫽ , For noise ripple the space of ␻ t and ␻ f is sampled uniformly
共 P total⫺ P low兲
within some frequency bounds. In our case, we wanted to
where match our sampling to the modulation spectrum obtained

冕 冕
⫹⬁ ⫹⌬ ␻ f from a particular sound ensemble. For this purpose, we
P ⌬␻t⫽ P共 ␻ t , ␻ f 兲d ␻ td ␻ f sampled the desired modulation spectrum by normalizing the
⫺⬁ ⫺⌬ ␻ f
power spectral density to obtain a probability density func-
is the total modulation power in a band of spectral frequen- tion and randomly choosing N distinct pair of values for ␻ t
cies limited by ␦ ␻ f and similarly for P ⌬ ␻ f . P total is the total and ␻ f from that distribution.
power in the modulation spectrum. To generate the function that describes the amplitude
envelope for our synthetic sound, we then obtained the en-
velope function for a sum of ripple sounds 共see Sec. II.A兲:
4. Shape separability
The measure of separability defined above is critically N
dependent on the power distribution. Since natural sounds S 共 t, f 兲 ⫽ 兺
i⫽1
cos共 2 ␲␻ t,i t⫹2 ␲␻ f ,i f ⫹ ␸ i 兲 ,
have a high concentration of power in the low frequencies,
we found that ␣ sep was relatively high for all the natural
sound ensembles. However, we could also observe and fur- where ␸ i is a random phase for each ripple component. In
ther quantify with the starriness parameter that the energy our implementation we generated synthetic sound ensembles
outside the low spectral and temporal frequencies was not made of 20 synthetic sounds, each 2 s in duration. To syn-
uniformly distributed. To examine the shape of the distribu- thesize the envelope of each noise ripple sound, we used N
tions, we calculated the separability for an occupancy matrix: ⫽100 ripple components.
given a contour line defined by the percent of the total power An ensemble of sounds with an amplitude envelope
within the contour, we set all the values within the contour to given by S(t, f ) will have the same modulation spectrum as
have a value of 1 and all values outside to have a value of 0. the original sound ensemble but will also have, on average, a
We then calculated a separability index for this occupancy flat frequency spectrum and, on average, a flat temporal en-
matrix. velope. The flat average temporal envelope will also be
found in the original sound 共if the sound ensemble is station-
ary in time兲 but the average flat frequency spectrum is un-
5. Modulation depth
likely to be found in natural sounds. To match the overall
A measure of modulation depth can be estimated by frequency spectrum and the dc value of the modulation spec-
looking at the ratio between the dc power and the power in tra 共the modulation depth兲, we normalized S(t, f ) by the
the rest of the frequencies: average standard deviation of the amplitude modulation in

␣ mod⫽ 冑 P total⫺ P dc
P dc
each frequency band f and added the mean amplitude enve-
lope. Calling A( f ) the average amplitude in each frequency
band measured in the original ensemble, ␴ ( f ) the standard
where P total is the total power and P dc is the power at dc, i.e., deviation in each frequency band measure in the original
power at ␻ t ⫽0 and ␻ f ⫽0. We used the square root because ensemble and ␴ S ( f ) the standard deviation obtained from

3400 J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing
the ensemble of S(t, f ) functions for the synthetic ensemble, are read out of context, in an acoustically controlled environ-
we generate a new function for the amplitude envelopes ment. The total length of sound sampled was approximately
given by 40 s. These speech signals have been used previously in
speech perception research 共Shannon et al., 1995; Dorman
␴共 f 兲
S Norm共 t, f 兲 ⫽A 共 f 兲 ⫹ S 共 t, f 兲 . et al., 1997兲.
␴ S共 f 兲 The zebra finch song ensemble consisted of the songs of
Finally, to synthesize the sound a direct method or a 20 different adult males (age⬎100 days) that were raised by
more precise iterative method can be used. For the direct their parents in a large zebra finch colony in our laboratory.
method, one simply creates an ensemble of carrier frequen- The recordings were obtained in a noise-free environment by
cies that will be modulated by S Norm . To prevent any arti- isolating individual male birds in a sound proof recording
facts in periodicity, the frequency of the carrier frequencies chamber. Multiple samples of each song were obtained and a
should be chosen randomly with a uniform distribution be- particularly clean exemplar was chosen. Each song lasted
tween the lower and upper bounds of the desired frequency approximately 2 s and the 20-song ensemble was approxi-
range. In our case, we set the lower frequency bound f min mately 40 s in duration.
⫽250 Hz and the upper bound f max⫽8 kHz. Each carrier The ensemble of environmental sounds was 45 s in du-
sound has the form ration and consisted of a rustling brush, crunching leaves and
twigs, rain, fire and forest and stream sounds. These were
s i 共 t 兲 ⫽cos共 2 ␲ f i t⫹ ␪ i 兲 , recorded and provided to us by Michael Lewicki. Lewicki
where f i is a random frequency between f min and f max and ␪ i used these sounds to study the higher order statistics of the
is a random phase and i⫽1 to Nc. We found that Nc sound pressure waveform 共Lewicki, 2002兲.
⫽1000 carrier frequencies were more than sufficient to
sample our range of frequencies. The synthetic sound is fi- III. RESULTS
nally given by We examined the statistics of the temporal-spectral en-
Nc velope obtained from spectrographic representations of
s syn共 t 兲 ⫽ 兺 S Norm共 t, f i 兲 s i共 t 兲 .
i⫽1
speech, zebra finch song and environmental sounds.
A. Probability distributions of the modulation
The more precise iterative method is called spectro- amplitude
graphic inversion and effectively involves iteratively adjust-
ing the phase of the carrier sounds, ␪ i , in order to minimize We first examined the probability distribution of the am-
the difference between the desired spectrogram S Norm and the plitude of the modulation envelopes 共Fig. 4兲. In this analysis,
spectrogram obtained from the synthesized sound 共Griffin we used a spectrographic representation based on our inter-
and Lim, 1984兲. We used the implementation of the Griffin mediate value for the time-frequency scale 关 BW( ␻ t )
and Lim algorithm provided by Malcolm Slaney as a Matlab ⫽392.7 Hz兴 . The modulation envelopes were obtained as
program 共1994兲. described in Sec. II.A. The middle row in Fig. 4 shows the
When we used the simple direct method, we found that distribution obtained for p(A) in each frequency band. The
the ensemble of synthesized sounds had very similar fre- distribution of amplitudes for all three natural sounds is strik-
quency spectrum, modulation depth and modulation spec- ingly different from that of white noise. The distribution of
trum as the original sounds. However, the spectrogram ob- amplitudes for Gaussian white noise is given by the Rayleigh
2
tained from specific sample sounds from the synthetic distribution: p(A) ␣ Ae ⫺A . The fit of our data with the the-
ensemble could be quite different from the desired spectro- oretical distribution is shown in the bottom panel and the two
gram due to random phase interferences. The iterative distributions are indistinguishable 共K-S test兲. On the other
method yielded a much better one-to-one match of the spec- hand, the distributions for the natural sounds examined here
trogram and a slight improvement on the match between the have a strong exponential component and are best fitted with
ensemble modulation spectra of the natural and synthesized an exponential distribution or a gamma distribution. The ex-
sounds. ponential distribution gave good fits for song and speech and
for the higher frequencies of environmental sounds. The
gamma distribution gave good fits for the lower frequencies
E. Natural sound ensembles
of the environmental sounds. The exponential shape of these
We analyzed the statistics of three natural sound en- distributions reflects the fact that, for vocalizations and for
sembles: two types of animal vocalizations 共speech and ze- the higher frequencies of environmental sounds, there is a
bra finch song兲 and an ensemble of environmental sounds. finite probability of finding sounds that are arbitrarily soft as
The speech ensemble was made of 20 sentences chosen in, for example, the silent pauses between speech syllables.
randomly from the audio-visual speech test library recorded We also found that there were systematic trends as functions
by the Otolaryngology Department at the University of Iowa of the center frequency of the band. Some of the differences
共Tyler et al., 1990兲. The sentences corpus consists of 100 can be explained simply by changes in amplitude and not by
short complete sentences read by six different adult male and changes the shape of the probability distribution and these
female speakers. Examples of these sentences are ‘‘It rained differences would not appear if we had normalized our prob-
all day yesterday,’’ ‘‘The book tells a story,’’ ‘‘The mother ability distributions by their variance. For example, the co-
reads a paper’’ and ‘‘They have only one son.’’ The sentences efficient of the exponential fit decreased from low to high

J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing 3401
ZF Song Speech Env Sounds White Noise
0.1 0.1 0.1 0.1

p(Log A)
0.05 0.05 0.05 0.05

0 0 0 0
-50 -25 0 25 50 -50 -25 0 25 50 -50 -25 0 25 50 -50 -25 0 25 50
Log A (dB) Log A (dB) Log A (dB) Log A (dB)
2 2 2 2

1.5 1.5 1.5 1.5


p(A)

1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
A A A A
2 2 2 2

1.5 1.5 1.5 1.5


p(A)

1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
A A A A

FIG. 4. Amplitude distributions in different sound ensembles. Probability density functions were estimated for the amplitude 共bottom two rows兲 and the log
amplitude 共top row兲 of the envelope of the four sound ensembles examined in this study. The amplitudes of the envelopes were obtained by calculating the
Hilbert transform of the decomposition of the sound into narrow bands as shown in Fig. 2 and explained in Sec. II.A. The filter bandwidth of ␴ filt
⫽125 Hz was used 关 BW( ␻ t )⫽392.7 Hz兴 . The gray scale lines show the results for the different frequency bands with the dark black line corresponding to
f ⫽875 Hz and the lightest gray line corresponding to f ⫽7312.5 Hz. In the top row the probability distribution of the log amplitude is shown with the x-axis
in dB units and with 0 dB corresponding to the mean of the distribution in each band. The middle row shows the probability distribution of the amplitude. The
bottom row shows the probability distribution for the frequency band centered at 2500 Hz 共crosses兲 and the best fit 共solid兲 given by an exponential distribution
for the natural sounds and given by the Rayleigh distribution for white noise.

frequencies, reflecting the higher probability of soft sounds probability as a function of sound intensity. As frequency
in the upper frequency range. However, for speech and bird- increases, the kurtosis also increases reflecting the much
song, we also observed systematic changes in the shape of steeper slope in the linear region: the kurtosis is below 3 for
the probability distribution as described in more detail in the frequencies below 3500 Hz and above 3 from all frequencies
next paragraph. Finally, although the fits were good in a above. The mean kurtosis above 6000 Hz is 4.6. In compari-
mean square sense, the data showed systematic deviations son, the distribution of the log amplitude for the environmen-
from either the exponential form or the gamma form and tal sounds is more symmetric and has a kurtosis close to the
these two theoretical models were rejected by the K-S test. normal distribution 共mean kurtosis across all frequency
The theoretical distribution of these natural sounds is there- bands is 3.26 relative to 3 for the normal distribution兲.
fore complex, potentially composed of multiple components.
The probability distributions were also examined in the
logarithmic scale as shown in the top row of Fig. 4 where we B. Modulation spectra of natural sounds and
time-frequency scale
plotted p(log(A)) as a function of log(A). Here, we can also
distinguish features that distinguish the distributions of the We calculated the modulation spectrum of the three
natural sounds from those of Gaussian white-noise. First, for natural sound ensembles using the methodology described in
zebra finch song and the lower frequency bands in speech, Sec. II.A and Sec. II.B. Figure 5 shows the modulation spec-
the distributions are approximately rectangular 共low kurtosis兲 tra for zebra finch song, speech and environmental sounds
with similar probability distributions in a 40-dB range calculated for the three values of time-frequency scale that
(⫺20 to 20兲. The distribution for zebra finch song has two we investigated 关 BW( ␻ t )⫽196.35 Hz, BW( ␻ t )⫽392.7 Hz
peaks 共bi-modal兲 corresponding to syllables and intersyllable and BW( ␻ t )⫽785.4 Hz]. The three time-frequency scales
silences. The relative peak probability of each peak changes capture the principal features in the spectra since most of the
as a function of frequency since there are more song syl- energy is found at low spectral and temporal modulations.
lables with energy only in the lower frequency range. The We can also visually verify the validity of the chosen range
probability for speech sounds in the lower frequency range of time-frequency scale by noting that, for all three en-
also has low kurtosis but it is not bimodal. Instead the dis- sembles, the energy for the fastest temporal modulations de-
tribution is asymmetric with an almost linear decrease in cay to zero for the wide temporal bandwidth filter

3402 J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing
BW=196.35 Hz BW=392.7 Hz BW=785.4 Hz
8 8 8

6 6 6
0.4 0.4 0.4

ZF song 4 0.2
4 0.2
4 0.2

0 0 0
-20 0 20 -20 0 20 -20 0 20

2 2 2

0 0 0
-500 0 500 -500 0 500 -500 0 500

8 8 8

6 6 6
0.4 0.4 0.4

Speech 4 0.2
4 0.2
4 0.2

0 0 0
-20 0 20 -20 0 20 -20 0 20

2 2 2

0 0 0
-500 0 500 500 0 500 -500 0 500

8 8 8

6 6 6
ωf (1/kHz)

0.4 0.4 0.4

Env 4 0.2
4 0.2
4 0.2

0 0 0
-20 0 20 -20 0 20 -20 0 20

2 2 2

0 0 0
-500 0 500 -500 0 500 -500 0 500
ωt(Hz)

FIG. 5. Modulation spectra for three different sound ensembles—zebra finch song, speech and environmental sounds obtained at three different time-frequency
scales. The modulation spectra of these three sound ensembles are shown for an effective temporal bandwidth 关 BW( ␻ t ) 兴 of 196.35, 321 and 642 Hz. The
contours are drawn to circle 50%, 80% and 90% of the total power. The small inset zooms in on the modulation spectra for the lower frequency range. The
50% contour is shown in white on these insets. Since a large fraction of the energy in modulation spectra of these natural sounds is concentrated in the low
frequencies, the 50% contour is identical at all three time-frequency scales. Small differences are observed for the 80% contour and large differences for the
90% contour illustrating the fact that the time-frequency compromise affects approximately 20% of the modulation power found at the higher spectral and
temporal modulations.

关 BW( ␻ t )⫽785.4 Hz兴 . Similarly, the energy for the fastest tically significant. In the remainder of the paper, we show the
spectral modulation decays to zero for the narrow temporal results of our analysis of the modulation spectra at the inter-
bandwidth filter 关 BW( ␻ t )⫽196.35 Hz兴 . Nonetheless, the mediate time frequency scale 关 BW( ␻ t )⫽392.7 Hz兴 but very
smallest temporal bandwidth misses some of the fast tempo- similar results were obtained at all three scales. To further
ral modulations and the widest frequency bandwidth misses estimate the effect of the potential compromise, we also cal-
some of the fast spectral modulations observed in these natu- culated the power in the modulation spectra found outside
ral sounds. the sampled ellipse. The power outside the area sampled by
We wanted to find a single time-frequency scale that the intermediate time-frequency scale and found in the area
yielded the best compromise for the representation of the sampled by the BW( ␻ t )⫽196.35 Hz scale was 1.6% of the
fastest spectral-temporal modulations so that we could quan- total power for zebra finch song, 2.3% for human speech,
titatively describe and compare the spectra of these three and 3.7% for environmental sounds.
natural sound ensembles. For this purpose, we calculated an
information theoretic entropy measure as described in Sec.
C. Modulation spectra of natural sounds
II.B. The results of that analysis are shown in Fig. 6. The
entropy measures were similar for all three time-frequency Within the allowed space imposed by the uncertainty
scales, reflecting the fact that the probability distributions are principle and the time-frequency scale used in the measure-
well captured at any one of the three scales or that similar ments, we observed that the modulations in natural sounds
compromises are achieved. The highest entropy for the have a characteristic distribution. As mentioned above, for
speech ensemble was obtained at the widest temporal band- all three sound ensembles, most of the energy is found for
width BW( ␻ t )⫽785.4 Hz whereas for zebra finch song en- low spectral and temporal modulations. These natural sounds
semble the entropy is the highest at the narrowest bandwidth and in particular the vocalizations 共Zebra finch song and hu-
BW( ␻ t )⫽196.35. The differences, however, were not statis- man speech兲 are further characterized by nonoval distribu-

J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing 3403
10 ters, which are symmetric in time-frequency and the relative
196
length of the temporal and spectral axes is determined by the
393
time-frequency scale set by the bandwidth of the filters. For

Entropy (Bits)
9.5
785
the natural sound ensembles, the contours that bound 50% of
9 the energy are very close to the origin reflecting the low-
passed property. The contours that bound 70% and 80% of
8.5 the total energy draw a star-shape pattern reflecting the low
probabilities of finding sound components with jointly high
8 spectral and high temporal modulations. We quantified these
Speech Song Env observations by calculating various parameters describing
Sounds the shape of these spectra as described in Sec. II.C. The
FIG. 6. Entropy of the modulation spectrum. The entropy of the modulation results of these analyses are shown in Figs. 8 and 9.
spectrum was calculated by treating it as a discrete probability distribution. First the separability index shows that the three natural
The entropy was calculated to investigate the optimal time-frequency scale, sounds ensembles are quite separable. Only the speech en-
which would be defined as the scale with the highest entropy value. The
error bars show the standard error of the measure obtained by the jack-knife
semble, with an index of 0.84, is significantly different from
resampling method on the sound ensemble. The legend shows the temporal the index found for the white noise ensemble, which is com-
bandwidth BW( ␴ t ) in Hz. pletely separable 关Fig. 8共a兲兴. On the other hand, our starri-
ness index, which calculates the relative energy in a the low
tions, reflecting the fact that most of the high frequency spec- temporal modulation band 共the band of ⫾10 Hz along the y
tral modulation power is found at the very lowest temporal axis兲 added to the energy in a low spectral modulation band
modulation and vice versa. In other words, there is a scarcity (⫾0.195 kHz⫺1 along the x axis兲, is much larger in natural
of sounds with both high spectral and high temporal modu- sounds than it is in white noise, reflecting the star-shaped
lations. This property is best seen in Fig. 7 where we display pattern observed in Fig. 7 关Fig. 8共d兲兴. Although these two
a contour plot of the modulation spectra of the three natural results seem contradictory, they have a simple explanation.
sound ensembles and white noise for comparison. In this As quantified by the low pass coefficient, a large fraction of
figure, we show all four quadrants of the modulation spec- the modulation energy spectrum 共64% for zebra finch song,
trum to visually emphasize the shape difference. The white 61% for speech and 51% for environmental sounds兲 is found
noise contours are oval, reflecting the Gaussian-shaped fil- at the very low spectral and temporal modulations and the

ZF Song Speech
5 5
ωf (1/KHz)

ωf (1/KHz)

0 0

-5 -5
-400 -200 0 200 400 -400 -200 0 200 400
ω (Hz) ω (Hz)
t t

Env Sounds White Noise


5 5
ωf (1/KHz)

ωf (1/KHz)

0 0

-5 -5
-400 =200 0 200 400 -400 -200 0 200 400
ωt (Hz) ωt (Hz)

FIG. 7. Modulation spectra displayed as contour plots. The modulation spectrum for zebra finch song, speech, environmental sounds and white-noise are
displayed with contour plots. The contours plots are drawn at the fixed power values that surround 50%, 60%, 70%, 80% and 90% of the total power.

3404 J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing
A Separability B Asymmetry
1 0.25

0.2
0.9
0.15

asym
αsep

0.8

α
0.1
FIG. 8. Separability, asymmetry, low-
0.7 pass coefficient and starriness coeffi-
0.05
cients. Four quantifiers that measure
different aspects of the shape of the
0.6 0 modulation spectrum were calculated
ZF SP ES WN ZF SP ES WN
for each ensemble 共ZF: zebra finch
song, SP: speech, ES: environment
C Low Pass Coefficient D Starriness sounds and WN: white noise兲. The er-
0.8 0.8 ror bars show one standard error ob-
tained with the jack-knife resampling
technique. See Sec. II.C for the exact
0.6 0.6 definition of the coefficients.

αstar
αlow

0.4 0.4

0.2 0.2

0 0
ZF SP ES WN ZF SP ES WN

modulation distribution in this area is highly separable as tween the 80% and 90% contours: the power in the higher
illustrated by the circular 50% contour in Fig. 7 关Fig. 8共c兲兴. modulation frequencies of the spectrum making approxi-
For those reasons, the separability index remains high. How- mately 30% of the energy for speech and environmental
ever, when one looks at the tail of the distributions found at sounds and 20% of the energy for song are particularly con-
the high spectral and high temporal modulations, the natural centrated along the spectral and temporal modulation axes.
sounds lack the joint high spectral and temporal power found Besides the low-pass filter characteristics and the lack of
in the random signal. To measure this effect, we calculated jointly high spectral and temporal modulations, the modula-
the separability of an occupancy matrix defined by the con- tion spectra of speech and environmental sounds are remark-
tours that bound 50% to 90% of the total energy. This analy- ably symmetric. These sounds have equal representations of
sis showed that speech and environmental sounds are insepa- up-sweep and down-sweep ripple sound components 共see
rable relative to noise for energy found between the 60% and Figs. 5 and 7兲. Zebra finch song, on the other hand, exhibits
80% contours and zebra finch song for energies found be- some asymmetry, with slightly more energy in down-sweeps
共see Fig. 5兲. These observations are reflected in the asymme-
try index, which measures the relative difference in the two
quadrants 关Fig. 8共b兲兴: only the zebra finch song ensemble
Speech ZF Song

Env Sounds WN
shows a value of asymmetry that is different from zero.
1
Finally, we measured the modulation depth of the am-
0.9
plitude envelopes for the four sound ensembles. The modu-
lation depth is traditionally defined as one minus the relative
Separability

0.8

0.7
value of the amplitude minimum relative to the maximum: a
signal with a modulation depth of 1 is intermittently silent.
0.6
We used an alternative measure in which we quantified the
0.5
modulation depth of our signals from their modulation spec-
0.4 tra. To estimate the ‘‘size’’ of the joint temporal and spectral
0.3
modulations, we calculated the square root of the ratio of the
50 60 70 80 90
non-dc power relative to the dc power 共see Sec. II.C兲. Al-
though the actual amplitude modulation observed in the time
Percent Power
domain will also depend on the phase of the ripple compo-
FIG. 9. Occupancy separability at different thresholds. An occupancy sepa- nents of the sound, our measure will be large for signals that
rability index was defined by calculating the separability coefficient for the are dominated by non-dc ripple components. For example,
space covered by the modulation spectrum. The space covered was defined
by the contour line that bounded a given percent of the total power 共shown
one would expect isolated animal vocalizations to show
on the X axis兲. The error bars show one standard error obtained with the larger amplitude modulations than white noise or environ-
jack-knife resampling technique. mental sounds. Indeed we found that speech had the largest

J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing 3405
ZF Song Speech Env Sounds Noise
-1 -1 -1 -1
10 10 10 10

Power
-3 -3 -3 -3
10 10 10 10

-5 -5 -5 -5
10 10 10 10
0 1 2 0 1 2 0 1 2 0 1 2
10 10 10 10 10 10 10 10 10 10 10 10
ωt (Hz) ωt (Hz) ωt (Hz) ωt (Hz)

-1 -1 -1 -1
10 10 10 10
Power

-3 -3 -3 -3
10 10 10 10

-5 -5 -5 -5
10 10 10 10
-1 0 -1 0 -1 0 -1 0
10 10 10 10 10 10 10 10
ωf (1/kHz) ωf (1/kHz) ωf (1/kHz) ωf (1/kHz)

FIG. 10. Average temporal modulation power and spectral modulation power. The average temporal modulation power 共upper row兲 and spectral modulation
power 共bottom row兲 is plotted as a function of modulation frequency on a log-log plot. The average spectral and average temporal modulations were obtained
with a singular value decomposition of the joint modulation spectrum. The data is plotted with a solid line and the power function fit is plotted with a dashed
line.

modulation depth (13.8⫾2.0), followed by Zebra finch song relationship is reminiscent of the 1/f 2 relationship found for
(5.9⫾0.2), environmental sounds (4.5⫾0.2) and white- spatial frequencies in natural images and its significance is
noise (3.92⫾0.006). discussed below.
Ultimately, one might want to fit the modulation spectra The average spectral components of the modulation
with a function or a theoretical model. We began this process spectrum could also be fitted reasonably well with the power
by fitting the average temporal 关 P( ␻ t ) 兴 and average spectral law function although additional structure can clearly be ob-
关 P( ␻ f ) 兴 components of the modulation spectrum with a served in the zebra finch song 共ZF兲 and the environmental
power-law function: P( ␻ ) ␣ ␻ ⫺ ␣ . The average spectral and sounds (R 2 ⫽0.95 P⬍10⫺4 for speech; R 2 ⫽0.57 P⬍10⫺3
temporal components were obtained from the first g( ␻ t ) and for ZF song, R 2 ⫽0.54 P⫽0.001 for env sounds兲. The slope
h( ␻ f ) functions calculated in the singular value decomposi- of the power law was shallower than for the temporal com-
tion of the modulation spectrum that was used for the sepa- ponent closer to 1 for zebra finch song 共ZF song ␣ ⫽1, lower
rability analysis 共see Sec. II.C兲. Figure 10 shows these func- 95%⫽0.5 upper 95%⫽1.6) and close to 1.5 for speech and
tions 共solid line兲 and the corresponding fits 共dashed lines兲 for environmental sounds ( ␣ ⫽1.52, lower 95%⫽1.32 upper
all four sound ensembles. The fits were performed for tem- 95%⫽1.72; env sounds ␣ ⫽1.4, lower 95%⫽0.6 upper
poral modulations between 3 and 100 Hz and for spectral 95%⫽2.1).
modulations between 0.1 and 1 kHz⫺1 . This range corre-
sponded to the area of the modulation spectrum where white-
noise had a flat distribution, as shown in Fig. 10. Note that, D. Synthetic sounds with matched modulation
spectrum
although our effective bandwidth set by the standard devia-
tion parameter of the Gaussian-shaped filters is given by A final goal of our analysis was to demonstrate how one
BW( ␻ t )⫽397.2 Hz and BW( ␻ f )⫽4 kHz⫺1 , the boundaries could synthesize sounds that had similar modulation spectra
for the areas that showed flat modulation power for white as arbitrary sound ensembles but different phases in their
noise are approximately 41 of the total effective bandwidth. ripple components. Using the methodology described in Sec.
Our estimation of the shape of the power distribution must II.D, we generated synthetic zebra finch song, which we
be restricted to that central area or it will be affected by the called song ripples, and synthetic speech, which we called
shape and bandwidth of the filters. speech ripples. The top row of Fig. 11 shows the modulation
The average temporal components of the modulation spectrum of zebra finch song and speech and the bottom row
spectrum for all three natural sounds were well fitted by the shows the modulation spectrum obtained from 400 s of syn-
power law 共with R 2 ⬎0.9 and P⬍10⫺4 in all cases兲. The thetic song ripples and speech ripples. The contour lines sur-
vocalizations had steeper slope coefficients with a value for round the area that encloses 50%, 80% and 90% power and
␣ close to 2 共ZF song: ␣ ⫽2.26, lower 95%⫽2.15, upper the gray scale showing the power is logarithmic. By visual
95%⫽2.36; speech ␣ ⫽1.6, lower 95%⫽1.48, upper 95% inspection, one can see that the match is relatively good. In
⫽1.72). The slope for environmental sounds was between particular, the areas of high power are practically identical.
that of vocalizations and white noise 共env sounds: ␣ ⫽0.78, The contours that enclose 80% and 90% of the total power
lower 95%⫽0.72, upper 95%⫽0.84). The approximate 1/␻ 2t are different in the synthetic and natural ensembles but one

3406 J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing
ZF Song Speech
2 2

1.5 1.5
ω (1/kHz)

ωf (1/kHz)
1 1
f

0.5 0.5 FIG. 11. Modulation spectrum of


natural sounds and their synthetic
models. The top panels show the
0 0 modulation spectra for zebra finch
-100 -50 0 50 100 -100 -50 0 50 100
ω (Hz) ω (Hz)
song and speech. The bottom panels
t t show the corresponding modulation
Song Ripples Speech Ripples spectra of synthetic sounds that we
2 2 generated by sampling the spectra of
the original sound. The contour lines
surround the areas that enclose 50%,
1.5 1.5 80% and 90% of the total power. The
50% contour lines are drawn in white.
ω (1/kHz)

ωf (1/kHz)
1 1
f

0.5 0.5

0 0
-100 -50 0 50 100 -100 -50 0 50 100
ω (Hz) ω (Hz)
t t

should realize that they correspond to contours drawn at tion spectra of natural sound and that of the synthetic sound:
0.2% and 0.03% of max power for speech and at 0.3% and for zebra finch song and song ripples it is 0.9649 and for
0.04% of max power for zebra finch song. These areas of human speech and speech ripple it is 0.9545.
modulation space are therefore infrequently sampled in our Spectrograms of exemplars of song ripples and speech
synthesis. To better evaluate the quality of the fit, we calcu- ripples are shown in Fig. 12. These sounds have a distinct
lated the cross-correlation coefficient between the modula- quality that can be described as zebra-finch-song like and

ZF Song Speech
8 8

7 7
Frequency (kHz)

Frequency (kHz)

6 6

5 5

4 4

3 3

2 2

1 1
0 0.5 1 1.5 2 0 0.5 1 1.5 2 FIG. 12. Representative spectrograms
Time (s) Time (s) of a natural zebra finch song and a
speech sample (top row) and a syn-
Song Ripple Speech Ripple thetic song ripple and speech ripple
8 8 (bottom row).
7 7
Frequency (kHz)

Frequency (kHz)

6 6

5 5

4 4

3 3

2 2

1 1
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Time (s) Time (s)

J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing 3407
speechlike. Similarly, one can observe, in the spectrograms, modified power law. There results are similar to those we
similar temporal and spectral modulations in the natural and found by fitting the temporal component of the joint modu-
synthetic ensemble. On the other hand, it is clear that the lation spectrum. In their analysis, they found, as we did, that
phase of the ripple components plays an important role in the the power coefficient, ␣, was between 1 and 2.5. Similar
natural sounds. For example, in the temporal domain, the results were also found in the analysis of the distribution of
presence of complete silence is more common in the natural the overall amplitude envelope of speech and music 共Voss
vocalizations than in the ripple sounds. The phase of the and Clarke, 1975; Brillinger and Irizarry, 1998兲. In addition,
ripple sound components also plays a crucial role in the we found that environmental sounds and animal vocaliza-
spectral domain, both in generating natural harmonic stacks tions can be segregated into different groups based on these
共all cosine components兲 and in determining the exact fre- statistics as well as those obtained from the joint time-
quency location of the formants in speech. For these reasons frequency modulation spectrum 共see below兲. First, the distri-
and also because the correlations are calculated in 300-ms butions of the log amplitude of the envelopes in those animal
windows 共and 300-ms bits of speech sound parsed together vocalizations have low kurtosis, exhibiting a relatively uni-
randomly would be unintelligible兲, the speech ripple sounds form distribution for a 40-dB range of sound intensity,
are completely unintelligible. whereas environmental sounds have a log distribution of am-
plitudes that is approximately normal. Second, animal vocal-
IV. DISCUSSION AND THEORETICAL PREDICTIONS izations exhibit temporal modulation power relationships
that are approximately 1/␻ 2 whereas environmental sounds
We have shown that the statistical redundancies that are
are characterized by a significantly flatter power curve. A
observed in natural sounds can be, in part, described by the
similar separation of natural sounds into these two broad
lower-order joint temporal and spectral statistics of the enve-
classes was also suggested by Lewicki 共2002兲 who analyzed
lopes of the sounds obtained from a time-frequency decom-
the statistically independent components of the sound pres-
position of the sound. The distribution of the amplitude of
sure waveform in these two classes of sounds. He found that
the envelopes of natural sounds has a strong exponential
vocalizations were best decomposed by Fourier filters and
component, which distinguishes it from that of white noise.
environmental sounds by wavelet filters 共Lewicki, 2002兲.
The modulation spectrum, given by the 2-D Fourier trans-
form of the auto-correlation matrix of the amplitude enve- B. Joint spectro-temporal statistics
lopes, shows that the spectro-temporal envelope modulations
in natural sounds are concentrated in the low frequencies, In addition, we calculated the joint spectro-temporal
that the average temporal and spectral modulation power can modulation spectrum of the natural sounds. It is the natural
be fitted with a power law and that vocalizations have most extension of the purely temporal power-spectra analysis per-
of their power in the higher spectral modulation frequencies formed on amplitude envelopes or on the overall loudness as
concentrated at low temporal modulation frequencies. described above and the purely spectral power-spectra analy-
sis of the cepstrum performed for speech analysis. We
A. Probability distribution and second order statistics showed how to calculate the joint modulation spectrum and
of amplitude envelopes
how its calculation was dependent on the choice of the time-
Our results complement and support previous work that frequency representation and on the time-frequency scale of
analyzed the statistics of natural sounds. The exponential the chosen filters. We chose to perform our analysis using
form of the distribution for amplitude envelopes in speech linearly spaced filters with Gaussian shape and fixed band-
has been known and exploited in speech processing applica- width. The advantages to that representation are that 共1兲 it is
tions for a long time 共e.g., Paez and Glisson, 1972兲. More symmetric in time and frequency, 共2兲 the limits of the
recently, Attias and Schreiner 共1997兲 analyzed the amplitude sampled space are well defined and 共3兲 the harmonic sounds,
distributions and the temporal modulation power in a larger which are very common in animal vocalizations, have well
ensemble of natural sounds which included symphonic mu- localized occupancy on the spectral axis of modulation spec-
sic, speech, cat vocalizations and environmental sounds. trum at the value corresponding to the inverse of their fun-
They also noted the exponential form of the distribution of damental frequency. Since harmonic sounds are critical in
amplitudes in the natural sounds, which reflects the high acoustical behaviors 共e.g., Lohr and Dooling, 1998兲 and con-
probability of finding arbitrarily soft sounds. In their analy- tribute significantly to the statistical redundancy of the
sis, they found that the distribution in different frequency sound, the Fourier filter decomposition is in our opinion su-
bands were very similar whereas we found systematic differ- perior to the wavelet transformation for the characterization
ences in the shape of the probability distribution as a func- of the statistical properties of natural vocalizations. On the
tion of the center frequency. However, simple methodologi- other hand, the wavelet transform has the advantage of span-
cal differences between our analyses and theirs could explain ning multiple time-frequency scales. The wavelet transform
the discrepancy: first, we used smaller and more homog- could therefore be more efficient to estimate the modulation
enous ensemble of sounds that they did and, second, we used spectrum of sounds that have a broad range of modulation
a filter-bank with filters of linearly spaced center frequencies components, such as environmental sounds. These facts and
and of fixed bandwidth, whereas they used logarithmically hypotheses are well supported by the analysis performed on
space filters of 81 -oct bandwidth. the higher order statistics of the sound pressure waveform of
Attias and Schreiner also calculated the power spectrum vocalizations versus environmental sounds mentioned above
of the temporal modulations and fitted their data with a 共Lewicki, 2002兲. Moreover, a form of wavelet decomposition

3408 J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing
is performed by the mammalian cochlea and the frequency therefore optimized for the statistical redundancy at that
sampling throughout most of the auditory system is approxi- point in time兲, it is not clear how the compression could be
mately uniform on a logarithmic scale. For this reason, the improved by prior knowledge of the average modulation
spectro-temporal receptive fields of mammalian auditory spectra. On the other hand, information from the modulation
neurons have also, until now, been exclusively estimated spectra could be used to choose appropriate time-frequency
with a wavelet decomposition 共e.g., Depireux et al., 2001; decompositions for different classes of sound. Similarly, the
Escabi and Schreiner, 2002兲. It would therefore be of great knowledge of the differences in modulation spectra for
value to repeat the statistical analysis performed here using speech versus environmental sounds could be of use for de-
wavelet transforms or other biologically inspired time- signing preprocessing modules in auditory prosthetics or
frequency representations as those obtained from models of hearing aids that maximize signal-to-noise ratios for particu-
the auditory system 共Chi et al., 1999兲. Note, however, that, lar signals and noises 共see also Chi et al., 1999兲.
because of the physical limits on the frequency range of
sounds 共and of hearing兲, a wavelet transform will still result D. Implications for neural coding
in a modulation spectrum that is bounded to a particular
Our principal interest is to generate a theoretical frame-
region of spectral and temporal modulations. Also, within the
work with which to study the processing of complex sounds
sampled region different frequencies would be analyzed at
in the auditory system of animals. Following the school of
different scales. Therefore, the results obtained in such
thought started by Barlow 共1961兲, we argue that the auditory
analyses would have to be carefully compared to those ob-
system has evolved to process behaviorally relevant sounds.
tained for white noise and for colored noise, with identical
For that reason, we expect that the neural representations and
overall frequency power as the ensemble of interest.
computations in the auditory system will be affected by the
Our data suggest that the modulation spectrum for vo-
statistics of behaviorally relevant sounds. In particular, we
calizations can also be distinguished from that of environ-
postulate that the statistics of the spectro-temporal amplitude
mental sounds. The vocalizations studied here show signifi-
envelopes of the sound are important for auditory brain areas
cant power in spectral modulations in a narrow band of
that are responsible for sound identity. Our results generate
temporal modulation frequencies (⬍5 Hz). This power cor-
various testable hypotheses, which are in some cases, at least
responds to the voiced sections of speech and to the har-
qualitatively, supported by psychoacoustical or physiological
monic sounds found in zebra finch song. In both vocaliza-
data.
tions, there is a scarcity of sound components with both high
spectral and high temporal modulations, giving the modula- 1. Amplitude coding
tion spectra for vocalizations a characteristic star shape. En- The nature of the distributions of the amplitude enve-
vironmental sounds are principally characterized by their lopes leads to a first set of hypotheses. The relatively flat
low-passed quality both spectrally and temporally with a distribution obtained for the log of the amplitude of the en-
power law function describing both the average temporal and velopes for vocalizations suggests that, in order to discrimi-
spectral modulation power. The power coefficient describing nate among natural sounds by their amplitude level, an ap-
temporal modulations is smaller 共flatter curve兲 and the power proximate logarithmic amplitude-response curve should be
coefficient describing spectral modulations is greater 共steeper used, as described by Weber’s law. Although the psychoa-
curve兲 for environmental sounds than that for vocalizations. coustical literature on the subject of sound loudness is com-
We also found that the zebra finch vocalizations are asym- plex, it is generally accepted that a power law with a coeffi-
metric with more energy in the down-sweeps than the up- cient around 0.6 relates loudness and sound pressure
sweeps. We found similar results for other animal vocaliza- amplitude 共Stevens, 1956兲. This power law is not as com-
tions 共bengalese finch song and bat calls; data not shown兲 pressive as the log function but it will ‘‘flatten out’’ the dis-
and expect this asymmetry to be a common feature of the tribution of amplitude. Similarly a compressive nonlinearity
vocalizations of some animal species. It was strikingly ab- is a common property of the auditory system found both at
sent from human speech, however. the level of the basilar membrane 共Schlauch et al., 1998;
Ruggero and Rich, 1991兲 and in many auditory neurons 共Sa-
C. Implications for audio processing chs and Abbas, 1974; Palmer and Evans, 1982; Phillips,
The statistical structure of the spectro-temporal enve- 1990兲. In addition, synthetic stimuli that have the natural
lopes of natural sounds could have direct implications for amplitude distribution were shown to increase the coding
various forms of sound processing such as sound compres- efficiency of auditory neurons in the cat inferior colliculus
sion algorithms for storing music on digital media or speech 共Attias and Schreiner, 1998兲 and in auditory neurons of the
pre-processing for auditory prosthetics. For example, most grasshopper 共Machens et al., 2001兲.
current sound compression algorithms 共such as MP-3兲 use
standard entropy compression methods on a time-frequency 2. Spectro-temporal modulation coding
representation of the sound obtained with a filter bank The characteristic modulation spectrum of natural
共Painter and Spanias, 2000兲. The fact that these methods can sounds leads to a second set of hypotheses on neural coding
obtain relatively high compression factors demonstrates that in which the spectro-temporal receptive fields 共STRFs兲 of
the redundancy in sounds like human music is well captured auditory neurons would be matched to the modulation spec-
in the amplitude envelopes of the sound. Since the entropy trum of behaviorally relevant sounds. This ‘‘matching’’ could
compression is performed for short segments of time 共and take various forms. In the simplest form, a matched-filter

J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing 3409
hypothesis, we would expect an approximate one-to-one quencies relative to the lower temporal modulations. In these
match between the modulation spectrum of behaviorally rel- models, the degree of amplification of the higher frequencies
evant sounds and the ensemble modulation transfer function depends critically on the signal-to-noise ratio of the input to
of an auditory processing stage. The modulation transfer the neuron. For flat noise power, the theory predicts that at
function 共MTF; also called ripple transfer function or RTF兲 low signal-to-noise the amplification of higher frequencies is
of a neuron is given by the amplitude of the 2-D Fourier reduced or non existent and the neurons are low-pass filters
transform of its STRF. The MTF is the equivalent of the gain whereas, at higher signal-to-noise ratio, the amplification
response 共or Bode plot兲 of one-dimensional filters and shows comes into play and the neurons become band-pass filters
the spectro-temporal modulations in the sound that will drive 共van Hateren, 1992b兲. Our data on the modulation spectra of
the cells 共see also Chi et al., 1999; Miller et al., 2002兲. natural sounds could be used to extend such noise analysis.
Given that the modulation spectra of all natural sounds is If, for example, the role of an auditory area would be to
concentrated in the low frequencies, we would expect to find encode speech in a noisy background of environmental
a concentration of best temporal modulation and best spec- sounds, we would expect to find MTF that amplify the areas
tral modulation tuning in the low frequencies. This concen- of high signal-to-noise ratio and filter out the areas of low
tration has been observed experimentally in the mammalian signal-to-noise: more explicitly, MTF would be tuned to the
thalamus and cortex 共Miller et al., 2002兲 although the quan- intermediate modulation frequencies that are present in
tification of the match has not yet been performed. A second speech and not so dominant in the environmental sounds.
line of evidence for neural tuning to the modulation spectrum Interestingly, both psychophysically determined thresholds
of natural sounds was described in the songbirds. In the au- for ripple stimuli 共Chi et al., 1999兲 and ensemble MTF of
ditory forebrain, neurons that are selective for conspecific neurons in the auditory thalamus and cortex of the cat
song show the greatest responses to synthetic sounds that 共Miller et al., 2002兲 exhibit this type of band-pass filtering.
have similar modulation spectra as the natural vocalizations.
Moreover, to obtain responses similar in strength to those of E. Concluding remarks
the natural sounds matching the modulation spectrum was
more critical than matching the frequency spectrum 共Grace In summary, the statistics of the envelopes of natural
sounds are characteristically different from those of white
et al., 2003兲.
noise stimuli. This statistical structure allows us to make
An additional prediction can be made from the power
predictions on theories of auditory processing based on natu-
law relationship between power and temporal modulation
ral sound statistics. Current psychological and physiological
frequency that was observed in this analysis 共Fig. 10兲: if one
data is, in a qualitative fashion, consistent with these predic-
desires equal driving power for each neuron, then the ap-
tions. But these theories remain hypothetical until experi-
proximate 1/␻ 2t relationship found for zebra finch song and
mental data are generated to prove or disprove them directly.
speech requires the bandwidth of the temporal modulation
Also, it is almost certain that different coding hypotheses
tuning to be fixed in octave units. Recent data on the tempo-
will apply to different stages of the auditory system and that
ral MTF is also consistent with this hypothesis 共Miller et al.,
other biological and physical constraints will be of impor-
2002兲. The 1/␻ 2t relationship is reminiscent of the 1/f 2 power
tance. Given the recent work in the analysis of natural
relationship found in natural images as a function of the
sounds 共Attias and Schreiner, 1997; Brillinger and Irizarry,
spatial frequency, f 共Field, 1987兲. For natural images, the
1998; Lewicki, 2002兲 and in the development of analytical
1/f 2 relationship implies that the second order statistics of
tools to generate complex synthetic sounds and to extract
natural images are scale invariant. The equivalent statement
auditory receptive fields from responses to such sounds or to
for natural vocalizations is that the second order temporal
natural sounds 共Theunissen and Doupe, 1998; Klein et al.,
statistics of the amplitude envelopes are invariant to time
2000; Theunissen et al., 2001; Escabi and Schreiner, 2002兲,
compression or dilation. This mathematical relationship
we are now in a position where we can directly test this
might explain the perceptual effect whereby sped up vocal-
theoretical framework and further advance our understanding
izations of a particular animal species often sound like the of the computations occurring in the auditory system for
vocalizations of a different animal species. sound identification.
A more complex form of matching between the modu-
lation spectrum and the MTF of neurons is predicted if one
theorizes a spectral-whitening of the input space. In this ACKNOWLEDGMENTS
framework, neurons effectively amplify stimulus regions of The authors would like to thank Sarah Woolley, Noopur
low stimulus power to generate a white response output for Amin, and Lee Miller for critical comments on a previous
each neuron or for an ensemble of neurons. The result is that version of the manuscript. The paper was greatly improved
entropy of the output is maximized and the transmission ca- by following the suggestions and addressing the criticisms of
pacity of the neuronal channel is optimized from an informa- two anonymous reviewers. The work was funded by NIMH
tion theoretic perspective 共Atick, 1992; van Hateren, 1992b兲. Grant Nos. MH-58189 and MH-66990 to FET.
Experimental data in support for this theoretical framework
has been found in the visual system of insects 共van Hateren,
Atick, J. 共1992兲. ‘‘Could information theory provide an ecological theory of
1992a兲 and mammals 共Dan et al., 1996兲. With this perspec- sensory processing?’’ Network 3, 213–251.
tive, one would then expect asymmetric MTF of neurons, Attias, H., and Schreiner, C. E. 共1997兲. ‘‘Temporal low-order statistics of
with an amplification of the higher temporal modulation fre- natural sounds,’’ Adv. Neural Info. Process. Syst. 9, 27–33.

3410 J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing
Attias, H., and Schreiner, C. E. 共1998兲. ‘‘Coding of naturalistic stimuli by Newman, J., and Wollberg, Z. 共1978兲. ‘‘Multiple coding of species-specific
auditory midbrain neurons,’’ in Advances in Neural Information Process- vocalizations in the auditory cortex of squirrel monkeys,’’ Brain Res. 54,
ing Systems 共MIT, Cambridge, MA兲. 287–304.
Attneave, F. 共1954兲. ‘‘Some informational aspects of visual perception,’’ Paez, M. D., and Glisson, T. H. 共1972兲. ‘‘Minimum Mean-Squared-Error
Psychol. Rev. 61, 183–193. Quantization in speech PCM and DPCM Systems,’’ IEEE Trans. Com-
Barlow, H. B. 共1961兲. ‘‘Possible principles underlying the transformation of mun. COM-20共2兲, 225–230.
sensory messages,’’ in Sensory Communication, edited by W. A. Rosen- Painter, T., and Spanias, A. 共2000兲. ‘‘Perceptual Coding of Digital Audio,’’
bluth 共MIT, Cambridge, MA兲, pp. 217–234. Proc. IEEE 88, 451–513.
Brillinger, D. R., and Irizarry, R. A. 共1998兲. ‘‘An investigation of the second-
Palmer, A. R., and Evans, E. F. 共1982兲. ‘‘Intensity coding in the auditory
and higher-order spectra of music,’’ Signal Process. 65, 161–179.
periphery of the cat: responses of cochlear nerve and cochlear nucleus
Calhoun, B., and Schreiner, C. 共1998兲. ‘‘Spectral envelope coding in cat
neurons to signals in the presence of bandstop masking noise,’’ Hear. Res.
primary auditory cortex: linear and non-linear effects of stimulus charac-
teristics,’’ Eur. J. Neurosci. 10, 926 –940. 7, 305–323.
Chi, T., Gao, Y., Guyton, M. C., Ru, P., and Shamma, S. 共1999兲. ‘‘Spectro- Phillips, D. P. 共1990兲. ‘‘Neural representation of sound amplitude in the
temporal modulation transfer functions and speech intelligibility,’’ J. auditory cortex: effects of noise masking,’’ Behav. Brain Res. 37, 197–
Acoust. Soc. Am. 106, 2719–2732. 214.
Cohen, L. 共1995兲. Time-Frequency Analysis 共Prentice Hall, Englewood Phillips, D. P., and Hall, S. E. 共1987兲. ‘‘Responses of single neurons in cat
Cliffs, NJ兲. auditory cortex to time-varying stimuli: linear amplitude modulations,’’
Dan, Y., Atick, J. J., and Reid, R. C. 共1996兲. ‘‘Efficient coding of natural Exp. Brain Res. 67, 479– 492.
scenes in the lateral geniculate nucleus: experimental test of a computa- Popper, A. N., and Fay, R. R. 共1992兲. The Mammalian Auditory Pathway:
tional theory,’’ J. Neurosci. 16, 3351–3362. Neurophysiology. 共Springer-Verlag, New York兲.
deCharms, R. C., Blake, D. T., and Merzenich, M. M. 共1998兲. ‘‘Optimizing Rieke, F., Bodnar, D. A., and Bialek, W. 共1995兲. ‘‘Naturalistic stimuli in-
sound features for cortical neurons,’’ Science 280, 1439–1443. crease the rate and efficiency of information transmission by primary au-
Depireux, D. A., Simon, J. Z., Klein, D. J., and Shamma, S. A. 共2001兲. ditory afferents,’’ Proc. R. Soc. London, Ser. B 262, 259–265.
‘‘Spectro-temporal response field characterization with dynamic ripples in Ruggero, M. A., and Rich, N. C. 共1991兲. ‘‘Application of a commercially-
ferret primary auditory cortex,’’ J. Neurophysiol. 85, 1220–1234. manufactured Doppler-shift laser velocimeter to the measurement of
Dorman, M. F., Loizou, P. C., and Rainey, D. 共1997兲. ‘‘Speech intelligibility basilar-membrane vibration,’’ Hear. Res. 51, 215–230.
as a function of the number of channels of stimulation for signal proces- Sachs, M. B., and Abbas, P. J. 共1974兲. ‘‘Rate versus level functions for
sors using sine-wave and noise-band outputs,’’ J. Acoust. Soc. Am. 102,
auditory-nerve fibers in cats: tone-burst stimuli,’’ J. Acoust. Soc. Am. 56,
2403–2411.
1835–1847.
Drullman, R. 共1995兲. ‘‘Temporal envelope and fine structure cues for speech
Schlauch, R. S., DiGiovanni, J. J., and Ries, D. T. 共1998兲. ‘‘Basilar mem-
intelligibility,’’ J. Acoust. Soc. Am. 97, 585–592.
Drullman, R., Festen, J. M., and Plomp, R. 共1994兲. ‘‘Effect of temporal brane nonlinearity and loudness,’’ J. Acoust. Soc. Am. 103, 2010–2020.
envelope smearing on speech reception,’’ J. Acoust. Soc. Am. 95, 1053– Schreiner, C. E., and Calhoun, B. M. 共1994兲. ‘‘Spectral envelope coding in
1064. cat primary auditory cortex: properties of ripple transfer functions,’’ Aud.
Eggermont, J. J. 共2002兲. ‘‘Temporal modulation transfer functions in cat Neurosci. 1, 39– 61.
primary auditory cortex: separating stimulus effects from neural mecha- Sen, K., Theunissen, F. E., and Doupe, A. J. 共2001兲. ‘‘Feature analysis of
nisms,’’ J. Neurophysiol. 87, 305–321. natural sounds in the songbird auditory forebrain,’’ J. Neurophysiol. 86,
Eggermont, J. J., Aertsen, A. M., and Johannesma, P. I. 共1983兲. ‘‘Prediction 1445–1458.
of the responses of auditory neurons in the midbrain of the grass frog Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., and Ekelid, M.
based on the spectro-temporal receptive field,’’ Hear. Res. 10, 191–202. 共1995兲. ‘‘Speech recognition with primarily temporal cues,’’ Science 270,
Escabi, M. A., and Schreiner, C. E. 共2002兲. ‘‘Nonlinear spectrotemporal 303–304.
sound analysis by neurons in the auditory midbrain,’’ J. Neurosci. 22, Simoncelli, E. P., and Olshausen, B. A. 共2001兲. ‘‘Natural image statistics and
4114 – 4131. neural representation,’’ Annu. Rev. Neurosci. 24, 1193–1216.
Field, D. J. 共1987兲. ‘‘Relations between the statistics of natural images and Slaney M. 共1994兲. ‘‘An introduction to auditory model inversion,’’ Interval
the response properties of cortical cells,’’ J. Opt. Soc. Am. A 4, 2379– Technical Report IRC1994-014.
2394. Stevens, S. S. 共1956兲. ‘‘The direct estimation of sensory magnitudes: loud-
Flanagan, J. L. 共1980兲. ‘‘Parametric coding of speech spectra,’’ J. Acoust. ness,’’ Am. J. Psychol. 69, 1–25.
Soc. Am. 68, 412– 419. Suga, N., O’Neill, W. E., and Manabe, T. 共1978兲. ‘‘Cortical neurons sensi-
Grace, J. A., Amin, N., Singh, N. C., and Theunissen, F. E. 共2003兲. ‘‘Selec-
tive to combinations of information-bearing elements of biosonar signals
tivity for conspecific song in the zebra finch auditory forebrain,’’ J. Neu-
in the moustache bat,’’ Science 200, 778 –781.
rophysiol. 89, 472– 487.
Theunissen, F. E., and Doupe, A. J. 共1998兲. ‘‘Temporal and spectral sensi-
Green, D. 共1986兲. ‘‘Frequency and the detection of spectral shape change,’’
in Auditory Frequency Selectivity, edited by B. C. Moore and R. Patterson tivity of complex auditory neurons in the nucleus HVc of male zebra
共Plenum, Cambridge兲, pp. 351–359. finches,’’ J. Neurosci. 18, 3786 –3802.
Griffin, D., and Lim, J. 共1984兲. ‘‘Signal estimation from modified short-time Theunissen, F. E., Sen, K., and Doupe, A. J. 共2000兲. ‘‘Spectral-temporal
Fourier transform,’’ IEEE Trans. Acoust., Speech, Signal Process. 32, receptive fields of nonlinear auditory neurons obtained using natural
236 –242. sounds,’’ J. Neurosci. 20, 2315–2331.
Klein, D. J., Depireux, D. A., Simon, J. Z., and Shamma, S. A. 共2000兲. Theunissen, F. E., David, S. V., Singh, N. C., Hsu, A., Vinje, W., and Gal-
‘‘Robust spectro-temporal reverse correlation for the auditory system: Op- lant, J. L. 共2001兲. ‘‘Estimating spatio-temporal receptive fields of auditory
timizing stimulus design,’’ J. Comput. Neurosci. 9, 85–111. and visual neurons from their responses to natural stimuli,’’ Network
Lewicki, M. S. 共2002兲. ‘‘Efficient coding of natural sounds,’’ Nat. Neurosci. Comput. Neural Syst. 12, 1–28.
5, 356 –363. Tyler, R. S., Preece, J. P. and Tye-Murray, K. 共1990兲. ‘‘Iowa Audiovisual
Lohr, B., and Dooling, R. J. 共1998兲. ‘‘Detection of changes in timbre and Speech Perception Tests,’’ Department of Otolaryngology. The University
harmonicity in complex sounds by zebra finches 共Taeniopygia guttata兲 and of Iowa, Iowa City, IA 52242.
budgerigars 共Melopsittacus undulatus兲,’’ J. Comp. Psychol. 112, 36 – 47. van Hateren, J. H. 共1992a兲. ‘‘Theoretical predictions of spatiotemporal re-
Machens, C. K., Stemmler, M. B., Prinz, P., Krahe, R., Ronacher, B., and ceptive fields of fly LMCs, and experimental validation,’’ J. Comp.
Herz, A. V. 共2001兲. ‘‘Representation of acoustic communication signals by Physiol. 关A兴 171, 157–170.
insect auditory receptor neurons,’’ J. Neurosci. 21共9兲, 3215–3227. van Hateren, J. H. 共1992b兲. ‘‘A theory of maximizing sensory information,’’
Margoliash, D. 共1983兲. ‘‘Acoustic parameters underlying the responses of
Biol. Cybern. 68, 23–29.
song-specific neurons in the white-crowned sparrow,’’ J. Neurosci. 3,
Viemeister, N. F. 共1979兲. ‘‘Temporal modulation transfer functions based
1039–1057.
upon modulation thresholds,’’ J. Acoust. Soc. Am. 66, 1364 –1380.
Miller, L. M., Escabi, M. A., Read, H. L., and Schreiner, C. E. 共2002兲.
‘‘Spectrotemporal receptive fields in the lemniscal auditory thalamus and Voss, R. F., and Clarke, J. 共1975兲. ‘‘1/f noise in music and speech,’’ Nature
cortex,’’ J. Neurophysiol. 87, 516 –527. 共London兲 258, 317–318.

J. Acoust. Soc. Am., Vol. 114, No. 6, Pt. 1, Dec. 2003 N. C. Singh and F. E. Theunissen: Natural sounds and auditory processing 3411

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy