tmp1737 TMP
tmp1737 TMP
3-15, 2015.
Introduction
The information in sound is carried by variations in the air pressure over time,
which for many sound sources can be modelled as a superposition of sine wave
oscillations of different frequencies. To capture this information by auditory perception or signal processing, the sound signal has to be processed over some
non-infinitesimal amount of time and in the case of a spectral analysis also over
some range of frequencies. Such a region over time or over the spectro-temporal
domain is referred to as a temporal or spectro-temporal receptive field (Aertsen
and Johannesma [1]; Miller et al. [2]).
The subject of this article is to show how a principled theory for auditory
receptive fields can be developed based on scale-space theory. Our aim is to
?
Support from the Swedish Research Council contracts 2010-4766, 2012-4685 and
2014-4083, a KTH CSC Small Visionary Project and the EU project SkAT-VG
FET-Open grant 618067 is gratefully acknowledged.
express auditory operations that (i) are well localized over time and frequencies
and (ii) allow for well-founded handling of temporal phenomena that occur at
different temporal scales as well as (iii) receptive fields that operate over different
ranges of frequencies in such a way that operations over different ranges of
frequencies can be related in a well-defined manner.
When applied to the definition of a spectrogram, alternatively to the formulation of an idealized cochlea model, the scale-space approach can be used for
deriving the Gabor (Gabor [3]; Wolfe et al. [4]) and Gamma-tone (Johannesma
[5]; Patterson et al. [6]) approaches for computing local windowed Fourier transforms as specific cases of a complex-valued scale-space transform over different
frequencies. In addition, the scale-space approach to defining spectrograms leads
to a new family of generalized Gamma-tone filters, where the time constants of
the individual first-order integrators coupled in cascade are not equal as for regular Gamma-tone filters but instead distributed logarithmically over temporal
scales and allowing for different trade-offs in terms of e.g. the frequency selectivity of the spectrogram and the temporal delay of time-causal receptive fields.
When applied to a logarithmic transformation of the spectrogram, as motivated from the desire of handling sound signals of different strength (sound
pressure) in an invariant manner and with a logarithmic transformation of the
frequencies as motivated by the desire of enabling invariance properties under a
frequency shift, such as transposing a musical piece by one octave, the theory also
allows for the formulation of spectro-temporal receptive fields at higher levels
in the auditory hierarchy in terms of spectro-temporal derivatives of spectrotemporal smoothing operations as obtained from scale-space theory.
Such second-layer receptive fields can be used for (i) computing basic auditory features such as onset detection, partial tone enhancement and formants,
and (ii) generating predictions of auditory receptive fields qualitatively similar to
biological receptive fields as measured by cell recordings in the inferior colliculus
(ICC) and the primary auditory cortex (A1) (Miller et al. [2]; Qiu et al. [7];
Elhilali et al. [8]; Atencio and Schreiner [9]).
In this concise summary of the theory, we emphasize the scale-space aspects
of auditory receptive fields. A more extensive treatment is given in [10].
Multi-scale spectrograms
A basic question in this context concerns how to choose the window function.
Would any choice of window function w do? Specifically, how long should the
effective integration time be? A priori there may be no principled reason
for preferring a particular duration of the temporal window function for the
windowed Fourier transform over some other temporal duration. Specifically,
different temporal durations may be appropriate for different auditory tasks, such
as a preference for a short temporal duration for onset detection and a preference
for a longer temporal duration to separate sounds with nearby frequencies.
If we apply a scale-space approach to this problem and associate a temporal
window scale with any spectrogram, let us require that we should be able
to relate spectrograms computed for different temporal window sizes between
scales. If we assume a continuum of temporal window scales, then a semi-group
structure w(; 2 ) = w(; 2 1 ) w(; 1 ) on the window functions implies a
cascade property between the spectrograms
S(, ; 2 ) = w(; 2 1 ) S(, ; 1 ).
(2)
If we instead assume a discrete set of temporal window scales, with each temporal
window function w(; n) at a coarser scale defined as the composition of a
set of primitive temporal window functions (w)(; k) such that w(; n) =
nk=1 (w)(; k), then we obtain a Markov property of the following type
S(, ; n ) = (w)(; m 7 n) S(, ; m ).
(3)
For pre-recorded sound signals we may in principle take the liberty of accessing
the virtual future in relation to any time moment. For real-time audio processing
or when modelling biological auditory perception there is on the other hand
no way to access the future. For real-time audio models, the temporal window
functions must therefore be time-causal such that w(t; ) = 0 for t < 0.
In the case of non-causal time and a continuum of temporal window scales, let
us assume that the window functions in addition should guarantee non-creation
of new structure in the sense of non-enhancement of local extrema in either of
the real or purely imaginary channels. Then, it follows from general results in
(Lindeberg [11], eq. (45)) that the temporal window function must be Gaussian
g(t; ) =
2
1
e(t ) /2
2
(4)
(5)
where = (1 , . . . , k ) and
hexp (t; k ) =
1 t/k
k e
t0
t<0
(6)
Thereby the convolution kernels in temporal scale spaces for a general timevarying signal are used as scale-dependent window functions for defining windowed Fourier transforms of different temporal extent. Specifically, this scalespace approach allows for the definition of windowed Fourier transforms for all
temporal extents in such a way that a windowed Fourier transform at any coarse
temporal scale can be related to a windowed Fourier transform at any finer temporal scale using the cascade property (2) or the Markov property (3) derived
from the underlying scale-space kernels. Combined with the additional scalespace properties of non-creation of new structures with increasing scale, this
guarantees well-founded theoretical properties between corresponding windowed
Fourier transforms at different temporal scales.
Relations to Gabor functions. By rewriting the expressions (1) and (4) for the
complex-valued spectrogram based on the Gaussian temporal scale space as
Z
0
g(t t0 ; ) ei(tt ) f (t0 ) dt0
(7)
Sg (, t; ) = eit
t0 =
it can be seen that up to a phase shift this multi-scale spectrogram can equivalently be interpreted as the convolution of the original auditory signal f by
Gabor functions [3] of the form
G(t, ; ) = g(t; ) eit .
(8)
Such Gabor functions have been previously used for analyzing auditory signals
by several authors, including Wolfe et al. [4] and Heckmann et al. [13].
Relations to Gammatone filters. In the special case when the time constants
of the K truncated exponential filters that are coupled in cascade are all equal
k = , then the multi-scale spectrogram defined by (1) and (5) is given by [10]
Z
0
(t t0 )K1 e(tt )/ i(tt0 ) 0 0
it
e
f (t ) dt
(9)
Sh (t, ; , K) = e
K (K)
t0 =
and does up to a phase shift correspond to convolution of the input signal f by
filters of the form
hcos (t, ; , K) =
tK1 et/
cos t,
K (K)
(10)
hsin (t, ; , K) =
tK1 et/
sin t.
K (K)
(11)
For comparison, the Gammatone filter with parameters a and b and frequency
is defined according to (t) = a tn1 e2bt cos(2 t + ). By identifying the
parameters a = 1/(K (K)), b = 1/(2) and = 2 , it follows that we can
derive the Gammatone filter as a special case of applying a time-causal scalespace representation with discrete scale levels to the projections f cos t and
f sin t of an auditory signal f (t) onto a complex sine wave eit .
Gammatone filter banks are also commonly used in audio processing (Johannesma [5]; Patterson et al. [6]; Ngamkham et al. [14]).
(12)
(13)
(1 k K)
(14)
(15)
1 = c1K max
p
p
kK1
2
(16)
c 1 max
(2 k K)
k = k k1 = c
By comparing graphs of the underlying temporal scale-space kernels [16], one
finds that filters based on truncated exponentials with a logarithmic distribution
of the intermediate temporal scales allow for a faster temporal response compared to the corresponding filters based on truncated exponentials with equal
time constants. Thereby, these generalized Gammatone filters allow for additional degrees of freedom to obtain different trade-offs between the frequency
selectivity and the temporal delay of time-causal window functions by varying
the number of levels K and the distribution parameter c for a given max .
Frequency-dependent window scale. To guarantee basic covariance properties of
the spectrogram under a frequency shift 7 , it is natural to let the temporal
window scale vary with
the frequency in such a a way that the temporal window
scale in units of = is proportional to the wavelength = 2/
2
2 n
=
(17)
Given that a spectrogram has been computed by a first layer of auditory receptive
fields, we define a second layer of receptive fields by operating on the spectrogram
with 2-D spectro-temporal filters in a structurally similar way as visual receptive
fields are applied to time-varying visual input (see overview in Lindeberg [17]).
3.1
(19)
then the influence on the receptive field responses of the constants a and R
A SdB = t T (SdB + 20 log10 a 20 log10 R) = t T SdB + 0 + 0 (20)
will be eliminated if the constants a and R do not depend on time t or the
logarithmic frequency , implying invariance of the second-layer receptive field
responses to variations in the sound pressure or the distance to a sound source.
Since logarithmic frequencies constitute a natural metric for relating frequencies of sound and there is an approximately logarithmic distribution of frequencies both on the basilar membrane and in the auditory cortex, it is natural to
express these derived receptive fields in terms of logarithmic frequencies
= 0 + C log
(21)
0
for some constants C and 0 , where specifically 0 = 69, C = 12/ log 2 and
0 = 2 440 correspond to the MIDI standard.
This logarithmic parameterization implies that a shift in frequency, caused
by e.g. transposing a piece of music by one octave or varying the fundamental
(22)
where
t represents a temporal derivative operator of order with respect to time
t which could alternatively be replaced by a glissando-adapted temporal
derivative of the form t = t + v ,
represents a logspectral derivative operator of order with respect to
logarithmic frequency ,
T (t; ) represents a temporal smoothing kernel with temporal scale parameter , which should either be (i) a temporal Gaussian kernel g(t; ) (4) or
(ii) the equivalent kernel hcomposed (t; ) according to (5) and corresponding
to a set of truncated exponential kernels coupled in cascade, and
g( vt; s) represents a Gaussian spectral smoothing kernel over logarithmic
frequencies with logspectral scale parameter s and v representing a glissando parameter making it possible to adapt the receptive fields to variations
in frequency 0 = + vt over time and
the spectro-temporal covariance matrix in the left hand expression for
spectro-temporal receptive fields comprises both the temporal scale parameter , the logspectral scale parameter s and the glissando parameter v.
Thereby, the spectro-temporal receptive fields (22) constitute a combination of
a Gaussian scale-space concept over the logspectral dimension with purely temporal receptive fields obtained by either a non-causal Gaussian temporal scale
space or a time-causal scale space obtained by coupling truncated exponential
kernels/first-order integrators in cascade (see figure 2, columns 2-3).
The proofs concerning spectro-temporal receptive fields are similar to those
regarding spatio-temporal receptive fields over a 1+1-D spatio-temporal domain
with the spatial dimension replaced by a logspectral dimension.
Fig. 1. (top left) Spectrogram of a male voice that reads zero five four one (from
the TIDigits database) computed with generalized Gammatone functions. (top right)
Onset enhancement by first-order temporal derivatives. (bottom left) Enhancement of
partial tones by second-order logspectral derivatives using separable receptive fields.
(bottom right) Enhancement of partial tones by the maximum of second-order logspectral derivatives over a filter bank of glissando-adapted receptive fields. Note the better
ability of the glissando-adapted receptive fields to capture rapid frequency variations.
3.4
In the following, we will show examples of auditory features that can be defined
from a second layer of auditory receptive fields of this form:
Onset enhancement. Computation
of first-order temporal derivatives Dt (t, ; , s) =
t T (t, ; , s) where is a scale normalization factor to approximate scalenormalized derivatives (Lindeberg [18]). To select receptive field responses that
correspond to onsets only, we add the non-linear logical operation Dt > 0 such
that Aonset SdB = Dt SdB if Dt SdB > 0 and 0 otherwise (see figure 1, top right).
Enhancement of partials. Computation of second-order logspectral derivatives
D (t, ; , s) = s T (t, ; , s) where the factor s is a scale normalization factor for scale-normalized derivatives in the Gaussian scale space (Lindeberg [18]).
Depending on the value of the logspectral scale parameter s, this operation may
either enhance partial tones or formants. This operation is naturally combined
with the (non-linear) logical operation D < 0 such that Aband SdB = D SdB
if D SdB < 0 and 0 otherwise (see figure 1, bottom left).
10
In the central nucleus of the inferior colliculus (ICC) of cats, Qiu et al. [7]
report that about 60 % of the neurons can be described as separable in the
time-frequency domain (see figure 2, top row), whereas the remaining neurons
are either obliquely oriented (see figure 2, second row) or contain multiple excitatory/inhibitory subfields. This overall structure is nicely compatible with the
treatment in section 3.4, where the second-layer receptive fields are expressed in
terms of spectro-temporal derivatives of either time-frequency separable spectrotemporal smoothing operations or corresponding glissando-adapted features as
motivated by the structural requirements in section 3.2.
Qualitatively similar shapes of receptive fields can be measured from neurons
in the primary auditory cortex (see figure 2, third row, as well as Miller et al.
[2] regarding binaural receptive fields). Specifically, the use of multiple temporal
and spectral scales as a main component in the model is in good agreement with
biological receptive fields having different degrees of spectral tuning ranging from
narrow to broad and different temporal extent (see figure 2, rows 4-5).
We have presented a theory for how idealized models of auditory receptive fields
can be derived from structural constraints (scale-space axioms) on the first stages
of auditory processing. The theory includes (i) the definition of multi-scale spectrograms at different temporal scales in such a way that a spectrogram at any
coarser temporal scale can be related to a corresponding spectrogram at any
finer temporal scale using theoretically well-defined scale-space operations, and
additionally (ii) how a second layer of spectro-temporal receptive fields can be
defined over a logarithmically transformed spectrogram in such a way that the
resulting spectro-temporal receptive fields obey invariance or covariance properties under natural sound transformations including temporal shifts, variations in
the sound pressure, the distance between the sound source and the observer, a
shift in the frequencies of auditory stimuli or glissando transformations. Specifically, theoretical arguments have been presented showing how these idealized
receptive fields are constrained to the presented forms from symmetry properties
of the environment in combination with assumptions about the internal structure of auditory operations as motivated from requirements of handling different
temporal and spectral scales in a theoretically well-founded manner.
We propose that this theory should be of wide general interest for the audio processing community by providing theoretically well-founded and provably
Time-causal model
Gaussian model
4
20
40
11
40
20
4
0
20
40
Time (ms)
Time-causal model
Gaussian model
3
0
25
50
A1 receptive field
25
130
Log Frequency (semitones)
120
110
100
90
80
120
110
100
70
50
50
Gaussian model
130
25
Time-causal model
16
Frequency (kHz)
50
25
Time (ms)
90
80
70
20
40
60
80
20
40
60
80
Time (ms)
Broadly tuned A1 RF
Frequency (kHz)
2
-50
120
120
110
50
100
150 200
Narrowly tuned A1 RF
100
-50
50
100
150 200
130
16
50
100
150 200
-50
-50
50
100
150 200
50
100
150 200
110
100
2
-50
100
120
110
110
130
120
Gaussian model
130
Time-causal model
130
16
100
0
50 100
Time (ms)
150 200
-50
Fig. 2. (top row left) A separable monaural spectro-temporal receptive field in the
central nucleus of the inferior colliculus (ICC) of cat as reported by Qiu et al. [7].
(second row left) A non-separable spectro-temporal receptive field in the central nucleus
of the inferior colliculus (ICC) of cat as reported by Qiu et al. [7]. (third row left)
A separable spectro-temporal receptive fields in the primary auditory cortex (A1) of
ferret as reported by Elhilali et al. [8]. (fourth and bottom rows left) Spectro-temporal
receptive fields of broadly and narrowly tuned neurons in the primary auditory cortex
(A1) of cats as reported by Atencio and Schreiner [9]. (middle and right columns) Timecausal and non-causal receptive field models according to eq. (22). (Figures reprinted
from [10] with permission.)
12
invariant/covariant audio operations for processing sound signals and for computational modelling or measurements of receptive fields, auditory invariances,
theoretical biology and psychophysics, by serving as a general theoretical foundation and understanding of how receptive fields in ICC and A1 support invariant
visual processes at higher levels in the auditory hierarchy.
References
1. Aertsen, A.M.H.J., Johannesma, P.I.M.: The spectro-temporal receptive field: A
functional characterization of auditory neurons. Biol. Cyb. 42 (1981) 133143
2. Miller, L.M., Escabi, N.A., Read, H.L., Schreiner, C.: Spectrotemporal receptive
fields in the lemniscal auditory thalamus and cortex. J. Neurophys. 87 (2001)
516527
3. Gabor, D.: Theory of communication. J. of the IEE 93 (1946) 429457
4. Wolfe, P.J., Godsill, S.J., Dorfler, M.: Multi-Gabor dictionaries for audio timefrequency analysis. Appl. of Signal Proc. to Audio and Acoustics. (2001) 4346
5. Johannesma, P.I.M.: The pre-response stimulus ensemble of neurons in the cochlear
nucleus. In: IPO Symposium on Hearing Theory, Eindhoven, (1972) 5869
6. Patterson, R.D., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory
filterbank based on the gammatone function. In: A meeting of the IOC Speech
Group on Auditory Modelling at RSRE. Volume 2:7. (1987)
7. Qiu, A., Schreiner, C.E., Escabi, M.A.: Gabor analysis of auditory midbrain receptive fields: Spectro-temporal and binaural composition. J. of Neurophysiology
90 (2003) 456476
8. Elhilali, M., Fritz, J., Chi, T.S., Shamma, S.: Auditory cortical receptive fields:
Stable entities with plastic abilities. J. of Neuroscience 27 (2007) 1037210382
9. Atencio, C.A., Schreiner, C.E.: Spectrotemporal processing in spectral tuning
modules of cat primary auditory cortex. PLOS ONE 7 (2012) e31537
10. Lindeberg, T., Friberg, A.: Idealized computational models of auditory receptive
fields. PLOS ONE 10(3):e0119032 (2015) 158, preprint at arXiv:1404.2037.
11. Lindeberg, T.: Generalized Gaussian scale-space axiomatics comprising linear
scale-space, affine scale-space and spatio-temporal scale-space. J. of Mathematical Imaging and Vision 40 (2011) 3681
12. Lindeberg, T., Fagerstr
om, D.: Scale-space with causal time direction. In: European Conf. on Computer Vision, Springer LNCS Vol. 1064 (1996) 229240
13. Heckmann, M., Domont, X., Joublin, F., Goerick, C.: A hierarchical framework for
spectro-temporal feature extraction. Speech Communication 53 (2011) 736752
14. Ngamkham, W., Sawigun, C., Hiseni, S., Serdijn, W.A.: Analog complex gammatone filter for cochlear implant channels. In: ISCAS (2010) 969972
15. Koenderink, J.J.: Scale-time. Biological Cybernetics 58 (1988) 159162
16. Lindeberg, T.: Separable time-causal and time-recursive receptive fields. In: Scale
Space and Variational Methods in Computer Vision, Springer LNCS Vol. 9087
(2015) 90102
17. Lindeberg, T.: A computational theory of visual receptive fields. Biological Cybernetics 107 (2013) 589635
18. Lindeberg, T.: Feature detection with automatic scale selection. Int. J. of Computer
Vision 30 (1998) 77116