0% found this document useful (0 votes)
24 views12 pages

tmp1737 TMP

This document discusses applying scale-space theory to model auditory receptive fields. It shows that scale-space principles can be used to derive the Gabor and Gammatone filters for time-frequency analysis. When applied to a logarithmic spectrogram, it leads to two families of spectro-temporal receptive fields that are either separable or adapted to local frequency variations over time. These receptive fields respect auditory invariances and can be used to model biological receptive fields in the auditory system.

Uploaded by

Frontiers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

tmp1737 TMP

This document discusses applying scale-space theory to model auditory receptive fields. It shows that scale-space principles can be used to derive the Gabor and Gammatone filters for time-frequency analysis. When applied to a logarithmic spectrogram, it leads to two families of spectro-temporal receptive fields that are either separable or adapted to local frequency variations over time. These receptive fields respect auditory invariances and can be used to model biological receptive fields in the auditory system.

Uploaded by

Frontiers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Proc SSVM 2015: Scale Space and Variational Methods in Computer Vision, Springer LNCS vol 9087, pages

3-15, 2015.

Scale-space theory for auditory signals


Tony Lindeberg1 and Anders Friberg2
1

Department of Computational Biology, 2 Department of Speech, Music and Hearing


School of Computer Science and Communication?
KTH Royal Institute of Technology, Stockholm, Sweden

Abstract. We show how the axiomatic structure of scale-space theory


can be applied to the auditory domain and be used for deriving idealized
models of auditory receptive fields via scale-space principles. For defining a time-frequency transformation of a purely temporal signal, it is
shown that the scale-space framework allows for a new way of deriving
the Gabor and Gammatone filters as well as a novel family of generalized
Gammatone filters with additional degrees of freedom to obtain different trade-offs between the spectral selectivity and the temporal delay of
time-causal window functions. Applied to the definition of a second layer
of receptive fields from the spectrogram, it is shown that the scale-space
framework leads to two canonical families of spectro-temporal receptive
fields, using a combination of Gaussian filters over the logspectral domain with either Gaussian filters or a cascade of first-order integrators
over the temporal domain. These spectro-temporal receptive fields can
be either separable over the time-frequency domain or be adapted to
local glissando transformations that represent variations in logarithmic
frequencies over time. Such idealized models of auditory receptive fields
respect auditory invariances, can be used for computing basic auditory
features for audio processing and lead to predictions about auditory receptive fields with good qualitative similarity to biological receptive fields
in the inferior colliculus (ICC) and the primary auditory cortex (A1).

Introduction

The information in sound is carried by variations in the air pressure over time,
which for many sound sources can be modelled as a superposition of sine wave
oscillations of different frequencies. To capture this information by auditory perception or signal processing, the sound signal has to be processed over some
non-infinitesimal amount of time and in the case of a spectral analysis also over
some range of frequencies. Such a region over time or over the spectro-temporal
domain is referred to as a temporal or spectro-temporal receptive field (Aertsen
and Johannesma [1]; Miller et al. [2]).
The subject of this article is to show how a principled theory for auditory
receptive fields can be developed based on scale-space theory. Our aim is to
?

Support from the Swedish Research Council contracts 2010-4766, 2012-4685 and
2014-4083, a KTH CSC Small Visionary Project and the EU project SkAT-VG
FET-Open grant 618067 is gratefully acknowledged.

Tony Lindeberg and Anders Friberg

express auditory operations that (i) are well localized over time and frequencies
and (ii) allow for well-founded handling of temporal phenomena that occur at
different temporal scales as well as (iii) receptive fields that operate over different
ranges of frequencies in such a way that operations over different ranges of
frequencies can be related in a well-defined manner.
When applied to the definition of a spectrogram, alternatively to the formulation of an idealized cochlea model, the scale-space approach can be used for
deriving the Gabor (Gabor [3]; Wolfe et al. [4]) and Gamma-tone (Johannesma
[5]; Patterson et al. [6]) approaches for computing local windowed Fourier transforms as specific cases of a complex-valued scale-space transform over different
frequencies. In addition, the scale-space approach to defining spectrograms leads
to a new family of generalized Gamma-tone filters, where the time constants of
the individual first-order integrators coupled in cascade are not equal as for regular Gamma-tone filters but instead distributed logarithmically over temporal
scales and allowing for different trade-offs in terms of e.g. the frequency selectivity of the spectrogram and the temporal delay of time-causal receptive fields.
When applied to a logarithmic transformation of the spectrogram, as motivated from the desire of handling sound signals of different strength (sound
pressure) in an invariant manner and with a logarithmic transformation of the
frequencies as motivated by the desire of enabling invariance properties under a
frequency shift, such as transposing a musical piece by one octave, the theory also
allows for the formulation of spectro-temporal receptive fields at higher levels
in the auditory hierarchy in terms of spectro-temporal derivatives of spectrotemporal smoothing operations as obtained from scale-space theory.
Such second-layer receptive fields can be used for (i) computing basic auditory features such as onset detection, partial tone enhancement and formants,
and (ii) generating predictions of auditory receptive fields qualitatively similar to
biological receptive fields as measured by cell recordings in the inferior colliculus
(ICC) and the primary auditory cortex (A1) (Miller et al. [2]; Qiu et al. [7];
Elhilali et al. [8]; Atencio and Schreiner [9]).
In this concise summary of the theory, we emphasize the scale-space aspects
of auditory receptive fields. A more extensive treatment is given in [10].

Multi-scale spectrograms

To capture the frequency content in an auditory signal f : IR IR, the notion of


spectrograms or locally windowed Fourier transforms constitutes a natural tool
Z
0
S(t, ; ) =
f (t0 ) eit w(t t0 ; ) dt0 .
(1)
t0 =

A basic question in this context concerns how to choose the window function.
Would any choice of window function w do? Specifically, how long should the
effective integration time be? A priori there may be no principled reason
for preferring a particular duration of the temporal window function for the
windowed Fourier transform over some other temporal duration. Specifically,

Scale-space theory for auditory signals

different temporal durations may be appropriate for different auditory tasks, such
as a preference for a short temporal duration for onset detection and a preference
for a longer temporal duration to separate sounds with nearby frequencies.
If we apply a scale-space approach to this problem and associate a temporal
window scale with any spectrogram, let us require that we should be able
to relate spectrograms computed for different temporal window sizes between
scales. If we assume a continuum of temporal window scales, then a semi-group
structure w(; 2 ) = w(; 2 1 ) w(; 1 ) on the window functions implies a
cascade property between the spectrograms
S(, ; 2 ) = w(; 2 1 ) S(, ; 1 ).

(2)

If we instead assume a discrete set of temporal window scales, with each temporal
window function w(; n) at a coarser scale defined as the composition of a
set of primitive temporal window functions (w)(; k) such that w(; n) =
nk=1 (w)(; k), then we obtain a Markov property of the following type
S(, ; n ) = (w)(; m 7 n) S(, ; m ).

(3)

For pre-recorded sound signals we may in principle take the liberty of accessing
the virtual future in relation to any time moment. For real-time audio processing
or when modelling biological auditory perception there is on the other hand
no way to access the future. For real-time audio models, the temporal window
functions must therefore be time-causal such that w(t; ) = 0 for t < 0.
In the case of non-causal time and a continuum of temporal window scales, let
us assume that the window functions in addition should guarantee non-creation
of new structure in the sense of non-enhancement of local extrema in either of
the real or purely imaginary channels. Then, it follows from general results in
(Lindeberg [11], eq. (45)) that the temporal window function must be Gaussian
g(t; ) =

2
1
e(t ) /2
2

(4)

with = 0 and = 0 where we without loss of generality can set 0 = 1.


If we in the case of time-causal data and a discrete set of temporal window
scales assume that the temporal window functions should guarantee non-creation
of new structure in the sense of guaranteeing non-creation of new local extrema
in either of the real or purely imaginary channels, then it follows from general
results in (Lindeberg and Fagerstrom [12], eq. (8)) that the temporal window
functions should be given by a cascade of truncated exponential functions
hcomposed (t; ) = K
k=1 hexp (t; k )

(5)

where = (1 , . . . , k ) and

hexp (t; k ) =

1 t/k
k e

t0
t<0

(6)

Tony Lindeberg and Anders Friberg

Thereby the convolution kernels in temporal scale spaces for a general timevarying signal are used as scale-dependent window functions for defining windowed Fourier transforms of different temporal extent. Specifically, this scalespace approach allows for the definition of windowed Fourier transforms for all
temporal extents in such a way that a windowed Fourier transform at any coarse
temporal scale can be related to a windowed Fourier transform at any finer temporal scale using the cascade property (2) or the Markov property (3) derived
from the underlying scale-space kernels. Combined with the additional scalespace properties of non-creation of new structures with increasing scale, this
guarantees well-founded theoretical properties between corresponding windowed
Fourier transforms at different temporal scales.
Relations to Gabor functions. By rewriting the expressions (1) and (4) for the
complex-valued spectrogram based on the Gaussian temporal scale space as
Z
0
g(t t0 ; ) ei(tt ) f (t0 ) dt0
(7)
Sg (, t; ) = eit
t0 =

it can be seen that up to a phase shift this multi-scale spectrogram can equivalently be interpreted as the convolution of the original auditory signal f by
Gabor functions [3] of the form
G(t, ; ) = g(t; ) eit .

(8)

Such Gabor functions have been previously used for analyzing auditory signals
by several authors, including Wolfe et al. [4] and Heckmann et al. [13].
Relations to Gammatone filters. In the special case when the time constants
of the K truncated exponential filters that are coupled in cascade are all equal
k = , then the multi-scale spectrogram defined by (1) and (5) is given by [10]
Z
0
(t t0 )K1 e(tt )/ i(tt0 ) 0 0
it
e
f (t ) dt
(9)
Sh (t, ; , K) = e
K (K)
t0 =
and does up to a phase shift correspond to convolution of the input signal f by
filters of the form
hcos (t, ; , K) =

tK1 et/
cos t,
K (K)

(10)

hsin (t, ; , K) =

tK1 et/
sin t.
K (K)

(11)

For comparison, the Gammatone filter with parameters a and b and frequency
is defined according to (t) = a tn1 e2bt cos(2 t + ). By identifying the
parameters a = 1/(K (K)), b = 1/(2) and = 2 , it follows that we can
derive the Gammatone filter as a special case of applying a time-causal scalespace representation with discrete scale levels to the projections f cos t and
f sin t of an auditory signal f (t) onto a complex sine wave eit .
Gammatone filter banks are also commonly used in audio processing (Johannesma [5]; Patterson et al. [6]; Ngamkham et al. [14]).

Scale-space theory for auditory signals

Generalized Gammatone filters. By allowing for different time constants in the


primitive truncated exponential filters, we obtain generalized Gammatone filters
hcos (t, ; ) = hcomposed (t; ) cos t

(12)

hsin (t, ; ) = hcomposed (t; ) sin t

(13)

with hcomposed according to (5) and = (1 , . . . , K ). If we have the freedom of


choosing the minimum temporal window scale min freely, we can parameterize
the intermediate temporal scale levels using a parameter c > 1 such that [16]
k = c2(kK) max

(1 k K)

(14)

which shares some qualitative similarities to the logarithmic transformation of


the past used in the scale-time model proposed by Koenderink [15].
By the additive property of variances (which for a primitive truncated exponential filter (6) with time constant k is given by 2k ) under convolution this
implies that time constants of the individual first-order integrators will be [16]

(15)
1 = c1K max
p
p

kK1
2
(16)
c 1 max
(2 k K)
k = k k1 = c
By comparing graphs of the underlying temporal scale-space kernels [16], one
finds that filters based on truncated exponentials with a logarithmic distribution
of the intermediate temporal scales allow for a faster temporal response compared to the corresponding filters based on truncated exponentials with equal
time constants. Thereby, these generalized Gammatone filters allow for additional degrees of freedom to obtain different trade-offs between the frequency
selectivity and the temporal delay of time-causal window functions by varying
the number of levels K and the distribution parameter c for a given max .
Frequency-dependent window scale. To guarantee basic covariance properties of
the spectrogram under a frequency shift 7 , it is natural to let the temporal
window scale vary with
the frequency in such a a way that the temporal window
scale in units of = is proportional to the wavelength = 2/

2
2 n
=
(17)

where n is a parameter. By such frequency dependent temporal window scale,


the spectral selectivity in the spectrogram (the width of a spectral band) will be
independent of the frequency . This is a prerequisite for the desirable property
that a shift by one octave of a musical piece should imply that the corresponding
spectrogram should appear similar while shifted by one octave, if the frequency
axis of the spectrogram is parameterized on a logarithmic scale.
Additionally, to prevent the temporal window scale from being too short for
high frequencies or too long at low frequencies, we also introduce soft lower and
upper bounds on the temporal window scale. Thereby, self-similarity will only
hold within a limited range of frequencies.

Tony Lindeberg and Anders Friberg

Second-layer receptive fields over the spectrogram

Given that a spectrogram has been computed by a first layer of auditory receptive
fields, we define a second layer of receptive fields by operating on the spectrogram
with 2-D spectro-temporal filters in a structurally similar way as visual receptive
fields are applied to time-varying visual input (see overview in Lindeberg [17]).
3.1

Invariances by logarithmic transformations of the spectrogram

Prior to the definition of receptive fields from the spectrogram, it is natural to


allow for a self-similar logarithmic transformation of the magnitude values
 
|S|
.
(18)
SdB = 20 log10
S0
Then, a multiplicative transformation of the sound pressure f 7 a f , corresponding to |S| 7 a |S|, or an inversely proportional reduction in the sound pressure of
the signal from a single auditory point source as function of distance f 7 f /R,
corresponding to |S| 7 |S|/R, are both transformed into a subtraction of the
logarithmic magnitude by a constant.
If we operate on the logarithmically transformed spectrogram by a receptive
field A that is based on a combination of a spectro-temporal smoothing operation T with logspectral and temporal scale parameters as determined by a
spectro-temporal covariance matrix , temporal and/or logspectral derivatives
t of orders and with at least one of > 0 or > 0
A SdB = t T SdB

(19)

then the influence on the receptive field responses of the constants a and R
A SdB = t T (SdB + 20 log10 a 20 log10 R) = t T SdB + 0 + 0 (20)
will be eliminated if the constants a and R do not depend on time t or the
logarithmic frequency , implying invariance of the second-layer receptive field
responses to variations in the sound pressure or the distance to a sound source.
Since logarithmic frequencies constitute a natural metric for relating frequencies of sound and there is an approximately logarithmic distribution of frequencies both on the basilar membrane and in the auditory cortex, it is natural to
express these derived receptive fields in terms of logarithmic frequencies
 

= 0 + C log
(21)
0
for some constants C and 0 , where specifically 0 = 69, C = 12/ log 2 and
0 = 2 440 correspond to the MIDI standard.
This logarithmic parameterization implies that a shift in frequency, caused
by e.g. transposing a piece of music by one octave or varying the fundamental

Scale-space theory for auditory signals

frequency in singing resulting in a multiplicative transformation of the harmonics


(overtones), corresponds to a mere translation in logarithmic frequency.
Note, however, that some properties of voice or instruments, such as the formant structure in speech or physical resonances in instruments, are independent
of the fundamental frequency and therefore not frequency invariant.
3.2

Structural requirements on second-layer receptive fields

Given such a logarithmically transformed spectrogram, we define a family of


second-layer spectro-temporal receptive fields A(t, ; ) that are to operate
on the transformed spectrogram SdB (t, ; ) and be parameterized by some
multi-dimensional spectro-temporal scale parameter comprising smoothing
over time t and logarithmic frequencies , and obeying:
(i) linearity over the logarithmic spectrogram to ensure that (a) the multiplicative relations of the magnitude of the spectrogram that are mapped to linear
relations by the logarithmic transformation (18) are preserved as linear relations over the receptive field responses and (b) the scale-space properties
imposed to ensure non-creation of new structures in smoothed spectrograms
as defined by spectro-temporal smoothing kernels do also transfer to spectrotemporal derivatives of these.
(ii) shift-invariance with respect to translations over time t 7 t + t and logarithmic frequencies 7 + such that all temporal moments and all
logarithmic frequencies are treated in a similar manner. Temporal shift invariance implies that an auditory stimulus should be perceived in a similar
manner irrespective of when it occurs. Shift-invariance in the logarithmic
frequency domain implies that, for example, a piece of music should be perceived in a similar manner if it is transposed by e.g. one octave.
(iii.a) For pre-recorded sound signals, for which we can take the freedom of accessing data from the virtual future in relation to any time moment, we impose a
continuous semi-group structure over spectro-temporal scales on the secondlayer receptive fields T (, ; 2 ) = T (, ; 2 1 ) T (, ; 1 ) corresponding
to an additive structure over the multi-dimensional scale parameter .
(iii.b) For time-causal signals, we require a continuous semi-group structure over
logspectral scales s, T (; s2 ) = T (; s2 s1 ) T (; s1 ), and a Markov property
between adjacent temporal scales , T (; k+1 ) = (T )(; k) T (; k ).
(iv.a) For the non-causal spectrogram (7) we require non-enhancement of local
extrema in the sense that if for some scale 0 the point (t0 , 0 ) is a local
maximum (minimum) for the mapping (t, ) 7 (A SdB )(t, ; , 0 ) then
the value at this point must not increase (decrease) with increasing scale .
(iv.b) For the time-causal spectrogram generated by (10)(11) or (12)(13) we
require: (iv.b1) the smoothing operation over the logspectral domain to satisfy non-enhancement of local extrema in the sense that if at some logspectral scale s0 a point 0 is a local maximum (minimum) of the mapping
7 (A SdB )(; , s0 ) obtained by disregarding the temporal variations,
then the value at this point must not increase (decrease) with increasing

Tony Lindeberg and Anders Friberg

logspectral scale s, and (iv.b2) the purely temporal smoothing operation to


be a time-causal scale-space kernel guaranteeing non-creation of new local
extrema under an increase of the temporal scale parameter .
(v) glissando covariance in the sense that if two local patches of two spectrograms are related by a local glissando transformation S 0 = Gv S of the form
0 = + v t and corresponding to frequencies that vary smoothly over time,
such as during singing or for instruments with continuous pitch control,
then it should be possible to relate the local spectro-temporal receptive
field responses such that AGv () Gv S = Gv A S for some transformation
0 = Gv () of the spectro-temporal scale parameters .
3.3

Idealized models for spectro-temporal receptive fields

Given these structural requirements, it follows from derivations similar to those


that are used for constraining visual receptive fields given structural requirements on a visual front-end (Lindeberg [17]) that the second layer of auditory
receptive fields should be based on spectro-temporal receptive fields of the form
A(t, ; ) = t (g( vt; s) T (t; ))

(22)

where
t represents a temporal derivative operator of order with respect to time
t which could alternatively be replaced by a glissando-adapted temporal
derivative of the form t = t + v ,
represents a logspectral derivative operator of order with respect to
logarithmic frequency ,
T (t; ) represents a temporal smoothing kernel with temporal scale parameter , which should either be (i) a temporal Gaussian kernel g(t; ) (4) or
(ii) the equivalent kernel hcomposed (t; ) according to (5) and corresponding
to a set of truncated exponential kernels coupled in cascade, and
g( vt; s) represents a Gaussian spectral smoothing kernel over logarithmic
frequencies with logspectral scale parameter s and v representing a glissando parameter making it possible to adapt the receptive fields to variations
in frequency 0 = + vt over time and
the spectro-temporal covariance matrix in the left hand expression for
spectro-temporal receptive fields comprises both the temporal scale parameter , the logspectral scale parameter s and the glissando parameter v.
Thereby, the spectro-temporal receptive fields (22) constitute a combination of
a Gaussian scale-space concept over the logspectral dimension with purely temporal receptive fields obtained by either a non-causal Gaussian temporal scale
space or a time-causal scale space obtained by coupling truncated exponential
kernels/first-order integrators in cascade (see figure 2, columns 2-3).
The proofs concerning spectro-temporal receptive fields are similar to those
regarding spatio-temporal receptive fields over a 1+1-D spatio-temporal domain
with the spatial dimension replaced by a logspectral dimension.

Scale-space theory for auditory signals

Fig. 1. (top left) Spectrogram of a male voice that reads zero five four one (from
the TIDigits database) computed with generalized Gammatone functions. (top right)
Onset enhancement by first-order temporal derivatives. (bottom left) Enhancement of
partial tones by second-order logspectral derivatives using separable receptive fields.
(bottom right) Enhancement of partial tones by the maximum of second-order logspectral derivatives over a filter bank of glissando-adapted receptive fields. Note the better
ability of the glissando-adapted receptive fields to capture rapid frequency variations.

3.4

Auditory features from second-layer receptive fields

In the following, we will show examples of auditory features that can be defined
from a second layer of auditory receptive fields of this form:
Onset enhancement. Computation
of first-order temporal derivatives Dt (t, ; , s) =

t T (t, ; , s) where is a scale normalization factor to approximate scalenormalized derivatives (Lindeberg [18]). To select receptive field responses that
correspond to onsets only, we add the non-linear logical operation Dt > 0 such
that Aonset SdB = Dt SdB if Dt SdB > 0 and 0 otherwise (see figure 1, top right).
Enhancement of partials. Computation of second-order logspectral derivatives
D (t, ; , s) = s T (t, ; , s) where the factor s is a scale normalization factor for scale-normalized derivatives in the Gaussian scale space (Lindeberg [18]).
Depending on the value of the logspectral scale parameter s, this operation may
either enhance partial tones or formants. This operation is naturally combined
with the (non-linear) logical operation D < 0 such that Aband SdB = D SdB
if D SdB < 0 and 0 otherwise (see figure 1, bottom left).

10

Tony Lindeberg and Anders Friberg

Enhancement of partials using filter bank of glissando-adapted receptive fields.


To more accurately capture the harmonic components in sound for which the
frequencies vary rapidly over time, we use a filter bank of receptive fields that
are adapted to different glissando values v, which are combined by taking the
maximum over all glissando-adapted filter responses (see figure 1, bottom right).

Relations to biological receptive fields

In the central nucleus of the inferior colliculus (ICC) of cats, Qiu et al. [7]
report that about 60 % of the neurons can be described as separable in the
time-frequency domain (see figure 2, top row), whereas the remaining neurons
are either obliquely oriented (see figure 2, second row) or contain multiple excitatory/inhibitory subfields. This overall structure is nicely compatible with the
treatment in section 3.4, where the second-layer receptive fields are expressed in
terms of spectro-temporal derivatives of either time-frequency separable spectrotemporal smoothing operations or corresponding glissando-adapted features as
motivated by the structural requirements in section 3.2.
Qualitatively similar shapes of receptive fields can be measured from neurons
in the primary auditory cortex (see figure 2, third row, as well as Miller et al.
[2] regarding binaural receptive fields). Specifically, the use of multiple temporal
and spectral scales as a main component in the model is in good agreement with
biological receptive fields having different degrees of spectral tuning ranging from
narrow to broad and different temporal extent (see figure 2, rows 4-5).

Summary and discussion

We have presented a theory for how idealized models of auditory receptive fields
can be derived from structural constraints (scale-space axioms) on the first stages
of auditory processing. The theory includes (i) the definition of multi-scale spectrograms at different temporal scales in such a way that a spectrogram at any
coarser temporal scale can be related to a corresponding spectrogram at any
finer temporal scale using theoretically well-defined scale-space operations, and
additionally (ii) how a second layer of spectro-temporal receptive fields can be
defined over a logarithmically transformed spectrogram in such a way that the
resulting spectro-temporal receptive fields obey invariance or covariance properties under natural sound transformations including temporal shifts, variations in
the sound pressure, the distance between the sound source and the observer, a
shift in the frequencies of auditory stimuli or glissando transformations. Specifically, theoretical arguments have been presented showing how these idealized
receptive fields are constrained to the presented forms from symmetry properties
of the environment in combination with assumptions about the internal structure of auditory operations as motivated from requirements of handling different
temporal and spectral scales in a theoretically well-founded manner.
We propose that this theory should be of wide general interest for the audio processing community by providing theoretically well-founded and provably

Scale-space theory for auditory signals

Log Frequency (octave)

ICC receptive field

Time-causal model

Gaussian model

4
20

40

11

40

20

4
0

20

40

Time (ms)

Log Frequency (octave)

ICC receptive field

Time-causal model

Gaussian model

3
0

25

50

A1 receptive field

25

130
Log Frequency (semitones)

Log Frequency (semitones)

120
110
100
90
80

120
110
100

70

50

50

Gaussian model

130

25

Time-causal model

16

Frequency (kHz)

50

25
Time (ms)

90
80
70

20

40

60

80

20

40

60

80

Time (ms)

Broadly tuned A1 RF

Frequency (kHz)

2
-50

120

120

110

50

100

150 200

Narrowly tuned A1 RF

100
-50

50

100

150 200

130

16

50

100

150 200

-50

-50

50

100

150 200

50

100

150 200

110

100
2
-50

100

120

110

110

130

120

Gaussian model
130

Log Frequency (semitones)

Time-causal model
130

Log Frequency (semitones)

16

100
0

50 100
Time (ms)

150 200

-50

Fig. 2. (top row left) A separable monaural spectro-temporal receptive field in the
central nucleus of the inferior colliculus (ICC) of cat as reported by Qiu et al. [7].
(second row left) A non-separable spectro-temporal receptive field in the central nucleus
of the inferior colliculus (ICC) of cat as reported by Qiu et al. [7]. (third row left)
A separable spectro-temporal receptive fields in the primary auditory cortex (A1) of
ferret as reported by Elhilali et al. [8]. (fourth and bottom rows left) Spectro-temporal
receptive fields of broadly and narrowly tuned neurons in the primary auditory cortex
(A1) of cats as reported by Atencio and Schreiner [9]. (middle and right columns) Timecausal and non-causal receptive field models according to eq. (22). (Figures reprinted
from [10] with permission.)

12

Tony Lindeberg and Anders Friberg

invariant/covariant audio operations for processing sound signals and for computational modelling or measurements of receptive fields, auditory invariances,
theoretical biology and psychophysics, by serving as a general theoretical foundation and understanding of how receptive fields in ICC and A1 support invariant
visual processes at higher levels in the auditory hierarchy.

References
1. Aertsen, A.M.H.J., Johannesma, P.I.M.: The spectro-temporal receptive field: A
functional characterization of auditory neurons. Biol. Cyb. 42 (1981) 133143
2. Miller, L.M., Escabi, N.A., Read, H.L., Schreiner, C.: Spectrotemporal receptive
fields in the lemniscal auditory thalamus and cortex. J. Neurophys. 87 (2001)
516527
3. Gabor, D.: Theory of communication. J. of the IEE 93 (1946) 429457
4. Wolfe, P.J., Godsill, S.J., Dorfler, M.: Multi-Gabor dictionaries for audio timefrequency analysis. Appl. of Signal Proc. to Audio and Acoustics. (2001) 4346
5. Johannesma, P.I.M.: The pre-response stimulus ensemble of neurons in the cochlear
nucleus. In: IPO Symposium on Hearing Theory, Eindhoven, (1972) 5869
6. Patterson, R.D., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory
filterbank based on the gammatone function. In: A meeting of the IOC Speech
Group on Auditory Modelling at RSRE. Volume 2:7. (1987)
7. Qiu, A., Schreiner, C.E., Escabi, M.A.: Gabor analysis of auditory midbrain receptive fields: Spectro-temporal and binaural composition. J. of Neurophysiology
90 (2003) 456476
8. Elhilali, M., Fritz, J., Chi, T.S., Shamma, S.: Auditory cortical receptive fields:
Stable entities with plastic abilities. J. of Neuroscience 27 (2007) 1037210382
9. Atencio, C.A., Schreiner, C.E.: Spectrotemporal processing in spectral tuning
modules of cat primary auditory cortex. PLOS ONE 7 (2012) e31537
10. Lindeberg, T., Friberg, A.: Idealized computational models of auditory receptive
fields. PLOS ONE 10(3):e0119032 (2015) 158, preprint at arXiv:1404.2037.
11. Lindeberg, T.: Generalized Gaussian scale-space axiomatics comprising linear
scale-space, affine scale-space and spatio-temporal scale-space. J. of Mathematical Imaging and Vision 40 (2011) 3681
12. Lindeberg, T., Fagerstr
om, D.: Scale-space with causal time direction. In: European Conf. on Computer Vision, Springer LNCS Vol. 1064 (1996) 229240
13. Heckmann, M., Domont, X., Joublin, F., Goerick, C.: A hierarchical framework for
spectro-temporal feature extraction. Speech Communication 53 (2011) 736752
14. Ngamkham, W., Sawigun, C., Hiseni, S., Serdijn, W.A.: Analog complex gammatone filter for cochlear implant channels. In: ISCAS (2010) 969972
15. Koenderink, J.J.: Scale-time. Biological Cybernetics 58 (1988) 159162
16. Lindeberg, T.: Separable time-causal and time-recursive receptive fields. In: Scale
Space and Variational Methods in Computer Vision, Springer LNCS Vol. 9087
(2015) 90102
17. Lindeberg, T.: A computational theory of visual receptive fields. Biological Cybernetics 107 (2013) 589635
18. Lindeberg, T.: Feature detection with automatic scale selection. Int. J. of Computer
Vision 30 (1998) 77116

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy