HRTF Individualization A Survey
HRTF Individualization A Survey
net/publication/328146217
CITATIONS READS
32 466
2 authors:
All content following this page was uploaded by Corentin Guezenoc on 19 July 2022.
in 2014 [4]. It is however often sped up, down to 20 tion requirements to a set of consumer-grade smartphone
mn according to Rugeles in 2016 [3], using interleaved pictures [16]. Since the mid-2000s, the major computa-
multiple sweep sines as proposed by Majdak et al. in tion techniques have been the Fast-Multipole-accelerated
2007 [8]. A promising and rather trending approach Boundary Element Method (FM-BEM) [17–19] for har-
is the one proposed by Enzner in 2008 [5]. Based on monic domain and the Finite Difference Time Domain
continuous azimuth-wise rotation and adaptive filtering, (FDTD) [20, 21] for time domain, though other methods
this new paradigm allowed the measurement time to be such as the Finite Element Method (FEM) [22] and the
considerably reduced further: according to his work, it more exotic raytracing [23] and Differential Pressure Syn-
would only take 4 mn with that method to measure a thesis (DPS) [24] have been used since the late 1990s, 2006
whole HRIR set with a spatial resolution comparable to and 2003, respectively. We take a particular interest here
that of Rugeles’s system [3]. into the matters of the accuracy of the 3D geometry used
for simulation, the computing time and the perceptual
relevance of the calculated HRTFs.
C. Directional imprecision due to subject
movement
C. Regression
IV. INDIRECT INDIVIDUALIZATION BASED
ON ANTHROPOMETRIC DATA Going further, another approach to devising low-cost
HRTF individualization based on morphology is the esti-
Though more convenient than acoustic measurement, mation of a HRTF set from anthropometric measurements
HRTF calculation still requires specialized equipment of the listener. To this end, multiple linear regression
and non-negligible mesh processing and computing time. has been widely used. Among such work, the HRTF sets
Hence, based on the fact that HRTF sets rely heavily have often, since the early 2000s, been compressed using
on morphology, many studies have explored the idea of statistical modeling such as Principal Component Analy-
a low-cost HRTF individualization methodology based sis (PCA) [35, 36] and Independent Component Analysis
on anthropometric measurements. We distinguish three (ICA) [37]. Some, as Bilinski et al. in 2014 [38], have
sub-categories: adaptation, selection and regression. chosen to rather predict a HRTF set by linear combi-
nation of HRTF sets using the coefficients of a model
of anthropometric parameters. Suprisingly, among the
A. Adaptation studies reviewed for this article, only that of Hu et al.[36]
featured a perceptual evaluation and, while the results
One way to do it is to take a non-individual set and to were encouraging, they did not put elevation perception to
adapt it, i.e. to alter it in order to make more suitable the test. Since the late 2000s, nonlinear regression models
for the subject at hand. Based on the idea that the most have been used too that have typically relied on neural
prominent morphological difference between two individ- networks coupled to various data compression techniques
uals is size, Middlebrooks and colleagues [28] proposed in including PCA, [39] High-Order SVD [40] and Isomap [41].
1999 to adapt a generic HRTF set thanks to a frequency However, none of these studies carried out any perceptual
scaling. In 2000 [29], they reported that the scaling fac- evaluation of the estimated HRTF sets.
tor could be estimated from a combination of head and
pinnae measurements through linear regression. In both
cases, perceptual evaluations performed on 9 to 11 sub- V. INDIRECT INDIVIDUALIZATION BASED
jects reported localization performance to be improved ON PERCEPTUAL FEEDBACK
compared to no individualization but to be worse than
with own measured HRTF set. Later on in 2005 and 2008, If methods for indirect individualization based on mor-
other researchers [30, 31] combined frequency scaling with phological data are practical for the end-user and provide
4
individualization, they can be subject to morphological direction dependance was not handled [46], which meant
measurement errors. Indeed, the morphological data ac- the adaptation was rather rough as it is basically an equal-
quisition is done by the user: measurements as well as ization of the whole HRTF set. Second, the listener-driven
pictures can be made wrong. As the subjective perception filter-design had to be done for each direction separately
of spatialization is the ultimate goal, an alternative is to [47] and thus the number of parameters to tune for a
propose a low-cost individualization method that is based whole set was too high to expect a tuning procedure in
on the listener’s feedback. Quite similarly to section IV, a reasonable amount of time. Indeed, Runkle et al. [47]
we distinguish two categories: selection and adaptation. did not present any perceptual evaluation of their solu-
tion while Tan and Gan [46] presented some encouraging
perceptual results but did not evaluate other criteria that
A. Selection the ones used for tuning i.e. front-back reversal and sense
of elevation.
A natural strategy that has been well-explored in the
literature since the late 1990s is to help the listener select
the best non-individual HRTF set among a database 3. Statistical-model-based tuning
[42, 43]. All studies reviewed for this article evaluated the
selected HRTF set perceptually with results indicating Alternatively, a lot of work have proposed to rely on
that the selected set was better than a non-individual one a statistical model, with in mind the goal of reducing
but worse than a subject’s own set. However, it should the number of tuning parameters while still being able to
be noted that Seeber et al. [42] did not put elevation cover most of the database’s HRTF space.
perception to the test in their study. Reported tuning The main statistical modeling method used in the lit-
times ranged from 15 min [42] to more than 35 min [43]. erature is Principal Component Analysis (PCA) for its
Conjointly, in order to improve the relevance and duration ease to interpret as well as for its low implementation
of the tuning procedure, it has been proposed to cluster and computing complexity. Most [48–50], in 2008, 2008
a priori the database based on either objective [44] or and 2015 respectively, proposed a procedure that allowed
perceptual [43] criteria. the tuning of a HRTF in one direction at a time. The
number of parameters were reduced to 3 to 5 principal
components (PC) weights per direction, making it possi-
B. Adaptation
ble for the listener to tune each direction in a reasonable
amount of time. These studies all reported a localization
A non-individual HRTF set, sometimes elected through performance improvement over non-individual HRTFs,
a previous selection procedure, can be adapted based on although the number of subjects was rather small (3 and
perceptual feedback from the listener. We distinguish 4 respectively) for [48] and [49] and elevation perception
three ways to adapt a HRTF set: frequency scaling, filter- was not evaluated in [50]. However, these tuning pro-
design-based tuning and statistical-model-based tuning. cedures had to be performed direction by direction and
thus did not allow to tune a whole HRTF set in a rea-
sonable amount of time (only 9 to 10 directions were
1. Frequency scaling tuned). Hölzl, in his 2014 Master Thesis [51], proposed
a solution to that flaw by applying Spherical Harmonics
As mentioned in IV A, Middlebrooks et al. explored in (SH) to the direction-dependent PC weights. However, no
1999 [28] the idea of adapting a generic HRTF set through subjective evaluation of this method was proposed, and
frequency scaling and reported in its companion study [45] even though the overall problem dimension was reduced
an improvement in localization performance compared to to 5 PC weights x 9 SH coefficients = 45, it is still a
no scaling. In their 2000 study [29], they reported that the high number of parameters to tune. Moreover, the com-
scaling factor could be tuned by the listener trough a 20- bination of spherical harmonics coefficients and principal
min tuning session with similar localization performance component weights are rather counter-intuitive and hard
than previous methods for obtaining the scaling factor to comprehend for the end-user.
(minimization of a spectrum-based metric and anthro- In 2017, Yamamoto and Igarashi [52] proposed a state-
pometric measurements). This tuning method has the of-the-art method that relied on the modeling of HRTF
advantage of offering one single tuning lever for the whole sets thanks to a variational autoencoder neural network.
HRTF set and to bring some perceptual improvement. The tuning procedure consisted in a gradient descent op-
timization of the network’s weights where the cost was
determined at every iteration by the user’s notation of two
2. Filter-design-based tuning HRTF sets presented to him by the algorithm. They con-
ducted a preference test in which the participants graded
Some work [46, 47] proposed in 1998 and 2000, respec- HRTF sets pair by pair in a double-blind manner. The
tively, to rely on the tuning of filters to adapt a given baseline condition was a best fit non-individual HRTF set
HRTF set. We have distingushed two directions. First, elected among the database in a previous preference test
5
procedure. The outcome was a significant improvement linear and nonlinear regression between anthropometric
over an optimal non-individual HRTF set for 18 partici- measurements and HRTF sets. Indeed, among the last
pants out of 20, although the nonstandard nature of the category we found a rare single perceptual study [36] and
perceptual testing methodology makes it hard to compare that one did not try elevation perception. In other words,
those results with other studies’. there is a lack of perceptual results for statistics-based
methods, which may well indicate that the databases
are not large enough: all the studies reviewed here used
VI. DISCUSSION similarly-sized databases of 43 to 50 subjects. Thus, a key
to their improvement may well reside in larger databases.
However, to the best of our knowledge the matter of their
As of today, acoustic measurement remains the refer-
ideal size remains an open one. More generally for the
ence method to acquire individual HRTFs thanks to sig-
anthropometrics-based approach, errors may also come
nificant perceptual assessment against real sound sources
from the fact that the measurement step is handed over to
[10, 12, 13], as summarized in Table I. As such, it has
the end-user and from the unclear relevance of the choice
been used as ground truth by all other families of HRTF
of the anthropometric parameters to predict HRTFs.
individualization methods. Nevertheless, in spite of recent
major advances in terms of acquisition time, it is imprac- Alternatively, researchers have investigated the possibil-
tical for consumer-grade applications because of the cost ity of individualizing a HRTF set based on the listener’s
and difficulty to transport the measurement equipment. subjective feedback. This approach has the double advan-
tage of including the listener and his perceptions in the
On the other hand, in spite of the professional-grade
individualization process while avoiding errors related to
scanning equipment and few processing hours needed ,
data acquisition. Accordingly, the vast majority of such
numerical simulation allows the data acquisition step to
studies provide subjective evaluations (cf Table I). On one
be mobile and more comfortable for the user. Further-
hand, the simple techniques, which include selection and
more, the scanning equipment may be reduced to a simple
adaptation by frequency-scaling, have shown perceptual
smartphone for consumer-grade applications by relying
improvement over no individualization in studies that
on 2D-to-3D reconstruction technologies[16]. In addition,
gathered 7 to 11 listeners [29, 42]. On the other hand, the
simulation is a powerful tool for investigating and un-
more complex methods i.e. the statistical-model-based
derstanding the link between morphology and HRTFs.
ones, have been well used in order to reduce the number
Major technical limitations such as computing time, 3D
of tuning parameters in the most relevant manner. To
geometry acquisition and re-meshing have mostly been
this end, PCA models have been used in majority [48–50].
overcome. However, although objective [17, 18, 27] and
While the models that were used needed to be tuned direc-
subjective [6, 25] evaluations showed rather promising re-
tion by direction and thus the tuning of a whole HRTF set
sults, perceptual studies that compared calculated HRTFs
was impractical, they have shown encouraging results to
with measured ones were surprisingly rare and featured
their localization tests, though some [48, 49] featured only
only 2 to 3 subjects (cf Table I. In addition, some objec-
3 to 4 subjects and the other [50] only included azimuthal
tive observations underlined the possibility of perceptual
directions. As for Yamamoto and Igarashi [52], the re-
defects in the produced HRTFs. Hence, despite a lot of
sult of their 20-listener preference test was altogether
work on HRTF simulation for thirty years, and in partic-
promising, but it would merit a more standard subjec-
ular since the first full-band calculations ten years ago,
tive evaluation to be able to compare it to other studies.
computed HRTFs would merit wider-ranged perceptual
For further advances, statistical-model-based approaches,
studies, both in number of studies and of participants.
as in the case of anhtropometry-based indirect methods,
Possible causes for simulation-related problems include an
may very well benefit from larger databases. Indeed, it
inaccurate geometry acquisition (depending on the scan-
would then be particularly interesting to attempt PCA
ning process) and/or a wrong modeling of the acoustics
modeling of whole HRTF sets and to use its weights as
problem.
tuning parameters. Yamamoto and Igarashi’s [52] method
With in mind the goal of developing solutions that
seems promising as well but would benefit from a more
are more user-friendly, the idea of individualizing HRTFs
conventional perceptual evaluation methodology such as
from simpler morphological data has been widely explored
localization testing.
in the literature. This has the advantage of relying on
little equipment and on an easy data acquisition process,
usually a smartphone and the shooting of one or a few
pictures. However, as reported in Table I, the perceptual VII. CONCLUSION
results are mixed. On one side, the simple methods,
namely selection and adaptation by frequency scaling In this paper we established a state-of-the-art of what
and/or set rotation, have demonstrated some perceptual has been done so far to tackle the problem of HRTF in-
improvement compared to no individualization, thanks dividualization for the end-user. We distinguished four
to studies that featured 6 to 11 participants [29, 34]. On families of methods, namely acoustic measurement, nu-
the other side, we cannot conclude on the quality of the merical simulation, indirect individualization from mor-
HRTFs produced by more complex methods, such as phology and indirect individualization from perceptual
6
Numerical simulation [6, 25] Localization IAC 3 25 Promising but would merit
more studies & subjects
Filter-design-based adaptation, Localization IAC, NIAC 3-6 Promising but would merit
statistical-model-based adaptation 80 more standard studies &
[48–50, 52] Preference BFAC 20 more subjects
TABLE I: Overview of perceptual evaluations for the major HRTF individualization approaches.
The columns describe the following features, from left to right: type of evaluation (Eval. type), condition(s) used as ground
truth (Baseline), number of participants (Nsubj ), proportion of studies that carried out a perceptual evaluation (τperc ) and
results of the perceptual studies.
Acronyms RS, IAC, NIAC and BFAC stand respectively for Real sound Sources, stimuli binauralized using Individual Acoustic
HRTFs, stimuli binauralized using Non-Individual Acoustic HRTFs and stimuli binauralized using a Best Fit non-individual
Acoustic HRTF set elected among the database in a previous preference test procedure.
feedback. We summarized their specific advantages and cant perceptual results are rather scarce, though not for
disadvantages and took stock of the current advances all approaches (cf Table I), which tends to indicate that
while identifying some leads for improvement. In partic- a lot of work remains to be done to reach an efficient
ular, we took a special interest into the existence and end-user-friendly solution to HRTF individualization.
outcome of related perceptual studies. Overall, signifi-
[1] H. Møller, Applied Acoustics 36, 171 (1992), URL Display (Paris, France, 2008).
http://www.sciencedirect.com/science/article/ [7] E. H. A. Langendijk and A. W. Bronkhorst, JASA 107,
pii/0003682X9290046U. 528 (1999), URL http://asa.scitation.org/doi/abs/
[2] E. M. Wenzel, M. Arruda, D. J. Kistler, and F. L. Wight- 10.1121/1.428321.
man, JASA 94, 111 (1993), URL http://asa.scitation. [8] P. Majdak, P. Balazs, and B. Laback, JAES 55, 623
org/doi/10.1121/1.407089. (2007).
[3] F. Rugeles Ospina, PhD Thesis, Universite Pierre et [9] T. Hirahara, H. Sagara, I. Toshima, and M. Otani,
Marie Curie / Orange Labs (2016), URL https://hal. Acoustical Science and Technology 31, 165 (2010),
archives-ouvertes.fr/tel-01537182. URL http://joi.jlc.jst.go.jp/JST.JSTAGE/ast/31.
[4] T. Carpentier, H. Bahu, M. Noisternig, and O. Warusfel, 165?from=CrossRef.
in 7th Forum Acusticum (EAA) (2014), URL https:// [10] P. Majdak, M. J. Goupell, and B. Laback, Attention, Per-
hal.archives-ouvertes.fr/hal-01247583/. ception, & Psychophysics 72, 454 (2010), URL https://
[5] G. Enzner, in IEEE International Conference on Acous- link.springer.com/article/10.3758/APP.72.2.454.
tics, Speech and Signal Processing (ICASSP) (2008), pp. [11] F. Denk, J. Heeren, S. D. Ewert, B. Kollmeier, and S. M.
393–396. Ernst, in DAGA (Kiel, 2017).
[6] P. Mokhtari, R. Nishimura, and H. Takemoto, in Pro- [12] F. L. Wightman and D. J. Kistler, JASA 85, 868 (1989).
ceedings of the 14th International Conference on Auditory [13] H. Møller, M. F. Sørensen, C. B. Jensen, and D. Ham-
7
mershøi, JAES 44, 451 (1996), URL http://www.aes. Acoustics, 2001 IEEE Workshop on the (IEEE, 2001), pp.
org/e-lib/browse.cfm?elib=7897. 99–102.
[14] R. L. Martin, K. I. McAnally, and M. A. Senova, JAES [33] D. N. Zotkin, R. Duraiswami, and L. S. Davis (Ky-
49, 14 (2001), URL http://www.aes.org/e-lib/browse. oto, Japan, 2002), URL https://smartech.gatech.edu/
cfm?elib=10204. handle/1853/51348.
[15] H. Bahu, Ph.D. thesis, Universite Pierre et Marie [34] S.-N. Yao, T. Collins, and C. Liang, Archives of Acoustics
Curie / IRCAM (2016), URL http://www.theses.fr/ 42, 365 (2017).
2016PA066452. [35] C. Jin, P. Leong, J. Leung, A. Corderoy, and S. Carlile,
[16] S. Kaneko, T. Suenaga, and S. Sekine, in AES In- in Proceedings of the First IEEE Pacific-Rim Conference
ternational Conference on Audio for Virtual and Aug- on Multimedia (2000), pp. 235–238.
mented Reality (Audio Engineering Society, 2016), URL [36] H. Hu, L. Zhou, J. Zhang, H. Ma, and Z. Wu, in 2006
http://www.aes.org/e-lib/browse.cfm?elib=18509. International Conference on Computational Intelligence
[17] N. A. Gumerov, R. Duraiswami, and D. N. Zotkin, in and Security (2006), vol. 2, pp. 1829–1832, URL http:
IEEE International Conference on Acoustics, Speech and //sci-hub.la/10.1109/ICCIAS.2006.295380.
Signal Processing (ICASSP) (2007), vol. 1, pp. I–165. [37] Q. H. Huang and Q. L. Zhuang, Electronics Letters 45,
[18] W. Kreuzer, P. Majdak, and Z. Chen, The Jour- 1002 (2009).
nal of the Acoustical Society of America 126, [38] P. Bilinski, J. Ahrens, M. R. Thomas, I. J. Tashev, and
1280 (2009), URL http://www.ncbi.nlm.nih.gov/pmc/ J. C. Platt, in IEEE International Conference on Acous-
articles/PMC3061451/. tics, Speech and Signal Processing (ICASSP) (2014), pp.
[19] S. Ghorbal, T. Auclair, C. Soladié, and R. Séguier, in Pro- 4468–4472.
ceedings of the 20th International Conference on Digital [39] H. Hu, L. Zhou, H. Ma, and Z. Wu, Applied Acous-
Audio Effects (DAFx-17) (Edinburgh, 2017). tics 69, 163 (2008), URL http://linkinghub.elsevier.
[20] P. Mokhtari, H. Takemoto, R. Nishimura, and H. Kato, com/retrieve/pii/S0003682X07000965.
in Audio Engineering Society Convention 123 (Audio [40] L. Li and Q. Huang, in IEEE International Conference
Engineering Society, 2007), URL http://www.aes.org/ on Acoustics, Speech and Signal Processing (ICASSP)
e-lib/online/browse.cfm?elib=14298. (2013), pp. 3707–3710, URL http://ieeexplore.ieee.
[21] S. Prepelită, M. Geronazzo, F. Avanzini, and L. Savioja, org/abstract/document/6638350/.
The Journal of the Acoustical Society of America [41] F. Grijalva, L. Martini, S. Goldenstein, and D. Florencio,
139, 2489 (2016), URL http://asa.scitation.org/ in IEEE International Conference on Acoustics, Speech
doi/full/10.1121/1.4947546. and Signal Processing (ICASSP) (2014), pp. 4473–4477.
[22] T. Huttunen, E. T. Seppälä, O. Kirkeby, A. Kärkkäinen, [42] B. U. Seeber and H. Fastl, in International Conference
and L. Kärkkäinen, J. Comp. Acous. 15, 429 on Auditory Display (ICAD) (Boston, MA, USA, 2003).
(2007), URL http://www.worldscientific.com/doi/ [43] B. F. Katz and G. Parseihian, The Journal of the Acous-
abs/10.1142/S0218396X07003469. tical Society of America 131, EL99 (2012), URL http:
[23] N. Röber, S. Andres, and M. Masuch (2006). //asa.scitation.org/doi/abs/10.1121/1.3672641.
[24] Y. Tao, A. I. Tew, and S. J. Porter, JAES 51, 647 [44] B. Xie, X. Zhong, and N. He, Applied Acoustics 94, 1
(2003), URL http://www.aes.org/e-lib/browse.cfm? (2015).
elib=12212. [45] J. C. Middlebrooks, The Journal of the Acoustical So-
[25] H. Ziegelwanger, P. Majdak, and W. Kreuzer, The Journal ciety of America 106, 1493 (1999), URL http://asa.
of the Acoustical Society of America 138, 208 (2015), URL scitation.org/doi/abs/10.1121/1.427147.
http://asa.scitation.org/doi/10.1121/1.4922518. [46] C.-J. Tan and W.-S. Gan, Electronics letters 34, 2387
[26] H. Ziegelwanger, W. Kreuzer, and P. Maj- (1998), URL http://ieeexplore.ieee.org/abstract/
dak, Applied Acoustics 114, 99 (2016), URL document/744001/.
http://www.sciencedirect.com/science/article/ [47] P. Runkle, A. Yendiki, and G. H. Wakefield, in Interna-
pii/S0003682X1630192X. tional Conference on Auditory Display (ICAD) (Georgia
[27] H. Ziegelwanger, A. Reichinger, and P. Majdak, in Inter- Institute of Technology, 2000), URL https://smartech.
national Congress on Acoustics (ICA) (Acoustical Society gatech.edu/handle/1853/50665.
of America, 2013), vol. 19. [48] K. H. Shin and Y. Park, IEICE Transactions on Funda-
[28] J. C. Middlebrooks, The Journal of the Acoustical So- mentals of Electronics, Communications and Computer
ciety of America 106, 1480 (1999), URL http://asa. Sciences 91, 345 (2008).
scitation.org/doi/abs/10.1121/1.427176. [49] S. Hwang, Y. Park, and Y.-s. Park, Acta Acustica
[29] J. C. Middlebrooks, E. A. Macpherson, and Z. A. Onsan, united with Acustica 94, 965 (2008), URL http://
The Journal of the Acoustical Society of America 108, openurl.ingenta.com/content/xref?genre=article&
3088 (2000). issn=1610-1928&volume=94&issue=6&spage=965.
[30] K. Maki and S. Furukawa, The Journal of the Acoustical [50] K. J. Fink and L. Ray, Applied Acoustics 87,
Society of America 118, 2392 (2005). 162 (2015), URL http://linkinghub.elsevier.com/
[31] P. Guillon, R. Nicol, and L. Simon, in Audio Engi- retrieve/pii/S0003682X14001753.
neering Society Convention 125 (Audio Engineering So- [51] J. Hölzl, Master Thesis, Graz University of Technology
ciety, 2008), URL http://www.aes.org/e-lib/browse. (2014).
cfm?elib=14761. [52] K. Yamamoto and T. Igarashi, ACM Transactions
[32] V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Aven- on Graphics 36, 1 (2017), URL http://dl.acm.org/
dano, in Applications of Signal Processing to Audio and citation.cfm?doid=3130800.3130838.