Abstract
Speech is the natural mode of communication and the easiest way of expressing human emotions. Emotional speech is expressed in terms of features like f0 contour, intensity, speaking rate, and voice quality. The group of these features is called prosody. Generally, prosody is modified by pitch and time scaling. Emotional speech conversion is more sensitive to prosody unlike voice conversion, where spectral conversion is the main concern. Several techniques, linear as well as nonlinear, have been used for transforming the speech. Our hypothesis is that quality of emotional speech conversion can be improved by estimating nonlinear relationship between the neutral and emotional speech feature vectors. In this research work, quadratic multivariate polynomial (QMP) has been explored for transforming neutral speech to emotional target speech. Both subjective and objective analyses were carried out to evaluate the transformed emotional speech using comparison mean opinion scores (CMOS), mean opinion scores (MOS), identification rate, root-mean-square error, and Mahalanobis distance. For Toronto emotional database, except for neutral/sad conversion, the CMOS analysis indicates that the transformed speech can partly be perceived as target emotion. Moreover, the MOS and spectrogram indicate good quality of transformed speech. For German database except for neutral/boredom conversion, the CMOS value of proposed technique has better score than gross and initial–middle–final methods but less than syllable method. However, QMP technique is simple, is easy to implement, has better quality of transformed speech, and estimates transformation function using limited number of utterances of training set.


Similar content being viewed by others
References
M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion through vector quantization. J. Acoust. Soc. Jpn. (E) 11(2), 71–76 (1990)
Y. Adachi, S. Kawamoto, S. Morishima, S. Nakamura, Perceptual similarity measurement of speech by combination of acoustic features, in Proceedings IEEE International Conference Acoustics, Speech and Signal Processing, (2008), pp. 4861–4864
R. Aihara, R. Takashima, T. Takiguchi, Y. Ariki, GMM-based emotional voice conversion using spectrum and prosody features. Am. J. Signal Process. 2, 134–138 (2012)
R. Aihara, R. Ueda, T. Takiguchi, Y. Ariki, Exemplar-based emotional voice conversion using non-negative matrix factorization, in Proceedings IEEE Asia-Pacific Signal and Information Processing Association, (2014), pp. 1-7
M. Bulut, et al., Investigating the role of phoneme-level modifications in emotional speech resynthesis, in Proceedings INTERSPEECH, (2005), pp. 801–804
F. Burkhardt, W.F. Sendlmeier, Verification of acoustical correlates of emotional speech using formant synthesis, in Tutorial and Research Workshop on Speech and Emotion, (2000), pp. 151–156
F. Burkhardt, N. Campbell, Emotional speech synthesis, in Oxford Handbook of Affective Computing, ed. By R.A. Calvo, S.K. D’Mello, J. Gratch, A. Kappas (Oxford University Press, 2014), p. 286
F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of German emotional speech, in Proceedings INTERSPEECH, (2005), pp. 1517–1520
L. Cen, P. Chan, M. Dong, H. Li, Generating emotional speech from neutral speech, in Proceedings 7th International Symposium on Chinese Spoken Language Processing, (2010), pp. 383–386
R.R. Chang, X.Q. Yu, Y.Y. Yuan, W.G. Wan, Emotional analysis and synthesis of human voice based on STRAIGHT. Appl. Mech. Mater. 536, 105–110 (2014)
Y. Chen, M. Chu, E. Chang, J. Liu, R. Liu, Voice conversion with smoothed GMM and MAP adaptation, in Eurospeech, (2003), pp. 2413–2416
R. Cowie et al., Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 18, 32–80 (2001)
E.A. Cudney et al., An evaluation of Mahalanobis–Taguchi system and neural network for multivariate pattern recognition. J. Ind. Syst. Eng. 1, 139–150 (2007)
S. Desai, A.W. Black, B. Yegnanarayana, K. Prahallad, Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio Speech Lang. Process. 18, 954–964 (2010)
K. Dupuis, M.K. Pichora-Fuller, Toronto Emotional Speech Set (TESS) (Psychology Department, University of Toronto, Toronto, 2010)
T. En-Najjary, O. Rosec, T. Chonavel, A voice conversion method based on joint pitch and spectral envelope transformation, in Proceedings INTERSPEECH (2004)
D. Erro, A. Moreno, A. Bonafonte, Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18, 922–931 (2010)
H. Fujisaki, Information, prosody, and modeling with emphasis on tonal features of speech, in Speech Prosody, (2004), pp. 1–10
K.I. Funahashi, On the approximate realization of continuous mappings by neural networks. Neural Netw. 2, 183–192 (1989)
D. Govind, S.R.M. Prasanna, B. Yegnanarayana, Neutral to target emotion conversion using source and suprasegmental information, in Proceedings INTERSPEECH, (2011), pp. 2969–2972
R.C. Guido et al., A neural-wavelet architecture for voice conversion. Neurocomputing 71, 174–180 (2007)
A. Haque, K. S. Rao, Analysis and modification of spectral energy for neutral to sad emotion conversion, in Proceedings IEEE 8th International Contemporary Computing, (2015), pp. 263–268
A. Haque, K.S. Rao, Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech, in International Journal of Speech Technology, (2016), pp. 1–11
E. Helander, T. Virtanen, J. Nurminen, M. Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18, 912–921 (2010)
W.J. Holmes, J.N. Holmes, M.W. Judd, Extension of the bandwidth of the JSRU parallel-formant synthesizer for high quality synthesis of male and female speech, in Proceedings IEEE International Conference Acoustics, Speech, and Signal Processing, (1990), pp. 313–316
A. Iida, N. Campbell, S. Iga, F. Higuchi, M. Yasumura, A speech synthesis system with emotion for assisting communication, in Tutorial and Research Workshop on Speech and Emotion, (2000), pp. 167–172
T. Irino, Y. Minami, T. Nakatani, M. Tsuzaki, H. Tagawa, Evaluation of a speech recognition/generation method based on HMM and STRAIGHT, in Proceedings INTERSPEECH (2002)
H. Kawahara, M. Morise, Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana Acad. Proc. Eng. Sci. 36, 713–727 (2011)
H. Kawahara, I. Masuda-Katsuse, A. De Cheveigne, Restructuring speech representations using a pitch adaptive time frequency smoothing and an instantaneous frequency based f0 extraction: possible role of repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)
R. Lawrence, Fundamentals of Speech Recognition (Pearson Education India, Delhi, 2008)
P.K. Lehana, P.C. Pandey, Transformation of short-term spectral envelope of speech signal using multivariate polynomial modeling, in Proceedings National Conference on Communications, NCC, (2011)
P.K. Lehana, Spectral mapping using multivariate polynomial modeling for voice conversion, Ph.D. Thesis, Department of Electrical Engineering, IIT Bombay, India (2013)
Z.H. Ling, L. Deng, D. Yu, Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 21, 2129–2139 (2013)
K. Liu, J. Zhang, Y. Yan, High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for mandarin, in Proceedings 4th International Conference on Fuzzy Systems and Knowledge Discovery, (2007), pp. 410–414
Z. Luo, J. Chen, T. Nakashika, T. Takiguchi, Y. Ariki, Emotional voice conversion using neural networks with different temporal scales of f0 based on wavelet transform, in Proceedings 9th ISCA Speech Synthesis Workshop (2016), pp. 140–145
Z. Luo, T. Takiguchi, Y. Ariki, Emotional voice conversion using deep neural networks with MCC and F0 features, in Proceedings IEEE 15th International Conference Computer and Information Science, (2016), pp. 1–5
P.C. Mahalanobis, On the generalized distance in statistics, in Proceedings of the National Institute of Sciences of India, (1936), pp. 49–55
T. Masuko, K. Tokuda, T. Kobayashi, S. Imai, Voice characteristics conversion for HMM-based speech synthesis system. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 3, 1611–1614 (1997)
A. Mouchtaris, S.S. Narayanan, C. Kyriakakis, Multichannel audio synthesis by subband-based spectral conversion and parameter adaptation. IEEE Trans. Speech Audio Process. 13, 263–274 (2005)
T. Nakashika, R. Takashima, T. Takiguchi, Y. Ariki, Voice conversion in high-order Eigen space using deep belief nets, in Proceedings INTERSPEECH, (2013), pp. 369–372
J. Nirmal, M. Zaveri, S. Patnaik, P. Kachare, Voice conversion using general regression neural network. Appl. Soft Comput. 24, 1–12 (2014)
H.K. Palo, M.N. Mohanty, M. Chandra, Efficient feature combination techniques for emotional speech classification. Int. J. Speech Technol. 19, 135–150 (2016)
B.S. Pathak, M. Sayankar, A. Panat, Emotion transformation from neutral to 3 emotions of speech signal using DWT and adaptive filtering techniques, in Proceedings IEEE 11th India Conference: Emerging Trends and Innovation in Technology, (2014)
K.R. Scherer, Vocal communication of emotion: a review of research paradigms. Speech Commun. 40, 227–256 (2003)
M. Schröder, Emotional speech synthesis: a review, in Proceedings INTERSPEECH, (2001), pp. 561–564
J.B. Singh, R. Khanna, P. Lehana, Effect of MFCC based features for speech signal alignments, in Proceedings International Journal on Natural Language Computing, vol. 2 (2013)
Y. Stylianou, O. Cappé, E. Moulines, Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6, 131–142 (1998)
D. Sundermann, A. Bonafonte, H. Ney, A study on residual prediction techniques for voice conversion. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1, 1–13 (2005)
T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 2, 841–844 (2001)
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 3, 1315–1318 (2000)
O.Türk, M. Schröder, A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis, in Proceedings INTERSPEECH, (2008), pp. 2282–2285
O.Türk, L.M. Arslan, Voice conversion methods for vocal tract and pitch contour modification, in Proceedings INTERSPEECH (2003)
O.Türk, Cross-lingual voice conversion. Ph.D. dissertation, Bogaziçi University, (2007)
H. Valbret, E. Moulines, J.P. Tubach, Voice transformation using PSOLA technique. Speech Commun. 11, 175–187 (1992)
C. Veaux, X. Rodet, Intonation conversion from neutral to expressive speech, in Proceedings INTERSPEECH (2011), pp. 2765–2768
F. Villavicencio, A. Röbel, X. Rodet, Extending efficient spectral envelope modeling to Mel-frequency based representation, in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, (2008), pp. 1625–1628
Z. Wu, Spectral mapping for voice conversion, Ph.D. dissertation, St. School of Computer Engineering, Nanyang Technological University, (2015)
J. Yadav, K.S. Rao, Prosodic mapping using neural networks for emotion conversion in Hindi Language. Circuits Systems Signal Process. 35, 139–162 (2016)
H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)
Acknowledgements
The authors would like to thank Prof. Hideki Kawahara, Wakayama University, for his assistance for STRAIGHT.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Singh, J.B., Lehana, P. STRAIGHT-Based Emotion Conversion Using Quadratic Multivariate Polynomial. Circuits Syst Signal Process 37, 2179–2193 (2018). https://doi.org/10.1007/s00034-017-0660-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-017-0660-0