Abstract
The most commonly used training target is masking-based approach which maps noisy speech to the time–frequency (T–F) unit and has a remarkable impact on the performance in the supervised learning algorithms. Traditional T–F masks like ideal ratio mask (IRM) demonstrate a strong performance but are limited to only the magnitude domain in enhancement. Though bounded IRM with phase constraint (BIRMP) includes phase difference but doesn’t exploit channel correlation, the proposed ratio mask (pRM) considers channel correlation but is computed only in the magnitude domain. This work proposes a new mask, i.e., phase correlation ideal ratio mask (PCIRM), which includes both inter-channel correlation and phase difference between the noisy speech (\(N_\mathrm{S}\)), noise (N) and clean speech (\(C_\mathrm{S}\)). Considering these factors increases the percentage of \(C_\mathrm{S}\) and readily decreases the percentage of unwanted noise in the speech components and conversely for the noise components making the mask more precise. The experimental results are conducted under different SNR levels using TIMIT dataset and NOISEX-92 dataset and also compared with the existing state-of-the-art approaches. The results prove that the proposed mask has higher performance than BIRMP and pRM in terms of speech quality and intelligibility.





Similar content being viewed by others
References
F. Bao, W. Abdulla, A new ratio mask representation for CASA-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 7–19 (2018)
F. Bao, and W.H. Abdulla, Signal power estimation based on convex optimization for speech enhancement, in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (2017), pp. 483–487
S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
J.H. Chang, Q.H. Jo, D.K. Kim, N.S. Kim, Global soft decision employing support vector machine for speech enhancement. IEEE Signal Process. Lett. 16(1), 57–60 (2008)
F. Deng, C.C. Bao, W.B. Kleijin, Sparse hidden Markov models for speech enhancement in non-stationary noise environments. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1973–1987 (2015)
H. Erdogan, J.R. Hershey, S. Watanabe, and J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 708–712
Y. Ephraim, D. Malah, Speech enhancement using a minimum mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)
B. Gao, W.L. Woo, S.S. Dlay, Unsupervised single-channel separation of nonstationary signals using gammatone filter-bank and Itakura-Saito nonnegative matrix two-dimensional factorizations. IEEE Trans. Circuits Syst. I Regul. Pap. 60(3), 662–675 (2013)
G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 17–29 (1987)
I. Holube, B. Kollmeier, Speech intelligibility prediction in hearing impaired listeners based on a psycho acoustically motivated perception model. J. Acoust. Soc. Am. 100(3), 1703–1716 (1996)
K. Hu, D.L. Wang, An iterative model-based approach to cochannel speech separation. EURASIP J. Audio Speech Music Process. 2013, 1–11 (2013)
S.O. Haykin, Neural Networks and Learning Machines, 3rd edn. (Prentice Hall, New York, 2009)
G. Kim, Y. Lu, Y. Hu, P. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J. Acoust. Soc. Am. 126(3), 1486–1494 (2009)
J.M. Kates, K.H. Arehart, Coherence and the speech intelligibility index. J. Acoust. Soc. Am. 115(5), 2604–2604 (2005)
G.K. Liu, Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv preprint arXiv:1806.09010 (2018)
S. Liang, W.J. Liu, W. Jiang, W. Xue, The optimal ratio time frequency mask for speech separation in terms of the signal-to-noise ratio. J. Acoust. Soc. Am. 134(5), 452–458 (2013)
J. Lim, A. Oppenheim, All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 26(3), 197–210 (1978)
N. Mohammadiha, S. Paris, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)
R. Meddis, Simulation of auditory-neural transduction: further studies. J. Acoust. Soc. Am. 83(3), 1056–1063 (1988)
V. Nair, and G.E Hinton, Rectified linear units improve restricted Boltzmann machines, in Proceedings of International Conference on Machine Learning, pp. 807–814 (2010)
A. Narayanan, D. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing, Vancouver, Canada (2013), pp. 7092–7096
R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, An efficient auditory filterbank based on the gammatone function. Appl. Psychol. Unit, Cambridge Univ., Cambridge, APU Rep. 2341 (1998)
A.W. Rix, J. Berger, and J.G. Beerends, Perceptual quality assessment of telecommunications systems including terminals, in Audio Engineering Society Convention 114. Audio Engineering Society (2003)
A. Salinna, M. Zamani, A. Demosthenous, Towards more efficient DNN-based speech enhancement using quantized correlation mask. IEEE Access 9, 24350–24362 (2021)
S. Sivapatham, A. Kar, R. Ramadoss, Performance analysis of various training targets for improving speech quality and intelligibility. Appl. Acoust. 175, 107817 (2021)
S. Shoba, R. Rajavel, Improving speech intelligibility in monaural segregation system by fusing voiced and unvoiced speech segments. Circuits Syst. Signal Process. 38(8), 3573–3590 (2019)
S. Srinivasan, N. Roman, D. Wang, Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)
S. Shoba, R. Rajavel, Image processing techniques for segments grouping in monaural speech separation. Circuits Syst. Signal Process. 37(8), 3651–3670 (2018)
C.H. Taal, R.C. Hendriks, R. Heusdens, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
T. Virtanen, J. Gemmeke, B. Raj, Active-set Newton algorithm for overcomplete non-negative representations of audio. IEEE Trans. Audio Speech Lang. Process. 21(11), 2277–2289 (2013)
M. Weintraub, A theory and computational model of auditory monaural sound separation. Ph.D. dissertation, Dept. Elect. Eng., StanfordUniv., Stanford (1985)
D.S. Williamson, Y. Wang, D.L. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
D.S. Williamson, Y. Wang, D.L. Wang, Estimating nonnegative matrix model activations with deep neural networks to increase perceptual speech quality. J. Acoust. Soc. Am. 138(3), 1399–1407 (2015)
X. Wang, C. Bao, Mask estimation incorporating phase-sensitive information for speech enhancement. Appl. Acoust. 156, 101–112 (2019)
Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
Y. Wang, K. Han, D. Wang, Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio Speech Lang. Process. 21(2), 270–279 (2012)
Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)
D.Y. Zhao, W.B. Kleijn, HMM-based gain modeling for enhancement of speech in noise. IEEE Trans. Audio Speech Lang. Process. 15(3), 882–892 (2007)
V. Zue, S. Seneff, J. Glass, Speech database development at MIT: TIMIT and beyond. Speech Commun. 9(4), 351–356 (1990)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sowjanya, D., Sivapatham, S., Kar, A. et al. Mask Estimation Using Phase Information and Inter-channel Correlation for Speech Enhancement. Circuits Syst Signal Process 41, 4117–4135 (2022). https://doi.org/10.1007/s00034-022-01981-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-01981-0