Mask Estimation Using Phase Information and Inter-channel Correlation for Speech Enhancement

Sowjanya, Devi; Sivapatham, Shoba; Kar, Asutosh; Mladenovic, Vladimir

doi:10.1007/s00034-022-01981-0

Mask Estimation Using Phase Information and Inter-channel Correlation for Speech Enhancement

Published: 25 February 2022

Volume 41, pages 4117–4135, (2022)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

403 Accesses
1 Altmetric
Explore all metrics

Abstract

The most commonly used training target is masking-based approach which maps noisy speech to the time–frequency (T–F) unit and has a remarkable impact on the performance in the supervised learning algorithms. Traditional T–F masks like ideal ratio mask (IRM) demonstrate a strong performance but are limited to only the magnitude domain in enhancement. Though bounded IRM with phase constraint (BIRMP) includes phase difference but doesn’t exploit channel correlation, the proposed ratio mask (pRM) considers channel correlation but is computed only in the magnitude domain. This work proposes a new mask, i.e., phase correlation ideal ratio mask (PCIRM), which includes both inter-channel correlation and phase difference between the noisy speech ($N_\mathrm{S}$), noise (N) and clean speech ($C_\mathrm{S}$). Considering these factors increases the percentage of $C_\mathrm{S}$ and readily decreases the percentage of unwanted noise in the speech components and conversely for the noise components making the mask more precise. The experimental results are conducted under different SNR levels using TIMIT dataset and NOISEX-92 dataset and also compared with the existing state-of-the-art approaches. The results prove that the proposed mask has higher performance than BIRMP and pRM in terms of speech quality and intelligibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Noisy speech enhancement based on correlation canceling/log-MMSE hybrid method

Article 01 August 2022

Multi-resolution auditory cepstral coefficient and adaptive mask for speech enhancement with deep neural network

Article Open access 02 April 2019

A Study on the Benefits of Phase-Aware Speech Enhancement in Challenging Noise Scenarios

References

F. Bao, W. Abdulla, A new ratio mask representation for CASA-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 7–19 (2018)
Article Google Scholar
F. Bao, and W.H. Abdulla, Signal power estimation based on convex optimization for speech enhancement, in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (2017), pp. 483–487
S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Article Google Scholar
J.H. Chang, Q.H. Jo, D.K. Kim, N.S. Kim, Global soft decision employing support vector machine for speech enhancement. IEEE Signal Process. Lett. 16(1), 57–60 (2008)
Article Google Scholar
F. Deng, C.C. Bao, W.B. Kleijin, Sparse hidden Markov models for speech enhancement in non-stationary noise environments. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1973–1987 (2015)
Article Google Scholar
H. Erdogan, J.R. Hershey, S. Watanabe, and J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 708–712
Y. Ephraim, D. Malah, Speech enhancement using a minimum mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)
Article Google Scholar
B. Gao, W.L. Woo, S.S. Dlay, Unsupervised single-channel separation of nonstationary signals using gammatone filter-bank and Itakura-Saito nonnegative matrix two-dimensional factorizations. IEEE Trans. Circuits Syst. I Regul. Pap. 60(3), 662–675 (2013)
Article MathSciNet Google Scholar
G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet Google Scholar
H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
Article Google Scholar
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 17–29 (1987)
Google Scholar
I. Holube, B. Kollmeier, Speech intelligibility prediction in hearing impaired listeners based on a psycho acoustically motivated perception model. J. Acoust. Soc. Am. 100(3), 1703–1716 (1996)
Article Google Scholar
K. Hu, D.L. Wang, An iterative model-based approach to cochannel speech separation. EURASIP J. Audio Speech Music Process. 2013, 1–11 (2013)
Article Google Scholar
S.O. Haykin, Neural Networks and Learning Machines, 3rd edn. (Prentice Hall, New York, 2009)
Google Scholar
G. Kim, Y. Lu, Y. Hu, P. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J. Acoust. Soc. Am. 126(3), 1486–1494 (2009)
Article Google Scholar
J.M. Kates, K.H. Arehart, Coherence and the speech intelligibility index. J. Acoust. Soc. Am. 115(5), 2604–2604 (2005)
Article Google Scholar
G.K. Liu, Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv preprint arXiv:1806.09010 (2018)
S. Liang, W.J. Liu, W. Jiang, W. Xue, The optimal ratio time frequency mask for speech separation in terms of the signal-to-noise ratio. J. Acoust. Soc. Am. 134(5), 452–458 (2013)
Article Google Scholar
J. Lim, A. Oppenheim, All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 26(3), 197–210 (1978)
Article Google Scholar
N. Mohammadiha, S. Paris, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)
Article Google Scholar
R. Meddis, Simulation of auditory-neural transduction: further studies. J. Acoust. Soc. Am. 83(3), 1056–1063 (1988)
Article Google Scholar
V. Nair, and G.E Hinton, Rectified linear units improve restricted Boltzmann machines, in Proceedings of International Conference on Machine Learning, pp. 807–814 (2010)
A. Narayanan, D. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing, Vancouver, Canada (2013), pp. 7092–7096
R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, An efficient auditory filterbank based on the gammatone function. Appl. Psychol. Unit, Cambridge Univ., Cambridge, APU Rep. 2341 (1998)
A.W. Rix, J. Berger, and J.G. Beerends, Perceptual quality assessment of telecommunications systems including terminals, in Audio Engineering Society Convention 114. Audio Engineering Society (2003)
A. Salinna, M. Zamani, A. Demosthenous, Towards more efficient DNN-based speech enhancement using quantized correlation mask. IEEE Access 9, 24350–24362 (2021)
Article Google Scholar
S. Sivapatham, A. Kar, R. Ramadoss, Performance analysis of various training targets for improving speech quality and intelligibility. Appl. Acoust. 175, 107817 (2021)
Article Google Scholar
S. Shoba, R. Rajavel, Improving speech intelligibility in monaural segregation system by fusing voiced and unvoiced speech segments. Circuits Syst. Signal Process. 38(8), 3573–3590 (2019)
Article Google Scholar
S. Srinivasan, N. Roman, D. Wang, Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)
Article Google Scholar
S. Shoba, R. Rajavel, Image processing techniques for segments grouping in monaural speech separation. Circuits Syst. Signal Process. 37(8), 3651–3670 (2018)
Article MathSciNet Google Scholar
C.H. Taal, R.C. Hendriks, R. Heusdens, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
Article Google Scholar
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Article Google Scholar
T. Virtanen, J. Gemmeke, B. Raj, Active-set Newton algorithm for overcomplete non-negative representations of audio. IEEE Trans. Audio Speech Lang. Process. 21(11), 2277–2289 (2013)
Article Google Scholar
M. Weintraub, A theory and computational model of auditory monaural sound separation. Ph.D. dissertation, Dept. Elect. Eng., StanfordUniv., Stanford (1985)
D.S. Williamson, Y. Wang, D.L. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Article Google Scholar
D.S. Williamson, Y. Wang, D.L. Wang, Estimating nonnegative matrix model activations with deep neural networks to increase perceptual speech quality. J. Acoust. Soc. Am. 138(3), 1399–1407 (2015)
Article Google Scholar
X. Wang, C. Bao, Mask estimation incorporating phase-sensitive information for speech enhancement. Appl. Acoust. 156, 101–112 (2019)
Article Google Scholar
Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
Article Google Scholar
Y. Wang, K. Han, D. Wang, Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio Speech Lang. Process. 21(2), 270–279 (2012)
Article Google Scholar
Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)
Article Google Scholar
D.Y. Zhao, W.B. Kleijn, HMM-based gain modeling for enhancement of speech in noise. IEEE Trans. Audio Speech Lang. Process. 15(3), 882–892 (2007)
Article Google Scholar
V. Zue, S. Seneff, J. Glass, Speech database development at MIT: TIMIT and beyond. Speech Commun. 9(4), 351–356 (1990)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, IIITDM Kancheepuram, Chennai, 600127, India
Devi Sowjanya & Asutosh Kar
School of Electronics Engineering, VIT University, Chennai, 600127, India
Shoba Sivapatham
Faculty of Technical Sciences Cacak, University of Kragujevac, Cacak, Serbia
Vladimir Mladenovic

Authors

Devi Sowjanya
View author publications
You can also search for this author in PubMed Google Scholar
Shoba Sivapatham
View author publications
You can also search for this author in PubMed Google Scholar
Asutosh Kar
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Mladenovic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Asutosh Kar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sowjanya, D., Sivapatham, S., Kar, A. et al. Mask Estimation Using Phase Information and Inter-channel Correlation for Speech Enhancement. Circuits Syst Signal Process 41, 4117–4135 (2022). https://doi.org/10.1007/s00034-022-01981-0

Download citation

Received: 17 March 2021
Revised: 26 January 2022
Accepted: 27 January 2022
Published: 25 February 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s00034-022-01981-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mask Estimation Using Phase Information and Inter-channel Correlation for Speech Enhancement

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Noisy speech enhancement based on correlation canceling/log-MMSE hybrid method

Multi-resolution auditory cepstral coefficient and adaptive mask for speech enhancement with deep neural network

A Study on the Benefits of Phase-Aware Speech Enhancement in Challenging Noise Scenarios

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Mask Estimation Using Phase Information and Inter-channel Correlation for Speech Enhancement

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Noisy speech enhancement based on correlation canceling/log-MMSE hybrid method

Multi-resolution auditory cepstral coefficient and adaptive mask for speech enhancement with deep neural network

A Study on the Benefits of Phase-Aware Speech Enhancement in Challenging Noise Scenarios

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.