Skip to main content

Advertisement

Log in

Mask Estimation Using Phase Information and Inter-channel Correlation for Speech Enhancement

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

The most commonly used training target is masking-based approach which maps noisy speech to the time–frequency (T–F) unit and has a remarkable impact on the performance in the supervised learning algorithms. Traditional T–F masks like ideal ratio mask (IRM) demonstrate a strong performance but are limited to only the magnitude domain in enhancement. Though bounded IRM with phase constraint (BIRMP) includes phase difference but doesn’t exploit channel correlation, the proposed ratio mask (pRM) considers channel correlation but is computed only in the magnitude domain. This work proposes a new mask, i.e., phase correlation ideal ratio mask (PCIRM), which includes both inter-channel correlation and phase difference between the noisy speech (\(N_\mathrm{S}\)), noise (N) and clean speech (\(C_\mathrm{S}\)). Considering these factors increases the percentage of \(C_\mathrm{S}\) and readily decreases the percentage of unwanted noise in the speech components and conversely for the noise components making the mask more precise. The experimental results are conducted under different SNR levels using TIMIT dataset and NOISEX-92 dataset and also compared with the existing state-of-the-art approaches. The results prove that the proposed mask has higher performance than BIRMP and pRM in terms of speech quality and intelligibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. F. Bao, W. Abdulla, A new ratio mask representation for CASA-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 7–19 (2018)

    Article  Google Scholar 

  2. F. Bao, and W.H. Abdulla, Signal power estimation based on convex optimization for speech enhancement, in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (2017), pp. 483–487

  3. S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)

    Article  Google Scholar 

  4. J.H. Chang, Q.H. Jo, D.K. Kim, N.S. Kim, Global soft decision employing support vector machine for speech enhancement. IEEE Signal Process. Lett. 16(1), 57–60 (2008)

    Article  Google Scholar 

  5. F. Deng, C.C. Bao, W.B. Kleijin, Sparse hidden Markov models for speech enhancement in non-stationary noise environments. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1973–1987 (2015)

    Article  Google Scholar 

  6. H. Erdogan, J.R. Hershey, S. Watanabe, and J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 708–712

  7. Y. Ephraim, D. Malah, Speech enhancement using a minimum mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)

    Article  Google Scholar 

  8. B. Gao, W.L. Woo, S.S. Dlay, Unsupervised single-channel separation of nonstationary signals using gammatone filter-bank and Itakura-Saito nonnegative matrix two-dimensional factorizations. IEEE Trans. Circuits Syst. I Regul. Pap. 60(3), 662–675 (2013)

    Article  MathSciNet  Google Scholar 

  9. G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)

    Article  MathSciNet  Google Scholar 

  10. H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)

    Article  Google Scholar 

  11. H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 17–29 (1987)

    Google Scholar 

  12. I. Holube, B. Kollmeier, Speech intelligibility prediction in hearing impaired listeners based on a psycho acoustically motivated perception model. J. Acoust. Soc. Am. 100(3), 1703–1716 (1996)

    Article  Google Scholar 

  13. K. Hu, D.L. Wang, An iterative model-based approach to cochannel speech separation. EURASIP J. Audio Speech Music Process. 2013, 1–11 (2013)

    Article  Google Scholar 

  14. S.O. Haykin, Neural Networks and Learning Machines, 3rd edn. (Prentice Hall, New York, 2009)

    Google Scholar 

  15. G. Kim, Y. Lu, Y. Hu, P. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J. Acoust. Soc. Am. 126(3), 1486–1494 (2009)

    Article  Google Scholar 

  16. J.M. Kates, K.H. Arehart, Coherence and the speech intelligibility index. J. Acoust. Soc. Am. 115(5), 2604–2604 (2005)

    Article  Google Scholar 

  17. G.K. Liu, Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv preprint arXiv:1806.09010 (2018)

  18. S. Liang, W.J. Liu, W. Jiang, W. Xue, The optimal ratio time frequency mask for speech separation in terms of the signal-to-noise ratio. J. Acoust. Soc. Am. 134(5), 452–458 (2013)

    Article  Google Scholar 

  19. J. Lim, A. Oppenheim, All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 26(3), 197–210 (1978)

    Article  Google Scholar 

  20. N. Mohammadiha, S. Paris, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)

    Article  Google Scholar 

  21. R. Meddis, Simulation of auditory-neural transduction: further studies. J. Acoust. Soc. Am. 83(3), 1056–1063 (1988)

    Article  Google Scholar 

  22. V. Nair, and G.E Hinton, Rectified linear units improve restricted Boltzmann machines, in Proceedings of International Conference on Machine Learning, pp. 807–814 (2010)

  23. A. Narayanan, D. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing, Vancouver, Canada (2013), pp. 7092–7096

  24. R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, An efficient auditory filterbank based on the gammatone function. Appl. Psychol. Unit, Cambridge Univ., Cambridge, APU Rep. 2341 (1998)

  25. A.W. Rix, J. Berger, and J.G. Beerends, Perceptual quality assessment of telecommunications systems including terminals, in Audio Engineering Society Convention 114. Audio Engineering Society (2003)

  26. A. Salinna, M. Zamani, A. Demosthenous, Towards more efficient DNN-based speech enhancement using quantized correlation mask. IEEE Access 9, 24350–24362 (2021)

    Article  Google Scholar 

  27. S. Sivapatham, A. Kar, R. Ramadoss, Performance analysis of various training targets for improving speech quality and intelligibility. Appl. Acoust. 175, 107817 (2021)

    Article  Google Scholar 

  28. S. Shoba, R. Rajavel, Improving speech intelligibility in monaural segregation system by fusing voiced and unvoiced speech segments. Circuits Syst. Signal Process. 38(8), 3573–3590 (2019)

    Article  Google Scholar 

  29. S. Srinivasan, N. Roman, D. Wang, Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)

    Article  Google Scholar 

  30. S. Shoba, R. Rajavel, Image processing techniques for segments grouping in monaural speech separation. Circuits Syst. Signal Process. 37(8), 3651–3670 (2018)

    Article  MathSciNet  Google Scholar 

  31. C.H. Taal, R.C. Hendriks, R. Heusdens, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)

    Article  Google Scholar 

  32. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  33. T. Virtanen, J. Gemmeke, B. Raj, Active-set Newton algorithm for overcomplete non-negative representations of audio. IEEE Trans. Audio Speech Lang. Process. 21(11), 2277–2289 (2013)

    Article  Google Scholar 

  34. M. Weintraub, A theory and computational model of auditory monaural sound separation. Ph.D. dissertation, Dept. Elect. Eng., StanfordUniv., Stanford (1985)

  35. D.S. Williamson, Y. Wang, D.L. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)

    Article  Google Scholar 

  36. D.S. Williamson, Y. Wang, D.L. Wang, Estimating nonnegative matrix model activations with deep neural networks to increase perceptual speech quality. J. Acoust. Soc. Am. 138(3), 1399–1407 (2015)

    Article  Google Scholar 

  37. X. Wang, C. Bao, Mask estimation incorporating phase-sensitive information for speech enhancement. Appl. Acoust. 156, 101–112 (2019)

    Article  Google Scholar 

  38. Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)

    Article  Google Scholar 

  39. Y. Wang, K. Han, D. Wang, Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio Speech Lang. Process. 21(2), 270–279 (2012)

    Article  Google Scholar 

  40. Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)

    Article  Google Scholar 

  41. D.Y. Zhao, W.B. Kleijn, HMM-based gain modeling for enhancement of speech in noise. IEEE Trans. Audio Speech Lang. Process. 15(3), 882–892 (2007)

    Article  Google Scholar 

  42. V. Zue, S. Seneff, J. Glass, Speech database development at MIT: TIMIT and beyond. Speech Commun. 9(4), 351–356 (1990)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Asutosh Kar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sowjanya, D., Sivapatham, S., Kar, A. et al. Mask Estimation Using Phase Information and Inter-channel Correlation for Speech Enhancement. Circuits Syst Signal Process 41, 4117–4135 (2022). https://doi.org/10.1007/s00034-022-01981-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-01981-0

Keywords

Navigation

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy