In this paper, we present a new technique to extract a noise robust representation of speech signals called spectro-temporal power spectrum. This technique is based on applying a simple 2-D filter to the speech spectrogram to highlight the movements of spectral peaks. As speech spectral peaks constitute the regions of high-SNR (signal-to-noise ratio) values in the speech spectrogram, we expect that applying our filter will improve the recognition performance. In addition, by applying the 2-D filter, the spectro-temporal information around each frequency component is encoded into the frequency representation of speech signal. This information will help the recognizer to better identify the true state to which each frame should be allocated. Experimental results on the Aurora 2 task show that error rate improvements of about 40 and 35 % are obtained for test sets A and B, respectively, in comparison with the baseline system when combined with cepstral mean and variance normalization. Also, further improvement was achieved when the proposed features were extracted from enhanced spectra obtained by applying advanced front-end routine. Moreover, phone recognition task evaluated on TIMIT database showed the preference of the proposed method over the baseline methods. The obtained improvement by the proposed method is made with a very simple and easy-to-implement routine which makes it suitable for practical systems.

This work was in part supported by a grant from the Iran Telecommunication Research Center (ITRC).
Riazati Seresht, H., Ahadi, S.M. & Seyedin, S. Spectro-temporal Power Spectrum Features for Noise Robust ASR. Circuits Syst Signal Process 36, 3222–3242 (2017). https://doi.org/10.1007/s00034-016-0434-0
DOI: https://doi.org/10.1007/s00034-016-0434-0