Abstract
A common technique in visual object recognition is to use a sparse encoding of low-level input with a feature dictionary followed by a spatial pooling over local neighbourhoods. While some methods stack these in alternating layers within hierarchies, using these two stages alone can also produce state-of-the-art results. Following from vision, this framework is moving in to speech and audio processing tasks. We investigate the effect of architectural choices when applied to a spoken digit recognition task. We find that the unsupervised learning of features has a negligible effect on the classification, with the number of and size of the features being a greater determinant for recognition. Finally, we show that, given an optimised architecture, sparse coding performs comparably with Hidden Markov Models (HMMs) and outperforms K-means clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Shamma, S.: On the role of space and time in auditory processing. Trends Cogn. Sci. 5(8), 340–348 (2001)
Olshausen, B., Field, D.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996)
Riesenhuber, M., Poggio, T.: Hierarchical models of object recognition in cortex. Nature Neuroscience 2(11), 1019–1025 (1999)
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Labusch, K., Barth, E., Martinetz, T.: Simple method for high performance digit recognition based on sparse coding. IEEE Trans. Neural Netw. 19(11), 1985–1989 (2008)
Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: International Conference on Computer Vision (2009)
Kleinschmidt, M.: Methods for capturing spectro-temporal modulations in automatic speech recognition. Acta Acust. Acust. 88(3), 416–422 (2002)
Heckmann, M., Domont, X., Joublin, F., Goerick, C.: A hierarchical framework for spectro-temporal feature extraction. Speech Communication 53(5), 736–752 (2011)
Cho, Y., Choi, S.: Nonnegative features of spectro-temporal sounds for classification. Pattern Recognition Lett. 26(9), 1327–1336 (2005)
Henaff, M., Jarrett, K., Kavukcuoglu, K., LeCun, Y.: Unsupervised learning of sparse features for scalable audio classification. In: Proceedings of International Symposium on Music Information Retrieval (2011)
Coates, A., Lee, H., Ng, A.: An analysis of single-layer networks in unsupervised feature learning. In: AISTATS, vol. 14 (2011)
Coates, A., Ng, A.: The importance of encoding versus training with sparse coding and vector quantization. In: Proceedings of the Twenty-Eighth International Conference on Machine Learning (2010)
Saxe, A., Koh, P., Chen, Z., Bhand, M., Suresh, B., Ng, A.: On random weights and unsupervised feature learning. In: Twenty-Eighth International Conference on Machine Learning (2011)
Tošić, I., Frossard, P.: Dictionary learning: What is the right representation for my signal? IEEE Signal Processing Magazine 28(2), 27–38 (2011)
Pati, Y., Rezaiifar, R., Krishnaprasad, P.: Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition. In: Proc. Asilomar Conf. Signals, Syst., Comput., vol. 1, pp. 40–44. Pacific Grove, CA (1993)
van Gemert, J.C., Geusebroek, J.-M., Veenman, C.J., Smeulders, A.W.M.: Kernel Codebooks for Scene Categorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 696–709. Springer, Heidelberg (2008)
Scherer, D., Müller, A., Behnke, S.: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part III. LNCS, vol. 6354, pp. 92–101. Springer, Heidelberg (2010)
Pedregosa, J., et al.: Scikit-learn: Machine Learning in Python. JMLR 12, 2825–2830 (2011), Software, http://scikit-learn.org/stable/
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: A Library for Large Linear Classification, Software, http://www.csie.ntu.edu.tw/~cjlin/liblinear
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines, Software, http://www.csie.ntu.edu.tw/~cjlin/libsvm
Hirsch, H.G., Pearce, D.: The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proc. of ISCA ASR 2000 Workshop, Paris, France, pp. 181–188 (2000)
Plannerer, B.: Mel-Spectral Toolbox (2004), http://www.speech-recognition.de/matlab-examples.html
Aharon, M., Elad, M., Bruckstein, A.M.: K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representation. IEEE Trans. on Signal Processing 54(11), 4311–4322 (2006)
Rubenstein, R.: KSVD-Box v13, http://www.cs.technion.ac.il/~ronrubin/software.html
Li, Y., Osher, S.: Coordinate descent optimization for l1 minimization with application to compressed sensing; a greedy algorithm. Inverse Problems and Imaging 3(3), 487–503 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
O’Donnell, F., Triefenbach, F., Martens, JP., Schrauwen, B. (2012). Effects of Architecture Choices on Sparse Coding in Speech Recognition. In: Villa, A.E.P., Duch, W., Érdi, P., Masulli, F., Palm, G. (eds) Artificial Neural Networks and Machine Learning – ICANN 2012. ICANN 2012. Lecture Notes in Computer Science, vol 7552. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33269-2_79
Download citation
DOI: https://doi.org/10.1007/978-3-642-33269-2_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33268-5
Online ISBN: 978-3-642-33269-2
eBook Packages: Computer ScienceComputer Science (R0)