Abstract
Sequence labeling is a common machine-learning task which not only needs the most likely prediction of label for a local input but also seeks the most suitable annotation for the whole input sequence. So it requires the model that is able to handle both the local spatial features and temporal-dependence features effectively. Furthermore, it is common for the length of the label sequence to be much shorter than the input sequence in some tasks such as speech recognition and handwritten text recognition. In this paper, we propose a kind of novel deep neural network architecture which combines convolution, pooling and recurrent in a unified framework to construct the convolutional recurrent neural network (CRNN) for sequence labeling tasks with variable lengths of input and output. Specifically, we design a novel CRNN to achieve the joint extraction of local spatial features and long-distance temporal-dependence features in sequence, introduce pooling along time to achieve a transform of long input to short output which will also reduce he model’s complexity, and adopt Connectionist Temporal Classification (CTC) layer to achieve an end-to-end pattern for sequence labeling. Experiments on phoneme sequence recognition and handwritten character sequence recognition have been conducted and the results show that our method achieves great performance while having a more simplified architecture with more efficient training and labeling procedure.
Article PDF
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
A. Graves, S. Fernández, J. Schmidhuber, Connectionist temporal classification: labeling unsegmented sequence data with recurrent neural networks, in Proceedings of the 23rd International Conference on Machine Learning, ACM, Pittsburgh, PA, USA, 2006, pp. 369–376.
J. Watanabe, S. Hori, T. Baskar, et al., Language model integration based on memory control for sequence to sequence speech recognition, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 2019, pp. 6191–6195.
J. Sueiras, V. Ruiz, A. Sanchez, J. Velez, Offline continuous handwritten recognition using sequence to sequence neural networks, Neurocomputing. 289 (2018), 119–128.
A. Graves, J. Schmidhuber, Multidimensional recurrent neural networks, in International Conference on Artificial Neural Networks, Porto, Portugal, 2007, pp. 549–558.
A. Naseer, K. Zafar, Meta features-based scale invariant OCR decision making using LSTM-RNN, Comput. Math. Organ. Theor. 5 (2019), 165–183.
Y. Zhang, M. Pezeshki, Towards end-to-end speech recognition with deep convolutional neural networks, in Interspeech 2016, San Francisco, CA, USA, 2016, pp. 410–414.
M. Karafiát, M.K. Baskar, S. Watanabe, et al., Analysis of multilingual sequence-to-sequence speech recognition systems, in Interspeech 2019, Graz, Austria, 2019.
A. Graves, N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks, in Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, China, 2014, vol. 32, pp. 1764–1772.
S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997), 1735–1780.
A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2013), Vancouver, Canada, 2013, pp. 6645–6649.
H. El Bahi, A. Zatni, Text recognition in document images obtained by a smartphone based on deep convolutional and recurrent neural network, Multimed. Tools Appl. 78 (2019), 26453–26481.
Q. Lu, Y. Xu, R. Yang, N. Li, C. Wang, Serial and parallel recurrent convolutional neural networks for biomedical named entity recognition, Database Syst. Adv. Appl. 11448 (2019), 439–443.
D. de Benito-Gorron, A. Lozano-Diez, D.T. Toledano, et al., Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset, EURASIP J. Audio Speech Music Process. 2019 (2019).
Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, T. Tan, LFNet: a novel bidirectional recurrent convolutional neural network for light-field image super-resolution, IEEE Trans. Image Process. 27 (2018), 4274–4286.
D. Bahdanau, J. Chorowski, D. Serdyuk, et al., End-to-end attention-based large vocabulary speech recognition, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2016), Shanghai, China, 2016, pp. 4945–4949.
J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based models for speech recognition, in Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, Canada, 2015, pp. 577–585.
B. Li, T. Liu, Z. Zhao, X. Du, Attention-based recurrent neural network for sequence labeling, Web and Big Data. 10987 (2018), 340–348.
L. Kang, J.I. Toledo, P. Riba, M. Villegas, et al., Convolve, attend and spell: an attention-based sequence-to-sequence model for handwritten word recognition, in International Conference on Pattern Recognition, Beijing, China, 2018, vol. 11269, pp. 459–472.
T.N. Sainath, O. Vinyals, A. Senior, H. Sak, Convolutional, long short-term memory, fully connected deep neural networks, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 2015, pp. 4580–4584.
Y. Zhao, X. Jin, X. Hu, Recurrent convolutional neural network for speech processing, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2017), New Orleans, LA, USA, 2017.
Z. Zhang, D. Robinson, J. Tepper, Detecting hate speech on twitter using a convolution-GRU based deep neural network, in The 15th European Semantic Web Conference (ESWC 2018), Heraklion, Greece, 2018, vol. 10843, pp. 745–760.
B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2017), 2298–2304.
O. Vinyals, A. Toshev, S. Bengio, et al., Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2016), 652–663.
F. Zhou, R. Hang, Q. Liu, X. Yuan, Integrating convolutional neural network and gated recurrent unit for hyper-spectral image spectral-spatial classification, in Pattern Recognition and Computer Vision, Guangzhou, China, 2018, pp. 409–420.
X. Wang, W. Jiang, Z. Luo, Combination of convolutional and recurrent neural network for sentiment analysis of short texts, in The 26th international conference on Computational Linguistics, Osaka, Japan, 2016, pp. 2428–2437. https://www.aminer.cn/pub/58d83051d649053542fe99f1
M.Z. Alom, C. Yakopcic, M.S. Nasrin, et al., Breast cancer classification from histopathological images with inception recurrent residual convolutional neural network, J. Digit. Imaging. 32 (2019), 605–617.
P. Xie, G. Wang, C. Zhang, et al., Bidirectional recurrent neural network and convolutional neural network (BiRCNN) for ECG beat classification, in The 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC2018), Honolulu, HI, USA, 2018, pp. 2555–2558.
S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification, in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Inteligence, Austin, TX, USA, 2015, vol. 333, pp. 2267–2273.
K. Cho, B. Van Merrienboer, C. Gulcehre, et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, in Proceeding of the Conference on Empirical Methods Natural Language Process, Doha, Qatar, 2014, pp. 1724–1734.
M. Schuster, K.K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans. Signal Process. 45 (1997), 2673–2681.
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT Acoustic-Phonemetic Continuous Speech Corpus CD-ROM.NIST Speech Disc 1-1.1, NASA STI/Recon Technical Report, vol. 93, 1993.
U. Marti, H. Bunke, The IAM-database: an English sentence database for off-line handwritten recognition, J. Doc. Anal. Recognit. 5 (2002), 39–46.
E. Grosicki, H. El-Abed, ICDAR 2011-french handwriting recognition competition, in International Conference on Document Analysis and Recognition, Beijing, China, 2011, pp. 1459–1463.
K.F. Li, H.W. Hon, Speaker-independent phoneme recognition using hidden markov models, J. Acoustical Soc. Am. 84 (1988), 62.
L. Toth, Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014, pp. 190–194.
A. Chowdhury, L. Vig, An efficient end-to-end neural model for handwritten text recognition, arXiv: 1807.07965 [cs.CL], 2018.
J. Puigcerver, Are multidimensional recurrent layers really necessary for handwritten text recognition?, in IEEE 2017 IAPR 14th International Conference on Document Analysis and Recognition (ICDAR 2017), Kyoto, Japan, 2017.
K. Dutta, P. Krishnan, M. Mathew, C.V. Jawahar, Improving CNN-RNN hybrid networks for handwriting recognition, in 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara, NY, USA, 2018.
J. Michael, R. Labahn, T. Grüning, et al., Evaluating sequence-to-sequence models for handwritten text recognition, arXiv: 1903.07377v2 [cs.CV], 2019.
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE. 86 (1998), 2278–2324.
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in The 3rd International Conference on Learning Representations (ICLR2015), San Diego, CA, USA, 2015. arXiv:1409.1556[cs.CV].
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
This is an open access article distributed under the CC BY-NC 4.0 license (https://doi.org/creativecommons.org/licenses/by-nc/4.0/).
About this article
Cite this article
Huang, X., Qiao, L., Yu, W. et al. End-to-End Sequence Labeling via Convolutional Recurrent Neural Network with a Connectionist Temporal Classification Layer. Int J Comput Intell Syst 13, 341–351 (2020). https://doi.org/10.2991/ijcis.d.200316.001
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.2991/ijcis.d.200316.001