Abstract
Convolutional neural networks have achieved great success in many computer vision tasks. However, it is still challenging for action recognition in videos due to the intrinsically complicated space-time correlation and computational difficult of videos. Existing methods usually neglect the fusion of long term spatio-temporal information. In this paper, we propose a novel hybrid spatio-temporal convolutional network for action recognition. Specifically, we integrate three different type of streams into the network: (1) the image stream utilizes still images to learn the appearance information; (2) the optical stream captures the motion information from optical flow frames; (3) the dynamic image stream explores the appearance information and motion information simultaneously from generated dynamic images. Finally, a weighted fusion strategy at the softmax layer is utilized to make the class decision. With the help of these three streams, we can take full advantage of the spatio-temporal information of the videos. Extensive experiments on two popular human action recognition datasets demonstrate the superiority of our proposed method when compared with several state-of-the-art approaches.






Similar content being viewed by others
References
Alfaro A, Mery D, Soto A (2016) Action recognition in video using sparse coding and relative features Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2688–2697
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 3034–3042
Cai Z, Wang L M, Peng X, Qiao Y (2014) Multi-view super vector for action recognition Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 596–603
Diba A, Pazandeh A, Gool LV (2016) Efficient two-stream motion and appearance 3d CNNs for video classfication. arXiv:1608.08851
Deng J, Dong W, Socher R, Li L J, Li K, Li F-F (2009) Imagenet: a large-scale hierarchical image database Conference on computer vision and pattern recognition (CVPR), 2009, I.E. IEEE, pp 248–255
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features VS-PETS 2005
Donahue J, Hendricks L A, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrel T (2015) Long-term recurrent convolutional networks for visual recognition and description Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 2625–2634
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream networks fusion for video action recognition. arXiv:1604.06573
Fernando B, Anderson P, Hutter M, Gounld S (2016) Discriminative hierarchical rank pooling for activity recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1924–1932
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift International conference on machine learning (ICML), 2015, pp 448–456
Ji SW, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 35(1):221–231
Jia Y Q, Evan S, Jeff D, Sergey K, Jonathan L, Ross G, Sergio G, Trevor D (2014) Caffe: convolutional architecture for fast feature embedding. arXiv:1408.5093
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li Fei-Fei (2014) Large-scale video classification with convolutional neural networks Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 1725–1732
Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients 2008-19th British machine vision conference (BMVC), British machine vision association
Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgbd action recognition Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 1054–1062
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition International conference on computer vision (ICCV), 2011, I.E. IEEE, pp 2556–2563
Laptev I (2005) On space-time interest points. Int J Comput Vis (IJCV) 64 (2–3):107–123
Li Z Y, Gavves E, Jain M, Snoek CGM (2016) VideoLSTM convolves, attends and flows for action recognition. arXiv:1607.01794
Peng X, Wang L M, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:14054506
Sadanand S, Corso J J (2012) Action bank: a high-level representation of activity in video Conference on computer vision and pattern recognition (CVPR), 2012, I.E. IEEE, pp 1234–1341
Scovanner P, Ali S, Mubarak Shah (2007) A 3-dimensional SIFT descriptor and its application to action recognition ACM international conference on multimedia (ACM MM), pp 357–360
Shahroudy A, Ng T T, Yang Q, Wang G (2016) Multimodal multipart learning for action recognition in depth videos. IEEE Trans Pattern Anal Mach Intell (PAMI) 10:2123–2129
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Shen Y, Lin W Y, Yan J C, Xu M L, Wu J X, Wang J D (2015) Person re-identification with correspondence structure learning International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 3200–3208
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos Annual conference on neural information processing systems (NIPS), pp 568–576
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition International conference on learning representations (ICLR), pp 1–14
Soomro K, Zamir A R, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Szegedy C, Vanhoucke V, Ioffe S, Shlens J (2016) Rethinking the inception architecture for computer vision Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2818–2826
Tran D, Bourdev L, Fergus R, Torresani L, Manohar Paluri (2015) Learning spatiotemporal features with 3d convolutional networks International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 4489–4497
Varol G, Laptev I, Schmid C (2016) Long-term temporal convolutions for action recognition. arXiv:1604.04994
Wang H, Schmid C (2013) Action recognition with improved trajectories International conference on computer vision (ICCV), 2013, I.E. IEEE, pp 3551–3558
Wang L M, Qiao Y, XO T (2015) Action recognition with trajectory-pooled deep-convolutional descriptors Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 4305–4314
Wang L M, Qiao Y, Tang X O (2016) MoFAP: a multi-level representation for action recognition. Int J Comput Vis (IJCV) 119(3):254–271
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang XO, Gool LV (2016) Temproal segment networks: towards good practices for deep action recognition. arXiv:1608.00859
Wang X L, Farhadi A, Gupta A (2016) Action ∼ transformation Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2658–2667
Wang Y L, Wang S H, Tang J L, O’Hare N, Chang Y, Li BX (2016) Hierarchical attention network for action recognition in videos. arXiv:1607.0641
Willems G, Tuytelaars T, Gool L V (2008) An efficient dense and scale-invariant spatio-temporal interest point detector Proceedings of the european conference on computer vision (ECCV), pp 650–663
Wu Z X, Wang X, Jiang Y G, Ye H, Xue X Y (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification ACM international conference on multimedia (ACM MM), pp 461–470
Xu Z, Hu C P, Mei L (2016) Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimedia Tools and Applications (MTAP) 75 (19):12155–12172
Xu Z, Liu Y H, Mei L, Hu C P, Chen L (2015) Semantic based representing and organizing surveillance big data using video structural description technology. J Syst Softw 102:217–225
Xu Z, Mei L, Hu C P, Liu Y H (2016) The big data analytics and applications of the surveillance system using video structured description technology. Clust Comput 19(3):1283–1292
Xu Z, Mei L, Liu Y H, Hu C P, Chen L (2016) Semantic enhanced cloud environment for surveillance data management using video structural description. Computing 98(1–2):35–54
Yang YH, Deng C, Gao SQ, Liu W, Tao DP, Gao XB (2016) Discriminative multi-instance multi-task learning for 3d action recognition. IEEE Trans Multimedia (TMM). doi:10.1109/TMM.2016.2626959
Yang Y H, Deng C, Tao D P, Zhang S T, Liu W, Gao X B (2016) Latent max-margin multitask learning with skelets for 3d action recognition. IEEE Transactions on Cybernetics (TCYB) 99:1–10
Yang Y H, Liu R S, Deng C, Gao X B (2016) Multi-task human action recognition via exploring super-category. Signal Process (SP) 124:36–44
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-l1 optical flow 29th DAGM symposium on pattern recognition, pp 214–223
Zhang B W, Wang L M, Wang Z, Qiao Y, Wang H L (2016) Real-time action recognition with enhanced motion vector CNNs Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2718–2726
Zhu J, Wang B Y, Yang X K, Zhang W J, Tu Z W (2013) Action recognition with actons International conference oncomputer vision (ICCV), 2013, I.E. IEEE, pp 3559–3566
Zhu W J, Hu J, Sun G, Cao X D, Qiao Y (2016) A key volume mining deep framework for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1991–1999
Acknowledgements
The authors would like to thank the Editor-in-Chief, the handling associate editor and all anonymous reviewers for their considerations and suggestions. This work was supported by the National Natural Science Foundation of China (61572388).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, H., Yang, Y., Yang, E. et al. Exploring hybrid spatio-temporal convolutional networks for human action recognition. Multimed Tools Appl 76, 15065–15081 (2017). https://doi.org/10.1007/s11042-017-4514-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4514-3