Abstract
We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach.
Editors: Isabelle Guyon and Vassilis Athitsos.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
When referring to activity spotting purposes, we use the term gestures instead of activities, only to be consistent with the terminology of the ChaLearn Gesture Challenge.
- 2.
Implementation can be found at http://www.irisa.fr/vista/Equipe/People/Laptev/download.html#stip.
- 3.
States are modeled as multinomials since our input observables are discrete values.
References
J.K. Aggarwal, M.S. Ryoo, Human activity analysis: a review. ACM Comput. Surv. 43, 1–16 (2011)
Y. Benabbas, A. Lablack, N. Ihaddadene, C. Djeraba, Action recognition using direction models of motion, in Proceedings of the 2010 International Conference on Pattern Recognition, 2010, pp. 4295–4298
H. Bilen, V.P. Namboodiri, L. Van Gool, Action recognition: a region based approach, in Proceedings of the 2011 IEEE Workshop on the Applications of Computer Vision, 2011, pp. 294–300
David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
M. Bregonzio, S. Gong, T. Xiang, Recognising action as clouds of space-time interest points, in Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1948–1955
ChaLearn. ChaLearn Gesture Dataset (CGD2011), ChaLearn, California, 2011. http://gesture.chalearn.org/2011-one-shot-learning
N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893
K.G. Derpanis, M. Sizintsev, K. Cannons, R.P. Wildes, Efficient action spotting based on a spacetime oriented structure representation, in Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1990–1997
P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in Proceedings of the 2005 IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 65–72
A. Gilbert, J. Illingworth, R. Bowden, Action recognition using mined hierarchical compound features. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 883–897 (2011)
S. Gong, and T. Xiang, Recognition of group activities using dynamic probabilistic networks, in Proceedings of the 2003 IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2003, pp. 742–749
G. Heinrich, Parameter estimation for text analysis. Technical report, University of Leipzig, 2008
T. Hospedales, S.G. Gong, T. Xiang, A Markov clustering topic model for mining behaviour in video, in Proceedings of the 2009 International Conference on Computer Vision, 2009, pp. 1165–1172
A. Kläser, M. Marszalek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in Proceedings of the 2008 British Machine Vision Conference (2008)
A. Kovashka, K. Grauman, Learning a hierarchy of discriminative space-time neighborhood features for human action recognition, in Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2046–2053
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, HMDB: a large video database for human motion recognition, in Proceedings of the 2011 International Conference on Computer Vision 2011
I. Laptev, On space-time interest points. Int. J. Comput. Vis. 64, 107–123 (2005)
I. Laptev, T. Lindeberg, Space-time interest points, in Proceedings of the 2003 International Conference on Computer Vision, 2003, pp. 432–439
I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8
P. Matikainen, M. Hebert, R. Sukthankar, Trajectons: Action recognition through the motion analysis of tracked features, in Proceedings of the 2009 IEEE Workshop on Video-Oriented Object and Event Classification (2009)
P. Matikainen, M. Hebert, R. Sukthankar, Representing pairwise spatial and temporal relations for action recognition, in Proceedings of the 2010 European Conference on Computer Vision 2010
R. Messing, C. Pal, H. Kautz, Activity recognition using the velocity histories of tracked keypoints, in Proceedings of the 2009 International Conference on Computer Vision 2009
P. Natarajan, R. Nevatia, Coupled hidden semi Markov models for activity recognition, in Proceedings of the IEEE Workshop on Motion and Video Computing 2007
N.T. Nguyen, D.Q. Phung, S. Venkatesh, Learning and detecting activities from movement trajectories using the hierarchical hidden Markov models, in Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 955–960
E. Nowak, F. Jurie, B. Triggs, Sampling strategies for bag-of-features image classification, in Proceedings of the 2006 European Conference on Computer Vision, 2006, pp. 490–503
N. Oliver, E. Horvitz, A. Garg, Layered representations for human activity recognition, in Proceedings of the 2002 IEEE International Conference on Multimodal Interfaces, 2002, pp. 3–8
Nuria M. Oliver, Barbara Rosario, Alex P. Pentland, A Bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 831–843 (2000)
J.R. Rohlicek, W. Russell, S. Roukos, H. Gish, Continuous hidden Markov modeling for speaker-independent word spotting, in Proceedings of the 1989 International Conference on Acoustics, Speech, and Signal Processing, 1989, pp. 627–630
R. Rose, D. Paul, A hidden Markov model based keyword recognition system, in Proceedings of the 1990 International Conference on Acoustics, Speech, and Signal Processing 1990
C. Schüldt, I. Laptev, B. Caputo, Recognizing human actions: a local svm approach, in Proceedings of the 2004 International Conference on Pattern Recognition, 2004, pp. 32–36
P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor and its application to action recognition, in Procedings of the ACM International Conference on Multimedia, 2007, pp. 57–360
University of Central Florida. University of Central Florida, Computer Vision Lab, 2010. URL http://server.cs.ucf.edu/~vision/data/UCF50.rar
H. Wang, M.M. Ullah, A. Kläser, I. Laptev, C. Schmid,Evaluation of local spatio-temporal features for action recognition, in Proceedings of the 2009 British Machine Vision Conference 2009
H. Wang, A. Kläser, C. Schmid, L. Cheng-Lin, Action recognition by dense trajectories. in Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3169–3176
G. Willems, T. Tuytelaars, L. Gool, An efficient dense and scale-invariant spatio-temporal interest point detector, in Proceedings of the 2008 European Conference on Computer Vision, 2008, pp. 650–663
J. Yamato, J. Ohya, K. Ishii, Recognizing human action in time-sequential images using hidden Markov model, in Proceedings of the 1992 IEEE Conference on Computer Vision and Pattern Recognition, 1992, pp. 379–385
L. Yeffet, L. Wolf, Local trinary patterns for human action recognition, in Proceedings of the 2009 International Conference on Computer Vision 2009
J. Yuan, Z. Liu, Y. Wu, Discriminative subvolume search for efficient action detection, in Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition 2009
Acknowledgements
The authors wish to thank the associate editors and anonymous referees for all their advice about the structure, references, experimental illustration and interpretation of this manuscript. The work benefited significantly from our participation in the ChaLearn challenge as well as the accompanying workshops.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Malgireddy, M.R., Nwogu, I., Govindaraju, V. (2017). Language-Motivated Approaches to Action Recognition. In: Escalera, S., Guyon, I., Athitsos, V. (eds) Gesture Recognition. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-57021-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-57021-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57020-4
Online ISBN: 978-3-319-57021-1
eBook Packages: Computer ScienceComputer Science (R0)