Abstract
While video-based activity analysis and recognition has received much attention, a large body of existing work deals with activities of a single subject. Modeling and recognition of coordinated multi-subject activities, or group activities, present in a variety of applications such as surveillance, sports, and biological monitoring records, etc., is the main objective of this paper. Unlike earlier attempts which model the complex spatial temporal constraints among multiple subjects with a parametric Bayesian network, we propose a compact and discriminative descriptor referred to as the Temporal Interaction Matrix for representing a coordinated group motion pattern. Moreover, we characterize the space of the Temporal Interaction Matrices using the Discriminative Temporal Interaction Manifold (DTIM), and use it as a framework within which we develop a data-driven strategy to characterize the group motion pattern without employing specific domain knowledge. In particular, we establish probability densities on the DTIM for compactly describing the statistical properties of the coordinations and interactions among multiple subjects in a group activity. For each class of group activity, we learn a multi-modal density function on the DTIM. A Maximum a Posteriori (MAP) classifier on the manifold is then designed for recognizing new activities. In addition, we have extended this model to one with which we can explicitly distinguish the participants from non-participants. We demonstrate how the framework can be applied to motions represented by point trajectories as well as articulated human actions represented by images. Experiments on both cases show the effectiveness of the proposed approach.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal, J. K., & Ryoo, M. S. (2011). Human activity analysis: a review. ACM Computing Surveys, 43(3).
Amari, S., & Nagaoka, H. (2000). Methods of information geometry. London: Oxford University Press.
Amer, M., & Todorovic, S. (2011). A chains model for localizing group activities in videos. In IEEE international conference on computer vision, Barcelona, Spain.
Choi, W., Shahid, K., & Savarese, S. (2009). What are they doing?: Collective activity classification using spatio-temporal relationship among people. In 9th international workshop on visual surveillance, Kyoto, Japan.
Choi, W., Shahid, K., & Savarese, S. (2011). Learning context for collective activity recognition. In IEEE conference on computer vision and pattern recognition, Colorado Springs, CO.
Cutler, R., & Davis, L. (2000). Robust real-time periodic motion detection, analysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 781–796.
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In Joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, Beijing, China.
Dryden, I. L., & Mardia, K. V. (1998). Statistical shape analysis. New York: Wiley.
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Gong, S., & Xiang, T. (2003). Recognition of group activities using dynamic probabilistic networks. In IEEE international conference on computer vision, Nice, France.
Grant, M., & Boyd, S. (2011). CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx.
Hakeem, A., & Shah, M. (2007). Learning, detection and representation of multi-agent events in videos. Artificial Intelligence, 171, 586–605.
Hongeng, S., & Nevatia, R. (2001). Multi-agent event recognition. In IEEE international conference on computer vision, Vancouver, BC.
Hoogs, A., Bush, S., Brooksby, G., Perera, A., Dausch, M., & Krahnstoever, N. (2008). Detecting semantic group activities using relational clustering. In IEEE workshop on motion and video computing, Copper Mountain, CO.
Huang, C., Shih, H., & Chao, C. (2006). Semantic analysis of soccer video using dynamic bayesian network. IEEE Transactions on Multimedia, 8(4), 749–760.
Intille, S., & Bobick, A. (2001). Recognizing planned, multiperson action. Computer Vision and Image Understanding, 81, 414–445.
Ivanov, Y., & Bobick, A. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 852–872.
Joo, S., & Chellappa, R. (2007). A multiple-hypothesis approach for multiobject visual tracking. IEEE Transactions on Image Processing, 16(11), 2849–2854.
Junejo, I. N., Dexter, E., Laptev, I., & Perez, P. (2011). View independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 172–185.
Kass, R., & Vos, P. (1997). Geometric foundations of asymptotic inference. New York: Wiley.
Khan, S. M., & Shah, M. (2005). Detecting group activities using rigidity of formation. In ACM multimedia, Singapore.
Kim, K., Lee, D., & Essa, I. (2012). Detecting regions of interest in dynamic scenes with camera motions. In IEEE conference on computer vision and pattern recognition, Providence, RI.
Kim, M., & Pavlovic, V. (2006). Discriminative learning of mixture of bayesian network classifiers for sequence classification. In IEEE conference on computer vision and pattern recognition, New York, NY.
Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d gradients. In British machine vision conference, Leeds, UK.
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2, 83–97.
Lan, T., Wang, Y., Yang, W., & Mori, G. (2010). Beyond actions: discriminative models for contextual group activities. In Neural information processing systems, Vancouver, BC.
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64, 107–123.
Lazarescu, M., & Venkatesh, S. (2003). Using camera motion to identify different types of American football plays. In IEEE international conference on multimedia and expo, Baltimore, MD (pp. 181–184).
Li, R., Chellappa, R., & Zhou, S. (2009). Learning multi-modal densities on discriminative temporal interaction manifold for group activity recognition. In IEEE conference on computer vision and pattern recognition, Miami, FL.
libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (2012).
Liu, T., Ma, W., & Zhang, H. (2005). Effective feature extraction for play detection in American football video. In Multimedia modeling, Melbourne, Australia.
Liu, X., & Chua, C. (2006). Multi-agent activity recognition using observation decomposed hidden Markov models. Image and Vision Computing, 24(2), 166–175.
Ma, X., Bashir, F., Khokhar, A., & Schonfeld, D. (2009). Event analysis based on multiple interactive motion trajectories. IEEE Transactions on Circuits and Systems for Video Technology, 19, 397–406.
Moeslund, T. B., Hilton, A., & Kruger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104, 90–126.
Morariu, V., & Davis, L. (2011). Multi-agent event recognition in structured scenarios. In IEEE conference on computer vision and pattern recognition, Colorado Springs, CO.
Ni, B., Yan, S., & Kassim, A. (2009). Recognizing human group activities by localized causalities. In IEEE conference on computer vision and pattern recognition, Miami, FL.
Pennec, X. (2006). Intrinsic statistics on riemannian manifolds: basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1), 127–154.
Perse, M., Kristan, M., Kovacic, S., Vuckovic, G., & Pers, J. (2009). A trajectory-based analysis of coordinated team activity in a basketball game. Computer Vision and Image Understanding, 113(5), 612–621.
Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.
Rosset, S., & Segal, E. (2002). Boosting density estimation. In Neural information processing systems, Vancouver, BC.
Ryoo, M. S., & Aggarwal, J. K. (2009). Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In IEEE international conference on computer vision, Japan, Kyoto.
Ryoo, M. S., & Aggarwal, J. K. (2011). Stochastic representation and recognition of high-level group activities. International Journal of Computer Vision, 93, 183–200.
Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In ACM multimedia, Augsburg, Germany.
Srivastava, A., Jermyn, I., & Joshi, S. (2007). Riemannian analysis of probability density functions with applications in vision. In IEEE conference on computer vision and pattern recognition, Minneapolis, MN.
Swears, E., & Hoogs, A. (2009). Learning and recognizing American football plays. In Snowbird learning workshop, Snowbird, UT.
Vaswani, N., Roy-Chowdhury, A., & Chellappa, R. (2005). Shape activity: a continuous-state HMM for moving/deforming shapes with application to abnormal activity detection. IEEE Transactions on Image Processing, 14, 1603–1616.
Veeraraghavan, A., Chellappa, R., & Srinivasan, M. (2008). Shape and behavior encoded tracking of bee dances. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3), 463–476.
Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: a survey. ACM Computing Surveys, 38(4), 1–45.
Zhang, D., Gatica-Perez, D., Bengio, S., & McCowan, I. (2006). Modeling individual and group actions in meetings with layered HMMs. IEEE Transactions on Multimedia, 8, 509–520.
Zhou, Y., Yan, S., & Huang, T. S. (2008). Pair-activity classification by bi-trajectories analysis. In IEEE conference on computer vision and pattern recognition, Anchorage, AK.
Acknowledgements
Li and Chellappa were supported by the DARPA VIRAT program and a MURI program from the Office of Naval Research under the grant N00014-10-1-0934.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 7.1 Derivation of (37)
By elementary calculus it is obvious that

Since ϵf is a local deviation from P J in Taylor’s expansion, here we may assume ϵf≪P J. Hence we ignore the terms of ϵf and have the approximation

Note that the best ϵ is determined after f is learned.
1.2 7.2 Derivation of the Expansion of (40) in Sect. 4
Consider exemplar set {z 1,z 2,…,z M }, where the tuple \(z_{i}\triangleq(X_{i}, Y_{i}, \bar{W}_{i}, \bar{X}_{i}, c_{i})\). With these exemplars and assuming uniform prior probabilities for the classes, we have

where the last equality comes from the fact that given an exemplar z, i.e., Y, the matching W between it and the testing interaction \(\bar{Y}\) does not depend on the class label c anymore. Further more, note that according to the definition of z the component \(\bar{W}\) in z has already encoded the matching between \(\bar{Y}\) and Y. Therefore, we assume that the optimal matching W only depends on \(\bar{W}\) given z, and consequently W and \(\bar{Y}\) are conditionally independent given z. As a result, we have
At the same time, note that given matching \(\bar{W}\) in z, P subjects can be selected from \(\bar{Y}\) which are then reduced to a P×P Temporal Interaction Matrix \(\bar{X}\). As a result, we have
Now, the first integrand in the rightmost hand side of (46) can be written as
and the second integrand can also be straightforwardly simplified as
Eventually, (46) can be expressed as

and the objective (40) simply becomes

1.3 7.3 Solution to (42) or (43)
A unified representation for the optimization problems (42) and (43) can be written as

where w is the PQ×1 vectorization of the matrix W by stacking the columns of W. The constant vectors or matrices c,H,E,e,F,f encode the other numbers in the objective or constraints in the original optimization problem (42) or (43).
Note that though the objective is quadratic in w, it is not necessarily convex or concave, and the elements of w only allow 0 or 1. Instead of directly tackling it, we solve the following optimization.

where \(\hat{\mathbf{c}}=[\sigma_{1},\sigma_{2},\ldots,\sigma_{PQ}]^{T}\), \(\hat{H}=\operatorname{diag}\{-\sigma_{1},-\sigma_{2},\ldots,\allowbreak-\sigma_{PQ}\}\), and σ i is a sufficiently large number satisfying \(\sigma_{i}>\sum^{PQ}_{j=1,j\neq i}|H_{ij}|+H_{ii}\).
Note that \(\hat{H}\) defined in this way imposes a negative strictly dominant diagonal to H and the quadratic term \(\hat{H}+H\) is strictly negative definite. Therefore, (54) is a concave programming problem in the convex unit hypercube [0,1]P×Q and will achieve its minimum at one of the feasible vertices satisfying the linear equality and inequality constraints. The feasible vertices, meanwhile, are exactly the feasible solutions of (53), and at these vertices, the values of the objective of (54) are equal to those of (53) due to the cancellation brought by \(\hat{\mathbf{c}}\). It is therefore implied that by solving the much more efficient problem in (54) we obtain the exact solution for the original problem in (53). To solve (54), we simply employ the optimization software CVX (Grant and Boyd 2011).
Rights and permissions
About this article
Cite this article
Li, R., Chellappa, R. & Zhou, S.K. Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics. Int J Comput Vis 101, 305–328 (2013). https://doi.org/10.1007/s11263-012-0573-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0573-0