Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics

Li, Ruonan; Chellappa, Rama; Zhou, Shaohua Kevin

doi:10.1007/s11263-012-0573-0

Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics

Published: 21 September 2012

Volume 101, pages 305–328, (2013)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

While video-based activity analysis and recognition has received much attention, a large body of existing work deals with activities of a single subject. Modeling and recognition of coordinated multi-subject activities, or group activities, present in a variety of applications such as surveillance, sports, and biological monitoring records, etc., is the main objective of this paper. Unlike earlier attempts which model the complex spatial temporal constraints among multiple subjects with a parametric Bayesian network, we propose a compact and discriminative descriptor referred to as the Temporal Interaction Matrix for representing a coordinated group motion pattern. Moreover, we characterize the space of the Temporal Interaction Matrices using the Discriminative Temporal Interaction Manifold (DTIM), and use it as a framework within which we develop a data-driven strategy to characterize the group motion pattern without employing specific domain knowledge. In particular, we establish probability densities on the DTIM for compactly describing the statistical properties of the coordinations and interactions among multiple subjects in a group activity. For each class of group activity, we learn a multi-modal density function on the DTIM. A Maximum a Posteriori (MAP) classifier on the manifold is then designed for recognizing new activities. In addition, we have extended this model to one with which we can explicitly distinguish the participants from non-participants. We demonstrate how the framework can be applied to motions represented by point trajectories as well as articulated human actions represented by images. Experiments on both cases show the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A compact and recursive Riemannian motion descriptor for untrimmed activity recognition

Article 05 January 2021

Recognizing Interactions Between People from Video Sequences

Rate-Invariant Analysis of Covariance Trajectories

Article 24 April 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aggarwal, J. K., & Ryoo, M. S. (2011). Human activity analysis: a review. ACM Computing Surveys, 43(3).
Amari, S., & Nagaoka, H. (2000). Methods of information geometry. London: Oxford University Press.
MATH Google Scholar
Amer, M., & Todorovic, S. (2011). A chains model for localizing group activities in videos. In IEEE international conference on computer vision, Barcelona, Spain.
Google Scholar
Choi, W., Shahid, K., & Savarese, S. (2009). What are they doing?: Collective activity classification using spatio-temporal relationship among people. In 9th international workshop on visual surveillance, Kyoto, Japan.
Google Scholar
Choi, W., Shahid, K., & Savarese, S. (2011). Learning context for collective activity recognition. In IEEE conference on computer vision and pattern recognition, Colorado Springs, CO.
Google Scholar
Cutler, R., & Davis, L. (2000). Robust real-time periodic motion detection, analysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 781–796.
Article Google Scholar
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In Joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, Beijing, China.
Google Scholar
Dryden, I. L., & Mardia, K. V. (1998). Statistical shape analysis. New York: Wiley.
MATH Google Scholar
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Article Google Scholar
Gong, S., & Xiang, T. (2003). Recognition of group activities using dynamic probabilistic networks. In IEEE international conference on computer vision, Nice, France.
Google Scholar
Grant, M., & Boyd, S. (2011). CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx.
Hakeem, A., & Shah, M. (2007). Learning, detection and representation of multi-agent events in videos. Artificial Intelligence, 171, 586–605.
Article Google Scholar
Hongeng, S., & Nevatia, R. (2001). Multi-agent event recognition. In IEEE international conference on computer vision, Vancouver, BC.
Google Scholar
Hoogs, A., Bush, S., Brooksby, G., Perera, A., Dausch, M., & Krahnstoever, N. (2008). Detecting semantic group activities using relational clustering. In IEEE workshop on motion and video computing, Copper Mountain, CO.
Google Scholar
Huang, C., Shih, H., & Chao, C. (2006). Semantic analysis of soccer video using dynamic bayesian network. IEEE Transactions on Multimedia, 8(4), 749–760.
Article Google Scholar
Intille, S., & Bobick, A. (2001). Recognizing planned, multiperson action. Computer Vision and Image Understanding, 81, 414–445.
Article MATH Google Scholar
Ivanov, Y., & Bobick, A. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 852–872.
Article Google Scholar
Joo, S., & Chellappa, R. (2007). A multiple-hypothesis approach for multiobject visual tracking. IEEE Transactions on Image Processing, 16(11), 2849–2854.
Article MathSciNet Google Scholar
Junejo, I. N., Dexter, E., Laptev, I., & Perez, P. (2011). View independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 172–185.
Article Google Scholar
Kass, R., & Vos, P. (1997). Geometric foundations of asymptotic inference. New York: Wiley.
Book Google Scholar
Khan, S. M., & Shah, M. (2005). Detecting group activities using rigidity of formation. In ACM multimedia, Singapore.
Google Scholar
Kim, K., Lee, D., & Essa, I. (2012). Detecting regions of interest in dynamic scenes with camera motions. In IEEE conference on computer vision and pattern recognition, Providence, RI.
Google Scholar
Kim, M., & Pavlovic, V. (2006). Discriminative learning of mixture of bayesian network classifiers for sequence classification. In IEEE conference on computer vision and pattern recognition, New York, NY.
Google Scholar
Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d gradients. In British machine vision conference, Leeds, UK.
Google Scholar
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2, 83–97.
Article MathSciNet Google Scholar
Lan, T., Wang, Y., Yang, W., & Mori, G. (2010). Beyond actions: discriminative models for contextual group activities. In Neural information processing systems, Vancouver, BC.
Google Scholar
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64, 107–123.
Article Google Scholar
Lazarescu, M., & Venkatesh, S. (2003). Using camera motion to identify different types of American football plays. In IEEE international conference on multimedia and expo, Baltimore, MD (pp. 181–184).
Google Scholar
Li, R., Chellappa, R., & Zhou, S. (2009). Learning multi-modal densities on discriminative temporal interaction manifold for group activity recognition. In IEEE conference on computer vision and pattern recognition, Miami, FL.
Google Scholar
libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (2012).
Liu, T., Ma, W., & Zhang, H. (2005). Effective feature extraction for play detection in American football video. In Multimedia modeling, Melbourne, Australia.
Google Scholar
Liu, X., & Chua, C. (2006). Multi-agent activity recognition using observation decomposed hidden Markov models. Image and Vision Computing, 24(2), 166–175.
Article MATH Google Scholar
Ma, X., Bashir, F., Khokhar, A., & Schonfeld, D. (2009). Event analysis based on multiple interactive motion trajectories. IEEE Transactions on Circuits and Systems for Video Technology, 19, 397–406.
Article Google Scholar
Moeslund, T. B., Hilton, A., & Kruger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104, 90–126.
Article Google Scholar
Morariu, V., & Davis, L. (2011). Multi-agent event recognition in structured scenarios. In IEEE conference on computer vision and pattern recognition, Colorado Springs, CO.
Google Scholar
Ni, B., Yan, S., & Kassim, A. (2009). Recognizing human group activities by localized causalities. In IEEE conference on computer vision and pattern recognition, Miami, FL.
Google Scholar
Pennec, X. (2006). Intrinsic statistics on riemannian manifolds: basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1), 127–154.
Article MathSciNet Google Scholar
Perse, M., Kristan, M., Kovacic, S., Vuckovic, G., & Pers, J. (2009). A trajectory-based analysis of coordinated team activity in a basketball game. Computer Vision and Image Understanding, 113(5), 612–621.
Article Google Scholar
Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.
Article Google Scholar
Rosset, S., & Segal, E. (2002). Boosting density estimation. In Neural information processing systems, Vancouver, BC.
Google Scholar
Ryoo, M. S., & Aggarwal, J. K. (2009). Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In IEEE international conference on computer vision, Japan, Kyoto.
Google Scholar
Ryoo, M. S., & Aggarwal, J. K. (2011). Stochastic representation and recognition of high-level group activities. International Journal of Computer Vision, 93, 183–200.
Article MathSciNet MATH Google Scholar
Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In ACM multimedia, Augsburg, Germany.
Google Scholar
Srivastava, A., Jermyn, I., & Joshi, S. (2007). Riemannian analysis of probability density functions with applications in vision. In IEEE conference on computer vision and pattern recognition, Minneapolis, MN.
Google Scholar
Swears, E., & Hoogs, A. (2009). Learning and recognizing American football plays. In Snowbird learning workshop, Snowbird, UT.
Google Scholar
Vaswani, N., Roy-Chowdhury, A., & Chellappa, R. (2005). Shape activity: a continuous-state HMM for moving/deforming shapes with application to abnormal activity detection. IEEE Transactions on Image Processing, 14, 1603–1616.
Article Google Scholar
Veeraraghavan, A., Chellappa, R., & Srinivasan, M. (2008). Shape and behavior encoded tracking of bee dances. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3), 463–476.
Article Google Scholar
Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: a survey. ACM Computing Surveys, 38(4), 1–45.
Article Google Scholar
Zhang, D., Gatica-Perez, D., Bengio, S., & McCowan, I. (2006). Modeling individual and group actions in meetings with layered HMMs. IEEE Transactions on Multimedia, 8, 509–520.
Article Google Scholar
Zhou, Y., Yan, S., & Huang, T. S. (2008). Pair-activity classification by bi-trajectories analysis. In IEEE conference on computer vision and pattern recognition, Anchorage, AK.
Google Scholar

Download references

Acknowledgements

Li and Chellappa were supported by the DARPA VIRAT program and a MURI program from the Office of Naval Research under the grant N00014-10-1-0934.

Author information

Authors and Affiliations

Harvard School of Engineering and Applied Sciences, Cambridge, MA, 02138, USA
Ruonan Li
Center for Automation Research, UMIACS, and the Department of Electrical and Computer Engineering, University of Maryland, College Park, MD, 20742, USA
Rama Chellappa
Corporate Research & Technology, Siemens Corporation, Princeton, NJ, 08540, USA
Shaohua Kevin Zhou

Authors

Ruonan Li
View author publications
You can also search for this author inPubMed Google Scholar
Rama Chellappa
View author publications
You can also search for this author inPubMed Google Scholar
Shaohua Kevin Zhou
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ruonan Li.

Appendix

1.1 7.1 Derivation of (37)

By elementary calculus it is obvious that

(44)

Since ϵf is a local deviation from P ^J in Taylor’s expansion, here we may assume ϵf≪P ^J. Hence we ignore the terms of ϵf and have the approximation

(45)

Note that the best ϵ is determined after f is learned.

1.2 7.2 Derivation of the Expansion of (40) in Sect. 4

Consider exemplar set {z ₁,z ₂,…,z _M}, where the tuple $z_{i}\triangleq(X_{i}, Y_{i}, \bar{W}_{i}, \bar{X}_{i}, c_{i})$. With these exemplars and assuming uniform prior probabilities for the classes, we have

(46)

where the last equality comes from the fact that given an exemplar z, i.e., Y, the matching W between it and the testing interaction $\bar{Y}$ does not depend on the class label c anymore. Further more, note that according to the definition of z the component $\bar{W}$ in z has already encoded the matching between $\bar{Y}$ and Y. Therefore, we assume that the optimal matching W only depends on $\bar{W}$ given z, and consequently W and $\bar{Y}$ are conditionally independent given z. As a result, we have

$$ P(\bar{Y},W|z)=P(W|z)P(\bar{Y}|z)=P(W|\bar{W})P(\bar{Y}|z). $$

(47)

At the same time, note that given matching $\bar{W}$ in z, P subjects can be selected from $\bar{Y}$ which are then reduced to a P×P Temporal Interaction Matrix $\bar{X}$. As a result, we have

$$ P(\bar{Y}|z)=P(\bar{X}|z)=P(\bar{X}|X). $$

(48)

Now, the first integrand in the rightmost hand side of (46) can be written as

$$ P(\bar{Y},W|z)=P(W|\bar{W})P(\bar{X}|X), $$

(49)

and the second integrand can also be straightforwardly simplified as

$$ P(z|c)=P(X|c). $$

(50)

Eventually, (46) can be expressed as

(51)

and the objective (40) simply becomes

(52)

1.3 7.3 Solution to (42) or (43)

A unified representation for the optimization problems (42) and (43) can be written as

(53)

where w is the PQ×1 vectorization of the matrix W by stacking the columns of W. The constant vectors or matrices c,H,E,e,F,f encode the other numbers in the objective or constraints in the original optimization problem (42) or (43).

Note that though the objective is quadratic in w, it is not necessarily convex or concave, and the elements of w only allow 0 or 1. Instead of directly tackling it, we solve the following optimization.

(54)

where $\hat{\mathbf{c}}=[\sigma_{1},\sigma_{2},\ldots,\sigma_{PQ}]^{T}$, $\hat{H}=\operatorname{diag}\{-\sigma_{1},-\sigma_{2},\ldots,\allowbreak-\sigma_{PQ}\}$, and σ _i is a sufficiently large number satisfying $\sigma_{i}>\sum^{PQ}_{j=1,j\neq i}|H_{ij}|+H_{ii}$.

Note that $\hat{H}$ defined in this way imposes a negative strictly dominant diagonal to H and the quadratic term $\hat{H}+H$ is strictly negative definite. Therefore, (54) is a concave programming problem in the convex unit hypercube [0,1]^P×Q and will achieve its minimum at one of the feasible vertices satisfying the linear equality and inequality constraints. The feasible vertices, meanwhile, are exactly the feasible solutions of (53), and at these vertices, the values of the objective of (54) are equal to those of (53) due to the cancellation brought by $\hat{\mathbf{c}}$. It is therefore implied that by solving the much more efficient problem in (54) we obtain the exact solution for the original problem in (53). To solve (54), we simply employ the optimization software CVX (Grant and Boyd 2011).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, R., Chellappa, R. & Zhou, S.K. Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics. Int J Comput Vis 101, 305–328 (2013). https://doi.org/10.1007/s11263-012-0573-0

Download citation

Received: 30 August 2011
Accepted: 07 September 2012
Published: 21 September 2012
Issue Date: January 2013
DOI: https://doi.org/10.1007/s11263-012-0573-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A compact and recursive Riemannian motion descriptor for untrimmed activity recognition

Recognizing Interactions Between People from Video Sequences

Rate-Invariant Analysis of Covariance Trajectories

References

Acknowledgements