Skip to main content

Attention-Based Two-Phase Model for Video Action Detection

  • Conference paper
  • First Online:
Computer Analysis of Images and Patterns (CAIP 2017)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10425))

Included in the following conference series:

Abstract

This paper considers the task of action detection in long untrimmed video. Existing methods tend to process every single frame or fragment through the whole video to make detection decisions, which can not only be time-consuming but also burden the computational models. Instead, we present an attention-based model to perform action detection by watching only a few fragments, which is independent with the video length and can be applied to real-world videos consequently. Our motivation is inspired by the observation that human usually focus their attention sequentially on different frames of a video to quickly narrow down the extent where an action occurs. Our model is a two-phase architecture, where a temporal proposal network is designed to predict temporal proposals for multi-category actions in the first phase. The temporal proposal network observes a fixed number of locations in a video to predict action bounds and learn a location transfer policy. In the second phase, a well-trained classifier is prepared to extract visual information from proposals, to classify the action and decide whether to adopt the proposals. We evaluate our model on ActivityNet dataset and show it can significantly outperform the baseline.

This work was supported by Shenzhen Peacock Plan (20130408-183003656).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. 43(3), 16 (2011)

    Article  Google Scholar 

  2. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)

  3. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. In: ICLR (2015)

    Google Scholar 

  4. Ba, J., Salakhutdinov, R.R., Grosse, R.B., Frey, B.J.: Learning wake-sleep recurrent attention models. In: NIPS, pp. 2593–2601 (2015)

    Google Scholar 

  5. Bazzani, L., Larochelle, H., Torresani, L.: Recurrent mixture density network for spatiotemporal visual attention. arXiv preprint arXiv:1603.08199 (2016)

  6. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970 (2015)

    Google Scholar 

  7. Campbell, L.W., Bobick, A.F.: Recognition of human body motion using phase space constraints. In: ICCV, pp. 624–630 (1995)

    Google Scholar 

  8. Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In: BigLearn, NIPS Workshop. No. EPFL-CONF-192376 (2011)

    Google Scholar 

  9. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: ICCV, pp. 2625–2634 (2015)

    Google Scholar 

  10. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)

    Google Scholar 

  11. Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: CVPR, pp. 2012–2019 (2009)

    Google Scholar 

  12. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV, pp. 3192–3199 (2013)

    Google Scholar 

  13. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. TPAMI 35(1), 221–231 (2013)

    Article  Google Scholar 

  14. Kantorov, V., Laptev, I.: Efficient feature extraction, encoding and classification for action recognition. In: CVPR, pp. 2593–2600 (2014)

    Google Scholar 

  15. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)

    Google Scholar 

  16. Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC, pp. 275:1–275:10 (2008)

    Google Scholar 

  17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)

    Google Scholar 

  18. Laptev, I.: On space-time interest points. IJCV 64(2–3), 107–123 (2005)

    Article  Google Scholar 

  19. Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: NIPS, pp. 2204–2212 (2014)

    Google Scholar 

  20. Niyogi, S.A., Adelson, E.H.: Analyzing and recognizing walking figures in xyt. In: CVPR, pp. 469–474 (1994)

    Google Scholar 

  21. Oneata, D., Verbeek, J., Schmid, C.: The lear submission at thumos 2014 (2014)

    Google Scholar 

  22. Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)

    Article  Google Scholar 

  23. Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. In: ICLR (2016)

    Google Scholar 

  24. Shi, Y., Bobick, A., Essa, I.: Learning temporal sequence model from partially labeled data. In: CVPR, vol. 2, pp. 1631–1638 (2006)

    Google Scholar 

  25. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)

    Google Scholar 

  26. Singh, G., Cuzzolin, F.: Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979 (2016)

  27. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y., et al.: Policy gradient methods for reinforcement learning with function approximation. In: NIPS, vol. 99, pp. 1057–1063 (1999)

    Google Scholar 

  28. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)

    Google Scholar 

  29. Tang, Y., Srivastava, N., Salakhutdinov, R.R.: Learning generative models with visual attention. In: NIPS, pp. 1808–1816 (2014)

    Google Scholar 

  30. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)

    Google Scholar 

  31. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)

  32. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1), 60–79 (2013)

    Article  MathSciNet  Google Scholar 

  33. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)

    Google Scholar 

  34. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC, pp. 124–1 (2009)

    Google Scholar 

  35. Wang, J., Wang, W., Wang, R., Gao, W., et al.: Deep alternative neural network: Exploring contexts as early as possible for action recognition. In: NIPS, pp. 811–819 (2016)

    Google Scholar 

  36. Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. In: THUMOS14 Action Recognition Challenge, vol. 1, p. 2 (2014)

    Google Scholar 

  37. Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115(2), 224–241 (2011)

    Article  Google Scholar 

  38. Willems, G., Tuytelaars, T., Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88688-4_48

    Chapter  Google Scholar 

  39. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)

    MATH  Google Scholar 

  40. Wu, J., Wang, G., Yang, W., Ji, X.: Action recognition with joint attention on multi-level deep features. arXiv preprint arXiv:1607.02556 (2016)

  41. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)

    Google Scholar 

  42. Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. arXiv preprint arXiv:1511.06984 (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenmin Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Chen, X., Wang, W., Li, W., Wang, J. (2017). Attention-Based Two-Phase Model for Video Action Detection. In: Felsberg, M., Heyden, A., Krüger, N. (eds) Computer Analysis of Images and Patterns. CAIP 2017. Lecture Notes in Computer Science(), vol 10425. Springer, Cham. https://doi.org/10.1007/978-3-319-64698-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64698-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64697-8

  • Online ISBN: 978-3-319-64698-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy