Abstract
We present a probabilistic generative model for simultaneously recognizing daily actions and predicting gaze locations in videos recorded from an egocentric camera. We focus on activities requiring eye-hand coordination and model the spatio-temporal relationship between the gaze point, the scene objects, and the action label. Our model captures the fact that the distribution of both visual features and object occurrences in the vicinity of the gaze point is correlated with the verb-object pair describing the action. It explicitly incorporates known properties of gaze behavior from the psychology literature, such as the temporal delay between fixation and manipulation events. We present an inference method that can predict the best sequence of gaze locations and the associated action label from an input sequence of images. We demonstrate improvements in action recognition rates and gaze prediction accuracy relative to state-of-the-art methods, on two new datasets that contain egocentric videos of daily activities and gaze.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: From contours to regions: an empirical evaluation. In: CVPR (2009)
Borji, A., Sihite, D.N., Itti, L.: Probabilistic learning of task-specific visual attention. In: CVPR (2012)
Devyver, M., Tsukada, A., Kanade, T.: A wearable device for first person vision. In: 3rd International Symposium on Quality of Life Technology (2011)
Einhauser, W., Spain, M., Perona, P.: Objects predict fixations better than early saliency. Journal of Vision (2008)
Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)
Fathi, A., Hodgins, J.K., Rehg, J.M.: Social interactions: A first-person perspective. In: CVPR (2012)
Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR (2011)
Findlay, J.M., Gilchrist, I.D.: Active Vision: The Psychology of Looking and Seeing. Oxford Psychology Series. Oxford University Press (2003)
Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. PAMI (2009)
Hayhoe, M., Ballard, D.: Eye movements in natural behavior. TRENDS in Congnitive Sciences (2005)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. PAMI (1998)
Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV (2009)
Kitani, K.M., Okabe, T., Sato, Y., Sugimoto, A.: Fast unsupervised ego-action learning for first-person sports videos. In: CVPR (2011)
Land, M.F., Hayhoe, M.: In what ways do eye movements contribute to everyday activities? Vision Research 41, 3559–3565 (2001)
Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)
Lester, J., Choudhury, T., Kern, N., Borriello, G., Hannaford, B.: A hybrid discriminative/generative approach for modeling human activities. In: IJCAI (2005)
Mann, R., Jepson, A., Siskind, J.M.: Computational Perception of Scene Dynamics. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996, Part II. LNCS, vol. 1065, pp. 528–539. Springer, Heidelberg (1996)
Pelz, J.B., Consa, R.: Oculomotor behavior and perceptual strategies in complex tasks. Vision Research (2001)
Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)
Platt, J.: Probabilities for sv machines. In: Advanced in Large Margin Classifiers. MIT Press (1999)
Ren, X., Gu, C.: Figure-ground segmentation improves handled object recognition in egocentric video. In: CVPR (2010)
Schiele, B., Oliver, N., Jebara, T., Pentland, A.: An interactive computer vision system - dypers: dynamic personal enhanced reality system. In: ICVS (1999)
Spriggs, E.H., De La Torre, F., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: Egovision Workshop (2009)
Torralba, A., Oliva, A., Castelhano, M., Henderson, J.: Contextual guidance of eye movements and attention in real-world scenes: the role of global features on object search. Psychological Review (2006)
Verma, M., Zisserman, A.: A statistical approach to texture classification from single images. IJCV (2005)
Wu, J., Osuntogun, A., Choudhury, T., Philipose, M., Rehg, J.M.: A scalable approach to activity recognition based on object use. In: CVPR (2007)
Yarbus, A.: Eye Movements and Vision. Plenum Press (1967)
Yi, W., Ballard, D.: Recognizing behavior in hand-eye coordination patterns. International Journal of Humanoid Robots (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fathi, A., Li, Y., Rehg, J.M. (2012). Learning to Recognize Daily Actions Using Gaze. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds) Computer Vision – ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, vol 7572. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33718-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-33718-5_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33717-8
Online ISBN: 978-3-642-33718-5
eBook Packages: Computer ScienceComputer Science (R0)