0% found this document useful (0 votes)

6 views6 pages

Action Recognition 2

This survey reviews various deep learning backbones for video action recognition, focusing on Three main categories: Two-Streams networks, 3D convolutional networks, and Transformer-based methods. It highlights advancements in each method, their architectures, and the challenges faced in the field. The paper aims to provide insights and references for future research in action recognition technologies essential for interactive metaverse experiences.

Uploaded by

kimsjpk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views6 pages

Action Recognition 2

Uploaded by

kimsjpk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

A SURVEY ON BACKBONES FOR DEEP VIDEO ACTION RECOGNITION

Zixuan Tang1 , Youjun Zhao1 , Yuhang Wen1 , Mengyuan Liu2 *

1
School of Intelligent Systems Engineering,
Sun Yat-sen University, Shenzhen 518107, China
2
National Key Laboratory of General Artificial Intelligence,
Peking University, Shenzhen Graduate School, Shenzhen 518055, China
arXiv:2405.05584v1 [cs.CV] 9 May 2024

ABSTRACT
Action recognition is a key technology in building interactive
metaverses. With the rapid development of deep learning,
methods in action recognition have also achieved great ad-
vancement. Researchers design and implement the backbones
referring to multiple standpoints, which leads to the diversity
of methods and encountering new challenges. This paper re-
views several action recognition methods based on deep neu-
ral networks. We introduce these methods in three parts: 1)
Two-Streams networks and their variants, which, specifically
in this paper, use RGB video frame and optical flow modality
as input; 2) 3D convolutional networks, which make efforts
in taking advantage of RGB modality directly while extract-
ing different motion information is no longer necessary; 3)
Transformer-based methods, which introduce the model from
natural language processing into computer vision and video
Fig. 1. Over the last decade, many video action recognition
understanding. We offer objective sights in this review and
datasets with various labels have been proposed, which con-
hopefully provide a reference for future research.
tributes to the advancement of action recognition tasks.
Index Terms— Video understanding, Action recognition

1. INTRODUCTION
inspired many follow-up researches [3, 4]. Another CNN
Video action recognition is a foundational technology for method fuses the temporal information with spatial features
building the metaverse because it meets the needs for im- by directly using a 3D convolutional filter [5]. The research in
mersive interactive experiences [1]. With the development 3D convolutional networks has also achieved great advance-
of deep learning and computing power, deep neural network ment [6, 7, 8, 9]. Transformer is proven to be also effec-
gradually takes a dominant place in computer vision. Convo- tive in computer vision without the convolutional filter. Its
lutional Neural Network (CNN) was primarily designed for attention mechanism allows the model to perform video ac-
image classification. Due to its great success in the image tion recognition more accurately in metaverse scenarios. In
domain, CNN-based methods are extended into video under- recent years there are many Transformer-based methods be-
standing. Besides, the Transformer was introduced into com- ing proposed [10, 11, 12, 13, 14, 15]. The rapid development
puter vision and achieved great success. This result leads to of video action recognition comes with various methods and
the research concerning Transformer-based methods in video complex challenges, which inspire us to provide this review to
understanding and action recognition. CNN has shown re- offer new sights in this area. In this paper: 1) We overview the
markable performance in tasks related to still images, such as development of video action recognition and introduce pop-
image classification and semantic segmentation. Two-streams ular benchmarks in video action recognition. 2) We review
Networks method [2] introduces an additional pathway that several notable deep neural network methods in video ac-
takes optical flow as input, which indicates the temporal in- tion recognition, which include Transformer-based networks.
formation in videos. The success of Two-Streams networks 3) We make an inventory summary of future challenges and
* Corresponding author. Email: nkliuyifang@gmail.com promising directions in the video action recognition field.
2. BACKBONES OF ACTION RECOGNITION training and convergence of the model harder. In 2017, Car-
reira et al. [6] proposed the I3D model, whose architecture is
2.1. Two-Streams Networks in Action Recognition inspired by Inception [22] and can initiate by inflating the 2D
convolution of the Inception network to 3D convolution. The
2.1.1. Two-Stream Networks
inflating technique makes it possible for 3D CNN to utilize
Two-stream network based on a single-stream network is de- existing large-scale image datasets like ImageNet. The newly
veloped by using both spatial stream and temporal stream as proposed larger Kinetics dataset was also used in the training
input to extract video information. Two-Stream Neural Net- process of the I3D network, which played an important role
works [2] was proposed in 2014, which expanded CNN in the in the convergence of the network. The I3D network achieves
image domain into the spatial stream and temporal stream. better accuracy than the 3D Convolution technique proposed
As shown in Fig. 2(a), the spatial stream is the same as the before [23, 5] on the HMDB-51 datasets using only RGB as
CNN in the image domain, which performs action recognition input.
in space. The temporal stream takes a stacked optical flow,
which represents temporal components from videos. After
2.2.2. Spatiotemporal Semantic Information
the parallel process of two streams, the inference results from
two CNNs were fused into the class score. This fusion method Highlighting the significance of spatio-temporal information,
is referred to as late fusion [16], which means no information various approaches have been proposed to analyze video fea-
interaction during the feature extraction. tures. Qiu et al. [24] introduced the P3D model, which sim-
An additional temporal stream makes it possible for CNN plifies optimization by separating spatial and temporal convo-
to achieve performance equal to traditional handcrafted fea- lution parameters. Similarly, Tran et al. [25] developed the
tures and inspired many follow-up researchers. TDD [3] pro- R(2+1)D model, dividing a standard convolution into spatial
poses to improve the performance of CNN with trajectory- (Mi 1 × d × d) and temporal (1 d × 1 × 1) convolutions, en-
constrained pooling, which merits benefit from both deep net- hancing temporal analysis and network trainability. Xie et al.
work and handcrafted features. Then TSN [4] proposes to [26] presented the S3D model, which integrates 2D and 3D
model long-range temporal information by sparse sampling. convolutions using a Top-Heavy structure for improved pre-
diction accuracy by separating 3D convolutions into spatial
2.1.2. Multi-Stream Networks and temporal parts.
To address the convolution kernel’s limited receptive field,
Action recognition tasks can perform well with pose estima- Wang et al. [27] introduced a Non-local module for global
tion information. P-CNN [17] aggregated the appearance and temporal feature extraction. Inspired by biological mecha-
motion information of human pose by tracking the human nisms, Feichtenhofer et al. [8] designed the SlowFast net-
parts. In particular, it used the position of joints to represent work, leveraging residual connections and Non-Local mod-
body regions with CNN descriptors to conduct the P-CNN ules for effective video analysis. Further, Feichtenhofer et
feature. PoTion [18] utilized the movement of human joints al. [9] proposed the X3D model, optimizing computational
over the video clips as semantic key points to represent hu- efficiency and parameter count through machine learning-
man action. [19] propose dynamic pose image (DPI) as a inspired feature selection.
compact pose feature for human action recognition. Based on
joint estimation maps, DPI captures richer information about
human body parts, compared with pose-based methods using 2.3. Transformer-based Neural Network
joint locations. [20] introduces salient directed graphs with
Time Salient and Space Salient Pairwise features for efficient 2.3.1. Transformer-based Architectures and Spatiotemporal
real-time human action recognition. Object information can Attention Design
also benefit action recognition. R*CNN [21] adapted RCNN Transformer-based architectures have been progressively ap-
to predict action in more than one region. It observed that plied to texts, images, and videos, with significant achieve-
action videos are always combined with contextual cues like ments. In the context of videos, designing space-time atten-
objects and scenes, providing an additional source of infor- tion modules is a critical aspect of vision transformer archi-
mation for video understanding. tectures, as illustrated in Fig 2(c). Recent advancements in
transformer-based video action recognition architectures in-
2.2. 3D CNNs troduce four primary approaches to spatiotemporal attention:
split, factorized, joint, and redesigned.
2.2.1. Inspiration from Image Domain
The split approach, as seen in VTN [11], utilizes a spa-
The general pipeline for 3D CNN is shown in Fig. 2(b), which tial backbone, a temporal attention encoder, and a classifica-
is much like to image domain. One of the most challenging tion head, focusing on separating spatial and temporal pro-
issues of using 3D convolution is that the parameters make the cessing. STAM, proposed by Sharir et al. [28], follows a
t t
Space Attention

Video Frame Spatial stream Factorized Space-Time t t

Class Score
Prediction Attention
Fusion

Video shift
Temporal stream Joint Space-Time Attention
Optical Flow
(with shifted windows) t
(a) Two-stream architecture
(c) Visualization of self-attention schemes

1
1
4 2
3 1
3
Prediction 2
1 N 2
4
3D CNN
Video Video Clip N/4
N/2
Video N

(b) 3D CNN architecture Video Frame Scale1 Scale2 Scale3

(d) Multiscale Vision Transformers

Fig. 2. We review deep neural network backbones in video action recognition. As shown in Fig. 2, we demonstrate the general
architecture of (a) Two-stream networks and (b) 3D CNN. Moreover, we review two ways of improving Transformer for action
recognition: designing different kinds of attention mechanisms (c) or introducing multi-scale/multi-view features(d) into the
model.

similar approach by integrating a temporal aggregating trans- Transformer, which processes different video views through
former with spatial transformers. In contrast, Arnab et al. distinct encoders, achieving optimal results with cross-view
[12] explored different attention mechanisms and advocated attention. This approach effectively balances accuracy and
for a factorized encoder in ViViT to efficiently combine spa- computational demands for ViT variants. VideoMAE V2 [33]
tial and temporal information, addressing the computational achieves state-of-the-art performance on various downstream
challenges of joint space-time attention. tasks through a designed masking strategy. Hiera [34] in-
Joint spatiotemporal attention is implemented in the novatively achieves a simpler and more efficient hierarchical
Video Swin Transformer by Liu et al. [14], which utilizes vision transformer model that excels across multiple visual
shifted windows [29] in its self-attention module to achieve tasks, by leveraging strong visual pretext task (MAE) pre-
a balance between speed and accuracy, demonstrating strong training to eliminate complex components traditionally used
performance across various recognition tasks. in hierarchical vision transformers.
Additional attention mechanisms include Trajectory At-
tention by Patrick et al. [30] in the Motionformer, which
2.3.3. Integration of Transformer and CNN
aggregates information along motion paths, and Space-time
Mixing Attention by Bulat et al. [31], enhancing the effi- Before the emergence of pure-transformer models in com-
ciency of ViT with minimal computational cost. puter vision, initial efforts focused on enhancing CNNs with
self-attention to improve long-range dependency modeling.
2.3.2. Multiscale and Multiview Transformers NLNet, as mentioned by Liu et al., was a forerunner in adding
self-attention to CNNs for pixel-level long-range dependen-
Drawing from the pyramid concept in multiscale process- cies. Following NLNet, various efforts sought to refine this
ing to simplify their approach, researchers developed sim- by simplifying or completely overhauling the non-local block.
ilar options for vision transformers. Fig. 2(d) highlights For instance, Cao et al. enhanced NLNet’s global context
Fan et al.’s [13] blend of transformers with CNN’s multi- capture with a query-independent approach. With increasing
scale features, leading to MViT. This model introduces a hi- interest in combining Transformers and CNNs, new methods
erarchical structure with Multi Head Pooling Attention for emerged. Xiao et al. showed that adding a convolutional stem
adaptable resolution handling, expanding channel capacity, to image transformers improves optimization stability and
and lowering spatial resolution in stages to build a multi- performance without losing computational efficiency. This
scale feature pyramid. Wu et al. [15] further evolved this encourages using early convolution layers in ViT architec-
with MeMViT, enhancing long-term video recognition by uti- tures for video analysis. Li et al. introduced Uniformer [35],
lizing previously stored memory. These advancements are a technique for learning spatiotemporal representations that
adaptable to various transformer models, according to the re- balances global dependency capture and local redundancy re-
searchers. Yan et al. [32] diverged by developing a Multiview duction in ViT and CNN models.
architectures and their adaptations for video understanding.
Table 1. Comparison of different network architectures.
Transformer-based methods have generally outperformed
Two-Stream Networks Year Params (M) HMDB51 K400 SSv2
TN [2] 2014 36.6 59.4 - - two-stream networks and 3D CNNs, showcasing the advan-
TDD [3] 2015 - 63.2 - - tage of leveraging extensive parameters and innovative spa-
TSN [4] 2016 - 68.5 73.9 - tiotemporal attention mechanisms. Among these transformer-
DOVF [36] 2017 - 71.7 - - based approaches, models like Swin-L [14], UniFormer [35],
TLE [37] 2017 - 71.7 - - and MVD [40] not only demonstrate the power of joint spa-
ActionVLED [38] 2017 - 66.9 - - tiotemporal processing but also underscore the evolving land-
TSM [39] 2019 24.3 73.5 74.1 61.7
scape of video action recognition, where large-scale datasets
3D CNNs Year Params (M) HMDB51 K400 SSv2
and complex architectures drive improvements in model ac-
C3D [5] 2015 34.8 56.8 59.5 -
I3D [6] 2017 - 74.8 71.1 -
curacy and efficiency.
Two-Stream I3D [6] 2017 25.0 80.9 74.2 - Although these methods have reached the pinnacle of ac-
P3D [24] 2017 98.0 - 71.6 - curacy in behavior recognition, considering the requirements
ResNet3D [7] 2018 - 70.2 65.1 - for real-time performance in the metaverse, a balance between
NL I3D [27] 2018 61.0 - 77.7 - accuracy and number of parameters needs to be struck in prac-
R(2+1)D [25] 2018 118.2 74.5 72.0 - tical applications. For example, the Side4Video [42] method
S3D [26] 2018 8.8 75.9 74.7 -
SlowFast [8] 2019 62.8 - 79.8 -
can ensure accuracy (i.e., 88.6 on K400 and 75.5 on SSv2)
X3D [9] 2020 20.3 - 80.4 - while having a smaller number of parameters (4M).
Transformer-based Year Params (M) HMDB51 K400 SSv2
VTN [11] 2021 114.0 - 79.8 - 4. CONCLUSION
TimeSFormer [10] 2021 121.4 - 80.7 62.4
STAM [28] 2021 96 - 80.5 -
ViViT-L [12] 2021 - - 81.7 65.9
In reviewing deep neural networks for video action recog-
MViT-B [13] 2021 36.6 - 81.2 67.7 nition, we explored three primary methods: Two-stream
Motionformer [30] 2021 - - 81.1 67.1 networks, 3D Convolutional Neural Networks (CNNs), and
X-ViT [31] 2021 92.0 - 80.2 67.2 Transformer-based approaches. This survey underscores the
Swin-L [14] 2021 200.0 - 84.9 - crucial role of spatial-temporal information across all meth-
UniFormer [35] 2022 49.8 - 83 71.2 ods. The two-stream networks combine spatial-temporal in-
MTV-B [32] 2022 310.0 - 89.9 68.5
formation in videos by separately processing temporal and
MVD [40] 2022 87.8 - 87.2 77.3
InternVideo [41] 2022 1000.0 - 91.1 77.2 spatial information through designed RGB and optical flow
Side4Video [42] 2023 4.0 - 88.6 75.5 streams. The 3D CNN networks consider videos as three-
Hiera [34] 2023 673.4 - 87.8 76.5 dimensional matrices and then use three-dimensional convo-
VideoMAE V2 [33] 2023 1011.0 88.1 90.0 77.0 lution kernels to fuse the spatial and temporal information.
The transformer-based networks decompose video matrices
into different small patches and then fuse the spatial and tem-
3. COMPARISON poral information within them through attention blocks. Al-
though these methods have achieved certain effects in video
We compare various methods and discuss different network action recognition in real-world scenarios, when applied to
backbones and benchmarks. For backbones, we start by metaverse scenarios, they require new, large datasets for train-
comparing models with the same base encoder before col- ing. At the same time, the computation of the models needs
lectively evaluating all models. For benchmarks, we se- to be further optimized to ensure the real-time requirements
lect HMDB-51 [43], Kinetics-400 [44], and Something- of the metaverse. We hope our review will provide a clear and
Something V2 [45] due to their widespread use. objective reference for future explorations.
The results shown in Tab. 1 demonstrate significant
progress in the field of video action recognition over the last
decade, thanks to both advancements in network architecture 5. ACKNOWLEDGEMENT
design and the introduction of larger datasets. While Video-
This work was supported by National Natural Science Foun-
MAE V2 [33] achieves outstanding performance, particu-
dation of China (No. 62203476), Natural Science Foundation
larly setting new benchmarks on the HMDB-51 datasets with
of Shenzhen (No. JCYJ20230807120801002).
the highest accuracies, it is essential to note that within the
Kinetics-400 and Something-Something V2 datasets, other
models exhibit leading performance. Specifically, Intern- 6. REFERENCES
Video [41] shows exceptional results on Kinetics-400, and
MVD [40] achieves remarkable accuracy on Something- [1] Stylianos Mystakidis, “Metaverse,” Encyclopedia, vol.
Something V2, highlighting the effectiveness of transformer 2, no. 1, pp. 486–497, 2022.
[2] Karen Simonyan and Andrew Zisserman, “Two- [14] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang,
stream convolutional networks for action recognition in Stephen Lin, and Han Hu, “Video swin transformer,”
videos,” Advances in neural information processing sys- arXiv preprint arXiv:2106.13230, 2021.
tems, vol. 27, 2014. [15] Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam,
[3] Limin Wang, Yu Qiao, and Xiaoou Tang, “Action Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Fe-
recognition with trajectory-pooled deep-convolutional ichtenhofer, “Memvit: Memory-augmented multiscale
descriptors,” in Proceedings of the IEEE Conference on vision transformer for efficient long-term video recogni-
Computer Vision and Pattern Recognition (CVPR), June tion,” arXiv preprint arXiv:2201.08383, 2022.
2015. [16] Andrej Karpathy, George Toderici, Sanketh Shetty,
[4] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei,
Dahua Lin, Xiaoou Tang, and Luc Van Gool, “Temporal “Large-scale video classification with convolutional
segment networks: Towards good practices for deep ac- neural networks,” in Proceedings of the IEEE con-
tion recognition,” in European conference on computer ference on Computer Vision and Pattern Recognition,
vision. Springer, 2016, pp. 20–36. 2014, pp. 1725–1732.
[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Tor- [17] Guilhem Chéron, Ivan Laptev, and Cordelia Schmid,
resani, and Manohar Paluri, “Learning spatiotemporal “P-cnn: Pose-based cnn features for action recognition,”
features with 3d convolutional networks,” in Proceed- in Proceedings of the IEEE international conference on
ings of the IEEE international conference on computer computer vision, 2015, pp. 3218–3226.
vision, 2015, pp. 4489–4497. [18] Vasileios Choutas, Philippe Weinzaepfel, Jérôme Re-
[6] Joao Carreira and Andrew Zisserman, “Quo vadis, ac- vaud, and Cordelia Schmid, “Potion: Pose motion repre-
tion recognition? a new model and the kinetics dataset,” sentation for action recognition,” in Proceedings of the
in proceedings of the IEEE Conference on Computer Vi- IEEE conference on computer vision and pattern recog-
sion and Pattern Recognition, 2017, pp. 6299–6308. nition, 2018, pp. 7024–7033.
[19] Mengyuan Liu, Fanyang Meng, Chen Chen, and Song-
[7] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh,
tao Wu, “Joint dynamic pose image and space time re-
“Can spatiotemporal 3d cnns retrace the history of 2d
versal for human action recognition from videos,” in
cnns and imagenet?,” in Proceedings of the IEEE con-
Proceedings of the AAAI conference on artificial intelli-
ference on Computer Vision and Pattern Recognition,
gence, 2019, vol. 33, pp. 8762–8769.
2018, pp. 6546–6555.
[20] Mengyuan Liu, Hong Liu, Qianru Sun, Tianwei Zhang,
[8] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik,
and Runwei Ding, “Salient pairwise spatio-temporal in-
and Kaiming He, “Slowfast networks for video recog-
terest points for real-time activity recognition,” CAAI
nition,” in Proceedings of the IEEE/CVF international
Transactions on Intelligence Technology, vol. 1, no. 1,
conference on computer vision, 2019, pp. 6202–6211.
pp. 14–29, 2016.
[9] Christoph Feichtenhofer, “X3d: Expanding architec- [21] Georgia Gkioxari, Ross Girshick, and Jitendra Malik,
tures for efficient video recognition,” in Proceedings “Contextual action recognition with r* cnn,” in Proceed-
of the IEEE/CVF Conference on Computer Vision and ings of the IEEE international conference on computer
Pattern Recognition, 2020, pp. 203–213. vision, 2015, pp. 1080–1088.
[10] Gedas Bertasius, Heng Wang, and Lorenzo Torresani, [22] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-
“Is space-time attention all you need for video under- manet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
standing,” arXiv preprint arXiv:2102.05095, vol. 2, no. Vincent Vanhoucke, and Andrew Rabinovich, “Going
3, pp. 4, 2021. deeper with convolutions,” in Proceedings of the IEEE
[11] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan As- conference on computer vision and pattern recognition,
selmann, “Video transformer network,” in Proceedings 2015, pp. 1–9.
of the IEEE/CVF International Conference on Com- [23] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu, “3d
puter Vision, 2021, pp. 3163–3172. convolutional neural networks for human action recog-
[12] Anurag Arnab, Mostafa Dehghani, Georg Heigold, nition,” IEEE transactions on pattern analysis and ma-
Chen Sun, Mario Lučić, and Cordelia Schmid, “Vivit: A chine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
video vision transformer,” in International Conference [24] Zhaofan Qiu, Ting Yao, and Tao Mei, “Learning spatio-
on Computer Vision (ICCV), 2021. temporal representation with pseudo-3d residual net-
[13] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao works,” in proceedings of the IEEE International Con-
Li, Zhicheng Yan, Jitendra Malik, and Christoph Fe- ference on Computer Vision, 2017, pp. 5533–5541.
ichtenhofer, “Multiscale vision transformers,” in Pro- [25] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray,
ceedings of the IEEE/CVF International Conference on Yann LeCun, and Manohar Paluri, “A closer look at
Computer Vision, 2021, pp. 6824–6835. spatiotemporal convolutions for action recognition,” in
Proceedings of the IEEE conference on Computer Vision Shawn Newsam, “Deep local video feature for action
and Pattern Recognition, 2018, pp. 6450–6459. recognition,” in Proceedings of the IEEE conference
[26] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, on computer vision and pattern recognition workshops,
and Kevin Murphy, “Rethinking spatiotemporal feature 2017, pp. 1–7.
learning: Speed-accuracy trade-offs in video classifica- [37] Ali Diba, Vivek Sharma, and Luc Van Gool, “Deep
tion,” in Proceedings of the European conference on temporal linear encoding networks,” in Proceedings of
computer vision (ECCV), 2018, pp. 305–321. the IEEE conference on Computer Vision and Pattern
[27] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Recognition, 2017, pp. 2329–2338.
Kaiming He, “Non-local neural networks,” in Proceed- [38] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef
ings of the IEEE conference on computer vision and pat- Sivic, and Bryan Russell, “Actionvlad: Learning spatio-
tern recognition, 2018, pp. 7794–7803. temporal aggregation for action classification,” in Pro-
[28] Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor, “An ceedings of the IEEE conference on computer vision and
image is worth 16x16 words, what is a video worth?,” pattern recognition, 2017, pp. 971–980.
2021. [39] Ji Lin, Chuang Gan, and Song Han, “Tsm: Temporal
[29] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan shift module for efficient video understanding,” in Pro-
Wei, Zheng Zhang, Stephen Lin, and Baining Guo, ceedings of the IEEE/CVF International Conference on
“Swin transformer: Hierarchical vision transformer us- Computer Vision, 2019, pp. 7083–7093.
ing shifted windows,” in Proceedings of the IEEE/CVF [40] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng
International Conference on Computer Vision, 2021, pp. Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, and Yu-
10012–10022. Gang Jiang, “Masked video distillation: Rethinking
[30] Mandela Patrick, Dylan Campbell, Yuki Asano, Ishan masked feature modeling for self-supervised video rep-
Misra, Florian Metze, Christoph Feichtenhofer, Andrea resentation learning,” in Proceedings of the IEEE/CVF
Vedaldi, and João F Henriques, “Keeping your eye on Conference on Computer Vision and Pattern Recogni-
the ball: Trajectory attention in video transformers,” Ad- tion, 2023, pp. 6312–6322.
vances in Neural Information Processing Systems, vol. [41] Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun
34, 2021. Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu,
[31] Adrian Bulat, Juan Manuel Perez Rua, Swathikiran Sud- Zun Wang, et al., “Internvideo: General video founda-
hakaran, Brais Martinez, and Georgios Tzimiropoulos, tion models via generative and discriminative learning,”
“Space-time mixing attention for video transformer,” arXiv preprint arXiv:2212.03191, 2022.
Advances in Neural Information Processing Systems, [42] Huanjin Yao, Wenhao Wu, and Zhiheng Li,
vol. 34, 2021. “Side4video: Spatial-temporal side network for
[32] Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, memory-efficient image-to-video transfer learning,”
Mi Zhang, Chen Sun, and Cordelia Schmid, “Multi- arXiv preprint arXiv:2311.15769, 2023.
view transformers for video recognition,” arXiv preprint [43] Hildegard Kuehne, Hueihan Jhuang, Estı́baliz Garrote,
arXiv:2201.04288, 2022. Tomaso Poggio, and Thomas Serre, “Hmdb: a large
[33] Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, video database for human motion recognition,” in 2011
Yinan He, Yi Wang, Yali Wang, and Yu Qiao, “Video- International conference on computer vision. IEEE,
mae v2: Scaling video masked autoencoders with dual 2011, pp. 2556–2563.
masking,” in Proceedings of the IEEE/CVF Conference [44] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
on Computer Vision and Pattern Recognition (CVPR), Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Vi-
June 2023, pp. 14549–14560. ola, Tim Green, Trevor Back, Paul Natsev, et al., “The
[34] Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen kinetics human action video dataset,” arXiv preprint
Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, arXiv:1705.06950, 2017.
Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoff- [45] Raghav Goyal, Samira Ebrahimi Kahou, Vincent
man, et al., “Hiera: A hierarchical vision transformer Michalski, Joanna Materzynska, Susanne Westphal, He-
without the bells-and-whistles,” in International Con- una Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos,
ference on Machine Learning. PMLR, 2023, pp. 29441– Moritz Mueller-Freitag, et al., “The” something some-
29454. thing” video database for learning and evaluating visual
common sense,” in Proceedings of the IEEE interna-
[35] Kunchang Li, Yali Wang, Gao Peng, Guanglu Song,
tional conference on computer vision, 2017, pp. 5842–
Yu Liu, Hongsheng Li, and Yu Qiao, “Uniformer: Uni-
5850.
fied transformer for efficient spatial-temporal represen-
tation learning,” in International Conference on Learn-
ing Representations, 2022.
[36] Zhenzhong Lan, Yi Zhu, Alexander G Hauptmann, and

Language-Development Educ 1 PowerPoint Presentation
100% (1)
Language-Development Educ 1 PowerPoint Presentation
29 pages
Global Trends On Inclusive Education July 2018
100% (1)
Global Trends On Inclusive Education July 2018
36 pages
Dr. D. Y. Patil Pratishthan's College of Engineering Salokhenagar, Kolhapur
No ratings yet
Dr. D. Y. Patil Pratishthan's College of Engineering Salokhenagar, Kolhapur
45 pages
2025 Teachers Opportunity With Citam
No ratings yet
2025 Teachers Opportunity With Citam
3 pages
Action Recog
No ratings yet
Action Recog
11 pages
Grade 3 Word Sums Book
No ratings yet
Grade 3 Word Sums Book
250 pages
Lec6 Video Understanding
No ratings yet
Lec6 Video Understanding
33 pages
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
No ratings yet
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
22 pages
Online Learnable Keyframe Extraction in Videos and Its Application With Semantic Word Vector in Action Recognition
No ratings yet
Online Learnable Keyframe Extraction in Videos and Its Application With Semantic Word Vector in Action Recognition
28 pages
Video Tutorial CVPR19
No ratings yet
Video Tutorial CVPR19
40 pages
A Comprehensive Study of Deep Video Action Recognition
No ratings yet
A Comprehensive Study of Deep Video Action Recognition
30 pages
10.1007@s00371 020 01868 8
No ratings yet
10.1007@s00371 020 01868 8
15 pages
10224/submission 10224
No ratings yet
10224/submission 10224
10 pages
Continuous Human Action Recognition For Human Machine Interaction A Review
No ratings yet
Continuous Human Action Recognition For Human Machine Interaction A Review
31 pages
Moon Landing Comprehension
No ratings yet
Moon Landing Comprehension
7 pages
5 6280382869936280464
No ratings yet
5 6280382869936280464
14 pages
Actor-Transformers For Group Activity Recognition
No ratings yet
Actor-Transformers For Group Activity Recognition
10 pages
Feichtenhofer Convolutional Two-Stream Network CVPR 2016 Paper
No ratings yet
Feichtenhofer Convolutional Two-Stream Network CVPR 2016 Paper
9 pages
A Semantic and Motion-Aware Spatiotemporal Transformer Network For Action Detection
No ratings yet
A Semantic and Motion-Aware Spatiotemporal Transformer Network For Action Detection
15 pages
Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition
No ratings yet
Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition
13 pages
Brochure - 2nd International Conference On Research Methodology (ICRM-23) - Oct - 28-30 - 2023
No ratings yet
Brochure - 2nd International Conference On Research Methodology (ICRM-23) - Oct - 28-30 - 2023
8 pages
A Review of Research On Human Behavior Recognition Methods Based On Deep Learning
No ratings yet
A Review of Research On Human Behavior Recognition Methods Based On Deep Learning
5 pages
CVPR2016 - Slicing Convolutional Neural Network For Crowd Video Understanding
No ratings yet
CVPR2016 - Slicing Convolutional Neural Network For Crowd Video Understanding
9 pages
CS231N Section: Video Understanding
No ratings yet
CS231N Section: Video Understanding
52 pages
Abu Farha MS-TCN Multi-Stage Temporal Convolutional Network For Action Segmentation CVPR 2019 Paper
No ratings yet
Abu Farha MS-TCN Multi-Stage Temporal Convolutional Network For Action Segmentation CVPR 2019 Paper
10 pages
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
No ratings yet
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
10 pages
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
No ratings yet
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
27 pages
Deepanshu Training
No ratings yet
Deepanshu Training
18 pages
Action Classification and Highlighting in Videos
No ratings yet
Action Classification and Highlighting in Videos
12 pages
Teacher Education Lesson Plan
No ratings yet
Teacher Education Lesson Plan
20 pages
Yang STEP Spatio-Temporal Progressive Learning For Video Action Detection CVPR 2019 Paper
No ratings yet
Yang STEP Spatio-Temporal Progressive Learning For Video Action Detection CVPR 2019 Paper
9 pages
Academic Performance Rating Scale PDF
No ratings yet
Academic Performance Rating Scale PDF
17 pages
Can 3D CNN Retrace The History of 2D CNN and ImageNet
No ratings yet
Can 3D CNN Retrace The History of 2D CNN and ImageNet
10 pages
CNN Unconstrained Video Classification
No ratings yet
CNN Unconstrained Video Classification
9 pages
3D CNN For Human Action Recognition: Conference Paper
No ratings yet
3D CNN For Human Action Recognition: Conference Paper
7 pages
Ugc - Academic Staff College Pt. Ravishankar Shukla University, Raipur
No ratings yet
Ugc - Academic Staff College Pt. Ravishankar Shukla University, Raipur
2 pages
SLFLSDFKSFLDKJ
No ratings yet
SLFLSDFKSFLDKJ
3 pages
Video Transformer Network
No ratings yet
Video Transformer Network
11 pages
Reference: Phenomena-Report-2023
No ratings yet
Reference: Phenomena-Report-2023
23 pages
General Activity Detection
No ratings yet
General Activity Detection
10 pages
I3D-Shufflenet Based Human Action Recognition
No ratings yet
I3D-Shufflenet Based Human Action Recognition
14 pages
Attention Based Bidirectional Long Short Term Memory For Abnormal Human Activity Detection
No ratings yet
Attention Based Bidirectional Long Short Term Memory For Abnormal Human Activity Detection
12 pages
28 - Action Recognition in Australian Rules Football Through Deep Learning
No ratings yet
28 - Action Recognition in Australian Rules Football Through Deep Learning
14 pages
Human Action Recognition System For Elderly and Children Care Using Three Stream ConvNet
No ratings yet
Human Action Recognition System For Elderly and Children Care Using Three Stream ConvNet
5 pages
Advanced Seminar in Business Plan Writing
No ratings yet
Advanced Seminar in Business Plan Writing
2 pages
Frame-Skip Convolutional Neural Networks For Action Recognition
No ratings yet
Frame-Skip Convolutional Neural Networks For Action Recognition
6 pages
Jurnal 2
No ratings yet
Jurnal 2
13 pages
Working Through Sadness: Check Your Thinking
No ratings yet
Working Through Sadness: Check Your Thinking
3 pages
Activity Recognition Based On Spatio-Temporal Features With Transfer Learning
No ratings yet
Activity Recognition Based On Spatio-Temporal Features With Transfer Learning
9 pages
Human Action Recognition Using CNN and LSTM-RNN With Attention Model
No ratings yet
Human Action Recognition Using CNN and LSTM-RNN With Attention Model
5 pages
Tugas Proposal Yeni Komariah B2 2019-1
No ratings yet
Tugas Proposal Yeni Komariah B2 2019-1
17 pages
1 s2.0 S0031320316000169 Main
No ratings yet
1 s2.0 S0031320316000169 Main
14 pages
3.action Recognition
No ratings yet
3.action Recognition
10 pages
22accessing Embodied Imagination22 1
No ratings yet
22accessing Embodied Imagination22 1
225 pages
Video Based Action Recognition RM
No ratings yet
Video Based Action Recognition RM
3 pages
Template RPH CEFR
No ratings yet
Template RPH CEFR
2 pages
Ufc Sports Data
No ratings yet
Ufc Sports Data
10 pages
Laporan Praktikum MPS (Processing) - Teknisi 1200 2024
No ratings yet
Laporan Praktikum MPS (Processing) - Teknisi 1200 2024
10 pages
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
No ratings yet
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
14 pages
Kylie Neil Resume
No ratings yet
Kylie Neil Resume
2 pages
Ilidrissi-Tan2019 Article ADeepUnifiedFrameworkForSuspic
No ratings yet
Ilidrissi-Tan2019 Article ADeepUnifiedFrameworkForSuspic
6 pages
P-CNN: Pose-Based CNN Features For Action Recognition: Guilhem CH Eron Ivan Laptev Cordelia Schmid Inria
No ratings yet
P-CNN: Pose-Based CNN Features For Action Recognition: Guilhem CH Eron Ivan Laptev Cordelia Schmid Inria
9 pages
Boutros Bou-Nahra Resume
No ratings yet
Boutros Bou-Nahra Resume
1 page
Week 1 - What Is Community Engagement and Why Does It Matter
No ratings yet
Week 1 - What Is Community Engagement and Why Does It Matter
10 pages
Two-Stream Convolutional Networks For Action Recognition in Videos
No ratings yet
Two-Stream Convolutional Networks For Action Recognition in Videos
9 pages
Directorate of Technical Education, Maharashtra State, Mumbai
No ratings yet
Directorate of Technical Education, Maharashtra State, Mumbai
7 pages
Video Swin Transformer
No ratings yet
Video Swin Transformer
12 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
Gel 107 Writing Assignment 1 Your Inner Fish1
No ratings yet
Gel 107 Writing Assignment 1 Your Inner Fish1
1 page
Recent Advances in Video-Based Human Action Recognition Using Deep Learning: A Review
No ratings yet
Recent Advances in Video-Based Human Action Recognition Using Deep Learning: A Review
8 pages
71395903
No ratings yet
71395903
1 page
Final Selected Report
No ratings yet
Final Selected Report
4 pages
Raushan Pandey Review Paper of Deep Learning
No ratings yet
Raushan Pandey Review Paper of Deep Learning
3 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
Science Week Proposal
100% (1)
Science Week Proposal
6 pages
Lee Few-Shot Common Action Localization Via Cross-Attentional Fusion of Context and ICCV 2023 Paper
No ratings yet
Lee Few-Shot Common Action Localization Via Cross-Attentional Fusion of Context and ICCV 2023 Paper
10 pages
RITES Recruitment 2024
No ratings yet
RITES Recruitment 2024
10 pages
Mental Health Nursing
100% (4)
Mental Health Nursing
10 pages
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
No ratings yet
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
8 pages
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
No ratings yet
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
9 pages
Video Survivallence
No ratings yet
Video Survivallence
3 pages
All QB De&vc (2024-25) Question Bank
No ratings yet
All QB De&vc (2024-25) Question Bank
8 pages
(SV) Being Heard in Meetings
No ratings yet
(SV) Being Heard in Meetings
13 pages
Sun Human Action Recognition ICCV 2015 Paper
No ratings yet
Sun Human Action Recognition ICCV 2015 Paper
9 pages
CNN-based and DTW Features For Human Activity Recognition On Depth Maps
No ratings yet
CNN-based and DTW Features For Human Activity Recognition On Depth Maps
14 pages
PDCA Cycle Vs OODA Loop
No ratings yet
PDCA Cycle Vs OODA Loop
12 pages
Quality of Experience Engineering for Customer Added Value Services: From Evaluation to Monitoring
From Everand
Quality of Experience Engineering for Customer Added Value Services: From Evaluation to Monitoring
Abdelhamid Mellouk
No ratings yet
2020-DEED-OF-ACCEPTANCE-donation-MLQES 2
No ratings yet
2020-DEED-OF-ACCEPTANCE-donation-MLQES 2
3 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Action Recognition 2

Uploaded by

Action Recognition 2

Uploaded by

A SURVEY ON BACKBONES FOR DEEP VIDEO ACTION RECOGNITION

Zixuan Tang1 , Youjun Zhao1 , Yuhang Wen1 , Mengyuan Liu2 *

Video Frame Spatial stream Factorized Space-Time t t

(b) 3D CNN architecture Video Frame Scale1 Scale2 Scale3

(d) Multiscale Vision Transformers

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.