Action Recognition 2
Action Recognition 2
ABSTRACT
Action recognition is a key technology in building interactive
metaverses. With the rapid development of deep learning,
methods in action recognition have also achieved great ad-
vancement. Researchers design and implement the backbones
referring to multiple standpoints, which leads to the diversity
of methods and encountering new challenges. This paper re-
views several action recognition methods based on deep neu-
ral networks. We introduce these methods in three parts: 1)
Two-Streams networks and their variants, which, specifically
in this paper, use RGB video frame and optical flow modality
as input; 2) 3D convolutional networks, which make efforts
in taking advantage of RGB modality directly while extract-
ing different motion information is no longer necessary; 3)
Transformer-based methods, which introduce the model from
natural language processing into computer vision and video
Fig. 1. Over the last decade, many video action recognition
understanding. We offer objective sights in this review and
datasets with various labels have been proposed, which con-
hopefully provide a reference for future research.
tributes to the advancement of action recognition tasks.
Index Terms— Video understanding, Action recognition
1. INTRODUCTION
inspired many follow-up researches [3, 4]. Another CNN
Video action recognition is a foundational technology for method fuses the temporal information with spatial features
building the metaverse because it meets the needs for im- by directly using a 3D convolutional filter [5]. The research in
mersive interactive experiences [1]. With the development 3D convolutional networks has also achieved great advance-
of deep learning and computing power, deep neural network ment [6, 7, 8, 9]. Transformer is proven to be also effec-
gradually takes a dominant place in computer vision. Convo- tive in computer vision without the convolutional filter. Its
lutional Neural Network (CNN) was primarily designed for attention mechanism allows the model to perform video ac-
image classification. Due to its great success in the image tion recognition more accurately in metaverse scenarios. In
domain, CNN-based methods are extended into video under- recent years there are many Transformer-based methods be-
standing. Besides, the Transformer was introduced into com- ing proposed [10, 11, 12, 13, 14, 15]. The rapid development
puter vision and achieved great success. This result leads to of video action recognition comes with various methods and
the research concerning Transformer-based methods in video complex challenges, which inspire us to provide this review to
understanding and action recognition. CNN has shown re- offer new sights in this area. In this paper: 1) We overview the
markable performance in tasks related to still images, such as development of video action recognition and introduce pop-
image classification and semantic segmentation. Two-streams ular benchmarks in video action recognition. 2) We review
Networks method [2] introduces an additional pathway that several notable deep neural network methods in video ac-
takes optical flow as input, which indicates the temporal in- tion recognition, which include Transformer-based networks.
formation in videos. The success of Two-Streams networks 3) We make an inventory summary of future challenges and
* Corresponding author. Email: nkliuyifang@gmail.com promising directions in the video action recognition field.
2. BACKBONES OF ACTION RECOGNITION training and convergence of the model harder. In 2017, Car-
reira et al. [6] proposed the I3D model, whose architecture is
2.1. Two-Streams Networks in Action Recognition inspired by Inception [22] and can initiate by inflating the 2D
convolution of the Inception network to 3D convolution. The
2.1.1. Two-Stream Networks
inflating technique makes it possible for 3D CNN to utilize
Two-stream network based on a single-stream network is de- existing large-scale image datasets like ImageNet. The newly
veloped by using both spatial stream and temporal stream as proposed larger Kinetics dataset was also used in the training
input to extract video information. Two-Stream Neural Net- process of the I3D network, which played an important role
works [2] was proposed in 2014, which expanded CNN in the in the convergence of the network. The I3D network achieves
image domain into the spatial stream and temporal stream. better accuracy than the 3D Convolution technique proposed
As shown in Fig. 2(a), the spatial stream is the same as the before [23, 5] on the HMDB-51 datasets using only RGB as
CNN in the image domain, which performs action recognition input.
in space. The temporal stream takes a stacked optical flow,
which represents temporal components from videos. After
2.2.2. Spatiotemporal Semantic Information
the parallel process of two streams, the inference results from
two CNNs were fused into the class score. This fusion method Highlighting the significance of spatio-temporal information,
is referred to as late fusion [16], which means no information various approaches have been proposed to analyze video fea-
interaction during the feature extraction. tures. Qiu et al. [24] introduced the P3D model, which sim-
An additional temporal stream makes it possible for CNN plifies optimization by separating spatial and temporal convo-
to achieve performance equal to traditional handcrafted fea- lution parameters. Similarly, Tran et al. [25] developed the
tures and inspired many follow-up researchers. TDD [3] pro- R(2+1)D model, dividing a standard convolution into spatial
poses to improve the performance of CNN with trajectory- (Mi 1 × d × d) and temporal (1 d × 1 × 1) convolutions, en-
constrained pooling, which merits benefit from both deep net- hancing temporal analysis and network trainability. Xie et al.
work and handcrafted features. Then TSN [4] proposes to [26] presented the S3D model, which integrates 2D and 3D
model long-range temporal information by sparse sampling. convolutions using a Top-Heavy structure for improved pre-
diction accuracy by separating 3D convolutions into spatial
2.1.2. Multi-Stream Networks and temporal parts.
To address the convolution kernel’s limited receptive field,
Action recognition tasks can perform well with pose estima- Wang et al. [27] introduced a Non-local module for global
tion information. P-CNN [17] aggregated the appearance and temporal feature extraction. Inspired by biological mecha-
motion information of human pose by tracking the human nisms, Feichtenhofer et al. [8] designed the SlowFast net-
parts. In particular, it used the position of joints to represent work, leveraging residual connections and Non-Local mod-
body regions with CNN descriptors to conduct the P-CNN ules for effective video analysis. Further, Feichtenhofer et
feature. PoTion [18] utilized the movement of human joints al. [9] proposed the X3D model, optimizing computational
over the video clips as semantic key points to represent hu- efficiency and parameter count through machine learning-
man action. [19] propose dynamic pose image (DPI) as a inspired feature selection.
compact pose feature for human action recognition. Based on
joint estimation maps, DPI captures richer information about
human body parts, compared with pose-based methods using 2.3. Transformer-based Neural Network
joint locations. [20] introduces salient directed graphs with
Time Salient and Space Salient Pairwise features for efficient 2.3.1. Transformer-based Architectures and Spatiotemporal
real-time human action recognition. Object information can Attention Design
also benefit action recognition. R*CNN [21] adapted RCNN Transformer-based architectures have been progressively ap-
to predict action in more than one region. It observed that plied to texts, images, and videos, with significant achieve-
action videos are always combined with contextual cues like ments. In the context of videos, designing space-time atten-
objects and scenes, providing an additional source of infor- tion modules is a critical aspect of vision transformer archi-
mation for video understanding. tectures, as illustrated in Fig 2(c). Recent advancements in
transformer-based video action recognition architectures in-
2.2. 3D CNNs troduce four primary approaches to spatiotemporal attention:
split, factorized, joint, and redesigned.
2.2.1. Inspiration from Image Domain
The split approach, as seen in VTN [11], utilizes a spa-
The general pipeline for 3D CNN is shown in Fig. 2(b), which tial backbone, a temporal attention encoder, and a classifica-
is much like to image domain. One of the most challenging tion head, focusing on separating spatial and temporal pro-
issues of using 3D convolution is that the parameters make the cessing. STAM, proposed by Sharir et al. [28], follows a
t t
Space Attention
Video shift
Temporal stream Joint Space-Time Attention
Optical Flow
(with shifted windows) t
(a) Two-stream architecture
(c) Visualization of self-attention schemes
1
1
4 2
3 1
3
Prediction 2
1 N 2
4
3D CNN
Video Video Clip N/4
N/2
Video N
Fig. 2. We review deep neural network backbones in video action recognition. As shown in Fig. 2, we demonstrate the general
architecture of (a) Two-stream networks and (b) 3D CNN. Moreover, we review two ways of improving Transformer for action
recognition: designing different kinds of attention mechanisms (c) or introducing multi-scale/multi-view features(d) into the
model.
similar approach by integrating a temporal aggregating trans- Transformer, which processes different video views through
former with spatial transformers. In contrast, Arnab et al. distinct encoders, achieving optimal results with cross-view
[12] explored different attention mechanisms and advocated attention. This approach effectively balances accuracy and
for a factorized encoder in ViViT to efficiently combine spa- computational demands for ViT variants. VideoMAE V2 [33]
tial and temporal information, addressing the computational achieves state-of-the-art performance on various downstream
challenges of joint space-time attention. tasks through a designed masking strategy. Hiera [34] in-
Joint spatiotemporal attention is implemented in the novatively achieves a simpler and more efficient hierarchical
Video Swin Transformer by Liu et al. [14], which utilizes vision transformer model that excels across multiple visual
shifted windows [29] in its self-attention module to achieve tasks, by leveraging strong visual pretext task (MAE) pre-
a balance between speed and accuracy, demonstrating strong training to eliminate complex components traditionally used
performance across various recognition tasks. in hierarchical vision transformers.
Additional attention mechanisms include Trajectory At-
tention by Patrick et al. [30] in the Motionformer, which
2.3.3. Integration of Transformer and CNN
aggregates information along motion paths, and Space-time
Mixing Attention by Bulat et al. [31], enhancing the effi- Before the emergence of pure-transformer models in com-
ciency of ViT with minimal computational cost. puter vision, initial efforts focused on enhancing CNNs with
self-attention to improve long-range dependency modeling.
2.3.2. Multiscale and Multiview Transformers NLNet, as mentioned by Liu et al., was a forerunner in adding
self-attention to CNNs for pixel-level long-range dependen-
Drawing from the pyramid concept in multiscale process- cies. Following NLNet, various efforts sought to refine this
ing to simplify their approach, researchers developed sim- by simplifying or completely overhauling the non-local block.
ilar options for vision transformers. Fig. 2(d) highlights For instance, Cao et al. enhanced NLNet’s global context
Fan et al.’s [13] blend of transformers with CNN’s multi- capture with a query-independent approach. With increasing
scale features, leading to MViT. This model introduces a hi- interest in combining Transformers and CNNs, new methods
erarchical structure with Multi Head Pooling Attention for emerged. Xiao et al. showed that adding a convolutional stem
adaptable resolution handling, expanding channel capacity, to image transformers improves optimization stability and
and lowering spatial resolution in stages to build a multi- performance without losing computational efficiency. This
scale feature pyramid. Wu et al. [15] further evolved this encourages using early convolution layers in ViT architec-
with MeMViT, enhancing long-term video recognition by uti- tures for video analysis. Li et al. introduced Uniformer [35],
lizing previously stored memory. These advancements are a technique for learning spatiotemporal representations that
adaptable to various transformer models, according to the re- balances global dependency capture and local redundancy re-
searchers. Yan et al. [32] diverged by developing a Multiview duction in ViT and CNN models.
architectures and their adaptations for video understanding.
Table 1. Comparison of different network architectures.
Transformer-based methods have generally outperformed
Two-Stream Networks Year Params (M) HMDB51 K400 SSv2
TN [2] 2014 36.6 59.4 - - two-stream networks and 3D CNNs, showcasing the advan-
TDD [3] 2015 - 63.2 - - tage of leveraging extensive parameters and innovative spa-
TSN [4] 2016 - 68.5 73.9 - tiotemporal attention mechanisms. Among these transformer-
DOVF [36] 2017 - 71.7 - - based approaches, models like Swin-L [14], UniFormer [35],
TLE [37] 2017 - 71.7 - - and MVD [40] not only demonstrate the power of joint spa-
ActionVLED [38] 2017 - 66.9 - - tiotemporal processing but also underscore the evolving land-
TSM [39] 2019 24.3 73.5 74.1 61.7
scape of video action recognition, where large-scale datasets
3D CNNs Year Params (M) HMDB51 K400 SSv2
and complex architectures drive improvements in model ac-
C3D [5] 2015 34.8 56.8 59.5 -
I3D [6] 2017 - 74.8 71.1 -
curacy and efficiency.
Two-Stream I3D [6] 2017 25.0 80.9 74.2 - Although these methods have reached the pinnacle of ac-
P3D [24] 2017 98.0 - 71.6 - curacy in behavior recognition, considering the requirements
ResNet3D [7] 2018 - 70.2 65.1 - for real-time performance in the metaverse, a balance between
NL I3D [27] 2018 61.0 - 77.7 - accuracy and number of parameters needs to be struck in prac-
R(2+1)D [25] 2018 118.2 74.5 72.0 - tical applications. For example, the Side4Video [42] method
S3D [26] 2018 8.8 75.9 74.7 -
SlowFast [8] 2019 62.8 - 79.8 -
can ensure accuracy (i.e., 88.6 on K400 and 75.5 on SSv2)
X3D [9] 2020 20.3 - 80.4 - while having a smaller number of parameters (4M).
Transformer-based Year Params (M) HMDB51 K400 SSv2
VTN [11] 2021 114.0 - 79.8 - 4. CONCLUSION
TimeSFormer [10] 2021 121.4 - 80.7 62.4
STAM [28] 2021 96 - 80.5 -
ViViT-L [12] 2021 - - 81.7 65.9
In reviewing deep neural networks for video action recog-
MViT-B [13] 2021 36.6 - 81.2 67.7 nition, we explored three primary methods: Two-stream
Motionformer [30] 2021 - - 81.1 67.1 networks, 3D Convolutional Neural Networks (CNNs), and
X-ViT [31] 2021 92.0 - 80.2 67.2 Transformer-based approaches. This survey underscores the
Swin-L [14] 2021 200.0 - 84.9 - crucial role of spatial-temporal information across all meth-
UniFormer [35] 2022 49.8 - 83 71.2 ods. The two-stream networks combine spatial-temporal in-
MTV-B [32] 2022 310.0 - 89.9 68.5
formation in videos by separately processing temporal and
MVD [40] 2022 87.8 - 87.2 77.3
InternVideo [41] 2022 1000.0 - 91.1 77.2 spatial information through designed RGB and optical flow
Side4Video [42] 2023 4.0 - 88.6 75.5 streams. The 3D CNN networks consider videos as three-
Hiera [34] 2023 673.4 - 87.8 76.5 dimensional matrices and then use three-dimensional convo-
VideoMAE V2 [33] 2023 1011.0 88.1 90.0 77.0 lution kernels to fuse the spatial and temporal information.
The transformer-based networks decompose video matrices
into different small patches and then fuse the spatial and tem-
3. COMPARISON poral information within them through attention blocks. Al-
though these methods have achieved certain effects in video
We compare various methods and discuss different network action recognition in real-world scenarios, when applied to
backbones and benchmarks. For backbones, we start by metaverse scenarios, they require new, large datasets for train-
comparing models with the same base encoder before col- ing. At the same time, the computation of the models needs
lectively evaluating all models. For benchmarks, we se- to be further optimized to ensure the real-time requirements
lect HMDB-51 [43], Kinetics-400 [44], and Something- of the metaverse. We hope our review will provide a clear and
Something V2 [45] due to their widespread use. objective reference for future explorations.
The results shown in Tab. 1 demonstrate significant
progress in the field of video action recognition over the last
decade, thanks to both advancements in network architecture 5. ACKNOWLEDGEMENT
design and the introduction of larger datasets. While Video-
This work was supported by National Natural Science Foun-
MAE V2 [33] achieves outstanding performance, particu-
dation of China (No. 62203476), Natural Science Foundation
larly setting new benchmarks on the HMDB-51 datasets with
of Shenzhen (No. JCYJ20230807120801002).
the highest accuracies, it is essential to note that within the
Kinetics-400 and Something-Something V2 datasets, other
models exhibit leading performance. Specifically, Intern- 6. REFERENCES
Video [41] shows exceptional results on Kinetics-400, and
MVD [40] achieves remarkable accuracy on Something- [1] Stylianos Mystakidis, “Metaverse,” Encyclopedia, vol.
Something V2, highlighting the effectiveness of transformer 2, no. 1, pp. 486–497, 2022.
[2] Karen Simonyan and Andrew Zisserman, “Two- [14] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang,
stream convolutional networks for action recognition in Stephen Lin, and Han Hu, “Video swin transformer,”
videos,” Advances in neural information processing sys- arXiv preprint arXiv:2106.13230, 2021.
tems, vol. 27, 2014. [15] Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam,
[3] Limin Wang, Yu Qiao, and Xiaoou Tang, “Action Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Fe-
recognition with trajectory-pooled deep-convolutional ichtenhofer, “Memvit: Memory-augmented multiscale
descriptors,” in Proceedings of the IEEE Conference on vision transformer for efficient long-term video recogni-
Computer Vision and Pattern Recognition (CVPR), June tion,” arXiv preprint arXiv:2201.08383, 2022.
2015. [16] Andrej Karpathy, George Toderici, Sanketh Shetty,
[4] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei,
Dahua Lin, Xiaoou Tang, and Luc Van Gool, “Temporal “Large-scale video classification with convolutional
segment networks: Towards good practices for deep ac- neural networks,” in Proceedings of the IEEE con-
tion recognition,” in European conference on computer ference on Computer Vision and Pattern Recognition,
vision. Springer, 2016, pp. 20–36. 2014, pp. 1725–1732.
[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Tor- [17] Guilhem Chéron, Ivan Laptev, and Cordelia Schmid,
resani, and Manohar Paluri, “Learning spatiotemporal “P-cnn: Pose-based cnn features for action recognition,”
features with 3d convolutional networks,” in Proceed- in Proceedings of the IEEE international conference on
ings of the IEEE international conference on computer computer vision, 2015, pp. 3218–3226.
vision, 2015, pp. 4489–4497. [18] Vasileios Choutas, Philippe Weinzaepfel, Jérôme Re-
[6] Joao Carreira and Andrew Zisserman, “Quo vadis, ac- vaud, and Cordelia Schmid, “Potion: Pose motion repre-
tion recognition? a new model and the kinetics dataset,” sentation for action recognition,” in Proceedings of the
in proceedings of the IEEE Conference on Computer Vi- IEEE conference on computer vision and pattern recog-
sion and Pattern Recognition, 2017, pp. 6299–6308. nition, 2018, pp. 7024–7033.
[19] Mengyuan Liu, Fanyang Meng, Chen Chen, and Song-
[7] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh,
tao Wu, “Joint dynamic pose image and space time re-
“Can spatiotemporal 3d cnns retrace the history of 2d
versal for human action recognition from videos,” in
cnns and imagenet?,” in Proceedings of the IEEE con-
Proceedings of the AAAI conference on artificial intelli-
ference on Computer Vision and Pattern Recognition,
gence, 2019, vol. 33, pp. 8762–8769.
2018, pp. 6546–6555.
[20] Mengyuan Liu, Hong Liu, Qianru Sun, Tianwei Zhang,
[8] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik,
and Runwei Ding, “Salient pairwise spatio-temporal in-
and Kaiming He, “Slowfast networks for video recog-
terest points for real-time activity recognition,” CAAI
nition,” in Proceedings of the IEEE/CVF international
Transactions on Intelligence Technology, vol. 1, no. 1,
conference on computer vision, 2019, pp. 6202–6211.
pp. 14–29, 2016.
[9] Christoph Feichtenhofer, “X3d: Expanding architec- [21] Georgia Gkioxari, Ross Girshick, and Jitendra Malik,
tures for efficient video recognition,” in Proceedings “Contextual action recognition with r* cnn,” in Proceed-
of the IEEE/CVF Conference on Computer Vision and ings of the IEEE international conference on computer
Pattern Recognition, 2020, pp. 203–213. vision, 2015, pp. 1080–1088.
[10] Gedas Bertasius, Heng Wang, and Lorenzo Torresani, [22] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-
“Is space-time attention all you need for video under- manet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
standing,” arXiv preprint arXiv:2102.05095, vol. 2, no. Vincent Vanhoucke, and Andrew Rabinovich, “Going
3, pp. 4, 2021. deeper with convolutions,” in Proceedings of the IEEE
[11] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan As- conference on computer vision and pattern recognition,
selmann, “Video transformer network,” in Proceedings 2015, pp. 1–9.
of the IEEE/CVF International Conference on Com- [23] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu, “3d
puter Vision, 2021, pp. 3163–3172. convolutional neural networks for human action recog-
[12] Anurag Arnab, Mostafa Dehghani, Georg Heigold, nition,” IEEE transactions on pattern analysis and ma-
Chen Sun, Mario Lučić, and Cordelia Schmid, “Vivit: A chine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
video vision transformer,” in International Conference [24] Zhaofan Qiu, Ting Yao, and Tao Mei, “Learning spatio-
on Computer Vision (ICCV), 2021. temporal representation with pseudo-3d residual net-
[13] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao works,” in proceedings of the IEEE International Con-
Li, Zhicheng Yan, Jitendra Malik, and Christoph Fe- ference on Computer Vision, 2017, pp. 5533–5541.
ichtenhofer, “Multiscale vision transformers,” in Pro- [25] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray,
ceedings of the IEEE/CVF International Conference on Yann LeCun, and Manohar Paluri, “A closer look at
Computer Vision, 2021, pp. 6824–6835. spatiotemporal convolutions for action recognition,” in
Proceedings of the IEEE conference on Computer Vision Shawn Newsam, “Deep local video feature for action
and Pattern Recognition, 2018, pp. 6450–6459. recognition,” in Proceedings of the IEEE conference
[26] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, on computer vision and pattern recognition workshops,
and Kevin Murphy, “Rethinking spatiotemporal feature 2017, pp. 1–7.
learning: Speed-accuracy trade-offs in video classifica- [37] Ali Diba, Vivek Sharma, and Luc Van Gool, “Deep
tion,” in Proceedings of the European conference on temporal linear encoding networks,” in Proceedings of
computer vision (ECCV), 2018, pp. 305–321. the IEEE conference on Computer Vision and Pattern
[27] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Recognition, 2017, pp. 2329–2338.
Kaiming He, “Non-local neural networks,” in Proceed- [38] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef
ings of the IEEE conference on computer vision and pat- Sivic, and Bryan Russell, “Actionvlad: Learning spatio-
tern recognition, 2018, pp. 7794–7803. temporal aggregation for action classification,” in Pro-
[28] Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor, “An ceedings of the IEEE conference on computer vision and
image is worth 16x16 words, what is a video worth?,” pattern recognition, 2017, pp. 971–980.
2021. [39] Ji Lin, Chuang Gan, and Song Han, “Tsm: Temporal
[29] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan shift module for efficient video understanding,” in Pro-
Wei, Zheng Zhang, Stephen Lin, and Baining Guo, ceedings of the IEEE/CVF International Conference on
“Swin transformer: Hierarchical vision transformer us- Computer Vision, 2019, pp. 7083–7093.
ing shifted windows,” in Proceedings of the IEEE/CVF [40] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng
International Conference on Computer Vision, 2021, pp. Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, and Yu-
10012–10022. Gang Jiang, “Masked video distillation: Rethinking
[30] Mandela Patrick, Dylan Campbell, Yuki Asano, Ishan masked feature modeling for self-supervised video rep-
Misra, Florian Metze, Christoph Feichtenhofer, Andrea resentation learning,” in Proceedings of the IEEE/CVF
Vedaldi, and João F Henriques, “Keeping your eye on Conference on Computer Vision and Pattern Recogni-
the ball: Trajectory attention in video transformers,” Ad- tion, 2023, pp. 6312–6322.
vances in Neural Information Processing Systems, vol. [41] Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun
34, 2021. Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu,
[31] Adrian Bulat, Juan Manuel Perez Rua, Swathikiran Sud- Zun Wang, et al., “Internvideo: General video founda-
hakaran, Brais Martinez, and Georgios Tzimiropoulos, tion models via generative and discriminative learning,”
“Space-time mixing attention for video transformer,” arXiv preprint arXiv:2212.03191, 2022.
Advances in Neural Information Processing Systems, [42] Huanjin Yao, Wenhao Wu, and Zhiheng Li,
vol. 34, 2021. “Side4video: Spatial-temporal side network for
[32] Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, memory-efficient image-to-video transfer learning,”
Mi Zhang, Chen Sun, and Cordelia Schmid, “Multi- arXiv preprint arXiv:2311.15769, 2023.
view transformers for video recognition,” arXiv preprint [43] Hildegard Kuehne, Hueihan Jhuang, Estı́baliz Garrote,
arXiv:2201.04288, 2022. Tomaso Poggio, and Thomas Serre, “Hmdb: a large
[33] Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, video database for human motion recognition,” in 2011
Yinan He, Yi Wang, Yali Wang, and Yu Qiao, “Video- International conference on computer vision. IEEE,
mae v2: Scaling video masked autoencoders with dual 2011, pp. 2556–2563.
masking,” in Proceedings of the IEEE/CVF Conference [44] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
on Computer Vision and Pattern Recognition (CVPR), Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Vi-
June 2023, pp. 14549–14560. ola, Tim Green, Trevor Back, Paul Natsev, et al., “The
[34] Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen kinetics human action video dataset,” arXiv preprint
Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, arXiv:1705.06950, 2017.
Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoff- [45] Raghav Goyal, Samira Ebrahimi Kahou, Vincent
man, et al., “Hiera: A hierarchical vision transformer Michalski, Joanna Materzynska, Susanne Westphal, He-
without the bells-and-whistles,” in International Con- una Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos,
ference on Machine Learning. PMLR, 2023, pp. 29441– Moritz Mueller-Freitag, et al., “The” something some-
29454. thing” video database for learning and evaluating visual
common sense,” in Proceedings of the IEEE interna-
[35] Kunchang Li, Yali Wang, Gao Peng, Guanglu Song,
tional conference on computer vision, 2017, pp. 5842–
Yu Liu, Hongsheng Li, and Yu Qiao, “Uniformer: Uni-
5850.
fied transformer for efficient spatial-temporal represen-
tation learning,” in International Conference on Learn-
ing Representations, 2022.
[36] Zhenzhong Lan, Yi Zhu, Alexander G Hauptmann, and