0% found this document useful (0 votes)
20 views13 pages

03 - ViViT - A Video Vision Transformer

Uploaded by

22134008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views13 pages

03 - ViViT - A Video Vision Transformer

Uploaded by

22134008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ViViT: A Video Vision Transformer

Anurag Arnab* Mostafa Dehghani* Georg Heigold Chen Sun Mario LučiㆠCordelia Schmid†
Google Research
{aarnab, dehghani, heigold, chensun, lucic, cordelias}@google.com

Abstract only very recently with the Vision Transformer (ViT) [15],
arXiv:2103.15691v1 [cs.CV] 29 Mar 2021

that a pure-transformer based architecture has outperformed


We present pure-transformer based models for video its convolutional counterparts in image classification. Doso-
classification, drawing upon the recent success of such mod- vitskiy et al. [15] closely followed the original transformer
els in image classification. Our model extracts spatio- architecture of [65], and noticed that its main benefits
temporal tokens from the input video, which are then en- were observed at large scale – as transformers lack some
coded by a series of transformer layers. In order to han- of the inductive biases of convolutions (such as transla-
dle the long sequences of tokens encountered in video, we tional equivariance), they seem to require more data [15]
propose several, efficient variants of our model which fac- or stronger regularisation [61].
torise the spatial- and temporal-dimensions of the input. Al- Inspired by ViT, and the fact that attention-based ar-
though transformer-based models are known to only be ef- chitectures are an intuitive choice for modelling long-
fective when large training datasets are available, we show range contextual relationships in video, we develop sev-
how we can effectively regularise the model during training eral transformer-based models for video classification. Cur-
and leverage pretrained image models to be able to train on rently, the most performant models are based on deep 3D
comparatively small datasets. We conduct thorough abla- convolutional architectures [6, 17, 18] which were a natu-
tion studies, and achieve state-of-the-art results on multiple ral extension of image classification CNNs [24, 57]. Re-
video classification benchmarks including Kinetics 400 and cently, these models were augmented by incorporating self-
600, Epic Kitchens, Something-Something v2 and Moments attention into their later layers to better capture long-range
in Time, outperforming prior methods based on deep 3D dependencies [72, 20, 76].
convolutional networks. To facilitate further research, we
will release code and models. As shown in Fig. 1, we propose pure-transformer mod-
els for video classification. The main operation performed
in this architecture is self-attention, and it is computed on
1. Introduction a sequence of spatio-temporal tokens that we extract from
the input video. To effectively process the large number of
Approaches based on deep convolutional neural net- spatio-temporal tokens that may be encountered in video,
works have advanced the state-of-the-art across many stan- we present several methods of factorising our model along
dard datasets for vision problems since AlexNet [35]. At spatial and temporal dimensions to increase efficiency and
the same time, the most prominent architecture of choice in scalability. Furthermore, to train our model effectively on
sequence-to-sequence modelling (e.g. in natural language smaller datasets, we show how to reguliarise our model dur-
processing) is the transformer [65], which does not use con- ing training and leverage pretrained image models.
volutions, but is based on multi-headed self-attention. This
We also note that convolutional models have been de-
operation is particularly effective at modelling long-range
veloped by the community for several years, and there are
dependencies and allows the model to attend over all ele-
thus many “best practices” associated with such models.
ments in the input sequence. This is in stark contrast to
As pure-transformer models present different characteris-
convolutions where the corresponding “receptive field” is
tics, we need to determine the best design choices for such
limited, and grows linearly with the depth of the network.
architectures. We conduct a thorough ablation analysis of
The success of attention-based models in NLP has re-
tokenisation strategies, model architecture and regularisa-
cently inspired approaches in computer vision to integrate
tion methods. Informed by this analysis, we achieve state-
transformers into CNNs [72, 5], as well as some attempts
of-the-art results on multiple standard video classification
to replace convolutions completely [46, 50]. However, it is
benchmarks, including Kinetics 400 and 600 [32], Epic
* Equal contribution Kitchens 100 [11], Something-Something v2 [23] and Mo-
† Equal advising ments in Time [42]. We will release code and models.
MLP
Head
Class
Factorised Factorised Factorised
Transformer Encoder Encoder Self-Attention Dot-Product
Position + Token
Embedding
MLP
0
CLS
Temporal Temporal
Fuse
1 Layer Norm
●●●
Spatial Temporal

2 Temporal Spatial
L× Self-Attention
●●● ●●●

Embed to 3
Multi-Head
tokens Dot-Product Spatial Temporal Fuse
Attention
Spatial Temporal
… K V Q
●●●

Spatial Spatial
N
Layer Norm

1 2 ●●● N 1 2 ●●● N 1 2 ●●● N

Figure 1: We propose a pure-transformer architecture for video classification, inspired by the recent success of such models for images [15].
To effectively process a large number of spatio-temporal tokens, we develop several model variants which factorise different components
of the transformer encoder over the spatial- and temporal-dimensions. As shown on the right, these factorisations correspond to different
attention patterns over space and time.

2. Related Work blocks [27, 4, 7, 54] within a ResNet architecture [24].

Architectures for video understanding have mirrored ad- Although previous works attempted to replace convolu-
vances in image recognition. Early video research used tions in vision architectures [46, 50, 52], it is only very re-
hand-crafted features to encode appearance and motion cently that Dosovitisky et al. [15] showed with their ViT ar-
information [38, 66]. The success of AlexNet on Ima- chitecture that pure-transformer networks, similar to those
geNet [35, 13] initially led to the repurposing of 2D im- employed in NLP, can achieve state-of-the-art results for
age convolutional networks (CNNs) for video as “two- image classification too. The authors showed that such
stream” networks [31, 53, 44]. These models processed models are only effective at large scale, as transformers lack
RGB frames and optical flow images independently before some of inductive biases of convolutional networks (such
fusing them at the end. Availability of larger video classi- as translational equivariance), and thus require datasets
fication datasets such as Kinetics [32] subsequently facili- larger than the common ImageNet ILSRVC dataset [13] to
tated the training of spatio-temporal 3D CNNs [6, 19, 62] train. ViT has inspired a large amount of follow-up work
which have significantly more parameters and thus require in the community, and we note that there are a number
larger training datasets. As 3D convolutional networks re- of concurrent approaches on extending it to other tasks in
quire significantly more computation than their image coun- computer vision [68, 71, 81, 82] and improving its data-
terparts, many architectures factorise convolutions across efficiency [61, 45]. In particular, [2, 43] have also proposed
spatial and temporal dimensions and/or use grouped convo- transformer-based models for video.
lutions [56, 63, 64, 78, 17]. We also leverage factorisation In this paper, we develop pure-transformer architectures
of the spatial and temporal dimensions of videos to increase for video classification. We propose several variants of our
efficiency, but in the context of transformer-based models. model, including those that are more efficient by factoris-
Concurrently, in natural language processing (NLP), ing the spatial and temporal dimensions of the input video.
Vaswani et al. [65] achieved state-of-the-art results by re- We also show how additional regularisation and pretrained
placing convolutions and recurrent networks with the trans- models can be used to combat the fact that video datasets
former network that consisted only of self-attention, layer are not as large as their image counterparts that ViT was
normalisation and multilayer perceptron (MLP) operations. originally trained on. Furthermore, we outperform the state-
Current state-of-the-art architectures in NLP [14, 49] re- of-the-art across five popular datasets.
main transformer-based, and have been scaled to web-
scale datasets [3]. Many variants of the transformer have
also been proposed to reduce the computational cost of 3. Video Vision Transformers
self-attention when processing longer sequences [8, 9, 34,
59, 60, 70] and to improve parameter efficiency [37, 12]. We start by summarising the recently proposed Vision
Although self-attention has been employed extensively in Transformer [15] in Sec. 3.1, and then discuss two ap-
computer vision, it has, in contrast, been typically incor- proaches for extracting tokens from video in Sec. 3.2. Fi-
porated as a layer at the end or in the later stages of nally, we develop several transformer-based architectures
the network [72, 5, 29, 74, 80] or to augment residual for video classification in Sec. 3.3 and 3.4.
3.1. Overview of Vision Transformers (ViT) "

Vision Transformer (ViT) [15] adapts the transformer


!
architecture of [65] to process 2D images with minimal
changes. In particular, ViT extracts N non-overlapping im-
age patches, xi ∈ Rh×w , performs a linear projection and #
then rasterises them into 1D tokens zi ∈ Rd . The sequence
of tokens input to the following transformer encoder is
Figure 2: Uniform frame sampling: We simply sample nt frames,
z = [zcls , Ex1 , Ex2 , . . . , ExN ] + p, (1) and embed each 2D frame independently following ViT [15].
where the projection by E is equivalent to a 2D convolution.
As shown in Fig. 1, an optional learned classification token
zcls is prepended to this sequence, and its representation at
the final layer of the encoder serves as the final represen-
tation used by the classification layer [14]. In addition, a
!
learned positional embedding, p ∈ RN ×d , is added to the
tokens to retain positional information, as the subsequent
self-attention operations in the transformer are permutation
invariant. The tokens are then passed through an encoder "
consisting of a sequence of L transformer layers. Each layer
#
` comprises of Multi-Headed Self-Attention [65], layer nor-
malisation (LN) [1], and MLP blocks as follows: Figure 3: Tubelet embedding. We extract and linearly embed non-
overlapping tubelets that span the spatio-temporal input volume.
y` = MSA(LN(z` )) + z` (2)
`+1 ` `
z = MLP(LN(y )) + y . (3) Tubelet embedding An alternate method, as shown in
Fig. 3, is to extract non-overlapping, spatio-temporal
The MLP consists of two linear projections separated by a
“tubes” from the input volume, and to linearly project this to
GELU non-linearity [25] and the token-dimensionality, d,
Rd . This method is an extension of ViT’s embedding to 3D,
remains fixed throughout all layers. Finally, a linear classi-
L and corresponds to a 3D convolution. For a tubelet of di-
fier is used to classify the encoded input based on zcls ∈ Rd ,
mension t × h × w, nt = b Tt c, nh = b H W
h c and nw = b w c,
if it was prepended to the input, or a global average pooling
tokens are extracted from the temporal, height, and width
of all the tokens, zL , otherwise.
dimensions respectively. Smaller tubelet dimensions thus
As the transformer [65], which forms the basis of
result in more tokens which increases the computation.
ViT [15], is a flexible architecture that can operate on any
Intuitively, this method fuses spatio-temporal information
sequence of input tokens z ∈ RN ×d , we describe strategies
during tokenisation, in contrast to “Uniform frame sam-
for tokenising videos next.
pling” where temporal information from different frames is
3.2. Embedding video clips fused by the transformer.
We consider two simple methods for mapping a video 3.3. Transformer Models for Video
V ∈ RT ×H×W ×C to a sequence of tokens z̃ ∈
Rnt ×nh ×nw ×d . We then add the positional embedding and As illustrated in Fig. 1, we propose multiple transformer-
reshape into RN ×d to obtain z, the input to the transformer. based architectures. We begin with a straightforward ex-
tension of ViT [15] that models pairwise interactions be-
Uniform frame sampling As illustrated in Fig. 2, a tween all spatio-temporal tokens, and then develop more
straightforward method of tokenising the input video is to efficient variants which factorise the spatial and temporal
uniformly sample nt frames from the input video clip, em- dimensions of the input video at various levels of the trans-
bed each 2D frame independently using the same method former architecture.
as ViT [15], and concatenate all these tokens together. Con-
cretely, if nh · nw non-overlapping image patches are ex- Model 1: Spatio-temporal attention This model sim-
tracted from each frame, as in [15], then a total of nt ·nh ·nw ply forwards all spatio-temporal tokens extracted from the
tokens will be forwarded through the transformer encoder. video, z0 , through the transformer encoder. We note that
Intuitively, this process may be seen as simply constructing this has also been explored concurrently by [2] in their
a large 2D image to be tokenised following ViT. We note “Joint Space-Time” model. In contrast to CNN architec-
that this is the input embedding method employed by the tures, where the receptive field grows linearly with the
concurrent work of [2]. number of layers, each transformer layer models all pair-
MLP
Temporal Transformer Encoder Class Transformer Block x L
Head
Temporal + Token
Embedding

0
C
L 1 2
… T
S

Positional embedding

K
K

Multi-Head

Layer Norm
Multi-Head

Layer Norm
Layer Norm

Attention
Attention
Spatial Transformer Spatial Transformer Spatial Transformer

V
V

MLP
Token embedding
Encoder Encoder Encoder
Positional + Token

Q
Q
Embedding

0
C
L 1 … N 0
C
L 1 … N
… 0
C
L 1 … N
S S S
Spatial Self-Attention Block Temporal Self-Attention Block

Embed to tokens

Figure 5: Factorised self-attention (Model 3). Within each trans-


former block, the multi-headed self-attention operation is fac-
torised into two operations (indicated by striped boxes) that first
only compute self-attention spatially, and then temporally.
Figure 4: Factorised encoder (Model 2). This model consists of Model 3: Factorised self-attention This model, in con-
two transformer encoders in series: the first models interactions trast, contains the same number of transformer layers as
between tokens extracted from the same temporal index to produce Model 1. However, instead of computing multi-headed
a latent representation per time-index. The second transformer
self-attention across all pairs of tokens, z` , at layer l, we
models interactions between time steps. It thus corresponds to a
“late fusion” of spatial- and temporal information.
factorise the operation to first only compute self-attention
spatially (among all tokens extracted from the same tem-
wise interactions between all spatio-temporal tokens, and it poral index), and then temporally (among all tokens ex-
thus models long-range interactions across the video from tracted from the same spatial index) as shown in Fig. 5.
the first layer. However, as it models all pairwise in- Each self-attention block in the transformer thus models
teractions, Multi-Headed Self Attention (MSA) [65] has spatio-temporal interactions, but does so more efficiently
quadratic complexity with respect to the number of tokens. than Model 1 by factorising the operation over two smaller
This complexity is pertinent for video, as the number of to- sets of elements, thus achieving the same computational
kens increases linearly with the number of input frames, and complexity as Model 2. We note that factorising attention
motivates the development of more efficient architectures over input dimensions has also been explored in [26, 75],
next. and concurrently in the context of video by [2] in their “Di-
vided Space-Time” model.
Model 2: Factorised encoder As shown in Fig. 4, this This operation can be performed efficiently by reshaping
model consists of two separate transformer encoders. The the tokens z from R1×nt ·nh ·nw ·d to Rnt ×nh ·nw ·d (denoted
first, spatial encoder, only models interactions between to- by zs ) to compute spatial self-attention. Similarly, the input
kens extracted from the same temporal index. A representa- to temporal self-attention, zt is reshaped to Rnh ·nw ×nt ·d .
tion for each temporal index, hi ∈ Rd , is obtained after Ls Here we assume the leading dimension is the “batch dimen-
Ls
layers: This is the encoded classification token, zcls if it was sion”. Our factorised self-attention is defined as
prepended to the input (Eq. 1), or a global average pooling ys` = MSA(LN(z`s )) + z`s (4)
from the tokens output by the spatial encoder, zLs , other-
yt` = MSA(LN(ys` )) + ys` (5)
wise. The frame-level representations, hi , are concatenated
`+1
into H ∈ Rnt ×d , and then forwarded through a temporal z = MLP(LN(yt` )) + yt` . (6)
encoder consisting of Lt transformer layers to model in-
We observed that the order of spatial-then-temporal self-
teractions between tokens from different temporal indices.
attention or temporal-then-spatial self-attention does not
The output token of this encoder is then finally classified.
make a difference, provided that the model parameters are
This architecture corresponds to a “late fusion” [31,
initialised as described in Sec. 3.4. Note that the number
53, 69, 43] of temporal information, and the initial spa-
of parameters, however, increases compared to Model 1, as
tial encoder is identical to the one used for image classi-
there is an additional self-attention layer (cf. Eq. 7). We do
fication. It is thus analogous to CNN architectures such
not use a classification token in this model, to avoid ambi-
as [21, 31, 69, 83] which first extract per-frame fea-
guities when reshaping the input tokens between spatial and
tures, and then aggregate them into a final representation
temporal dimensions.
before classifying them. Although this model has more
transformer layers than Model 1 (and thus more parame- Model 4: Factorised dot-product attention Finally, we
ters), it requires fewer floating point operations (FLOPs), develop a model which has the same computational com-
as the two separate transformer blocks have a complexity plexity as Models 2 and 3, while retaining the same number
of O((nh · nw )2 + n2t ) compared to O((nt · nh · nw )2 ) of of parameters as the unfactorised Model 1. The factorisa-
Model 1. tion of spatial- and temporal dimensions is similar in spirit
discuss several effective strategies to initialise these large-
Linear
scale video classification models.
Multi-Head Concatenate
Dot-product Attention
K V Q
Positional embeddings A positional embedding p is
Scaled Dot-Product Attention Scaled Dot-Product Attention

added to each input token (Eq. 1). However, our video


Layer Norm Linear Linear Linear Linear Linear Linear

K V Q K V Q
models have nt times more tokens than the pretrained im-
Self-Attention Block
Spatial Heads Temporal Heads age model. As a result, we initialise the positional embed-
dings by “repeating” them temporally from Rnw ·nh ×d to
Figure 6: Factorised dot-product attention (Model 4). For half of
Rnt ·nh ·nw ×d . Therefore, at initialisation, all tokens with
the heads, we compute dot-product attention over only the spatial
the same spatial index have the same embedding which is
axes, and for the other half, over only the temporal axis.
then fine-tuned.
to Model 3, but we factorise the multi-head dot-product at- Embedding weights, E When using the “tubelet embed-
tention operation instead (Fig. 6). Concretely, we compute ding” tokenisation method (Sec. 3.2), the embedding filter
attention weights for each token separately over the spatial- E is a 3D tensor, compared to the 2D tensor in the pre-
and temporal-dimensions using different heads. First, we trained model, Eimage . A common approach for initialising
note that the attention operation for each head is defined as 3D convolutional filters from 2D filters for video classifica-
tion is to “inflate” them by replicating the filters along the
QK>
 
Attention(Q, K, V) = Softmax √ V. (7) temporal dimension and averaging them [6, 19] as
dk
1
E= [Eimage , . . . , Eimage , . . . , Eimage ]. (8)
In self-attention, the queries Q = XWq , keys K = XWk , t
and values V = XWv are linear projections of the input X
We consider an additional strategy, which we denote as
with X, Q, K, V ∈ RN ×d . Note that in the unfactorised
“central frame initialisation”, where E is initialised with ze-
case (Model 1), the spatial and temporal dimensions are
roes along all temporal positions, except at the centre b 2t c,
merged as N = nt · nh · nw .
The main idea here is to modify the keys and values for E = [0, . . . , Eimage , . . . , 0]. (9)
each query to only attend over tokens from the same spatial-
and temporal index by constructing Ks , Vs ∈ Rnh ·nw ×d Therefore, the 3D convolutional filter effectively behaves
and Kt , Vt ∈ Rnt ×d , namely the keys and values corre- like “Uniform frame sampling” (Sec. 3.2) at initialisation,
sponding to these dimensions. Then, for half of the atten- while also enabling the model to learn to aggregate temporal
tion heads, we attend over tokens from the spatial dimen- information from multiple frames as training progresses.
sion by computing Ys = Attention(Q, Ks , Vs ), and for
the rest we attend over the temporal dimension by comput- Transformer weights for Model 3 The transformer
ing Yt = Attention(Q, Kt , Vt ). Given that we are only block in Model 3 (Fig. 5) differs from the pretrained ViT
changing the attention neighbourhood for each query, the model [15], in that it contains two multi-headed self atten-
attention operation has the same dimension as in the unfac- tion (MSA) modules. In this case, we initialise the spatial
torised case, namely Ys , Yt ∈ RN ×d . We then combine MSA module from the pretrained module, and initialise all
the outputs of multiple heads by concatenating them and weights of the temporal MSA with zeroes, such that Eq. 5
using a linear projection [65], Y = Concat(Ys , Yt )WO . behaves as a residual connection [24] at initialisation.

3.4. Initialisation by leveraging pretrained models 4. Empirical evaluation


ViT [15] has been shown to only be effective when We first present our experimental setup and implementa-
trained on large-scale datasets, as transformers lack some of tion details in Sec. 4.1, before ablating various components
the inductive biases of convolutional networks [15]. How- of our model in Sec. 4.2. We then present state-of-the-art
ever, even the largest video datasets such as Kinetics [32], results on five datasets in Sec. 4.3.
have several orders of magnitude less labelled examples
4.1. Experimental Setup
when compared to their image counterparts [13, 36, 55]. As
a result, training large models from scratch to high accuracy Network architecture and training Our backbone archi-
is extremely challenging. To sidestep this issue, and enable tecture follows that of ViT [15] and BERT [14]. We con-
more efficient training we initialise our video models from sider ViT-Base (ViT-B, L=12, NH =12, d=3072), ViT-Large
pretrained image models. However, this raises several prac- (ViT-L, L=24, NH =16, d=4096), and ViT-Huge (ViT-H,
tical questions, specifically on how to initialise parameters L=32, NH =16, d=5120), where L is the number of trans-
not present or incompatible with image models. We now former layers, each with a self-attention block of NH heads
Table 1: Comparison of input encoding methods using ViViT-B Table 2: Comparison of model architectures using ViViT-B as the
and spatio-temporal attention on Kinetics. Further details in text. backbone, and tubelet size of 16 × 2. We report Top-1 accuracy on
Kinetics 400 (K400) and action accuracy on Epic Kitchens (EK).
Top-1 accuracy
Runtime is during inference on a TPU-v3.
Uniform frame sampling 78.5 K400 EK
FLOPs Params Runtime
(×109 ) (×106 ) (ms)
Tubelet embedding Model 1: Spatio-temporal 80.0 43.1 455.2 88.9 58.9
Random initialisation [22] 73.2 Model 2: Fact. encoder 78.8 43.7 284.4 100.7 17.4
Filter inflation [6] 77.6 Model 3: Fact. self-attention 77.4 39.1 372.3 117.3 31.7
Model 4: Fact. dot product 76.3 39.5 277.1 88.9 22.9
Central frame 79.2
Model 2: Ave. pool baseline 75.8 38.8 283.9 86.7 17.3

and hidden dimension d. We also apply the same naming


Table 3: The effect of varying the number of temporal transform-
scheme to our models (e.g., ViViT-B/16x2 denotes a ViT- ers, Lt , in the Factorised encoder model (Model 2). We report the
Base backbone with a tubelet size of h×w×t = 16×16×2). Top-1 accuracy on Kinetics 400. Note that Lt = 0 corresponds to
In all experiments, the tubelet height and width are equal. the “average pooling baseline”.
Note that smaller tubelet sizes correspond to more tokens at
Lt 0 1 4 8 12
the input, and thus more computation.
We train our models using synchronous SGD and mo- Top-1 75.8 78.6 78.8 78.8 78.9
mentum, a cosine learning rate schedule and TPU-v3 ac-
celerators. We initialise our models from a ViT image wise specified, we use a total of 4 views per video (as this
model trained either on ImageNet-21K [13] (unless other- is sufficient to “see” the entire video clip across the various
wise specified) or the larger JFT [55] dataset. Exact experi- datasets), and ablate these and other design choices next.
mental hyperparameters are detailed in the appendix.
4.2. Ablation study
Datasets We evaluate the performance of our proposed Input encoding We first consider the effect of different
models on a diverse set of video classification datasets: input encoding methods (Sec. 3.2) using our unfactorised
Kinetics [32] consists of 10-second videos sampled at model (Model 1) and ViViT-B on Kinetics 400. As we pass
25fps from YouTube. We evaluate on both Kinetics 400 32-frame inputs to the network, sampling 8 frames and ex-
and 600, containing 400 and 600 classes respectively. As tracting tubelets of length t = 4 correspond to the same
these are dynamic datasets (videos may be removed from number of tokens in both cases. Table 1 shows that tubelet
YouTube), we note our dataset sizes are approximately 267 embedding initialised using the “central frame” method
000 and 446 000 respectively. (Eq. 9) performs well, outperforming the commonly-used
Epic Kitchens-100 consists of egocentric videos captur- “filter inflation” initialisation method [6, 19] by 1.6%, and
ing daily kitchen activities spanning 100 hours and 90 000 “uniform frame sampling” by 0.7%. We therefore use this
clips [11]. We report results following the standard “action encoding method for all subsequent experiments.
recognition” protocol. Here, each video is labelled with a
“verb” and a “noun” and we therefore predict both cate- Model variants We compare our proposed model vari-
gories using a single network with two “heads”. The top- ants (Sec. 3.3) across the Kinetics 400 and Epic Kitchens
scoring verb and action pair predicted by the network form datasets, both in terms of accuracy and efficiency, in Tab. 2.
an “action”, and action accuracy is the primary metric. In all cases, we use the “Base” backbone and tubelet size of
Moments in Time [42] consists of 800 000, 3-second 16 × 2. Model 2 (“Factorised Encoder”) has an additional
YouTube clips that capture the gist of a dynamic scene in- hyperparameter, the number of temporal transformers, Lt .
volving animals, objects, people, or natural phenomena. We set Lt = 4 for all experiments and show in Tab. 3 that
Something-Something v2 (SSv2) [23] contains 220 000 the model is not sensitive to this choice.
videos, with durations ranging from 2 to 6 seconds. In con- The unfactorised model (Model 1) performs the best
trast to the other datasets, the objects and backgrounds in on Kinetics 400. However, it can also overfit on smaller
the videos are consistent across different action classes, and datasets such as Epic Kitchens, where we find our “Fac-
this dataset thus places more emphasis on a model’s ability torised Encoder” (Model 2) to perform the best. We also
to recognise fine-grained motion cues. consider an additional baseline (last row), based on Model
2, where we do not use any temporal transformer, and sim-
Inference The input to our network is a video clip of 32 ply average pool the frame-level representations from the
frames using a stride of 2, unless otherwise mentioned, sim- spatial encoder before classifying. This average pooling
ilar to [18, 17]. Following common practice, at inference baseline performs the worst, and has a larger accuracy drop
time, we process multiple views of a longer video and aver- on Epic Kitchens, suggesting that this dataset requires more
age per-view logits to obtain the final result. Unless other- detailed modelling of temporal relations.
Table 4: The effect of progressively adding regularisation (each Table 5: The effect of spatial resolution on the performance of
row includes all methods above it) on Top-1 action accuracy on ViViT-L/16x2 and spatio-temporal attention on Kinetics 400.
Epic Kitchens. We use a Factorised encoder model with tubelet
Crop size 224 288 320
size 16 × 2.
Accuracy 80.3 80.7 81.0
Top-1 accuracy
GFLOPs 1446 2919 3992
Random crop, flip, colour jitter 38.4 Runtime 58.9 147.6 238.8
+ Kinetics 400 initialisation 39.6
+ Stochastic depth [28] 40.2
+ Random augment [10] 41.1 in Tab. 4. We note that these regularisers were originally
+ Label smoothing [58] 43.1 proposed for training CNNs, and that [61] have recently
+ Mixup [79] 43.7 explored them for training ViT for image classification.
ViViT-B ViViT-L Each row of Tab. 4 includes all the methods from the
1.5
80 rows above it, and we observe progressive improvements
Top-1 Accuracy

1.0
TFLOPs

79 from adding each regulariser. Overall, we obtain a substan-


78 0.5 tial overall improvement of 5.3% on Epic Kitchens. We
16x8 16x4 16x2 16x8 16x4 16x2
also achieve a similar improvement of 5%, from 60.4% to
Input tubelet size Input tubelet size 65.4%, on SSv2 by using all the regularisation in Tab. 4.
(a) Accuracy (b) Compute Note that the Kinetics-pretrained models that we initialise
Figure 7: The effect of the backbone architecture on (a) accuracy from are from Tab. 2, and that all Epic Kitchens models in
and (b) computation on Kinetics 400, for the spatio-temporal at- Tab. 2 were trained with all the regularisers in Tab. 4. For
tention model (Model 1). larger datasets like Kinetics and Moments in Time, we do
not use these additional regularisers (we use only the first
Spatio-temporal Factorised encoder Factorised self-attention Factorised dot-product
80.0 row of Tab. 4), as we obtain state-of-the-art results without
Top-1 Accuracy

0.4
77.5 them. The appendix contains hyperparameter values and
TFLOPs

75.0 0.2 additional details for all regularisers.


72.5
16x8 16x4 16x2 16x8 16x4 16x2 Varying the backbone Figure 7 compares the ViViT-
Input tubelet size Input tubelet size
(a) Accuracy (b) Compute
B and ViViT-L backbones for the unfactorised spatio-
temporal model. We observe consistent improvements in
Figure 8: The effect of varying the number of temporal tokens on
(a) accuracy and (b) computation on Kinetics 400, for different accuracy as the backbone capacity increases. As expected,
variants of our model with a ViViT-B backbone. the compute also grows as a function of the backbone size.

As described in Sec. 3.3, all factorised variants of our Varying the number of tokens We first analyse the per-
model use significantly fewer FLOPs than the unfactorised formance as a function of the number of tokens along the
Model 1, as the attention is computed separately over temporal dimension in Fig. 8. We observe that using smaller
spatial- and temporal-dimensions. Model 4 adds no addi- input tubelet sizes (and therefore more tokens) leads to con-
tional parameters to the unfactorised Model 1, and uses the sistent accuracy improvements across all of our model ar-
least compute. The temporal transformer encoder in Model chitectures. At the same time, computation in terms of
2 operates on only nt tokens, which is why there is a barely FLOPs increases accordingly, and the unfactorised model
a change in compute and runtime over the average pool- (Model 1) is impacted the most.
ing baseline, even though it improves the accuracy substan- We then vary the number of tokens fed into the model
tially (3% on Kinetics and 4.9% on Epic Kitchens). Fi- by increasing the spatial crop-size from the default of 224
nally, Model 3 requires more compute and parameters than to 320 in Tab. 5. As expected, there is a consistent increase
the other factorised models, as its additional self-attention in both accuracy and computation. We note that when com-
block means that it performs another query-, key-, value- paring to prior work we consistently obtain state-of-the-art
and output-projection in each transformer layer [65]. results (Sec. 4.3) using a spatial resolution of 224, but we
also highlight that further improvements can be obtained at
Model regularisation Pure-transformer architectures higher spatial resolutions.
such as ViT [15] are known to require large training
datasets, and we observed overfitting on smaller datasets Varying the number of input frames In our experiments
like Epic Kitchens and SSv2, even when using an ImageNet so far, we have kept the number of input frames fixed to
pretrained model. In order to effectively train our models 32. We now ablate this choice whilst effectively keeping the
on such datasets, we employed several regularisation strate- amount of computation to process a single clip fixed. This
gies that we ablate using our “Factorised encoder” model is done by increasing the tubelet length t in proportion to
Table 6: Comparisons to state-of-the-art across multiple datasets. For “views”, x × y denotes x temporal crops and y spatial crops. “320”
denotes models trained and tested with a spatial resolution of 320 instead of 224.
(a) Kinetics 400 (b) Kinetics 600 (d) Epic Kitchens 100 Top 1 accuracy
Method Top 1 Top 5 Views Method Top 1 Top 5 Views Method Action Verb Noun

blVNet [16] 73.5 91.2 – AttentionNAS [73] 79.8 94.4 – TSN [69] 33.2 60.2 46.0
STM [30] 73.7 91.6 – LGD-3D R101 [48] 81.5 95.6 – TRN [83] 35.3 65.9 45.4
TEA [39] 76.1 92.5 10 × 3 SlowFast R101-NL [18] 81.8 95.1 10 × 3 TBN [33] 36.7 66.0 47.2
X3D-XL [17] 81.9 95.5 10 × 3 TSM [40] 38.3 67.9 49.0
TSM-ResNeXt-101 [40] 76.3 – –
TimeSformer-HR [2] 82.4 96.0 – SlowFast [18] 38.5 65.6 50.0
I3D NL [72] 77.7 93.3 10 × 3 ViViT-L/16x2 82.5 95.6 4×3
CorrNet-101 [67] 79.2 – 10 × 3 ViViT-L/16x2 320 83.0 95.7 4×3 ViViT-L/16x2 Fact. encoder 44.0 66.4 56.8
ip-CSN-152 [63] 79.2 93.8 10 × 3
LGD-3D R101 [48] 79.4 94.4 – ViViT-L/16x2 (JFT) 84.3 96.2 4×3
ViViT-H/16x2 (JFT) 85.8 96.5 4×3 (e) Something-Something v2
SlowFast R101-NL [18] 79.8 93.9 10 × 3
X3D-XXL [17] 80.4 94.6 10 × 3 Method Top 1 Top 5
TimeSformer-L [2] 80.7 94.7 1×3 (c) Moments in Time TRN [83] 48.8 77.6
ViViT-L/16x2 80.6 94.7 4×3 Top 1 Top 5 SlowFast [17, 77] 61.7 –
ViViT-L/16x2 320 81.3 94.7 4×3 TimeSformer-HR [2] 62.5 –
TSN [69] 25.3 50.1
TSM [40] 63.4 88.5
Methods with large-scale pretraining TRN [83] 28.3 53.4
STM [30] 64.2 89.8
ip-CSN-152 [63] (IG [41]) 82.5 95.3 10 × 3 I3D [6] 29.5 56.1
TEA [39] 65.1 –
ViViT-L/16x2 (JFT) 82.8 95.5 4×3 blVNet [16] 31.4 59.3
blVNet [16] 65.2 90.3
AssembleNet-101 [51] 34.3 62.7
ViViT-L/16x2 320 (JFT) 83.5 95.5 4×3
ViViT-L/16x2 Fact. encoder 65.4 89.8
ViViT-H/16x2 (JFT) 84.8 95.8 4×3 ViViT-L/16x2 38.0 64.9

32 stride 2 64 stride 2 128 stride 2 process longer videos without increasing the number of to-
kens, they offer an efficient method for processing longer
videos than those considered by existing video classifica-
Top-1 Accuracy

78 tion datasets, and keep it as an avenue for future work.

76 4.3. Comparison to state-of-the-art


Based on our ablation studies in the previous section,
1 2 3 4 5 6 7 we compare to the current state-of-the-art using two of our
Number of views model variants. We use the unfactorised spatio-temporal at-
Figure 9: The effect of varying the number of frames input to the tention model (Model 1) for the larger datasets, Kinetics and
network when keeping the number of tokens constant by adjusting Moments in Time. For the smaller datasets, Epic Kitchens
the tubelet length t. We use ViViT-B, and spatio-temporal atten-
and SSv2, we use our Factorised encoder model (Model 2).
tion on Kinetics 400. A Kinetics video contains 250 frames (10
seconds sampled at 25 fps) and the accuracy for each model sat-
urates once the number of equidistant temporal views is sufficient Kinetics Tables 6a and 6b show that our spatio-temporal
to “see” the whole video clip. attention models outperform the state-of-the-art on Kinetics
400 and 600 respectively. Following standard practice, we
the number of input frames, such that the number of tokens take 3 spatial crops (left, centre and right) [18, 17, 63, 72]
processed by the network is constant. for each temporal view, and notably, we require signifi-
Figure 9 shows that as we increase the number of frames cantly fewer views than previous CNN-based methods.
input to the network, and enlarge the tubelet length, t, ac- We surpass the previous CNN-based state-of-the-art us-
cordingly, our accuracy from processing a single clip in- ing ViViT-L/16x2 pretrained on ImageNet, and achieve
creases, as the network incorporates longer temporal con- comparable accuracy to [2] who concurrently proposed a
text. However, common practice on datasets such as Kinet- pure-transformer architecture. We then obtain further im-
ics [18, 72, 39] is to average results over multiple, shorter provements by increasing the spatial resolution from 224
“views” of the same video clip. Following this multi-view to 320 as expected given the ablation in Tab. 5. Moreover,
testing protocol, processing shorter clips is actually more by initialising our backbones from models pretrained on the
advantageous. We also observe from Fig. 9 that the accu- larger JFT dataset [55], we obtain further improvements.
racy saturates once the number of views is sufficent to cover Although these models are not directly comparable to pre-
the whole video, and that this is consistent across models vious work, we do also outperform [63] who pretrained on
with different numbers of input frames. Consequently, we the large-scale, Instagram dataset [41]. Our best model uses
use 32 frames, sampled with a stride of 2, as our network- a ViViT-H backbone pretrained on JFT and significantly ad-
input for all experiments, and use 4 temporal views for vances the best reported results on Kinetics 400 and 600 to
multi-view testing. Finally, we note that as our models can 84.8% and 85.8%, respectively.
Moments in Time We surpass the state-of-the-art by a [2] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is
significant margin as shown in Tab. 6c. We note that the space-time attention all you need for video understanding?
videos in this dataset are diverse and contain significant la- In arXiv preprint arXiv:2102.05095, 2021. 2, 3, 4, 8, 9
bel noise, making this task challenging and leading to lower [3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
accuracies than on other datasets. biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
Agarwal, et al. Language models are few-shot learners. In
Epic Kitchens 100 Table 6d shows that our Factorised en-
NeurIPS, 2020. 2
coder model outperforms previous methods by a significant [4] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han
margin. In addition, our model obtains substantial improve- Hu. Gcnet: Non-local networks meet squeeze-excitation net-
ments for Top-1 accuracy of “noun” classes, and the only works and beyond. In CVPR Workshops, 2019. 2
method which achieves higher “verb” accuracy used optical [5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
flow as an additional input modality [40, 47]. Furthermore, Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
all variants of our model presented in Tab. 2 outperformed end object detection with transformers. In ECCV, 2020. 1,
the existing state-of-the-art on action accuracy. We note that 2
we use the same model to predict verbs and nouns using two [6] Joao Carreira and Andrew Zisserman. Quo vadis, action
separate “heads”, and for simplicity, we do not use separate recognition? a new model and the kinetics dataset. In CVPR,
loss weights for each head. 2017. 1, 2, 5, 6, 8
[7] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng
Yan, and Jiashi Feng. A2-nets: Double attention networks.
Something-Something v2 (SSv2) Finally, Tab. 6e shows
In NeurIPS, 2018. 2
that we achieve state-of-the-art Top-1 accuracy with our [8] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever.
Factorised encoder model (Model 2), albeit with a smaller Generating long sequences with sparse transformers. In
margin compared to previous methods. Notably, our Fac- arXiv preprint arXiv:1904.10509, 2019. 2
torised encoder model significantly outperforms the concur- [9] Krzysztof Choromanski, Valerii Likhosherstov, David Do-
rent TimeSformer [2] method by 2.9%, which also proposes han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter
a pure-transformer model, but does not consider our Fac- Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser,
torised encoder variant or our additional regularisation. et al. Rethinking attention with performers. In arXiv preprint
SSv2 differs from other datasets in that the backgrounds arXiv:2009.14794, 2020. 2
and objects are quite similar across different classes, mean- [10] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V.
Le. Randaugment: Practical automated data augmen-
ing that recognising fine-grained motion patterns is neces-
tation with a reduced search space. In arXiv preprint
sary to distinguish classes from each other. Our results sug-
arXiv:1909.13719, 2019. 7, 12, 13
gest that capturing these fine-grained motions is an area of [11] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,
improvement and future work for our model. We also note Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide
an inverse correlation between the relative performance of Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and
previous methods on SSv2 (Tab. 6e) and Kinetics (Tab. 6a) Michael Wray. Rescaling egocentric vision. In arXiv
suggesting that these two datasets evaluate complementary preprint arXiv:2006.13256, 2020. 1, 6
characteristics of a model. [12] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob
Uszkoreit, and Łukasz Kaiser. Universal transformers. In
ICLR, 2019. 2
5. Conclusion and Future Work
[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
We have presented four pure-transformer models for and Li Fei-Fei. Imagenet: A large-scale hierarchical image
video classification, with different accuracy and efficiency database. In CVPR, 2009. 2, 5, 6
profiles, achieving state-of-the-art results across five pop- [14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional trans-
ular datasets. Furthermore, we have shown how to ef-
formers for language understanding. In NAACL, 2019. 2, 3,
fectively regularise such high-capacity models for training 5
on smaller datasets and thoroughly ablated our main de- [15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
sign choices. Future work is to remove our dependence on Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
image-pretrained models. Finally, going beyond video clas- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
sification towards more complex tasks is a clear next step. vain Gelly, et al. An image is worth 16x16 words: Trans-
formers for image recognition at scale. In ICLR, 2021. 1, 2,
References 3, 5, 7
[16] Quanfu Fan, Chun-Fu Chen, Hilde Kuehne, Marco Pistoia,
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. and David Cox. More is less: Learning efficient video repre-
Layer normalization. In arXiv preprint arXiv:1607.06450, sentations by big-little network and depthwise temporal ag-
2016. 3 gregation. In NeurIPS, 2019. 8
[17] Christoph Feichtenhofer. X3d: Expanding architectures for [36] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-
efficient video recognition. In CVPR, 2020. 1, 2, 6, 8 jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan
[18] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Popov, Matteo Malloci, Tom Duerig, et al. The open im-
Kaiming He. Slowfast networks for video recognition. In ages dataset v4: Unified image classification, object detec-
ICCV, 2019. 1, 6, 8 tion, and visual relationship detection at scale. IJCV, 2020.
[19] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 5
Spatiotemporal residual networks for video action recogni- [37] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin
tion. In NeurIPS, 2016. 2, 5, 6 Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite
[20] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zis- bert for self-supervised learning of language representations.
serman. Video action transformer network. In CVPR, 2019. In ICLR, 2020. 2
1 [38] Ivan Laptev. On space-time interest points. IJCV, 64(2-3),
[21] Rohit Girdhar and Deva Ramanan. Attentional pooling for 2005. 2
action recognition. In NeurIPS, 2017. 4 [39] Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and
[22] Xavier Glorot and Yoshua Bengio. Understanding the diffi- Limin Wang. Tea: Temporal excitation and aggregation for
culty of training deep feedforward neural networks. In AIS- action recognition. In CVPR, 2020. 8
TATS, 2010. 6
[40] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift
[23] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
module for efficient video understanding. In ICCV, 2019. 8,
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
9
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
Mueller-Freitag, et al. The” something something” video [41] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,
database for learning and evaluating visual common sense. Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,
In ICCV, 2017. 1, 6 and Laurens Van Der Maaten. Exploring the limits of weakly
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. supervised pretraining. In ECCV, 2018. 8
Deep residual learning for image recognition. In CVPR, [42] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra-
2016. 1, 2, 5 makrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown,
[25] Dan Hendrycks and Kevin Gimpel. Gaussian error linear Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments
units (gelus). In arXiv preprint arXiv:1606.08415, 2016. 3 in time dataset: one million videos for event understanding.
[26] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim PAMI, 42(2):502–508, 2019. 1, 6
Salimans. Axial attention in multidimensional transformers. [43] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan As-
In arXiv preprint arXiv:1912.12180, 2019. 4 selmann. Video transformer network. In arXiv preprint
[27] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- arXiv:2102.00719, 2021. 2, 4
works. In CVPR, 2018. 2 [44] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vi-
[28] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian jayanarasimhan, Oriol Vinyals, Rajat Monga, and George
Weinberger. Deep networks with stochastic depth. In ECCV, Toderici. Beyond short snippets: Deep networks for video
2016. 7, 12, 13 classification. In CVPR, 2015. 2
[29] Zilong Huang, Xinggang Wang, Lichao Huang, Chang [45] Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, and Jian-
Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross fei Cai. Scalable visual transformers with hierarchical pool-
attention for semantic segmentation. In ICCV, 2019. 2 ing. In arXiv preprint arXiv:2103.10619, 2021. 2
[30] Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, and [46] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Junjie Yan. Stm: Spatiotemporal and motion encoding for Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
action recognition. In ICCV, 2019. 8 age transformer. In ICML, 2018. 1, 2
[31] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas
[47] Will Price and Dima Damen. An evaluation of action
Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video
recognition models on epic-kitchens. In arXiv preprint
classification with convolutional neural networks. In CVPR,
arXiv:1908.00867, 2019. 9
2014. 2, 4
[48] Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and
[32] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
Tao Mei. Learning spatio-temporal representation with local
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
and global diffusion. In CVPR, 2019. 8
Tim Green, Trevor Back, Paul Natsev, et al. The ki-
netics human action video dataset. In arXiv preprint [49] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
arXiv:1705.06950, 2017. 1, 2, 5, 6 Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
[33] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Peter J Liu. Exploring the limits of transfer learning with a
Dima Damen. Epic-fusion: Audio-visual temporal binding unified text-to-text transformer. JMLR, 2020. 2
for egocentric action recognition. In ICCV, 2019. 8 [50] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan
[34] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Re- Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone
former: The efficient transformer. In ICLR, 2020. 2 self-attention in vision models. In NeurIPS, 2019. 1, 2
[35] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. [51] Michael S Ryoo, AJ Piergiovanni, Mingxing Tan, and Anelia
Imagenet classification with deep convolutional neural net- Angelova. Assemblenet: Searching for multi-stream neural
works. In NeurIPS, volume 25, 2012. 1, 2 connectivity in video architectures. In ICLR, 2020. 8
[52] Zhuoran Shen, Irwan Bello, Raviteja Vemulapalli, Xuhui Jia, [69] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
and Ching-Hui Chen. Global self-attention networks for im- Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
age recognition. In arXiv preprint arXiv:2010.03019, 2021. networks: Towards good practices for deep action recogni-
2 tion. In ECCV, 2016. 4, 8
[53] Karen Simonyan and Andrew Zisserman. Two-stream con- [70] Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and
volutional networks for action recognition in videos. In Hao Ma. Linformer: Self-attention with linear complexity.
NeurIPS, 2014. 2, 4 In arXiv preprint arXiv:2006.04768, 2020. 2
[54] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon [71] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
transformers for visual recognition. In CVPR, 2021. 2 Pyramid vision transformer: A versatile backbone for
[55] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi- dense prediction without convolutions. In arXiv preprint
nav Gupta. Revisiting unreasonable effectiveness of data in arXiv:2102.12122, 2021. 2
deep learning era. In ICCV, 2017. 5, 6, 8 [72] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
[56] Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human ing He. Non-local neural networks. In CVPR, 2018. 1, 2,
action recognition using factorized spatio-temporal convolu- 8
tional networks. In ICCV, 2015. 2 [73] Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Pier-
[57] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, giovanni, Michael S Ryoo, Anelia Angelova, Kris M Kitani,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent and Wei Hua. Attentionnas: Spatiotemporal attention cell
Vanhoucke, and Andrew Rabinovich. Going deeper with search for video classification. In ECCV, 2020. 8
convolutions. In CVPR, 2015. 1 [74] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen,
[58] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end
Shlens, and Zbigniew Wojna. Rethinking the inception ar- video instance segmentation with transformers. In arXiv
chitecture for computer vision. In CVPR, 2016. 7, 12, 13 preprint arXiv:2011.14503, 2020. 2
[59] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, [75] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit.
Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebas- Scaling autoregressive video models. In ICLR, 2020. 4
tian Ruder, and Donald Metzler. Long range arena: A
[76] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim-
benchmark for efficient transformers. In arXiv preprint
ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term
arXiv:2011.04006, 2020. 2
feature banks for detailed video understanding. In CVPR,
[60] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Met-
2019. 1
zler. Efficient transformers: A survey. In arXiv preprint
[77] Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Fe-
arXiv:2009.06732, 2020. 2
ichtenhofer, and Philipp Krahenbuhl. A multigrid method
[61] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
for efficiently training video models. In CVPR, 2020. 8
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through at- [78] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
tention. In arXiv preprint arXiv:2012.12877, 2020. 1, 2, 7 Kevin Murphy. Rethinking spatiotemporal feature learning:
Speed-accuracy trade-offs in video classification. In ECCV,
[62] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
2018. 2
and Manohar Paluri. Learning spatiotemporal features with
3d convolutional networks. In ICCV, 2015. 2 [79] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and
[63] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feis- David Lopez-Paz. Mixup: Beyond empirical risk minimiza-
zli. Video classification with channel-separated convolu- tion. In ICLR, 2018. 7, 12, 13
tional networks. In ICCV, 2019. 2, 8 [80] Li Zhang, Dan Xu, Anurag Arnab, and Philip HS Torr. Dy-
[64] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann namic graph message passing networks. In CVPR, 2020. 2
LeCun, and Manohar Paluri. A closer look at spatiotemporal [81] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and
convolutions for action recognition. In CVPR, 2018. 2 Vladlen Koltun. Point transformer. In arXiv preprint
[65] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- arXiv:2012.09164, 2020. 2
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia [82] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
Polosukhin. Attention is all you need. In NeurIPS, 2017. 1, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
2, 3, 4, 5, 7 Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
[66] Heng Wang, Alexander Kläser, Cordelia Schmid, and tation from a sequence-to-sequence perspective with trans-
Cheng-Lin Liu. Dense trajectories and motion boundary de- formers. In arXiv preprint arXiv:2012.15840, 2020. 2
scriptors for action recognition. IJCV, 103(1), 2013. 2 [83] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-
[67] Heng Wang, Du Tran, Lorenzo Torresani, and Matt Feiszli. ralba. Temporal relational reasoning in videos. In ECCV,
Video modeling with correlation networks. In CVPR, 2020. 2018. 4, 8
8
[68] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
Liang-Chieh Chen. Max-deeplab: End-to-end panoptic
segmentation with mask transformers. In arXiv preprint
arXiv:2012.00759, 2020. 2
Appendix Mixup Mixup [79] constructs virtual training examples
which are a convex combination of pairs of training exam-
A. Additional experimental details ples and their labels. Concretely, given (xi , yi ) and (xj , yj )
where xi denotes an input vector and yi a one-hot input la-
In this appendix, we provide additional experimental de-
bel, mixup constructs the virtual training example,
tails. Section A.1 provides additional details about the reg-
ularisers we used and Sec. A.2 details the training hyper- x̃ = αxi + (1 − α)xj
paramters used for our experiments.
ỹ = αyi + (1 − α)yj . (12)
A.1. Further details about regularisers
Our choice of the hyperparamter α ∈ [0, 1] is detailed in
In this section, we provide additional details and list the Tab. 7.
hyperparameters of the additional regularisers that we em-
ployed in Tab. 4. Hyperparameter values for all our experi- A.2. Training hyperparameters
ments are listed in Tab. 7. Table 7 details the hyperparamters for all of our ex-
periments. We use synchronous SGD with momentum, a
Stochastic depth Stochastic depth regularisation was cosine learning rate schedule with linear warmup, and a
originally proposed for training very deep residual net- batch size of 64 for all experiments. As aforementioned,
works [28]. Intuitively, the outputs of a layer, `, are we only employed additional regularisation when training
“dropped out” with probability, pdrop (`) during training, by on the smaller Epic Kitchens and Something-Something v2
setting the output of the layer to be equal to its input. datasets.
Following [28], we linearly increase the probability of
dropping a layer according to its depth within the network,

`
pdrop (`) = pdrop , (10)
L
where ` is the index of the layer in the network, and L is the
total number of layers.

Random augment Random augment [10] randomly ap-


plies data augmentation transformations sequentially to an
input example. We follow the public implementation1 , but
modify the data augmentation operations to be temporally
consistent throughout the video (in other words, the same
transformation is applied on each frame of the video).
The authors define two hyperparameters for Random
augment, “number of layers” , the number of augmentation
transformations to apply sequentially to a video and “mag-
nitude”, the strength of the transformation that is shared
across all augmentation operations. Our values for these
parameters are shown in Tab. 7.

Label smoothing Label smoothing was proposed by [58]


originally to regularise training Inception-v3. Concretely,
the label distribution used during training, ỹ, is a mixture
of the one-hot ground-truth label, y, and a uniform distribu-
tion, u, to encourage the network to produce less confident
predictions during training:

ỹ = (1 − λ)y + λu. (11)

There is therefore one scalar hyperparamter, λ ∈ [0, 1].


1 https://github.com/tensorflow/models/blob/

master/official/vision/beta/ops/augment.py
Table 7: Training hyperparamters for experiments in the main paper. “–” indicates that the regularisation method was not used at all. Values
which are constant across all columns are listed once. Datasets are denoted as follows: K400: Kinetics 400. K600: Kinetics 600. MiT:
Moments in Time. EK: Epic Kitchens. SSv2: Something-Something v2.
K400 K600 MiT EK SSv2
Optimisation
Optimiser Synchronous SGD
Momentum 0.9
Batch size 64
Learning rate schedule cosine with linear warmup
Linear warmup epochs 2.5
Base learning rate 0.1 0.1 0.25 0.5 0.5
Epochs 30 30 10 50 35
Data augmentation
Random crop probability 1.0
Random flip probability 0.5
Scale jitter probability 1.0
Maximum scale 1.33
Minimum scale 0.9
Colour jitter probability 0.8 0.8 0.8 – –
Rand augment number of layers [10] – – – 2 2
Rand augment magnitude [10] – – – 15 20
Other regularisation
Stochastic droplayer rate, pdrop [28] – – – 0.2 0.3
Label smoothing λ [58] – – – 0.2 0.3
Mixup α [79] – – – 0.1 0.3

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy