03 - ViViT - A Video Vision Transformer
03 - ViViT - A Video Vision Transformer
Anurag Arnab* Mostafa Dehghani* Georg Heigold Chen Sun Mario LučiㆠCordelia Schmid†
Google Research
{aarnab, dehghani, heigold, chensun, lucic, cordelias}@google.com
Abstract only very recently with the Vision Transformer (ViT) [15],
arXiv:2103.15691v1 [cs.CV] 29 Mar 2021
2 Temporal Spatial
L× Self-Attention
●●● ●●●
Embed to 3
Multi-Head
tokens Dot-Product Spatial Temporal Fuse
Attention
Spatial Temporal
… K V Q
●●●
Spatial Spatial
N
Layer Norm
Figure 1: We propose a pure-transformer architecture for video classification, inspired by the recent success of such models for images [15].
To effectively process a large number of spatio-temporal tokens, we develop several model variants which factorise different components
of the transformer encoder over the spatial- and temporal-dimensions. As shown on the right, these factorisations correspond to different
attention patterns over space and time.
Architectures for video understanding have mirrored ad- Although previous works attempted to replace convolu-
vances in image recognition. Early video research used tions in vision architectures [46, 50, 52], it is only very re-
hand-crafted features to encode appearance and motion cently that Dosovitisky et al. [15] showed with their ViT ar-
information [38, 66]. The success of AlexNet on Ima- chitecture that pure-transformer networks, similar to those
geNet [35, 13] initially led to the repurposing of 2D im- employed in NLP, can achieve state-of-the-art results for
age convolutional networks (CNNs) for video as “two- image classification too. The authors showed that such
stream” networks [31, 53, 44]. These models processed models are only effective at large scale, as transformers lack
RGB frames and optical flow images independently before some of inductive biases of convolutional networks (such
fusing them at the end. Availability of larger video classi- as translational equivariance), and thus require datasets
fication datasets such as Kinetics [32] subsequently facili- larger than the common ImageNet ILSRVC dataset [13] to
tated the training of spatio-temporal 3D CNNs [6, 19, 62] train. ViT has inspired a large amount of follow-up work
which have significantly more parameters and thus require in the community, and we note that there are a number
larger training datasets. As 3D convolutional networks re- of concurrent approaches on extending it to other tasks in
quire significantly more computation than their image coun- computer vision [68, 71, 81, 82] and improving its data-
terparts, many architectures factorise convolutions across efficiency [61, 45]. In particular, [2, 43] have also proposed
spatial and temporal dimensions and/or use grouped convo- transformer-based models for video.
lutions [56, 63, 64, 78, 17]. We also leverage factorisation In this paper, we develop pure-transformer architectures
of the spatial and temporal dimensions of videos to increase for video classification. We propose several variants of our
efficiency, but in the context of transformer-based models. model, including those that are more efficient by factoris-
Concurrently, in natural language processing (NLP), ing the spatial and temporal dimensions of the input video.
Vaswani et al. [65] achieved state-of-the-art results by re- We also show how additional regularisation and pretrained
placing convolutions and recurrent networks with the trans- models can be used to combat the fact that video datasets
former network that consisted only of self-attention, layer are not as large as their image counterparts that ViT was
normalisation and multilayer perceptron (MLP) operations. originally trained on. Furthermore, we outperform the state-
Current state-of-the-art architectures in NLP [14, 49] re- of-the-art across five popular datasets.
main transformer-based, and have been scaled to web-
scale datasets [3]. Many variants of the transformer have
also been proposed to reduce the computational cost of 3. Video Vision Transformers
self-attention when processing longer sequences [8, 9, 34,
59, 60, 70] and to improve parameter efficiency [37, 12]. We start by summarising the recently proposed Vision
Although self-attention has been employed extensively in Transformer [15] in Sec. 3.1, and then discuss two ap-
computer vision, it has, in contrast, been typically incor- proaches for extracting tokens from video in Sec. 3.2. Fi-
porated as a layer at the end or in the later stages of nally, we develop several transformer-based architectures
the network [72, 5, 29, 74, 80] or to augment residual for video classification in Sec. 3.3 and 3.4.
3.1. Overview of Vision Transformers (ViT) "
0
C
L 1 2
… T
S
Positional embedding
K
K
Multi-Head
Layer Norm
Multi-Head
Layer Norm
Layer Norm
Attention
Attention
Spatial Transformer Spatial Transformer Spatial Transformer
V
V
MLP
Token embedding
Encoder Encoder Encoder
Positional + Token
Q
Q
Embedding
0
C
L 1 … N 0
C
L 1 … N
… 0
C
L 1 … N
S S S
Spatial Self-Attention Block Temporal Self-Attention Block
Embed to tokens
K V Q K V Q
models have nt times more tokens than the pretrained im-
Self-Attention Block
Spatial Heads Temporal Heads age model. As a result, we initialise the positional embed-
dings by “repeating” them temporally from Rnw ·nh ×d to
Figure 6: Factorised dot-product attention (Model 4). For half of
Rnt ·nh ·nw ×d . Therefore, at initialisation, all tokens with
the heads, we compute dot-product attention over only the spatial
the same spatial index have the same embedding which is
axes, and for the other half, over only the temporal axis.
then fine-tuned.
to Model 3, but we factorise the multi-head dot-product at- Embedding weights, E When using the “tubelet embed-
tention operation instead (Fig. 6). Concretely, we compute ding” tokenisation method (Sec. 3.2), the embedding filter
attention weights for each token separately over the spatial- E is a 3D tensor, compared to the 2D tensor in the pre-
and temporal-dimensions using different heads. First, we trained model, Eimage . A common approach for initialising
note that the attention operation for each head is defined as 3D convolutional filters from 2D filters for video classifica-
tion is to “inflate” them by replicating the filters along the
QK>
Attention(Q, K, V) = Softmax √ V. (7) temporal dimension and averaging them [6, 19] as
dk
1
E= [Eimage , . . . , Eimage , . . . , Eimage ]. (8)
In self-attention, the queries Q = XWq , keys K = XWk , t
and values V = XWv are linear projections of the input X
We consider an additional strategy, which we denote as
with X, Q, K, V ∈ RN ×d . Note that in the unfactorised
“central frame initialisation”, where E is initialised with ze-
case (Model 1), the spatial and temporal dimensions are
roes along all temporal positions, except at the centre b 2t c,
merged as N = nt · nh · nw .
The main idea here is to modify the keys and values for E = [0, . . . , Eimage , . . . , 0]. (9)
each query to only attend over tokens from the same spatial-
and temporal index by constructing Ks , Vs ∈ Rnh ·nw ×d Therefore, the 3D convolutional filter effectively behaves
and Kt , Vt ∈ Rnt ×d , namely the keys and values corre- like “Uniform frame sampling” (Sec. 3.2) at initialisation,
sponding to these dimensions. Then, for half of the atten- while also enabling the model to learn to aggregate temporal
tion heads, we attend over tokens from the spatial dimen- information from multiple frames as training progresses.
sion by computing Ys = Attention(Q, Ks , Vs ), and for
the rest we attend over the temporal dimension by comput- Transformer weights for Model 3 The transformer
ing Yt = Attention(Q, Kt , Vt ). Given that we are only block in Model 3 (Fig. 5) differs from the pretrained ViT
changing the attention neighbourhood for each query, the model [15], in that it contains two multi-headed self atten-
attention operation has the same dimension as in the unfac- tion (MSA) modules. In this case, we initialise the spatial
torised case, namely Ys , Yt ∈ RN ×d . We then combine MSA module from the pretrained module, and initialise all
the outputs of multiple heads by concatenating them and weights of the temporal MSA with zeroes, such that Eq. 5
using a linear projection [65], Y = Concat(Ys , Yt )WO . behaves as a residual connection [24] at initialisation.
1.0
TFLOPs
0.4
77.5 them. The appendix contains hyperparameter values and
TFLOPs
As described in Sec. 3.3, all factorised variants of our Varying the number of tokens We first analyse the per-
model use significantly fewer FLOPs than the unfactorised formance as a function of the number of tokens along the
Model 1, as the attention is computed separately over temporal dimension in Fig. 8. We observe that using smaller
spatial- and temporal-dimensions. Model 4 adds no addi- input tubelet sizes (and therefore more tokens) leads to con-
tional parameters to the unfactorised Model 1, and uses the sistent accuracy improvements across all of our model ar-
least compute. The temporal transformer encoder in Model chitectures. At the same time, computation in terms of
2 operates on only nt tokens, which is why there is a barely FLOPs increases accordingly, and the unfactorised model
a change in compute and runtime over the average pool- (Model 1) is impacted the most.
ing baseline, even though it improves the accuracy substan- We then vary the number of tokens fed into the model
tially (3% on Kinetics and 4.9% on Epic Kitchens). Fi- by increasing the spatial crop-size from the default of 224
nally, Model 3 requires more compute and parameters than to 320 in Tab. 5. As expected, there is a consistent increase
the other factorised models, as its additional self-attention in both accuracy and computation. We note that when com-
block means that it performs another query-, key-, value- paring to prior work we consistently obtain state-of-the-art
and output-projection in each transformer layer [65]. results (Sec. 4.3) using a spatial resolution of 224, but we
also highlight that further improvements can be obtained at
Model regularisation Pure-transformer architectures higher spatial resolutions.
such as ViT [15] are known to require large training
datasets, and we observed overfitting on smaller datasets Varying the number of input frames In our experiments
like Epic Kitchens and SSv2, even when using an ImageNet so far, we have kept the number of input frames fixed to
pretrained model. In order to effectively train our models 32. We now ablate this choice whilst effectively keeping the
on such datasets, we employed several regularisation strate- amount of computation to process a single clip fixed. This
gies that we ablate using our “Factorised encoder” model is done by increasing the tubelet length t in proportion to
Table 6: Comparisons to state-of-the-art across multiple datasets. For “views”, x × y denotes x temporal crops and y spatial crops. “320”
denotes models trained and tested with a spatial resolution of 320 instead of 224.
(a) Kinetics 400 (b) Kinetics 600 (d) Epic Kitchens 100 Top 1 accuracy
Method Top 1 Top 5 Views Method Top 1 Top 5 Views Method Action Verb Noun
blVNet [16] 73.5 91.2 – AttentionNAS [73] 79.8 94.4 – TSN [69] 33.2 60.2 46.0
STM [30] 73.7 91.6 – LGD-3D R101 [48] 81.5 95.6 – TRN [83] 35.3 65.9 45.4
TEA [39] 76.1 92.5 10 × 3 SlowFast R101-NL [18] 81.8 95.1 10 × 3 TBN [33] 36.7 66.0 47.2
X3D-XL [17] 81.9 95.5 10 × 3 TSM [40] 38.3 67.9 49.0
TSM-ResNeXt-101 [40] 76.3 – –
TimeSformer-HR [2] 82.4 96.0 – SlowFast [18] 38.5 65.6 50.0
I3D NL [72] 77.7 93.3 10 × 3 ViViT-L/16x2 82.5 95.6 4×3
CorrNet-101 [67] 79.2 – 10 × 3 ViViT-L/16x2 320 83.0 95.7 4×3 ViViT-L/16x2 Fact. encoder 44.0 66.4 56.8
ip-CSN-152 [63] 79.2 93.8 10 × 3
LGD-3D R101 [48] 79.4 94.4 – ViViT-L/16x2 (JFT) 84.3 96.2 4×3
ViViT-H/16x2 (JFT) 85.8 96.5 4×3 (e) Something-Something v2
SlowFast R101-NL [18] 79.8 93.9 10 × 3
X3D-XXL [17] 80.4 94.6 10 × 3 Method Top 1 Top 5
TimeSformer-L [2] 80.7 94.7 1×3 (c) Moments in Time TRN [83] 48.8 77.6
ViViT-L/16x2 80.6 94.7 4×3 Top 1 Top 5 SlowFast [17, 77] 61.7 –
ViViT-L/16x2 320 81.3 94.7 4×3 TimeSformer-HR [2] 62.5 –
TSN [69] 25.3 50.1
TSM [40] 63.4 88.5
Methods with large-scale pretraining TRN [83] 28.3 53.4
STM [30] 64.2 89.8
ip-CSN-152 [63] (IG [41]) 82.5 95.3 10 × 3 I3D [6] 29.5 56.1
TEA [39] 65.1 –
ViViT-L/16x2 (JFT) 82.8 95.5 4×3 blVNet [16] 31.4 59.3
blVNet [16] 65.2 90.3
AssembleNet-101 [51] 34.3 62.7
ViViT-L/16x2 320 (JFT) 83.5 95.5 4×3
ViViT-L/16x2 Fact. encoder 65.4 89.8
ViViT-H/16x2 (JFT) 84.8 95.8 4×3 ViViT-L/16x2 38.0 64.9
32 stride 2 64 stride 2 128 stride 2 process longer videos without increasing the number of to-
kens, they offer an efficient method for processing longer
videos than those considered by existing video classifica-
Top-1 Accuracy
`
pdrop (`) = pdrop , (10)
L
where ` is the index of the layer in the network, and L is the
total number of layers.
master/official/vision/beta/ops/augment.py
Table 7: Training hyperparamters for experiments in the main paper. “–” indicates that the regularisation method was not used at all. Values
which are constant across all columns are listed once. Datasets are denoted as follows: K400: Kinetics 400. K600: Kinetics 600. MiT:
Moments in Time. EK: Epic Kitchens. SSv2: Something-Something v2.
K400 K600 MiT EK SSv2
Optimisation
Optimiser Synchronous SGD
Momentum 0.9
Batch size 64
Learning rate schedule cosine with linear warmup
Linear warmup epochs 2.5
Base learning rate 0.1 0.1 0.25 0.5 0.5
Epochs 30 30 10 50 35
Data augmentation
Random crop probability 1.0
Random flip probability 0.5
Scale jitter probability 1.0
Maximum scale 1.33
Minimum scale 0.9
Colour jitter probability 0.8 0.8 0.8 – –
Rand augment number of layers [10] – – – 2 2
Rand augment magnitude [10] – – – 15 20
Other regularisation
Stochastic droplayer rate, pdrop [28] – – – 0.2 0.3
Label smoothing λ [58] – – – 0.2 0.3
Mixup α [79] – – – 0.1 0.3