V Jepa
V Jepa
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and
introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without
the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision.
The models are trained on 2 million videos collected from public datasets and are evaluated on downstream
image and video tasks. Our results show that learning by predicting video features leads to versatile visual
representations that perform well on both motion and appearance-based tasks, without adaption of the
model’s parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos,
obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Frozen Evaluation
1 Introduction
SOTA fine-tuned task-specific
model on SSv2 (MVD) V-JEPA
Humans possess the remarkable ability to map low-level 70
ViT-H/16
Something-Something-v2
ViT-L/16
signals originating from the retina into a semantic spatio- OmniMAE
temporal understanding of the world; synthesizing no- ViT-H/16
VideoMAE
Hiera ViT-H/16
SOTA fine-tuned
task-specific model
tions such as objects and global motion (Spelke et al., 60 Hiera-H on K400 (UniFormer)
VideoMAEv2
1995). A long-standing goal of the machine learning ViT-g/14
1
To that end, we pretrain a family of V-JEPA models variance and explores feature prediction using masked
on a dataset of 2 million videos collected from pub- modeling.
licly available datasets by combining a masked modeling
prediction task with a joint-embedding predictive ar-
chitecture (see Figure 2). We measure performance on Predictive Features. Going beyond local invariance,
several downstream image and video tasks, using both a family of works trains a predictor network to map the
frozen evaluation and end-to-end fine-tuning. Our find- representation of a frame or clip at one time-step to a
ings suggest that feature prediction can indeed serve as distinct representation at another time-step. Srivastava
an effective stand-alone objective for unsupervised learn- et al. (2015); Vondrick et al. (2016); Wang et al. (2023b)
ing from video, while using significantly shorter training train such a video feature predictor network on top of
schedules than pixel prediction methods. Specifically: a frozen pretrained image or video encoder. Unfreezing
the target feature extractor, several methods train the
• Feature prediction leads to versatile visual repre- video encoder and the predictor network simultaneously,
sentations that perform well across downstream while preventing collapse by using a supervised action
image and video tasks without adaption of the forecasting loss (Girdhar and Grauman, 2021), or by
model’s weights; i.e., using a frozen backbone. using the representations of distant clips as negative
V-JEPA achieves the best performance among samples in a contrastive loss (Han et al., 2019, 2020;
methods we consider (+6% accuracy) on the Tan et al., 2023), often focusing on small convolutional
SomethingSomething-v2 task, which requires fine- encoders (Han et al., 2019, 2020). The idea of learning a
grained temporal understanding. V-JEPA is representation by predicting missing information in fea-
also competitive on tasks like Kinetics400, where ture space is also core to the joint-embedding predictive
appearance-based features are sufficient and hence architecture (JEPA) (LeCun, 2022), which combines a
state-of-the-art image models such as DINOv2 excel siamese encoder with a predictor network. JEPAs have
(Figure 1 and Table 6). been successfully instantiated in several modalities, such
as with audio data (Baevski et al., 2022b) and image
• Models trained with feature prediction are supe-
data (Zhou et al., 2021; Oquab et al., 2023; Assran et al.,
rior to pixel prediction approaches under a frozen
2023). In this work, we extend this paradigm to video
evaluation protocol (attentive probing) and are com-
data by leveraging recent advances in self-supervised
petitive with pixel prediction under full fine-tuning,
learning.
while using significantly shorter training schedules
(Tables 5 and 6).
• Models trained with feature prediction are more Advances in Self-Supervised Learning. The use
label-efficient than pixel prediction approaches. De- of vision transformers (Dosovitskiy et al., 2020; Li et al.,
creasing the available number of labeled examples re- 2022) has become standard practice in self-supervised
sults in an increase in the performance gap between learning with joint-embedding architectures (Chen et al.,
V-JEPA and pixel-reconstruction models (Table 7). 2021; Caron et al., 2021; Oquab et al., 2023; Zhou et al.,
2021; Assran et al., 2022), and unlocked masked image
modeling in pixel space by parameterizing the pixel de-
2 Related Works coder as a transformer with learnable mask tokens (Doso-
vitskiy et al., 2020; Xie et al., 2021; He et al., 2021; Bao
Slow Features. One way to encourage temporally et al., 2021), demonstrating a step-change in the rep-
adjacent representations to be predictive of each other resentation quality of autoencoding methods (Vincent
is to ensure that they vary slowly over time. Early et al., 2010). This line of generative methods was sub-
works targeting predictive features encouraged represen- sequently extended to video data using spatio-temporal
tations of individual video frames to be locally tempo- masking (Tong et al., 2022; Feichtenhofer et al., 2022;
rally invariant, while preventing representation collapse Wang et al., 2023a; Kalluri et al., 2023; Gupta et al.,
by using spectral methods, as in SFA (Wiskott and Se- 2023). It was also recently shown that the representa-
jnowski, 2002), SSA (Kayser et al., 2001), and Simulated tions of masked image autoencoders could be significantly
Fixations (Zou et al., 2012). More recently, Goroshin improved by using learnable pooling mechanisms based
et al. (2015); Wang et al. (2010) train a siamese con- on cross-attention (Chen et al., 2022). Finally, through
volutional network to map the representations of two careful selection of design choices, the non-contrastive
subsequent frames to the same point, while encouraging collapse prevention strategy in BYOL (Grill et al., 2020)
distant frames to have diverse representations via a pair- was recently made to work with image feature prediction
wise margin loss and a triplet loss, respectively. Other methods (Baevski et al., 2022b; Assran et al., 2023),
works (Oord et al., 2018; Surís et al., 2021; Feichtenhofer which demonstrated the ability to learn representations
et al., 2021) implement temporal invariance using noise- that can be leveraged for various downstream tasks with-
contrastive estimation (Gutmann and Hyvärinen, 2012). out relying on invariance to hand-crafted image trans-
Our exploration in this paper goes beyond temporal in- formations.
2
Feature Prediction versus Pixel Reconstruction. computed from another part of the video, x. The pre-
Approaches that predict in pixel space must dedicate dictor network Pϕ (·), which maps the representation of
significant model capacity and compute to capture all x to the representation of y, is trained simultaneously
the low-level detail in the visual input. By contrast, ap- with the encoder, and is provided specification of the
proaches that predict in latent space have the flexibility spatio-temporal positions of y through the conditioning
to eliminate irrelevant or unpredictable pixel-level details variable z ← ∆y .
from the target representation (Vondrick et al., 2016).
Naively implementing the objective using the regression
Predicting in representation space has been shown to
lead to versatile representations that perform well across minimizeθ,ϕ ∥Pϕ (Eθ (x), ∆y ) − Eθ (y)∥1 ,
many downstream tasks through linear probing or low-
shot adaptation (Assran et al., 2023; Oquab et al., 2023; would admit a trivial solution, where the encoder out-
Assran et al., 2022), while demonstrating an efficiency puts a constant representation, regardless of its input.
gain during pretraining compared to pixel level recon- In practice, we use the following modified objective to
struction (Assran et al., 2023; Baevski et al., 2022b,a). prevent representation collapse,
The works of Baevski et al. (2022a,b) additionally show
that predicting in representation space results in compet- minimizeθ,ϕ ∥Pϕ (Eθ (x), ∆y ) − sg(E θ (y))∥1 , (1)
itive end-to-end fine-tuning performance in the image, where sg(·) denotes a stop-gradient operation, which
audio and text domains. In this work, we extend these does not backpropagate through its argument, and E θ (·)
findings to the video modality. is an exponential moving average of the network Eθ (·).
The use of an exponential-moving average feature ex-
tractor along with a stop-gradient and a predictor has
3 Methodology: Video-JEPA been used as a collapse prevention strategy for image pre-
training (Grill et al., 2020), and studied empirically (Xie
et al., 2021) and theoretically (Tian et al., 2021). In
z predictor D(ŝy , sy )
ŝy fact, the objective in equation (1) is similar to the loss
sy of Assran et al. (2023) used for image pretraining, but
x-encoder y-encoder
we modify it to use an ℓ1 regression, which we found to
be more stable.
3
Binary Mask
[T×H×W]
x-encoder y-encoder
predictor
\ Eθ Pφ L1 // / Eθ̄
Concatenate stop-grad
Remove mask tokens Remove
masked unmasked
tokens tokens
Figure 3 V-JEPA. Training operates on a video clip of T frames with spatial resolution H × W , flattened into a sequence
of L tokens. (Left to right): We first obtain the input of the x-encoder by dropping tokens from the video clip. The
x-encoder then processes the masked video sequence, and outputs an embedding vector for each input token. Next, the
outputs of the x-encoder are concatenated with a set of learnable mask tokens containing positional embeddings of the masked
spatio-temporal patches. The predictor network processes the combined token sequence, and outputs an embedding vector for
each mask token. The outputs of the predictor are then regressed to the prediction targets using an L1 loss. The prediction
targets correspond to the output of the y-encoder.
3.2 Prediction Task: Predicting y from x puts x and y correspond to masked regions of a video, we
apply the video masks by simply dropping a subset of the
The feature prediction task is based on a masked mod- tokens. We apply masking at the input of the x-encoder,
eling formulation (He et al., 2021; Tong et al., 2022); and at the output of the y-encoder to construct contex-
i.e., regions x and y from the video are sampled using tualized targets (Baevski et al., 2022b). The encoder is
masking. To sample y from a video, we sample several parameterized using standard ViT networks, while the
(possibly overlapping) spatially continuous blocks with predictor is a narrow transformer implemented using
various aspect ratios and repeat the spatial blocks across 12 blocks with an embedding dimension of 384. Taking
the entire temporal dimension of the video; x is taken to inspiration from masked autoencoders (He et al., 2021),
be the complement. Masking a large continuous block our predictor takes as input the sequence of embeddings
that covers the full temporal dimension limits informa- produced by the x-encoder as well as a sequence of learn-
tion leakage due to the spatial and temporal redundancy able mask tokens with positional embeddings indicating
of videos, and results in a harder prediction task (Tong the spatio-temporal positions of the y tokens. The out-
et al., 2022). put of the predictor is an embedding vector for each
We leverage two types of masks: short-range masks, mask token; see Figure 3 and refer to Appendix B for
where we take the union of 8 randomly sampled target more details.
blocks covering 15% of each frame, and long-range masks,
where we take the union of 2 randomly sampled target
3.4 Pretraining Data and Evaluation Setup
blocks covering 70% of each frame. In both cases, the
aspect ratio for all sampled blocks is randomly chosen in Pretraining. We combine several public datasets to
the range (0.75, 1.5). Given that both short-range and construct an unsupervised video pretraining dataset,
long-range masks are produced by sampling many blocks which we refer to as VideoMix2M. Specifically, we com-
and taking their union, the result is an average masking bine the videos from HowTo100M (HT) (Miech et al.,
ratio of ∼ 90%. We refer to our masking strategy as 2019), Kinetics-400/600/700 (K710) (Kay et al., 2017),
multi-block, and compare it to other possible masking and Something-Something-v2 (SSv2) (Goyal et al., 2017),
strategies in Section 4. and remove any overlap with the validation sets of
Kinetics-400/600/700 and Something-Something-v2, re-
3.3 Network Parameterization sulting in approximately 2 million videos. We train a
ViT-L/16, a ViT-H/16, and a ViT-H/16384 transformer
We use a Vision Transformer (ViT) (Dosovitskiy et al., model on VideoMix2M. We use a batch size of 3072 for
2020; Arnab et al., 2021) as our video backbone. To the ViT-L/16 and ViT-H/16 models, and a batch size
process a video with a transformer network, we split the of 2400 for the ViT-H/16384 model. Each model takes
video clip into a 3D grid of L spatio-temporal patches, as input a video clip of 16 frames sampled with a frame-
where a patch consists of a 16 × 16 pixel block spanning skip of 4, corresponding to roughly 3 second clips on
2 consecutive frames; we refer to these spatio-temporal average. The ViT-L/16 and ViT-H/16 process the video
patches as tokens. This sequence of tokens is then di- at a spatial resolution of 224, while the ViT-H/16384
rectly processed by the stack of transformer blocks. In- uses an input resolution of 384; cf. Appendix C.
4
Table 1 Pixels vs. Featurized Targets. We ablate the effect of computing the prediction loss in feature space vs pixel space. All
models are trained on VideoMix2M for 90K iterations with a batch size of 3072 using the multi-block prediction task. We
examine downstream performance using a frozen backbone with attentive probing, and report top-1 accuracy using a single
center view. We also examine end-to-end fine-tuning performance of the models on K400. Predicting in feature space provide
a consistent improvement over pixel space prediction.
Table 2 Pretraining Data Distribution. We pretrain all models for 90K iterations using a batch size of 3072, and evaluate
downstream performance of the frozen backbones with an attentive probe using a single center view. Average performance
across tasks increases with the pretraining dataset size.
Frozen Evaluation
K400 SSv2 IN1K Avg.
Arch. Data #Samples (16×1×1) (16×1×1)
Evaluations. Pretrained models are evaluated on versus pixel prediction objective, b) the construction of
downstream video and image tasks. On video tasks, the pretraining data distribution, c) the feature pooling
we use a subset of the VideoGLUE benchmark (Yuan strategy for leveraging the model’s representations in
et al., 2023) to test for various capabilities; specif- downstream tasks, and d) the masking strategy, towards
ically, we investigate action recognition on Kinetics- identifying: what to predict from what?
400 (K400) (Kay et al., 2017), motion classification on
Something-Something-v2 (SSv2) (Goyal et al., 2017), 4.1 Predicting Representations versus Pixels
and action localization on AVA (Gu et al., 2018). Action
classification on Kinetics evaluates the appearance-based We first ablate the effect of computing the prediction
understanding of the model, as many action classes in loss in representation space. We train a pair of ViT-L/16
the dataset can be inferred from the presence of specific models using either a V-JEPA feature prediction loss,
objects in the video (Sevilla-Lara et al., 2021). Motion or a mean-squared error loss with the normalized pixel
classification on Something-Something-v2 evaluates the values, as in masked autoencoders (He et al., 2021), and
temporal understanding of the model, as action classes perform a sweep over the learning rate and weight decay
in the dataset are decoupled from the appearance/pres- schedules for both approaches. All models are pretrained
ence of specific objects in the video (Goyal et al., 2017). on VideoMix2M for 90K iterations with a batch size of
Finally, action localization on AVA evaluates the ability 3072 using multi-block masking. We examine perfor-
of the model to understand and localize motions in the mance on Kinetics-400 (K400), Something-Something-v2
video. We follow standard practice and report accu- (SSv2), and ImageNet-1K (IN1K), using a frozen back-
racy on K400 and SSv2 by sampling several spatial and bone with an attentive probe, and report top-1 accuracy
temporal views. For static image tasks, we explore ob- using a single center view. We also examine end-to-end
ject recognition on ImageNet (Russakovsky et al., 2015), fine-tuning performance of the models on Kinetics-400.
scene classification on Places205 (Zhou et al., 2014), and Results of this comparison are reported in Table 1 and
fine-grained recognition on iNaturalist 2021 (Van Horn indicate that predicting in feature space provides a con-
et al., 2018). sistent performance improvement over pixel space pre-
diction in both frozen evaluation of the video backbone,
as well as end-to-end fine-tuning.
4 What Matters for Learning Represen-
tations from Video? 4.2 Pretraining Data Distribution
In this section we isolate the contributions of several de- Next we study the impact of the pretraining data dis-
sign choices, including: a) the use of a feature prediction tribution in Table 2. Leveraging large scale datasets
5
Table 3 Average Pooling vs. Adaptive Pooling. We pool the Table 4 Ablating Prediction Task. Models are ViT-L/16
feature map output by the frozen V-JEPA encoder using networks pretrained on K710 and SSv2 and evaluated with
an attentive probe, which is then fed into a linear classifier an attentive probe using a single center view. The region x is
for downstream supervised tasks (K400 and SSv2). We sampled by masking spatio-temporal regions in the video; y is
evaluate two pooling strategies: 1) average pooling (Avg.), the mask complement. 1) random-tube[r]: x is obtained by
and attentive pooling (Att.). Results are reported using masking a fraction r of tubes (spatial patches extended across
a single center view. Using adaptive pooling with a cross- the entire temporal duration) from the video, 2) causal
attention layer leads to improvements of +17.3 points on multi-block[p]: x is restricted to the first p frames of the
K400 and +16.1 points on SSv2. 16-frame video, which are then masked with a random set
of spatio-temporal blocks, 3) multi-block: x is obtained
Frozen Evaluation by masking a random set of spatio-temporal blocks from the
K400 SSv2 entire video. Best performance obtained by using multiblock
(16×1×1) (16×1×1) masking.
Method Arch. Avg. Att. Avg. Att.
V-JEPA ViT-L/16 56.7 73.7 50.1 66.2 Frozen Evaluation
K400 SSv2 IN1K
Masking (16×1×1) (16×1×1)
has been critical for enabling the surge of advancements random-tube[0.9] 51.5 46.4 55.6
61.3 49.8 66.9
in other modalities, such as text and images (Kaplan causal multi-block[6]
causal multi-block[12] 71.9 63.6 72.2
et al., 2020; Cherti et al., 2023). We investigate whether multi-block 72.9 67.4 72.8
a similar trend holds for video data. To control for the
possible confounding variable of compute budget, we
pretrain all models in Table 2 for 90K iterations using with a single GeLU activation, followed by a LayerNorm,
a batch-size of 3072. We report downstream results on and finally a linear classifier.
K400, SSv2, and IN1K using a frozen backbone with an
attentive probe, and report top-1 accuracy using a single In Table 3 we see that using adaptive pooling with
center view. a learnable cross-attention layer leads to a significant
improvement of +17 points on K400 and +16.1 points
Table 2 shows that average performance across tasks
on SSv2. Using an attentive-probe is also beneficial for
monotonically increases as we increase the size of the
other baseline models as reported in Appendix E.
pretraining dataset, but the best task-specific perfor-
mance is obtained by independently selecting the pre-
training data for each specific downstream task. For
instance, the L/16 obtains its best SSv2 performance 4.4 Prediction Task: Predicting y from x
when pretrained on K710+SSv2, its best K400 perfor- We conduct an ablation on the masking strategy used in
mance when pretrained only on K710, and its best IN1K V-JEPA pretraining. We examine the following masking
performance when pretrained only on K710+HT. The strategies: random-tube[r] in which x is obtained by
best average performance across all tasks is achieved by removing a random fraction r of tubes (spatial patches
pretraining VideoMix2M, which combines all the data extended across the entire temporal duration) from the
sources. Similarly, the H/16 pretrained on K710+SSv2 video, causal multi-block[p] in which x is restricted to
achieves a greater K400 score than the H/16 pretrained the first p frames of the 16-frame video, which are then
on VideoMix2M, however, the top performing H/16 on masked with a random set of spatio-temporal blocks,
average is pretrained on VideoMix2M. and multi-block in which x obtained by masking a ran-
dom set of spatio-temporal blocks from the entire video.
4.3 Evaluation: Attentive Probing Spatio-temporal blocks are sampled using the parame-
ters described in Section 3.2; an ablation on the size and
Next we explore the feature pooling strategy for apply- quantity of masked spatio-temporal blocks is provided
ing the model’s representations in downstream tasks. in Appendix E.4.
Since the prediction objective in equation (1) is unnor-
malized, there is no a priori reason for the encoder to Table 4 indicates that the best results are obtained by
yield a linearly separable subspace (Chen et al., 2020). sampling x using a multi-block strategy, wherein the
Thus, rather than using a linear operation (averaging) network is forced to make predictions after removing
to pool the features output of the frozen backbone, we large continuous blocks in the video. When x is only
explore a learnable non-linear pooling strategy. Specifi- sampled from the first few frames of the video, as in
cally, when evaluating the frozen pretrained backbone the causal multi-block strategy, we observe a decrease
on downstream tasks, we learn a cross-attention layer in downstream performances. Finally, the random-tube
with a learnable query token. The output of the cross- strategy, wherein 90% of the tubes in the video are ran-
attention layer is then added back to the query token domly masked, leads to features of low-semantic quality
(residual connection), and then fed into two-layer MLP when combined with our feature prediction objective.
6
Table 5 Comparison with Pixel Prediction Methods. We compare V-JEPA with OmniMAE (Girdhar et al., 2023), Video-
MAE (Tong et al., 2022), and Hiera (Ryali et al., 2023), which leverage a pixel-reconstruction loss. All models are trained using
a ViT-L architecture or a comparable Hiera-L. We evaluate the approaches on downstream image tasks (IN1K, Places205,
iNat201) and video tasks (K400, SSv2, AVA) in both frozen evaluation (with a frozen backbone), and end-to-end fine-tuning.
All models are evaluated at resolution 224. On K400 and SSv2 we follow the standard practice of reporting accuracy from
several spatial and temporal views from the video. In frozen evaluation, V-JEPA outperforms the baselines on all downstream
tasks, except ImageNet, where the model achieves 74.8% compared to 75.1% of an OmniMAE model trained directly on
ImageNet. V-JEPA also achieves the best fine-tuning performance amongs all ViT-L models and matches the Hiera-L on
SSv2. The V-JEPA results are achieved while processing significantly fewer examples during pretraining.
Table 6 Comparison with State-of-the-Art Models. We compare V-JEPA with state-of-the-art baselines in frozen evaluation
with an attentive probe on downstream image tasks (IN1K, Place205, iNat21) and video tasks (K400, SSv2, AVA). All models
are evaluated at resolution 224, except I-JEPA512 and V-JEPA384 which are evaluated respectively at resolution 512 and
384. On K400 and SSv2 we follow the standard practice of reporting accuracy from several spatial and temporal views
from the video. Compared to other video baselines, V-JEPA exhibits a consistent improvement across all downstream tasks.
Compared to image-models that excel under the frozen evaluation, V-JEPA shows a significant performance improvement on
tasks requiring motion understanding (+21 points on SSv2), and reduces the gap between video and image models on tasks
requiring static appearance-based features.
5 Comparison with Prior Work for the possible confounding factor of model architec-
ture by evaluating all models using either a ViT-L/16
In Section 5.1, we investigate the impact of feature pre- encoder, or a Hiera-L encoder, which has a similar num-
diction by comparing V-JEPA with video approaches ber of parameters. For the pixel prediction baselines
that rely on pixel prediction, while using a similar ar- we consider VideoMAE (Tong et al., 2022; Wang et al.,
chitecture for all baselines. Subsequently, in Section 5.2, 2023a), which trains vision transformer autoencoders
we remove the architectural constraint and report the exclusively on video, Hiera (Ryali et al., 2023), which
best performance across architectures for self-supervised trains a hierarchical transformer autoencoder on video,
video and image pretraining approaches. Finally, we ex- and OmniMAE (Girdhar et al., 2023), which trains a
plore the label-efficiency of V-JEPA relative to other self- vision transformer autoencoder on static images and
supervised video pretraining approaches in Section 5.3. video simultaneously.
We further detail the evaluation setup in Appendix D.
Table 5 examines both frozen evaluation with an atten-
5.1 Comparison with Pixel Prediction tive probe on downstream video and image tasks, as well
as end-to-end fine-tuning. In frozen evaluation, V-JEPA
To investigate the effectiveness of feature prediction pre- outperforms the baselines on all downstream tasks, ex-
training, we first compare V-JEPA to video masked mod- cept ImageNet, where we achieve 74.8% compared to
eling models relying on a pixel prediction loss. We control 75.1% of an OmniMAE model trained directly on Im-
7
Something-Something-v2
Something-Something-v2
End-to-End Fine-Tuning Frozen Evaluation
V-JEPA Hiera 75
ViT-L/16 Video Feature Pred.
V-JEPA
75 Hiera-L
ViT-H/16384
Video Feature Pred.
Video Pixel Pred. Video Pixel Pred.
70
VideoMAE
74.5
VideoMAE
65 ViT-H/16
102.4 102.6 102.8 103 103.2 103.4 50 100 150 200 250 300 350
Samples Seen (M) Pretraining Time (Hrs.)
Figure 4 SSv2 fine-tuning performance vs. Samples Seen. We Figure 5 SSv2 frozen-evaluation performance vs. Pretraining
report SSv2 fine-tuning for V-JEPA and pixel-reconstruction Time. Wallclock times for all methods are measured on a
baselines using a ViT-L/16 or Hiera-L architecture. V-JEPA single GPU with a batch size of 10 clips, using the official
outperforms all pixel-reconstruction methods using a ViT- codebases for VideoMAE and VideoMAEv2, and linearly
L/16 and matches the Hiera-L performance while seeing extrapolated assuming a global batch size of 2400 samples.
significantly less samples during pretraining. However, note that the SSv2 accuracies of video pixel pre-
diction methods are actually obtained with small batch sizes
and significantly longer training schedules. V-JEPA out-
ageNet; hence, V-JEPA achieves comparable ImageNet performs pixel-reconstruction methods while training signifi-
performance despite only pretraining on video. cantly faster.
8
Table 7 Low-Shot Frozen Evaluation. Comparing V-JEPA to other video models in frozen evaluation on Kinetics-400 and
Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive
probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random
splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. We report
the mean performances and standard deviation using the K400 and SSv2 validation sets. V-JEPA is more label-efficient than
other models; specifically, decreasing the available number of labeled examples from each class increases the performance gap
between V-JEPA and the baselines.
Frozen Evaluation
K400 SSv2
(16×8×3) (16×2×3)
MVD ViT-L/16 62.6 ± 0.2 68.3 ± 0.2 77.2 ± 0.3 42.9 ± 0.8 49.5 ± 0.6 61.0 ± 0.2
VideoMAE ViT-H/16 62.3 ± 0.3 68.5 ± 0.2 78.2 ± 0.1 41.4 ± 0.8 48.1 ± 0.2 60.5 ± 0.4
VideoMAEv2 ViT-g/14 37.0 ± 0.3 48.8 ± 0.4 67.8 ± 0.1 28.0 ± 1.0 37.3 ± 0.3 54.0 ± 0.3
ViT-H/16 67.0 ± 0.2 72.1 ± 0.1 80.2 ± 0.2 51.9 ± 0.3 57.5 ± 0.4 67.3 ± 0.2
V-JEPA
ViT-H/16384 68.2 ± 0.2 72.8 ± 0.2 80.6 ± 0.2 54.0 ± 0.2 59.3 ± 0.5 67.9 ± 0.2
layer attentive probe, which can be further improved to to 54.0% top-1 when we reduce the number of labeled
77.9% using a two-layer attentive probe. More generally, examples by a factor of 10× (from roughly 440 examples
we hypothesize that the datasets used to train V-JEPA per class to 48 examples per class). By contrast, Video-
and other video models are too constrained and lack the MAEv2 drops by 26% to 28.0% top-1, VideoMAE drops
visual diversity of the internet-scale pretraining data used by 19.1% to 41.4% top-1, and MVD drops by 18.1% to
by the images models; as such, there is value in focusing 42.9% top-1.
future work on building diverse publicly available video
datasets.
9
Frozen
(a) Visualization Methodology. We train a conditional diffusion model to decode the V-JEPA feature-space predictions to
interpretable pixels; the pretrained V-JEPA encoder and predictor networks are kept frozen in this process. The decoder is
only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions
of the video.
(b) Visualizations. First Row: Masked videos used as input to the V-JEPA models (a pretrained ViT-H/16 encoder and its
corresponding predictor network). Other rows: Bounding boxes contain various samples from the decoder overlayed on the
original video. V-JEPA is not a generative model and the decoder does not have access to the context (first row), so we do
not expect samples to exactly match the input. This experiment qualitatively illustrates what information is encoded and
predicted by V-JEPA. In particular, characteristics that are common across samples represent information that is encoded in
the V-JEPA predictions. V-JEPA generates predictions that are spatially and temporally coherent with unmask region of the
video. The predictions also capture consistent motion through time.
10
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Ekin Dogus Cubuk, Barret Zoph, Vijay Mane, Dandelion and-
Mario Lucic, and Cordelia Schmid. Vivit: A video vision Vasudevan, and Quoc V. Le. Autoaugment: Learning aug-
transformer. In Proceedings of the IEEE international mentation policies from data. In Proceedings of the IEEE
conference on computer vision, 2021. Conference on Computer Vision and Pattern Recognition,
2019.
Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr
Bojanowski, Florian Bordes, Pascal Vincent, Armand Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
Joulin, Michael Rabbat, and Nicolas Ballas. Masked Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
siamese networks for label-efficient learning. arXiv preprint Dehghani, Matthias Minderer, Georg Heigold, Sylvain
arXiv:2204.07141, 2022. Gelly, et al. An image is worth 16x16 words: Trans-
formers for image recognition at scale. arXiv preprint
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo-
arXiv:2010.11929, 2020.
janowski, Pascal Vincent, Michael Rabbat, Yann LeCun,
and Nicolas Ballas. Self-supervised learning from images Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick,
with a joint-embedding predictive architecture. In Proceed- and Kaiming He. A large-scale study on unsupervised spa-
ings of the IEEE/CVF Conference on Computer Vision tiotemporal representation learning. Proceedings of the
and Pattern Recognition, pages 15619–15629, 2023. IEEE conference on computer vision and pattern recogni-
Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael tion, 2021.
Auli. Efficient self-supervised learning with contextualized Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al.
target representations for vision, speech and language. Masked autoencoders as spatiotemporal learners. Advances
arXiv preprint arXiv:2212.07525, 2022a. in neural information processing systems, 35:35946–35958,
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, 2022.
Jiatao Gu, and Michael Auli. Data2vec: A general frame-
David J Field. What is the goal of sensory coding? Neural
work for self-supervised learning in speech, vision and
computation, 6(4):559–601, 1994.
language. arXiv preprint arXiv:2202.03555, 2022b.
Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick
of image transformers. arXiv preprint arXiv:2106.08254, Pérez, and Matthieu Cord. Learning representations by
2021. predicting bags of visual words. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Pietro Berkes and Laurenz Wiskott. Slow feature analysis Recognition, pages 6928–6938, 2020.
yields a rich repertoire of complex cell properties. Journal
of vision, 5(6):9–9, 2005. Rohit Girdhar and Kristen Grauman. Anticipative video
transformer. In Proceedings of the IEEE/CVF interna-
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, tional conference on computer vision, pages 13505–13515,
Piotr Bojanowski, and Armand Joulin. Unsupervised learn- 2021.
ing of visual features by contrasting cluster assignments.
arXiv preprint arXiv:2006.09882, 2020. Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh,
Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra.
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jé- Omnimae: Single model masked pretraining on images
gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. and videos. In Proceedings of the IEEE/CVF Confer-
Emerging properties in self-supervised vision transformers. ence on Computer Vision and Pattern Recognition, pages
arXiv preprint arXiv:2104.14294, 2021. 10406–10417, 2023.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen,
offrey Hinton. A simple framework for contrastive learning and Yann LeCun. Unsupervised learning of spatiotempo-
of visual representations. preprint arXiv:2002.05709, 2020. rally coherent metrics. In Proceedings of the IEEE inter-
Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, national conference on computer vision, pages 4086–4093,
Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, 2015.
Gang Zeng, and Jingdong Wang. Context autoencoder
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
for self-supervised representation learning. arXiv preprint
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
arXiv:2202.03026, 2022.
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
Xinlei Chen, Saining Xie, and Kaiming He. An empirical Mueller-Freitag, et al. The" something something" video
study of training self-supervised vision transformers. arXiv database for learning and evaluating visual common sense.
preprint arXiv:2104.02057, 2021. In Proceedings of the IEEE international conference on
computer vision, pages 5842–5850, 2017.
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell
Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Repro- Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do-
ducible scaling laws for contrastive language-image learn- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-
ing. In Proceedings of the IEEE/CVF Conference on Com- mad Gheshlaghi Azar, et al. Bootstrap your own latent: A
puter Vision and Pattern Recognition, pages 2818–2829, new approach to self-supervised learning. arXiv preprint
2023. arXiv:2006.07733, 2020.
11
Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caro- Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-
line Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, Hsuan Yang. Unsupervised representation learning by
George Toderici, Susanna Ricco, Rahul Sukthankar, et al. sorting sequences. In Proceedings of the IEEE international
Ava: A video dataset of spatio-temporally localized atomic conference on computer vision, pages 667–676, 2017.
visual actions. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 6047–6056, Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu,
2018. Hongsheng Li, and Yu Qiao. Uniformer: Unified trans-
former for efficient spatiotemporal representation learning.
Agrim Gupta, Jiajun Wu, Jia Deng, and Li Fei-Fei. Siamese arXiv preprint arXiv:2201.04676, 2022.
masked autoencoders. arXiv preprint arXiv:2305.14344,
2023. Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. arXiv preprint arXiv:1711.05101, 2017.
Michael U Gutmann and Aapo Hyvärinen. Noise-contrastive
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,
estimation of unnormalized statistical models, with appli-
Makarand Tapaswi, Ivan Laptev, and Josef Sivic.
cations to natural image statistics. Journal of machine
Howto100m: Learning a text-video embedding by watch-
learning research, 13(2), 2012.
ing hundred million narrated video clips. In Proceedings
Tengda Han, Weidi Xie, and Andrew Zisserman. Video of the IEEE/CVF international conference on computer
representation learning by dense predictive coding. In vision, pages 2630–2640, 2019.
Proceedings of the IEEE/CVF International Conference
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
on Computer Vision Workshops, pages 0–0, 2019.
visual representations by solving jigsaw puzzles. In Euro-
Tengda Han, Weidi Xie, and Andrew Zisserman. Memory- pean conference on computer vision, pages 69–84. Springer,
augmented dense predictive coding for video representation 2016.
learning. In European conference on computer vision, pages
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen-
312–329. Springer, 2020.
tation learning with contrastive predictive coding. arXiv
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dol- preprint arXiv:1807.03748, 2018.
lár, and Ross Girshick. Masked autoencoders are scalable
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
vision learners. arXiv preprint arXiv:2111.06377, 2021.
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
Geoffrey E Hinton. Connectionist learning procedures. In Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
Machine learning, pages 555–610. Elsevier, 1989. Dinov2: Learning robust visual features without supervi-
sion. arXiv preprint arXiv:2304.07193, 2023.
Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and
Du Tran. Flavr: Flow-agnostic video representations for Nikhil Parthasarathy, SM Eslami, João Carreira, and
fast frame interpolation. In Proceedings of the IEEE/CVF Olivier J Hénaff. Self-supervised video pretraining
Winter Conference on Applications of Computer Vision, yields strong image representations. arXiv preprint
pages 2071–2082, 2023. arXiv:2210.06433, 2022.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Darrell, and Alexei A Efros. Context encoders: Feature
Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for learning by inpainting. In Proceedings of the IEEE con-
neural language models. arXiv preprint arXiv:2001.08361, ference on computer vision and pattern recognition, pages
2020. 2536–2544, 2016.
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Silvia L Pintea, Jan C van Gemert, and Arnold WM Smeul-
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Vi- ders. Déja vu: Motion prediction in static images. In
ola, Tim Green, Trevor Back, Paul Natsev, et al. The Computer Vision–ECCV 2014: 13th European Conference,
kinetics human action video dataset. arXiv preprint Zurich, Switzerland, September 6-12, 2014, Proceedings,
arXiv:1705.06950, 2017. Part III 13, pages 172–187. Springer, 2014.
Christoph Kayser, Wolfgang Einhäuser, Olaf Dümmer, Peter Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
König, and Konrad Körding. Extracting slow subspaces Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
from natural videos leads to complex cells. In Artificial Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
Neural Networks—ICANN 2001: International Conference ing transferable visual models from natural language su-
Vienna, Austria, August 21–25, 2001 Proceedings 11, pages pervision. In International conference on machine learning,
1075–1080. Springer, 2001. pages 8748–8763. PMLR, 2021.
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Rajesh PN Rao and Dana H Ballard. Predictive coding
Learning representations for automatic colorization. 2016. in the visual cortex: a functional interpretation of some
extra-classical receptive-field effects. Nature neuroscience,
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. 2(1):79–87, 1999.
Colorization as a proxy task for visual understanding. 2017.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
Yann LeCun. A path towards autonomous machine intelli- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
gence version 0.9. 2, 2022-06-27. 2022. Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
12
Li Fei-Fei. Imagenet large scale visual recognition chal- Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Ben-
lenge. International Journal of Computer Vision, 115(3): gio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked
211–252, 2015. denoising autoencoders: Learning useful representations
in a deep network with a local denoising criterion. Journal
Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei,
of machine learning research, 11(12), 2010.
Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arka-
bandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.
Hiera: A hierarchical vision transformer without the bells- Anticipating visual representations from unlabeled video.
and-whistles. arXiv preprint arXiv:2306.00989, 2023. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 98–106, 2016.
Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj
Goswami, Matt Feiszli, and Lorenzo Torresani. Only time Fei Wang, Ping Li, and Arnd Christian Konig. Learning
can tell: Discovering temporal data for temporal modeling. a bi-stochastic data similarity matrix. In 2010 IEEE
In Proceedings of the IEEE/CVF winter conference on International Conference on Data Mining, pages 551–560.
applications of computer vision, pages 535–544, 2021. IEEE, 2010.
Elizabeth S Spelke, Peter Vishton, and Claes Von Hofsten.
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan
Object perception, object-directed action, and physical
He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2:
knowledge in infancy. 1995.
Scaling video masked autoencoders with dual masking. In
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi- Proceedings of the IEEE/CVF Conference on Computer
nov. Unsupervised learning of video representations using Vision and Pattern Recognition, pages 14549–14560, 2023a.
lstms. In International conference on machine learning,
Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen,
pages 843–852. PMLR, 2015.
Xiyang Dai, Mengchen Liu, Lu Yuan, and Yu-Gang Jiang.
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Masked video distillation: Rethinking masked feature mod-
Cordelia Schmid. Videobert: A joint model for video and eling for self-supervised video representation learning. In
language representation learning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer
IEEE/CVF international conference on computer vision, Vision and Pattern Recognition, pages 6312–6322, 2023b.
pages 7464–7473, 2019.
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang,
Dídac Surís, Ruoshi Liu, and Carl Vondrick. Learning the pre- Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang,
dictability of the future. In Proceedings of the IEEE/CVF et al. Internvideo: General video foundation models via
Conference on Computer Vision and Pattern Recognition, generative and discriminative learning. arXiv preprint
pages 12607–12617, 2021. arXiv:2212.03191, 2022.
Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A Laurenz Wiskott and Terrence J Sejnowski. Slow feature
Plummer, Kate Saenko, Karl Ridgeway, and Lorenzo Tor- analysis: Unsupervised learning of invariances. Neural
resani. Multiscale video pretraining for long-term activity computation, 14(4):715–770, 2002.
forecasting. arXiv preprint arXiv:2307.12854, 2023.
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Antti Tarvainen and Harri Valpola. Mean teachers are bet-
Unsupervised feature learning via non-parametric instance
ter role models: Weight-averaged consistency targets im-
discrimination. In Proceedings of the IEEE conference on
prove semi-supervised deep learning results. arXiv preprint
computer vision and pattern recognition, pages 3733–3742,
arXiv:1703.01780, 2017.
2018.
Yuandong Tian, Xinlei Chen, and Surya Ganguli. Under-
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin
standing self-supervised learning dynamics without con-
Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A sim-
trastive pairs. In International Conference on Machine
ple framework for masked image modeling. arXiv preprint
Learning, pages 10268–10278. PMLR, 2021.
arXiv:2111.09886, 2021.
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Video-
mae: Masked autoencoders are data-efficient learners for Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and
self-supervised video pre-training. Advances in neural Yueting Zhuang. Self-supervised spatiotemporal learn-
information processing systems, 35:10078–10093, 2022. ing via video clip order prediction. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Recognition, pages 10334–10343, 2019.
Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona,
and Serge Belongie. The inaturalist species classification Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko,
and detection dataset. In Proceedings of the IEEE con- Armen Aghajanyan, Florian Metze, Luke Zettlemoyer,
ference on computer vision and pattern recognition, pages and Christoph Feichtenhofer. Videoclip: Contrastive pre-
8769–8778, 2018. training for zero-shot video-text understanding. arXiv
preprint arXiv:2109.14084, 2021.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-
Antoine Manzagol. Extracting and composing robust fea- Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
tures with denoising autoencoders. In Proceedings of the jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
25th International Conference on Machine Learning, ICML captioners are image-text foundation models. arXiv
’08, page 1096–1103, 2008. preprint arXiv:2205.01917, 2022.
13
Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao,
Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia,
Tobias Weyand, Luke Friedman, et al. Videoglue: Video
general understanding evaluation of foundation models.
arXiv preprint arXiv:2307.03166, 2023.
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng
Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hes-
sel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural
script knowledge through vision and language and sound.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 16375–16387, 2022.
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor-
ralba, and Aude Oliva. Learning deep features for scene
recognition using places database. In Z. Ghahramani,
M. Welling, C. Cortes, N. Lawrence, and K.Q. Wein-
berger, editors, Advances in Neural Information Pro-
cessing Systems, volume 27. Curran Associates, Inc.,
2014. https://proceedings.neurips.cc/paper/2014/file/
3fe94a002317b5f9259f82690aeea4cd-Paper.pdf.
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie,
Alan Yuille, and Tao Kong. Ibot: Image bert pre-training
with online tokenizer. arXiv preprint arXiv:2111.07832,
2021.
Will Zou, Shenghuo Zhu, Kai Yu, and Andrew Ng. Deep
learning of invariant features via simulated fixations in
video. Advances in neural information processing systems,
25, 2012.
14
Appendix
15
B Extended Description of V-JEPA
In this section, we provide an in-depth description of our approach V-JEPA that is illustrated in Figure 3.
Input. Unless stated otherwise, during during pretraining, we always randomly sample a clip of 16 frames from
each input video with a temporal stride of 4 between sampled frames. An input video clip therefore covers 64 frames
in total, or roughly 2 seconds of a given video running at 30 frames per second. We then resize the video’s spatial
dimensions to 224 × 224, resulting in an overall shape of 16 × 224 × 224 × 3 for the entire clip. Since ViT networks
process a 1D sequence of tokens, we must convert an input video clip into a 1D token sequence. To do so, we apply a
3D convolution comprising d filters of size 2 × 16 × 16 with a temporal stride of 2 and a spatial stride of 16, resulting
in a tensor of shape 8 × 14 × 14 × d. Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal
feature map and flatten it, resulting in a 1D token sequence of shape 1568 × d. This process is demonstrated in
Figure 7.
[8 x 14 x 14 x d]
Figure 7 V-JEPA training operates on a video clip flattened into a sequence of tokens. To convert a video clip of size
16 × 224 × 224 × 3 into a 1D token sequence, we apply a 3D convolution comprising d filters of size 2 × 16 × 16 with a temporal
stride of 2 and a spatial stride of 16, resulting in a tensor of shape 8 × 14 × 14 × d. Next we add absolute 3D sin-cos positional
embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape 1568 × d.
V-JEPA. We sample both a video clip, and a video mask in each iteration. We denote a video clip represented as
a 1D token sequence of length L = 1568 by xL = (x1 , . . . , xL ). Similarly, given a mask of M < L patches, leaving
N = L − M patches unmasked, we denote the indices of masked patches by (i1 , . . . , iM ) and its complement (the
indices of unmasked patches) by (j1 , . . . , jN ).
Computing the x-representations. To compute the V-JEPA loss, we first produce the x-representations by masking
the video clip and feeding it into the x-encoder; we denote the masked video by xN = (xj1 , . . . , xjN ). Applying the x-
encoder Eθ (·) to the masked clip gives a sequence of patch representations, denoted as zN = Eθ (xN ) = (zj1 , . . . , zjN ).
Predicting the target. Next, the V-JEPA predictor network Pϕ (·, ·) takes as input the tokens produced by the
x-encoder and predicts the missing regions in the video clip, which are specified by a set of learnable mask tokens.
Specifically, the mask tokens are parameterized as the sum of a shared learnable vector and an absolute 3D
sin-cos positional embedding, denoted by mM = (mi1 , . . . , miM ). The output of the predictor is thus given by,
ŝM = Pϕ (zN , mM ) = (ŝi1 , . . . , ŝiM ), corresponding to a d-dimensional output for each of the M masked patches.
Computing the y-representations. Finally to compute the prediction targets, the entire unmasked video clip is
processed by the y-encoder to obtain a set of target representations, denoted by sL = E θ (xL ) = (s1 , . . . , sL ). The
V-JEPA loss is now computed as
1 X
Loss = ∥ŝk − sk ∥1 , (2)
M
k∈(i1 ,...,iM )
which is simply the average L1 distance between the output of the predictor and the y-encoder. We then compute a
gradient update with respect to the parameters of the x-encoder, θ, and the predictor, ϕ, and subsequently update
the parameters of the y-encoder as an exponential moving average of the context encoder weights (Polyak average).
16
Table 8 pretraining hyper-parameters for V-JEPA.
Multi-Mask Prediction. To increase the efficiency of V-JEPA, we use a multi-masking strategy (Caron et al.,
2020; Baevski et al., 2022a), which enables us to amortize the cost of the target computation. As mentioned in
Section 3, for a given video clip, we sample 2 different masks, short-range and long-range. While we need to forward
propagate the x-encoder and predictor separately for each mask, we only need to compute the y-representation once.
C Pretraining details
In section, we report V-JEPA pretraining details. Table 8 summarizes the main hyperparameters used during
pretraining.
Architectures. We use Vision Transformer (Dosovitskiy et al., 2020) (ViT) architectures for the x-encoder and
y-encoder. We train three V-JEPA encoders: a ViT-L/16224 , a ViT-H/16224 and a ViT-H/16384 . All three encoders
take as input a short video clip of 16 frames with a temporal stride of 4 between consecutive frames. The subscripts,
224 and 384, indicate the spatial resolution of the video clip. V-JEPA flattens the video clip into a sequence of
non-overlapping spatio-temporal patches of size 16 × 16 × 2 (see Figure 7). For all three models, the predictor is
designed as a narrow ViT architecture, consisting of 12 transformer blocks with an embedding dimension of 384. For
simplicity, we keep the number of self-attention heads in the predictor equal to that of the backbone used for the
context-encoder/target-encoder. V-JEPA is pretrained without using a [cls] token.
Optimization. We use AdamW (Loshchilov and Hutter, 2017) to optimize the x-encoder and predictor weights.
The ViT-L/16224 and ViT-H/16224 models use a batch size of 3072 while the ViT-H/16384 uses a batch size of
2400. Models are trained for a total of 90,000 iterations. The learning rate is linearly increased from 2 × 10−4
to 6.25 × 10−4 during the first 12, 000 iterations of pretraining, and decayed to 10−6 following a cosine schedule.
17
Table 9 Frozen Evaluation hyper-parameters.
Weight-decay is also linearly increased from 0.04 to 0.4 throughout pretraining. The y-encoder weights are initialized
identically to the x-encoder, and subsequently updated as an exponential moving average (EMA) (Tarvainen and
Valpola, 2017) of the x-encoder weights using a momentum value which starts at 0.998 and is linearly increased to
1.0 during training (Caron et al., 2021; Assran et al., 2022). We scale all hyper-parameter schedules 25% beyond
the actual training schedule. Specifically, the learning rate schedule, weight-decay schedule, and EMA schedule
are computed assuming a training length of 112,500 iterations, even though we only train our model for 90,000
iterations. We found the last 25% of the default scheduler period to update hyper-parameters too aggressively, and
simply truncating the schedulers improved performance.
Masking. As described in Section 3, we propose a 3D Multi-Block masking strategy. We use two type of masks:
short-range masks, where we take the union of 8 randomly sampled target blocks with a spatial scale of 0.15, and
long-range masks, where we take the union of 2 randomly sampled target blocks with a spatial scale of 0.7. In both
cases, the aspect ratio for all sampled blocks is randomly chosen in the range (0.75, 1.5).
D Evaluation details
where Wk , Wv ∈ Rd×d are the key and value matrices, and q ∈ Rd is a learnable query token. The output of the
cross-attention is then added back to the query token (residual connection), and then fed into two-layer MLP with a
single GeLU activation, followed by a LayerNorm, and finally a linear classifier. The parameters of the cross-attention
block are jointly learned with that of the linear classifier for the downstream task, while the encoder parameters
are kept frozen. Note that, in practice, we actually use an attentive probe with 12 heads, each of dimension 12. In
Appendix E we show that baselines benefit from the attentive probing protocol.
Optimization. For all the tasks, we use AdamW optimizer with a cosine scheduler (no warmup) that decays the
learning rate from 0.001 to 0. We use a fixed weight-decay of 0.01 and apply simple data augmentations (random
resized crops and horizontal flips) during training of the attentive probe, except on image tasks, where we apply
AutoAugment (Dogus Cubuk et al., 2019). Table 9 reports the hyperparameters for each downstream evaluation.
Extension to multiple clips. Unless stated otherwise, our attentive probe takes 8 clips of 16 frames as input on
Kinetics, and 2 clips of 16 frames on Something-Somethingv2 to increase the temporal coverage of the video.
18
Table 10 Frozen Detection hyper-parameters.
Specifically, we first divide a video in 8 (or 2) equal-length temporal segments, and sample 1 clip at random per
segment. The video encoder E θ processes each clip separately and produces a clip-level feature map. The feature
maps for each clip are then concatenated together and fed to the attentive probe. At test time, we average the
prediction of 3 spatial views following standard practice in video classification.
Application of video models to images. To evaluate the video models on image tasks, we simply duplicate input
images to generate still video clips of 16 frames. We perform this duplication operation simply for convenience in
evaluation of the video models, however we find this step to be unnecessary in general. Given a video tokenizer
implemented as a 3D-conv with a temporal stride of 2, it is sufficient to simply duplicate the image into a 2 frame
video clip. This would result in the same number of input tokens as that produced by a static image model with a
2D-conv tokenizer.
Application of image models to videos. To evaluate image models such as DINOv2 and OpenCLIP on video tasks,
we simply process each frame independently with the image encoder to produce a frame-level feature map. The
feature maps for each frame are then concatenated and fed to the attentive probe, just as we do with the clip-level
feature maps when evaluating video models.
D.3 Finetuning
Following Tong et al. (2022), we finetune a linear layer on top of our model, using a layer decay schema and mixup
as the data augmentation pipeline. We provide all hyper-parameters for both K400 and SSv2 in Table 11.
E Extra Results
19
Table 11 Finetuning Evaluation hyper-parameters.
20
Table 12 Linear vs. Attentive Probe Evaluation for V-JEPA and VideoMAE. We evaluate the effect of linear (Lin.)
and attentive (Att.) probing when adapting V-JEPA to the K400 (16 × 5 × 3) and SSv2 (16 × 2 × 2) tasks. V-JEPA and
VideoMAE benefit from using a non-linear attentive probe.
K400 SSv2
Method Arch. Lin. Att. Lin. Att.
VideoMAE ViT-L/16 52.5 77.8 41.3 61.2
V-JEPA ViT-L/16 56.7 80.8 50.1 69.5
Table 13 Linear vs. Attentive Probe Evaluation for DINOv2 and OpenCLIP. We evaluate the effect of linear (Lin.)
and attentive probing (Att.) when adapting DINOv2 and OpenCLIP. Image-baselines benefit from using an attentive probing
strategy. Results shown in gray are reported from the linear probe evaluation in Oquab et al. (2023).
One Clip vs Multiple clips. We examine the impact of changing the temporal coverage of a model during downstream
evaluation on K400 action classification. In Table 14, we evaluate VideoMAE and V-JEPA models using an attentive
probe with access to either the feature map of 1 clip randomly sampled from the video, or the concatenated feature
map of 8 clips randomly sampled from the video. To sample 8 clips from a video, we first divide the video into 8
equal length temporal segments, and sample 1 clip at random from each segment. A single clip corresponds to ≈ 2
seconds of a video on average, while 8 clips correspond to ≈ 16 seconds. The video encoders processes each clip
separately to produce a clip-level feature map, which are then concatenated at the input to the attentive probe.
Increasing the temporal coverage from 1 clip per video to 8 clips improves the performance of both V-JEPA and
VideoMAE on K400 action classification. We therefore use the multiclip attentive probing setup as our default
evaluation pipeline.
E.2 Finetuning
In Table 15, we evaluate V-JEPA using finetuning (separately) on K400 and SSv2. We compare V-JEPA with
VideoMAEv2 (Wang et al., 2023a), VideoMAE (Tong et al., 2022) and MVD (Wang et al., 2023b) using a ViT-L/16
or a ViT-H/16 architecture. V-JEPA obtains competitive performance using a finetuning protocol. With a ViTiH/16
architecture, V-JEPA outperforms by 1.2% VideoMAE and +0.3% VideoMAEv2 on the SSv2 dataset, while obtaining
comparable performance on K400. V-JEPA also obtains performance similar to MVD on the SSv2 dataset. The
MVD model achieves the best performance across models on the K400 dataset, and is trained using the image
dataset ImageNet1K, in contrast to the other methods in the table, which only use video data. Additionally MVD
requires the processing of significantly more samples during pretraining due to the cost of training the teacher
encoder networks in a pre-pre-training step.
21
Table 14 Temporal Coverage on Kinetics-400. We evaluate the effect of temporal coverage on K400. We train an attentive
probe on K400 using either 1 clip (≈ 2 seconds of a video) or 8 clips (≈ 16 seconds of a video). To sample N clips, we first
divide a video in N equal-length temporal segments and sample one clip at random per segment. The video encoder processes
each clip in parallel and all the encoder output tokens are concatenated at the input of the attentive probe. Increasing the
temporal coverage from 1 clip per video to 8 clips significantly improves the performance for both our VideoMAE baseline
and V-JEPA.
Table 15 Finetuning results. We evaluate a V-JEPA model with the finetuning protocol on the K400 and SSv2 datasets
using 16 frames per clip and multi-view fusion (5×3 or 2×3) for inference. The #Samples Seen entry corresponds to the
number of video clips processed during pretraining, which is larger than the size of the pretraining dataset for multi-epoch
training. We compare V-JEPA with different video self-supervised learning approaches. We report the VideoMAEv2 results
without instruction-turning for consistency with the other approaches. V-JEPA obtains competitive performance using the
finetuning protocol.
examine our multi-masking strategy and find that sampling two masks for each clip (long-range and short-range) to
be more effective than sampling just a single mask for each clip.
In Figure 8c, we explore different average spatial and temporal masking ratio, i.e. the spatial/temporal ratio of
the area that is covered by a mask on average for a clip. Recall that each mask is constructed by sampling several
(possibly overlapping) blocks and taking their union. We change the average spatial or temporal masking ratio by
changing a block spatial or temporal size, as well as the overall number of blocks. We found that low spatial or
temporal coverage results in a trivial prediction task, which degrades downstream performance. Based on those
results, we sample masks that remove roughly 90% of the frame and extend along the entire temporal dimension of
the clip by default.
In Figure 8b , we explore different block size given an effective spatial masking ratio of 90% and temporal ratio of
100%. We keep the masking ratio approximately constant by changing the block size and the number of block at the
same time. We find that sampling several blocks to perform better than sampling a single large block. Figure 9
visually illustrates the effect of sampling several smaller blocks to construct a mask.
In Figure 8a, we explore the effect of sampling various number of masks per samples. We find that sampling two
masks for each clip, with different spatial block sizes for each, to be more effective than sampling just a single mask.
We hypothesize that this masking strategy induces complementary tasks. In our experiment, we use this as our
default masks sampling.
22
Table 16 Sample efficiency. We compare the sample efficiency of pretraining various state-of-the-art image and video models.
The #Samples Seen entry corresponds to the number of samples (image or video clips) processed by the network during
pretraining, which is larger than the size of the pretraining dataset for multi-epoch training. The V-JEPA results in this
paper are obtained while processing an order of magnitude fewer samples than previous methods.
Ablating Number of Masks per Sample Ablating Number of Blocks per Mask Ablating Masking Ratio
50 50
55 Temporal Masking Ratio
100%
Kinetics 400
Kinetics 400
Kinetics 400
54 40
49 75%
53 30 50%
52 48 20
51
10
47
50
0
1 2 3 1 2 4 8 16 25 50 75 90
Number of Masks per Samples Number of Blocks per Mask Spatial Masking Ratio
Figure 8 Masking Strategy Ablation. Evaluating a linear probe on a ViT-B/16 pretrained with V-JEPA on K400 under
various 3D Multi-Block masking settings. We examine the impact of (a) sampling several masks per video, (b) varying the
number of blocks in a mask, and (c) varying the average spatial and temporal masking ratio. A temporal masking ratio of
100% extends the spatial mask across all the frames in the clip. We find it important to maintain a high spatial and temporal
masking ratio during pretraining.
Figure 9 Illustration of mask with number of blocks and block size. Each mask is constructed by sampling several (possibly
overlapping) blocks and taking their union.
23