0% found this document useful (0 votes)

123 views23 pages

V Jepa

Uploaded by

NguyễnHuyHùng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views23 pages

V Jepa

Uploaded by

NguyễnHuyHùng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Revisiting Feature Prediction for Learning Visual

Representations from Video

Adrien Bardes1,2,3 , Quentin Garrido1,4 , Jean Ponce3,5,6 , Xinlei Chen1 , Michael Rabbat1 , Yann LeCun1,5,6 ,
Mahmoud Assran1,† , Nicolas Ballas1,†
1
FAIR at Meta, 2 Inria, 3 École normale supérieure, CNRS, PSL Research University, 4 Univ. Gustave Eiffel,
CNRS, LIGM, 5 Courant Institute, New York University, 6 Center for Data Science, New York University
†
Joint last author

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and
introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without
the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision.
The models are trained on 2 million videos collected from public datasets and are evaluated on downstream
image and video tasks. Our results show that learning by predicting video features leads to versatile visual
representations that perform well on both motion and appearance-based tasks, without adaption of the
model’s parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos,
obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Date: February 14, 2024

Correspondence: {abardes, massran, ballasn}@meta.com
Code: https://github.com/facebookresearch/jepa
Blogpost: Click here

Frozen Evaluation
1 Introduction
SOTA fine-tuned task-specific
model on SSv2 (MVD) V-JEPA
Humans possess the remarkable ability to map low-level 70
ViT-H/16
Something-Something-v2

ViT-L/16
signals originating from the retina into a semantic spatio- OmniMAE
temporal understanding of the world; synthesizing no- ViT-H/16
VideoMAE
Hiera ViT-H/16
SOTA fine-tuned
task-specific model
tions such as objects and global motion (Spelke et al., 60 Hiera-H on K400 (UniFormer)
VideoMAEv2
1995). A long-standing goal of the machine learning ViT-g/14

community is to identify the principles or objectives that DINOv2

I-JEPA
may guide such unsupervised learning in humans (Field, 50 ViT-H/16
ViT-g/14

1994; Berkes and Wiskott, 2005; Hinton, 1989). One

related hypothesis is based on the predictive feature Video Feature Pred.
principle (Rao and Ballard, 1999), which posits that 40
Video Pixel Pred.
representations of temporally adjacent sensory stimuli OpenCLIP
ViT-G/14
Image Models

should be predictive of each other.

70 72 74 76 78 80 82 84 86 88 90 92
In this work, we revisit feature prediction as a stand-
Kinetics 400
alone objective for unsupervised learning of visual repre-
sentations from video. Numerous advances in the field — Figure 1 V-JEPA models pretrained on video learn versatile
such as the standard use of transformer architectures in visual representations. It performs well on motion-based
vision (Dosovitskiy et al., 2020), the maturing of masked tasks (Something-Something-v2) and appearance-based tasks
autoencoding frameworks (Xie et al., 2021; Bao et al., (Kinetics 400) without adaptation of the model’s parameters,
2021; He et al., 2021), query-based feature pooling (Chen i.e., using the same frozen backbone for both tasks.
et al., 2022), joint-embedding predictive architectures
(JEPA) (LeCun, 2022; Assran et al., 2023; Baevski et al.,
2022b), and larger datasets — form a unique arsenal of level reconstruction.
tools, which we integrate in a modern and conceptually
We seek to answer the simple question:
simple method, the video joint-embedding predictive ar-
chitecture or V-JEPA, which is based solely on feature How effective is feature prediction as a stand-
prediction, without using pretrained image encoders, alone objective for unsupervised learning from
text, negative examples, human annotations, or pixel- video with modern tools?

1
To that end, we pretrain a family of V-JEPA models variance and explores feature prediction using masked
on a dataset of 2 million videos collected from pub- modeling.
licly available datasets by combining a masked modeling
prediction task with a joint-embedding predictive ar-
chitecture (see Figure 2). We measure performance on Predictive Features. Going beyond local invariance,
several downstream image and video tasks, using both a family of works trains a predictor network to map the
frozen evaluation and end-to-end fine-tuning. Our find- representation of a frame or clip at one time-step to a
ings suggest that feature prediction can indeed serve as distinct representation at another time-step. Srivastava
an effective stand-alone objective for unsupervised learn- et al. (2015); Vondrick et al. (2016); Wang et al. (2023b)
ing from video, while using significantly shorter training train such a video feature predictor network on top of
schedules than pixel prediction methods. Specifically: a frozen pretrained image or video encoder. Unfreezing
the target feature extractor, several methods train the
• Feature prediction leads to versatile visual repre- video encoder and the predictor network simultaneously,
sentations that perform well across downstream while preventing collapse by using a supervised action
image and video tasks without adaption of the forecasting loss (Girdhar and Grauman, 2021), or by
model’s weights; i.e., using a frozen backbone. using the representations of distant clips as negative
V-JEPA achieves the best performance among samples in a contrastive loss (Han et al., 2019, 2020;
methods we consider (+6% accuracy) on the Tan et al., 2023), often focusing on small convolutional
SomethingSomething-v2 task, which requires fine- encoders (Han et al., 2019, 2020). The idea of learning a
grained temporal understanding. V-JEPA is representation by predicting missing information in fea-
also competitive on tasks like Kinetics400, where ture space is also core to the joint-embedding predictive
appearance-based features are sufficient and hence architecture (JEPA) (LeCun, 2022), which combines a
state-of-the-art image models such as DINOv2 excel siamese encoder with a predictor network. JEPAs have
(Figure 1 and Table 6). been successfully instantiated in several modalities, such
as with audio data (Baevski et al., 2022b) and image
• Models trained with feature prediction are supe-
data (Zhou et al., 2021; Oquab et al., 2023; Assran et al.,
rior to pixel prediction approaches under a frozen
2023). In this work, we extend this paradigm to video
evaluation protocol (attentive probing) and are com-
data by leveraging recent advances in self-supervised
petitive with pixel prediction under full fine-tuning,
learning.
while using significantly shorter training schedules
(Tables 5 and 6).
• Models trained with feature prediction are more Advances in Self-Supervised Learning. The use
label-efficient than pixel prediction approaches. De- of vision transformers (Dosovitskiy et al., 2020; Li et al.,
creasing the available number of labeled examples re- 2022) has become standard practice in self-supervised
sults in an increase in the performance gap between learning with joint-embedding architectures (Chen et al.,
V-JEPA and pixel-reconstruction models (Table 7). 2021; Caron et al., 2021; Oquab et al., 2023; Zhou et al.,
2021; Assran et al., 2022), and unlocked masked image
modeling in pixel space by parameterizing the pixel de-
2 Related Works coder as a transformer with learnable mask tokens (Doso-
vitskiy et al., 2020; Xie et al., 2021; He et al., 2021; Bao
Slow Features. One way to encourage temporally et al., 2021), demonstrating a step-change in the rep-
adjacent representations to be predictive of each other resentation quality of autoencoding methods (Vincent
is to ensure that they vary slowly over time. Early et al., 2010). This line of generative methods was sub-
works targeting predictive features encouraged represen- sequently extended to video data using spatio-temporal
tations of individual video frames to be locally tempo- masking (Tong et al., 2022; Feichtenhofer et al., 2022;
rally invariant, while preventing representation collapse Wang et al., 2023a; Kalluri et al., 2023; Gupta et al.,
by using spectral methods, as in SFA (Wiskott and Se- 2023). It was also recently shown that the representa-
jnowski, 2002), SSA (Kayser et al., 2001), and Simulated tions of masked image autoencoders could be significantly
Fixations (Zou et al., 2012). More recently, Goroshin improved by using learnable pooling mechanisms based
et al. (2015); Wang et al. (2010) train a siamese con- on cross-attention (Chen et al., 2022). Finally, through
volutional network to map the representations of two careful selection of design choices, the non-contrastive
subsequent frames to the same point, while encouraging collapse prevention strategy in BYOL (Grill et al., 2020)
distant frames to have diverse representations via a pair- was recently made to work with image feature prediction
wise margin loss and a triplet loss, respectively. Other methods (Baevski et al., 2022b; Assran et al., 2023),
works (Oord et al., 2018; Surís et al., 2021; Feichtenhofer which demonstrated the ability to learn representations
et al., 2021) implement temporal invariance using noise- that can be leveraged for various downstream tasks with-
contrastive estimation (Gutmann and Hyvärinen, 2012). out relying on invariance to hand-crafted image trans-
Our exploration in this paper goes beyond temporal in- formations.

2
Feature Prediction versus Pixel Reconstruction. computed from another part of the video, x. The pre-
Approaches that predict in pixel space must dedicate dictor network Pϕ (·), which maps the representation of
significant model capacity and compute to capture all x to the representation of y, is trained simultaneously
the low-level detail in the visual input. By contrast, ap- with the encoder, and is provided specification of the
proaches that predict in latent space have the flexibility spatio-temporal positions of y through the conditioning
to eliminate irrelevant or unpredictable pixel-level details variable z ← ∆y .
from the target representation (Vondrick et al., 2016).
Naively implementing the objective using the regression
Predicting in representation space has been shown to
lead to versatile representations that perform well across minimizeθ,ϕ ∥Pϕ (Eθ (x), ∆y ) − Eθ (y)∥1 ,
many downstream tasks through linear probing or low-
shot adaptation (Assran et al., 2023; Oquab et al., 2023; would admit a trivial solution, where the encoder out-
Assran et al., 2022), while demonstrating an efficiency puts a constant representation, regardless of its input.
gain during pretraining compared to pixel level recon- In practice, we use the following modified objective to
struction (Assran et al., 2023; Baevski et al., 2022b,a). prevent representation collapse,
The works of Baevski et al. (2022a,b) additionally show
that predicting in representation space results in compet- minimizeθ,ϕ ∥Pϕ (Eθ (x), ∆y ) − sg(E θ (y))∥1 , (1)
itive end-to-end fine-tuning performance in the image, where sg(·) denotes a stop-gradient operation, which
audio and text domains. In this work, we extend these does not backpropagate through its argument, and E θ (·)
findings to the video modality. is an exponential moving average of the network Eθ (·).
The use of an exponential-moving average feature ex-
tractor along with a stop-gradient and a predictor has
3 Methodology: Video-JEPA been used as a collapse prevention strategy for image pre-
training (Grill et al., 2020), and studied empirically (Xie
et al., 2021) and theoretically (Tian et al., 2021). In
z predictor D(ŝy , sy )
ŝy fact, the objective in equation (1) is similar to the loss
sy of Assran et al. (2023) used for image pretraining, but
x-encoder y-encoder
we modify it to use an ℓ1 regression, which we found to
be more stable.

x y Theoretical motivation. A theoretical motivation for

the effectiveness of this collapse prevention strategy was
Figure 2 Joint-Embedding Predictive Architectures are proposed in Grill et al. (2020) for the BYOL method. We
trained to predict the representation of an input y from provide a simple adaptation of their analysis for our ℓ1
the representation of another input x. The additional vari- loss. For ease of exposition, we will disregard the effect of
able z provides the predictor with information about the the conditioning variable z and consider one dimensional
transformation that computes y from x. representations. Denote the representation E θ (y) by
a random variable Y . The optimal predictor under
Our goal is to explore the effectiveness of feature pre- equation (1) is thus given by the following functional
diction as a stand-alone objective for learning visual expression,
representations from video. To that end, we use a
joint-embedding predictive architecture (JEPA) (LeCun, P ⋆ (Eθ (x)) = argminP ∥P (Eθ (x)) − Y ∥1
2022); see Figure 2. The main idea behind a JEPA is = median(Y |Eθ (x)).
to learn by predicting the representation of an input y
from the representation of another input x. The basic Substituting this expression for the optimal predictor
architecture is made up of an encoder, Eθ (·), which com- into the loss function and evaluating the expected gradi-
putes the representation of the inputs, and a predictor, ent of the encoder gives
Pϕ (·), which predicts the representation of y from the ∇θ E∥P ⋆ (Eθ (x)) − Y ∥1 = ∇θ MAD(Y |Eθ (x)),
representation of x, conditioned on a variable z indicat-
ing the transformation (or corruption) between x and where MAD(· |Eθ (x)) is the median absolute deviation
y. Conditioning on z enables the generation of distinct of a random variable conditioned on Eθ (x). Thus, in the
predictions for various transformations of x. case where the predictor is optimal, the encoder must
learn to capture as much information about the video
as possible to minimize the deviation of the target. The
3.1 Training Objective
hypothesis is that incorporating an exponential moving
We train our visual encoder Eθ (·) to satisfy the con- average to compute the representation of y ensures that
straint that representations computed from one part of the predictor evolves faster than the encoder and remains
the video, y, should be predictable from representations close to optimal, thereby preventing collapse.

3
Binary Mask
[T×H×W]

x-encoder y-encoder
predictor

\ Eθ Pφ L1 // / Eθ̄
Concatenate stop-grad
Remove mask tokens Remove
masked unmasked
tokens tokens

[L×d] [N×d] [N×d] [L×d] [M×d] [M×d] [L×d] [L×d]

Figure 3 V-JEPA. Training operates on a video clip of T frames with spatial resolution H × W , flattened into a sequence
of L tokens. (Left to right): We first obtain the input of the x-encoder by dropping tokens from the video clip. The
x-encoder then processes the masked video sequence, and outputs an embedding vector for each input token. Next, the
outputs of the x-encoder are concatenated with a set of learnable mask tokens containing positional embeddings of the masked
spatio-temporal patches. The predictor network processes the combined token sequence, and outputs an embedding vector for
each mask token. The outputs of the predictor are then regressed to the prediction targets using an L1 loss. The prediction
targets correspond to the output of the y-encoder.

3.2 Prediction Task: Predicting y from x puts x and y correspond to masked regions of a video, we
apply the video masks by simply dropping a subset of the
The feature prediction task is based on a masked mod- tokens. We apply masking at the input of the x-encoder,
eling formulation (He et al., 2021; Tong et al., 2022); and at the output of the y-encoder to construct contex-
i.e., regions x and y from the video are sampled using tualized targets (Baevski et al., 2022b). The encoder is
masking. To sample y from a video, we sample several parameterized using standard ViT networks, while the
(possibly overlapping) spatially continuous blocks with predictor is a narrow transformer implemented using
various aspect ratios and repeat the spatial blocks across 12 blocks with an embedding dimension of 384. Taking
the entire temporal dimension of the video; x is taken to inspiration from masked autoencoders (He et al., 2021),
be the complement. Masking a large continuous block our predictor takes as input the sequence of embeddings
that covers the full temporal dimension limits informa- produced by the x-encoder as well as a sequence of learn-
tion leakage due to the spatial and temporal redundancy able mask tokens with positional embeddings indicating
of videos, and results in a harder prediction task (Tong the spatio-temporal positions of the y tokens. The out-
et al., 2022). put of the predictor is an embedding vector for each
We leverage two types of masks: short-range masks, mask token; see Figure 3 and refer to Appendix B for
where we take the union of 8 randomly sampled target more details.
blocks covering 15% of each frame, and long-range masks,
where we take the union of 2 randomly sampled target
3.4 Pretraining Data and Evaluation Setup
blocks covering 70% of each frame. In both cases, the
aspect ratio for all sampled blocks is randomly chosen in Pretraining. We combine several public datasets to
the range (0.75, 1.5). Given that both short-range and construct an unsupervised video pretraining dataset,
long-range masks are produced by sampling many blocks which we refer to as VideoMix2M. Specifically, we com-
and taking their union, the result is an average masking bine the videos from HowTo100M (HT) (Miech et al.,
ratio of ∼ 90%. We refer to our masking strategy as 2019), Kinetics-400/600/700 (K710) (Kay et al., 2017),
multi-block, and compare it to other possible masking and Something-Something-v2 (SSv2) (Goyal et al., 2017),
strategies in Section 4. and remove any overlap with the validation sets of
Kinetics-400/600/700 and Something-Something-v2, re-
3.3 Network Parameterization sulting in approximately 2 million videos. We train a
ViT-L/16, a ViT-H/16, and a ViT-H/16384 transformer
We use a Vision Transformer (ViT) (Dosovitskiy et al., model on VideoMix2M. We use a batch size of 3072 for
2020; Arnab et al., 2021) as our video backbone. To the ViT-L/16 and ViT-H/16 models, and a batch size
process a video with a transformer network, we split the of 2400 for the ViT-H/16384 model. Each model takes
video clip into a 3D grid of L spatio-temporal patches, as input a video clip of 16 frames sampled with a frame-
where a patch consists of a 16 × 16 pixel block spanning skip of 4, corresponding to roughly 3 second clips on
2 consecutive frames; we refer to these spatio-temporal average. The ViT-L/16 and ViT-H/16 process the video
patches as tokens. This sequence of tokens is then di- at a spatial resolution of 224, while the ViT-H/16384
rectly processed by the stack of transformer blocks. In- uses an input resolution of 384; cf. Appendix C.

4
Table 1 Pixels vs. Featurized Targets. We ablate the effect of computing the prediction loss in feature space vs pixel space. All
models are trained on VideoMix2M for 90K iterations with a batch size of 3072 using the multi-block prediction task. We
examine downstream performance using a frozen backbone with attentive probing, and report top-1 accuracy using a single
center view. We also examine end-to-end fine-tuning performance of the models on K400. Predicting in feature space provide
a consistent improvement over pixel space prediction.

Frozen Evaluation Fine-Tuning

K400 SSv2 IN1K K400-ft
Target Arch. (16×1×1) (16×1×1) (16×5×3)

Pixels ViT-L/16 68.6 66.0 73.3 85.4

Features ViT-L/16 73.7 66.2 74.8 85.6

Table 2 Pretraining Data Distribution. We pretrain all models for 90K iterations using a batch size of 3072, and evaluate
downstream performance of the frozen backbones with an attentive probe using a single center view. Average performance
across tasks increases with the pretraining dataset size.

Frozen Evaluation
K400 SSv2 IN1K Avg.
Arch. Data #Samples (16×1×1) (16×1×1)

K710 700K 75.8 63.2 73.7 70.9

ViT-L/16 K710+SSv2 900K 72.9 67.4 72.8 71.0
K710+HT 1900K 74.5 64.2 74.8 71.1
VideoMix2M 2000K 73.7 66.2 74.8 71.5

ViT-H/16 K710+SSv2 900K 75.7 66.8 73.7 72.0

VideoMix2M 2000K 74.0 68.5 75.9 72.8

Evaluations. Pretrained models are evaluated on versus pixel prediction objective, b) the construction of
downstream video and image tasks. On video tasks, the pretraining data distribution, c) the feature pooling
we use a subset of the VideoGLUE benchmark (Yuan strategy for leveraging the model’s representations in
et al., 2023) to test for various capabilities; specif- downstream tasks, and d) the masking strategy, towards
ically, we investigate action recognition on Kinetics- identifying: what to predict from what?
400 (K400) (Kay et al., 2017), motion classification on
Something-Something-v2 (SSv2) (Goyal et al., 2017), 4.1 Predicting Representations versus Pixels
and action localization on AVA (Gu et al., 2018). Action
classification on Kinetics evaluates the appearance-based We first ablate the effect of computing the prediction
understanding of the model, as many action classes in loss in representation space. We train a pair of ViT-L/16
the dataset can be inferred from the presence of specific models using either a V-JEPA feature prediction loss,
objects in the video (Sevilla-Lara et al., 2021). Motion or a mean-squared error loss with the normalized pixel
classification on Something-Something-v2 evaluates the values, as in masked autoencoders (He et al., 2021), and
temporal understanding of the model, as action classes perform a sweep over the learning rate and weight decay
in the dataset are decoupled from the appearance/pres- schedules for both approaches. All models are pretrained
ence of specific objects in the video (Goyal et al., 2017). on VideoMix2M for 90K iterations with a batch size of
Finally, action localization on AVA evaluates the ability 3072 using multi-block masking. We examine perfor-
of the model to understand and localize motions in the mance on Kinetics-400 (K400), Something-Something-v2
video. We follow standard practice and report accu- (SSv2), and ImageNet-1K (IN1K), using a frozen back-
racy on K400 and SSv2 by sampling several spatial and bone with an attentive probe, and report top-1 accuracy
temporal views. For static image tasks, we explore ob- using a single center view. We also examine end-to-end
ject recognition on ImageNet (Russakovsky et al., 2015), fine-tuning performance of the models on Kinetics-400.
scene classification on Places205 (Zhou et al., 2014), and Results of this comparison are reported in Table 1 and
fine-grained recognition on iNaturalist 2021 (Van Horn indicate that predicting in feature space provides a con-
et al., 2018). sistent performance improvement over pixel space pre-
diction in both frozen evaluation of the video backbone,
as well as end-to-end fine-tuning.
4 What Matters for Learning Represen-
tations from Video? 4.2 Pretraining Data Distribution
In this section we isolate the contributions of several de- Next we study the impact of the pretraining data dis-
sign choices, including: a) the use of a feature prediction tribution in Table 2. Leveraging large scale datasets

5
Table 3 Average Pooling vs. Adaptive Pooling. We pool the Table 4 Ablating Prediction Task. Models are ViT-L/16
feature map output by the frozen V-JEPA encoder using networks pretrained on K710 and SSv2 and evaluated with
an attentive probe, which is then fed into a linear classifier an attentive probe using a single center view. The region x is
for downstream supervised tasks (K400 and SSv2). We sampled by masking spatio-temporal regions in the video; y is
evaluate two pooling strategies: 1) average pooling (Avg.), the mask complement. 1) random-tube[r]: x is obtained by
and attentive pooling (Att.). Results are reported using masking a fraction r of tubes (spatial patches extended across
a single center view. Using adaptive pooling with a cross- the entire temporal duration) from the video, 2) causal
attention layer leads to improvements of +17.3 points on multi-block[p]: x is restricted to the first p frames of the
K400 and +16.1 points on SSv2. 16-frame video, which are then masked with a random set
of spatio-temporal blocks, 3) multi-block: x is obtained
Frozen Evaluation by masking a random set of spatio-temporal blocks from the
K400 SSv2 entire video. Best performance obtained by using multiblock
(16×1×1) (16×1×1) masking.
Method Arch. Avg. Att. Avg. Att.
V-JEPA ViT-L/16 56.7 73.7 50.1 66.2 Frozen Evaluation
K400 SSv2 IN1K
Masking (16×1×1) (16×1×1)

has been critical for enabling the surge of advancements random-tube[0.9] 51.5 46.4 55.6
61.3 49.8 66.9
in other modalities, such as text and images (Kaplan causal multi-block[6]
causal multi-block[12] 71.9 63.6 72.2
et al., 2020; Cherti et al., 2023). We investigate whether multi-block 72.9 67.4 72.8
a similar trend holds for video data. To control for the
possible confounding variable of compute budget, we
pretrain all models in Table 2 for 90K iterations using with a single GeLU activation, followed by a LayerNorm,
a batch-size of 3072. We report downstream results on and finally a linear classifier.
K400, SSv2, and IN1K using a frozen backbone with an
attentive probe, and report top-1 accuracy using a single In Table 3 we see that using adaptive pooling with
center view. a learnable cross-attention layer leads to a significant
improvement of +17 points on K400 and +16.1 points
Table 2 shows that average performance across tasks
on SSv2. Using an attentive-probe is also beneficial for
monotonically increases as we increase the size of the
other baseline models as reported in Appendix E.
pretraining dataset, but the best task-specific perfor-
mance is obtained by independently selecting the pre-
training data for each specific downstream task. For
instance, the L/16 obtains its best SSv2 performance 4.4 Prediction Task: Predicting y from x
when pretrained on K710+SSv2, its best K400 perfor- We conduct an ablation on the masking strategy used in
mance when pretrained only on K710, and its best IN1K V-JEPA pretraining. We examine the following masking
performance when pretrained only on K710+HT. The strategies: random-tube[r] in which x is obtained by
best average performance across all tasks is achieved by removing a random fraction r of tubes (spatial patches
pretraining VideoMix2M, which combines all the data extended across the entire temporal duration) from the
sources. Similarly, the H/16 pretrained on K710+SSv2 video, causal multi-block[p] in which x is restricted to
achieves a greater K400 score than the H/16 pretrained the first p frames of the 16-frame video, which are then
on VideoMix2M, however, the top performing H/16 on masked with a random set of spatio-temporal blocks,
average is pretrained on VideoMix2M. and multi-block in which x obtained by masking a ran-
dom set of spatio-temporal blocks from the entire video.
4.3 Evaluation: Attentive Probing Spatio-temporal blocks are sampled using the parame-
ters described in Section 3.2; an ablation on the size and
Next we explore the feature pooling strategy for apply- quantity of masked spatio-temporal blocks is provided
ing the model’s representations in downstream tasks. in Appendix E.4.
Since the prediction objective in equation (1) is unnor-
malized, there is no a priori reason for the encoder to Table 4 indicates that the best results are obtained by
yield a linearly separable subspace (Chen et al., 2020). sampling x using a multi-block strategy, wherein the
Thus, rather than using a linear operation (averaging) network is forced to make predictions after removing
to pool the features output of the frozen backbone, we large continuous blocks in the video. When x is only
explore a learnable non-linear pooling strategy. Specifi- sampled from the first few frames of the video, as in
cally, when evaluating the frozen pretrained backbone the causal multi-block strategy, we observe a decrease
on downstream tasks, we learn a cross-attention layer in downstream performances. Finally, the random-tube
with a learnable query token. The output of the cross- strategy, wherein 90% of the tubes in the video are ran-
attention layer is then added back to the query token domly masked, leads to features of low-semantic quality
(residual connection), and then fed into two-layer MLP when combined with our feature prediction objective.

6
Table 5 Comparison with Pixel Prediction Methods. We compare V-JEPA with OmniMAE (Girdhar et al., 2023), Video-
MAE (Tong et al., 2022), and Hiera (Ryali et al., 2023), which leverage a pixel-reconstruction loss. All models are trained using
a ViT-L architecture or a comparable Hiera-L. We evaluate the approaches on downstream image tasks (IN1K, Places205,
iNat201) and video tasks (K400, SSv2, AVA) in both frozen evaluation (with a frozen backbone), and end-to-end fine-tuning.
All models are evaluated at resolution 224. On K400 and SSv2 we follow the standard practice of reporting accuracy from
several spatial and temporal views from the video. In frozen evaluation, V-JEPA outperforms the baselines on all downstream
tasks, except ImageNet, where the model achieves 74.8% compared to 75.1% of an OmniMAE model trained directly on
ImageNet. V-JEPA also achieves the best fine-tuning performance amongs all ViT-L models and matches the Hiera-L on
SSv2. The V-JEPA results are achieved while processing significantly fewer examples during pretraining.

Frozen Evaluation w/ Att. Pooling Fine-Tuning

#Samples K400 SSv2 AVA IN1K Places205 iNat21 K400-ft SSv2-ft
Method Arch. Seen Iter. (16×8×3) (16×2×3) (16×5×3) (16×2×3)

Methods pretrained using pixel prediction

OmniMAE ViT-L/16 2400M 1170K 65.6 60.6 14.4 75.1 59.8 66.1 84.0 74.2
VideoMAE ViT-L/16 410M 400K 77.8 65.5 21.6 71.1 59.3 64.6 85.4 74.3
Hiera Hiera-L 770M 1500K 75.5 64.2 15.8 68.9 58.5 56.9 87.3 75.1
V-JEPA ViT-L/16 270M 90K 80.8 69.5 25.6 74.8 60.3 67.8 85.6 75.1

Table 6 Comparison with State-of-the-Art Models. We compare V-JEPA with state-of-the-art baselines in frozen evaluation
with an attentive probe on downstream image tasks (IN1K, Place205, iNat21) and video tasks (K400, SSv2, AVA). All models
are evaluated at resolution 224, except I-JEPA512 and V-JEPA384 which are evaluated respectively at resolution 512 and
384. On K400 and SSv2 we follow the standard practice of reporting accuracy from several spatial and temporal views
from the video. Compared to other video baselines, V-JEPA exhibits a consistent improvement across all downstream tasks.
Compared to image-models that excel under the frozen evaluation, V-JEPA shows a significant performance improvement on
tasks requiring motion understanding (+21 points on SSv2), and reduces the gap between video and image models on tasks
requiring static appearance-based features.

Video Tasks Image Tasks

K400 SSv2 AVA IN1K Places205 iNat21
Method Arch. Params. Data (16×8×3) (16×2×3)

Methods pretrained on Images

I-JEPA ViT-H/16512 630M IN22K 79.7 50.0 19.8 84.4 66.5 85.7
OpenCLIP ViT-G/14 1800M LAION 81.8 34.8 23.2 85.3 70.2 83.6
DINOv2 ViT-g/14 1100M LVD-142M 83.4 50.6 24.3 86.2 68.4 88.8
Methods pretrained on Videos
MVD ViT-L/16 200M IN1K+K400 79.4 66.5 19.7 73.3 59.4 65.7
OmniMAE ViT-H/16 630M IN1K+SSv2 71.4 65.4 16.0 76.3 60.6 72.4
VideoMAE ViT-H/16 630M K400 79.8 66.2 20.7 72.3 59.1 65.5
VideoMAEv2 ViT-g/14 1100M Un.Hybrid 71.2 61.2 12.9 71.4 60.6 68.3
Hiera Hiera-H 670M K400 77.0 64.7 17.5 71.4 59.5 61.7
ViT-L/16 200M 80.8 69.5 25.6 74.8 60.3 67.8
V-JEPA ViT-H/16 630M VideoMix2M 82.0 71.4 25.8 75.9 61.7 67.9
ViT-H/16384 630M 81.9 72.2 25.0 77.4 62.8 72.6

5 Comparison with Prior Work for the possible confounding factor of model architec-
ture by evaluating all models using either a ViT-L/16
In Section 5.1, we investigate the impact of feature pre- encoder, or a Hiera-L encoder, which has a similar num-
diction by comparing V-JEPA with video approaches ber of parameters. For the pixel prediction baselines
that rely on pixel prediction, while using a similar ar- we consider VideoMAE (Tong et al., 2022; Wang et al.,
chitecture for all baselines. Subsequently, in Section 5.2, 2023a), which trains vision transformer autoencoders
we remove the architectural constraint and report the exclusively on video, Hiera (Ryali et al., 2023), which
best performance across architectures for self-supervised trains a hierarchical transformer autoencoder on video,
video and image pretraining approaches. Finally, we ex- and OmniMAE (Girdhar et al., 2023), which trains a
plore the label-efficiency of V-JEPA relative to other self- vision transformer autoencoder on static images and
supervised video pretraining approaches in Section 5.3. video simultaneously.
We further detail the evaluation setup in Appendix D.
Table 5 examines both frozen evaluation with an atten-
5.1 Comparison with Pixel Prediction tive probe on downstream video and image tasks, as well
as end-to-end fine-tuning. In frozen evaluation, V-JEPA
To investigate the effectiveness of feature prediction pre- outperforms the baselines on all downstream tasks, ex-
training, we first compare V-JEPA to video masked mod- cept ImageNet, where we achieve 74.8% compared to
eling models relying on a pixel prediction loss. We control 75.1% of an OmniMAE model trained directly on Im-

7
Something-Something-v2

Something-Something-v2
End-to-End Fine-Tuning Frozen Evaluation
V-JEPA Hiera 75
ViT-L/16 Video Feature Pred.
V-JEPA
75 Hiera-L
ViT-H/16384
Video Feature Pred.
Video Pixel Pred. Video Pixel Pred.
70
VideoMAE
74.5
VideoMAE
65 ViT-H/16

ViT-L/16 OmniMAE VideoMAEv2

ViT-L/16
74 60 ViT-g/14

102.4 102.6 102.8 103 103.2 103.4 50 100 150 200 250 300 350
Samples Seen (M) Pretraining Time (Hrs.)

Figure 4 SSv2 fine-tuning performance vs. Samples Seen. We Figure 5 SSv2 frozen-evaluation performance vs. Pretraining
report SSv2 fine-tuning for V-JEPA and pixel-reconstruction Time. Wallclock times for all methods are measured on a
baselines using a ViT-L/16 or Hiera-L architecture. V-JEPA single GPU with a batch size of 10 clips, using the official
outperforms all pixel-reconstruction methods using a ViT- codebases for VideoMAE and VideoMAEv2, and linearly
L/16 and matches the Hiera-L performance while seeing extrapolated assuming a global batch size of 2400 samples.
significantly less samples during pretraining. However, note that the SSv2 accuracies of video pixel pre-
diction methods are actually obtained with small batch sizes
and significantly longer training schedules. V-JEPA out-
ageNet; hence, V-JEPA achieves comparable ImageNet performs pixel-reconstruction methods while training signifi-
performance despite only pretraining on video. cantly faster.

Under the fine-tuning protocol, V-JEPA also achieves the

best performance of any model trained with a ViT-L/16, and image task with notable margin (see Table 6). Our
and matches the performance of the Hiera-L on SSv2, H/16 model outperforms the largest publicly available
which benefits from a hierachical prior (Ryali et al., 2023). VideoMAE, VideoMAEv2, OmniMAE, MVD, and Hiera
The V-JEPA models achieve this result while processing models by at least +5 points in motion understanding
significantly fewer samples during pretraining (Figure 4), (Something-Something-v2), +2 points in action recogni-
demonstrating the efficiency of feature prediction as a tion (Kinetics-400), +5 points on action detection (AVA),
learning principle. +1 point on object recognition (ImageNet-1K), +2 points
in scene recognition (Places205), and +0.2 points on fine-
5.2 Comparison with State-of-the-Art grained recognition (iNaturalist). Moreover, when com-
paring pretraining wallclock time in Figure 5, we see that
Next, in Table 6, we inspect how the V-JEPA models
V-JEPA achieves this performance with a roughly 2×
pretrained on video stack up next to the largest state-
speedup compared to the large pixel prediction models.
of-the-art self-supervised image and video models when
freezing the backbone encoder and training an attentive
probe on top. Our image pretrained baselines include Comparison with image models. On tasks that re-
OpenCLIP (Cherti et al., 2023), DINOv2 (Oquab et al., quire a fine-grained understanding of motion (Something-
2023), and I-JEPA (Assran et al., 2023). The Open- Something-v2), the V-JEPA models provide a major im-
CLIP model is trained with a contrastive image-text provement (over +21 points) compared to large-scale
alignment objective, DINOv2 and I-JEPA are trained image baselines, such as DINOv2, OpenCLIP, and I-
with self-supervision. These models are known to excel JEPA. Self-supervised pretraining from videos allows to
in their frozen-evaluation performance (Oquab et al., model dynamic concepts that are not easily learned from
2023); i.e., their ability to produce visual features that static image datasets. Similarly, we observe that the
can be applied to many downstream tasks simultane- V-JEPA models outperform image-based pretraining on
ously, without end-to-end fine-tuning, and thus pro- action localization.
vide highly competitive baselines. Our video pretrained
On Kinetics-400, we find image models to perform well;
baselines include VideoMAE (Tong et al., 2022), Omni-
e.g., while DINOv2 (Oquab et al., 2023) previously re-
MAE (Girdhar et al., 2023), Hiera (Ryali et al., 2023),
ported 78.4% on K400 with a linear probe, we improve
VideoMAEv2 (Wang et al., 2023a), and MVD (Wang
the frozen evaluation of the g/14 model to 83.4% by
et al., 2023b). The OpenCLIP, DINOv2 and Video-
using an attentive probe. In this case, our H/16 model
MAEv2 models are parameterized as Giant/Gigantic
achieves 82.0% top-1 accuracy. It is worth noting that
vision transformer architectures containing over 1B pa-
the label for many Kinetics videos can be inferred using
rameters trained on large-scale image or video datasets.
appearance-based cues, without requiring an understand-
ing of motion (Sevilla-Lara et al., 2021).
Comparison with video models. Compared to The V-JEPA models narrow the gap with image models
large-scale video baselines, the V-JEPA models outper- on image classification tasks. In particular, V-JEPA
form all previous models on every downstream video achieves a score of 77.4% on ImageNet using a one-

8
Table 7 Low-Shot Frozen Evaluation. Comparing V-JEPA to other video models in frozen evaluation on Kinetics-400 and
Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive
probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random
splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. We report
the mean performances and standard deviation using the K400 and SSv2 validation sets. V-JEPA is more label-efficient than
other models; specifically, decreasing the available number of labeled examples from each class increases the performance gap
between V-JEPA and the baselines.

Frozen Evaluation
K400 SSv2
(16×8×3) (16×2×3)

5% 10% 50% 5% 10% 50%

Method Arch. (∼29 samples per class) (∼58 samples per class) (∼287 samples per class) (∼48 samples per class) (∼96 samples per class) (∼440 samples per class)

MVD ViT-L/16 62.6 ± 0.2 68.3 ± 0.2 77.2 ± 0.3 42.9 ± 0.8 49.5 ± 0.6 61.0 ± 0.2
VideoMAE ViT-H/16 62.3 ± 0.3 68.5 ± 0.2 78.2 ± 0.1 41.4 ± 0.8 48.1 ± 0.2 60.5 ± 0.4
VideoMAEv2 ViT-g/14 37.0 ± 0.3 48.8 ± 0.4 67.8 ± 0.1 28.0 ± 1.0 37.3 ± 0.3 54.0 ± 0.3
ViT-H/16 67.0 ± 0.2 72.1 ± 0.1 80.2 ± 0.2 51.9 ± 0.3 57.5 ± 0.4 67.3 ± 0.2
V-JEPA
ViT-H/16384 68.2 ± 0.2 72.8 ± 0.2 80.6 ± 0.2 54.0 ± 0.2 59.3 ± 0.5 67.9 ± 0.2

layer attentive probe, which can be further improved to to 54.0% top-1 when we reduce the number of labeled
77.9% using a two-layer attentive probe. More generally, examples by a factor of 10× (from roughly 440 examples
we hypothesize that the datasets used to train V-JEPA per class to 48 examples per class). By contrast, Video-
and other video models are too constrained and lack the MAEv2 drops by 26% to 28.0% top-1, VideoMAE drops
visual diversity of the internet-scale pretraining data used by 19.1% to 41.4% top-1, and MVD drops by 18.1% to
by the images models; as such, there is value in focusing 42.9% top-1.
future work on building diverse publicly available video
datasets.

6 Evaluating the Predictor

5.3 Label-efficiency
Next, we seek to qualitatively inspect the V-JEPA mod-
We examine the label-efficiency of V-JEPA compared to
els. Recall that the predictor network in V-JEPA predicts
other self-supervised video models by measuring the abil-
the representations of a masked spatio-temporal region y
ity of the pretrained backbones to adapt to downstream
from a visible region x, given the positional information
tasks with few labels. Specifically, we investigate the
of the masked regions (see Section 3). To qualitatively in-
performance of the frozen models on Kinetics-400 and
vestigate the grounding of the feature-space predictions,
Something-Something-v2 as we vary the percentage of
we freeze the pretrained encoder and predictor networks
labeled examples from each dataset available for training
and train a conditional diffusion decoder to map the
the attentive probe. We train the probes in several low-
V-JEPA predictions to interpretable pixels. Notably, the
shot settings: using either 5% of the train set, 10%, or
decoder is only fed the representations predicted for the
50%, and take 3 random splits in each setting to obtain
missing regions of the video, and does not have access
more robust metrics, resulting in 9 different evaluation
to the unmasked regions of the video (see Figure 6a).
experiments for each model. Table 7 reports the mean
performances and standard deviation using the K400
Given a masked video, we use the V-JEPA pretrained
and SSv2 validation sets.
models to predict the representations of the missing
We find V-JEPA to be more label-efficient than other regions, and then use the decoder to project the rep-
self-supervised video models: decreasing the available resentations to pixel space. Figure 6b shows decoder
number of labeled examples for training the attentive outputs for various random seeds. Qualities that are
probe results in an increase in the performance gap common across samples represent information that is
between V-JEPA and the other models. In particular, contained in the predictor representation.
the performance of the largest V-JEPA model on K400
drops by 12% to 68.2% top-1 when we reduce the number Figure 6b shows that the V-JEPA feature predictions
of labeled examples by a factor of 10× (from roughly are indeed grounded, and exhibit spatio-temporal con-
287 examples per class to 29 examples per class). By sistency with the unmasked regions of the video. Specif-
contrast, VideoMAEv2 drops by 30% to 37.0% top-1, ically, the samples in Figure 6b show that the V-JEPA
VideoMAE drops by 15.9% to 62.3% top-1, and MVD predictor correctly captures positional uncertainty and
drops by 14.6% to 62.6% top-1. produces a variety of visual objects at various locations
with consistent motion. Some of the samples also demon-
Similar observations hold on SSv2. The performance strate an understanding of object-permanence, as the
of the largest V-JEPA model on SSv2 drops by 13.9% visual objects remain consistent after partial occlusion.

9
Frozen

x-encoder predictor decoder

(a) Visualization Methodology. We train a conditional diffusion model to decode the V-JEPA feature-space predictions to
interpretable pixels; the pretrained V-JEPA encoder and predictor networks are kept frozen in this process. The decoder is
only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions
of the video.

(b) Visualizations. First Row: Masked videos used as input to the V-JEPA models (a pretrained ViT-H/16 encoder and its
corresponding predictor network). Other rows: Bounding boxes contain various samples from the decoder overlayed on the
original video. V-JEPA is not a generative model and the decoder does not have access to the context (first row), so we do
not expect samples to exactly match the input. This experiment qualitatively illustrates what information is encoded and
predicted by V-JEPA. In particular, characteristics that are common across samples represent information that is encoded in
the V-JEPA predictions. V-JEPA generates predictions that are spatially and temporally coherent with unmask region of the
video. The predictions also capture consistent motion through time.

Figure 6 Qualitative Analysis. Offline visualizations of the V-JEPA feature-space predictions.

7 Conclusion stream tasks requiring fine-grained motion understand-

ing, while large-scale image models trained on internet
scale datasets fall short on such tasks. Finally, we em-
In this work, we explored the effectiveness of feature
pirically observed that V-JEPA models are label-efficient
prediction as a stand-alone objective for unsupervised
learners, and exhibit good performance on downstream
learning from video and introduced V-JEPA, a collection
tasks, even when only few labeled examples are available.
of vision models trained solely using a self-supervised
feature prediction objective. The V-JEPA models demon-
strate the ability to solve various downstream image and References
video tasks without adaption of the model parameters,
and outperform previous video representation learning Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang,
approaches in frozen evaluation on action recognition, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Trans-
spatio-temporal action detection, and image classifica- formers for multimodal self-supervised learning from raw
tion tasks. Additionally, we show that pretraining V- video, audio and text. Advances in Neural Information
JEPA on videos is particularly effective for solving down- Processing Systems, 34:24206–24221, 2021.

10
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Ekin Dogus Cubuk, Barret Zoph, Vijay Mane, Dandelion and-
Mario Lucic, and Cordelia Schmid. Vivit: A video vision Vasudevan, and Quoc V. Le. Autoaugment: Learning aug-
transformer. In Proceedings of the IEEE international mentation policies from data. In Proceedings of the IEEE
conference on computer vision, 2021. Conference on Computer Vision and Pattern Recognition,
2019.
Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr
Bojanowski, Florian Bordes, Pascal Vincent, Armand Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
Joulin, Michael Rabbat, and Nicolas Ballas. Masked Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
siamese networks for label-efficient learning. arXiv preprint Dehghani, Matthias Minderer, Georg Heigold, Sylvain
arXiv:2204.07141, 2022. Gelly, et al. An image is worth 16x16 words: Trans-
formers for image recognition at scale. arXiv preprint
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo-
arXiv:2010.11929, 2020.
janowski, Pascal Vincent, Michael Rabbat, Yann LeCun,
and Nicolas Ballas. Self-supervised learning from images Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick,
with a joint-embedding predictive architecture. In Proceed- and Kaiming He. A large-scale study on unsupervised spa-
ings of the IEEE/CVF Conference on Computer Vision tiotemporal representation learning. Proceedings of the
and Pattern Recognition, pages 15619–15629, 2023. IEEE conference on computer vision and pattern recogni-
Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael tion, 2021.
Auli. Efficient self-supervised learning with contextualized Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al.
target representations for vision, speech and language. Masked autoencoders as spatiotemporal learners. Advances
arXiv preprint arXiv:2212.07525, 2022a. in neural information processing systems, 35:35946–35958,
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, 2022.
Jiatao Gu, and Michael Auli. Data2vec: A general frame-
David J Field. What is the goal of sensory coding? Neural
work for self-supervised learning in speech, vision and
computation, 6(4):559–601, 1994.
language. arXiv preprint arXiv:2202.03555, 2022b.

Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick
of image transformers. arXiv preprint arXiv:2106.08254, Pérez, and Matthieu Cord. Learning representations by
2021. predicting bags of visual words. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Pietro Berkes and Laurenz Wiskott. Slow feature analysis Recognition, pages 6928–6938, 2020.
yields a rich repertoire of complex cell properties. Journal
of vision, 5(6):9–9, 2005. Rohit Girdhar and Kristen Grauman. Anticipative video
transformer. In Proceedings of the IEEE/CVF interna-
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, tional conference on computer vision, pages 13505–13515,
Piotr Bojanowski, and Armand Joulin. Unsupervised learn- 2021.
ing of visual features by contrasting cluster assignments.
arXiv preprint arXiv:2006.09882, 2020. Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh,
Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra.
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jé- Omnimae: Single model masked pretraining on images
gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. and videos. In Proceedings of the IEEE/CVF Confer-
Emerging properties in self-supervised vision transformers. ence on Computer Vision and Pattern Recognition, pages
arXiv preprint arXiv:2104.14294, 2021. 10406–10417, 2023.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen,
offrey Hinton. A simple framework for contrastive learning and Yann LeCun. Unsupervised learning of spatiotempo-
of visual representations. preprint arXiv:2002.05709, 2020. rally coherent metrics. In Proceedings of the IEEE inter-
Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, national conference on computer vision, pages 4086–4093,
Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, 2015.
Gang Zeng, and Jingdong Wang. Context autoencoder
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
for self-supervised representation learning. arXiv preprint
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
arXiv:2202.03026, 2022.
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
Xinlei Chen, Saining Xie, and Kaiming He. An empirical Mueller-Freitag, et al. The" something something" video
study of training self-supervised vision transformers. arXiv database for learning and evaluating visual common sense.
preprint arXiv:2104.02057, 2021. In Proceedings of the IEEE international conference on
computer vision, pages 5842–5850, 2017.
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell
Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Repro- Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do-
ducible scaling laws for contrastive language-image learn- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-
ing. In Proceedings of the IEEE/CVF Conference on Com- mad Gheshlaghi Azar, et al. Bootstrap your own latent: A
puter Vision and Pattern Recognition, pages 2818–2829, new approach to self-supervised learning. arXiv preprint
2023. arXiv:2006.07733, 2020.

11
Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caro- Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-
line Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, Hsuan Yang. Unsupervised representation learning by
George Toderici, Susanna Ricco, Rahul Sukthankar, et al. sorting sequences. In Proceedings of the IEEE international
Ava: A video dataset of spatio-temporally localized atomic conference on computer vision, pages 667–676, 2017.
visual actions. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 6047–6056, Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu,
2018. Hongsheng Li, and Yu Qiao. Uniformer: Unified trans-
former for efficient spatiotemporal representation learning.
Agrim Gupta, Jiajun Wu, Jia Deng, and Li Fei-Fei. Siamese arXiv preprint arXiv:2201.04676, 2022.
masked autoencoders. arXiv preprint arXiv:2305.14344,
2023. Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. arXiv preprint arXiv:1711.05101, 2017.
Michael U Gutmann and Aapo Hyvärinen. Noise-contrastive
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,
estimation of unnormalized statistical models, with appli-
Makarand Tapaswi, Ivan Laptev, and Josef Sivic.
cations to natural image statistics. Journal of machine
Howto100m: Learning a text-video embedding by watch-
learning research, 13(2), 2012.
ing hundred million narrated video clips. In Proceedings
Tengda Han, Weidi Xie, and Andrew Zisserman. Video of the IEEE/CVF international conference on computer
representation learning by dense predictive coding. In vision, pages 2630–2640, 2019.
Proceedings of the IEEE/CVF International Conference
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
on Computer Vision Workshops, pages 0–0, 2019.
visual representations by solving jigsaw puzzles. In Euro-
Tengda Han, Weidi Xie, and Andrew Zisserman. Memory- pean conference on computer vision, pages 69–84. Springer,
augmented dense predictive coding for video representation 2016.
learning. In European conference on computer vision, pages
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen-
312–329. Springer, 2020.
tation learning with contrastive predictive coding. arXiv
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dol- preprint arXiv:1807.03748, 2018.
lár, and Ross Girshick. Masked autoencoders are scalable
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
vision learners. arXiv preprint arXiv:2111.06377, 2021.
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
Geoffrey E Hinton. Connectionist learning procedures. In Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
Machine learning, pages 555–610. Elsevier, 1989. Dinov2: Learning robust visual features without supervi-
sion. arXiv preprint arXiv:2304.07193, 2023.
Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and
Du Tran. Flavr: Flow-agnostic video representations for Nikhil Parthasarathy, SM Eslami, João Carreira, and
fast frame interpolation. In Proceedings of the IEEE/CVF Olivier J Hénaff. Self-supervised video pretraining
Winter Conference on Applications of Computer Vision, yields strong image representations. arXiv preprint
pages 2071–2082, 2023. arXiv:2210.06433, 2022.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Darrell, and Alexei A Efros. Context encoders: Feature
Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for learning by inpainting. In Proceedings of the IEEE con-
neural language models. arXiv preprint arXiv:2001.08361, ference on computer vision and pattern recognition, pages
2020. 2536–2544, 2016.

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Silvia L Pintea, Jan C van Gemert, and Arnold WM Smeul-
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Vi- ders. Déja vu: Motion prediction in static images. In
ola, Tim Green, Trevor Back, Paul Natsev, et al. The Computer Vision–ECCV 2014: 13th European Conference,
kinetics human action video dataset. arXiv preprint Zurich, Switzerland, September 6-12, 2014, Proceedings,
arXiv:1705.06950, 2017. Part III 13, pages 172–187. Springer, 2014.

Christoph Kayser, Wolfgang Einhäuser, Olaf Dümmer, Peter Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
König, and Konrad Körding. Extracting slow subspaces Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
from natural videos leads to complex cells. In Artificial Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
Neural Networks—ICANN 2001: International Conference ing transferable visual models from natural language su-
Vienna, Austria, August 21–25, 2001 Proceedings 11, pages pervision. In International conference on machine learning,
1075–1080. Springer, 2001. pages 8748–8763. PMLR, 2021.

Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Rajesh PN Rao and Dana H Ballard. Predictive coding
Learning representations for automatic colorization. 2016. in the visual cortex: a functional interpretation of some
extra-classical receptive-field effects. Nature neuroscience,
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. 2(1):79–87, 1999.
Colorization as a proxy task for visual understanding. 2017.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
Yann LeCun. A path towards autonomous machine intelli- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
gence version 0.9. 2, 2022-06-27. 2022. Aditya Khosla, Michael Bernstein, Alexander C. Berg, and

12
Li Fei-Fei. Imagenet large scale visual recognition chal- Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Ben-
lenge. International Journal of Computer Vision, 115(3): gio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked
211–252, 2015. denoising autoencoders: Learning useful representations
in a deep network with a local denoising criterion. Journal
Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei,
of machine learning research, 11(12), 2010.
Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arka-
bandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.
Hiera: A hierarchical vision transformer without the bells- Anticipating visual representations from unlabeled video.
and-whistles. arXiv preprint arXiv:2306.00989, 2023. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 98–106, 2016.
Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj
Goswami, Matt Feiszli, and Lorenzo Torresani. Only time Fei Wang, Ping Li, and Arnd Christian Konig. Learning
can tell: Discovering temporal data for temporal modeling. a bi-stochastic data similarity matrix. In 2010 IEEE
In Proceedings of the IEEE/CVF winter conference on International Conference on Data Mining, pages 551–560.
applications of computer vision, pages 535–544, 2021. IEEE, 2010.
Elizabeth S Spelke, Peter Vishton, and Claes Von Hofsten.
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan
Object perception, object-directed action, and physical
He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2:
knowledge in infancy. 1995.
Scaling video masked autoencoders with dual masking. In
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi- Proceedings of the IEEE/CVF Conference on Computer
nov. Unsupervised learning of video representations using Vision and Pattern Recognition, pages 14549–14560, 2023a.
lstms. In International conference on machine learning,
Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen,
pages 843–852. PMLR, 2015.
Xiyang Dai, Mengchen Liu, Lu Yuan, and Yu-Gang Jiang.
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Masked video distillation: Rethinking masked feature mod-
Cordelia Schmid. Videobert: A joint model for video and eling for self-supervised video representation learning. In
language representation learning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer
IEEE/CVF international conference on computer vision, Vision and Pattern Recognition, pages 6312–6322, 2023b.
pages 7464–7473, 2019.
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang,
Dídac Surís, Ruoshi Liu, and Carl Vondrick. Learning the pre- Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang,
dictability of the future. In Proceedings of the IEEE/CVF et al. Internvideo: General video foundation models via
Conference on Computer Vision and Pattern Recognition, generative and discriminative learning. arXiv preprint
pages 12607–12617, 2021. arXiv:2212.03191, 2022.
Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A Laurenz Wiskott and Terrence J Sejnowski. Slow feature
Plummer, Kate Saenko, Karl Ridgeway, and Lorenzo Tor- analysis: Unsupervised learning of invariances. Neural
resani. Multiscale video pretraining for long-term activity computation, 14(4):715–770, 2002.
forecasting. arXiv preprint arXiv:2307.12854, 2023.
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Antti Tarvainen and Harri Valpola. Mean teachers are bet-
Unsupervised feature learning via non-parametric instance
ter role models: Weight-averaged consistency targets im-
discrimination. In Proceedings of the IEEE conference on
prove semi-supervised deep learning results. arXiv preprint
computer vision and pattern recognition, pages 3733–3742,
arXiv:1703.01780, 2017.
2018.
Yuandong Tian, Xinlei Chen, and Surya Ganguli. Under-
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin
standing self-supervised learning dynamics without con-
Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A sim-
trastive pairs. In International Conference on Machine
ple framework for masked image modeling. arXiv preprint
Learning, pages 10268–10278. PMLR, 2021.
arXiv:2111.09886, 2021.
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Video-
mae: Masked autoencoders are data-efficient learners for Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and
self-supervised video pre-training. Advances in neural Yueting Zhuang. Self-supervised spatiotemporal learn-
information processing systems, 35:10078–10093, 2022. ing via video clip order prediction. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Recognition, pages 10334–10343, 2019.
Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona,
and Serge Belongie. The inaturalist species classification Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko,
and detection dataset. In Proceedings of the IEEE con- Armen Aghajanyan, Florian Metze, Luke Zettlemoyer,
ference on computer vision and pattern recognition, pages and Christoph Feichtenhofer. Videoclip: Contrastive pre-
8769–8778, 2018. training for zero-shot video-text understanding. arXiv
preprint arXiv:2109.14084, 2021.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-
Antoine Manzagol. Extracting and composing robust fea- Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
tures with denoising autoencoders. In Proceedings of the jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
25th International Conference on Machine Learning, ICML captioners are image-text foundation models. arXiv
’08, page 1096–1103, 2008. preprint arXiv:2205.01917, 2022.

13
Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao,
Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia,
Tobias Weyand, Luke Friedman, et al. Videoglue: Video
general understanding evaluation of foundation models.
arXiv preprint arXiv:2307.03166, 2023.
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng
Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hes-
sel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural
script knowledge through vision and language and sound.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 16375–16387, 2022.
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor-
ralba, and Aude Oliva. Learning deep features for scene
recognition using places database. In Z. Ghahramani,
M. Welling, C. Cortes, N. Lawrence, and K.Q. Wein-
berger, editors, Advances in Neural Information Pro-
cessing Systems, volume 27. Curran Associates, Inc.,
2014. https://proceedings.neurips.cc/paper/2014/file/
3fe94a002317b5f9259f82690aeea4cd-Paper.pdf.
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie,
Alan Yuille, and Tao Kong. Ibot: Image bert pre-training
with online tokenizer. arXiv preprint arXiv:2111.07832,
2021.
Will Zou, Shenghuo Zhu, Kai Yu, and Andrew Ng. Deep
learning of invariant features via simulated fixations in
video. Advances in neural information processing systems,
25, 2012.

14
Appendix

A Extended Related Works

We first review approaches for learning visual perception from static images before discussing strategies for learning
from video.

Weakly-Supervised Learning from Static Images

One family of approaches for learning visual perception from static images trains a visual encoder to predict the
representations of text captions often found accompanying images from the Web, as in CLIP (Radford et al., 2021) or
CoCa (Yu et al., 2022). The largest open source CLIP model to date, numbering 2B parameters and trained on over
2B web-scraped images (Cherti et al., 2023), demonstrates impressive performance on a wide range of downstream
image and video tasks. Notably, this is achieved using only the light-weight adaptation of task-specific heads, also
referred to as frozen-evaluation, and does not require expensive end-to-end fine-tuning of the pretrained model.

Self-Supervised Learning from Static Images

Other approaches for learning from static images leverage unsupervised objectives. Initial works on self-supervised
approaches are based on sparse coding or hand-crafted pretext tasks, such as colorization (Larsson et al., 2016, 2017),
rotation prediction (Gidaris et al., 2020), and jigsaws (Noroozi and Favaro, 2016). More recent approaches leverage
invariance-based objectives by training a visual encoder to be invariant to hand-crafted image transformations (Wu
et al., 2018; Chen et al., 2020).
Another family of methods learn representations using denoising autoencoders (Vincent et al., 2008); image inpainting
is one popular instantiation of this idea (Pathak et al., 2016). More recently, masked autoencoders (He et al., 2021)
train an encoder-decoder transformer to predict missing pixels of a masked image. Follow-up work addresses the
indeterminism of pixel reconstruction by exploring instantiations of masked image modeling in latent space (Baevski
et al., 2022b; Assran et al., 2023; Baevski et al., 2022a). These approaches can be seen as applications of the
predictive feature principle in the image modality.
There are also various methods that combine both masked image modeling and invariance criteria to learn visual
representations from static images, such as iBOT (Zhou et al., 2021) and DINOv2 (Zhou et al., 2021; Oquab et al.,
2023), the latter is currently the most competitive instantiation of self-supervised learning with static images, scaled
to a model with over 1.1B parameters trained on a curated dataset of 142M images.

Weakly-Supervised Learning from Videos

One family of approaches for learning visual perception from videos relies on weakly-supervised guidance from closed
captioning, often computed from an ASR transcription of audio data accompanying internet videos. For instance,
VideoBERT (Sun et al., 2019; Xu et al., 2021) trains a video encoder to predict masked spans in the textual closed
captions. Similarly, VideoCLIP (Xu et al., 2021) trains a video encoder to predict the representation of video
captions computed by a text encoder. Follow-up work such as MERLOT (Zellers et al., 2022), VATT (Akbari et al.,
2021), and InternVideo (Wang et al., 2022) extended VideoCLIP by incorporating additional unsupervised objectives.

Self-Supervised Learning from Videos

Similar to unsupervised learning from images, a family of unsupervised video representation learning approaches
enforces a spatio-temporal representation of a video clip to be invariant to hand-crafted spatio-temporal data
augmentations (Parthasarathy et al., 2022). However, one obvious insight is that the temporal ordering of visual
information in video can provide implicit supervision. Indeed, this insight is the key insight leveraged by many works
on unsupervised video learning. Towards leveraging temporal information as supervision, some approaches train a
visual encoder by predicting the temporal ordering of frames (Xu et al., 2019; Lee et al., 2017). Other approaches
seek to predict low-level motion vectors computed from optical flow (Pintea et al., 2014), or to predict mixing pixels
in video frames, using either a frame-interpolation objective (Kalluri et al., 2023) or a denoising autoencoder (Tong
et al., 2022; Feichtenhofer et al., 2022; Wang et al., 2023a).

15
B Extended Description of V-JEPA
In this section, we provide an in-depth description of our approach V-JEPA that is illustrated in Figure 3.

Input. Unless stated otherwise, during during pretraining, we always randomly sample a clip of 16 frames from
each input video with a temporal stride of 4 between sampled frames. An input video clip therefore covers 64 frames
in total, or roughly 2 seconds of a given video running at 30 frames per second. We then resize the video’s spatial
dimensions to 224 × 224, resulting in an overall shape of 16 × 224 × 224 × 3 for the entire clip. Since ViT networks
process a 1D sequence of tokens, we must convert an input video clip into a 1D token sequence. To do so, we apply a
3D convolution comprising d filters of size 2 × 16 × 16 with a temporal stride of 2 and a spatial stride of 16, resulting
in a tensor of shape 8 × 14 × 14 × d. Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal
feature map and flatten it, resulting in a 1D token sequence of shape 1568 × d. This process is demonstrated in
Figure 7.

[8 x 14 x 14 x d]

3D sin-cos absolute position

embeddings

16 video frames flatten

resolution 224 x 224
3D Conv
[2 x 16 x 16 x d] +

[16 x 224 x 224 x 3] [8 x 14 x 14 x d] [1568 x d]

Figure 7 V-JEPA training operates on a video clip flattened into a sequence of tokens. To convert a video clip of size
16 × 224 × 224 × 3 into a 1D token sequence, we apply a 3D convolution comprising d filters of size 2 × 16 × 16 with a temporal
stride of 2 and a spatial stride of 16, resulting in a tensor of shape 8 × 14 × 14 × d. Next we add absolute 3D sin-cos positional
embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape 1568 × d.

V-JEPA. We sample both a video clip, and a video mask in each iteration. We denote a video clip represented as
a 1D token sequence of length L = 1568 by xL = (x1 , . . . , xL ). Similarly, given a mask of M < L patches, leaving
N = L − M patches unmasked, we denote the indices of masked patches by (i1 , . . . , iM ) and its complement (the
indices of unmasked patches) by (j1 , . . . , jN ).

Computing the x-representations. To compute the V-JEPA loss, we first produce the x-representations by masking
the video clip and feeding it into the x-encoder; we denote the masked video by xN = (xj1 , . . . , xjN ). Applying the x-
encoder Eθ (·) to the masked clip gives a sequence of patch representations, denoted as zN = Eθ (xN ) = (zj1 , . . . , zjN ).

Predicting the target. Next, the V-JEPA predictor network Pϕ (·, ·) takes as input the tokens produced by the
x-encoder and predicts the missing regions in the video clip, which are specified by a set of learnable mask tokens.
Specifically, the mask tokens are parameterized as the sum of a shared learnable vector and an absolute 3D
sin-cos positional embedding, denoted by mM = (mi1 , . . . , miM ). The output of the predictor is thus given by,
ŝM = Pϕ (zN , mM ) = (ŝi1 , . . . , ŝiM ), corresponding to a d-dimensional output for each of the M masked patches.

Computing the y-representations. Finally to compute the prediction targets, the entire unmasked video clip is
processed by the y-encoder to obtain a set of target representations, denoted by sL = E θ (xL ) = (s1 , . . . , sL ). The
V-JEPA loss is now computed as
1 X
Loss = ∥ŝk − sk ∥1 , (2)
M
k∈(i1 ,...,iM )

which is simply the average L1 distance between the output of the predictor and the y-encoder. We then compute a
gradient update with respect to the parameters of the x-encoder, θ, and the predictor, ϕ, and subsequently update
the parameters of the y-encoder as an exponential moving average of the context encoder weights (Polyak average).

16
Table 8 pretraining hyper-parameters for V-JEPA.

Hyper-parameter ViT-L/16224 ViT-H/16224 ViT-H/16384

data
datasets VideoMix2M VideoMix2M VideoMix2M
resolution 224 224 384
num_frames 16 16 16
temporal_stride 4 4 4
horizontal_flip true true true
random_resize_scale (0.3, 1.0) (0.3, 1.0) (0.3, 1.0)
random_resize_aspect_ratio (0.75, 1.35) (0.75, 1.35) (0.75, 1.35)
masking
block_aspect_ratio (0.75, 1.5) (0.75, 1.5) (0.75, 1.5)
shortrange_mask_num_blocks 8 8 8
shortrange_mask_spatial_scale 0.15 0.15 0.15
longrange_mask_num_blocks 2 2 2
longrange_mask_spatial_scale 0.7 0.7 0.7
optimization
batch_size 3072 3072 2400
total_number_of_iterations 90000 90000 90000
warmup_iterations 12000 12000 12000
lr 6.25e-4 6.25×10−4 6.25×10−4
start_lr 2×10−4 2×10−4 2×10−4
final_lr 1×10−6 1×10−6 1×10−6
start_momentum 0.998 0.998 0.998
final_momentum 1.0 1.0 1.0
start_weight_decay 0.04 0.04 0.04
final_weight_decay 0.4 0.4 0.4
scheduler_scale_factor 1.25 1.25 1.25
architecture
patch_size 16 16 16
tubelet_size 2 2 2
pred_depth 12 12 12
pred_embed_dim 384 384 384
hardware
dtype bfloat16 bfloat16 bfloat16
accelerator A100 80G A100 80G A100 80G

Multi-Mask Prediction. To increase the efficiency of V-JEPA, we use a multi-masking strategy (Caron et al.,
2020; Baevski et al., 2022a), which enables us to amortize the cost of the target computation. As mentioned in
Section 3, for a given video clip, we sample 2 different masks, short-range and long-range. While we need to forward
propagate the x-encoder and predictor separately for each mask, we only need to compute the y-representation once.

C Pretraining details
In section, we report V-JEPA pretraining details. Table 8 summarizes the main hyperparameters used during
pretraining.

Architectures. We use Vision Transformer (Dosovitskiy et al., 2020) (ViT) architectures for the x-encoder and
y-encoder. We train three V-JEPA encoders: a ViT-L/16224 , a ViT-H/16224 and a ViT-H/16384 . All three encoders
take as input a short video clip of 16 frames with a temporal stride of 4 between consecutive frames. The subscripts,
224 and 384, indicate the spatial resolution of the video clip. V-JEPA flattens the video clip into a sequence of
non-overlapping spatio-temporal patches of size 16 × 16 × 2 (see Figure 7). For all three models, the predictor is
designed as a narrow ViT architecture, consisting of 12 transformer blocks with an embedding dimension of 384. For
simplicity, we keep the number of self-attention heads in the predictor equal to that of the backbone used for the
context-encoder/target-encoder. V-JEPA is pretrained without using a [cls] token.

Optimization. We use AdamW (Loshchilov and Hutter, 2017) to optimize the x-encoder and predictor weights.
The ViT-L/16224 and ViT-H/16224 models use a batch size of 3072 while the ViT-H/16384 uses a batch size of
2400. Models are trained for a total of 90,000 iterations. The learning rate is linearly increased from 2 × 10−4
to 6.25 × 10−4 during the first 12, 000 iterations of pretraining, and decayed to 10−6 following a cosine schedule.

17
Table 9 Frozen Evaluation hyper-parameters.

Hyper-parameter K400 SSv2 IN1K Place205 iNat21

data
num_clips 8 1 N.A. N.A. N.A.
num_frames 16 16 N.A. N.A. N.A.
temporal_stride 4 4 N.A. N.A. N.A.
horizontal_flip true true true true true
random_resize_scale (0.08, 1.0) (0.08, 1.0) (0.08, 1.0) (0.08, 1.0) (0.08, 1.0)
random_resize_aspect_ratio (0.75, 1.33) (0.75, 1.33) (0.75, 1.33) (0.75, 1.33) (0.75, 1.33)
auto_augment false false true true true
optimization
batch_size 256 256 1024 1024 1024
epochs 20 20 20 20 20
lr 1e-3 1e-3 1e-3 1e-3 1e-3
final_lr 0 0 0 0 0
weight_decay 0.01 0.01 0.01 0.01 0.01

Weight-decay is also linearly increased from 0.04 to 0.4 throughout pretraining. The y-encoder weights are initialized
identically to the x-encoder, and subsequently updated as an exponential moving average (EMA) (Tarvainen and
Valpola, 2017) of the x-encoder weights using a momentum value which starts at 0.998 and is linearly increased to
1.0 during training (Caron et al., 2021; Assran et al., 2022). We scale all hyper-parameter schedules 25% beyond
the actual training schedule. Specifically, the learning rate schedule, weight-decay schedule, and EMA schedule
are computed assuming a training length of 112,500 iterations, even though we only train our model for 90,000
iterations. We found the last 25% of the default scheduler period to update hyper-parameters too aggressively, and
simply truncating the schedulers improved performance.

Masking. As described in Section 3, we propose a 3D Multi-Block masking strategy. We use two type of masks:
short-range masks, where we take the union of 8 randomly sampled target blocks with a spatial scale of 0.15, and
long-range masks, where we take the union of 2 randomly sampled target blocks with a spatial scale of 0.7. In both
cases, the aspect ratio for all sampled blocks is randomly chosen in the range (0.75, 1.5).

D Evaluation details

D.1 Frozen classification

Attentive Probing. Given an input video, xL , the V-JEPA target encoder E θ (·) outputs a sequence of L tokens,
Eθ (xL ) = (s1 , . . . , sL ), where si ∈ Rd . To pool this sequence of tokens into a single feature vector, we apply a
lightweight non-linear cross-attention block which replace the self-attention operation of a transformer block with
cross attention. Specifically, the cross-attention performs the following computation:
L
X exp(q ⊤ Wk si )
P ⊤
Wv si ,
i=1 j exp(q Wk sj )

where Wk , Wv ∈ Rd×d are the key and value matrices, and q ∈ Rd is a learnable query token. The output of the
cross-attention is then added back to the query token (residual connection), and then fed into two-layer MLP with a
single GeLU activation, followed by a LayerNorm, and finally a linear classifier. The parameters of the cross-attention
block are jointly learned with that of the linear classifier for the downstream task, while the encoder parameters
are kept frozen. Note that, in practice, we actually use an attentive probe with 12 heads, each of dimension 12. In
Appendix E we show that baselines benefit from the attentive probing protocol.

Optimization. For all the tasks, we use AdamW optimizer with a cosine scheduler (no warmup) that decays the
learning rate from 0.001 to 0. We use a fixed weight-decay of 0.01 and apply simple data augmentations (random
resized crops and horizontal flips) during training of the attentive probe, except on image tasks, where we apply
AutoAugment (Dogus Cubuk et al., 2019). Table 9 reports the hyperparameters for each downstream evaluation.

Extension to multiple clips. Unless stated otherwise, our attentive probe takes 8 clips of 16 frames as input on
Kinetics, and 2 clips of 16 frames on Something-Somethingv2 to increase the temporal coverage of the video.

18
Table 10 Frozen Detection hyper-parameters.

Hyper-parameter ViT-L/16 ViT-H/16

out_layers [18, 20, 22, 24] [26, 28, 30, 32]
batch_size 64 64
epochs 30 30
opt AdamW AdamW
opt_eps 0.00000001 0.00000001
momentum 0.9 0.9
weight_decay 0.05 0.05
lr 0.0001 0.0001
warmup_lr 0.000001 0.000001
min_lr 0.000001 0.000001
warmup_epochs 2 2
warmup_steps 1 1

Specifically, we first divide a video in 8 (or 2) equal-length temporal segments, and sample 1 clip at random per
segment. The video encoder E θ processes each clip separately and produces a clip-level feature map. The feature
maps for each clip are then concatenated together and fed to the attentive probe. At test time, we average the
prediction of 3 spatial views following standard practice in video classification.

Application of video models to images. To evaluate the video models on image tasks, we simply duplicate input
images to generate still video clips of 16 frames. We perform this duplication operation simply for convenience in
evaluation of the video models, however we find this step to be unnecessary in general. Given a video tokenizer
implemented as a 3D-conv with a temporal stride of 2, it is sufficient to simply duplicate the image into a 2 frame
video clip. This would result in the same number of input tokens as that produced by a static image model with a
2D-conv tokenizer.

Application of image models to videos. To evaluate image models such as DINOv2 and OpenCLIP on video tasks,
we simply process each frame independently with the image encoder to produce a frame-level feature map. The
feature maps for each frame are then concatenated and fed to the attentive probe, just as we do with the clip-level
feature maps when evaluating video models.

D.2 Frozen detection

We evaluate our model on the AVA (Gu et al., 2018) spatio-temporal localization of human actions dataset, containing
211k training and 57k validation video segments. We follow the experimental protocol of (Feichtenhofer et al., 2021),
and use precomputed masks from a pretrained Faster-RCNN adapted to videos, which uses a ResNeXt-101-FPN
backbone and is pretrained on ImageNet and COCO. We train a linear classifier on top of the frozen V-JEPA features
to classify the extracted regions of interest and report mean Average Precision (mAP) on the 60 most common
classes. Hyper-parameters are provided in Table 10. Our frozen features are obtained by concatenating the last layer
of the transformer encoder with three intermediate layers. We use a batch size of 64 and pretrain for 30 epochs with
AdamW using a learning rate of 0.0001 with 2 epochs of warmup and a weight decay of 0.05.

D.3 Finetuning
Following Tong et al. (2022), we finetune a linear layer on top of our model, using a layer decay schema and mixup
as the data augmentation pipeline. We provide all hyper-parameters for both K400 and SSv2 in Table 11.

E Extra Results

E.1 Frozen Evaluation.

Linear vs. Attentive probe Table 12 shows that V-JEPA and VideoMAE benefit from using a non-linear attentive
probe and multiple clips on the K400 and SSv2 downstream tasks. Additionally, Table 13 shows that attentive
probing leads to better performance on average for DINOv2 and OpenCLIP models. Since attentive probing and
multiclips eval improves the performance of all models, we use it as our default protocol in frozen evaluation.

19
Table 11 Finetuning Evaluation hyper-parameters.

Hyper-parameter K400 SSv2

data
num_segments 1
num_frames 16
sampling_rate 4
resolution 224
model
model_name ViT-L/16 ViT-H/16 ViT-L/16 ViT-H/16
drop_path 0.1 0.2 0.2 0.2
head_drop_rate 0. 0. 0.5 0.5
optimization
batch_size 256 1024 256 256
epochs 35 25 15 15
opt adamw
opt_eps 0.00000001
momentum 0.9
weight_decay 0.05
lr 0.002 0.0005 0.0005 0.0005
layer_decay 0.75 0.75 0.75 0.75
warmup_lr 1e-6 1e-8 1e-6 1e-6
min_lr 1e-6 1e-5 1.5e-4 1.5e-3
warmup_epochs 5
augmentations
color_jitter 0.4
horizontal_flip True True False False
num_sample 2
aa rand-m7-n4-mstd0.5-inc1
smoothing 0.1
train_interpolation bicubic
test_num_segment 5 5 2 2
test_num_crop 3 3 3 3
erase
prob 0.25
mode pixel
count 1
split False
mixup
mixup 0.8
cutmix 1.0
mixup_prob 1.0
mixup_switch_prob 0.5
mixup_mode batch

20
Table 12 Linear vs. Attentive Probe Evaluation for V-JEPA and VideoMAE. We evaluate the effect of linear (Lin.)
and attentive (Att.) probing when adapting V-JEPA to the K400 (16 × 5 × 3) and SSv2 (16 × 2 × 2) tasks. V-JEPA and
VideoMAE benefit from using a non-linear attentive probe.

K400 SSv2
Method Arch. Lin. Att. Lin. Att.
VideoMAE ViT-L/16 52.5 77.8 41.3 61.2
V-JEPA ViT-L/16 56.7 80.8 50.1 69.5

Table 13 Linear vs. Attentive Probe Evaluation for DINOv2 and OpenCLIP. We evaluate the effect of linear (Lin.)
and attentive probing (Att.) when adapting DINOv2 and OpenCLIP. Image-baselines benefit from using an attentive probing
strategy. Results shown in gray are reported from the linear probe evaluation in Oquab et al. (2023).

K400 SSv2 IN1K Place205 iNat21

Method Arch. Lin. Att. Lin. Att. Lin. Att. Lin. Att. Lin. Att.
DINOv2 ViT-g/14 78.4 83.4 38.3 50.0 86.5 86.2 67.5 68.4 85.7 88.8
OpenCLIP ViT-G/14 78.3 81.8 35.8 34.8 86.2 85.3 69.8 70.2 76.0 83.6

One Clip vs Multiple clips. We examine the impact of changing the temporal coverage of a model during downstream
evaluation on K400 action classification. In Table 14, we evaluate VideoMAE and V-JEPA models using an attentive
probe with access to either the feature map of 1 clip randomly sampled from the video, or the concatenated feature
map of 8 clips randomly sampled from the video. To sample 8 clips from a video, we first divide the video into 8
equal length temporal segments, and sample 1 clip at random from each segment. A single clip corresponds to ≈ 2
seconds of a video on average, while 8 clips correspond to ≈ 16 seconds. The video encoders processes each clip
separately to produce a clip-level feature map, which are then concatenated at the input to the attentive probe.
Increasing the temporal coverage from 1 clip per video to 8 clips improves the performance of both V-JEPA and
VideoMAE on K400 action classification. We therefore use the multiclip attentive probing setup as our default
evaluation pipeline.

E.2 Finetuning
In Table 15, we evaluate V-JEPA using finetuning (separately) on K400 and SSv2. We compare V-JEPA with
VideoMAEv2 (Wang et al., 2023a), VideoMAE (Tong et al., 2022) and MVD (Wang et al., 2023b) using a ViT-L/16
or a ViT-H/16 architecture. V-JEPA obtains competitive performance using a finetuning protocol. With a ViTiH/16
architecture, V-JEPA outperforms by 1.2% VideoMAE and +0.3% VideoMAEv2 on the SSv2 dataset, while obtaining
comparable performance on K400. V-JEPA also obtains performance similar to MVD on the SSv2 dataset. The
MVD model achieves the best performance across models on the K400 dataset, and is trained using the image
dataset ImageNet1K, in contrast to the other methods in the table, which only use video data. Additionally MVD
requires the processing of significantly more samples during pretraining due to the cost of training the teacher
encoder networks in a pre-pre-training step.

E.3 Sample Efficiency of pretraining

We compare the sample efficiency of pretraining various state-of-the-art image and video models. Specifically, we
look at the number of samples (image or video clips) processed by the network during pretraining, which is larger
than the size of the pretraining dataset for multi-epoch training. Notably, our results with V-JEPA are obtained
while processing an order of magnitude fewer samples than previous methods, and notably two orders of magnitude
fewer samples than OpenCLIP. We believe that further investment towards improving the video pretraining data
distribution could lead to substantial gains in downstream image and video tasks.

E.4 Masking Strategy

An important component of the V-JEPA pretraining strategy is the 3D clip masking strategy. In this section, we
detail 26 ablation experiments exploring different masks. For all the experiments, we pretrain a ViT-B/16 pretrained
on K400. Figure 8 presents a summary of those results.
Figure 8c shows the effect of changing the spatial and temporal masking ratio. Figure 8b ablates the number of
sampled blocks used to construct the masks given a fixed effective masking ratio of 90%. Finally, in Figure 8a we

21
Table 14 Temporal Coverage on Kinetics-400. We evaluate the effect of temporal coverage on K400. We train an attentive
probe on K400 using either 1 clip (≈ 2 seconds of a video) or 8 clips (≈ 16 seconds of a video). To sample N clips, we first
divide a video in N equal-length temporal segments and sample one clip at random per segment. The video encoder processes
each clip in parallel and all the encoder output tokens are concatenated at the input of the attentive probe. Increasing the
temporal coverage from 1 clip per video to 8 clips significantly improves the performance for both our VideoMAE baseline
and V-JEPA.

Method Arch. 1 Clip 8 Clips

VideoMAE ViT-L/16 69.4 77.8
V-JEPA ViT-L/16 73.7 80.9

Table 15 Finetuning results. We evaluate a V-JEPA model with the finetuning protocol on the K400 and SSv2 datasets
using 16 frames per clip and multi-view fusion (5×3 or 2×3) for inference. The #Samples Seen entry corresponds to the
number of video clips processed during pretraining, which is larger than the size of the pretraining dataset for multi-epoch
training. We compare V-JEPA with different video self-supervised learning approaches. We report the VideoMAEv2 results
without instruction-turning for consistency with the other approaches. V-JEPA obtains competitive performance using the
finetuning protocol.

Method Arch. Pretraining Data #Samples Seen K400 SSv2

(16×5×3) (16×2×3)

VideoMAEv1 ViT-L/16 K400|SSv2 380M|410M 85.4 74.3

ViT-H/16 K400|SSv2 380M|410M 86.6 74.8
VideoMAEv2 ViT-H/16 Un.Hybrid 1600M 86.9 76.8
MVD ViT-L/16 K400+IN1K 2400M 86.4 76.7
ViT-H/16 K400+IN1K 2400M 87.2 77.3

V-JEPA ViT-L/16 VideoMix2M 270M 85.6 75.1

ViT-H/16 VideoMix2M 270M 86.6 77.0

examine our multi-masking strategy and find that sampling two masks for each clip (long-range and short-range) to
be more effective than sampling just a single mask for each clip.
In Figure 8c, we explore different average spatial and temporal masking ratio, i.e. the spatial/temporal ratio of
the area that is covered by a mask on average for a clip. Recall that each mask is constructed by sampling several
(possibly overlapping) blocks and taking their union. We change the average spatial or temporal masking ratio by
changing a block spatial or temporal size, as well as the overall number of blocks. We found that low spatial or
temporal coverage results in a trivial prediction task, which degrades downstream performance. Based on those
results, we sample masks that remove roughly 90% of the frame and extend along the entire temporal dimension of
the clip by default.
In Figure 8b , we explore different block size given an effective spatial masking ratio of 90% and temporal ratio of
100%. We keep the masking ratio approximately constant by changing the block size and the number of block at the
same time. We find that sampling several blocks to perform better than sampling a single large block. Figure 9
visually illustrates the effect of sampling several smaller blocks to construct a mask.
In Figure 8a, we explore the effect of sampling various number of masks per samples. We find that sampling two
masks for each clip, with different spatial block sizes for each, to be more effective than sampling just a single mask.
We hypothesize that this masking strategy induces complementary tasks. In our experiment, we use this as our
default masks sampling.

22
Table 16 Sample efficiency. We compare the sample efficiency of pretraining various state-of-the-art image and video models.
The #Samples Seen entry corresponds to the number of samples (image or video clips) processed by the network during
pretraining, which is larger than the size of the pretraining dataset for multi-epoch training. The V-JEPA results in this
paper are obtained while processing an order of magnitude fewer samples than previous methods.

Method Arch. Data #Samples Seen

OpenCLIP ViT-G/14 LAION-2B 39000M
DINOv2 ViT-g/14 LVD 142M 1900M
VideoMAEv2 ViT-g/14 UnlabeledHybrid 1600M
V-JEPA ViT-H/16384 VideoMix2M 210M

Ablating Number of Masks per Sample Ablating Number of Blocks per Mask Ablating Masking Ratio
50 50
55 Temporal Masking Ratio
100%

Kinetics 400
Kinetics 400
Kinetics 400

54 40
49 75%
53 30 50%

52 48 20
51
10
47
50
0
1 2 3 1 2 4 8 16 25 50 75 90

Number of Masks per Samples Number of Blocks per Mask Spatial Masking Ratio

(a) (b) (c)

Figure 8 Masking Strategy Ablation. Evaluating a linear probe on a ViT-B/16 pretrained with V-JEPA on K400 under
various 3D Multi-Block masking settings. We examine the impact of (a) sampling several masks per video, (b) varying the
number of blocks in a mask, and (c) varying the average spatial and temporal masking ratio. A temporal masking ratio of
100% extends the spatial mask across all the frames in the clip. We find it important to maintain a high spatial and temporal
masking ratio during pretraining.

(a) Num. Blocks: 8, Spatial Block Size: 32 × 32

(b) Num. Blocks: 4, Spatial Block Size: 80 × 80

(c) Num. Blocks: 2, Spatial Block Size: 160 × 160

Figure 9 Illustration of mask with number of blocks and block size. Each mask is constructed by sampling several (possibly
overlapping) blocks and taking their union.

Costas Efthimiou - Introduction To Functional Equations - Theory and Problem-Solving Strategies For Mathematical Competitions and Beyond-Msri Mathematical Sciences Research Institute (2011)
No ratings yet
Costas Efthimiou - Introduction To Functional Equations - Theory and Problem-Solving Strategies For Mathematical Competitions and Beyond-Msri Mathematical Sciences Research Institute (2011)
381 pages
Meta Ai提出了一种名为v Jepa的自监督学习方法，利用视频特征预测开发出高效自监督视觉表示，实现多任务性能提升
No ratings yet
Meta Ai提出了一种名为v Jepa的自监督学习方法，利用视频特征预测开发出高效自监督视觉表示，实现多任务性能提升
23 pages
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
No ratings yet
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
48 pages
BARDES Et Al 2024
No ratings yet
BARDES Et Al 2024
20 pages
Status Update
No ratings yet
Status Update
5 pages
Masked Autoencoders As Spatiotemporal Learners: Equal Contribution
No ratings yet
Masked Autoencoders As Spatiotemporal Learners: Equal Contribution
17 pages
Assran Self-Supervised Learning From Images With A Joint-Embedding Predictive Architecture CVPR 2023 Paper
No ratings yet
Assran Self-Supervised Learning From Images With A Joint-Embedding Predictive Architecture CVPR 2023 Paper
11 pages
Savp
No ratings yet
Savp
26 pages
Video Flow A Conditional Flow Based On Nanoparticles
No ratings yet
Video Flow A Conditional Flow Based On Nanoparticles
18 pages
2301.08243 I-Jepa (2023)
No ratings yet
2301.08243 I-Jepa (2023)
17 pages
Masked Autoencoders As Spatiotemporal Learners
No ratings yet
Masked Autoencoders As Spatiotemporal Learners
13 pages
Video Representation Learning by Dense Predictive Coding 1909.04656
No ratings yet
Video Representation Learning by Dense Predictive Coding 1909.04656
13 pages
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
No ratings yet
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
11 pages
IMP-1A Review On Deep Learning Techniques For Video Prediction
No ratings yet
IMP-1A Review On Deep Learning Techniques For Video Prediction
26 pages
Paper 4
No ratings yet
Paper 4
12 pages
An Empirical Study of Autoregressive Pre-Training From Videos
No ratings yet
An Empirical Study of Autoregressive Pre-Training From Videos
19 pages
Deep Learning Approaches To Predict Future Frames in Videos
No ratings yet
Deep Learning Approaches To Predict Future Frames in Videos
17 pages
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
No ratings yet
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
10 pages
Learning and Leveraging World Models in Visual Representation Learning
No ratings yet
Learning and Leveraging World Models in Visual Representation Learning
23 pages
Video Swin Transformer
No ratings yet
Video Swin Transformer
12 pages
2022 - PVT v2
No ratings yet
2022 - PVT v2
10 pages
Paper 1
No ratings yet
Paper 1
15 pages
2003 - Probabilistic Future Prediction For Video Scene Understanding
No ratings yet
2003 - Probabilistic Future Prediction For Video Scene Understanding
21 pages
11.self Supervised Learning From Images
No ratings yet
11.self Supervised Learning From Images
17 pages
Yu Et Al - 2016 - Recent Developments On Deep Big Vision
No ratings yet
Yu Et Al - 2016 - Recent Developments On Deep Big Vision
2 pages
2D or Not 2D Adaptive 3D Convolution Selection For Efficient Video Recogition
No ratings yet
2D or Not 2D Adaptive 3D Convolution Selection For Efficient Video Recogition
10 pages
A-JEPA Joint-Embedding Predictive Architecture Can Listen1
No ratings yet
A-JEPA Joint-Embedding Predictive Architecture Can Listen1
12 pages
2019 - VideoBERT - A Joint Model For Video and Language Representation Learning
No ratings yet
2019 - VideoBERT - A Joint Model For Video and Language Representation Learning
13 pages
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
Yu Deep Anomaly Discovery From Unlabeled Videos Via Normality Advantage and CVPR 2022 Paper
No ratings yet
Yu Deep Anomaly Discovery From Unlabeled Videos Via Normality Advantage and CVPR 2022 Paper
12 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
No ratings yet
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
20 pages
Deep Learning in Next-Frame Prediction A Benchmark Review
No ratings yet
Deep Learning in Next-Frame Prediction A Benchmark Review
11 pages
Paper 1
No ratings yet
Paper 1
14 pages
Intern Video 2
No ratings yet
Intern Video 2
27 pages
Inception Recurrent Neural Network Architecture For Video Frame Prediction
No ratings yet
Inception Recurrent Neural Network Architecture For Video Frame Prediction
16 pages
Predictive - Huawei-Noah & CUHK-SZ-1
No ratings yet
Predictive - Huawei-Noah & CUHK-SZ-1
5 pages
Ef Cient Training of Visual Transformers With Small Datasets - Liu Et Al
No ratings yet
Ef Cient Training of Visual Transformers With Small Datasets - Liu Et Al
13 pages
Unsupervised Learning of Video Representations Using Lstms
No ratings yet
Unsupervised Learning of Video Representations Using Lstms
12 pages
Paper 5
No ratings yet
Paper 5
12 pages
Autoregressive Video Generation
No ratings yet
Autoregressive Video Generation
22 pages
Preprints202403 1272 v1
No ratings yet
Preprints202403 1272 v1
37 pages
Transferring Knowledge From Text To Video: Zero-Shot Anticipation For Procedural Actions
No ratings yet
Transferring Knowledge From Text To Video: Zero-Shot Anticipation For Procedural Actions
17 pages
Vision Mamba
No ratings yet
Vision Mamba
14 pages
Liu 等 - VMamba Visual State Space Model
No ratings yet
Liu 等 - VMamba Visual State Space Model
33 pages
Video Mamba
No ratings yet
Video Mamba
24 pages
An Image Is Worth More Than 16x16 Patches
No ratings yet
An Image Is Worth More Than 16x16 Patches
23 pages
MPViT Multi-Path Vision Transformer For Dense Prediction
No ratings yet
MPViT Multi-Path Vision Transformer For Dense Prediction
16 pages
Research Notes
No ratings yet
Research Notes
9 pages
R J E P A R F P L: Ecurrent Oint Mbedding Redictive Rchitecture With Ecurrent Orward Ropagation Earning
No ratings yet
R J E P A R F P L: Ecurrent Oint Mbedding Redictive Rchitecture With Ecurrent Orward Ropagation Earning
16 pages
Entropy 25 01469
No ratings yet
Entropy 25 01469
22 pages
O: A Single Model For Many Visual Modalities: Mnivore
No ratings yet
O: A Single Model For Many Visual Modalities: Mnivore
14 pages
Temporal Context Mining For Learned Video Compression
No ratings yet
Temporal Context Mining For Learned Video Compression
12 pages
Research On Learning Representations in Computer Vision
No ratings yet
Research On Learning Representations in Computer Vision
52 pages
TAFormer A Unified Target-Aware Transformer For Video and Motion Joint Prediction in Aerial Scenes
No ratings yet
TAFormer A Unified Target-Aware Transformer For Video and Motion Joint Prediction in Aerial Scenes
16 pages
Dinov 2
No ratings yet
Dinov 2
31 pages
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
No ratings yet
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
9 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
Video Prediction PDF
No ratings yet
Video Prediction PDF
21 pages
MM1: Methods, Analysis & Insights From Multimodal LLM Pre-Training
No ratings yet
MM1: Methods, Analysis & Insights From Multimodal LLM Pre-Training
41 pages
Visual Sketchpad
No ratings yet
Visual Sketchpad
32 pages
Making Convolutional Networks Shift-Invariant Again
No ratings yet
Making Convolutional Networks Shift-Invariant Again
17 pages
Memory Less
No ratings yet
Memory Less
1 page
Implementation of Read Write Operation For AMBA AXI4
No ratings yet
Implementation of Read Write Operation For AMBA AXI4
4 pages
Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
No ratings yet
Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
14 pages
Nearest Larger Value Problem(s)
No ratings yet
Nearest Larger Value Problem(s)
23 pages
Breast Cancer Classification Analysis
No ratings yet
Breast Cancer Classification Analysis
5 pages
PHD Thesis Topics in Mechanical Engineering
100% (3)
PHD Thesis Topics in Mechanical Engineering
5 pages
ML - UNIT 5 - Material - SVCK - CSE
No ratings yet
ML - UNIT 5 - Material - SVCK - CSE
22 pages
Um Thesis and Dissertation
100% (3)
Um Thesis and Dissertation
6 pages
Module 7 STS
No ratings yet
Module 7 STS
9 pages
Unit-1 AI MCQ Bank Set-1 - HMG
100% (1)
Unit-1 AI MCQ Bank Set-1 - HMG
13 pages
Deep Learning Vs Machine Learning
No ratings yet
Deep Learning Vs Machine Learning
2 pages
Sum3 Trends
No ratings yet
Sum3 Trends
2 pages
Eti Chapter-1 MCQ
No ratings yet
Eti Chapter-1 MCQ
12 pages
Adaptive Neuro-Fuzzy Inference
No ratings yet
Adaptive Neuro-Fuzzy Inference
13 pages
Data Driven Reservoir Modeling 1st Edition Shahab D. Mohaghegh 2024 Scribd Download
No ratings yet
Data Driven Reservoir Modeling 1st Edition Shahab D. Mohaghegh 2024 Scribd Download
40 pages
Slide Deck EU Commission - Regulating AI
No ratings yet
Slide Deck EU Commission - Regulating AI
44 pages
Btech Cse Ai Curriculum Syllabus 2021
No ratings yet
Btech Cse Ai Curriculum Syllabus 2021
103 pages
Fin Irjmets1647283669
No ratings yet
Fin Irjmets1647283669
8 pages
Pragraph Jhgtyf
No ratings yet
Pragraph Jhgtyf
9 pages
Artificial Neural Network PHD Thesis
100% (2)
Artificial Neural Network PHD Thesis
5 pages
UML - Unit 2
No ratings yet
UML - Unit 2
10 pages
417 AI Handbook Class9 New
No ratings yet
417 AI Handbook Class9 New
128 pages
Data Miningof Public Opinion An Overview
No ratings yet
Data Miningof Public Opinion An Overview
12 pages
Artificial Intelligence in Pharmaceutical Sciences 1st Edition Educational Ebook Download
100% (17)
Artificial Intelligence in Pharmaceutical Sciences 1st Edition Educational Ebook Download
17 pages
Modern Marketing Data Stack 2025 Report
No ratings yet
Modern Marketing Data Stack 2025 Report
85 pages
BCG Service Factory of The Future Aug 2020
No ratings yet
BCG Service Factory of The Future Aug 2020
8 pages
Thesis Content Based Image Retrieval
100% (2)
Thesis Content Based Image Retrieval
4 pages
AS04a-2022 - MedTech - Patenting Medical Technology - NAGELE
No ratings yet
AS04a-2022 - MedTech - Patenting Medical Technology - NAGELE
29 pages
1 - Leaf Recognition For Plant Classification Using
No ratings yet
1 - Leaf Recognition For Plant Classification Using
6 pages
DL 2P DDoSADF
No ratings yet
DL 2P DDoSADF
13 pages
Text To Video - Model
No ratings yet
Text To Video - Model
2 pages
Buildings 14 02499
No ratings yet
Buildings 14 02499
30 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
62 pages
Project Thesis Final YOLO SSD
No ratings yet
Project Thesis Final YOLO SSD
49 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

V Jepa

Uploaded by

V Jepa

Uploaded by

Revisiting Feature Prediction for Learning Visual

Representations from Video

Date: February 14, 2024

community is to identify the principles or objectives that DINOv2

1994; Berkes and Wiskott, 2005; Hinton, 1989). One

should be predictive of each other.

x y Theoretical motivation. A theoretical motivation for

[L×d] [N×d] [N×d] [L×d] [M×d] [M×d] [L×d] [L×d]

Frozen Evaluation Fine-Tuning

Pixels ViT-L/16 68.6 66.0 73.3 85.4

K710 700K 75.8 63.2 73.7 70.9

ViT-H/16 K710+SSv2 900K 75.7 66.8 73.7 72.0

Frozen Evaluation w/ Att. Pooling Fine-Tuning

Methods pretrained using pixel prediction

Video Tasks Image Tasks

Methods pretrained on Images

ViT-L/16 OmniMAE VideoMAEv2

Under the fine-tuning protocol, V-JEPA also achieves the

5% 10% 50% 5% 10% 50%

6 Evaluating the Predictor

x-encoder predictor decoder

Figure 6 Qualitative Analysis. Offline visualizations of the V-JEPA feature-space predictions.

7 Conclusion stream tasks requiring fine-grained motion understand-

A Extended Related Works

Weakly-Supervised Learning from Static Images

Self-Supervised Learning from Static Images

Weakly-Supervised Learning from Videos

Self-Supervised Learning from Videos

3D sin-cos absolute position

16 video frames flatten

[16 x 224 x 224 x 3] [8 x 14 x 14 x d] [1568 x d]

Hyper-parameter ViT-L/16224 ViT-H/16224 ViT-H/16384

Hyper-parameter K400 SSv2 IN1K Place205 iNat21

D.1 Frozen classification

Hyper-parameter ViT-L/16 ViT-H/16

D.2 Frozen detection

E.1 Frozen Evaluation.

Hyper-parameter K400 SSv2

K400 SSv2 IN1K Place205 iNat21

E.3 Sample Efficiency of pretraining

E.4 Masking Strategy

Method Arch. 1 Clip 8 Clips

Method Arch. Pretraining Data #Samples Seen K400 SSv2

VideoMAEv1 ViT-L/16 K400|SSv2 380M|410M 85.4 74.3

V-JEPA ViT-L/16 VideoMix2M 270M 85.6 75.1

Method Arch. Data #Samples Seen

(a) (b) (c)

(a) Num. Blocks: 8, Spatial Block Size: 32 × 32

(b) Num. Blocks: 4, Spatial Block Size: 80 × 80

(c) Num. Blocks: 2, Spatial Block Size: 160 × 160

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.