2301.08243 I-Jepa (2023)
2301.08243 I-Jepa (2023)
1 2 3 4
Meta AI (FAIR) McGill University Mila, Quebec AI Institute New York University
arXiv:2301.08243v3 [cs.CV] 13 Apr 2023
iBOT MAE
blocks in the same image. A core design choice to guide 77 ViT-S/16 (800ep)
data2vec ViT-H/14 (1600ep)
I-JEPA towards producing semantic representations is the ViT-L/16 (1600ep)
x y x y x y
(a) Joint-Embedding Architecture (b) Generative Architecture (c) Joint-Embedding Predictive Architecture
Figure 2. Common architectures for self-supervised learning, in which the system learns to capture the relationships between its inputs.
The objective is to assign a high energy (large scaler value) to incompatible inputs, and to assign a low energy (low scaler value) to compat-
ible inputs. (a) Joint-Embedding Architectures learn to output similar embeddings for compatible inputs x, y and dissimilar embeddings
for incompatible inputs. (b) Generative Architectures learn to directly reconstruct a signal y from a compatible signal x, using a decoder
network that is conditioned on additional (possibly latent) variables z to facilitate reconstruction. (c) Joint-Embedding Predictive Architec-
tures learn to predict the embeddings of a signal y from a compatible signal x, using a predictor network that is conditioned on additional
(possibly latent) variables z to facilitate prediction.
resulting representations are typically of a lower semantic approaches on semantic tasks and achieves better per-
level and underperform invariance-based pretraining in off- formance on low-level visions tasks such as object
the-shelf evaluations (e.g., linear-probing) and in transfer counting and depth prediction (Sections 5 and 6). By
settings with limited supervision for semantic classification using a simpler model with less rigid inductive bias,
tasks [4]. Consequently, a more involved adaptation mech- I-JEPA is applicable to a wider set of tasks.
anism (e.g., end-to-end fine-tuning) is required to reap the
full advantage of these methods. • I-JEPA is also scalable and efficient (Section 7). Pre-
In this work, we explore how to improve the semantic training a ViT-H/14 on ImageNet requires less than
level of self-supervised representations without using extra 1200 GPU hours, which is over 2.5× faster than a ViT-
prior knowledge encoded through image transformations. S/16 pretrained with iBOT [79] and over 10× more ef-
To that end, we introduce a joint-embedding predictive ar- ficient than a ViT-H/14 pretrained with MAE. Predict-
chitecture [48] for images (I-JEPA). An illustration of the ing in representation space significantly reduces the to-
method is provided in Figure 3. The idea behind I-JEPA tal computation needed for self-supervised pretraining.
is to predict missing information in an abstract representa-
tion space; e.g., given a single context block, predict the 2. Background
representations of various target blocks in the same im-
Self-supervised learning is an approach to representa-
age, where target representations are computed by a learned
tion learning in which a system learns to capture the rela-
target-encoder network.
tionships between its inputs. This objective can be read-
Compared to generative methods that predict in
ily described using the framework of Energy-Based Models
pixel/token space, I-JEPA makes use of abstract prediction
(EBMs) [49] in which the self-supervised objective is to as-
targets for which unnecessary pixel-level details are poten-
sign a high energy to incompatible inputs, and to assign a
tially eliminated, thereby leading the model to learn more
low energy to compatible inputs. Many existing generative
semantic features. Another core design choice to guide
and non-generative approaches to self-supervised learning
I-JEPA towards producing semantic representations is the
can indeed be cast in this framework; see Figure 2.
proposed multi-block masking strategy. Specifically, we
demonstrate the importance of predicting sufficiently large
target blocks in the image, using an informative (spatially Joint-Embedding Architectures. Invariance-based pre-
distributed) context block. training can be cast in the framework of EBMs using a
Through an extensive empirical evaluation, we demon- Joint-Embedding Architecture (JEA), which learns to out-
strate that: put similar embeddings for compatible inputs, x, y, and dis-
similar embeddings for incompatible inputs; see Figure 2a.
• I-JEPA learns strong off-the-shelf representations
In the context of image-based pretraining, compatible x, y
without the use of hand-crafted view augmentations
pairs are typically constructed by randomly applying hand-
(cf. Fig.1). I-JEPA outperforms pixel-reconstruction
crafted data augmentations to the same input image [20].
methods such as MAE [36] on ImageNet-1K linear
The main challenge with JEAs is representation collapse,
probing, semi-supervised 1% ImageNet-1K, and se-
wherein the energy landscape is flat (i.e., the encoder pro-
mantic transfer tasks.
duces a constant output regardless of the input). During
• I-JEPA is competitive with view-invariant pretraining the past few years, several approaches have been investi-
gated to prevent representation collapse, such as contrastive predictor
losses that explicitly push apart embeddings of negative ex- context
encoder gφ
amples [15,24,37], non-contrastive losses that minimize the
informational redundancy across embeddings [10, 74], and context
clustering-based approaches that maximize the entropy of
the average embedding [4, 5, 18]. There are also heuristic fθ gφ
approaches that leverage an asymmetric architectural de-
sign between the x-encoder and y-encoder to avoid col-
lapse [8, 24, 35]. gφ
\frac {1}{M} \sum ^M_{i=1} {D}\left (\hat {\vs }_y(i), \vs _y(i)\right ) = \frac {1}{M} \sum ^M_{i=1}\sum _{j \in B_i} \lVert \hat {\vs }_{y_j} - \vs _{y_j}\rVert ^2_2.
4. Related Work
patch-level representation. Typically, we set M equal to
4, and sample the blocks with a random aspect ratio in the A long line of work has explored visual representation
range (0.75, 1.5) and random scale in the range (0.15, 0.2). learning by predicting the values of missing or corrupted
Note that the target blocks are obtained by masking the out- sensory inputs. Denoising autoencoders use random noise
put of the target-encoder, not the input. This distinction is as input corruption [67]. Context encoders regress an entire
crucial to ensure target representations of a high semantic image region based on its surrounding [57]. Other works
level; see, e.g., [8]. cast image colorization as a denoising task [46, 47, 77].
The idea of image denoising has recently been revis-
ited in the context of masked image modelling [9, 36, 71],
Context. Recall, the goal behind I-JEPA is to predict the where a Vision Transformer [29] is used to reconstruct
target block representations from a single context block. missing input patches. The work on Masked Autoen-
To obtain the context in I-JEPA, we first sample a single coders (MAE) [36] proposed an efficient architecture that
block x from the image with a random scale in the range only requires the encoder to process visible image patches.
(0.85, 1.0) and unit aspect ratio. We denote by Bx the mask By reconstructing missing patches in pixels space, MAE
associated with the context block x. Since the target blocks achieves strong performance when fine-tuned end-to-end on
are sampled independently from the context block, there large labeled datasets and exhibits good scaling properties.
may be significant overlap. To ensure a non-trivial predic- BEiT [9] predicts the value of missing patches in a tok-
tion task, we remove any overlapping regions from the con- enized space; specifically, tokenizing image patches using
text block. Figure 4 shows examples of various context and a frozen discreteVAE, which is trained on a dataset contain-
target blocks in practice. Next, the masked context block, ing 250 million images [58]. Yet, pixel-level pre-training
x, is fed through the context encoder fθ to obtain a corre- has been shown to outperform BEiT for fine-tuning [36].
sponding patch-level representation sx = {sxj }j∈Bx . Another work, SimMIM [71], explores reconstruction tar-
gets based on the classic Histogram of Gradients [27] fea-
Prediction. Given the output of the context encoder, sx , ture space, and demonstrates some advantage over pixel
we wish to predict the M target block representations space reconstruction. Different from those works, our rep-
sy (1), . . . , sy (M ). To that end, for a given target block resentation space is learned during training through a Joint-
sy (i) corresponding to a target mask Bi , the predictor Embedding Predictive Architecture. Our goal is to learn
gϕ (·, ·) takes as input the output of the context encoder semantic representations that do not require extensive fine-
sx and a mask token for each patch we wish to predict, tuning on downstream tasks.
{mj }j∈Bi , and outputs a patch-level prediction ŝy (i) = Closest to our work is data2vec [8] and Context Autoen-
{ŝyj }j∈Bi = gϕ (sx , {mj }j∈Bi ). The mask tokens are coders [25]. The data2vec method learns to predict the rep-
Method Arch. Epochs Top-1 Method Arch. Epochs Top-1
Methods without view data augmentations Methods without view data augmentations
data2vec [8] ViT-L/16 1600 77.3 data2vec [8] ViT-L/16 1600 73.3
ViT-B/16 1600 68.0 ViT-L/16 1600 67.1
MAE [36]
MAE [36] ViT-L/16 1600 76.0 ViT-H/14 1600 71.5
ViT-H/14 1600 77.2 ViT-L/16 600 69.4
ViT-B/16 1600 70.4 I-JEPA ViT-H/14 300 73.3
CAE [22]
ViT-L/16 1600 78.1 ViT-H/16448 300 77.3
ViT-B/16 600 72.9 Methods using extra view data augmentations
I-JEPA ViT-L/16 600 77.5 iBOT [79] ViT-B/16 400 69.7
ViT-H/14 300 79.3 DINO [18] ViT-B/8 300 70.0
ViT-H/16448 300 81.1 SimCLR v2 [35] RN151 (2×) 800 70.2
Methods using extra view data augmentations BYOL [35] RN200 (2×) 800 71.2
SimCLR v2 [21] RN152 (2×) 800 79.1 MSN [4] ViT-B/4 300 75.7
DINO [18] ViT-B/8 300 80.1 Table 2. ImageNet-1%. Semi-supervised evaluation on
iBOT [79] ViT-L/16 250 81.0 ImageNet-1K using only 1% of the available labels. Models are
adapted via fine-tuning or linear-probing, depending on whichever
Table 1. ImageNet. Linear-evaluation on ImageNet-1k (the ViT- works best for each respective method. ViT-H/16448 is pretrained
H/16448 is pretrained at at a resolution of 448 × 448). I-JEPA im- at at a resolution of 448 × 448. I-JEPA pretraining outperforms
proves linear probing performance compared to other methods that MAE which also does not rely on hand-crafted data-augmentations
do not rely on hand-crafted view data-augmentations during pre- during pretraining. Moreover, I-JEPA benefits from scale. A ViT-
training. Moreover, I-JEPA demonstrates good scalability — the H/16 trained at resolution 448 surpasses previous methods includ-
larger I-JEPA model matches the performance of view-invariance ing methods that leverage extra hand-crafted data-augmentations.
approaches without requiring view data-augmentations.
5. Image Classification
resentation of missing patches computed through an online
target encoder; by avoiding handcrafted augmentations, the To demonstrate that I-JEPA learns high-level representa-
method can be applied to diverse modalities with promising tions without relying on hand-crafted data-augmentations,
results in vision, text and speech. Context Autoencoders we report results on various image classification tasks us-
use an encoder/decoder architecture optimized via the sum ing the linear probing and partial fine-tuning protocols. In
of a reconstruction loss and an alignment constraint, which this section, we consider self-supervised models that have
enforces predictability of missing patches in representation been pretrained on the ImageNet-1K dataset [60]. Pretrain-
space. Compared to these methods, I-JEPA exhibits signif- ing and evaluation implementation details are described in
icant improvements in computational efficiency and learns the Appendix A. All I-JEPA models are trained at resolution
more semantic off-the-shelf representations. Concurrent to 224 × 224 pixels, unless stated otherwise.
our work, data2vec-v2 [7] explores efficient architectures
for learning with various modalities. ImageNet-1K. Table 1 shows performance on the com-
We also compare I-JEPA with various methods based on mon ImageNet-1K linear-evaluation benchmark. After self-
joint-embedding architectures; e.g., DINO [18], MSN [4] supervised pretraining, the model weights are frozen and a
and iBOT [79]. Theses methods rely on hand-crafted data linear classifier is trained on top using the full ImageNet-
augmentations during pretraining to learn semantic image 1K training set. Compared to popular methods such as
representations. The work on MSN [4], uses masking as Masked Autoencoders (MAE) [36], Context Autoencoders
an additional data-augmentation during pretraining, while (CAE) [22], and data2vec [8], which also do not rely on ex-
iBOT combines a data2vec-style patch-level reconstruc- tensive hand-crafted data-augmentations during pretraining,
tion loss with the DINO view-invariance loss. Common we see that I-JEPA significantly improves linear probing
to these approaches is the need to process multiple user- performance, while using less computational effort (see sec-
generated views of each input image, thereby hindering tion 7). By leveraging the improved efficiency of I-JEPA,
scalability. By contrast, I-JEPA only requires processing we can train larger models that outperform the best CAE
a single view of each image. We find that a ViT-Huge/14 model while using a fraction of the compute. I-JEPA also
trained with I-JEPA requires less computational effort than benefits from scale; in particular, a ViT-H/16 trained at res-
a ViT-Small/16 trained with iBOT. olution 448 × 448 pixels matches the performance of view-
Method Arch. CIFAR100 Places205 iNat18 Method Arch. Clevr/Count Clevr/Dist
Methods without view data augmentations Methods without view data augmentations
data2vec [8] ViT-L/16 81.6 54.6 28.1 data2vec [8] ViT-L/16 85.3 71.3
MAE [36] ViT-H/14 77.3 55.0 32.9 MAE [36] ViT-H/14 90.5 72.4
I-JEPA ViT-H/14 87.5 58.4 47.6 I-JEPA ViT-H/14 86.7 72.4
Methods using extra view data augmentations Methods using extra data augmentations
DINO [18] ViT-B/8 84.9 57.9 55.9 DINO [18] ViT-B/8 86.6 53.4
iBOT [79] ViT-L/16 88.3 60.4 57.3 iBOT [79] ViT-L/16 85.7 62.8
Table 3. Linear-probe transfer for image classification. Linear- Table 4. Linear-probe transfer for low-level tasks. Linear-
evaluation on downstream image classification tasks. I-JEPA sig- evaluation on downstream low-level tasks consisting of object
nificantly outperforms previous methods that also do not use aug- counting (Clevr/Count) and depth prediction (Clevr/Dist). The I-
mentations (MAE and data2vec), and decreases the gap with the JEPA method effectively captures low-level image features dur-
best view-invariance-based methods that leverage hand-crafted ing pretraining and outperforms view-invariance based methods
data augmentations during pretraining. on tasks such object counting and depth prediction.
invariant approaches such as iBOT [79], despite avoiding view-invariance based methods that leverage extra hand-
the use of hand-crafted data-augmentations. crafted data augmentations. In this section, we find that
I-JEPA also learns local image features and surpasses view-
invariance based methods on low-level and dense prediction
Low-Shot ImageNet-1K. Table 2 shows performance on tasks, such as object counting and depth prediction.
the 1% ImageNet benchmark. Here the idea is to adapt
Table 4 shows performance on various low-level tasks
the pretrained models for ImageNet classification using
using a linear probe. After pretraining, the encoder weights
only 1% of the available ImageNet labels, corresponding
are frozen and a linear model is trained on top to per-
to roughly 12 or 13 images per class. Models are adapted
form object-counting and depth prediction on the Clevr
via fine-tuning or linear-probing, depending on whichever
dataset [43]. Compared to view-invariance methods such
works best for each respective method. I-JEPA outper-
as DINO and iBOT, the I-JEPA method effectively cap-
forms MAE while requiring less pretraining epochs when
tures low-level image features during pretraining and out-
using a similar encoder architecture. I-JEPA, using a ViT-
performs them in object counting (Clevr/Count) and (by a
H/14 architecture, matches the performance of a ViT-L/16
large margin) depth prediction (Clevr/Dist).
pretrained with data2vec [8], while using significantly less
computational effort (see Section 7). By increasing the im-
7. Scalability
age input resolution, I-JEPA outperforms previous methods
including joint-embedding methods that do leverage extra Model Efficiency. I-JEPA is highly scalable compared
hand-crafted data-augmentations during pretraining, such to previous approaches. Figure 5 shows semi-supervised
as MSN [4], DINO [17], and iBOT [79]. evaluation on 1% ImageNet-1K as a function of GPU
hours. I-JEPA requires less compute than previous methods
and achieves strong performance without relying on hand-
Transfer learning. Table 3 shows performance on var-
crafted data-augmentations. Compared to reconstruction-
ious downstream image classification tasks using a linear
based methods, such as MAE, which directly use pixels as
probe. I-JEPA significantly outperforms previous methods
targets, I-JEPA introduces extra overhead by computing tar-
that do not use augmentations (MAE and data2vec), and de-
gets in representation space (about 7% slower time per it-
creases the gap with the best view-invariance-based meth-
eration). However, since I-JEPA converges in roughly 5×
ods, which leverage hand-crafted data augmentations dur-
fewer iterations, we still see significant compute savings in
ing pretraining, even surpassing the popular DINO [18] on
practice. Compared to view-invariance based methods, such
CIFAR100 and Place205 with a linear probe.
as iBOT, which rely on hand-crafted data augmentations to
create and process multiple views of each image, I-JEPA
6. Local Prediction Tasks also runs significantly faster. In particular, a huge I-JEPA
As demonstrated in Section 5, I-JEPA learns semantic model (ViT-H/14) requires less compute than a small iBOT
image representations that significantly improve the down- model (ViT-S/16).
stream image classification performance of previous meth-
ods, such as MAE and data2vec. Additionally, I-JEPA ben- Scaling data size. We also find I-JEPA to benefit from
efits from scale and can close the gap, and even surpass, pretraining with larger datasets. Table 5 shows transfer
Pretrain Arch. CIFAR100 Place205 INat18 Clevr/Count Clevr/Dist
IN1k ViT-H/14 87.5 58.4 47.6 86.7 72.4
IN22k ViT-H/14 89.5 57.8 50.5 88.6 75.0
IN22k ViT-G/16 89.5 59.1 55.3 86.7 73.0
Table 5. Ablating dataset and model size. Evaluating impact of pre-training dataset size and model size on transfer tasks. I-JEPA
benefits from larger more diverse datasets. When increasing the size of the pretraining dataset (IN1k versus IN22k) we see an performance
improvement for the ViT-H/14 model. We observe a further performance improvement on semantic tasks by training a larger model ViT-
G/16 model on ImageNet-22k. The ViT-H/14 is trained for 300 epochs on IN1k and the equivalent of 900 IN1K epochs on IN22k. The
ViT-H/16 is trained for the equivalent of 600 IN1k epochs.
Scaling model size. Table 5 also shows that I-JEPA ben- Masking strategy. Table 6 compare our multi-block
efit from larger model size when pretraining on IN22K. masking with other masking strategies such as
Pretraining a ViT-G/16 significantly improves the down- rasterized masking, where the image is split into
stream performances on image classification tasks such as four large quadrants, and the goal is to use one quadrant
Place205 and INat18 compared to a ViT-H/14 model, but as a context to predict the other three quadrants, and the
does not improve performance on low-level downstream traditional block and random masking typically used
tasks — the ViT-G/16 uses larger input patches, which can in reconstruction-based methods. In block masking,
be detrimental for the local prediction tasks. the target is a single image block and the context is the
Figure 6. Visualization of I-JEPA predictor representations. For each image: first column contains the original image; second column
contains the context image, which is processed by a pretrained I-JEPA ViT-H/14 encoder. Green bounding boxes in subsequent columns
contain samples from a generative model decoding the output of the pretrained I-JEPA predictor, which is conditioned on positional mask
tokens corresponding to the location of the green bounding box. Qualities that are common across samples represent information that
contained is in the I-JEPA prediction. The I-JEPA predictor is correctly capturing positional uncertainty and producing high-level object
parts with a correct pose (e.g., the back of the bird and top of a car). Qualities that vary across samples represent information that is not
contained in the representation. In this case, the I-JEPA predictor discards the precise low-level details as well as background information.
Targets Context
Mask Type Freq. Type Avg. Ratio∗ Top-1
multi-block Block(0.15, 0.2) 4 Block(0.85, 1.0) × Complement 0.25 54.2
rasterized Quadrant 3 Complement 0.25 15.5
block Block(0.6) 1 Complement 0.4 20.2
random Random(0.6) 1 Complement 0.4 17.6
∗ Avg. Ratio is the average number of patches in the context block relative to the total number of patches in the image.
Table 6. Ablating masking strategy. Linear evaluation on ImageNet-1K using only 1% of the available labels after I-JEPA pretraining of
a ViT-B/16 for 300 epochs. Comparison of proposed multi-block masking strategy. In rasterized masking the image is split into
four large quadrants; one quadrant is used as a context to predict the other three quadrants. In block masking, the target is a single image
block and the context is the image complement. In random masking, the target is a set of random image patches and the context is the
image complement. The proposed multi-block masking strategy is helpful for guiding I-JEPA to learn semantic representations.
Targets Arch. Epochs Top-1 find multi-block masking helpful for guiding I-JEPA
Target-Encoder Output ViT-L/16 500 66.9 to learning semantic representations. Additional ablations
Pixels ViT-L/16 800 40.7 on multi-block masking can be found in Appendix C.
Optimization. We use AdamW [51] to optimize the context-encoder and predictor weights. Our default batch-size is 2048,
and the learning rate is linearly increased from 10−4 to 10−3 during the first 15 epochs of pretraining, and decayed to 10−6
following a cosine schedule thereafter. Following [4, 18], the weight-decay is linearly increased from 0.04 to 0.4 throughout
pretraining. The target-encoder weights are identical to the context-encoder weights at initialization, and updated via an
exponential moving average thereafter [4, 18, 23, 35, 37, 61]. We use a momentum value of 0.996, and linearly increase this
value to 1.0 throughout pretraining, following [4, 18].
Masking. By default, we sample 4 possibly overlapping target blocks masks with random scale in the range (015, 0.2) and
aspect ratio in the range (0.75, 1.5). We sample 1 context block mask with random scale in the range (0.85, 1.0) and unit
aspect ratio. We subsequently eliminate any regions in the context block mask that overlap with any of the 4 target block
masks. The context-block mask and target-block masks are sampled independently for each image in the mini-batch. To
ensure efficient batch processing, we restrict the size of all context masks on a co-located GPU to be identical. Similarly, we
restrict the size of all target masks on a co-located GPU to be identical. The mask-sampler is efficiently implemented in only
a few lines of code in PyTorch [56] using a batch-collator function, which runs in the data loader processes. In short, in each
iteration, the data loader returns a mini-batch of images and a set of context and target masks for each image, identifying the
patch indices to keep for the context and target views.
ImageNet evaluations. To evaluate the I-JEPA on ImageNet [60], we adapt the VISSL recipe to use average pooled repre-
sentations instead of the [cls] token. Following MAE [36], we use the LARS [72] optimizer with a batch-size of 16384,
and train the linear probe for 50 epochs. We use a learning rate with a step-wise decay, dividing it by a factor of 10 every 15
epochs, and sweep three different reference learning rates [0.01, 0.05, 0.001], and two weight decay values [0.0005, 0.0].
Low-shot evaluation. To evaluate our model on the ImageNet-1% low-shot task, we adapt the fine-tuning protocol of
MAE [36].We fine-tune our ViT-L/H models for 50 epochs on ImageNet-1% with the AdamW optimizer and a cosine learning
rate scheduler. We use a batch size of 512, a learning rate layer decay of 0.75 and 0.1 label smoothing. We use the default
randaugment data-augmentations as in MAE. In contrast to the fine-tuning done with MAE, we do not use mixup, cutmix,
random erasing or drop path. For the I-JEPA, we use a learning rate /weight decay of 3e−5 /5e−2 for the ViT-L/16, 3e−5 /4e−1
for the ViT-H/14 and 3e−5 /4e−1 for the ViT-H/16448 . Similar fine-tuning strategy for low-shot learning has been explored by
Semi-VIT in the context of semi-supervised learning [16].
Table 8. Ablation of the target block size for multi-block masking. Linear evaluation on 1% ImageNet-1K (using only 1% of the
available labels); ablating the multi-block target size during I-JEPA pretraining of a ViT-B/16 for 300 epochs. Predicting larger (semantic)
blocks improves the low-shot accuracy as long as the context is sufficiently informative.
Targets Context
Scale Freq. Scale Top-1
(0.15, 0.2) 4 (0.40, 1.0) 31.2
(0.15, 0.2) 4 (0.65, 1.0) 47.1
(0.15, 0.2) 4 (0.75, 1.0) 49.3
(0.15, 0.2) 4 (0.85, 1.0) 54.2
Table 9. Ablation of the context size for multi-block masking. Linear evaluation on 1% ImageNet-1K (using only 1% of the available
labels); ablating the multi-block target size during I-JEPA pretraining of a ViT-B/16 for 300 epochs. Reducing the multi-block context size
degrades the low-shot performance.
Targets Context
Scale Freq. Scale Top-1
(0.15, 0.2) 1 (0.85, 1.0) 9.0
(0.15, 0.2) 2 (0.85, 1.0) 22.0
(0.15, 0.2) 3 (0.85, 1.0) 48.5
(0.15, 0.2) 4 (0.85, 1.0) 54.2
Table 10. Ablation of the targets number for multi-block masking. Linear evaluation on 1% ImageNet-1K (using only 1% of the
available labels); ablating the multi-block number of targets during I-JEPA pretraining of a ViT-B/16 for 300 epochs. Increasing the
number of target blocks improve the low-shot accuracy.
C. Additional Ablations
This section follows the same experimental protocol as Section 9. We report the result of a linear probe with a frozen
backbone, trained on the low-shot 1% ImageNet-1K benchmark.
Multiblock masking strategy. We present an extended ablation of the multiblock masking strategy where we change the
targets block scale (Table 8), the context scale (Table 9) and the number of target blocks (Table 10). We train a ViT-B/16
for 300 epochs using I-JEPA with various multi-block settings and compare performance on the 1% ImageNet-1K
benchmark using a linear probe. In short, we find that it is important to predict several relatively large (semantic) target
blocks, and to use a sufficiently informative (spatially distributed) context block.
Masking at the output of the target-encoder. An important important design choice in I-JEPA is that the target blocks
are obtained by masking the output of the target-encoder, not the input. Table 11 shows the effect of this design choice
on the semantic level of the learned representations when pretraining a ViT-H/16 using I-JEPA for 300 epochs. In the case
where masking is applied to the input, we forward-propagate through the target-encoder once for each target region. Masking
the output of the target-encoder during pretraining results in more semantic prediction targets and improves linear probing
performance.
Target Masking Arch. Epochs Top-1
Output ViT-H/16 300 67.3
Input ViT-H/16 300 56.1
Table 11. Ablating masking output of target encoder. Linear evaluation on ImageNet-1K using only 1% of the available labels; ablating
the effect of masking the target-encoder output during I-JEPA pretraining of a ViT-H/16 for 300 epochs. Masking the output of the
target-encoder during pretraining significantly improves the linear probing performance of the pretrained representations.
Predictor depth. We examine the impact of the predictor depth on the downstream low-shot performance in Table 12.
We pretrain a ViT-L/16 for 500 epochs using either a 6-layer predictor network or a 12-layer predictor network. The model
pretrained using a deeper predictor shows a significant improvement in downstream low-shot performance compared to the
model pretrained with a shallower predictor.
Table 12. Ablating the predictor depth. Linear evaluation on ImageNet-1K using only 1% of the available labels; ablating the effect of
masking the predictor depth for a ViT-L/16 pretrained for 500 epochs. Increasing the predictor depth leads to significant improvement of
the linear probe performance of the pretrained representations.
Weight decay. In Table 13, we evaluate the impact of weight-decay during pretraining. We explore two weight decay
strategies: linearly increase the weight-decay from 0.04 to 0.4 or use a fix weight-decay of 0.05. Using a smaller weight
decay during pretraining improves the downstream performance on ImageNet-1% when fine-tuning. However, this also leads
to a degradation of performance in linear evaluation. In the main paper, we use the first weight decay strategy as it improves
the performances in linear evaluation downstream tasks.
Table 13. Ablating the pretraining weight-decay. We compare our default pretraining weight decay strategy where we linearly increase
the weight-decay from 0.04 to 0.4 to using a fix weight decay of 0.05. Using a smaller weight-decay during pretraining can improve the
fine-tuning performance on ImageNet-1%, However, it also leads to a drop of performance in linear evaluation.
Predictor width. We explore the impact of the predictor width in Table 14. We compare I-JEPA using a ViT-L encoder
and a predictor with 386 channels to a similar model using a predictor with 1024 channels. Note that the ViT-L encoder has
1024 channels. Using a bottleneck in the predictor width improves the downstream performance on ImageNet 1%.
Table 14. Ablating the predictor width. We reports results on ImageNet-1K 1% using fine-tuning. We compare two predictors having a
width of either 384 or 1024. Note the I-JEPA encoder is a ViT-L with 1024 channels. Having a width bottleneck in the predictor improves
the downstream performances.
Table 15. Finetuning on the full ImageNet dataset. I-JEPA achieves competitive performance. I-JEPA is close to MAE approach despite
I-JEPA being trained for 5.3 times less epochs than MAE.
E. RCDM Visualizations
To visualize the representations of a pretrained neural network in pixel space, we use the RCDM framework [13]. The
RCDM framework trains a decoder network hω , comprising a generative diffusion model, to reconstruct an image x from
the representation vector of that image sx and a noisy version of that image x̂ := x + ϵ, where ϵ is an additive noise vector.
Concretely, the decoder objective is to minimize the loss function ∥hω (x̂, sx )−ϵ∥. We train each RCDM network for 300,000
iterations using the default hyperparameters [13]. After training the decoder, one can subsequently feed the representation
vector of an unseen test image sy into the decoder along with various random noise vectors to generate several pixel-level
visualizations of the representation, thus providing insight into the features captured in the representations of the pretrained
network. Qualities that are common across samples represent information that is contained in the representation. On the
other hand, qualities that vary across samples represent information that is not contained in the representations
In Figure 6, the visualizations are obtained by feeding the average-pooled output of the predictor, conditioned on a specific
target region, into the decoder network, along with various random noise vectors. In Figures 7 and 8, the visualizations are
obtained by feeding the average-pooled output of the target-encoder into the decoder network, along with various random
noise vectors.
In Figure 7, we visualize the average-pooled I-JEPA representations at the output of our ViT-H/14 target-encoder. The first
column contains the original image, while subsequent columns contain synthetic samples obtained by feeding the average-
pooled representation of the image into the decoder along with various random noise vectors. Figure 7 suggests that the I-
JEPA target-encoder is able to correctly capture the high-level information regarding objects and their poses, while discarding
low-level image details and background information.
Figure 8 shows similar visualizations, but when using an MSN [4] pretrained ViT-L/7 target-encoder to compute the
image representations. The MSN method trains a context- and target-encoder using a Joint-Embedding Architecture to
enforce invariance of global image representations to various hand crafted data augmentations and missing patches. While
the MSN pretrained network is able to capture high level semantic information about the image in the first column, it also
exhibits higher variability in the generated samples, e.g., variability in the object pose, object scale, and number of instances.
In short, the MSN pretrained discards much of the local structure in the image, which is in stark contrast to I-JEPA, which
retains information about much of the local structure in the input image.
Figure 7. Visualization of I-JEPA target-encoder representations. For each image: first column contains the original image; subsequent
columns contain samples from a generative model decoding the average-pooled output of a pretrained I-JEPA target-encoder. Qualities
that are common across samples represent information that contained is in the I-JEPA representation. I-JEPA is able to correctly capture
the high-level information regarding objects and their poses. Qualities that vary across samples represent information that is not contained
in the representation. I-JEPA encoder discards the precise low-level details as well as background information.
Figure 8. Visualization of MSN target-encoder representations. For each image: first column contains the original image; subsequent
columns contain samples from a generative model decoding the output of a frozen MSN encoder [4]. Qualities that are common across
samples represent information that is contained in the representation. Qualities that vary across samples represent information that is not
captured by MSN. Compared to I-JEPA, MSN samples show higher variability. MSN retains less information from the input. In particular,
it discards global structure information such as the object pose or even number of instances.