0% found this document useful (0 votes)
138 views3 pages

XXXBetter Plain ViT Baselines For ImageNet-1k

This document presents improved baselines for the Vision Transformer (ViT) model on the ImageNet-1k dataset, demonstrating that standard data augmentation techniques can achieve competitive performance without complex regularization. Minor modifications to the original ViT training setup significantly enhance accuracy, allowing the model to reach over 80% top-1 accuracy within 300 epochs on a TPUv3-8. The findings suggest that a simpler training approach can yield results comparable to advanced ResNet models.

Uploaded by

zwwren0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views3 pages

XXXBetter Plain ViT Baselines For ImageNet-1k

This document presents improved baselines for the Vision Transformer (ViT) model on the ImageNet-1k dataset, demonstrating that standard data augmentation techniques can achieve competitive performance without complex regularization. Minor modifications to the original ViT training setup significantly enhance accuracy, allowing the model to reach over 80% top-1 accuracy within 300 epochs on a TPUv3-8. The findings suggest that a simpler training approach can yield results comparable to advanced ResNet models.

Uploaded by

zwwren0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Better plain ViT baselines for ImageNet-1k

Lucas Beyer Xiaohua Zhai Alexander Kolesnikov


Google Research, Brain Team Zürich

https://github.com/google-research/big_vision
arXiv:2205.01580v1 [cs.CV] 3 May 2022

Abstract ImageNet-1k top-1 accuracy [%]

80 S/16 80
It is commonly accepted that the Vision Transformer
Orig.
model requires sophisticated regularization techniques to B/16
ViT (improved)
excel at ImageNet-1k scale data. Surprisingly, we find this ViT (original)
AugReg
is not the case and standard data augmentation is sufficient. S/16 ViT (AugReg)
This note presents a few minor modifications to the origi- ViT (DeiT)
ViT (DeiT III)
nal Vision Transformer (ViT) vanilla training setting that 75 75 ResNet50 (orig)
dramatically improve the performance of plain ViT models. Orig. ResNet50 (BiT)
B/32 R50 (strikes back)
Notably, 90 epochs of training surpass 76% top-1 accuracy
in under seven hours on a TPUv3-8, similar to the classic 90 150 300 6h30m 10h50m 21h40m
Epochs of training TPUv3-8 wallclock training time
ResNet50 baseline, and 300 epochs of training reach 80% in
less than one day. Figure 1. Comparison of ViT model for this note to state-of-the-art
ViT and ResNet models. Left plot demonstrates how performance
depends on the total number of epochs, while the right plot uses
1. Introduction TPUv3-8 wallclock time to measure compute. We observe that
our simple setting is highly competitive, even to the canonical
The ViT paper [4] focused solely on the aspect of large- ResNet-50 setups.
scale pre-training, where ViT models outshine well tuned
ResNet [6] (BiT [8]) models. The addition of results
when pre-training only on ImageNet-1k was an afterthought, by [15] as we believe it provides a good tradeoff between
mostly to ablate the effect of data scale. Nevertheless, iteration velocity with commonly available hardware and
ImageNet-1k remains a key testbed in the computer vision final accuracy. However, when more compute and data is
research and it is highly beneficial to have as simple and available, we highly recommend iterating with ViT-B/32 or
effective a baseline as possible. ViT-B/16 instead [12,19], and note that increasing patch-size
Thus, coupled with the release of the big vision code- is almost equivalent to reducing image resolution.
base used to develop ViT [4], MLP-Mixer [14], ViT-G [19], All experiments use “inception crop” [13] at 224px² res-
LiT [20], and a variety of other research projects, we now olution, random horizontal flips, RandAugment [3], and
provide a new baseline that stays true to the original ViT’s Mixup augmentations. We train on the first 99% of the
simplicity while reaching results competitive with similar training data, and keep 1% for minival to encourage the com-
approaches [15, 17] and concurrent [16], which also strives munity to stop selecting design choices on the validation
for simplification. (de-facto test) set. The full setup is shown in Appendix A.

2. Experimental setup 3. Results


We focus entirely on the ImageNet-1k dataset (ILSVRC- The results for our improved setup are shown in Figure 1,
2012) for both (pre)training and evaluation. We stick to the along with a few related important baselines. It is clear that
original ViT model architecture due to its widespread accep- a simple, standard ViT trained this way can match both the
tance [1, 2, 5, 9, 15], simplicity and scalability, and revisit seminal ResNet50 at 90 epochs baseline, as well as more
only few very minor details, none of which are novel. We modern ResNet [17] and ViT [16] training setups. Further-
choose to focus on the smaller ViT-S/16 variant introduced more, on a small TPUv3-8 node, the 90 epoch run takes only

1
Table 1. Ablation of our trivial modifications.
report, as well as the Google Brain team for a supportive
90ep 150ep 300ep research environment.

Our improvements 76.5 78.5 80.0 References


no RandAug+MixUp 73.6 73.7 73.7 [1] Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiao-
Posemb: sincos2d → learned 75.0 78.0 79.6 hua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan
Batch-size: 1024 → 4096 74.7 77.3 78.6 Song, Zhangyang Wang, and Denny Zhou. A simple single-
Global Avgpool → [cls] token 75.0 76.9 78.2 scale vision transformer for object localization and instance
Head: MLP → linear 76.7 78.6 79.8 segmentation. CoRR, abs/2112.09747, 2021. 1
[2] Xinlei Chen, Saining Xie, and Kaiming He. An empirical
Original + RandAug + MixUp 71.6 74.8 76.1 study of training self-supervised vision transformers. In In-
Original 66.8 67.2 67.1 ternational Conference on Computer Vision (ICCV), 2021. 1,
2
[3] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V.
6h30, and one can reach 80% accuracy in less than a day Le. RandAugment: Practical data augmentation with no
when training for 300 epochs. separate search. CoRR, abs/1909.13719, 2019. 1, 2
The main differences from [4, 12] are a batch-size of [4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
1024 instead of 4096, the use of global average-pooling Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
(GAP) instead of a class token [2, 11], fixed 2D sin-cos Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
position embeddings [2], and the introduction of a small vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
amount of RandAugment [3] and Mixup [21] (level 10 and worth 16x16 words: Transformers for image recognition at
scale. In International Conference on Learning Representa-
probability 0.2 respectively, which is less than [12]). These
tions (ICLR), 2021. 1, 2
small changes lead to significantly better performance than
[5] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
that originally reported in [4]. Dollár, and Ross B. Girshick. Masked autoencoders are scal-
Notably absent from this baseline are further architectural able vision learners. CoRR, abs/2111.06377, 2021. 1
changes, regularizers such as dropout or stochastic depth [7], [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
advanced optimization schemes such as SAM [10], extra Deep residual learning for image recognition. In Conference
augmentations such as CutMix [18], repeated augmenta- on Computer Vision and Pattern Recognition (CVPR), 2016.
tions [15], or blurring, “tricks” such as high-resolution fine- 1
tuning or checkpoint averaging, as well as supervision from [7] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q.
a strong teacher via knowledge distillation. Weinberger. Deep networks with stochastic depth. CoRR,
Table 1 shows an ablation of the various minor changes abs/1603.09382, 2016. 2
we propose. It exemplifies how a collection of almost trivial [8] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan
changes can accumulate to an important overall improve- Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby.
ment. The only change which makes no significant differ- Big Transfer (BiT): General visual representation learning. In
ence in classification accuracy is whether the classification European Conference on Computer Vision (ECCV), 2020. 1
head is a single linear layer, or an MLP with one hidden [9] Yanghao Li, Hanzi Mao, Ross B. Girshick, and Kaiming
tanh layer as in the original Transformer formulation. He. Exploring plain vision transformer backbones for object
detection. CoRR, abs/2203.16527, 2022. 1
[10] Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, and
4. Conclusion Yang You. Towards efficient and scalable sharpness-aware
It is always worth striving for simplicity. minimization. CoRR, abs/2203.02714, 2022. 2
[11] Maithra Raghu, Thomas Unterthiner, Simon Kornblith,
Acknowledgements. We thank Daniel Suo and Naman Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans-
Agarwal for nudging for 90 epochs and feedback on the formers see like convolutional neural networks? CoRR,
abs/2108.08810, 2021. 2
Table 2. A few more standard metrics. [12] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross
Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train
your ViT? data, augmentation, and regularization in vision
Top-1 ReaL v2 transformers. CoRR, abs/2106.10270, 2021. 1, 2
Original (90ep) 66.8 72.8 52.2 [13] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Our improvements (90ep) 76.5 83.1 64.2 Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Our improvements (150ep) 78.5 84.5 66.4 Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In Conference on Computer Vision and Pattern
Our improvements (300ep) 80.0 85.4 68.3
Recognition (CVPR), 2015. 1

2
[14] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lu- 25 config.log_training_steps = 50
cas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, 26 config.log_eval_steps = 1000
Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp- 27 config.checkpoint_steps = 1000
28
mixer: An all-mlp architecture for vision. Advances in Neural
29 # Model section
Information Processing Systems, 34, 2021. 1 30 config.model_name = 'vit'
[15] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco 31 config.model = dict(
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training 32 variant='S/16',
data-efficient image transformers & distillation through at- 33 rep_size=True,
tention. In International Conference on Machine Learing 34 pool_type='gap',
35 posemb='sincos2d',
(ICML), 2021. 1, 2
36 )
[16] Hugo Touvron, Matthieu Cord, and Hervé Jégou. DeiT III: 37
revenge of the ViT. CoRR, abs/2204.07118, 2022. 1 38 # Optimizer section
[17] Ross Wightman, Hugo Touvron, and Hervé Jégou. ResNet 39 config.grad_clip_norm = 1.0
strikes back: An improved training procedure in timm. CoRR, 40 config.optax_name = 'scale_by_adam'
41 config.optax = dict(mu_dtype='bfloat16')
abs/2110.00476, 2021. 1
42 config.lr = 0.001
[18] Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon 43 config.wd = 0.0001
Oh, Youngjoon Yoo, and Junsuk Choe. CutMix: Regular- 44 config.schedule = dict(warmup_steps=10_000,
ization strategy to train strong classifiers with localizable decay_type='cosine')
features. In International Conference on Computer Vision 45 config.mixup = dict(p=0.2, fold_in=None)
46
(ICCV), 2019. 2
47 # Eval section
[19] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- 48 config.evals = [
cas Beyer. Scaling vision transformers. In Conference on 49 ('minival', 'classification'),
Computer Vision and Pattern Recognition (CVPR), 2022. 1 50 ('val', 'classification'),
[20] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, 51 ('real', 'classification'),
Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. LiT: 52 ('v2', 'classification'),
53 ]
Zero-shot transfer with locked-image text tuning. In Confer- 54 eval_common = dict(
ence on Computer Vision and Pattern Recognition (CVPR), 55 pp_fn=pp_eval.format(lbl='label'),
2022. 1 56 loss_name=config.loss,
[21] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and 57 log_steps=1000,
David Lopez-Paz. mixup: Beyond empirical risk minimiza- 58 )
59
tion. In International Conference on Learning Representa-
60 config.minival = dict(**eval_common)
tions (ICLR), 2018. 2 61 config.minival.dataset = 'imagenet2012'
62 config.minival.split = 'train[99%:]'
A. big vision experiment configuration 63 config.minival.prefix = 'minival_'
64
65 config.val = dict(**eval_common)
1 def get_config():
66 config.val.dataset = 'imagenet2012'
2 config = mlc.ConfigDict()
67 config.val.split = 'validation'
3
68 config.val.prefix = 'val_'
4 config.dataset = 'imagenet2012'
69
5 config.train_split = 'train[:99%]'
70 config.real = dict(**eval_common)
6 config.cache_raw = True
71 config.real.dataset = 'imagenet2012_real'
7 config.shuffle_buffer_size = 250_000
72 config.real.split = 'validation'
8 config.num_classes = 1000
73 config.real.pp_fn = pp_eval.format(lbl='
9 config.loss = 'softmax_xent'
real_label')
10 config.batch_size = 1024
74 config.real.prefix = 'real_'
11 config.num_epochs = 90
75
12
76 config.v2 = dict(**eval_common)
13 pp_common = (
77 config.v2.dataset = 'imagenet_v2'
14 '|value_range(-1, 1)'
78 config.v2.split = 'test'
15 '|onehot(1000, key="{lbl}", key_result="
79 config.v2.prefix = 'v2_'
labels")'
80
16 '|keep("image", "labels")'
81 return config
17 )
18 config.pp_train = ( Listing 1. Full recommended config
19 'decode_jpeg_and_inception_crop(224)' +
20 '|flip_lr|randaug(2,10)' +
21 pp_common.format(lbl='label')
22 )
23 pp_eval = 'decode|resize_small(256)|
central_crop(224)' + pp_common
24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy