Normalizer Free Networks
Normalizer Free Networks
87
Abstract
F4 NFNet-F5
Batch normalization is a key component of most 86
F3
image classification models, but it has many unde-
F2
sirable properties stemming from its dependence 85 F1 LambdaNet-420
BoTNet-128-T7 EffNet-B7
on the batch size and interactions between ex-
84
amples. Although recent work has succeeded F0 EffNet-B5
in training deep ResNets without normalization LambdaNet-152
layers, these models do not match the test ac- 83 DeIT-384
curacies of the best batch-normalized networks,
and are often unstable for large learning rates 82 DeIT-224
BoTNet-59
or strong data augmentations. In this work, we
81
develop an adaptive gradient clipping technique
which overcomes these instabilities, and design a EffNet-B2
80
significantly improved class of Normalizer-Free 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Training Latency (s/step) on TPUv3, Batch Size per Device = 32
ResNets. Our smaller models match the test ac-
curacy of an EfficientNet-B7 on ImageNet while Figure 1. ImageNet Validation Accuracy vs Training Latency.
being up to 8.7× faster to train, and our largest All numbers are single-model, single crop. Our NFNet-F1 model
models attain a new state-of-the-art top-1 accu- achieves comparable accuracy to an EffNet-B7 while being 8.7×
faster to train. Our NFNet-F5 model has similar training latency to
racy of 86.5%. In addition, Normalizer-Free mod-
EffNet-B7, but achieves a state-of-the-art 86.0% top-1 accuracy
els attain significantly better performance than
on ImageNet. We further improve on this using Sharpness Aware
their batch-normalized counterparts when fine- Minimization (Foret et al., 2021) to achieve 86.5% top-1 accuracy.
tuning on ImageNet after large-scale pre-training
on a dataset of 300 million labeled images, with
our best models obtaining an accuracy of 89.2%.2 However, batch normalization has three significant practical
disadvantages. First, it is a surprisingly expensive computa-
tional primitive, which incurs memory overhead (Rota Bulò
1. Introduction et al., 2018), and significantly increases the time required to
evaluate the gradient in some networks (Gitman & Ginsburg,
The vast majority of recent models in computer vision are 2017). Second, it introduces a discrepancy between the be-
variants of deep residual networks (He et al., 2016b;a), haviour of the model during training and at inference time
trained with batch normalization (Ioffe & Szegedy, 2015). (Summers & Dinneen, 2019; Singh & Shrivastava, 2019),
The combination of these two architectural innovations has introducing hidden hyper-parameters that have to be tuned.
enabled practitioners to train significantly deeper networks Third, and most importantly, batch normalization breaks the
which can achieve higher accuracies on both the training independence between training examples in the minibatch.
set and the test set. Batch normalization also smoothens the
loss landscape (Santurkar et al., 2018), which enables stable This third property has a range of negative consequences.
training with larger learning rates and at larger batch sizes For instance, practitioners have found that batch normalized
(Bjorck et al., 2018; De & Smith, 2020), and it can have a networks are often difficult to replicate precisely on differ-
regularizing effect (Hoffer et al., 2017; Luo et al., 2018). ent hardware, and batch normalization is often the cause of
subtle implementation errors, especially during distributed
1
DeepMind, London, United Kingdom. Correspondence to: training (Pham et al., 2019). Furthermore, batch normal-
Andrew Brock <ajbrock@google.com>.
ization cannot be used for some tasks, since the interaction
between training examples in a batch enables the network to
2
Code available at https://github.com/deepmind/ ‘cheat’ certain loss functions. For example, batch normaliza-
deepmind-research/tree/master/nfnets tion requires specific care to prevent information leakage in
High-Performance Normalizer-Free ResNets
some contrastive learning algorithms (Chen et al., 2020; He the current state of the art (Gong et al., 2020). This paper
et al., 2020). This is a major concern for sequence modeling builds on this line of work and seeks to address these central
tasks as well, which has driven language models to adopt al- limitations. Our main contributions are as follows:
ternative normalizers (Ba et al., 2016; Vaswani et al., 2017).
The performance of batch-normalized networks can also • We propose Adaptive Gradient Clipping (AGC), which
degrade if the batch statistics have a large variance during clips gradients based on the unit-wise ratio of gradient
training (Shen et al., 2020). Finally, the performance of norms to parameter norms, and we demonstrate that
batch normalization is sensitive to the batch size, and batch AGC allows us to train Normalizer-Free Networks with
normalized networks perform poorly when the batch size is larger batch sizes and stronger data augmentations.
too small (Hoffer et al., 2017; Ioffe, 2017; Wu & He, 2018),
which limits the maximum model size we can train on finite • We design a family of Normalizer-Free ResNets, called
hardware. We expand on the challenges associated with NFNets, which set new state-of-the-art validation ac-
batch normalization in Appendix B. curacies on ImageNet for a range of training latencies
(See Figure 1). Our NFNet-F1 model achieves similar
Therefore, although batch normalization has enabled the accuracy to EfficientNet-B7 while being 8.7× faster to
deep learning community to make substantial gains in re- train, and our largest model sets a new overall state of
cent years, we anticipate that in the long term it is likely to the art without extra data of 86.5% top-1 accuracy.
impede progress. We believe the community should seek
to identify a simple alternative which achieves competitive • We show that NFNets achieve substantially higher
test accuracies and can be used for a wide range of tasks. validation accuracies than batch-normalized networks
Although a number of alternative normalizers have been pro- when fine-tuning on ImageNet after pre-training on a
posed (Ba et al., 2016; Wu & He, 2018; Huang et al., 2020), large private dataset of 300 million labelled images.
these alternatives often achieve inferior test accuracies and Our best model achieves 89.2% top-1 after fine-tuning.
introduce their own disadvantages, such as additional com-
pute costs at inference. Fortunately, in recent years two The paper is structured as follows. We discuss the bene-
promising research themes have emerged. The first studies fits of batch normalization in Section 2, and recent work
the origin of the benefits of batch normalization during train- seeking to train ResNets without normalization in Section 3.
ing (Balduzzi et al., 2017; Santurkar et al., 2018; Bjorck We introduce AGC in Section 4, and we describe how we
et al., 2018; Luo et al., 2018; Yang et al., 2019; Jacot et al., developed our new state-of-the-art architectures in Section 5.
2019; De & Smith, 2020), while the second seeks to train Finally, we present our experimental results in Section 6.
deep ResNets to competitive accuracies without normaliza-
tion layers (Hanin & Rolnick, 2018; Zhang et al., 2019a; De 2. Understanding Batch Normalization
& Smith, 2020; Shao et al., 2020; Brock et al., 2021).
In order to train networks without normalization to com-
A key theme in many of these works is that it is possible to petitive accuracy, we must understand the benefits batch
train very deep ResNets without normalization by suppress- normalization brings during training, and identify alterna-
ing the scale of the hidden activations on the residual branch. tive strategies to recover these benefits. Here we list the four
The simplest way to achieve this is to introduce a learnable main benefits which have been identified by prior work.
scalar at the end of each residual branch, initialized to zero
(Goyal et al., 2017; Zhang et al., 2019a; De & Smith, 2020; Batch normalization downscales the residual branch:
Bachlechner et al., 2020). However this trick alone is not The combination of skip connections (Srivastava et al.,
sufficient to obtain competitive test accuracies on challeng- 2015; He et al., 2016b;a) and batch normalization (Ioffe &
ing benchmarks. Another line of work has shown that ReLU Szegedy, 2015) enables us to train significantly deeper net-
activations introduce a ‘mean shift’, which causes the hid- works with thousands of layers (Zhang et al., 2019a). This
den activations of different training examples to become in- benefit arises because batch normalization, when placed on
creasingly correlated as the network depth increases (Huang the residual branch (as is typical), reduces the scale of hid-
et al., 2017; Jacot et al., 2019). In a recent work, Brock et al. den activations on the residual branches at initialization (De
(2021) introduced “Normalizer-Free” ResNets, which sup- & Smith, 2020). This biases the signal towards the skip path,
press the residual branch at initialization and apply Scaled which ensures that the network has well-behaved gradients
Weight Standardization (Qiao et al., 2019) to remove the early in training, enabling efficient optimization (Balduzzi
mean shift. With additional regularization, these unnormal- et al., 2017; Hanin & Rolnick, 2018; Yang et al., 2019).
ized networks match the performance of batch-normalized Batch normalization eliminates mean-shift: Activation
ResNets (He et al., 2016a) on ImageNet (Russakovsky et al., functions like ReLUs or GELUs (Hendrycks & Gimpel,
2015), but they are not stable at large batch sizes and do not 2016), which are not anti-symmetric, have non-zero mean
match the performance of EfficientNets (Tan & Le, 2019), activations. Consequently, the inner product between the
High-Performance Normalizer-Free ResNets
activations of independent training examples immediately hi+1 = hi + αfi (hi /βi ), where hi denotes the inputs to the
after the non-linearity is typically large and positive, even ith residual block, and fi denotes the function computed
if the inner product between the input features is close to by the ith residual branch. The function fi is parameter-
zero. This issue compounds as the network depth increases, ized to be variance preserving at initialization, such that
and introduces a ‘mean-shift’ in the activations of different Var(fi (z)) = Var(z) for all i. The scalar α specifies the
training examples on any single channel proportional to the rate at which the variance of the activations increases after
network depth (De & Smith, 2020), which can cause deep each residual block (at initialization), and is typically set to a
networks to predict the same label for all training examples small value like α = 0.2. The scalar βi is determined by pre-
at initialization (Jacot et al., 2019). Batch normalization en- dicting the standard
p deviation of the inputs to the ith residual
sures the mean activation on each channel is zero across the block, βi = Var(hi ), where Var(hi+1 ) = Var(hi ) + α2 ,
current batch, eliminating mean shift (Brock et al., 2021). except for transition blocks (where spatial downsampling
occurs), for which the skip path operates on the downscaled
Batch normalization has a regularizing effect: It is
input (hi /βi ), and the expected variance is reset after the
widely believed that batch normalization also acts as a regu-
transition block to hi+1 = 1 + α2 . The outputs of squeeze-
larizer enhancing test set accuracy, due to the noise in the
excite layers (Hu et al., 2018) are multiplied by a factor of 2.
batch statistics which are computed on a subset of the train-
Empirically, Brock et al. (2021) found it was also beneficial
ing data (Luo et al., 2018). Consistent with this perspective,
to include a learnable scalar initialized to zero at the end of
the test accuracy of batch-normalized networks can often be
each residual branch (‘SkipInit’ (De & Smith, 2020)).
improved by tuning the batch size, or by using ghost batch
normalization in distributed training (Hoffer et al., 2017). In addition, Brock et al. (2021) prevent the emergence of a
mean-shift in the hidden activations by introducing Scaled
Batch normalization allows efficient large-batch train-
Weight Standardization (a minor modification of Weight
ing: Batch normalization smoothens the loss landscape
Standardization (Huang et al., 2017; Qiao et al., 2019)).
(Santurkar et al., 2018), and this increases the largest stable
This technique reparameterizes the convolutional layers as:
learning rate (Bjorck et al., 2018). While this property does
not have practical benefits when the batch size is small (De Wij − µi
Wˆij = √ , (1)
& Smith, 2020), the ability to train at larger learning rates is N σi
essential if one wishes to train efficiently with large batch
where µi = (1/N ) j Wij , σi2 = (1/N ) j (Wij − µi )2 ,
P P
sizes. Although large-batch training does not achieve higher
test accuracies within a fixed epoch budget (Smith et al., and N denotes the fan-in. The activation functions are
2020), it does achieve a given test accuracy in fewer param- also scaled by a non-linearity specific scalar gain γ, which
eter updates, significantly improving training speed when ensures that the combination of the γ-scaled activation func-
parallelized across multiple devices (Goyal et al., 2017). tion and a Scaled Weight Standardized
p layer is variance
preserving. For ReLUs, γ = 2/(1 − (1/π)) (Arpit et al.,
2016). We refer the reader to Brock et al. (2021) for a
3. Towards Removing Batch Normalization description of how to compute γ for other non-linearities.
Many authors have attempted to train deep ResNets to com- With additional regularization (Dropout (Srivastava et al.,
petitive accuracies without normalization, by recovering 2014) and Stochastic Depth (Huang et al., 2016)),
one or more of the benefits of batch normalization described Normalizer-Free ResNets match the test accuracies achieved
above. Most of these works suppress the scale of the activa- by batch normalized pre-activation ResNets on ImageNet
tions on the residual branch at initialization, by introducing at batch size 1024. They also significantly outperform their
either small constants or learnable scalars (Hanin & Rol- batch normalized counterparts when the batch size is very
nick, 2018; Zhang et al., 2019a; De & Smith, 2020; Shao small, but they perform worse than batch normalized net-
et al., 2020). Additionally, Zhang et al. (2019a) and De & works for large batch sizes (4096 or higher). Crucially, they
Smith (2020) observed that the performance of unnormal- do not match the performance of state-of-the-art networks
ized ResNets can be improved with additional regulariza- like EfficientNets (Tan & Le, 2019; Gong et al., 2020).
tion. However only recovering these two benefits of batch
normalization is not sufficient to achieve competitive test
accuracies on challenging benchmarks (De & Smith, 2020).
4. Adaptive Gradient Clipping for Efficient
Large-Batch Training
In this work, we adopt and build on “Normalizer-Free
ResNets” (NF-ResNets) (Brock et al., 2021), a class of pre- To scale NF-ResNets to larger batch sizes, we explore a
activation ResNets (He et al., 2016a) which can be trained to range of gradient clipping strategies (Pascanu et al., 2013).
competitive training and test accuracies without normaliza- Gradient clipping is often used in language modeling to sta-
tion layers. NF-ResNets employ a residual block of the form bilize training (Merity et al., 2018), and recent work shows
that it allows training with larger learning rates compared
High-Performance Normalizer-Free ResNets
Top-1 Accuracy
Top-1 Accuracy
loss landscapes or when training with large batch sizes, since 76 BatchNorm 76 B = 256
NF-ResNet B = 512
in these settings the optimal learning rate is constrained by 74 B = 1024
NF-ResNet+AGC 75
the maximum stable learning rate (Smith et al., 2020). We 72 ResNet50 B = 2048
ResNet200 B = 4096
therefore hypothesize that gradient clipping should help 70 256 74 0.01
512 1024 2048 4096 0.02 0.04 0.08 0.16
scale NF-ResNets efficiently to the large-batch setting. Batch Size B Clipping Threshold
(a) (b)
Gradient clipping is typically performed by constraining the
norm of the gradient (Pascanu et al., 2013). Specifically, for Figure 2. (a) AGC efficiently scales NF-ResNets to larger batch
gradient vector G = ∂L/∂θ, where L denotes the loss and sizes. (b) The performance across different clipping thresholds λ.
θ denotes a vector with all model parameters, the standard
clipping algorithm clips the gradient before updating θ as:
(
G fan-in extent (including the channel and spatial dimensions).
λ kGk if kGk > λ,
G→ (2) Using AGC, we can train NF-ResNets stably with larger
G otherwise. batch sizes (up to 4096), as well as with very strong data
The clipping threshold λ is a hyper-parameter which must augmentations like RandAugment (Cubuk et al., 2020) for
be tuned. Empirically, we found that while this clipping al- which NF-ResNets without AGC fail to train (Brock et al.,
gorithm enabled us to train at higher batch sizes than before, 2021). Note that the optimal clipping parameter λ may de-
training stability was extremely sensitive to the choice of pend on the choice of optimizer, learning rate and batch size.
the clipping threshold, requiring fine-grained tuning when Empirically, we find λ should be smaller for larger batches.
varying the model depth, the batch size, or the learning rate. AGC is closely related to a recent line of work studying “nor-
To overcome this issue, we introduce “Adaptive Gra- malized optimizers” (You et al., 2017; Bernstein et al., 2020;
dient Clipping” (AGC), which we now describe. Let You et al., 2019), which ignore the scale of the gradient by
W ` ∈ RN ×M denote the weight matrix of the `th layer, choosing an adaptive learning rate inversely proportional to
G` ∈ RN ×M denote the gradient with respect to W ` , the gradient norm. In particular, You et al. (2017) propose
and k · kF denote the Frobenius norm, i.e., kW ` kF = LARS, a momentum variant which sets the norm of the
q PN PM ` 2
parameter update to be a fixed ratio of the parameter norm,
i j (Wi,j ) . The AGC algorithm is motivated by completely ignoring the gradient magnitude. AGC can be
the observation that the ratio of the norm of the gradients G` interpreted as a relaxation of normalized optimizers, which
kG` kF
to the norm of the weights W ` of layer `, kW ` k , provides
F
imposes a maximum update size based on the parameter
a simple measure of how much a single gradient descent norm but does not simultaneously impose a lower-bound
step will change the original weights W ` . For instance, if on the update size or ignore the gradient magnitude. Al-
we train using gradient descent without momentum, then though we are also able to stably train at high batch sizes
k∆W ` k kG` kF
= h kW th with LARS, we found that doing so degrades performance.
kW ` k ` k , where the parameter update for the `
F
Stage Widths: Table 1. NFNet family depths, drop rates, and input resolutions.
ResNet: [256, 512, 1024, 2048]
1/𝛽 1x1 3x3 3x3 1x1 𝛼 NFNet: [256, 512, 1536, 1536]
Variant Depth Dropout Train Test
Stage Depths:
ResNet: [3, 4, 6, 3], [3, 4, 23, 3]...
+ NFNet: [1, 2, 6, 3] * N
F0 [1, 2, 6, 3] 0.2 192px 256px
F1 [2, 4, 12, 6] 0.3 224px 320px
Figure 3. Summary of NFNet bottleneck block design and archi- F2 [3, 6, 18, 9] 0.4 256px 352px
tectural differences. See Figure 5 in Appendix C for more details.
F3 [4, 8, 24, 12] 0.4 320px 416px
F4 [5, 10, 30, 15] 0.5 384px 512px
F5 [6, 12, 36, 18] 0.5 416px 544px
for different clipping thresholds λ across a range of batch F6 [7, 14, 42, 21] 0.5 448px 576px
sizes on ResNet50. We see that smaller (stronger) clipping
thresholds are necessary for stability at higher batch sizes.
We provide additional ablation details in Appendix D. are optimized for training latency on existing accelerators,
Next, we study whether or not AGC is beneficial for all as in Radosavovic et al. (2020). It is possible that future
layers. Using batch size 4096 and a clipping threshold accelerators may be able to take full advantage of the poten-
λ = 0.01, we remove AGC from different combinations of tial training speed that largely goes unrealized with models
the first convolution, the final linear layer, and every block like EfficientNets, so we believe this direction should not
in any given set of the residual stages. For example, one be ignored (Hooker, 2020), however we anticipate that de-
experiment may remove clipping in the linear layer and all veloping models with improved training speed on current
the blocks in the second and fourth stages. Two key trends hardware will be beneficial for accelerating research. We
emerge: first, it is always better to not clip the final linear note that accelerators like GPU and TPU tend to favor dense
layer. Second, it is often possible to train stably without computation, and while there are differences between these
clipping the initial convolution, but the weights of all four two platforms, they have enough in common that models
stages must be clipped to achieve stability when training at designed for one device are likely to train fast on the other.
batch size 4096 with the default learning rate of 1.6. For We therefore explore the space of model design by manu-
the rest of this paper (and for our ablations in Figure 2), we ally searching for design trends which yield improvements
apply AGC to every layer except for the final linear layer. to the pareto front of holdout top-1 on ImageNet against
actual training latency on device. This section describes the
5. Normalizer-Free Architectures with changes which we found to work well to this end (with more
Improved Accuracy and Training Speed details in Appendix C), while the ideas which we found to
work poorly are described in Appendix E. A summary of
In the previous section we introduced AGC, a gradient clip- these modifications is presented in Figure 3, and the effect
ping method which allows us to train efficiently with large they have on holdout accuracy is presented in Table 2.
batch sizes and strong data augmentations. Equipped with
We begin with an SE-ResNeXt-D model (Xie et al., 2017;
this technique, we now seek to design Normalizer-Free ar-
Hu et al., 2018; He et al., 2019) with GELU activations
chitectures with state-of-the-art accuracy and training speed.
(Hendrycks & Gimpel, 2016), which we found to be a sur-
The current state of the art on image classification is gener- prisingly strong baseline for Normalizer-Free Networks. We
ally held by the EfficientNet family of models (Tan & Le, make the following changes. First, we set the group width
2019), which are based on a variant of inverted bottleneck (the number of channels each output unit is connected to) in
blocks (Sandler et al., 2018) with a backbone and model scal- the 3 × 3 convs to 128, regardless of block width. Smaller
ing strategy derived from neural architecture search. These group widths reduce theoretical FLOPS, but the reduction in
models are optimized to maximize test accuracy while mini- compute density means that on many modern accelerators
mizing parameter and FLOP counts, but their low theoretical no actual speedup is realized. On TPUv3 for example, an
compute complexity does not translate into improved train- SE-ResNeXt-50 with a group width of 8 trains at the same
ing speed on modern accelerators. Despite having 10x fewer speed as an SE-ResNeXt-50 with a group width of 128 un-
FLOPS than a ResNet-50, an EffNet-B0 has similar training less the per-device batch size is 128 or larger (Google, 2021),
latency and final performance when trained on GPU or TPU. which is often not realizable due to memory constraints.
The choice of which metric to optimize– theoretical FLOPS, Next, we make two changes to the model backbone. First,
inference latency on a target device, or training latency on an we note that the default depth scaling pattern for ResNets
accelerator–is a matter of preference, and the nature of each (e.g., the method by which one increases depth to construct
metric will yield different design requirements. In this work a ResNet101 or ResNet200 from a ResNet50) involves non-
we choose to focus on manually designing models which uniformly increasing the number of layers in the second
High-Performance Normalizer-Free ResNets
Table 3. ImageNet Accuracy comparison for NFNets and a representative set of models, including SENet (Hu et al., 2018), LambdaNet,
(Bello, 2021), BoTNet (Srinivas et al., 2021), and DeIT (Touvron et al., 2020). Except for results using SAM, our results are averaged over
three random seeds. Latencies are given as the time in milliseconds required to perform a single full training step on TPU or GPU (V100).
clipping threshold of 0.01, and a learning rate which linearly be completely blank. See Appendix A for a complete de-
increases from 0 to 1.6 over 5 epochs, before decaying to scription of these magnitudes and how they are selected. We
zero with cosine annealing (Loshchilov & Hutter, 2017). show in Table 2 that these data augmentations substantially
From the first three rows of Table 2, we can see that the improve performance. Finally, in the last row of Table 2,
two changes we make to the model each result in slight we additionally present the performance of our full model
improvements to performance with only minor changes in ablated to use the default ResNet stage widths, demonstrat-
training latency (See Table 6 in the Appendix for latencies). ing that our slightly modified pattern in the third and fourth
stages does yield improvements under direct comparison.
Next, we evaluate the effects of progressively adding
stronger augmentations, combining MixUp (Zhang et al., For completeness, in Table 6 of the Appendix we also report
2017), RandAugment (RA, (Cubuk et al., 2020)) and Cut- the performance of our model architectures when trained
Mix (Yun et al., 2019). We apply RA with 4 layers and scale with batch normalization instead of the NF strategy. These
the magnitude with the resolution of the images, following models achieve slightly lower test accuracies than their NF
Cubuk et al. (2020). We find that this scaling is particularly counterparts and they are between 20% and 40% slower to
important, as if the magnitude is set too high relative to the train, even when using highly optimized batch normaliza-
image size (for example, using a magnitude of 20 on images tion implementations without cross-replica syncing. Fur-
of resolution 224) then most of the augmented images will thermore, we found that the larger model variants F4 and F5
High-Performance Normalizer-Free ResNets
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., and Yosinski,
C., Maclaurin, D., and Wanderman-Milne, S. JAX: com- J. Faster neural networks straight from jpeg. Advances in
High-Performance Normalizer-Free ResNets
Neural Information Processing Systems, 31:3933–3944, Huang, L., Liu, X., Liu, Y., Lang, B., and Tao, D. Centered
2018. weight normalization in accelerating training of deep neu-
ral networks. In Proceedings of the IEEE International
Hanin, B. and Rolnick, D. How to start training: The effect Conference on Computer Vision, pp. 2803–2811, 2017.
of initialization and architecture. In Advances in Neural
Information Processing Systems, pp. 571–581, 2018. Huang, L., Qin, J., Zhou, Y., Zhu, F., Liu, L., and
Shao, L. Normalization techniques in training dnns:
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers,
Methodology, analysis and application. arXiv preprint
R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J.,
arXiv:2009.12836, 2020.
Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van
Kerkwijk, M. H., Brett, M., Haldane, A., del Rı́o, J. F., Ioffe, S. Batch renormalization: Towards reducing mini-
Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, batch dependence in batch-normalized models. arXiv
K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and preprint arXiv:1702.03275, 2017.
Oliphant, T. E. Array programming with numpy. Nature,
585(7825):357–362, Sep 2020. ISSN 1476-4687. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings In ICML, 2015.
in deep residual networks. In European conference on
computer vision, pp. 630–645. Springer, 2016a. Jacot, A., Gabriel, F., and Hongler, C. Freeze and chaos for
dnns: an ntk view of batch normalization, checkerboard
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual and boundary effects. arXiv preprint arXiv:1907.05715,
learning for image recognition. In CVPR, 2016b. 2019.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo- Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
mentum contrast for unsupervised visual representation Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
learning. In Proceedings of the IEEE/CVF Conference Amodei, D. Scaling laws for neural language models.
on Computer Vision and Pattern Recognition, pp. 9729– arXiv preprint arXiv:2001.08361, 2020.
9738, 2020.
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung,
He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M.
J., Gelly, S., and Houlsby, N. Large scale learning of
Bag of tricks for image classification with convolutional
general visual representations for transfer. arXiv preprint
neural networks. In Proceedings of the IEEE Conference
arXiv:1912.11370, 2019.
on Computer Vision and Pattern Recognition, pp. 558–
567, 2019. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks.
Hendrycks, D. and Gimpel, K. Gaussian error linear units
Advances in neural information processing systems, 25:
(GELUs). arXiv preprint arXiv:1606.08415, 2016.
1097–1105, 2012.
Hennigan, T., Cai, T., Norman, T., and Babuschkin, I. Haiku:
Sonnet for JAX, 2020. URL http://github.com/ LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K.-R.
deepmind/dm-haiku. Efficient backprop. In Neural networks: Tricks of the
trade, pp. 9–48. Springer, 2012.
Hoffer, E., Hubara, I., and Soudry, D. Train longer, general-
ize better: closing the generalization gap in large batch Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra-
training of neural networks. In Advances in Neural Infor- dient descent with warm restarts. arXiv preprint
mation Processing Systems, pp. 1731–1741, 2017. arXiv:1608.03983, 2016.
Hooker, S. The hardware lottery. arXiv preprint Loshchilov, I. and Hutter, F. Decoupled weight decay regu-
arXiv:2009.06489, 2020. larization. arXiv preprint arXiv:1711.05101, 2017.
Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation Luo, P., Wang, X., Shao, W., and Peng, Z. Towards un-
networks. In Proceedings of the IEEE conference on derstanding regularization in batch normalization. arXiv
computer vision and pattern recognition, pp. 7132–7141, preprint arXiv:1809.00846, 2018.
2018.
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri,
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, M., Li, Y., Bharambe, A., and Van Der Maaten, L. Ex-
K. Q. Deep networks with stochastic depth. In European ploring the limits of weakly supervised pretraining. In
conference on computer vision, pp. 646–661. Springer, Proceedings of the European Conference on Computer
2016. Vision ECCV, pp. 181–196, 2018.
High-Performance Normalizer-Free ResNets
Merity, S., Keskar, N. S., and Socher, R. Regularizing and Rota Bulò, S., Porzi, L., and Kontschieder, P. In-place acti-
optimizing LSTM language models. In International vated batchnorm for memory-optimized training of dnns.
Conference on Learning Representations, 2018. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 5639–5647, 2018.
Nesterov, Y. A method for unconstrained convex mini-
mization problem with the rate of convergence o(1/k 2 ). Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Doklady AN USSR, pp. (269), 543–547, 1983. Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
M., Berg, A. C., and Fei-Fei, L. ImageNet large scale
Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty visual recognition challenge. IJCV, 115:211–252, 2015.
of training recurrent neural networks. In International
conference on machine learning, pp. 1310–1318, 2013. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
Chen, L.-C. Mobilenetv2: Inverted residuals and linear
Pham, H., Xie, Q., Dai, Z., and Le, Q. V. Meta pseudo bottlenecks. In Proceedings of the IEEE conference on
labels. arXiv preprint arXiv:2003.10580, 2020. computer vision and pattern recognition, pp. 4510–4520,
2018.
Pham, H. V., Lutellier, T., Qi, W., and Tan, L. Cradle: cross-
backend validation to detect and localize bugs in deep Sandler, M., Baccash, J., Zhmoginov, A., and Howard, A.
learning libraries. In 2019 IEEE/ACM 41st International Non-discriminative data or weak model? on the relative
Conference on Software Engineering (ICSE), pp. 1027– importance of data and model resolution. In Proceedings
1038. IEEE, 2019. of the IEEE/CVF International Conference on Computer
Vision Workshops, pp. 0–0, 2019.
Polyak, B. Some methods of speeding up the convergence
of iteration methods. USSR Computational Mathematics Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How
and Mathematical Physics, pp. 4(5):1–17, 1964. does batch normalization help optimization? In Ad-
vances in Neural Information Processing Systems, pp.
Qiao, S., Wang, H., Liu, C., Shen, W., and Yuille, A. Weight 2483–2493, 2018.
standardization. arXiv preprint arXiv:1903.10520, 2019.
Shao, J., Hu, K., Wang, C., Xue, X., and Raj, B. Is normal-
Qin, J., Fang, J., Zhang, Q., Liu, W., Wang, X., and Wang, ization indispensable for training deep neural network?
X. Resizemix: Mixing data with preserved object infor- Advances in Neural Information Processing Systems, 33,
mation and true labels. arXiv preprint arXiv:2012.11101, 2020.
2020.
Shen, S., Yao, Z., Gholami, A., Mahoney, M., and Keutzer,
Radford, A., Metz, L., and Chintala, S. Unsupervised rep- K. Powernorm: Rethinking batch normalization in trans-
resentation learning with deep convolutional generative formers. In International Conference on Machine Learn-
adversarial networks. In 4th International Conference on ing, pp. 8741–8751. PMLR, 2020.
Learning Representations, ICLR, 2016.
Simonyan, K. and Zisserman, A. Very deep convolutional
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and networks for large-scale image recognition. In 3rd Inter-
Dollár, P. Designing network design spaces. In Proceed- national Conference on Learning Representations, ICLR,
ings of the IEEE/CVF Conference on Computer Vision 2015.
and Pattern Recognition, pp. 10428–10436, 2020. Singh, S. and Shrivastava, A. Evalnorm: Estimating batch
normalization statistics for evaluation. In Proceedings
Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J.
of the IEEE/CVF International Conference on Computer
Svcca: Singular vector canonical correlation analysis for
Vision, pp. 3633–3641, 2019.
deep learning dynamics and interpretability. Advances
in neural information processing systems, 30:6076–6085, Smith, S., Elsen, E., and De, S. On the generalization
2017a. benefit of noise in stochastic gradient descent. In Interna-
tional Conference on Machine Learning, pp. 9058–9067.
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-
PMLR, 2020.
Dickstein, J. On the expressive power of deep neural
networks. In international conference on machine learn- Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel,
ing, pp. 2847–2854. PMLR, 2017b. P., and Vaswani, A. Bottleneck transformers for visual
recognition. arXiv preprint arXiv:2101.11605, 2021.
Robbins, H. and Monro, S. A stochastic approximation
method. The Annals of Mathematical Statistics, pp. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
22(3):400–407, 1951. and Salakhutdinov, R. Dropout: a simple way to prevent
High-Performance Normalizer-Free ResNets
neural networks from overfitting. The Journal of Machine Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggre-
Learning Research, 15(1):1929–1958, 2014. gated residual transformations for deep neural networks.
In Proceedings of the IEEE conference on computer vi-
Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway
sion and pattern recognition, pp. 1492–1500, 2017.
networks. arXiv preprint arXiv:1505.00387, 2015.
Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., and
Summers, C. and Dinneen, M. J. Four things everyone
Schoenholz, S. S. A mean field theory of batch normal-
should know to improve batch normalization. arXiv
ization. arXiv preprint arXiv:1902.08129, 2019.
preprint arXiv:1906.03548, 2019.
Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting You, Y., Gitman, I., and Ginsburg, B. Large batch
unreasonable effectiveness of data in deep learning era. training of convolutional networks. arXiv preprint
In ICCV, 2017. arXiv:1708.03888, 2017.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli,
importance of initialization and momentum in deep learn- S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-
ing. In International conference on machine learning, pp. J. Large batch optimization for deep learning: Training
1139–1147, 2013. bert in 76 minutes. In 7th International Conference on
Learning Representations, ICLR, 2019.
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi,
A. Inception-v4, inception-resnet and the impact Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y.
of residual connections on learning. arXiv preprint Cutmix: Regularization strategy to train strong classifiers
arXiv:1602.07261, 2016a. with localizable features. In Proceedings of the IEEE
International Conference on Computer Vision, pp. 6023–
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, 6032, 2019.
Z. Rethinking the inception architecture for computer
vision. In 2016 IEEE Conference on Computer Vision Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz,
and Pattern Recognition (CVPR), pp. 2818–2826, 2016b. D. mixup: Beyond empirical risk minimization. arXiv
preprint arXiv:1710.09412, 2017.
Tan, M. and Le, Q. Efficientnet: Rethinking model scal-
ing for convolutional neural networks. In International Zhang, H., Dauphin, Y. N., and Ma, T. Fixup initialization:
Conference on Machine Learning, pp. 6105–6114, 2019. Residual learning without normalization. arXiv preprint
arXiv:1901.09321, 2019a.
Tieleman, T. and Hinton, G. Rmsprop: Divide the gradient
by a running average of its recent magnitude. COURS- Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A.
ERA: Neural networks for machine learning, pp. 4(2):26– Self-attention generative adversarial networks. In Inter-
31, 2012. national conference on machine learning, pp. 7354–7363.
PMLR, 2019b.
Touvron, H., Vedaldi, A., Douze, M., and Jégou, H. Fix-
ing the train-test resolution discrepancy. In Advances in Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradi-
Neural Information Processing Systems, pp. 8252–8262, ent clipping accelerates training: A theoretical justifica-
2019. tion for adaptivity. In 8th International Conference on
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, Learning Representations, ICLR, 2020. URL https:
A., and Jégou, H. Training data-efficient image trans- //openreview.net/forum?id=BJgnXpVYwS.
formers & distillation through attention. arXiv preprint
arXiv:2012.12877, 2020.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention
is all you need. arXiv preprint arXiv:1706.03762, 2017.
Wu, Y. and He, K. Group normalization. In Proceedings of
the European Conference on Computer Vision (ECCV),
pp. 3–19, 2018.
Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. Self-training
with noisy student improves imagenet classification. In
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 10687–10698, 2020.
High-Performance Normalizer-Free ResNets
A. Experiment Details 0.25 for all variants, again similar to Tan & Le (2019).
A.1. ImageNet Experiment Settings We use a learning rate which warms up from 0 to its maximal
value over the first 5 epochs, where the maximal value is
87 chosen as 0.1 × B/256, with B the batch size, following
Goyal et al. (2017). After warmup, the learning rate is
F4NFNet-F5
annealed to zero with cosine decay over the rest of training
86
F3 (Loshchilov & Hutter, 2016). We employ AGC with λ =
F2 0.01 and = 10−3 for every parameter except the fully-
85
F1EffNet-B7 connected weight of the linear classifier layer.
ImageNet Top-1 Accuracy (%)
BoTNet-128-T7
By default, we train with a batch size of 4096 for 360 epochs,
84
EffNet-B5
a common training schedule which has the same number
F0 of total training steps (roughly 112,000) as training with a
83
batch size of 1024 for 90 epochs. We found that training
for longer sometimes improved results, but that this was
not always consistent across models or training settings; all
82 results reported in this work employ the 360 epoch schedule.
BoTNet-59 Unlike Tan & Le (2019) we do not perform early stopping.
81 We employ an exponential moving average of the model
parameters (similar to Polyak averaging (Polyak, 1964)),
EffNet-B2 with a decay rate of 0.99999 which, following Tan & Le
80
100 101 102
Test GFLOPS (2019), follows a warmup schedule where the decay is equal
1+t
to min(0.99999, 10+t ).
Figure 4. ImageNet Validation Accuracy vs. Test GFLOPs. All We train on TPU using bfloat16 activations to save memory
numbers are single-model, single crop. Our NFNet models are and improve speed. This means that we keep the parame-
competitive with large EfficientNet variants for a given FLOPs ters and optimizer state (the momentum buffer) in float32,
budget, despite being optimized for training latency. but compute activations and gradients in bfloat16 during
forward- and backpropagation. We cast the logits to float32
For ImageNet experiments (Russakovsky et al., 2015), before computing the loss to aid numerical stability. We cast
we train on the standard ILSVRC2012 training split, gradients back to float32 before summing them across de-
which comprises 1281167 images from 1000 classes. Our vices, which helps prevent compounding accumulation error
baseline training preprocessing follows Szegedy et al. and ensures the parameter update is computed in float32.
(2016b), with distorted bounding box crops and random
horizontal flips (Simonyan & Zisserman, 2015), with all For evaluation we follow the most common style of single-
other augmentations being applied in addition to this. crop preprocessing: we resize the raw image (with bicubic
We train using the categorical softmax cross-entropy interpolation) to be 32 pixels larger than the target resolution,
loss with label smoothing of 0.1 (Szegedy et al., 2016b), then crop to the target resolution (Simonyan & Zisserman,
and optimize our networks using stochastic gradient 2015). While this is the most commonly employed variant,
descent (Robbins & Monro, 1951) with Nesterov’s we note that an alternative method exists where a padded
momentum (Nesterov, 1983; Sutskever et al., 2013), center crop is taken and then resized to the target resolution
using a momentum coefficient of 0.9. Our training code (Szegedy et al., 2016a; Tan & Le, 2019). We find this alter-
is available at https://github.com/deepmind/ native to work marginally worse than the standard choice
deepmind-research/tree/master/nfnets, of resizing before cropping. No test time augmentation,
and is written using numpy (Harris et al., 2020), JAX multi-crop evaluation, or model ensembling is applied.
(Bradbury et al., 2018), Haiku (Hennigan et al., 2020), and
the DeepMind JAX Ecosystem (Babuschkin et al., 2020). A.2. Measuring Training Latency
We employ weight decay in the standard style (not decou- We measure training latency as the actual observed wall-
pled as in Loshchilov & Hutter (2017)), with a weight decay clock time required to perform a training step at a given
coefficient of 2 × 10−5 for NFNets. Critically, weight decay per-device batch size. To accomplish this, we run the full
is not applied to the affine gains or biases in the weight- training loop for 5000 steps, then take the median time re-
standardized convolutional layers, or to the SkipInit gains. quired to perform a single training step. We choose the
We apply a Dropout rate specific to each NFNet variant as median as the mean would also incorporate the initial speed
in Tan & Le (2019), and use Stochastic Depth with a rate of ramp-up at the beginning of training, so the median is more
High-Performance Normalizer-Free ResNets
Table 5. Comparing ImageNet transfer performance for models which use extra data for large-scale pre-training. Meta-Psuedo-Labels
results are from Pham et al. (2020), ViT results are from Dosovitskiy et al. (2021), BiT results are from Kolesnikov et al. (2019). Noisy
Student results (Xie et al., 2020) are taken from the improved versions reported in Foret et al. (2021) which employ SAM. IG-940M
(Mahajan et al., 2018) results are taken from the improved versions reported in Touvron et al. (2019).
robust to these types of variations during measurement and ent training latencies for small EfficientNet variants because
better reflects the speed observed during a full training run. we report the wallclock time, whereas Srinivas et al. (2021)
We remove dataloading as a consideration by having the report the “compute time” which will ignore cross-device
training loop operate on tensors which are already loaded communication. For very small models the inter-device
onto the device. This is consistent with how we train NFNets communication costs can be non-negligible relative to the
in practice, since our data pipeline is optimized to ensure compute time, especially for EfficientNets which employ
we are never input-bound. cross-replica batch normalization. For larger models this
cost is generally negligible on hardware like TPUv3 with
For measuring speed on TPUv3, we run on 32 devices with
very fast interconnects, so in practice one can expect that
a batch size of 32 per device, and sync gradients between
the compute time for models like BoTNets will be the same
replicas, meaning that our training latency is representative
regardless of the reporting methodology used.
of the actual speed we can obtain in practice with distributed
training. We employ bfloat16 training for all models, as
described above. For some of our larger models, this batch A.3. Augmentations
size of 32 per device does not fit into the 16GB of device Our full NFNet training recipe applies “baseline” prepro-
memory, so we allow the compiler to engage automatic cessing (sampling distorted bounding boxes and applying
rematerialization (also known as gradient checkpointing). random horizontal flips), RandAugment (RA, Cubuk et al.
Additional speed may be obtainable by careful tuning of (2020)), which we apply to all images in a batch, MixUp
manual rematerialization. (Zhang et al., 2017), which we apply to half the images in a
For measuring speed on GPU, we run on a single V100 batch with α = 0.2, and CutMix (Yun et al., 2019), which
GPU using float16 training to engage the card’s tensorcores, we apply to the other half of the images in the batch.
which strongly accelerates training. Unlike TPUv3, we Following Qin et al. (2020) we apply RandAugment after
do not consider the cost of cross-device communication applying MixUp or CutMix. We apply RA with 4 layers
for GPU, which will vary substantially depending on the (meaning 4 augmentations are chosen), which is substan-
hardware configuration of the interlinks available to the tially stronger than the common default of 2 layers, and
user. As with TPUv3, some of our models do not fit in following Cubuk et al. (2020) we pick the magnitude of the
memory at this batch size, but we instead employ gradient RA augmentation based on the training resolution of the
accumulation to mimic the full batch size. This appears images. If the augmentation magnitude is set too high rela-
to be less efficient than rematerialization for large models tive to the image resolution, then certain operations (such as
(specifically for our F5 variant and for EfficientNet-B7), so shearing) can result in many images being completely blank,
we expect that manually applying rematerialization would which will impede training. For NFNet variants F0 through
potentially yield GPU speedups in this case, but require F6, the chosen RA magnitudes are [5, 10, 10, 15, 15, 15, 15],
extra engineering effort. respectively.
We report results from our own measurements for all models The combination of MixUp, CutMix, and RA results in an
except for SENets (Hu et al., 2018), BoTNets (Srinivas et al., intense level of augmentation which progressively benefits
2021), and DeIT (Touvron et al., 2020), which we instead NFNets, but does not appear to benefit other models like
borrow from Srinivas et al. (2021). We report slightly differ- EfficientNets over a baseline of just using well-tuned RA.
High-Performance Normalizer-Free ResNets
We hypothesize that this is because our models lack the et al. (2017), which is warmed up over 5,000 steps and
implicit regularization of batch normalization, and similar then decayed to zero with cosine annealing through the
to how they are more amenable to large-scale pre-training, rest of training. We fine-tune ResNets on ImageNet with
they are accordingly also more amenable to stronger data a batch size of 2048 for 15,000 steps using a learning rate
augmentations. of 0.1 (again employing a 5000 step warmup and cosine
decay, but not applying the batch size scaling of Goyal et al.
A.4. Accelerating Sharpness-Aware Minimization (2017)), no weight decay, no DropOut, and no Stochastic
Depth. For fine-tuning we apply EMA with decay 0.9999
Sharpness-Aware Minimization (SAM, Foret et al. (2021)) and the decay warmup described above. Due to the expense
has been shown to improve the performance of various clas- of this experiment we only run a single random seed for
sifier models by seeking flat minima which are hypothesized each model (fine-tuning three separate times at each of the
to generalize better. However, by default it is expensive to fine-tune resolutions of 224, 320, and 384 pixels).
apply as it requires two evaluations of the gradient: one
for a step of gradient ascent to attain “noised” parameters, We find, contrary to (Dosovitskiy et al., 2021), that a large
and then one to attain the gradients with respect to the weight decay is harmful during pre-training, and that instead
noised parameters, which are used to update the actual pa- very small weight decays are important so that the models
rameters. We experimented with ameliorating this cost by are not constrained when trying to capture the information
only employing 20% of the batch to compute the gradients in a large scale dataset. Contrary to Dosovitskiy et al. (2021)
for the ascent step, which we found to result in equivalent we also find that Adam is not as performant as SGD in this
performance while only increasing the training latency by setting. We believe this reflects in the fact that our base-
20%-40% instead of by 100%. We also tried using SAM line batch-normalized ResNets substantially outperform the
where the batch of data used to compute the ascent step baselines reported in Dosovitskiy et al. (2021) despite oth-
was a different batch from the one used to compute the de- erwise similar pre-training and fine-tuning configurations.
scent step, but found that this destroyed all the benefits of For reference, Dosovitskiy et al. (2021) report a ResNet-50
SAM. This indicates that it is necessary for the ascent step transfer accuracy of 77.54% when fine-tuned at 384px reso-
to be computed using the same batch (or a subset thereof) lution, whereas we obtain an accuracy of 79.9% in the same
as is used to compute the descent step. As noted in Foret setting for BN-ResNet-50 and 81.1% for NF-ResNet-50.
et al. (2021), we found that SAM worked best in a dis- The full set of accuracies for these ResNet models is avail-
tributed setup where the gradients used for the ascent step able in Table 4. We recommend future work on large-scale
are not synced between replicas (meaning a separate copy pre-training to begin with a weight decay of zero and con-
of the “noised” parameters is kept on each replica and used sider lightly increasing it, rather than starting with a large
to compute the local descent gradients). We note that this value of weight decay and experimenting with decreasing it.
phenomenon can also be mimicked on fewer devices, or a For NFNet models, we pre-train with a batch size of
single device, by employing gradient accumulation (itera- 4096. For NFNet-F4, we pre-train for 40 epochs, and
tively computing noised parameters and then accumulating for NFNet-F4+ we pre-train for 20 epochs. The F4+
the gradients to be used for descent). model is a wider variant, constructed from the F4 model
by using a channel pattern of [384, 768, 2048, 2048] in-
A.5. Large Scale Pre-Training Details stead of [256, 512, 1536, 1536] and keeping all other hyper-
Our large scale pre-training is performed on JFT-300m (Sun parameters the same. We find that both models obtain about
et al., 2017), a dataset of 300 million labeled images span- the same training latency (around 830ms per step when
ning roughly 18,000 classes. We pre-train all models at training with a per-core batch size of 32), but that the F4
resolution 224 (regardless of the native model resolution model needs the additional pre-training time to reach the
for a given NFNet variant) using the same optimizer set- same final performance as the F4+ model. This indicates
tings as for our ImageNet experiments (as described in Ap- that (given sufficient pre-training data) it is more efficient to
pendix A.1) with the exception of using a smaller weight train larger models with a shorter epoch budget than to train
decay (10−5 for BN and NF-ResNets, and 10−6 for all smaller models for longer, consistent with the observations
NFNet models). We briefly tried pre-training at larger im- in (Kaplan et al., 2020).
age resolutions and found that this was not worth the added We fine-tune NFNet models for 15,000 steps at a batch
pre-trainining expense. We do not use any augmentations size of 2048 using a learning rate of 0.1, which is warmed
except for baseline random crops and flips, nor do we use up from zero over 5000 steps, then annealed to zero with
any exponential moving averages during pre-training. cosine decay through the rest of training. We use SAM with
For ResNet models, we pre-train with a batch size of 1024 ρ = 0.05, weight decay of 10−5 , a DropOut rate of 0.25, and
for 10 epochs using a learning rate of 0.4 following Goyal a stochastic depth rate of 0.1. We found that we could obtain
High-Performance Normalizer-Free ResNets
B. Downsides of Batch Normalization different hardware. SimCLR seeks to resolve this via the
use of cross-replica batch normalization.
Batch normalization provides a range of benefits, which we
discussed in Section 2 of the main text, but it also has a num-
ber of disadvantages that motivated this work on normalizer-
free networks. We discussed some of the disadvantages
of batch normalization in Section 1. In addition, here we
enumerate some documented errors and challenges in the
implementation of batch normalization in popular frame-
works and published work. A number of these errors are
identified by Pham et al. (2019), an academic paper on au-
tomated testing which discovers two such implementation
errors in Keras and one in the CNTK toolkit.
One example is a long-standing bug in certain versions of
Keras, whose consequence is that even if a user sets the
batch normalization layers to testing mode (as is common
when freezing the layers for fine-tuning for downstream
tasks) the batch normalization statistics will continue to
update, contrary to user expectations. This implementation
error is raised in in this github issue and this github issue.
The discrepancy between batch normalization train and test
behavior has had direct impact several times in previous
work. For examples, both DCGAN (Radford et al., 2016)
and SAGAN (Zhang et al., 2019b) reported results and re-
leased code where batch normalization was run in training
mode at test time as noted here and here,3 and consequently
their reported results depend on the batch size used to gen-
erate samples.
Subtle differences in batch normalization implementations
can also hamper reproducibility. For example, the Efficient-
Net training code uses a form of cross-replica BatchNorm
where the number of devices used to compute statistics
varies nonlinearly with the total number of devices (as seen
here), and consequently, even given the same code, exact
reproduction can be difficult without access to the same
hardware. Additionally, the EfficientNet code takes a mov-
ing average of the running batch normalization statistics,
which in practice means that it takes a moving average of a
moving average, compounding the averaging horizon in a
way that may be unexpected.
As discussed in the main text, breaking the independence
between training examples causes issues in contrastive learn-
ing setups like SimCLR (Chen et al., 2020) and MoCo (He
et al., 2020).Both models have to deal with the potential
for intra-batch information leakage negatively impacting
the contrastive objective. MoCo seeks to resolve this by
shuffling examples between devices when computing batch
statistics, which introduces implementation complexity and
makes it challenging to exactly reproduce their results on
3
Note that no ‘u’ or ‘s’ values are passed into the batch normal-
ization op here, meaning that running statistics are not accumu-
lated.
High-Performance Normalizer-Free ResNets
Table 6. Detailed Model ablation table. Each entry reports ImageNet Top-1 on the left, and TPUv3 training latency on the right.
F0 F1 F2 F3
Baseline 80.4% 58.0ms 81.7% 116.0ms 82.0% 211.7ms 82.3% 369.5ms
+ Modified Width 80.9% 64.1ms 81.8% 133.9ms 82.0% 252.2ms 82.3% 441.5ms
+ Second Conv 81.3% 73.3ms 82.2% 158.5ms 82.4% 295.8ms 82.7% 532.2ms
+ MixUp 82.2% 73.3ms 82.9% 158.5ms 83.1% 295.8ms 83.5% 532.2ms
+ RandAugment 83.2% 73.3ms 84.6% 158.5ms 84.8% 295.8ms 85.0% 532.2ms
+ CutMix 83.6% 73.3ms 84.7% 158.5ms 85.1% 295.8ms 85.7% 532.2ms
Default Width + Augs 83.1% 65.9ms 84.5% 137.4ms 85.0% 248.8ms 85.5% 452.2ms
-NF, + BN 83.4% 111.7ms 84.4% 258.0ms 85.1% 396.3ms 85.5% 617.7ms
+ +
𝛼 𝛼
S+E S+E
Scaled Scaled
Activation Activation
Scaled Scaled
Activation Activation
Scaled Scaled
1x1 WS-Conv
Activation Activation
Scaled Scaled
Activation Activation
1/𝛽 1/𝛽
Figure 5. Detailed view of an NFNet transition block. The bottle- Figure 6. Detailed view of an NFNet non-transition block. The
neck ratio is 0.5, while the group width (the number of channels bottleneck ratio is 0.5, while the group width (the number of
per group, C/G) in the 3 × 3 convolutions is fixed at 128 regard- channels per group, C/G) in the 3 × 3 convolutions is fixed at 128
less of the number of channels. Note that in this block, the skip regardless of the number of channels. Note that in this block, the
path takes in the signal after the variance downscaling with β and skip path takes in the signal before the variance downscaling with
the scaled nonlinearity. β.
High-Performance Normalizer-Free ResNets
80 ResNet200
Top-1 Accuracy 79
B = 256
78 B = 512
B = 1024
77 B = 2048
B = 4096
76 0.01 0.02 0.04 0.08 0.16
Clipping Threshold