0% found this document useful (0 votes)
17 views22 pages

Normalizer Free Networks

High-Performance Large-Scale Image Recognition Without Normalization
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views22 pages

Normalizer Free Networks

High-Performance Large-Scale Image Recognition Without Normalization
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

High-Performance Large-Scale Image Recognition Without Normalization

Andrew Brock 1 Soham De 1 Samuel L. Smith 1 Karen Simonyan 1

87
Abstract
F4 NFNet-F5
Batch normalization is a key component of most 86
F3
image classification models, but it has many unde-

ImageNet Top-1 Accuracy (%)


arXiv:2102.06171v1 [cs.CV] 11 Feb 2021

F2
sirable properties stemming from its dependence 85 F1 LambdaNet-420
BoTNet-128-T7 EffNet-B7
on the batch size and interactions between ex-
84
amples. Although recent work has succeeded F0 EffNet-B5
in training deep ResNets without normalization LambdaNet-152
layers, these models do not match the test ac- 83 DeIT-384
curacies of the best batch-normalized networks,
and are often unstable for large learning rates 82 DeIT-224
BoTNet-59
or strong data augmentations. In this work, we
81
develop an adaptive gradient clipping technique
which overcomes these instabilities, and design a EffNet-B2
80
significantly improved class of Normalizer-Free 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Training Latency (s/step) on TPUv3, Batch Size per Device = 32
ResNets. Our smaller models match the test ac-
curacy of an EfficientNet-B7 on ImageNet while Figure 1. ImageNet Validation Accuracy vs Training Latency.
being up to 8.7× faster to train, and our largest All numbers are single-model, single crop. Our NFNet-F1 model
models attain a new state-of-the-art top-1 accu- achieves comparable accuracy to an EffNet-B7 while being 8.7×
faster to train. Our NFNet-F5 model has similar training latency to
racy of 86.5%. In addition, Normalizer-Free mod-
EffNet-B7, but achieves a state-of-the-art 86.0% top-1 accuracy
els attain significantly better performance than
on ImageNet. We further improve on this using Sharpness Aware
their batch-normalized counterparts when fine- Minimization (Foret et al., 2021) to achieve 86.5% top-1 accuracy.
tuning on ImageNet after large-scale pre-training
on a dataset of 300 million labeled images, with
our best models obtaining an accuracy of 89.2%.2 However, batch normalization has three significant practical
disadvantages. First, it is a surprisingly expensive computa-
tional primitive, which incurs memory overhead (Rota Bulò
1. Introduction et al., 2018), and significantly increases the time required to
evaluate the gradient in some networks (Gitman & Ginsburg,
The vast majority of recent models in computer vision are 2017). Second, it introduces a discrepancy between the be-
variants of deep residual networks (He et al., 2016b;a), haviour of the model during training and at inference time
trained with batch normalization (Ioffe & Szegedy, 2015). (Summers & Dinneen, 2019; Singh & Shrivastava, 2019),
The combination of these two architectural innovations has introducing hidden hyper-parameters that have to be tuned.
enabled practitioners to train significantly deeper networks Third, and most importantly, batch normalization breaks the
which can achieve higher accuracies on both the training independence between training examples in the minibatch.
set and the test set. Batch normalization also smoothens the
loss landscape (Santurkar et al., 2018), which enables stable This third property has a range of negative consequences.
training with larger learning rates and at larger batch sizes For instance, practitioners have found that batch normalized
(Bjorck et al., 2018; De & Smith, 2020), and it can have a networks are often difficult to replicate precisely on differ-
regularizing effect (Hoffer et al., 2017; Luo et al., 2018). ent hardware, and batch normalization is often the cause of
subtle implementation errors, especially during distributed
1
DeepMind, London, United Kingdom. Correspondence to: training (Pham et al., 2019). Furthermore, batch normal-
Andrew Brock <ajbrock@google.com>.
ization cannot be used for some tasks, since the interaction
between training examples in a batch enables the network to
2
Code available at https://github.com/deepmind/ ‘cheat’ certain loss functions. For example, batch normaliza-
deepmind-research/tree/master/nfnets tion requires specific care to prevent information leakage in
High-Performance Normalizer-Free ResNets

some contrastive learning algorithms (Chen et al., 2020; He the current state of the art (Gong et al., 2020). This paper
et al., 2020). This is a major concern for sequence modeling builds on this line of work and seeks to address these central
tasks as well, which has driven language models to adopt al- limitations. Our main contributions are as follows:
ternative normalizers (Ba et al., 2016; Vaswani et al., 2017).
The performance of batch-normalized networks can also • We propose Adaptive Gradient Clipping (AGC), which
degrade if the batch statistics have a large variance during clips gradients based on the unit-wise ratio of gradient
training (Shen et al., 2020). Finally, the performance of norms to parameter norms, and we demonstrate that
batch normalization is sensitive to the batch size, and batch AGC allows us to train Normalizer-Free Networks with
normalized networks perform poorly when the batch size is larger batch sizes and stronger data augmentations.
too small (Hoffer et al., 2017; Ioffe, 2017; Wu & He, 2018),
which limits the maximum model size we can train on finite • We design a family of Normalizer-Free ResNets, called
hardware. We expand on the challenges associated with NFNets, which set new state-of-the-art validation ac-
batch normalization in Appendix B. curacies on ImageNet for a range of training latencies
(See Figure 1). Our NFNet-F1 model achieves similar
Therefore, although batch normalization has enabled the accuracy to EfficientNet-B7 while being 8.7× faster to
deep learning community to make substantial gains in re- train, and our largest model sets a new overall state of
cent years, we anticipate that in the long term it is likely to the art without extra data of 86.5% top-1 accuracy.
impede progress. We believe the community should seek
to identify a simple alternative which achieves competitive • We show that NFNets achieve substantially higher
test accuracies and can be used for a wide range of tasks. validation accuracies than batch-normalized networks
Although a number of alternative normalizers have been pro- when fine-tuning on ImageNet after pre-training on a
posed (Ba et al., 2016; Wu & He, 2018; Huang et al., 2020), large private dataset of 300 million labelled images.
these alternatives often achieve inferior test accuracies and Our best model achieves 89.2% top-1 after fine-tuning.
introduce their own disadvantages, such as additional com-
pute costs at inference. Fortunately, in recent years two The paper is structured as follows. We discuss the bene-
promising research themes have emerged. The first studies fits of batch normalization in Section 2, and recent work
the origin of the benefits of batch normalization during train- seeking to train ResNets without normalization in Section 3.
ing (Balduzzi et al., 2017; Santurkar et al., 2018; Bjorck We introduce AGC in Section 4, and we describe how we
et al., 2018; Luo et al., 2018; Yang et al., 2019; Jacot et al., developed our new state-of-the-art architectures in Section 5.
2019; De & Smith, 2020), while the second seeks to train Finally, we present our experimental results in Section 6.
deep ResNets to competitive accuracies without normaliza-
tion layers (Hanin & Rolnick, 2018; Zhang et al., 2019a; De 2. Understanding Batch Normalization
& Smith, 2020; Shao et al., 2020; Brock et al., 2021).
In order to train networks without normalization to com-
A key theme in many of these works is that it is possible to petitive accuracy, we must understand the benefits batch
train very deep ResNets without normalization by suppress- normalization brings during training, and identify alterna-
ing the scale of the hidden activations on the residual branch. tive strategies to recover these benefits. Here we list the four
The simplest way to achieve this is to introduce a learnable main benefits which have been identified by prior work.
scalar at the end of each residual branch, initialized to zero
(Goyal et al., 2017; Zhang et al., 2019a; De & Smith, 2020; Batch normalization downscales the residual branch:
Bachlechner et al., 2020). However this trick alone is not The combination of skip connections (Srivastava et al.,
sufficient to obtain competitive test accuracies on challeng- 2015; He et al., 2016b;a) and batch normalization (Ioffe &
ing benchmarks. Another line of work has shown that ReLU Szegedy, 2015) enables us to train significantly deeper net-
activations introduce a ‘mean shift’, which causes the hid- works with thousands of layers (Zhang et al., 2019a). This
den activations of different training examples to become in- benefit arises because batch normalization, when placed on
creasingly correlated as the network depth increases (Huang the residual branch (as is typical), reduces the scale of hid-
et al., 2017; Jacot et al., 2019). In a recent work, Brock et al. den activations on the residual branches at initialization (De
(2021) introduced “Normalizer-Free” ResNets, which sup- & Smith, 2020). This biases the signal towards the skip path,
press the residual branch at initialization and apply Scaled which ensures that the network has well-behaved gradients
Weight Standardization (Qiao et al., 2019) to remove the early in training, enabling efficient optimization (Balduzzi
mean shift. With additional regularization, these unnormal- et al., 2017; Hanin & Rolnick, 2018; Yang et al., 2019).
ized networks match the performance of batch-normalized Batch normalization eliminates mean-shift: Activation
ResNets (He et al., 2016a) on ImageNet (Russakovsky et al., functions like ReLUs or GELUs (Hendrycks & Gimpel,
2015), but they are not stable at large batch sizes and do not 2016), which are not anti-symmetric, have non-zero mean
match the performance of EfficientNets (Tan & Le, 2019), activations. Consequently, the inner product between the
High-Performance Normalizer-Free ResNets

activations of independent training examples immediately hi+1 = hi + αfi (hi /βi ), where hi denotes the inputs to the
after the non-linearity is typically large and positive, even ith residual block, and fi denotes the function computed
if the inner product between the input features is close to by the ith residual branch. The function fi is parameter-
zero. This issue compounds as the network depth increases, ized to be variance preserving at initialization, such that
and introduces a ‘mean-shift’ in the activations of different Var(fi (z)) = Var(z) for all i. The scalar α specifies the
training examples on any single channel proportional to the rate at which the variance of the activations increases after
network depth (De & Smith, 2020), which can cause deep each residual block (at initialization), and is typically set to a
networks to predict the same label for all training examples small value like α = 0.2. The scalar βi is determined by pre-
at initialization (Jacot et al., 2019). Batch normalization en- dicting the standard
p deviation of the inputs to the ith residual
sures the mean activation on each channel is zero across the block, βi = Var(hi ), where Var(hi+1 ) = Var(hi ) + α2 ,
current batch, eliminating mean shift (Brock et al., 2021). except for transition blocks (where spatial downsampling
occurs), for which the skip path operates on the downscaled
Batch normalization has a regularizing effect: It is
input (hi /βi ), and the expected variance is reset after the
widely believed that batch normalization also acts as a regu-
transition block to hi+1 = 1 + α2 . The outputs of squeeze-
larizer enhancing test set accuracy, due to the noise in the
excite layers (Hu et al., 2018) are multiplied by a factor of 2.
batch statistics which are computed on a subset of the train-
Empirically, Brock et al. (2021) found it was also beneficial
ing data (Luo et al., 2018). Consistent with this perspective,
to include a learnable scalar initialized to zero at the end of
the test accuracy of batch-normalized networks can often be
each residual branch (‘SkipInit’ (De & Smith, 2020)).
improved by tuning the batch size, or by using ghost batch
normalization in distributed training (Hoffer et al., 2017). In addition, Brock et al. (2021) prevent the emergence of a
mean-shift in the hidden activations by introducing Scaled
Batch normalization allows efficient large-batch train-
Weight Standardization (a minor modification of Weight
ing: Batch normalization smoothens the loss landscape
Standardization (Huang et al., 2017; Qiao et al., 2019)).
(Santurkar et al., 2018), and this increases the largest stable
This technique reparameterizes the convolutional layers as:
learning rate (Bjorck et al., 2018). While this property does
not have practical benefits when the batch size is small (De Wij − µi
Wˆij = √ , (1)
& Smith, 2020), the ability to train at larger learning rates is N σi
essential if one wishes to train efficiently with large batch
where µi = (1/N ) j Wij , σi2 = (1/N ) j (Wij − µi )2 ,
P P
sizes. Although large-batch training does not achieve higher
test accuracies within a fixed epoch budget (Smith et al., and N denotes the fan-in. The activation functions are
2020), it does achieve a given test accuracy in fewer param- also scaled by a non-linearity specific scalar gain γ, which
eter updates, significantly improving training speed when ensures that the combination of the γ-scaled activation func-
parallelized across multiple devices (Goyal et al., 2017). tion and a Scaled Weight Standardized
p layer is variance
preserving. For ReLUs, γ = 2/(1 − (1/π)) (Arpit et al.,
2016). We refer the reader to Brock et al. (2021) for a
3. Towards Removing Batch Normalization description of how to compute γ for other non-linearities.
Many authors have attempted to train deep ResNets to com- With additional regularization (Dropout (Srivastava et al.,
petitive accuracies without normalization, by recovering 2014) and Stochastic Depth (Huang et al., 2016)),
one or more of the benefits of batch normalization described Normalizer-Free ResNets match the test accuracies achieved
above. Most of these works suppress the scale of the activa- by batch normalized pre-activation ResNets on ImageNet
tions on the residual branch at initialization, by introducing at batch size 1024. They also significantly outperform their
either small constants or learnable scalars (Hanin & Rol- batch normalized counterparts when the batch size is very
nick, 2018; Zhang et al., 2019a; De & Smith, 2020; Shao small, but they perform worse than batch normalized net-
et al., 2020). Additionally, Zhang et al. (2019a) and De & works for large batch sizes (4096 or higher). Crucially, they
Smith (2020) observed that the performance of unnormal- do not match the performance of state-of-the-art networks
ized ResNets can be improved with additional regulariza- like EfficientNets (Tan & Le, 2019; Gong et al., 2020).
tion. However only recovering these two benefits of batch
normalization is not sufficient to achieve competitive test
accuracies on challenging benchmarks (De & Smith, 2020).
4. Adaptive Gradient Clipping for Efficient
Large-Batch Training
In this work, we adopt and build on “Normalizer-Free
ResNets” (NF-ResNets) (Brock et al., 2021), a class of pre- To scale NF-ResNets to larger batch sizes, we explore a
activation ResNets (He et al., 2016a) which can be trained to range of gradient clipping strategies (Pascanu et al., 2013).
competitive training and test accuracies without normaliza- Gradient clipping is often used in language modeling to sta-
tion layers. NF-ResNets employ a residual block of the form bilize training (Merity et al., 2018), and recent work shows
that it allows training with larger learning rates compared
High-Performance Normalizer-Free ResNets

to gradient descent, accelerating convergence (Zhang et al., 80 ResNet50


77
2020). This is particularly important for poorly conditioned 78

Top-1 Accuracy
Top-1 Accuracy
loss landscapes or when training with large batch sizes, since 76 BatchNorm 76 B = 256
NF-ResNet B = 512
in these settings the optimal learning rate is constrained by 74 B = 1024
NF-ResNet+AGC 75
the maximum stable learning rate (Smith et al., 2020). We 72 ResNet50 B = 2048
ResNet200 B = 4096
therefore hypothesize that gradient clipping should help 70 256 74 0.01
512 1024 2048 4096 0.02 0.04 0.08 0.16
scale NF-ResNets efficiently to the large-batch setting. Batch Size B Clipping Threshold
(a) (b)
Gradient clipping is typically performed by constraining the
norm of the gradient (Pascanu et al., 2013). Specifically, for Figure 2. (a) AGC efficiently scales NF-ResNets to larger batch
gradient vector G = ∂L/∂θ, where L denotes the loss and sizes. (b) The performance across different clipping thresholds λ.
θ denotes a vector with all model parameters, the standard
clipping algorithm clips the gradient before updating θ as:
(
G fan-in extent (including the channel and spatial dimensions).
λ kGk if kGk > λ,
G→ (2) Using AGC, we can train NF-ResNets stably with larger
G otherwise. batch sizes (up to 4096), as well as with very strong data
The clipping threshold λ is a hyper-parameter which must augmentations like RandAugment (Cubuk et al., 2020) for
be tuned. Empirically, we found that while this clipping al- which NF-ResNets without AGC fail to train (Brock et al.,
gorithm enabled us to train at higher batch sizes than before, 2021). Note that the optimal clipping parameter λ may de-
training stability was extremely sensitive to the choice of pend on the choice of optimizer, learning rate and batch size.
the clipping threshold, requiring fine-grained tuning when Empirically, we find λ should be smaller for larger batches.
varying the model depth, the batch size, or the learning rate. AGC is closely related to a recent line of work studying “nor-
To overcome this issue, we introduce “Adaptive Gra- malized optimizers” (You et al., 2017; Bernstein et al., 2020;
dient Clipping” (AGC), which we now describe. Let You et al., 2019), which ignore the scale of the gradient by
W ` ∈ RN ×M denote the weight matrix of the `th layer, choosing an adaptive learning rate inversely proportional to
G` ∈ RN ×M denote the gradient with respect to W ` , the gradient norm. In particular, You et al. (2017) propose
and k · kF denote the Frobenius norm, i.e., kW ` kF = LARS, a momentum variant which sets the norm of the
q PN PM ` 2
parameter update to be a fixed ratio of the parameter norm,
i j (Wi,j ) . The AGC algorithm is motivated by completely ignoring the gradient magnitude. AGC can be
the observation that the ratio of the norm of the gradients G` interpreted as a relaxation of normalized optimizers, which
kG` kF
to the norm of the weights W ` of layer `, kW ` k , provides
F
imposes a maximum update size based on the parameter
a simple measure of how much a single gradient descent norm but does not simultaneously impose a lower-bound
step will change the original weights W ` . For instance, if on the update size or ignore the gradient magnitude. Al-
we train using gradient descent without momentum, then though we are also able to stably train at high batch sizes
k∆W ` k kG` kF
= h kW th with LARS, we found that doing so degrades performance.
kW ` k ` k , where the parameter update for the `
F

layer is given by ∆W ` = −hG` , and h is the learning rate.


4.1. Ablations for Adaptive Gradient Clipping (AGC)
Intuitively, we expect training to become unstable if
We now present a range of ablations designed to test the effi-
(k∆W ` k/kW ` k) is large, which motivates a clipping strat-
kG` kF
cacy of AGC. We performed experiments on pre-activation
egy based on the ratio kW ` k . However in practice, we clip
F NF-ResNet-50 and NF-ResNet-200 on ImageNet, trained
gradients based on the unit-wise ratios of gradient norms to using SGD with Nesterov’s Momentum for 90 epochs at a
parameter norms, which we found to perform better empiri- range of batch sizes between 256 and 4096. As in Goyal
cally than taking layer-wise norm ratios. Specifically, in our et al. (2017) we use a base learning rate of 0.1 for batch
AGC algorithm, each unit i of the gradient of the `-th layer size 256, which is scaled linearly with the batch size. We
G`i (defined as the ith row of matrix G` ) is clipped as: consider a range of λ values [0.01, 0.02, 0.04, 0.08, 0.16].
( kW ` k? kG` k
` λ kG`i kFF G`i if kWi` kF? > λ, In Figure 2(a), we compare batch-normalized ResNets to
Gi → i i F (3) NF-ResNets with and without AGC. We show test accuracy
G`i otherwise.
at the best clipping threshold λ for each batch size. We find
The clipping threshold λ is a scalar hyperparameter, and we that AGC helps scale NF-ResNets to large batch sizes while
define kWi k?F = max(kWi kF , ), with default  = 10−3 , maintaining performance comparable or better than batch-
which prevents zero-initialized parameters from always hav- normalized networks on both ResNet50 and ResNet200. As
ing their gradients clipped to zero. For parameters in con- anticipated, the benefits of using AGC are smaller when the
volutional filters, we evaluate the unit-wise norms over the batch size is small. In Figure 2(b), we show performance
High-Performance Normalizer-Free ResNets

Stage Widths: Table 1. NFNet family depths, drop rates, and input resolutions.
ResNet: [256, 512, 1024, 2048]
1/𝛽 1x1 3x3 3x3 1x1 𝛼 NFNet: [256, 512, 1536, 1536]
Variant Depth Dropout Train Test
Stage Depths:
ResNet: [3, 4, 6, 3], [3, 4, 23, 3]...
+ NFNet: [1, 2, 6, 3] * N
F0 [1, 2, 6, 3] 0.2 192px 256px
F1 [2, 4, 12, 6] 0.3 224px 320px
Figure 3. Summary of NFNet bottleneck block design and archi- F2 [3, 6, 18, 9] 0.4 256px 352px
tectural differences. See Figure 5 in Appendix C for more details.
F3 [4, 8, 24, 12] 0.4 320px 416px
F4 [5, 10, 30, 15] 0.5 384px 512px
F5 [6, 12, 36, 18] 0.5 416px 544px
for different clipping thresholds λ across a range of batch F6 [7, 14, 42, 21] 0.5 448px 576px
sizes on ResNet50. We see that smaller (stronger) clipping
thresholds are necessary for stability at higher batch sizes.
We provide additional ablation details in Appendix D. are optimized for training latency on existing accelerators,
Next, we study whether or not AGC is beneficial for all as in Radosavovic et al. (2020). It is possible that future
layers. Using batch size 4096 and a clipping threshold accelerators may be able to take full advantage of the poten-
λ = 0.01, we remove AGC from different combinations of tial training speed that largely goes unrealized with models
the first convolution, the final linear layer, and every block like EfficientNets, so we believe this direction should not
in any given set of the residual stages. For example, one be ignored (Hooker, 2020), however we anticipate that de-
experiment may remove clipping in the linear layer and all veloping models with improved training speed on current
the blocks in the second and fourth stages. Two key trends hardware will be beneficial for accelerating research. We
emerge: first, it is always better to not clip the final linear note that accelerators like GPU and TPU tend to favor dense
layer. Second, it is often possible to train stably without computation, and while there are differences between these
clipping the initial convolution, but the weights of all four two platforms, they have enough in common that models
stages must be clipped to achieve stability when training at designed for one device are likely to train fast on the other.
batch size 4096 with the default learning rate of 1.6. For We therefore explore the space of model design by manu-
the rest of this paper (and for our ablations in Figure 2), we ally searching for design trends which yield improvements
apply AGC to every layer except for the final linear layer. to the pareto front of holdout top-1 on ImageNet against
actual training latency on device. This section describes the
5. Normalizer-Free Architectures with changes which we found to work well to this end (with more
Improved Accuracy and Training Speed details in Appendix C), while the ideas which we found to
work poorly are described in Appendix E. A summary of
In the previous section we introduced AGC, a gradient clip- these modifications is presented in Figure 3, and the effect
ping method which allows us to train efficiently with large they have on holdout accuracy is presented in Table 2.
batch sizes and strong data augmentations. Equipped with
We begin with an SE-ResNeXt-D model (Xie et al., 2017;
this technique, we now seek to design Normalizer-Free ar-
Hu et al., 2018; He et al., 2019) with GELU activations
chitectures with state-of-the-art accuracy and training speed.
(Hendrycks & Gimpel, 2016), which we found to be a sur-
The current state of the art on image classification is gener- prisingly strong baseline for Normalizer-Free Networks. We
ally held by the EfficientNet family of models (Tan & Le, make the following changes. First, we set the group width
2019), which are based on a variant of inverted bottleneck (the number of channels each output unit is connected to) in
blocks (Sandler et al., 2018) with a backbone and model scal- the 3 × 3 convs to 128, regardless of block width. Smaller
ing strategy derived from neural architecture search. These group widths reduce theoretical FLOPS, but the reduction in
models are optimized to maximize test accuracy while mini- compute density means that on many modern accelerators
mizing parameter and FLOP counts, but their low theoretical no actual speedup is realized. On TPUv3 for example, an
compute complexity does not translate into improved train- SE-ResNeXt-50 with a group width of 8 trains at the same
ing speed on modern accelerators. Despite having 10x fewer speed as an SE-ResNeXt-50 with a group width of 128 un-
FLOPS than a ResNet-50, an EffNet-B0 has similar training less the per-device batch size is 128 or larger (Google, 2021),
latency and final performance when trained on GPU or TPU. which is often not realizable due to memory constraints.
The choice of which metric to optimize– theoretical FLOPS, Next, we make two changes to the model backbone. First,
inference latency on a target device, or training latency on an we note that the default depth scaling pattern for ResNets
accelerator–is a matter of preference, and the nature of each (e.g., the method by which one increases depth to construct
metric will yield different design requirements. In this work a ResNet101 or ResNet200 from a ResNet50) involves non-
we choose to focus on manually designing models which uniformly increasing the number of layers in the second
High-Performance Normalizer-Free ResNets

novel modifications (see Appendix E) but found that the best


Table 2. The effect of architectural modifications and data augmen-
improvement came from adding an additional 3 × 3 grouped
tation on ImageNet Top-1 accuracy (averaged over 3 seeds).
conv after the first (with accompanying nonlinearity). This
F0 F1 F2 F3 additional convolution minimally impacts FLOPS and has
almost no impact on training time on our target accelerators.
Baseline 80.4 81.7 82.0 82.3
+ Modified Width 80.9 81.8 82.0 82.3 Finally, we establish a scaling strategy to produce model
+ Second Conv 81.3 82.2 82.4 82.7 variants at different compute budgets. The EfficientNet scal-
+ MixUp 82.2 82.9 83.1 83.5 ing strategy (Tan & Le, 2019) is to jointly scale model width,
+ RandAugment 83.2 84.6 84.8 85.0 depth, and input resolution, which works extremely well
+ CutMix 83.6 84.7 85.1 85.7 for base models with very slim MobileNet-like backbones.
Default Width + Augs 83.1 84.5 85.0 85.5 However we find that width scaling is ineffective for ResNet
backbones, consistent with Bello (2021), who attain strong
performance when only scaling depth and input resolution.
and third stages, while maintaining 3 blocks in the first and We therefore also adopt the latter strategy, using the fixed
fourth stages, where ‘stage’ refers to a sequence of residual width pattern mentioned above, scaling depth as described
blocks whose activations are the same width and have the above, and scaling training resolution such that each vari-
same resolution. We find that this strategy is suboptimal. ant is approximately half as fast to train as its predecessor.
Layers in early stages operate at higher resolution, require Following Touvron et al. (2019), we evaluate images at infer-
more memory and compute, and tend to learn localized, task- ence at a slightly higher resolution than we train at, chosen
general features (Krizhevsky et al., 2012), while layers in for each variant as approximately 33% larger than the train
later stages operate at lower resolutions, contain most of the resolution. We do not fine-tune at this higher resolution.
model’s parameters, and learn more task-specific features
(Raghu et al., 2017a). However, being overly parsimonious We also find that it is helpful to increase the regularization
with early stages (such as through aggressive downsam- strength as the model capacity rises. However modifying the
pling) can hurt performance, since the model needs enough weight decay or stochastic depth rate was not effective, and
capacity to extract good local features (Raghu et al., 2017b). instead we scale the drop rate of Dropout (Srivastava et al.,
It is also desirable to have a simple scaling rule for construct- 2014), following Tan & Le (2019). This step is particularly
ing deeper variants (Tan & Le, 2019). With these principles important as our models lack the implicit regularization
in mind, we explored several choices of backbone for our of batch normalization, and without explicit regularization
smallest model variant, named F0, before settling on the tend to dramatically overfit. Our resulting models are highly
simple pattern [1, 2, 6, 3] (indicating how many bottleneck performant and, despite being optimized for training latency,
blocks to allocate to each stage). We construct deeper vari- remain competitive with larger EfficientNet variants in terms
ants by multiplying the depth of each stage by a scalar N , so of FLOPs vs accuracy (although not in terms of parameters
that, for example, variant F1 has a depth pattern [2, 4, 12, 6], vs accuracy), as shown in Figure 4 in Appendix A.
and variant F4 has a depth pattern [5, 10, 30, 15].
5.1. Summary
In addition, we reconsider the default width pattern in
ResNets, where the first stage has 256 channels which are Our training recipe can be summarized as follows: First,
doubled at each subsequent stage, resulting in a pattern apply the Normalizer-Free setup of Brock et al. (2021) to
[256, 512, 1024, 2048]. Employing our depth patterns de- an SE-ResNeXt-D, with modified width and depth patterns,
scribed above, we considered a range of alternative pat- and a second spatial convolution. Second, apply AGC to
terns (taking inspiration from Radosavovic et al. (2020)) every parameter except for the linear weight of the classifier
but found that only one choice was better than this default: layer. For batch size 1024 to 4096, set λ = 0.01, and make
[256, 512, 1536, 1536]. This width pattern is designed to use of strong regularization and data augmentation. See
increase capacity in the third stage while slightly reducing Table 1 for additional information on each model variant.
capacity in the fourth stage, roughly preserving training
speed. Consistent with our chosen depth pattern and the 6. Experiments
default design of ResNets, we find that the third stage tends
to be the best place to add capacity, which we hypothesize is 6.1. Evaluating NFNets on ImageNet
due to this stage being deep enough to have a large receptive We now turn our attention to evaluating our NFNet models
field and access to deeper levels of the feature hierarchy, on ImageNet, beginning with an ablation of our architectural
while having a slightly higher resolution than the final stage. modifications when training for 360 epochs at batch size
We also consider the structure of the bottleneck residual 4096. We use Nesterov’s Momentum with a momentum
block itself. We considered a variety of pre-existing and coefficient of 0.9, AGC as described in Section 4 with a
High-Performance Normalizer-Free ResNets

Table 3. ImageNet Accuracy comparison for NFNets and a representative set of models, including SENet (Hu et al., 2018), LambdaNet,
(Bello, 2021), BoTNet (Srinivas et al., 2021), and DeIT (Touvron et al., 2020). Except for results using SAM, our results are averaged over
three random seeds. Latencies are given as the time in milliseconds required to perform a single full training step on TPU or GPU (V100).

Model #FLOPs #Params Top-1 Top-5 TPUv3 Train GPU Train


ResNet-50 4.10B 26.0M 78.6 94.3 41.6ms 35.3ms
EffNet-B0 0.39B 5.3M 77.1 93.3 51.1ms 44.8ms
SENet-50 4.09B 28.0M 79.4 94.6 64.3ms 59.4ms
NFNet-F0 12.38B 71.5M 83.6 96.8 73.3ms 56.7ms
EffNet-B3 1.80B 12.0M 81.6 95.7 129.5ms 116.6ms
LambdaNet-152 − 51.5M 83.0 96.3 138.3ms 135.2ms
SENet-152 19.04B 66.6M 83.1 96.4 149.9ms 151.2ms
BoTNet-110 10.90B 54.7M 82.8 96.3 181.3ms −
NFNet-F1 35.54B 132.6M 84.7 97.1 158.5ms 133.9ms
EffNet-B4 4.20B 19.0M 82.9 96.4 245.9ms 221.6ms
BoTNet-128-T5 19.30B 75.1M 83.5 96.5 355.2ms −
NFNet-F2 62.59B 193.8M 85.1 97.3 295.8ms 226.3ms
SENet-350 52.90B 115.2M 83.8 96.6 593.6ms −
EffNet-B5 9.90B 30.0M 83.7 96.7 450.5ms 458.9ms
LambdaNet-350 − 105.8M 84.5 97.0 471.4ms −
BoTNet-77-T6 23.30B 53.9M 84.0 96.7 578.1ms −
NFNet-F3 114.76B 254.9M 85.7 97.5 532.2ms 524.5ms
LambdaNet-420 − 124.8M 84.8 97.0 593.9ms −
EffNet-B6 19.00B 43.0M 84.0 96.8 775.7ms 868.2ms
BoTNet-128-T7 45.80B 75.1M 84.7 97.0 804.5ms −
NFNet-F4 215.24B 316.1M 85.9 97.6 1033.3ms 1190.6ms
EffNet-B7 37.00B 66.0M 84.7 97.0 1397.0ms 1753.3ms
DeIT 1000 epochs − 87.0M 85.2 − − −
EffNet-B8+MaxUp 62.50B 87.4M 85.8 − − −
NFNet-F5 289.76B 377.2M 86.0 97.6 1398.5ms 2177.1ms
NFNet-F5+SAM 289.76B 377.2M 86.3 97.9 1958.0ms −
NFNet-F6+SAM 377.28B 438.4M 86.5 97.9 2774.1ms −

clipping threshold of 0.01, and a learning rate which linearly be completely blank. See Appendix A for a complete de-
increases from 0 to 1.6 over 5 epochs, before decaying to scription of these magnitudes and how they are selected. We
zero with cosine annealing (Loshchilov & Hutter, 2017). show in Table 2 that these data augmentations substantially
From the first three rows of Table 2, we can see that the improve performance. Finally, in the last row of Table 2,
two changes we make to the model each result in slight we additionally present the performance of our full model
improvements to performance with only minor changes in ablated to use the default ResNet stage widths, demonstrat-
training latency (See Table 6 in the Appendix for latencies). ing that our slightly modified pattern in the third and fourth
stages does yield improvements under direct comparison.
Next, we evaluate the effects of progressively adding
stronger augmentations, combining MixUp (Zhang et al., For completeness, in Table 6 of the Appendix we also report
2017), RandAugment (RA, (Cubuk et al., 2020)) and Cut- the performance of our model architectures when trained
Mix (Yun et al., 2019). We apply RA with 4 layers and scale with batch normalization instead of the NF strategy. These
the magnitude with the resolution of the images, following models achieve slightly lower test accuracies than their NF
Cubuk et al. (2020). We find that this scaling is particularly counterparts and they are between 20% and 40% slower to
important, as if the magnitude is set too high relative to the train, even when using highly optimized batch normaliza-
image size (for example, using a magnitude of 20 on images tion implementations without cross-replica syncing. Fur-
of resolution 224) then most of the augmented images will thermore, we found that the larger model variants F4 and F5
High-Performance Normalizer-Free ResNets

were not stable when training with batch normalization, with


Table 4. ImageNet Transfer Top-1 accuracy after pre-training.
or without AGC. We attribute this to the necessity of using
bfloat16 training to fit these larger models in memory, which 224px 320px 384px
may introduce numerical imprecision that interacts poorly
BN-ResNet-50 78.1 79.6 79.9
with the computation of batch normalization statistics.
NF-ResNet-50 79.5 80.9 81.1
We provide a detailed summary of the size, training latency
BN-ResNet-101 80.8 82.2 82.5
(on TPUv3 and V100 with tensorcores), and ImageNet vali-
NF-ResNet-101 81.4 82.7 83.2
dation accuracy of six model variants, NFNet-F0 through
F5, along with comparisons to other models with similar BN-ResNet-152 81.8 83.1 83.4
training latencies, in Table 3. Our NFNet-F5 model attains a NF-ResNet-152 82.7 83.6 84.0
top-1 validation accuracy of 86.0%, improving over the pre- BN-ResNet-200 81.8 83.1 83.5
vious state of the art, EfficientNet-B8 with MaxUp (Gong NF-ResNet-200 82.9 84.1 84.3
et al., 2020) by a small margin, and our NFNet-F1 model
matches the 84.7% of EfficientNet-B7 with RA (Cubuk
et al., 2020), while being 8.7 times faster to train. See
Appendix A for details of how we measure training latency. training on a large dataset of 300 million labeled images.
Our models also benefit from the recently proposed We pre-train a range of batch normalized and NF-ResNets
Sharpness-Aware Minimization (SAM, (Foret et al., 2021)). for 10 epochs on this large dataset, then fine-tune all layers
SAM is not part of our standard training pipeline, as by on ImageNet simultaneously, using a batch size of 2048 and
default it doubles the training time and typically can only a small learning rate of 0.1 with cosine annealing for 15,000
be used for distributed training. However we make a small steps, for input image resolutions in the range [224, 320,
modification to the SAM procedure to reduce this cost to 20- 384]. As shown in Table 4, Normalizer-Free networks out-
40% increased training time (explained in Appendix A) and perform their Batch-Normalized counterparts in every single
employ it to train our two largest model variants, resulting case, typically by a margin of around 1% absolute top-1.
in an NFNet-F5 that attains 86.3% top-1, and an NFNet-F6 This suggests that in the transfer learning regime, removing
that attains 86.5% top-1, substantially improving over the batch normalization can directly benefit final performance.
existing state of the art on ImageNet without extra data. We perform this same experiment using our NFNet models,
Finally, we also evaluated the performance of our data aug- pre-training an NFNet-F4 and a slightly wider variant which
mentation strategy on EfficientNets. We find that while RA we denote NFNet-F4+ (see Appendix C). As shown in Ta-
strongly improves EfficientNets’ performance over base- ble 5 of the appendix, with 20 epochs of pre-training our
line augmentation, increasing the number of layers beyond NFNet-F4+ attains an ImageNet top-1 accuracy of 89.2%.
2 or adding MixUp and CutMix does not further improve This is the second highest validation accuracy achieved to
their performance, suggesting that our performance improve- date with extra training data, second only to a strong recent
ments are difficult to obtain by simply using stronger data semi-supervised learning baseline (Pham et al., 2020), and
augmentations. We also find that using SGD with cosine the highest accuracy achieved using transfer learning.
annealing instead of RMSProp (Tieleman & Hinton, 2012)
with step decay severely degrades EfficientNet performance, Conclusion
indicating that our performance improvements are also not
simply due to the selection of a different optimizer. We show for the first time that image recognition models,
trained without normalization layers, can not only match
6.2. Evaluating NFNets under Transfer the classification accuracies of the best batch normalized
models on large-scale datasets but also substantially exceed
Unnormalized networks do not share the implicit regular- them, while still being faster to train. To achieve this, we in-
ization effect of batch normalization, and on datasets like troduce Adaptive Gradient Clipping, a simple clipping algo-
ImageNet (Russakovsky et al., 2015) they tend to overfit un- rithm which stabilizes large-batch training and enables us to
less explicitly regularized (Zhang et al., 2019a; De & Smith, optimize unnormalized networks with strong data augmen-
2020; Brock et al., 2021). However when pre-training on tations. Leveraging this technique and simple architecture
extremely large scale datasets, such regularization may not design principles, we develop a family of models which at-
only be unnecessary, but also harmful to performance, re- tain state-of-the-art performance on ImageNet without extra
ducing the model’s ability to devote its full capacity to the data, while being substantially faster to train than competing
training set. We hypothesize that this may make Normalizer- approaches. We also show that Normalizer-Free models are
Free networks naturally better suited to transfer learning better suited to fine-tuning after pre-training on very large
after large-scale pre-training, and investigate this via pre- scale datasets than their batch-normalized counterparts.
High-Performance Normalizer-Free ResNets

Acknowledgements posable transformations of Python+NumPy programs,


2018. URL http://github.com/google/jax.
We would like to thank Aäron van den Oord, Sander Diele-
man, Erich Elsen, Guillaume Desjardins, Michael Figurnov, Brock, A., De, S., and Smith, S. L. Characterizing signal
Nikolay Savinov, Omar Rivasplata, Relja Arandjelović, and propagation to close the performance gap in unnormal-
Rishub Jain for helpful discussions and guidance. Addition- ized resnets. In 9th International Conference on Learning
ally, we would like to thank Blake Hechtman, Tim Shen, Representations, ICLR, 2021.
Peter Hawkins, and James Bradbury for assistance with
developing highly performant JAX code. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A
simple framework for contrastive learning of visual rep-
resentations. In International conference on machine
References learning, pp. 1597–1607. PMLR, 2020.
Arpit, D., Zhou, Y., Kota, B., and Govindaraju, V. Normal-
Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. Ran-
ization propagation: A parametric technique for removing
daugment: Practical automated data augmentation with a
internal covariate shift in deep networks. In International
reduced search space. In Proceedings of the IEEE/CVF
Conference on Machine Learning, pp. 1168–1176, 2016.
Conference on Computer Vision and Pattern Recognition
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. Workshops, pp. 702–703, 2020.
arXiv preprint arXiv:1607.06450, 2016.
De, S. and Smith, S. Batch normalization biases residual
Babuschkin, I., Baumli, K., Bell, A., Bhupatiraju, S., Bruce, blocks towards the identity function in deep networks.
J., Buchlovsky, P., Budden, D., Cai, T., Clark, A., Dani- Advances in Neural Information Processing Systems, 33,
helka, I., Fantacci, C., Godwin, J., Jones, C., Hennigan, 2020.
T., Hessel, M., Kapturowski, S., Keck, T., Kemaev, I., Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
King, M., Martens, L., Mikulik, V., Norman, T., Quan, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
J., Papamakarios, G., Ring, R., Ruiz, F., Sanchez, A., M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.
Schneider, R., Sezener, E., Spencer, S., Srinivasan, S., An image is worth 16x16 words: Transformers for image
Stokowiec, W., and Viola, F. The DeepMind JAX Ecosys- recognition at scale. In 9th International Conference on
tem, 2020. URL http://github.com/deepmind. Learning Representations, ICLR, 2021. URL https:
Bachlechner, T., Majumder, B. P., Mao, H. H., Cottrell, //openreview.net/forum?id=YicbFdNTTy.
G. W., and McAuley, J. Rezero is all you need: Fast con- Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.
vergence at large depth. arXiv preprint arXiv:2003.04887, Sharpness-aware minimization for efficiently improv-
2020. ing generalization. In 9th International Conference on
Learning Representations, ICLR, 2021. URL https:
Balduzzi, D., Frean, M., Leary, L., Lewis, J., Ma, K. W.-D.,
//openreview.net/forum?id=6Tm1mposlrM.
and McWilliams, B. The shattered gradients problem:
If resnets are the answer, then what is the question? In Gitman, I. and Ginsburg, B. Comparison of batch nor-
International Conference on Machine Learning, pp. 342– malization and weight normalization algorithms for
350, 2017. the large-scale image classification. arXiv preprint
arXiv:1709.08145, 2017.
Bello, I. Lambdanetworks: Modeling long-range interac-
tions without attention. In International Conference on Gong, C., Ren, T., Ye, M., and Liu, Q. Maxup: A simple
Learning Representations ICLR, 2021. URL https: way to improve generalization of neural network training.
//openreview.net/forum?id=xTJEN-ggl1b. arXiv preprint arXiv:2002.09024, 2020.
Bernstein, J., Vahdat, A., Yue, Y., and Liu, M.-Y. On the Google. Cloud TPU Performance Guide.
distance between two neural networks and the stability of https://cloud.google.com/tpu/docs/
learning. arXiv preprint arXiv:2002.03432, 2020. performance-guide, 2021.
Bjorck, N., Gomes, C. P., Selman, B., and Weinberger, Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,
K. Q. Understanding batch normalization. In Advances in Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
Neural Information Processing Systems, pp. 7694–7705, He, K. Accurate, large minibatch sgd: Training imagenet
2018. in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., and Yosinski,
C., Maclaurin, D., and Wanderman-Milne, S. JAX: com- J. Faster neural networks straight from jpeg. Advances in
High-Performance Normalizer-Free ResNets

Neural Information Processing Systems, 31:3933–3944, Huang, L., Liu, X., Liu, Y., Lang, B., and Tao, D. Centered
2018. weight normalization in accelerating training of deep neu-
ral networks. In Proceedings of the IEEE International
Hanin, B. and Rolnick, D. How to start training: The effect Conference on Computer Vision, pp. 2803–2811, 2017.
of initialization and architecture. In Advances in Neural
Information Processing Systems, pp. 571–581, 2018. Huang, L., Qin, J., Zhou, Y., Zhu, F., Liu, L., and
Shao, L. Normalization techniques in training dnns:
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers,
Methodology, analysis and application. arXiv preprint
R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J.,
arXiv:2009.12836, 2020.
Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van
Kerkwijk, M. H., Brett, M., Haldane, A., del Rı́o, J. F., Ioffe, S. Batch renormalization: Towards reducing mini-
Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, batch dependence in batch-normalized models. arXiv
K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and preprint arXiv:1702.03275, 2017.
Oliphant, T. E. Array programming with numpy. Nature,
585(7825):357–362, Sep 2020. ISSN 1476-4687. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings In ICML, 2015.
in deep residual networks. In European conference on
computer vision, pp. 630–645. Springer, 2016a. Jacot, A., Gabriel, F., and Hongler, C. Freeze and chaos for
dnns: an ntk view of batch normalization, checkerboard
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual and boundary effects. arXiv preprint arXiv:1907.05715,
learning for image recognition. In CVPR, 2016b. 2019.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo- Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
mentum contrast for unsupervised visual representation Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
learning. In Proceedings of the IEEE/CVF Conference Amodei, D. Scaling laws for neural language models.
on Computer Vision and Pattern Recognition, pp. 9729– arXiv preprint arXiv:2001.08361, 2020.
9738, 2020.
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung,
He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M.
J., Gelly, S., and Houlsby, N. Large scale learning of
Bag of tricks for image classification with convolutional
general visual representations for transfer. arXiv preprint
neural networks. In Proceedings of the IEEE Conference
arXiv:1912.11370, 2019.
on Computer Vision and Pattern Recognition, pp. 558–
567, 2019. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks.
Hendrycks, D. and Gimpel, K. Gaussian error linear units
Advances in neural information processing systems, 25:
(GELUs). arXiv preprint arXiv:1606.08415, 2016.
1097–1105, 2012.
Hennigan, T., Cai, T., Norman, T., and Babuschkin, I. Haiku:
Sonnet for JAX, 2020. URL http://github.com/ LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K.-R.
deepmind/dm-haiku. Efficient backprop. In Neural networks: Tricks of the
trade, pp. 9–48. Springer, 2012.
Hoffer, E., Hubara, I., and Soudry, D. Train longer, general-
ize better: closing the generalization gap in large batch Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra-
training of neural networks. In Advances in Neural Infor- dient descent with warm restarts. arXiv preprint
mation Processing Systems, pp. 1731–1741, 2017. arXiv:1608.03983, 2016.

Hooker, S. The hardware lottery. arXiv preprint Loshchilov, I. and Hutter, F. Decoupled weight decay regu-
arXiv:2009.06489, 2020. larization. arXiv preprint arXiv:1711.05101, 2017.

Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation Luo, P., Wang, X., Shao, W., and Peng, Z. Towards un-
networks. In Proceedings of the IEEE conference on derstanding regularization in batch normalization. arXiv
computer vision and pattern recognition, pp. 7132–7141, preprint arXiv:1809.00846, 2018.
2018.
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri,
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, M., Li, Y., Bharambe, A., and Van Der Maaten, L. Ex-
K. Q. Deep networks with stochastic depth. In European ploring the limits of weakly supervised pretraining. In
conference on computer vision, pp. 646–661. Springer, Proceedings of the European Conference on Computer
2016. Vision ECCV, pp. 181–196, 2018.
High-Performance Normalizer-Free ResNets

Merity, S., Keskar, N. S., and Socher, R. Regularizing and Rota Bulò, S., Porzi, L., and Kontschieder, P. In-place acti-
optimizing LSTM language models. In International vated batchnorm for memory-optimized training of dnns.
Conference on Learning Representations, 2018. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 5639–5647, 2018.
Nesterov, Y. A method for unconstrained convex mini-
mization problem with the rate of convergence o(1/k 2 ). Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Doklady AN USSR, pp. (269), 543–547, 1983. Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
M., Berg, A. C., and Fei-Fei, L. ImageNet large scale
Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty visual recognition challenge. IJCV, 115:211–252, 2015.
of training recurrent neural networks. In International
conference on machine learning, pp. 1310–1318, 2013. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
Chen, L.-C. Mobilenetv2: Inverted residuals and linear
Pham, H., Xie, Q., Dai, Z., and Le, Q. V. Meta pseudo bottlenecks. In Proceedings of the IEEE conference on
labels. arXiv preprint arXiv:2003.10580, 2020. computer vision and pattern recognition, pp. 4510–4520,
2018.
Pham, H. V., Lutellier, T., Qi, W., and Tan, L. Cradle: cross-
backend validation to detect and localize bugs in deep Sandler, M., Baccash, J., Zhmoginov, A., and Howard, A.
learning libraries. In 2019 IEEE/ACM 41st International Non-discriminative data or weak model? on the relative
Conference on Software Engineering (ICSE), pp. 1027– importance of data and model resolution. In Proceedings
1038. IEEE, 2019. of the IEEE/CVF International Conference on Computer
Vision Workshops, pp. 0–0, 2019.
Polyak, B. Some methods of speeding up the convergence
of iteration methods. USSR Computational Mathematics Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How
and Mathematical Physics, pp. 4(5):1–17, 1964. does batch normalization help optimization? In Ad-
vances in Neural Information Processing Systems, pp.
Qiao, S., Wang, H., Liu, C., Shen, W., and Yuille, A. Weight 2483–2493, 2018.
standardization. arXiv preprint arXiv:1903.10520, 2019.
Shao, J., Hu, K., Wang, C., Xue, X., and Raj, B. Is normal-
Qin, J., Fang, J., Zhang, Q., Liu, W., Wang, X., and Wang, ization indispensable for training deep neural network?
X. Resizemix: Mixing data with preserved object infor- Advances in Neural Information Processing Systems, 33,
mation and true labels. arXiv preprint arXiv:2012.11101, 2020.
2020.
Shen, S., Yao, Z., Gholami, A., Mahoney, M., and Keutzer,
Radford, A., Metz, L., and Chintala, S. Unsupervised rep- K. Powernorm: Rethinking batch normalization in trans-
resentation learning with deep convolutional generative formers. In International Conference on Machine Learn-
adversarial networks. In 4th International Conference on ing, pp. 8741–8751. PMLR, 2020.
Learning Representations, ICLR, 2016.
Simonyan, K. and Zisserman, A. Very deep convolutional
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and networks for large-scale image recognition. In 3rd Inter-
Dollár, P. Designing network design spaces. In Proceed- national Conference on Learning Representations, ICLR,
ings of the IEEE/CVF Conference on Computer Vision 2015.
and Pattern Recognition, pp. 10428–10436, 2020. Singh, S. and Shrivastava, A. Evalnorm: Estimating batch
normalization statistics for evaluation. In Proceedings
Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J.
of the IEEE/CVF International Conference on Computer
Svcca: Singular vector canonical correlation analysis for
Vision, pp. 3633–3641, 2019.
deep learning dynamics and interpretability. Advances
in neural information processing systems, 30:6076–6085, Smith, S., Elsen, E., and De, S. On the generalization
2017a. benefit of noise in stochastic gradient descent. In Interna-
tional Conference on Machine Learning, pp. 9058–9067.
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-
PMLR, 2020.
Dickstein, J. On the expressive power of deep neural
networks. In international conference on machine learn- Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel,
ing, pp. 2847–2854. PMLR, 2017b. P., and Vaswani, A. Bottleneck transformers for visual
recognition. arXiv preprint arXiv:2101.11605, 2021.
Robbins, H. and Monro, S. A stochastic approximation
method. The Annals of Mathematical Statistics, pp. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
22(3):400–407, 1951. and Salakhutdinov, R. Dropout: a simple way to prevent
High-Performance Normalizer-Free ResNets

neural networks from overfitting. The Journal of Machine Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggre-
Learning Research, 15(1):1929–1958, 2014. gated residual transformations for deep neural networks.
In Proceedings of the IEEE conference on computer vi-
Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway
sion and pattern recognition, pp. 1492–1500, 2017.
networks. arXiv preprint arXiv:1505.00387, 2015.
Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., and
Summers, C. and Dinneen, M. J. Four things everyone
Schoenholz, S. S. A mean field theory of batch normal-
should know to improve batch normalization. arXiv
ization. arXiv preprint arXiv:1902.08129, 2019.
preprint arXiv:1906.03548, 2019.
Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting You, Y., Gitman, I., and Ginsburg, B. Large batch
unreasonable effectiveness of data in deep learning era. training of convolutional networks. arXiv preprint
In ICCV, 2017. arXiv:1708.03888, 2017.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli,
importance of initialization and momentum in deep learn- S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-
ing. In International conference on machine learning, pp. J. Large batch optimization for deep learning: Training
1139–1147, 2013. bert in 76 minutes. In 7th International Conference on
Learning Representations, ICLR, 2019.
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi,
A. Inception-v4, inception-resnet and the impact Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y.
of residual connections on learning. arXiv preprint Cutmix: Regularization strategy to train strong classifiers
arXiv:1602.07261, 2016a. with localizable features. In Proceedings of the IEEE
International Conference on Computer Vision, pp. 6023–
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, 6032, 2019.
Z. Rethinking the inception architecture for computer
vision. In 2016 IEEE Conference on Computer Vision Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz,
and Pattern Recognition (CVPR), pp. 2818–2826, 2016b. D. mixup: Beyond empirical risk minimization. arXiv
preprint arXiv:1710.09412, 2017.
Tan, M. and Le, Q. Efficientnet: Rethinking model scal-
ing for convolutional neural networks. In International Zhang, H., Dauphin, Y. N., and Ma, T. Fixup initialization:
Conference on Machine Learning, pp. 6105–6114, 2019. Residual learning without normalization. arXiv preprint
arXiv:1901.09321, 2019a.
Tieleman, T. and Hinton, G. Rmsprop: Divide the gradient
by a running average of its recent magnitude. COURS- Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A.
ERA: Neural networks for machine learning, pp. 4(2):26– Self-attention generative adversarial networks. In Inter-
31, 2012. national conference on machine learning, pp. 7354–7363.
PMLR, 2019b.
Touvron, H., Vedaldi, A., Douze, M., and Jégou, H. Fix-
ing the train-test resolution discrepancy. In Advances in Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradi-
Neural Information Processing Systems, pp. 8252–8262, ent clipping accelerates training: A theoretical justifica-
2019. tion for adaptivity. In 8th International Conference on
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, Learning Representations, ICLR, 2020. URL https:
A., and Jégou, H. Training data-efficient image trans- //openreview.net/forum?id=BJgnXpVYwS.
formers & distillation through attention. arXiv preprint
arXiv:2012.12877, 2020.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention
is all you need. arXiv preprint arXiv:1706.03762, 2017.
Wu, Y. and He, K. Group normalization. In Proceedings of
the European Conference on Computer Vision (ECCV),
pp. 3–19, 2018.
Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. Self-training
with noisy student improves imagenet classification. In
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 10687–10698, 2020.
High-Performance Normalizer-Free ResNets

A. Experiment Details 0.25 for all variants, again similar to Tan & Le (2019).
A.1. ImageNet Experiment Settings We use a learning rate which warms up from 0 to its maximal
value over the first 5 epochs, where the maximal value is
87 chosen as 0.1 × B/256, with B the batch size, following
Goyal et al. (2017). After warmup, the learning rate is
F4NFNet-F5
annealed to zero with cosine decay over the rest of training
86
F3 (Loshchilov & Hutter, 2016). We employ AGC with λ =
F2 0.01 and  = 10−3 for every parameter except the fully-
85
F1EffNet-B7 connected weight of the linear classifier layer.
ImageNet Top-1 Accuracy (%)

BoTNet-128-T7
By default, we train with a batch size of 4096 for 360 epochs,
84
EffNet-B5
a common training schedule which has the same number
F0 of total training steps (roughly 112,000) as training with a
83
batch size of 1024 for 90 epochs. We found that training
for longer sometimes improved results, but that this was
not always consistent across models or training settings; all
82 results reported in this work employ the 360 epoch schedule.
BoTNet-59 Unlike Tan & Le (2019) we do not perform early stopping.
81 We employ an exponential moving average of the model
parameters (similar to Polyak averaging (Polyak, 1964)),
EffNet-B2 with a decay rate of 0.99999 which, following Tan & Le
80
100 101 102
Test GFLOPS (2019), follows a warmup schedule where the decay is equal
1+t
to min(0.99999, 10+t ).
Figure 4. ImageNet Validation Accuracy vs. Test GFLOPs. All We train on TPU using bfloat16 activations to save memory
numbers are single-model, single crop. Our NFNet models are and improve speed. This means that we keep the parame-
competitive with large EfficientNet variants for a given FLOPs ters and optimizer state (the momentum buffer) in float32,
budget, despite being optimized for training latency. but compute activations and gradients in bfloat16 during
forward- and backpropagation. We cast the logits to float32
For ImageNet experiments (Russakovsky et al., 2015), before computing the loss to aid numerical stability. We cast
we train on the standard ILSVRC2012 training split, gradients back to float32 before summing them across de-
which comprises 1281167 images from 1000 classes. Our vices, which helps prevent compounding accumulation error
baseline training preprocessing follows Szegedy et al. and ensures the parameter update is computed in float32.
(2016b), with distorted bounding box crops and random
horizontal flips (Simonyan & Zisserman, 2015), with all For evaluation we follow the most common style of single-
other augmentations being applied in addition to this. crop preprocessing: we resize the raw image (with bicubic
We train using the categorical softmax cross-entropy interpolation) to be 32 pixels larger than the target resolution,
loss with label smoothing of 0.1 (Szegedy et al., 2016b), then crop to the target resolution (Simonyan & Zisserman,
and optimize our networks using stochastic gradient 2015). While this is the most commonly employed variant,
descent (Robbins & Monro, 1951) with Nesterov’s we note that an alternative method exists where a padded
momentum (Nesterov, 1983; Sutskever et al., 2013), center crop is taken and then resized to the target resolution
using a momentum coefficient of 0.9. Our training code (Szegedy et al., 2016a; Tan & Le, 2019). We find this alter-
is available at https://github.com/deepmind/ native to work marginally worse than the standard choice
deepmind-research/tree/master/nfnets, of resizing before cropping. No test time augmentation,
and is written using numpy (Harris et al., 2020), JAX multi-crop evaluation, or model ensembling is applied.
(Bradbury et al., 2018), Haiku (Hennigan et al., 2020), and
the DeepMind JAX Ecosystem (Babuschkin et al., 2020). A.2. Measuring Training Latency

We employ weight decay in the standard style (not decou- We measure training latency as the actual observed wall-
pled as in Loshchilov & Hutter (2017)), with a weight decay clock time required to perform a training step at a given
coefficient of 2 × 10−5 for NFNets. Critically, weight decay per-device batch size. To accomplish this, we run the full
is not applied to the affine gains or biases in the weight- training loop for 5000 steps, then take the median time re-
standardized convolutional layers, or to the SkipInit gains. quired to perform a single training step. We choose the
We apply a Dropout rate specific to each NFNet variant as median as the mean would also incorporate the initial speed
in Tan & Le (2019), and use Stochastic Depth with a rate of ramp-up at the beginning of training, so the median is more
High-Performance Normalizer-Free ResNets

Table 5. Comparing ImageNet transfer performance for models which use extra data for large-scale pre-training. Meta-Psuedo-Labels
results are from Pham et al. (2020), ViT results are from Dosovitskiy et al. (2021), BiT results are from Kolesnikov et al. (2019). Noisy
Student results (Xie et al., 2020) are taken from the improved versions reported in Foret et al. (2021) which employ SAM. IG-940M
(Mahajan et al., 2018) results are taken from the improved versions reported in Touvron et al. (2019).

Model #FLOPS #Params ImageNet Top-1 TPUv3-core-days


NFNet-F4+ (ours) 367B 527M 89.2 1.86k
NFNet-F4 (ours) 215B 316M 89.2 3.7k
EffNet-L2 + Meta Pseudo Labels - 480M 90.2 22.5k
EffNet-L2 + NoisyStudent + SAM - 480M 88.6 12.3k
ViT-H/14 - 632M 88.55 ± 0.04 2.5k
ViT-L/16 - 307M 87.76 ± 0.03 0.68k
BiT-L ResNet152x4 - 928M 87.54 ± 0.02 9.9k
ResNeXt-101 32x48d (IG-940M) - 829M 86.4 -

robust to these types of variations during measurement and ent training latencies for small EfficientNet variants because
better reflects the speed observed during a full training run. we report the wallclock time, whereas Srinivas et al. (2021)
We remove dataloading as a consideration by having the report the “compute time” which will ignore cross-device
training loop operate on tensors which are already loaded communication. For very small models the inter-device
onto the device. This is consistent with how we train NFNets communication costs can be non-negligible relative to the
in practice, since our data pipeline is optimized to ensure compute time, especially for EfficientNets which employ
we are never input-bound. cross-replica batch normalization. For larger models this
cost is generally negligible on hardware like TPUv3 with
For measuring speed on TPUv3, we run on 32 devices with
very fast interconnects, so in practice one can expect that
a batch size of 32 per device, and sync gradients between
the compute time for models like BoTNets will be the same
replicas, meaning that our training latency is representative
regardless of the reporting methodology used.
of the actual speed we can obtain in practice with distributed
training. We employ bfloat16 training for all models, as
described above. For some of our larger models, this batch A.3. Augmentations
size of 32 per device does not fit into the 16GB of device Our full NFNet training recipe applies “baseline” prepro-
memory, so we allow the compiler to engage automatic cessing (sampling distorted bounding boxes and applying
rematerialization (also known as gradient checkpointing). random horizontal flips), RandAugment (RA, Cubuk et al.
Additional speed may be obtainable by careful tuning of (2020)), which we apply to all images in a batch, MixUp
manual rematerialization. (Zhang et al., 2017), which we apply to half the images in a
For measuring speed on GPU, we run on a single V100 batch with α = 0.2, and CutMix (Yun et al., 2019), which
GPU using float16 training to engage the card’s tensorcores, we apply to the other half of the images in the batch.
which strongly accelerates training. Unlike TPUv3, we Following Qin et al. (2020) we apply RandAugment after
do not consider the cost of cross-device communication applying MixUp or CutMix. We apply RA with 4 layers
for GPU, which will vary substantially depending on the (meaning 4 augmentations are chosen), which is substan-
hardware configuration of the interlinks available to the tially stronger than the common default of 2 layers, and
user. As with TPUv3, some of our models do not fit in following Cubuk et al. (2020) we pick the magnitude of the
memory at this batch size, but we instead employ gradient RA augmentation based on the training resolution of the
accumulation to mimic the full batch size. This appears images. If the augmentation magnitude is set too high rela-
to be less efficient than rematerialization for large models tive to the image resolution, then certain operations (such as
(specifically for our F5 variant and for EfficientNet-B7), so shearing) can result in many images being completely blank,
we expect that manually applying rematerialization would which will impede training. For NFNet variants F0 through
potentially yield GPU speedups in this case, but require F6, the chosen RA magnitudes are [5, 10, 10, 15, 15, 15, 15],
extra engineering effort. respectively.
We report results from our own measurements for all models The combination of MixUp, CutMix, and RA results in an
except for SENets (Hu et al., 2018), BoTNets (Srinivas et al., intense level of augmentation which progressively benefits
2021), and DeIT (Touvron et al., 2020), which we instead NFNets, but does not appear to benefit other models like
borrow from Srinivas et al. (2021). We report slightly differ- EfficientNets over a baseline of just using well-tuned RA.
High-Performance Normalizer-Free ResNets

We hypothesize that this is because our models lack the et al. (2017), which is warmed up over 5,000 steps and
implicit regularization of batch normalization, and similar then decayed to zero with cosine annealing through the
to how they are more amenable to large-scale pre-training, rest of training. We fine-tune ResNets on ImageNet with
they are accordingly also more amenable to stronger data a batch size of 2048 for 15,000 steps using a learning rate
augmentations. of 0.1 (again employing a 5000 step warmup and cosine
decay, but not applying the batch size scaling of Goyal et al.
A.4. Accelerating Sharpness-Aware Minimization (2017)), no weight decay, no DropOut, and no Stochastic
Depth. For fine-tuning we apply EMA with decay 0.9999
Sharpness-Aware Minimization (SAM, Foret et al. (2021)) and the decay warmup described above. Due to the expense
has been shown to improve the performance of various clas- of this experiment we only run a single random seed for
sifier models by seeking flat minima which are hypothesized each model (fine-tuning three separate times at each of the
to generalize better. However, by default it is expensive to fine-tune resolutions of 224, 320, and 384 pixels).
apply as it requires two evaluations of the gradient: one
for a step of gradient ascent to attain “noised” parameters, We find, contrary to (Dosovitskiy et al., 2021), that a large
and then one to attain the gradients with respect to the weight decay is harmful during pre-training, and that instead
noised parameters, which are used to update the actual pa- very small weight decays are important so that the models
rameters. We experimented with ameliorating this cost by are not constrained when trying to capture the information
only employing 20% of the batch to compute the gradients in a large scale dataset. Contrary to Dosovitskiy et al. (2021)
for the ascent step, which we found to result in equivalent we also find that Adam is not as performant as SGD in this
performance while only increasing the training latency by setting. We believe this reflects in the fact that our base-
20%-40% instead of by 100%. We also tried using SAM line batch-normalized ResNets substantially outperform the
where the batch of data used to compute the ascent step baselines reported in Dosovitskiy et al. (2021) despite oth-
was a different batch from the one used to compute the de- erwise similar pre-training and fine-tuning configurations.
scent step, but found that this destroyed all the benefits of For reference, Dosovitskiy et al. (2021) report a ResNet-50
SAM. This indicates that it is necessary for the ascent step transfer accuracy of 77.54% when fine-tuned at 384px reso-
to be computed using the same batch (or a subset thereof) lution, whereas we obtain an accuracy of 79.9% in the same
as is used to compute the descent step. As noted in Foret setting for BN-ResNet-50 and 81.1% for NF-ResNet-50.
et al. (2021), we found that SAM worked best in a dis- The full set of accuracies for these ResNet models is avail-
tributed setup where the gradients used for the ascent step able in Table 4. We recommend future work on large-scale
are not synced between replicas (meaning a separate copy pre-training to begin with a weight decay of zero and con-
of the “noised” parameters is kept on each replica and used sider lightly increasing it, rather than starting with a large
to compute the local descent gradients). We note that this value of weight decay and experimenting with decreasing it.
phenomenon can also be mimicked on fewer devices, or a For NFNet models, we pre-train with a batch size of
single device, by employing gradient accumulation (itera- 4096. For NFNet-F4, we pre-train for 40 epochs, and
tively computing noised parameters and then accumulating for NFNet-F4+ we pre-train for 20 epochs. The F4+
the gradients to be used for descent). model is a wider variant, constructed from the F4 model
by using a channel pattern of [384, 768, 2048, 2048] in-
A.5. Large Scale Pre-Training Details stead of [256, 512, 1536, 1536] and keeping all other hyper-
Our large scale pre-training is performed on JFT-300m (Sun parameters the same. We find that both models obtain about
et al., 2017), a dataset of 300 million labeled images span- the same training latency (around 830ms per step when
ning roughly 18,000 classes. We pre-train all models at training with a per-core batch size of 32), but that the F4
resolution 224 (regardless of the native model resolution model needs the additional pre-training time to reach the
for a given NFNet variant) using the same optimizer set- same final performance as the F4+ model. This indicates
tings as for our ImageNet experiments (as described in Ap- that (given sufficient pre-training data) it is more efficient to
pendix A.1) with the exception of using a smaller weight train larger models with a shorter epoch budget than to train
decay (10−5 for BN and NF-ResNets, and 10−6 for all smaller models for longer, consistent with the observations
NFNet models). We briefly tried pre-training at larger im- in (Kaplan et al., 2020).
age resolutions and found that this was not worth the added We fine-tune NFNet models for 15,000 steps at a batch
pre-trainining expense. We do not use any augmentations size of 2048 using a learning rate of 0.1, which is warmed
except for baseline random crops and flips, nor do we use up from zero over 5000 steps, then annealed to zero with
any exponential moving averages during pre-training. cosine decay through the rest of training. We use SAM with
For ResNet models, we pre-train with a batch size of 1024 ρ = 0.05, weight decay of 10−5 , a DropOut rate of 0.25, and
for 10 epochs using a learning rate of 0.4 following Goyal a stochastic depth rate of 0.1. We found that we could obtain
High-Performance Normalizer-Free ResNets

similar results using the same regularization setup as for


ResNets (no weight decay, DropOut, or Stochastic Depth)
but that this mild degree of augmentation was slightly more
performant. As with our ResNet fine-tuning we employ an
exponential moving average of the parameters with EMA
decay warmup. The results of this experiment, compared
against other models which are pre-trained on large scale
datasets, are available in Table 5.
High-Performance Normalizer-Free ResNets

B. Downsides of Batch Normalization different hardware. SimCLR seeks to resolve this via the
use of cross-replica batch normalization.
Batch normalization provides a range of benefits, which we
discussed in Section 2 of the main text, but it also has a num-
ber of disadvantages that motivated this work on normalizer-
free networks. We discussed some of the disadvantages
of batch normalization in Section 1. In addition, here we
enumerate some documented errors and challenges in the
implementation of batch normalization in popular frame-
works and published work. A number of these errors are
identified by Pham et al. (2019), an academic paper on au-
tomated testing which discovers two such implementation
errors in Keras and one in the CNTK toolkit.
One example is a long-standing bug in certain versions of
Keras, whose consequence is that even if a user sets the
batch normalization layers to testing mode (as is common
when freezing the layers for fine-tuning for downstream
tasks) the batch normalization statistics will continue to
update, contrary to user expectations. This implementation
error is raised in in this github issue and this github issue.
The discrepancy between batch normalization train and test
behavior has had direct impact several times in previous
work. For examples, both DCGAN (Radford et al., 2016)
and SAGAN (Zhang et al., 2019b) reported results and re-
leased code where batch normalization was run in training
mode at test time as noted here and here,3 and consequently
their reported results depend on the batch size used to gen-
erate samples.
Subtle differences in batch normalization implementations
can also hamper reproducibility. For example, the Efficient-
Net training code uses a form of cross-replica BatchNorm
where the number of devices used to compute statistics
varies nonlinearly with the total number of devices (as seen
here), and consequently, even given the same code, exact
reproduction can be difficult without access to the same
hardware. Additionally, the EfficientNet code takes a mov-
ing average of the running batch normalization statistics,
which in practice means that it takes a moving average of a
moving average, compounding the averaging horizon in a
way that may be unexpected.
As discussed in the main text, breaking the independence
between training examples causes issues in contrastive learn-
ing setups like SimCLR (Chen et al., 2020) and MoCo (He
et al., 2020).Both models have to deal with the potential
for intra-batch information leakage negatively impacting
the contrastive objective. MoCo seeks to resolve this by
shuffling examples between devices when computing batch
statistics, which introduces implementation complexity and
makes it challenging to exactly reproduce their results on
3
Note that no ‘u’ or ‘s’ values are passed into the batch normal-
ization op here, meaning that running statistics are not accumu-
lated.
High-Performance Normalizer-Free ResNets

C. Model Details to maintain signal variance.


Our NFNet model is a modified SE-ResNeXt-D (He et al., After all of the residual stages, we apply a 1 × 1 expansion
2016b;a; Xie et al., 2017; Hu et al., 2018; He et al., 2019). convolution that doubles the channel count, similar to the
The input to the model is an H × W RGB image which has final expansion convolution in EfficientNets (Tan & Le,
been normalized by the per-channel mean / standard devi- 2019), then global average pooling. This layer is primarily
ation from the entire ImageNet (Russakovsky et al., 2015) helpful when using very thin networks, as it is typically
training set, as is standard in most image classifiers. The desirable to have the dimensionality of the final activation
model has an initial “stem” comprised of a 3 × 3 stride 2 vectors (which the classifier layer receives) be greater than
convolution with 16 channels, two 3 × 3 stride 1 convolu- or equal to the number of classes, but we retain it in our
tions with 32 channels and 64 channels respectively, and a wider networks to benefit future work which might seek to
final 3 × 3 stride 2 convolution with 128 channels. A non- train very thin networks based on our backbones. We tried
linearity is placed in between each convolution in the stem, replacing this convolution with a fully connected layer after
but importantly not after the final convolution in the stem. the average pooling but found that this was not helpful.
By default we use GELU (Hendrycks & Gimpel, 2016), The final layer is a fully-connected classifier layer with
although most common nonlinearities like ReLU or SiLU learnable biases which outputs a 1000-way class vector
appear to have similar performance. All our nonlinearities (which can be softmaxed in order to obtain normalized
are rescaled to be approximately variance-preserving follow- class probabilities). We initialize this layer’s weight with a
ing Brock et al. (2021) using a fixed scalar gain, for which standard deviation of 0.01 following Goyal et al. (2017). We
we provide reference values in our source code. found that initializing the weight with zeros as is sometimes
Following the stem are four residual “stages”, where the done could sometimes lead to instabilities when training
number of blocks per stage is [1, 2, 6, 3] for our baseline with very large numbers of output classes.
F0 variant, and each subsequent variant has this number No activation normalization layers are used anywhere in our
multiplied by N (where N = 1 for F0). The residual residual blocks. Instead, we employ the Normalizer-Free
stages begin with a “transition” block (as shown in Figure 5) variance downscaling strategy (Brock et al., 2021). This
followed by standard residual blocks (as shown in Figure 6). means that the input to the main path of the residual block
In all but the first stage, the transition block downsamples is multiplied by 1/β, where β is the analytically predicted
(with 2 × 2 average pooling on the skip path and by striding value of the variance at that block at initialization, and the
the first 3 × 3 convolution on the main path) and changes the output of the block is multiplied by a scalar hyperparameter
output channel count (via a 1×1 shortcut convolution on the α, typically set to a small value like α = 0.2. As in Brock
skip path). He et al. (2019) identified that the use of a 2 × 2 et al. (2021), we compute the expected empirical variance at
average pooling improves performance over using a strided residual block ` analytically using Var(x` ) p= Var(x`−1 ) +
1×1 convolution on the skip path (which merely subsamples α2 , with Var(x0 ) = 1, resulting in β` = Var(x` ). We
the activation). Note that this is slightly different from Bello also mimic the variance reset that happens in the transition
(2021), which uses a 3 × 3 average pooling kernel with blocks of batch-normalized networks, by having the shortcut
stride 2. convolution in transition layers operate on (x` /β` ) rather
All blocks employ the pre-activation ResNe(X)t bottleneck than x` (see Figure 5). This ensures unit signal variance at
pattern with an added 3 × 3 grouped convolution inside the start of each stage (Var(x`+1 ) = 1 + α2 ).
the bottleneck. This means that the main path comprises a Additionally following Brock et al. (2021), we also employ
1 × 1 convolution whose output channels are equal to 0.5× SkipInit (De & Smith, 2020), a learnable zero-initialized
the output channel count for the block, two 3 × 3 grouped scalar gain in addition to α which results in the residual
convolution with group width 128 (with the first strided block being initialized to the identity (except in transition
in transition blocks), and a final 1 × 1 convolution whose layers), similar to Goyal et al. (2017); Zhang et al. (2019a);
output channel count is equal to the block output channel Bachlechner et al. (2020), which we find to improve sta-
count. bility for very deep networks. While this will result in the
Following the last 1 × 1 convolution is a Squeeze & Ex- signal propagation at initialization not actually following
cite layer (Hu et al., 2018), which globally average pools the expected variance as computed above, we find that the
the activation, applies two linear layers with an interleaved variance downscaling and α scalar are still beneficial for
scaled nonlinearity to the pooled activation, applies a sig- stability.
moid, then rescales the tensor channel-wise by twice the All convolutions employ Scaled Weight Standardization
value of this sigmoid. Concretely, the output of this layer (Brock et al., 2021), with a learnable affine gain applied
is 2σ(F C(GELU (F C(pool(h))))) × h. The non-standard to the standardized weight and a learnable affine bias ap-
scalar multiplier of 2 is used following Brock et al. (2021)
High-Performance Normalizer-Free ResNets

Table 6. Detailed Model ablation table. Each entry reports ImageNet Top-1 on the left, and TPUv3 training latency on the right.

F0 F1 F2 F3
Baseline 80.4% 58.0ms 81.7% 116.0ms 82.0% 211.7ms 82.3% 369.5ms
+ Modified Width 80.9% 64.1ms 81.8% 133.9ms 82.0% 252.2ms 82.3% 441.5ms
+ Second Conv 81.3% 73.3ms 82.2% 158.5ms 82.4% 295.8ms 82.7% 532.2ms
+ MixUp 82.2% 73.3ms 82.9% 158.5ms 83.1% 295.8ms 83.5% 532.2ms
+ RandAugment 83.2% 73.3ms 84.6% 158.5ms 84.8% 295.8ms 85.0% 532.2ms
+ CutMix 83.6% 73.3ms 84.7% 158.5ms 85.1% 295.8ms 85.7% 532.2ms
Default Width + Augs 83.1% 65.9ms 84.5% 137.4ms 85.0% 248.8ms 85.5% 452.2ms
-NF, + BN 83.4% 111.7ms 84.4% 258.0ms 85.1% 396.3ms 85.5% 617.7ms

plied to the output of the convolution operation. Critically,


weight decay is not applied to the affine gains or biases or
the SkipInit gains. The S&E layers do not employ weight
standardization on their fully connected layers, nor does the
fully-connected classifier layer’s weight. We initialize the
underlying weights for these layers using LeCun initializa-
tion (LeCun et al., 2012).
High-Performance Normalizer-Free ResNets

+ +

𝛼 𝛼

S+E S+E

1x1 WS-Conv 1x1 WS-Conv

Scaled Scaled
Activation Activation

3x3 WS, 3x3 WS,


C/G=128 C/G=128

Scaled Scaled
Activation Activation

3x3 WS s2, 3x3 WS,


C/G=128 C/G=128

Scaled Scaled
1x1 WS-Conv
Activation Activation

1x1 WS-Conv Avg Pool 1x1 WS-Conv

Scaled Scaled
Activation Activation

1/𝛽 1/𝛽

Figure 5. Detailed view of an NFNet transition block. The bottle- Figure 6. Detailed view of an NFNet non-transition block. The
neck ratio is 0.5, while the group width (the number of channels bottleneck ratio is 0.5, while the group width (the number of
per group, C/G) in the 3 × 3 convolutions is fixed at 128 regard- channels per group, C/G) in the 3 × 3 convolutions is fixed at 128
less of the number of channels. Note that in this block, the skip regardless of the number of channels. Note that in this block, the
path takes in the signal after the variance downscaling with β and skip path takes in the signal before the variance downscaling with
the scaled nonlinearity. β.
High-Performance Normalizer-Free ResNets

80 ResNet200
Top-1 Accuracy 79
B = 256
78 B = 512
B = 1024
77 B = 2048
B = 4096
76 0.01 0.02 0.04 0.08 0.16
Clipping Threshold

Figure 7. Performance across different clipping thresholds λ of


AGC for different batch sizes on ResNet200.

D. Additional AGC Ablations


In Figure 7, we show performance for different clipping
thresholds λ across a range of batch sizes on ResNet200,
using the same training setup described in Section 4.1. In
both Figure 7 and Figure 2, we run NF-ResNets with AGC
for 5 independent runs, and report the average of the best 4
of these 5 runs. This ensures that our results are robust to
outliers and failed training runs.
As in Figure 2(b), we see that smaller clipping thresh-
olds are necessary for stability at higher batch sizes on the
ResNet200. For all our experiments in Section 6 where we
use batch size 4096, we use a clipping threshold λ = 0.01.
High-Performance Normalizer-Free ResNets

E. Negative Results We explored aggressive downsampling strategies, such as


operating on 8 × 8 DCT coefficients as in Gueguen et al.
(2018) instead of using the standard ResNet stem. While
this is an effective way to improve model speed, we found
1x1
✕ 1x1
✕ that any improvements in model speed came at the cost of
model accuracy. This appears to hold true even when this
pool pool downsampling is done with an invertible operation (e.g. an
FC FC FC FC orthogonal strided transform like the DCT) such that no
information is lost. This is arguably consistent with the
observations in Sandler et al. (2019), suggesting that model
Figure 8. Comparison of standard (left) and “straddling” (right) “internal resolution” is a more important quantity to consider
Squeeze & Excite blocks. Both forms of S&E block allow for in this respect, but we did not explore this direction in further
full cross-channel connectivity, but only in the form of a scalar detail.
multiplier per channel.
We next considered trying to improve speed by making our
1 × 1 dense convolutions into grouped convolutions. This
In the course of developing the NFNet architecture we exper- normally causes sharp performance degradation, as these
imented with strategies impacting a range of model design layers are responsible for the flow of information across
aspects, including rules for picking backbone width and all channels (as the other convolutions are grouped), and
depth, bottleneck compression or expansion ratios, choice removing their full connectivity substantially reduces model
of group width, the placement of Squeeze & Excite (S&E) expressivity. To ameliorate this we considered applying
layers, and more. In this section we present select insights straddled Squeeze & Excite layers, where the input to the
from what we found not to work well. As in Section 5, our S&E is the input to the convolution, but the output of the
goal here was to improve the pareto front of top-1 holdout S&E multiplies the output of the convolution. This is in con-
accuracy versus training speed. strast to the normal Squeeze & Excite formulation, which
First, we considered patterns where, for a given choice of simply operates directly on an activation (i.e. it takes in a
backbone, we allowed the group width or number of groups value, and its output is used to multiply that same value).
in the 3 × 3 convolutions to be different in different stages, Both forms of S&E block help to restore cross-channel con-
or similarly allowed the bottleneck ratio to vary in different nectivity (albeit not as strongly as using a fully connected
stages. We also considered varying whether the transition convolution) more cheaply than fully-connected layers as
blocks would have their bottleneck ratios be a function of they operate on globally average-pooled activations.
the number of block output channels (as in ResNet models) Employing grouped 1 × 1 convs with a small number of
or as the number of block input channels (as in many mobile groups (2 or 4) paired with S&E layers slightly reduces accu-
models). For example, one model family variant used a racy, and improves theoretical FLOPS and reduces parame-
group width of [8, 8, 16, 16] in each of the four stages, with ter counts, with the straddled S&E block resulting in slightly
a bottleneck ratio of 0.25 (with the transition blocks using improved accuracy relative to the standard S&E block. How-
the bottleneck width based on the input channel count) in ever, we found there was no choice of 1 × 1 group width or
the first two stages and 0.5 in the latter two stages (with the group count which maintained comparable accuracy while
transition blocks here using the bottleneck width based on reducing training latency. Using a high group count (and
the output channel count). therefore a small group width) substantially reduces FLOPS
While we occasionally found that some of this variance and parameter counts, but also substantially reduces model
could be helpful (for example, using inverted bottleneck performance, indicating that incorporating these S&E layers
blocks in the first stage yielded occasional but inconsistent helps but does not fully recover the expressivity of dense
improvements), we broadly found such heterogeneity to 1 × 1 convolutions.
be unnecessary, and to confound attempts to reason out Finally, we did not experiment with any attention variants
interpretable design patterns. We expect that these design (Bello, 2021; Srinivas et al., 2021), and we expect that our
aspects could yield better models if incorporated into large- results could likely be improved by adopting these strategies
scale architecture search to get individual models at given into our models.
compute budget targets, but to be less useful for manual
design. Our final NFNet designs are largely homogenous
with respect to these parameters, with only width and stage
depth varying between stages, and ResNet style bottleneck
widths (where the channel count of the 3 × 3 convolutions
is the number of output channels times the bottleneck ratio).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy