0% found this document useful (0 votes)
83 views16 pages

Fixup Initialization PDF

1) The document proposes a new initialization method called "Fixup Initialization" that allows deep residual networks to be trained stably and achieve state-of-the-art performance without normalization layers. 2) Fixup Initialization rescales the standard initialization of residual branches based on network architecture. This solves the exploding gradient problem at the beginning of training deep residual networks. 3) Experiments show Fixup Initialization enables training very deep residual networks at maximal learning rates, matching the convergence speed and generalization of networks trained with normalization layers. It achieves new state-of-the-art results on image classification and machine translation benchmarks.

Uploaded by

yuyiip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views16 pages

Fixup Initialization PDF

1) The document proposes a new initialization method called "Fixup Initialization" that allows deep residual networks to be trained stably and achieve state-of-the-art performance without normalization layers. 2) Fixup Initialization rescales the standard initialization of residual branches based on network architecture. This solves the exploding gradient problem at the beginning of training deep residual networks. 3) Experiments show Fixup Initialization enables training very deep residual networks at maximal learning rates, matching the convergence speed and generalization of networks trained with normalization layers. It achieves new state-of-the-art results on image classification and machine translation benchmarks.

Uploaded by

yuyiip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Published as a conference paper at ICLR 2019

F IXUP I NITIALIZATION :
R ESIDUAL L EARNING W ITHOUT N ORMALIZATION
Hongyi Zhang∗ Yann N. Dauphin† Tengyu Ma‡
MIT Google Brain Stanford University
hongyiz@mit.edu yann@dauphin.io tengyuma@stanford.edu

A BSTRACT
arXiv:1901.09321v2 [cs.LG] 12 Mar 2019

Normalization layers are a staple in state-of-the-art deep neural network archi-


tectures. They are widely believed to stabilize training, enable higher learning
rate, accelerate convergence and improve generalization, though the reason for
their effectiveness is still an active research topic. In this work, we challenge the
commonly-held beliefs by showing that none of the perceived benefits is unique
to normalization. Specifically, we propose fixed-update initialization (Fixup), an
initialization motivated by solving the exploding and vanishing gradient problem
at the beginning of training via properly rescaling a standard initialization. We
find training residual networks with Fixup to be as stable as training with nor-
malization — even for networks with 10,000 layers. Furthermore, with proper
regularization, Fixup enables residual networks without normalization to achieve
state-of-the-art performance in image classification and machine translation.

1 I NTRODUCTION

Artificial intelligence applications have witnessed major advances in recent years. At the core of
this revolution is the development of novel neural network models and their training techniques. For
example, since the landmark work of He et al. (2016), most of the state-of-the-art image recognition
systems are built upon a deep stack of network blocks consisting of convolutional layers and additive
skip connections, with some normalization mechanism (e.g., batch normalization (Ioffe & Szegedy,
2015)) to facilitate training and generalization. Besides image classification, various normalization
techniques (Ulyanov et al., 2016; Ba et al., 2016; Salimans & Kingma, 2016; Wu & He, 2018) have
been found essential to achieving good performance on other tasks, such as machine translation
(Vaswani et al., 2017) and generative modeling (Zhu et al., 2017). They are widely believed to have
multiple benefits for training very deep neural networks, including stabilizing learning, enabling
higher learning rate, accelerating convergence, and improving generalization.
Despite the enormous empirical success of training deep networks with normalization, and recent
progress on understanding the working of batch normalization (Santurkar et al., 2018), there is
currently no general consensus on why these normalization techniques help training residual neural
networks. Intrigued by this topic, in this work we study
(i) without normalization, can a deep residual network be trained reliably? (And if so,)
(ii) without normalization, can a deep residual network be trained with the same learning rate,
converge at the same speed, and generalize equally well (or even better)?
Perhaps surprisingly, we find the answers to both questions are Yes. In particular, we show:

• Why normalization helps training. We derive a lower bound for the gradient norm of a residual
network at initialization, which explains why with standard initializations, normalization tech-
niques are essential for training deep residual networks at maximal learning rate. (Section 2)


Work done at Facebook. Equal contribution.

Work done at Facebook. Equal contribution.

Work done at Facebook.

1
Published as a conference paper at ICLR 2019

• Training without normalization. We propose Fixup, a method that rescales the standard initial-
ization of residual branches by adjusting for the network architecture. Fixup enables training very
deep residual networks stably at maximal learning rate without normalization. (Section 3)
• Image classification. We apply Fixup to replace batch normalization on image classification
benchmarks CIFAR-10 (with Wide-ResNet) and ImageNet (with ResNet), and find Fixup with
proper regularization matches the well-tuned baseline trained with normalization. (Section 4.2)
• Machine translation. We apply Fixup to replace layer normalization on machine translation
benchmarks IWSLT and WMT using the Transformer model, and find it outperforms the baseline
and achieves new state-of-the-art results on the same architecture. (Section 4.3)

ReLU remove
 rescale
 add scalar
 add scalar



normalization weights multipliers biases ReLU
+
+
bias

multiplier : initialized at 1 ⇤⇤ bias
⇤⇤
: initialized at 0 ReLU ⇤
normalize multiplier
⇤⇤⇤
p
3x3 conv : scaled down by L +
⇤⇤ 3x3 conv
ReLU ⇤⇤ bias
bias ⇤
multiplier ReLU
multiplier ⇤⇤ ⇤⇤
3x3 conv bias
⇤⇤⇤
normalize ReLU 3x3 conv
⇤⇤⇤ ⇤⇤ bias
3x3 conv 3x3 conv

(He et al., 2016) Fixup w/o bias Fixup

Figure 1: Left: ResNet basic block. Batch normalization (Ioffe & Szegedy, 2015) layers are marked
in red. Middle: A simple network block that trains stably when stacked together. Right: Fixup
further improves by adding bias parameters. (See Section 3 for details.)
In the remaining of this paper, we first analyze the exploding gradient problem of residual networks
at initialization in Section 2. To solve this problem, we develop Fixup in Section 3. In Section 4 we
quantify the properties of Fixup and compare it against state-of-the-art normalization methods on
real world benchmarks. A comparison with related work is presented in Section 5.

2 P ROBLEM : R ES N ET WITH S TANDARD I NITIALIZATIONS L EAD TO


E XPLODING G RADIENTS
Standard initialization methods (Glorot & Bengio, 2010; He et al., 2015; Xiao et al., 2018) attempt
to set the initial parameters of the network such that the activations neither vanish nor explode.
Unfortunately, it has been observed that without normalization techniques such as BatchNorm they
do not account properly for the effect of residual connections and this causes exploding gradients.
Balduzzi et al. (2017) characterizes this problem for ReLU networks, and we will generalize this to
residual networks with positively homogenous activation functions. A plain (i.e. without normal-
ization layers) ResNet with residual blocks {F1 , . . . , FL } and input x0 computes the activations as
l−1
X
xl = x0 + Fi (xi ). (1)
i=0

ResNet output variance grows exponentially with depth. Here we only consider the initial-
ization, view the input x0 as fixed, and consider the randomness of the weight initialization. We
analyze the variance of each layer xl , denoted by Var[xl ] (which is technically defined as the sum
of the variance of all the coordinates of xl .) For simplicity we assume the blocks are initialized to be
zero mean, i.e., E[Fl (xl ) | xl ] = 0. By xl+1 = xl + Fl (xl ), and the law of total variance, we have
Var[xl+1 ] = E[Var[F (xl )|xl ]] + Var(xl ). Resnet structure prevents xl from vanishing by forcing
the variance to grow with depth, i.e. Var[xl ] < Var[xl+1 ] if E[Var[F (xl )|xl ]] > 0. Yet, combined
with initialization methods such as He et al. (2015), the output variance of each residual branch

2
Published as a conference paper at ICLR 2019

Var[Fl (xl )|xl ] will be about the same as its input variance Var[xl ], and thus Var[xl+1 ] ≈ 2Var[xl ].
This causes the output variance to explode exponentially with depth without normalization (Hanin &
Rolnick, 2018) for positively homogeneous blocks (see Definition 1). This is detrimental to learning
because it can in turn cause gradient explosion.
As we will show, at initialization, the gradient norm of certain activations and weight tensors is
lower bounded by the cross-entropy loss up to some constant. Intuitively, this implies that blowup
in the logits will cause gradient explosion. Our result applies to convolutional and linear weights
in a neural network with ReLU nonlinearity (e.g., feed-forward network, CNN), possibly with skip
connections (e.g., ResNet, DenseNet), but without any normalization.
Our analysis utilizes properties of positively homogeneous functions, which we now introduce.
Definition 1 (positively homogeneous function of first degree). A function f : Rm → Rn is called
positively homogeneous (of first degree) (p.h.) if for any input x ∈ Rm and α > 0, f (αx) = αf (x).
Definition 2 (positively homogeneous set of first degree). Let θ = {θi }i∈S be the set of parameters
of f (x) and θph = {θi }i∈Sph ⊂S . We call θph a positively homogeneous set (of first degree) (p.h.
set) if for any α > 0, f (x; θ \ θph , αθph ) = αf (x; θ \ θph , θph ), where αθph denotes {αθi }i∈Sph .

Intuitively, a p.h. set is a set of parameters θph in function f such that for any fixed input x and fixed
parameters θ \ θph , f¯(θph ) , f (x; θ \ θph , θph ) is a p.h. function.
Examples of p.h. functions are ubiquitous in neural networks, including various kinds of linear op-
erations without bias (fully-connected (FC) and convolution layers, pooling, addition, concatenation
and dropout etc.) as well as ReLU nonlinearity. Moreover, we have the following claim:
Proposition 1. A function that is the composition of p.h. functions is itself p.h.

We study classification problems with c classes and the cross-entropy loss. We use f to denote a
neural network function except for the softmax layer. Cross-entropy loss is defined as `(z, y) ,
−yT (z − logsumexp(z)) where y is the one-hot label  vector, z , f (x)∈ Rc is the logits where
P
zi denotes its i-th element, and logsumexp(z) , log i∈[c] exp(zi ) . Consider a minibatch
of training examples DM = {(x(m) , y(m) )}M m=1 and the average cross-entropy loss `avg (DM ) ,
1
PM (m) (m) (m)
M m=1 `(f (x ), y ), where we use to index quantities referring to the m-th example.
k · k denotes any valid norm. We only make the following assumptions about the network f :
1. f is a sequential composition of network blocks {fi }L
i=1 , i.e. f (x0 ) = fL (fL−1 (. . . f1 (x0 ))),
each of which is composed of p.h. functions.
2. Weight elements in the FC layer are i.i.d. sampled from a zero-mean symmetric distribution.
These assumptions hold at initialization if we remove all the normalization layers in a residual
network with ReLU nonlinearity, assuming all the biases are initialized at 0.
Our results are summarized in the following two theorems, whose proofs are listed in the appendix:
Theorem 1. Denote the input to the i-th block by xi−1 . With Assumption 1, we have

∂` `(z, y) − H(p)
∂xi−1 ≥ , (2)

kxi−1 k
where p is the softmax probabilities and H denotes the Shannon entropy.

Since H(p) is upper bounded by log(c) and kxi−1 k is small in the lower blocks, blowup in the loss
will cause large gradient norm with respect to the lower block input. Our second theorem proves a
lower bound on the gradient norm of a p.h. set in a network.
Theorem 2. With Assumption 1, we have
M
∂`avg 1 X

∂θph M kθph k `(z(m) , y(m) ) − H(p(m) ) , G(θph ). (3)
m=1
Furthermore, with Assumptions 1 and 2, we have
E[maxi∈[c] zi ] − log(c)
EG(θph ) ≥ . (4)
kθph k

3
Published as a conference paper at ICLR 2019

It remains to identify such p.h. sets in a neural network. In Figure 2 we provide three examples
of p.h. sets in a ResNet without normalization. Theorem 2 suggests that these layers would suffer
from the exploding gradient problem, if the logits z blow up at initialization, which unfortunately
would occur in a ResNet without normalization if initialized in a traditional way. This motivates us
to introduce a new initialization in the next section.

+
fc

conv conv conv

Figure 2: Examples of p.h. sets in a ResNet without normalization: (1) the first convolution layer
before max pooling; (2) the fully connected layer before softmax; (3) the union of a spatial down-
sampling layer in the backbone and a convolution layer in its corresponding residual branch.

3 F IXUP : U PDATE A R ESIDUAL N ETWORK Θ(η) PER SGD S TEP

Our analysis in the previous section points out the failure mode of standard initializations for training
deep residual network: the gradient norm of certain layers is in expectation lower bounded by a
quantity that increases indefinitely with the network depth. However, escaping this failure mode
does not necessarily lead us to successful training — after all, it is the whole network as a function
that we care about, rather than a layer or a network block. In this section, we propose a top-down
design of a new initialization that ensures proper update scale to the network function, by simply
rescaling a standard initialization. To start, we denote the learning rate by η and set our goal:

f (x; θ) is updated by Θ(η) per SGD step after initialization as η → 0.



That is, k∆f (x)k = Θ(η) where ∆f (x) , f (x; θ − η ∂θ `(f (x), y)) − f (x; θ).

Put another way, our goal is to design an initialization such that SGD updates to the network function
are in the right scale and independent of the depth.
We define the Shortcut as the shortest path from input to output in a residual network. The Shortcut
is typically a shallow network with a few trainable layers.1 We assume the Shortcut is initialized
using a standard method, and focus on the initialization of the residual branches.

Residual branches update the network in sync. To start, we first make an important observa-
tion that the SGD update to each residual branch changes the network output in highly correlated
directions. This implies that if a residual network has L residual branches, then an SGD step to each
residual branch should change the network output by Θ(η/L) on average to achieve an overall Θ(η)
update. We defer the formal statement and its proof until Appendix B.1.

Study of a scalar branch. Next we study how to initialize a residual branch with m layers so
that its SGD update changes the network output by Θ(η/L). We assume m is a small positive
integer (e.g., 2 or 3). As we are only concerned about Qm the scale of the update, it is sufficiently
instructive to study the scalar case, i.e., F (x) = ( i=1 ai ) x where a1 , . . . , am , x ∈ R+ . For
example, the standard initialization methods typically initialize each layer so that the output (after
nonlinear activation) preserves the input variance, which can be modeled as setting ∀i ∈ [m], ai = 1.
In turn, setting ai to a positive number other than 1 corresponds to rescaling the i-th layer by ai .
Through deriving the constraints for F (x) to make Θ(η/L) updates, we will also discover how to
rescale the weight layers of a standard initialization as desired. In particular, we show the SGD
1
For example, in the ResNet architecture (e.g., ResNet-50, ResNet-101 or ResNet-152) for ImageNet clas-
sification, the Shortcut is always a 6-layer network with five convolution layers and one fully-connected layer,
irrespective of the total depth of the whole network.

4
Published as a conference paper at ICLR 2019

update to F (x) is Θ(η/L) if and only if the initialization satisfies the following constraint:
 
 
Y 1
 ai x = Θ √
 , where j ∈ arg min ak (5)
i∈[m]\{j}
L k

We defer the derivation until Appendix B.2.


Equation (5) suggests new methods to initialize a residual branch through rescaling the standard
initialization of i-th layer in a residual branch by its corresponding scalar ai . For example, we
1
could set ∀i ∈ [m], ai = L− 2m−2 . Alternatively, we could start the residual branch as a zero
1
function by setting am = 0 and ∀i ∈ [m − 1], ai = L− 2m−2 . In the second option, the residual
branch does not need to “unlearn” its potentially bad random initial state, which can be beneficial
for learning. Therefore, we use the latter option in our experiments, unless otherwise specified.

The effects of biases and multipliers. With proper rescaling of the weights in all the residual
branches, a residual network is supposed to be updated by Θ(η) per SGD step — our goal is
achieved. However, in order to match the training performance of a corresponding network with
normalization, there are two more things to consider: biases and multipliers.
Using biases in the linear and convolution layers is a common practice. In normalization methods,
bias and scale parameters are typically used to restore the representation power after normalization.2
Intuitively, because the preferred input/output mean of a weight layer may be different from the
preferred output/input mean of an activation layer, it also helps to insert bias terms in a residual
network without normalization. Empirically, we find that inserting just one scalar bias before each
weight layer and nonlinear activation layer significantly improves the training performance.
Multipliers scale the output of a residual branch, similar to the scale parameters in batch normaliza-
tion. They have an interesting effect on the learning dynamics of weight layers in the same branch.
Specifically, as the stochastic gradient of a layer is typically almost orthogonal to its weight, learn-
ing rate decay tends to cause the weight norm equilibrium to shrink when combined with L2 weight
decay (van Laarhoven, 2017). In a branch with multipliers, this in turn causes the growth of the mul-
tipliers, increasing the effective learning rate of other layers. In particular, we observe that inserting
just one scalar multiplier per residual branch mimics the weight norm dynamics of a network with
normalization, and spares us the search of a new learning rate schedule.
Put together, we propose the following method to train residual networks without normalization:

Fixup initialization (or: How to train a deep residual network without normalization)
1. Initialize the classification layer and the last layer of each residual branch to 0.
2. Initialize every other layer using a standard method (e.g., He et al. (2015)), and scale only
1
the weight layers inside residual branches by L− 2m−2 .
3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at
0) before each convolution, linear, and element-wise activation layer.

It is important to note that Rule 2 of Fixup is the essential part as predicted by Equation (5). Indeed,
we observe that using Rule 2 alone is sufficient and necessary for training extremely deep residual
networks. On the other hand, Rule 1 and Rule 3 make further improvements for training so as to
match the performance of a residual network with normalization layers, as we explain in the above
text.3 We find ablation experiments confirm our claims (see Appendix C.1).

2
For example, in batch normalization gamma and beta parameters are used to affine-transform the normal-
ized activations per each channel.
3
It is worth noting that the design of Fixup is a simplification of the common practice, in that we only
introduce O(K) parameters beyond convolution and linear weights (since we remove bias terms from convo-
lution and linear layers), whereas the common practice includes O(KC) (Ioffe & Szegedy, 2015; Salimans &
Kingma, 2016) or O(KCW H) (Ba et al., 2016) additional parameters, where K is the number of layers, C is
the max number of channels per layer and W, H are the spatial dimension of the largest feature maps.

5
Published as a conference paper at ICLR 2019

Our initialization and network design is consistent with recent theoretical work Hardt & Ma (2016);
Li et al. (2018), which, in much more simplified settings such as linearized residual nets and
quadratic neural nets, propose that small initialization tend to stabilize optimization and help gener-
alizaiton. However, our approach suggests that more delicate control of the scale of the initialization
is beneficial.4

4 E XPERIMENTS
4.1 T RAINING AT INCREASING DEPTH

One of the key advatanges of BatchNorm is that it leads to fast training even for very deep models
(Ioffe & Szegedy, 2015). Here we will determine if we can match this desirable property by relying
only on proper initialization. We propose to evaluate how each method affects training very deep
nets by measuring the test accuracy after the first epoch as we increase depth. In particular, we
use the wide residual network (WRN) architecture with width 1 and the default weight decay 5e−4
(Zagoruyko & Komodakis, 2016). We specifically use the default learning rate of 0.1 because the
ability to use high learning rates is considered to be important to the success of BatchNorm. We
compare Fixup against three baseline methods — (1) rescale the output of each residual block by √12
(Balduzzi et al., 2017), (2) post-process an orthogonal initialization such that the output variance of
each residual block is close to 1 (Layer-sequential unit-variance orthogonal initialization, or LSUV)
(Mishkin & Matas, 2015), (3) batch normalization (Ioffe & Szegedy, 2015). We use the default
batch size of 128 up to 1000 layers, with a batch size of 64 for 10,000 layers. We limit our budget
of epochs to 1 due to the computational strain of evaluating models with up to 10,000 layers.

55
First Epoch Test Accuracy (%)

1/2 -scaling
50 LSUV
45 BatchNorm
Fixup
40
35
30
25
10 100 1000 10000
Depth

Figure 3: Depth of residual networks versus test accuracy at the first epoch for various methods on
CIFAR-10 with the default BatchNorm learning rate. We observe that Fixup is able to train very
deep networks with the same learning rate as batch normalization. (Higher is better.)
Figure 3 shows the test accuracy at the first epoch as depth increases. Observe that Fixup
p matches
the performance of BatchNorm at the first epoch, even with 10,000 layers. LSUV and 1 /2 -scaling
are not able to train with the same learning rate as BatchNorm past 100 layers.

4.2 I MAGE CLASSIFICATION

In this section, we evaluate the ability of Fixup to replace batch normalization in image classification
applications. On the CIFAR-10 dataset, we first test on ResNet-110 (He et al., 2016) with default
hyper-parameters; results are shown in Table 1. Fixup obtains 7% relative improvement in test error
compared with standard initialization; however, we note a substantial difference in the difficulty of
training. While network with Fixup is trained with the same learning rate and converge as fast as
network with batch normalization, we fail to train a Xavier initialized ResNet-110 with 0.1x maximal
learning rate.5 The test error gap in Table 1 is likely due to the regularization effect of BatchNorm
4
For example, learning rate smaller than our choice would also stabilize the training, but lead to lower
convergence rate.
5
Personal communication with the authors of (Shang et al., 2017) confirms our observation, and reveals that
the Xavier initialized network need more epochs to converge.

6
Published as a conference paper at ICLR 2019

rather than difficulty in optimization; when we train Fixup networks with better regularization, the
test error gap disappears and we obtain state-of-the-art results on CIFAR-10 and SVHN without
normalization layers (see Appendix C.2).

Dataset ResNet-110 Normalization Large η Test Error (%)


w/ BatchNorm (He et al., 2016) 3 3 6.61
CIFAR-10
w/ Xavier Init (Shang et al., 2017) 7 7 7.78
w/ Fixup-init 7 3 7.24

Table 1: Results on CIFAR-10 with ResNet-110 (mean/median of 5 runs; lower is better).

On the ImageNet dataset, we benchmark Fixup with the ResNet-50 and ResNet-101 architectures
(He et al., 2016), trained for 100 epochs and 200 epochs respectively. Similar to our finding on
the CIFAR-10 dataset, we observe that (1) training with Fixup is fast and stable with the default
hyperparameters, (2) Fixup alone significantly improves the test error of standard initialization, and
(3) there is a large test error gap between Fixup and BatchNorm. Further inspection reveals that
Fixup initialized models obtain significantly lower training error compared with BatchNorm models
(see Appendix C.3), i.e., Fixup suffers from overfitting. We therefore apply stronger regularization
to the Fixup models using Mixup (Zhang et al., 2017). We find it is beneficial to reduce the learning
rate of the scalar multiplier and bias by 10x when additional large regularization is used. Best
Mixup coefficients are found through cross-validation: they are 0.2, 0.1 and 0.7 for BatchNorm,
GroupNorm (Wu & He, 2018) and Fixup respectively. We present the results in Table 2, noting that
with better regularization, the performance of Fixup is on par with GroupNorm.

Model Method Normalization Test Error (%)


BatchNorm (Goyal et al., 2017) 23.6
BatchNorm + Mixup (Zhang et al., 2017) 3 23.3
GroupNorm + Mixup 23.9
ResNet-50
Xavier Init (Shang et al., 2017) 31.5
Fixup-init 7 27.6
Fixup-init + Mixup 24.0
BatchNorm (Zhang et al., 2017) 22.0
BatchNorm + Mixup (Zhang et al., 2017) 3 20.8
ResNet-101
GroupNorm + Mixup 21.4
Fixup-init + Mixup 7 21.4

Table 2: ImageNet test results using the ResNet architecture. (Lower is better.)

4.3 M ACHINE TRANSLATION

To demonstrate the generality of Fixup, we also apply it to replace layer normalization (Ba et al.,
2016) in Transformer (Vaswani et al., 2017), a state-of-the-art neural network for machine trans-
lation. Specifically, we use the fairseq library (Gehring et al., 2017) and follow the Fixup tem-
plate in Section 3 to modify the baseline model. We evaluate on two standard machine translation
datasets, IWSLT German-English (de-en) and WMT English-German (en-de) following the setup
of Ott et al. (2018). For the IWSLT de-en dataset, we cross-validate the dropout probability from
{0.3, 0.4, 0.5, 0.6} and find 0.5 to be optimal for both Fixup and the LayerNorm baseline. For the
WMT’16 en-de dataset, we use dropout probability 0.4. All models are trained for 200k updates.
It was reported (Chen et al., 2018) that “Layer normalization is most critical to stabilize the training
process... removing layer normalization results in unstable training runs”. However we find training
with Fixup to be very stable and as fast as the baseline model. Results are shown in Table 3.
Surprisingly, we find the models do not suffer from overfitting when LayerNorm is replaced by
Fixup, thanks to the strong regularization effect of dropout. Instead, Fixup matches or supersedes
the state-of-the-art results using Transformer model on both datasets.

7
Published as a conference paper at ICLR 2019

Dataset Model Normalization BLEU


(Deng et al., 2018) 33.1
3
IWSLT DE-EN LayerNorm 34.2
Fixup-init 7 34.5
(Vaswani et al., 2017) 28.4
3
WMT EN-DE LayerNorm (Ott et al., 2018) 29.3
Fixup-init 7 29.3

Table 3: Comparing Fixup vs. LayerNorm for machine translation tasks. (Higher is better.)

5 R ELATED W ORK
Normalization methods. Normalization methods have enabled training very deep residual net-
works, and are currently an essential building block of the most successful deep learning architec-
tures. All normalization methods for training neural networks explicitly normalize (i.e. standardize)
some component (activations or weights) through dividing activations or weights by some real num-
ber computed from its statistics and/or subtracting some real number activation statistics (typically
the mean) from the activations.6 In contrast, Fixup does not compute statistics (mean, variance or
norm) at initialization or during any phase of training, hence is not a normalization method.

Theoretical analysis of deep networks. Training very deep neural networks is an important the-
oretical problem. Early works study the propagation of variance in the forward and backward pass
for different activation functions (Glorot & Bengio, 2010; He et al., 2015).
Recently, the study of dynamical isometry (Saxe et al., 2013) provides a more detailed characteriza-
tion of the forward and backward signal propogation at initialization (Pennington et al., 2017; Hanin,
2018), enabling training 10,000-layer CNNs from scratch (Xiao et al., 2018). For residual networks,
activation scale (Hanin & Rolnick, 2018), gradient variance (Balduzzi et al., 2017) and dynamical
isometry property (Yang & Schoenholz, 2017) have been studied. Our analysis in Section 2 leads
to the similar conclusion as previous work that the standard initialization for residual networks is
problematic. However, our use of positive homogeneity for lower bounding the gradient norm of a
neural network is novel, and applies to a broad class of neural network architectures (e.g., ResNet,
DenseNet) and initialization methods (e.g., Xavier, LSUV) with simple assumptions and proof.
Hardt & Ma (2016) analyze the optimization landscape (loss surface) of linearized residual nets in
the neighborhood around the zero initialization where all the critical points are proved to be global
minima. Yang & Schoenholz (2017) study the effect of the initialization of residual nets to the test
performance and pointed out Xavier or He initialization scheme is not optimal. In this paper, we
give a concrete recipe for the initialization scheme with which we can train deep residual networks
without batch normalization successfully.

Understanding batch normalization. Despite its popularity in practice, batch normalization has
not been well understood. Ioffe & Szegedy (2015) attributed its success to “reducing internal covari-
ate shift”, whereas Santurkar et al. (2018) argued that its effect may be “smoothing loss surface”.
Our analysis in Section 2 corroborates the latter idea of Santurkar et al. (2018) by showing that
standard initialization leads to very steep loss surface at initialization. Moreover, we empirically
showed in Section 3 that steep loss surface may be alleviated for residual networks by using smaller
initialization than the standard ones such as Xavier or He’s initialization in residual branches. van
Laarhoven (2017); Hoffer et al. (2018) studied the effect of (batch) normalization and weight decay
on the effective learning rate. Their results inspire us to include a multiplier in each residual branch.

ResNet initialization in practice. Gehring et al. (2017); Balduzzi et al. (2017)


p proposed to address
the initialization problem of residual nets by using the recurrence xl = 1 /2 (xl−1 + Fl (xl−1 )).
Mishkin & Matas (2015) proposed a data-dependent initialization to mimic the effect of batch nor-
malization in the first forward pass. While both methods limit the scale of activation and gradient,
they would fail to train stably at the maximal learning rate for very deep residual networks, since
6
For reference, we include a brief history of normalization methods in Appendix D.

8
Published as a conference paper at ICLR 2019

they fail to consider the accumulation of highly correlated updates contributed by different residual
branches to the network function (Appendix B.1). Srivastava et al. (2015); Hardt & Ma (2016);
Goyal et al. (2017); Kingma & Dhariwal (2018) found that initializing the residual branches at (or
close to) zero helped optimization. Our results support their observation in general, but Equation (5)
suggests additional subtleties when choosing a good initialization scheme.

6 C ONCLUSION
In this work, we study how to train a deep residual network reliably without normalization. Our
theory in Section 2 suggests that the exploding gradient problem at initialization in a positively
homogeneous network such as ResNet is directly linked to the blowup of logits. In Section 3 we
develop Fixup initialization to ensure the whole network as well as each residual branch gets up-
dates of proper scale, based on a top-down analysis. Extensive experiments on real world datasets
demonstrate that Fixup matches normalization techniques in training deep residual networks, and
achieves state-of-the-art test performance with proper regularization.
Our work opens up new possibilities for both theory and applications. Can we analyze the training
dynamics of Fixup, which may potentially be simpler than analyzing models with batch normaliza-
tion is? Could we apply or extend the initialization scheme to other applications of deep learning?
It would also be very interesting to understand the regularization benefits of various normalization
methods, and to develop better regularizers to further improve the test performance of Fixup.

ACKNOWLEDGMENTS
The authors would like to thank Yuxin Wu, Kaiming He, Aleksander Madry and the anonymous
reviewers for their helpful feedback.

R EFERENCES
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams.
The shattered gradients problem: If resnets are the answer, then what is the question? arXiv
preprint arXiv:1702.08591, 2017.
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster,
Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. The best of both worlds: Combin-
ing recent advances in neural machine translation. arXiv preprint arXiv:1804.09849, 2018.
Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander M Rush. Latent alignment
and variational attention. Thirty-second Conference on Neural Information Processing Systems
(NIPS), 2018.
Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks
with cutout. arXiv preprint arXiv:1708.04552, 2017.
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional
Sequence to Sequence Learning. In Proc. of ICML, 2017.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the thirteenth international conference on artificial intelligence and
statistics, pp. 249–256, 2010.
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An-
drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet
in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Benjamin Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014.
Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? arXiv
preprint arXiv:1801.03744, 2018.

9
Published as a conference paper at ICLR 2019

Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture.
arXiv preprint arXiv:1803.01719, 2018.
Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231,
2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE international
conference on computer vision, pp. 1026–1034, 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770–778, 2016.
David J Heeger. Normalization of cell responses in cat striate cortex. Visual neuroscience, 9(2):
181–197, 1992.
Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate
normalization schemes in deep networks. arXiv preprint arXiv:1803.01814, 2018.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.
arXiv preprint arXiv:1807.03039, 2018.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in neural information processing systems, pp. 1097–1105,
2012.
Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu. Generalizing pooling functions in convo-
lutional neural networks: Mixed, gated, and tree. In Artificial Intelligence and Statistics, pp.
464–472, 2016.
Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized
matrix recovery. Conference on Learning Theory (COLT), 2018.
Siwei Lyu and Eero P Simoncelli. Nonlinear image representation using divisive normalization.
In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8.
IEEE, 2008.
Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422,
2015.
Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation.
arXiv preprint arXiv:1806.00187, 2018.
Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep
learning through dynamical isometry: theory and practice. In Advances in neural information
processing systems, pp. 4785–4795, 2017.
Nicolas Pinto, David D Cox, and James J DiCarlo. Why is real-world visual object recognition
hard? PLoS computational biology, 4(1):e27, 2008.
Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accel-
erate training of deep neural networks. In Advances in Neural Information Processing Systems,
pp. 901–909, 2016.
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch
normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint
arXiv:1805.11604, 2018.
Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynam-
ics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.

10
Published as a conference paper at ICLR 2019

Wenling Shang, Justin Chiu, and Kihyuk Sohn. Exploring normalization in deep residual networks
with concatenated rectified linear units. In AAAI, pp. 1509–1516, 2017.

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint
arXiv:1505.00387, 2015.

Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing
ingredient for fast stylization. CoRR, abs/1607.08022, 2016.

Twan van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint
arXiv:1706.05350, 2017.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Infor-
mation Processing Systems, pp. 5998–6008, 2017.

Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision
(ECCV), September 2018.

Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Penning-
ton. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla
convolutional neural networks. arXiv preprint arXiv:1806.05393, 2018.

Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization. arXiv preprint
arXiv:1802.02375, 2018.

Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances
in neural information processing systems, pp. 7103–7114, 2017.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint
arXiv:1605.07146, 2016.

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical
risk minimization. arXiv preprint arXiv:1710.09412, 2017.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation
using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE Interna-
tional Conference on, 2017.

A P ROOFS FOR S ECTION 2

A.1 G RADIENT NORM LOWER BOUND FOR THE INPUT TO A NETWORK BLOCK

Proof of Theorem 1. We use fi→j to denote the composition fj ◦ fj−1 ◦ · · · ◦ fi , so that z =


fi→L (xi−1 ) for all i ∈ [L]. Note that z is p.h. with respect to the input of each network block,
i.e. fi→L ((1 + )xi−1 ) = (1 + )fi→L (xi−1 ) for  > −1. This allows us to compute the gradient
of the cross-entropy loss with respect to the scaling factor  at  = 0 as

∂ ∂` ∂fi→L
= −yT z + pT z = `(z, y) − H(p)

`(fi→L ((1 + )xi−1 ), y) = (6)
∂ =0 ∂z ∂

Since the gradient L2 norm k∂`/∂xi−1 k must be greater than the directional derivative
∂ xi−1
∂t `(fi→L (xi−1 + t kxi−1 k ), y), defining  = /kxi−1 k we have
t


∂` ∂ ∂ `(z, y) − H(p)
∂xi−1 ≥ ∂ `(fi→L (xi−1 + xi−1 ), y) ∂t = . (7)

kxi−1 k

11
Published as a conference paper at ICLR 2019

A.2 G RADIENT NORM LOWER BOUND FOR POSITIVELY HOMOGENEOUS SETS

Proof of Theorem 2. The proof idea is similar. Recall that if θph is a p.h. set, then f¯(m) (θph ) ,
f (x(m) ; θ \ θph , θph ) is a p.h. function. We therefore have
M M
1 X ∂` ∂ f¯(m)

∂ 1 X
`(z(m) , y(m) ) − H(p(m) ) (8)

`avg (DM ; (1 + )θph ) = (m)
=
∂ =0 M m=1
∂z ∂ M m=1

hence we again invoke the directional derivative argument to show


M
∂`avg 1 X

∂θph M kθph k `(z(m) , y(m) ) − H(p(m) ) , G(θph ). (9)
m=1

In order to estimate the scale of this lower bound, recall the FC layer weights are i.i.d. sampled from
a symmetric, mean-zero distribution, therefore z has a symmetric probability density function with
mean 0. We hence have
E`(z, y) = E[−yT (z − logsumexp(z))] ≥ E[yT (maxi∈[c] zi − z)] = E[maxi∈[c] zi ] (10)
where the inequality uses the fact that logsumexp(z) ≥ maxi∈[c] zi ; the last equality is due to y
and z being independent at initialization and Ez = 0. Using the trivial bound EH(p) ≤ log(c), we
get
E[maxi∈[c] zi ] − log(c)
EG(θph ) ≥ (11)
kθph k
which shows that the gradient norm of a p.h. set is of the order Ω(E[maxi∈[c] zi ]) at initialization.

B P ROOFS FOR S ECTION 3


B.1 R ESIDUAL BRANCHES UPDATE THE NETWORK IN SYNC

A common theme in previous analysis of residual networks is the scale of activation and gradient
(Balduzzi et al., 2017; Yang & Schoenholz, 2017; Hanin & Rolnick, 2018). However, it is more
important to consider the scale of actual change to the network function made by a (stochastic)
gradient descent step. If the updates to different layers cancel out each other, the network would
be stable as a whole despite drastic changes in different layers; if, on the other hand, the updates
to different layers align with each other, the whole network may incur a drastic change in one step,
even if each layer only changes a tiny amount. We now provide analysis showing that the latter
scenario more accurately describes what happens in reality at initialization.
For our result in this section, we make the following assumptions:
• f is a sequential composition of network blocks {fi }L
i=1 , i.e. f (x0 ) = fL (fL−1 (. . . f1 (x0 ))),
consisting of fully-connected weight layers, ReLU activation functions and residual branches.
• fL is a fully-connected layer with weights i.i.d. sampled from a zero-mean distribution.
• There is no bias parameter in f .
For l < L, let xl−1 be the input to fl and Fl (xl−1 ) be a branch in fl with ml layers. Without loss of
generality, we study the following specific form of network architecture:
m ReLU
l
z }| {
(ml ) (1)
Fl (xl−1 ) = (ReLU ◦ Wl ◦ · · · ◦ ReLU ◦ Wl )(xl−1 ),
fl (xl−1 ) = xl−1 + Fl (xl−1 ).
(1)
For the last block we denote mL = 1 and fL (xL−1 ) = FL (xL−1 ) = WL xL−1 .
Furthermore, we always choose 0 as the gradient of ReLU when its input is 0. As such, with input x,
the output and gradient of ReLU(x) can be simply written as D1[x>0] x, where D1[x>0] is a diagonal
matrix with diagonal entries corresponding to 1[x > 0]. Denote the preactivation of the i-th layer

12
Published as a conference paper at ICLR 2019

(i)
(i.e. the input to the i-th ReLU) in the l-th block by xl . We define the following terms to simplify
our presentation:
(i−) (i−1) (1)
Fl , D1[x(i−1) >0] Wl · · · D1[x(1) >0] Wl xl−1 , l < L, i ∈ [ml ]
l l
(i+) (m )
Fl , D1[x(ml ) >0] Wl l · · · D1[x(i) >0] , l < L, i ∈ [ml ]
l l
(1−)
FL , xL−1
(1+)
FL , I

We have the following result on the gradient update to f :


Theorem 3. With the above assumptions, suppose we update the network parameters by ∆θ =

−η ∂θ `(f (x0 ; θ), y), then the update to network output ∆f (x0 ) , f (x0 ; θ + ∆θ) − f (x0 ; θ) is
,Jli
 
z }| {
 ml
L X  T T  ∂f 
(i−) 2 ∂f  ∂`

(i+) (i+)
X
∆f (x0 ) = −η  Fl Fl Fl  + O(η 2 ), (12)

 i=1 ∂x l ∂x l  ∂z

l=1

where z , f (x0 ) ∈ Rc is the logits.


Let us discuss the implecation of this result before delving into the proof. As each Jli is a c × c real
symmetric positive semi-definite matrix, the trace norm of each Jli equals its trace. Similarly, the
trace norm of J , l i Jli equals the trace of the sum of all Jli as well, which scales linearly with
P P
the number of residual branches L. Since the output z has no (or little) correlation with the target
∂`
y at the start of training, ∂z is a vector of some random direction. It then follows that the expected
update scale is proportional to the trace norm of J, which is proportional to L as well as the average
trace of Jli . Simply put, to allow the whole network be updated by Θ(η) per step independent of
depth, we need to ensure each residual branch contributes only a Θ(η/L) update on average.

Proof. The first insight to prove our result is to note that conditioning on a specific input x0 , we
can replace each ReLU activation layer by a diagonal matrix and does not change the forward and
backward pass. (In fact, this is valid even after we apply a gradient descent update, as long as the
learning rate η > 0 is sufficiently small so that all positive preactivation remains positive. This
observation will be essential for our later analysis.) We thus have the gradient w.r.t. the i-th weight
layer in the l-th block is
∂` ∂xl ∂f ∂` 
(i−) (i)

(i+)
T ∂f ∂`
(i)
= (i)
· · = Fl ⊗ Il Fl · . (13)
∂Vec(W )
l ∂Vec(W ) ∂xl ∂z
l
∂xl ∂z
where ⊗ denotes the Kronecker product. The second insight is to note that with our assumptions, a
network block and its gradient w.r.t. its input have the following relation:
∂fl
fl (xl−1 ) = · xl−1 . (14)
∂xl−1

We then plug in Equation (13) to the gradient update ∆θ = −η ∂θ `(f (x0 ; θ), y), and recalculate the
forward pass f (x0 ; θ +∆θ). The theorem follows by applying Equation (14) and a first-order Taylor
series expansion in a small neighborhood of η = 0 where f (x0 ; θ + ∆θ) is smooth w.r.t. η.

B.2 W HAT SCALAR BRANCH HAS Θ(η/L) UPDATES ?


Qm
For this section, we focus on the proper initialization of a scalar branch F (x) = ( i=1 ai )x. We
have the following result:
Theorem 4. Assuming ∀i, ai ≥ 0, x = Θ(1) and ∂F∂`(x) = Θ(1), then ∆F (x) , F (x; θ − η ∂θ ∂`
)−
F (x; θ) is Θ(η/L) if and only if
 
 
Y 1
 ak  x = Θ √ , where j ∈ arg min ak (15)
k∈[m]\{j}
L k

13
Published as a conference paper at ICLR 2019

Proof. We start by calculating the gradient of each parameter:


 
∂` ∂`  Y
= ak  x (16)
∂ai ∂F
k∈[m]\{i}

and a first-order approximation of ∆F (x):


m
∂` 2
X 1
∆F (x) = −η (F (x)) 2 (17)
∂F (x) a
i=1 i

where we conveniently abuse some notations by defining


 
1 Y
F (x) ,  ak  x, if ai = 0. (18)
ai
k∈[m]\{i}
Pm 1
Denote i=1 a2i as M and mink ak as A, we have
1 m
(F (x))2 · 2
≤ (F (x))2 M ≤ (F (x))2 · 2 (19)
A A
and therefore by rearranging Equation (17) and letting ∆F (x) = Θ(η/L) we get
!  
2 1 ∆F (x) 1
(F (x)) · 2 = Θ ∂`
=Θ (20)
A η ∂F (x) L

i.e. F (x)/A = Θ(1/ L). Hence the “only if” part is proved. For the “if” part, we apply Equa-
tion (19) to Equation (17) and observe that by Equation (15)
 
2 1 η
∆F (x) = Θ η(F (x)) · 2 = Θ (21)
A L

The result of this theorem provides useful guidance on how to rescale the standard initialization to
achieve the desired update scale for the network function.

C A DDITIONAL EXPERIMENTS
C.1 A BLATION STUDIES OF F IXUP

In this section we present the training curves of different architecture designs and initialization
schemes. Specifically, we compare the training accuracy of batch normalization, Fixup, as well as
a few ablated options: (1) removing the bias parameters in the network; (2) use 0.1x the suggested
initialization scale and no bias parameters; (3) use 10x the suggested initialization scale and no bias
parameters; and (4) remove all the residual branches. The results are shown in Figure 4. We see that
initializing the residual branch layers at a smaller scale (or all zero) slows down learning, whereas
training fails when initializing them at a larger scale; we also see the clear benefit of adding bias
parameters in the network.

C.2 CIFAR AND SVHN WITH BETTER REGULARIZATION

We perform additional experiments to validate our hypothesis that the gap in test error between
Fixup and batch normalization is primarily due to overfitting. To combat overfitting, we use Mixup
(Zhang et al., 2017) and Cutout (DeVries & Taylor, 2017) with default hyperparameters as addi-
tional regularization. On the CIFAR-10 dataset, we perform experiments with WideResNet-40-10
and on SVHN we use WideResNet-16-12 (Zagoruyko & Komodakis, 2016), all with the default
hyperparameters. We observe in Table 4 that models trained with Fixup and strong regularization
are competitive with state-of-the-art methods on CIFAR-10 and SVHN, as well as our baseline with
batch normalization.

14
Published as a conference paper at ICLR 2019

100
BatchNorm
Fixup
80 L 2m1 2 , no bias
0.1L 2m 2
1

10L 2m 2
Train Accuracy (%) 1

no residual
60

40

20

0
0 200 400 600 800 1000 1200
Batch Index
Figure 4: Minibatch training accuracy of ResNet-110 on CIFAR-10 dataset with different config-
urations in the first 3 epochs. We use minibatch size of 128 and smooth the curves using 10-step
moving average.

Dataset Model Normalization Test Error (%)


(Zagoruyko & Komodakis, 2016) 3.8
(Yamada et al., 2018) Yes 2.3
CIFAR-10 BatchNorm + Mixup + Cutout 2.5
(Graham, 2014) 3.5
No
Fixup-init + Mixup + Cutout 2.3
(Zagoruyko & Komodakis, 2016) 1.5
(DeVries & Taylor, 2017) Yes 1.3
SVHN BatchNorm + Mixup + Cutout 1.4
(Lee et al., 2016) 1.7
No
Fixup-init + Mixup + Cutout 1.4

Table 4: Additional results on CIFAR-10, SVHN datasets.

15
Published as a conference paper at ICLR 2019

C.3 T RAINING AND TEST CURVES ON I MAGE N ET

Figure 5 shows that without additional regularization Fixup fits the training set very well, but overfits
significantly. We see in Figure 6 that Fixup is competitive with networks trained with normalization
when the Mixup regularizer is used.

60 50
BatchNorm BatchNorm
50 GroupNorm 45 GroupNorm
Train Error (%)

Test Error (%)


Fixup 40 Fixup
40
35
30
30
20 25
10 20
0 20 40 60 80 100 0 20 40 60 80 100
Epochs Epochs

Figure 5: Training and test errors on ImageNet using ResNet-50 without additional regularization.
We observe that Fixup is able to better fit the training data and that leads to overfitting - more
regularization is needed. Results of BatchNorm and GroupNorm reproduced from (Wu & He, 2018).

50
BatchNorm + Mixup
45 GroupNorm + Mixup
Fixup + Mixup
Test Error (%)

40
35
30
25
20
0 20 40 60 80 100
Epochs

Figure 6: Test error of ResNet-50 on ImageNet with Mixup (Zhang et al., 2017). Fixup closely
matches the final results yielded by the use of GroupNorm, without any normalization.

D A DDITIONAL REFERENCES : A BRIEF HISTORY OF NORMALIZATION


METHODS

The first use of normalization in neural networks appears in the modeling of biological visual system
and dates back at least to Heeger (1992) in neuroscience and to Pinto et al. (2008); Lyu & Simon-
celli (2008) in computer vision, where each neuron output is divided by the sum (or norm) of all of
the outputs, a module called divisive normalization. Recent popular normalization methods, such
as local response normalization (Krizhevsky et al., 2012), batch normalization (Ioffe & Szegedy,
2015) and layer normalization (Ba et al., 2016) mostly follow this tradition of dividing the neuron
activations by their certain summary statistics, often also with the activation mean subtracted. An
exception is weight normalization (Salimans & Kingma, 2016), which instead divides the weight
parameters by their statistics, specifically the weight norm; weight normalization also adopts the
idea of activation normalization for weight initialization. The recently proposed actnorm (Kingma
& Dhariwal, 2018) removes the normalization of weight parameters, but still use activation normal-
ization to initialize the affine transformation layers.

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy