0% found this document useful (0 votes)

83 views16 pages

Fixup Initialization PDF

1) The document proposes a new initialization method called "Fixup Initialization" that allows deep residual networks to be trained stably and achieve state-of-the-art performance without normalization layers. 2) Fixup Initialization rescales the standard initialization of residual branches based on network architecture. This solves the exploding gradient problem at the beginning of training deep residual networks. 3) Experiments show Fixup Initialization enables training very deep residual networks at maximal learning rates, matching the convergence speed and generalization of networks trained with normalization layers. It achieves new state-of-the-art results on image classification and machine translation benchmarks.

Uploaded by

yuyiip

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views16 pages

Fixup Initialization PDF

Uploaded by

yuyiip

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Published as a conference paper at ICLR 2019

F IXUP I NITIALIZATION :
R ESIDUAL L EARNING W ITHOUT N ORMALIZATION
Hongyi Zhang∗ Yann N. Dauphin† Tengyu Ma‡
MIT Google Brain Stanford University
hongyiz@mit.edu yann@dauphin.io tengyuma@stanford.edu

A BSTRACT
arXiv:1901.09321v2 [cs.LG] 12 Mar 2019

Normalization layers are a staple in state-of-the-art deep neural network archi-

tectures. They are widely believed to stabilize training, enable higher learning
rate, accelerate convergence and improve generalization, though the reason for
their effectiveness is still an active research topic. In this work, we challenge the
commonly-held beliefs by showing that none of the perceived benefits is unique
to normalization. Specifically, we propose fixed-update initialization (Fixup), an
initialization motivated by solving the exploding and vanishing gradient problem
at the beginning of training via properly rescaling a standard initialization. We
find training residual networks with Fixup to be as stable as training with nor-
malization — even for networks with 10,000 layers. Furthermore, with proper
regularization, Fixup enables residual networks without normalization to achieve
state-of-the-art performance in image classification and machine translation.

1 I NTRODUCTION

Artificial intelligence applications have witnessed major advances in recent years. At the core of
this revolution is the development of novel neural network models and their training techniques. For
example, since the landmark work of He et al. (2016), most of the state-of-the-art image recognition
systems are built upon a deep stack of network blocks consisting of convolutional layers and additive
skip connections, with some normalization mechanism (e.g., batch normalization (Ioffe & Szegedy,
2015)) to facilitate training and generalization. Besides image classification, various normalization
techniques (Ulyanov et al., 2016; Ba et al., 2016; Salimans & Kingma, 2016; Wu & He, 2018) have
been found essential to achieving good performance on other tasks, such as machine translation
(Vaswani et al., 2017) and generative modeling (Zhu et al., 2017). They are widely believed to have
multiple benefits for training very deep neural networks, including stabilizing learning, enabling
higher learning rate, accelerating convergence, and improving generalization.
Despite the enormous empirical success of training deep networks with normalization, and recent
progress on understanding the working of batch normalization (Santurkar et al., 2018), there is
currently no general consensus on why these normalization techniques help training residual neural
networks. Intrigued by this topic, in this work we study
(i) without normalization, can a deep residual network be trained reliably? (And if so,)
(ii) without normalization, can a deep residual network be trained with the same learning rate,
converge at the same speed, and generalize equally well (or even better)?
Perhaps surprisingly, we find the answers to both questions are Yes. In particular, we show:

• Why normalization helps training. We derive a lower bound for the gradient norm of a residual
network at initialization, which explains why with standard initializations, normalization tech-
niques are essential for training deep residual networks at maximal learning rate. (Section 2)

∗
Work done at Facebook. Equal contribution.
†
Work done at Facebook. Equal contribution.
‡
Work done at Facebook.

1
Published as a conference paper at ICLR 2019

• Training without normalization. We propose Fixup, a method that rescales the standard initial-
ization of residual branches by adjusting for the network architecture. Fixup enables training very
deep residual networks stably at maximal learning rate without normalization. (Section 3)
• Image classification. We apply Fixup to replace batch normalization on image classification
benchmarks CIFAR-10 (with Wide-ResNet) and ImageNet (with ResNet), and find Fixup with
proper regularization matches the well-tuned baseline trained with normalization. (Section 4.2)
• Machine translation. We apply Fixup to replace layer normalization on machine translation
benchmarks IWSLT and WMT using the Transformer model, and find it outperforms the baseline
and achieves new state-of-the-art results on the same architecture. (Section 4.3)

ReLU remove  rescale  add scalar  add scalar 

normalization weights multipliers biases ReLU
+
+
bias
⇤
multiplier : initialized at 1 ⇤⇤ bias
⇤⇤
: initialized at 0 ReLU ⇤
normalize multiplier
⇤⇤⇤
p
3x3 conv : scaled down by L +
⇤⇤ 3x3 conv
ReLU ⇤⇤ bias
bias ⇤
multiplier ReLU
multiplier ⇤⇤ ⇤⇤
3x3 conv bias
⇤⇤⇤
normalize ReLU 3x3 conv
⇤⇤⇤ ⇤⇤ bias
3x3 conv 3x3 conv

(He et al., 2016) Fixup w/o bias Fixup

Figure 1: Left: ResNet basic block. Batch normalization (Ioffe & Szegedy, 2015) layers are marked
in red. Middle: A simple network block that trains stably when stacked together. Right: Fixup
further improves by adding bias parameters. (See Section 3 for details.)
In the remaining of this paper, we first analyze the exploding gradient problem of residual networks
at initialization in Section 2. To solve this problem, we develop Fixup in Section 3. In Section 4 we
quantify the properties of Fixup and compare it against state-of-the-art normalization methods on
real world benchmarks. A comparison with related work is presented in Section 5.

2 P ROBLEM : R ES N ET WITH S TANDARD I NITIALIZATIONS L EAD TO

E XPLODING G RADIENTS
Standard initialization methods (Glorot & Bengio, 2010; He et al., 2015; Xiao et al., 2018) attempt
to set the initial parameters of the network such that the activations neither vanish nor explode.
Unfortunately, it has been observed that without normalization techniques such as BatchNorm they
do not account properly for the effect of residual connections and this causes exploding gradients.
Balduzzi et al. (2017) characterizes this problem for ReLU networks, and we will generalize this to
residual networks with positively homogenous activation functions. A plain (i.e. without normal-
ization layers) ResNet with residual blocks {F1 , . . . , FL } and input x0 computes the activations as
l−1
X
xl = x0 + Fi (xi ). (1)
i=0

ResNet output variance grows exponentially with depth. Here we only consider the initial-
ization, view the input x0 as fixed, and consider the randomness of the weight initialization. We
analyze the variance of each layer xl , denoted by Var[xl ] (which is technically defined as the sum
of the variance of all the coordinates of xl .) For simplicity we assume the blocks are initialized to be
zero mean, i.e., E[Fl (xl ) | xl ] = 0. By xl+1 = xl + Fl (xl ), and the law of total variance, we have
Var[xl+1 ] = E[Var[F (xl )|xl ]] + Var(xl ). Resnet structure prevents xl from vanishing by forcing
the variance to grow with depth, i.e. Var[xl ] < Var[xl+1 ] if E[Var[F (xl )|xl ]] > 0. Yet, combined
with initialization methods such as He et al. (2015), the output variance of each residual branch

2
Published as a conference paper at ICLR 2019

Var[Fl (xl )|xl ] will be about the same as its input variance Var[xl ], and thus Var[xl+1 ] ≈ 2Var[xl ].
This causes the output variance to explode exponentially with depth without normalization (Hanin &
Rolnick, 2018) for positively homogeneous blocks (see Definition 1). This is detrimental to learning
because it can in turn cause gradient explosion.
As we will show, at initialization, the gradient norm of certain activations and weight tensors is
lower bounded by the cross-entropy loss up to some constant. Intuitively, this implies that blowup
in the logits will cause gradient explosion. Our result applies to convolutional and linear weights
in a neural network with ReLU nonlinearity (e.g., feed-forward network, CNN), possibly with skip
connections (e.g., ResNet, DenseNet), but without any normalization.
Our analysis utilizes properties of positively homogeneous functions, which we now introduce.
Definition 1 (positively homogeneous function of first degree). A function f : Rm → Rn is called
positively homogeneous (of first degree) (p.h.) if for any input x ∈ Rm and α > 0, f (αx) = αf (x).
Definition 2 (positively homogeneous set of first degree). Let θ = {θi }i∈S be the set of parameters
of f (x) and θph = {θi }i∈Sph ⊂S . We call θph a positively homogeneous set (of first degree) (p.h.
set) if for any α > 0, f (x; θ \ θph , αθph ) = αf (x; θ \ θph , θph ), where αθph denotes {αθi }i∈Sph .

Intuitively, a p.h. set is a set of parameters θph in function f such that for any fixed input x and fixed
parameters θ \ θph , f¯(θph ) , f (x; θ \ θph , θph ) is a p.h. function.
Examples of p.h. functions are ubiquitous in neural networks, including various kinds of linear op-
erations without bias (fully-connected (FC) and convolution layers, pooling, addition, concatenation
and dropout etc.) as well as ReLU nonlinearity. Moreover, we have the following claim:
Proposition 1. A function that is the composition of p.h. functions is itself p.h.

We study classification problems with c classes and the cross-entropy loss. We use f to denote a
neural network function except for the softmax layer. Cross-entropy loss is defined as `(z, y) ,
−yT (z − logsumexp(z)) where y is the one-hot label vector, z , f (x)∈ Rc is the logits where
P
zi denotes its i-th element, and logsumexp(z) , log i∈[c] exp(zi ) . Consider a minibatch
of training examples DM = {(x(m) , y(m) )}M m=1 and the average cross-entropy loss `avg (DM ) ,
1
PM (m) (m) (m)
M m=1 `(f (x ), y ), where we use to index quantities referring to the m-th example.
k · k denotes any valid norm. We only make the following assumptions about the network f :
1. f is a sequential composition of network blocks {fi }L
i=1 , i.e. f (x0 ) = fL (fL−1 (. . . f1 (x0 ))),
each of which is composed of p.h. functions.
2. Weight elements in the FC layer are i.i.d. sampled from a zero-mean symmetric distribution.
These assumptions hold at initialization if we remove all the normalization layers in a residual
network with ReLU nonlinearity, assuming all the biases are initialized at 0.
Our results are summarized in the following two theorems, whose proofs are listed in the appendix:
Theorem 1. Denote the input to the i-th block by xi−1 . With Assumption 1, we have

∂` `(z, y) − H(p)
∂xi−1 ≥ , (2)

kxi−1 k
where p is the softmax probabilities and H denotes the Shannon entropy.

Since H(p) is upper bounded by log(c) and kxi−1 k is small in the lower blocks, blowup in the loss
will cause large gradient norm with respect to the lower block input. Our second theorem proves a
lower bound on the gradient norm of a p.h. set in a network.
Theorem 2. With Assumption 1, we have
M
∂`avg 1 X
≥
∂θph M kθph k `(z(m) , y(m) ) − H(p(m) ) , G(θph ). (3)
m=1
Furthermore, with Assumptions 1 and 2, we have
E[maxi∈[c] zi ] − log(c)
EG(θph ) ≥ . (4)
kθph k

3
Published as a conference paper at ICLR 2019

It remains to identify such p.h. sets in a neural network. In Figure 2 we provide three examples
of p.h. sets in a ResNet without normalization. Theorem 2 suggests that these layers would suffer
from the exploding gradient problem, if the logits z blow up at initialization, which unfortunately
would occur in a ResNet without normalization if initialized in a traditional way. This motivates us
to introduce a new initialization in the next section.

+
fc

conv conv conv

Figure 2: Examples of p.h. sets in a ResNet without normalization: (1) the first convolution layer
before max pooling; (2) the fully connected layer before softmax; (3) the union of a spatial down-
sampling layer in the backbone and a convolution layer in its corresponding residual branch.

3 F IXUP : U PDATE A R ESIDUAL N ETWORK Θ(η) PER SGD S TEP

Our analysis in the previous section points out the failure mode of standard initializations for training
deep residual network: the gradient norm of certain layers is in expectation lower bounded by a
quantity that increases indefinitely with the network depth. However, escaping this failure mode
does not necessarily lead us to successful training — after all, it is the whole network as a function
that we care about, rather than a layer or a network block. In this section, we propose a top-down
design of a new initialization that ensures proper update scale to the network function, by simply
rescaling a standard initialization. To start, we denote the learning rate by η and set our goal:

f (x; θ) is updated by Θ(η) per SGD step after initialization as η → 0.

∂
That is, k∆f (x)k = Θ(η) where ∆f (x) , f (x; θ − η ∂θ `(f (x), y)) − f (x; θ).

Put another way, our goal is to design an initialization such that SGD updates to the network function
are in the right scale and independent of the depth.
We define the Shortcut as the shortest path from input to output in a residual network. The Shortcut
is typically a shallow network with a few trainable layers.1 We assume the Shortcut is initialized
using a standard method, and focus on the initialization of the residual branches.

Residual branches update the network in sync. To start, we first make an important observa-
tion that the SGD update to each residual branch changes the network output in highly correlated
directions. This implies that if a residual network has L residual branches, then an SGD step to each
residual branch should change the network output by Θ(η/L) on average to achieve an overall Θ(η)
update. We defer the formal statement and its proof until Appendix B.1.

Study of a scalar branch. Next we study how to initialize a residual branch with m layers so
that its SGD update changes the network output by Θ(η/L). We assume m is a small positive
integer (e.g., 2 or 3). As we are only concerned about Qm the scale of the update, it is sufficiently
instructive to study the scalar case, i.e., F (x) = ( i=1 ai ) x where a1 , . . . , am , x ∈ R+ . For
example, the standard initialization methods typically initialize each layer so that the output (after
nonlinear activation) preserves the input variance, which can be modeled as setting ∀i ∈ [m], ai = 1.
In turn, setting ai to a positive number other than 1 corresponds to rescaling the i-th layer by ai .
Through deriving the constraints for F (x) to make Θ(η/L) updates, we will also discover how to
rescale the weight layers of a standard initialization as desired. In particular, we show the SGD
1
For example, in the ResNet architecture (e.g., ResNet-50, ResNet-101 or ResNet-152) for ImageNet clas-
sification, the Shortcut is always a 6-layer network with five convolution layers and one fully-connected layer,
irrespective of the total depth of the whole network.

4
Published as a conference paper at ICLR 2019

update to F (x) is Θ(η/L) if and only if the initialization satisfies the following constraint:
 

Y 1
 ai x = Θ √
 , where j ∈ arg min ak (5)
i∈[m]\{j}
L k

We defer the derivation until Appendix B.2.

Equation (5) suggests new methods to initialize a residual branch through rescaling the standard
initialization of i-th layer in a residual branch by its corresponding scalar ai . For example, we
1
could set ∀i ∈ [m], ai = L− 2m−2 . Alternatively, we could start the residual branch as a zero
1
function by setting am = 0 and ∀i ∈ [m − 1], ai = L− 2m−2 . In the second option, the residual
branch does not need to “unlearn” its potentially bad random initial state, which can be beneficial
for learning. Therefore, we use the latter option in our experiments, unless otherwise specified.

The effects of biases and multipliers. With proper rescaling of the weights in all the residual
branches, a residual network is supposed to be updated by Θ(η) per SGD step — our goal is
achieved. However, in order to match the training performance of a corresponding network with
normalization, there are two more things to consider: biases and multipliers.
Using biases in the linear and convolution layers is a common practice. In normalization methods,
bias and scale parameters are typically used to restore the representation power after normalization.2
Intuitively, because the preferred input/output mean of a weight layer may be different from the
preferred output/input mean of an activation layer, it also helps to insert bias terms in a residual
network without normalization. Empirically, we find that inserting just one scalar bias before each
weight layer and nonlinear activation layer significantly improves the training performance.
Multipliers scale the output of a residual branch, similar to the scale parameters in batch normaliza-
tion. They have an interesting effect on the learning dynamics of weight layers in the same branch.
Specifically, as the stochastic gradient of a layer is typically almost orthogonal to its weight, learn-
ing rate decay tends to cause the weight norm equilibrium to shrink when combined with L2 weight
decay (van Laarhoven, 2017). In a branch with multipliers, this in turn causes the growth of the mul-
tipliers, increasing the effective learning rate of other layers. In particular, we observe that inserting
just one scalar multiplier per residual branch mimics the weight norm dynamics of a network with
normalization, and spares us the search of a new learning rate schedule.
Put together, we propose the following method to train residual networks without normalization:

Fixup initialization (or: How to train a deep residual network without normalization)
1. Initialize the classification layer and the last layer of each residual branch to 0.
2. Initialize every other layer using a standard method (e.g., He et al. (2015)), and scale only
1
the weight layers inside residual branches by L− 2m−2 .
3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at
0) before each convolution, linear, and element-wise activation layer.

It is important to note that Rule 2 of Fixup is the essential part as predicted by Equation (5). Indeed,
we observe that using Rule 2 alone is sufficient and necessary for training extremely deep residual
networks. On the other hand, Rule 1 and Rule 3 make further improvements for training so as to
match the performance of a residual network with normalization layers, as we explain in the above
text.3 We find ablation experiments confirm our claims (see Appendix C.1).

2
For example, in batch normalization gamma and beta parameters are used to affine-transform the normal-
ized activations per each channel.
3
It is worth noting that the design of Fixup is a simplification of the common practice, in that we only
introduce O(K) parameters beyond convolution and linear weights (since we remove bias terms from convo-
lution and linear layers), whereas the common practice includes O(KC) (Ioffe & Szegedy, 2015; Salimans &
Kingma, 2016) or O(KCW H) (Ba et al., 2016) additional parameters, where K is the number of layers, C is
the max number of channels per layer and W, H are the spatial dimension of the largest feature maps.

5
Published as a conference paper at ICLR 2019

Our initialization and network design is consistent with recent theoretical work Hardt & Ma (2016);
Li et al. (2018), which, in much more simplified settings such as linearized residual nets and
quadratic neural nets, propose that small initialization tend to stabilize optimization and help gener-
alizaiton. However, our approach suggests that more delicate control of the scale of the initialization
is beneficial.4

4 E XPERIMENTS
4.1 T RAINING AT INCREASING DEPTH

One of the key advatanges of BatchNorm is that it leads to fast training even for very deep models
(Ioffe & Szegedy, 2015). Here we will determine if we can match this desirable property by relying
only on proper initialization. We propose to evaluate how each method affects training very deep
nets by measuring the test accuracy after the first epoch as we increase depth. In particular, we
use the wide residual network (WRN) architecture with width 1 and the default weight decay 5e−4
(Zagoruyko & Komodakis, 2016). We specifically use the default learning rate of 0.1 because the
ability to use high learning rates is considered to be important to the success of BatchNorm. We
compare Fixup against three baseline methods — (1) rescale the output of each residual block by √12
(Balduzzi et al., 2017), (2) post-process an orthogonal initialization such that the output variance of
each residual block is close to 1 (Layer-sequential unit-variance orthogonal initialization, or LSUV)
(Mishkin & Matas, 2015), (3) batch normalization (Ioffe & Szegedy, 2015). We use the default
batch size of 128 up to 1000 layers, with a batch size of 64 for 10,000 layers. We limit our budget
of epochs to 1 due to the computational strain of evaluating models with up to 10,000 layers.

55
First Epoch Test Accuracy (%)

1/2 -scaling
50 LSUV
45 BatchNorm
Fixup
40
35
30
25
10 100 1000 10000
Depth

Figure 3: Depth of residual networks versus test accuracy at the first epoch for various methods on
CIFAR-10 with the default BatchNorm learning rate. We observe that Fixup is able to train very
deep networks with the same learning rate as batch normalization. (Higher is better.)
Figure 3 shows the test accuracy at the first epoch as depth increases. Observe that Fixup
p matches
the performance of BatchNorm at the first epoch, even with 10,000 layers. LSUV and 1 /2 -scaling
are not able to train with the same learning rate as BatchNorm past 100 layers.

4.2 I MAGE CLASSIFICATION

In this section, we evaluate the ability of Fixup to replace batch normalization in image classification
applications. On the CIFAR-10 dataset, we first test on ResNet-110 (He et al., 2016) with default
hyper-parameters; results are shown in Table 1. Fixup obtains 7% relative improvement in test error
compared with standard initialization; however, we note a substantial difference in the difficulty of
training. While network with Fixup is trained with the same learning rate and converge as fast as
network with batch normalization, we fail to train a Xavier initialized ResNet-110 with 0.1x maximal
learning rate.5 The test error gap in Table 1 is likely due to the regularization effect of BatchNorm
4
For example, learning rate smaller than our choice would also stabilize the training, but lead to lower
convergence rate.
5
Personal communication with the authors of (Shang et al., 2017) confirms our observation, and reveals that
the Xavier initialized network need more epochs to converge.

6
Published as a conference paper at ICLR 2019

rather than difficulty in optimization; when we train Fixup networks with better regularization, the
test error gap disappears and we obtain state-of-the-art results on CIFAR-10 and SVHN without
normalization layers (see Appendix C.2).

Dataset ResNet-110 Normalization Large η Test Error (%)

w/ BatchNorm (He et al., 2016) 3 3 6.61
CIFAR-10
w/ Xavier Init (Shang et al., 2017) 7 7 7.78
w/ Fixup-init 7 3 7.24

Table 1: Results on CIFAR-10 with ResNet-110 (mean/median of 5 runs; lower is better).

On the ImageNet dataset, we benchmark Fixup with the ResNet-50 and ResNet-101 architectures
(He et al., 2016), trained for 100 epochs and 200 epochs respectively. Similar to our finding on
the CIFAR-10 dataset, we observe that (1) training with Fixup is fast and stable with the default
hyperparameters, (2) Fixup alone significantly improves the test error of standard initialization, and
(3) there is a large test error gap between Fixup and BatchNorm. Further inspection reveals that
Fixup initialized models obtain significantly lower training error compared with BatchNorm models
(see Appendix C.3), i.e., Fixup suffers from overfitting. We therefore apply stronger regularization
to the Fixup models using Mixup (Zhang et al., 2017). We find it is beneficial to reduce the learning
rate of the scalar multiplier and bias by 10x when additional large regularization is used. Best
Mixup coefficients are found through cross-validation: they are 0.2, 0.1 and 0.7 for BatchNorm,
GroupNorm (Wu & He, 2018) and Fixup respectively. We present the results in Table 2, noting that
with better regularization, the performance of Fixup is on par with GroupNorm.

Model Method Normalization Test Error (%)

BatchNorm (Goyal et al., 2017) 23.6
BatchNorm + Mixup (Zhang et al., 2017) 3 23.3
GroupNorm + Mixup 23.9
ResNet-50
Xavier Init (Shang et al., 2017) 31.5
Fixup-init 7 27.6
Fixup-init + Mixup 24.0
BatchNorm (Zhang et al., 2017) 22.0
BatchNorm + Mixup (Zhang et al., 2017) 3 20.8
ResNet-101
GroupNorm + Mixup 21.4
Fixup-init + Mixup 7 21.4

Table 2: ImageNet test results using the ResNet architecture. (Lower is better.)

4.3 M ACHINE TRANSLATION

To demonstrate the generality of Fixup, we also apply it to replace layer normalization (Ba et al.,
2016) in Transformer (Vaswani et al., 2017), a state-of-the-art neural network for machine trans-
lation. Specifically, we use the fairseq library (Gehring et al., 2017) and follow the Fixup tem-
plate in Section 3 to modify the baseline model. We evaluate on two standard machine translation
datasets, IWSLT German-English (de-en) and WMT English-German (en-de) following the setup
of Ott et al. (2018). For the IWSLT de-en dataset, we cross-validate the dropout probability from
{0.3, 0.4, 0.5, 0.6} and find 0.5 to be optimal for both Fixup and the LayerNorm baseline. For the
WMT’16 en-de dataset, we use dropout probability 0.4. All models are trained for 200k updates.
It was reported (Chen et al., 2018) that “Layer normalization is most critical to stabilize the training
process... removing layer normalization results in unstable training runs”. However we find training
with Fixup to be very stable and as fast as the baseline model. Results are shown in Table 3.
Surprisingly, we find the models do not suffer from overfitting when LayerNorm is replaced by
Fixup, thanks to the strong regularization effect of dropout. Instead, Fixup matches or supersedes
the state-of-the-art results using Transformer model on both datasets.

7
Published as a conference paper at ICLR 2019

Dataset Model Normalization BLEU

(Deng et al., 2018) 33.1
3
IWSLT DE-EN LayerNorm 34.2
Fixup-init 7 34.5
(Vaswani et al., 2017) 28.4
3
WMT EN-DE LayerNorm (Ott et al., 2018) 29.3
Fixup-init 7 29.3

Table 3: Comparing Fixup vs. LayerNorm for machine translation tasks. (Higher is better.)

5 R ELATED W ORK
Normalization methods. Normalization methods have enabled training very deep residual net-
works, and are currently an essential building block of the most successful deep learning architec-
tures. All normalization methods for training neural networks explicitly normalize (i.e. standardize)
some component (activations or weights) through dividing activations or weights by some real num-
ber computed from its statistics and/or subtracting some real number activation statistics (typically
the mean) from the activations.6 In contrast, Fixup does not compute statistics (mean, variance or
norm) at initialization or during any phase of training, hence is not a normalization method.

Theoretical analysis of deep networks. Training very deep neural networks is an important the-
oretical problem. Early works study the propagation of variance in the forward and backward pass
for different activation functions (Glorot & Bengio, 2010; He et al., 2015).
Recently, the study of dynamical isometry (Saxe et al., 2013) provides a more detailed characteriza-
tion of the forward and backward signal propogation at initialization (Pennington et al., 2017; Hanin,
2018), enabling training 10,000-layer CNNs from scratch (Xiao et al., 2018). For residual networks,
activation scale (Hanin & Rolnick, 2018), gradient variance (Balduzzi et al., 2017) and dynamical
isometry property (Yang & Schoenholz, 2017) have been studied. Our analysis in Section 2 leads
to the similar conclusion as previous work that the standard initialization for residual networks is
problematic. However, our use of positive homogeneity for lower bounding the gradient norm of a
neural network is novel, and applies to a broad class of neural network architectures (e.g., ResNet,
DenseNet) and initialization methods (e.g., Xavier, LSUV) with simple assumptions and proof.
Hardt & Ma (2016) analyze the optimization landscape (loss surface) of linearized residual nets in
the neighborhood around the zero initialization where all the critical points are proved to be global
minima. Yang & Schoenholz (2017) study the effect of the initialization of residual nets to the test
performance and pointed out Xavier or He initialization scheme is not optimal. In this paper, we
give a concrete recipe for the initialization scheme with which we can train deep residual networks
without batch normalization successfully.

Understanding batch normalization. Despite its popularity in practice, batch normalization has
not been well understood. Ioffe & Szegedy (2015) attributed its success to “reducing internal covari-
ate shift”, whereas Santurkar et al. (2018) argued that its effect may be “smoothing loss surface”.
Our analysis in Section 2 corroborates the latter idea of Santurkar et al. (2018) by showing that
standard initialization leads to very steep loss surface at initialization. Moreover, we empirically
showed in Section 3 that steep loss surface may be alleviated for residual networks by using smaller
initialization than the standard ones such as Xavier or He’s initialization in residual branches. van
Laarhoven (2017); Hoffer et al. (2018) studied the effect of (batch) normalization and weight decay
on the effective learning rate. Their results inspire us to include a multiplier in each residual branch.

ResNet initialization in practice. Gehring et al. (2017); Balduzzi et al. (2017)

p proposed to address
the initialization problem of residual nets by using the recurrence xl = 1 /2 (xl−1 + Fl (xl−1 )).
Mishkin & Matas (2015) proposed a data-dependent initialization to mimic the effect of batch nor-
malization in the first forward pass. While both methods limit the scale of activation and gradient,
they would fail to train stably at the maximal learning rate for very deep residual networks, since
6
For reference, we include a brief history of normalization methods in Appendix D.

8
Published as a conference paper at ICLR 2019

they fail to consider the accumulation of highly correlated updates contributed by different residual
branches to the network function (Appendix B.1). Srivastava et al. (2015); Hardt & Ma (2016);
Goyal et al. (2017); Kingma & Dhariwal (2018) found that initializing the residual branches at (or
close to) zero helped optimization. Our results support their observation in general, but Equation (5)
suggests additional subtleties when choosing a good initialization scheme.

6 C ONCLUSION
In this work, we study how to train a deep residual network reliably without normalization. Our
theory in Section 2 suggests that the exploding gradient problem at initialization in a positively
homogeneous network such as ResNet is directly linked to the blowup of logits. In Section 3 we
develop Fixup initialization to ensure the whole network as well as each residual branch gets up-
dates of proper scale, based on a top-down analysis. Extensive experiments on real world datasets
demonstrate that Fixup matches normalization techniques in training deep residual networks, and
achieves state-of-the-art test performance with proper regularization.
Our work opens up new possibilities for both theory and applications. Can we analyze the training
dynamics of Fixup, which may potentially be simpler than analyzing models with batch normaliza-
tion is? Could we apply or extend the initialization scheme to other applications of deep learning?
It would also be very interesting to understand the regularization benefits of various normalization
methods, and to develop better regularizers to further improve the test performance of Fixup.

ACKNOWLEDGMENTS
The authors would like to thank Yuxin Wu, Kaiming He, Aleksander Madry and the anonymous
reviewers for their helpful feedback.

R EFERENCES
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams.
The shattered gradients problem: If resnets are the answer, then what is the question? arXiv
preprint arXiv:1702.08591, 2017.
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster,
Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. The best of both worlds: Combin-
ing recent advances in neural machine translation. arXiv preprint arXiv:1804.09849, 2018.
Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander M Rush. Latent alignment
and variational attention. Thirty-second Conference on Neural Information Processing Systems
(NIPS), 2018.
Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks
with cutout. arXiv preprint arXiv:1708.04552, 2017.
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional
Sequence to Sequence Learning. In Proc. of ICML, 2017.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the thirteenth international conference on artificial intelligence and
statistics, pp. 249–256, 2010.
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An-
drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet
in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Benjamin Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014.
Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? arXiv
preprint arXiv:1801.03744, 2018.

9
Published as a conference paper at ICLR 2019

Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture.
arXiv preprint arXiv:1803.01719, 2018.
Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231,
2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE international
conference on computer vision, pp. 1026–1034, 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770–778, 2016.
David J Heeger. Normalization of cell responses in cat striate cortex. Visual neuroscience, 9(2):
181–197, 1992.
Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate
normalization schemes in deep networks. arXiv preprint arXiv:1803.01814, 2018.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.
arXiv preprint arXiv:1807.03039, 2018.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in neural information processing systems, pp. 1097–1105,
2012.
Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu. Generalizing pooling functions in convo-
lutional neural networks: Mixed, gated, and tree. In Artificial Intelligence and Statistics, pp.
464–472, 2016.
Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized
matrix recovery. Conference on Learning Theory (COLT), 2018.
Siwei Lyu and Eero P Simoncelli. Nonlinear image representation using divisive normalization.
In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8.
IEEE, 2008.
Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422,
2015.
Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation.
arXiv preprint arXiv:1806.00187, 2018.
Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep
learning through dynamical isometry: theory and practice. In Advances in neural information
processing systems, pp. 4785–4795, 2017.
Nicolas Pinto, David D Cox, and James J DiCarlo. Why is real-world visual object recognition
hard? PLoS computational biology, 4(1):e27, 2008.
Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accel-
erate training of deep neural networks. In Advances in Neural Information Processing Systems,
pp. 901–909, 2016.
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch
normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint
arXiv:1805.11604, 2018.
Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynam-
ics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.

10
Published as a conference paper at ICLR 2019

Wenling Shang, Justin Chiu, and Kihyuk Sohn. Exploring normalization in deep residual networks
with concatenated rectified linear units. In AAAI, pp. 1509–1516, 2017.

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint
arXiv:1505.00387, 2015.

Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing
ingredient for fast stylization. CoRR, abs/1607.08022, 2016.

Twan van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint
arXiv:1706.05350, 2017.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Infor-
mation Processing Systems, pp. 5998–6008, 2017.

Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision
(ECCV), September 2018.

Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Penning-
ton. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla
convolutional neural networks. arXiv preprint arXiv:1806.05393, 2018.

Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization. arXiv preprint
arXiv:1802.02375, 2018.

Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances
in neural information processing systems, pp. 7103–7114, 2017.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint
arXiv:1605.07146, 2016.

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical
risk minimization. arXiv preprint arXiv:1710.09412, 2017.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation
using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE Interna-
tional Conference on, 2017.

A P ROOFS FOR S ECTION 2

A.1 G RADIENT NORM LOWER BOUND FOR THE INPUT TO A NETWORK BLOCK

Proof of Theorem 1. We use fi→j to denote the composition fj ◦ fj−1 ◦ · · · ◦ fi , so that z =

fi→L (xi−1 ) for all i ∈ [L]. Note that z is p.h. with respect to the input of each network block,
i.e. fi→L ((1 + )xi−1 ) = (1 + )fi→L (xi−1 ) for > −1. This allows us to compute the gradient
of the cross-entropy loss with respect to the scaling factor at = 0 as

∂ ∂` ∂fi→L
= −yT z + pT z = `(z, y) − H(p)

`(fi→L ((1 + )xi−1 ), y) = (6)
∂ =0 ∂z ∂

Since the gradient L2 norm k∂`/∂xi−1 k must be greater than the directional derivative
∂ xi−1
∂t `(fi→L (xi−1 + t kxi−1 k ), y), defining = /kxi−1 k we have
t

∂` ∂ ∂ `(z, y) − H(p)
∂xi−1 ≥ ∂ `(fi→L (xi−1 + xi−1 ), y) ∂t = . (7)

kxi−1 k

11
Published as a conference paper at ICLR 2019

A.2 G RADIENT NORM LOWER BOUND FOR POSITIVELY HOMOGENEOUS SETS

Proof of Theorem 2. The proof idea is similar. Recall that if θph is a p.h. set, then f¯(m) (θph ) ,
f (x(m) ; θ \ θph , θph ) is a p.h. function. We therefore have
M M
1 X ∂` ∂ f¯(m)

∂ 1 X
`(z(m) , y(m) ) − H(p(m) ) (8)

`avg (DM ; (1 + )θph ) = (m)
=
∂ =0 M m=1
∂z ∂ M m=1

hence we again invoke the directional derivative argument to show

M
∂`avg 1 X
≥
∂θph M kθph k `(z(m) , y(m) ) − H(p(m) ) , G(θph ). (9)
m=1

In order to estimate the scale of this lower bound, recall the FC layer weights are i.i.d. sampled from
a symmetric, mean-zero distribution, therefore z has a symmetric probability density function with
mean 0. We hence have
E`(z, y) = E[−yT (z − logsumexp(z))] ≥ E[yT (maxi∈[c] zi − z)] = E[maxi∈[c] zi ] (10)
where the inequality uses the fact that logsumexp(z) ≥ maxi∈[c] zi ; the last equality is due to y
and z being independent at initialization and Ez = 0. Using the trivial bound EH(p) ≤ log(c), we
get
E[maxi∈[c] zi ] − log(c)
EG(θph ) ≥ (11)
kθph k
which shows that the gradient norm of a p.h. set is of the order Ω(E[maxi∈[c] zi ]) at initialization.

B P ROOFS FOR S ECTION 3

B.1 R ESIDUAL BRANCHES UPDATE THE NETWORK IN SYNC

A common theme in previous analysis of residual networks is the scale of activation and gradient
(Balduzzi et al., 2017; Yang & Schoenholz, 2017; Hanin & Rolnick, 2018). However, it is more
important to consider the scale of actual change to the network function made by a (stochastic)
gradient descent step. If the updates to different layers cancel out each other, the network would
be stable as a whole despite drastic changes in different layers; if, on the other hand, the updates
to different layers align with each other, the whole network may incur a drastic change in one step,
even if each layer only changes a tiny amount. We now provide analysis showing that the latter
scenario more accurately describes what happens in reality at initialization.
For our result in this section, we make the following assumptions:
• f is a sequential composition of network blocks {fi }L
i=1 , i.e. f (x0 ) = fL (fL−1 (. . . f1 (x0 ))),
consisting of fully-connected weight layers, ReLU activation functions and residual branches.
• fL is a fully-connected layer with weights i.i.d. sampled from a zero-mean distribution.
• There is no bias parameter in f .
For l < L, let xl−1 be the input to fl and Fl (xl−1 ) be a branch in fl with ml layers. Without loss of
generality, we study the following specific form of network architecture:
m ReLU
l
z }| {
(ml ) (1)
Fl (xl−1 ) = (ReLU ◦ Wl ◦ · · · ◦ ReLU ◦ Wl )(xl−1 ),
fl (xl−1 ) = xl−1 + Fl (xl−1 ).
(1)
For the last block we denote mL = 1 and fL (xL−1 ) = FL (xL−1 ) = WL xL−1 .
Furthermore, we always choose 0 as the gradient of ReLU when its input is 0. As such, with input x,
the output and gradient of ReLU(x) can be simply written as D1[x>0] x, where D1[x>0] is a diagonal
matrix with diagonal entries corresponding to 1[x > 0]. Denote the preactivation of the i-th layer

12
Published as a conference paper at ICLR 2019

(i)
(i.e. the input to the i-th ReLU) in the l-th block by xl . We define the following terms to simplify
our presentation:
(i−) (i−1) (1)
Fl , D1[x(i−1) >0] Wl · · · D1[x(1) >0] Wl xl−1 , l < L, i ∈ [ml ]
l l
(i+) (m )
Fl , D1[x(ml ) >0] Wl l · · · D1[x(i) >0] , l < L, i ∈ [ml ]
l l
(1−)
FL , xL−1
(1+)
FL , I

We have the following result on the gradient update to f :

Theorem 3. With the above assumptions, suppose we update the network parameters by ∆θ =
∂
−η ∂θ `(f (x0 ; θ), y), then the update to network output ∆f (x0 ) , f (x0 ; θ + ∆θ) − f (x0 ; θ) is
,Jli
 
z }| {
 ml
L X T T ∂f 
(i−) 2 ∂f  ∂`

(i+) (i+)
X
∆f (x0 ) = −η  Fl Fl Fl  + O(η 2 ), (12)

 i=1 ∂x l ∂x l  ∂z

l=1

where z , f (x0 ) ∈ Rc is the logits.

Let us discuss the implecation of this result before delving into the proof. As each Jli is a c × c real
symmetric positive semi-definite matrix, the trace norm of each Jli equals its trace. Similarly, the
trace norm of J , l i Jli equals the trace of the sum of all Jli as well, which scales linearly with
P P
the number of residual branches L. Since the output z has no (or little) correlation with the target
∂`
y at the start of training, ∂z is a vector of some random direction. It then follows that the expected
update scale is proportional to the trace norm of J, which is proportional to L as well as the average
trace of Jli . Simply put, to allow the whole network be updated by Θ(η) per step independent of
depth, we need to ensure each residual branch contributes only a Θ(η/L) update on average.

Proof. The first insight to prove our result is to note that conditioning on a specific input x0 , we
can replace each ReLU activation layer by a diagonal matrix and does not change the forward and
backward pass. (In fact, this is valid even after we apply a gradient descent update, as long as the
learning rate η > 0 is sufficiently small so that all positive preactivation remains positive. This
observation will be essential for our later analysis.) We thus have the gradient w.r.t. the i-th weight
layer in the l-th block is
∂` ∂xl ∂f ∂`
(i−) (i)

(i+)
T ∂f ∂`
(i)
= (i)
· · = Fl ⊗ Il Fl · . (13)
∂Vec(W )
l ∂Vec(W ) ∂xl ∂z
l
∂xl ∂z
where ⊗ denotes the Kronecker product. The second insight is to note that with our assumptions, a
network block and its gradient w.r.t. its input have the following relation:
∂fl
fl (xl−1 ) = · xl−1 . (14)
∂xl−1
∂
We then plug in Equation (13) to the gradient update ∆θ = −η ∂θ `(f (x0 ; θ), y), and recalculate the
forward pass f (x0 ; θ +∆θ). The theorem follows by applying Equation (14) and a first-order Taylor
series expansion in a small neighborhood of η = 0 where f (x0 ; θ + ∆θ) is smooth w.r.t. η.

B.2 W HAT SCALAR BRANCH HAS Θ(η/L) UPDATES ?

Qm
For this section, we focus on the proper initialization of a scalar branch F (x) = ( i=1 ai )x. We
have the following result:
Theorem 4. Assuming ∀i, ai ≥ 0, x = Θ(1) and ∂F∂`(x) = Θ(1), then ∆F (x) , F (x; θ − η ∂θ ∂`
)−
F (x; θ) is Θ(η/L) if and only if
 

Y 1
 ak  x = Θ √ , where j ∈ arg min ak (15)
k∈[m]\{j}
L k

13
Published as a conference paper at ICLR 2019

Proof. We start by calculating the gradient of each parameter:

 
∂` ∂`  Y
= ak  x (16)
∂ai ∂F
k∈[m]\{i}

and a first-order approximation of ∆F (x):

m
∂` 2
X 1
∆F (x) = −η (F (x)) 2 (17)
∂F (x) a
i=1 i

where we conveniently abuse some notations by defining

 
1 Y
F (x) ,  ak  x, if ai = 0. (18)
ai
k∈[m]\{i}
Pm 1
Denote i=1 a2i as M and mink ak as A, we have
1 m
(F (x))2 · 2
≤ (F (x))2 M ≤ (F (x))2 · 2 (19)
A A
and therefore by rearranging Equation (17) and letting ∆F (x) = Θ(η/L) we get
!
2 1 ∆F (x) 1
(F (x)) · 2 = Θ ∂`
=Θ (20)
A η ∂F (x) L
√
i.e. F (x)/A = Θ(1/ L). Hence the “only if” part is proved. For the “if” part, we apply Equa-
tion (19) to Equation (17) and observe that by Equation (15)

2 1 η
∆F (x) = Θ η(F (x)) · 2 = Θ (21)
A L

The result of this theorem provides useful guidance on how to rescale the standard initialization to
achieve the desired update scale for the network function.

C A DDITIONAL EXPERIMENTS
C.1 A BLATION STUDIES OF F IXUP

In this section we present the training curves of different architecture designs and initialization
schemes. Specifically, we compare the training accuracy of batch normalization, Fixup, as well as
a few ablated options: (1) removing the bias parameters in the network; (2) use 0.1x the suggested
initialization scale and no bias parameters; (3) use 10x the suggested initialization scale and no bias
parameters; and (4) remove all the residual branches. The results are shown in Figure 4. We see that
initializing the residual branch layers at a smaller scale (or all zero) slows down learning, whereas
training fails when initializing them at a larger scale; we also see the clear benefit of adding bias
parameters in the network.

C.2 CIFAR AND SVHN WITH BETTER REGULARIZATION

We perform additional experiments to validate our hypothesis that the gap in test error between
Fixup and batch normalization is primarily due to overfitting. To combat overfitting, we use Mixup
(Zhang et al., 2017) and Cutout (DeVries & Taylor, 2017) with default hyperparameters as addi-
tional regularization. On the CIFAR-10 dataset, we perform experiments with WideResNet-40-10
and on SVHN we use WideResNet-16-12 (Zagoruyko & Komodakis, 2016), all with the default
hyperparameters. We observe in Table 4 that models trained with Fixup and strong regularization
are competitive with state-of-the-art methods on CIFAR-10 and SVHN, as well as our baseline with
batch normalization.

14
Published as a conference paper at ICLR 2019

100
BatchNorm
Fixup
80 L 2m1 2 , no bias
0.1L 2m 2
1

10L 2m 2
Train Accuracy (%) 1

no residual
60

0
0 200 400 600 800 1000 1200
Batch Index
Figure 4: Minibatch training accuracy of ResNet-110 on CIFAR-10 dataset with different config-
urations in the first 3 epochs. We use minibatch size of 128 and smooth the curves using 10-step
moving average.

Dataset Model Normalization Test Error (%)

(Zagoruyko & Komodakis, 2016) 3.8
(Yamada et al., 2018) Yes 2.3
CIFAR-10 BatchNorm + Mixup + Cutout 2.5
(Graham, 2014) 3.5
No
Fixup-init + Mixup + Cutout 2.3
(Zagoruyko & Komodakis, 2016) 1.5
(DeVries & Taylor, 2017) Yes 1.3
SVHN BatchNorm + Mixup + Cutout 1.4
(Lee et al., 2016) 1.7
No
Fixup-init + Mixup + Cutout 1.4

Table 4: Additional results on CIFAR-10, SVHN datasets.

15
Published as a conference paper at ICLR 2019

C.3 T RAINING AND TEST CURVES ON I MAGE N ET

Figure 5 shows that without additional regularization Fixup fits the training set very well, but overfits
significantly. We see in Figure 6 that Fixup is competitive with networks trained with normalization
when the Mixup regularizer is used.

60 50
BatchNorm BatchNorm
50 GroupNorm 45 GroupNorm
Train Error (%)

Test Error (%)

Fixup 40 Fixup
40
35
30
30
20 25
10 20
0 20 40 60 80 100 0 20 40 60 80 100
Epochs Epochs

Figure 5: Training and test errors on ImageNet using ResNet-50 without additional regularization.
We observe that Fixup is able to better fit the training data and that leads to overfitting - more
regularization is needed. Results of BatchNorm and GroupNorm reproduced from (Wu & He, 2018).

50
BatchNorm + Mixup
45 GroupNorm + Mixup
Fixup + Mixup
Test Error (%)

40
35
30
25
20
0 20 40 60 80 100
Epochs

Figure 6: Test error of ResNet-50 on ImageNet with Mixup (Zhang et al., 2017). Fixup closely
matches the final results yielded by the use of GroupNorm, without any normalization.

D A DDITIONAL REFERENCES : A BRIEF HISTORY OF NORMALIZATION

METHODS

The first use of normalization in neural networks appears in the modeling of biological visual system
and dates back at least to Heeger (1992) in neuroscience and to Pinto et al. (2008); Lyu & Simon-
celli (2008) in computer vision, where each neuron output is divided by the sum (or norm) of all of
the outputs, a module called divisive normalization. Recent popular normalization methods, such
as local response normalization (Krizhevsky et al., 2012), batch normalization (Ioffe & Szegedy,
2015) and layer normalization (Ba et al., 2016) mostly follow this tradition of dividing the neuron
activations by their certain summary statistics, often also with the activation mean subtracted. An
exception is weight normalization (Salimans & Kingma, 2016), which instead divides the weight
parameters by their statistics, specifically the weight norm; weight normalization also adopts the
idea of activation normalization for weight initialization. The recently proposed actnorm (Kingma
& Dhariwal, 2018) removes the normalization of weight parameters, but still use activation normal-
ization to initialize the affine transformation layers.

path_sgd_behnam
No ratings yet
path_sgd_behnam
12 pages
Learning to Initialize Neural Networks for
No ratings yet
Learning to Initialize Neural Networks for
20 pages
NeurIPS-2020-batch-normalization-provably-avoids-ranks-collapse-for-randomly-initialised-deep-networks-Paper
No ratings yet
NeurIPS-2020-batch-normalization-provably-avoids-ranks-collapse-for-randomly-initialised-deep-networks-Paper
12 pages
How To Initialize Your Network? Robust Initialization For Weightnorm & Resnets
No ratings yet
How To Initialize Your Network? Robust Initialization For Weightnorm & Resnets
19 pages
Evolving Normalization Activation Layers
No ratings yet
Evolving Normalization Activation Layers
17 pages
Normalizer Free Networks
No ratings yet
Normalizer Free Networks
22 pages
2003.03488v2
No ratings yet
2003.03488v2
18 pages
9.b Handout-5-Weight Init
No ratings yet
9.b Handout-5-Weight Init
4 pages
Residual Networks
No ratings yet
Residual Networks
13 pages
Batch Normalization Separate
No ratings yet
Batch Normalization Separate
20 pages
Eait 2018 8470438
No ratings yet
Eait 2018 8470438
5 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
Basirat and Roth. 2018_The Quest for the Golden Activation Function
No ratings yet
Basirat and Roth. 2018_The Quest for the Golden Activation Function
16 pages
1710 11573 PDF
No ratings yet
1710 11573 PDF
14 pages
Oct1 A
No ratings yet
Oct1 A
7 pages
Res Net 4
No ratings yet
Res Net 4
23 pages
Delving Deep Into Rectifiers: Surpassing Human-Level Performance On Imagenet Classification
No ratings yet
Delving Deep Into Rectifiers: Surpassing Human-Level Performance On Imagenet Classification
11 pages
b
No ratings yet
b
13 pages
L S S Q: Earned TEP IZE Uantization
No ratings yet
L S S Q: Earned TEP IZE Uantization
12 pages
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
No ratings yet
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
15 pages
Object Classification Using CNN
No ratings yet
Object Classification Using CNN
9 pages
Exploring Low Rank Training of Deep Neural Networks
No ratings yet
Exploring Low Rank Training of Deep Neural Networks
7 pages
winter1516_lecture53
No ratings yet
winter1516_lecture53
20 pages
Deep Learning Using Rectified Linear Units (ReLU)
No ratings yet
Deep Learning Using Rectified Linear Units (ReLU)
7 pages
GoogleNET and ResNet v4 With Nin and Bias
No ratings yet
GoogleNET and ResNet v4 With Nin and Bias
82 pages
Layer Normalization: Jimmy@psi - Toronto.edu Rkiros@cs - Toronto.edu Hinton@cs - Toronto.edu
No ratings yet
Layer Normalization: Jimmy@psi - Toronto.edu Rkiros@cs - Toronto.edu Hinton@cs - Toronto.edu
14 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Deep Residual Learning For Image Recognition: Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research
No ratings yet
Deep Residual Learning For Image Recognition: Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research
7 pages
Going Deeper With Image Transformers: Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Herv e J Egou
No ratings yet
Going Deeper With Image Transformers: Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Herv e J Egou
11 pages
IVA UNIT-5 EDITED
No ratings yet
IVA UNIT-5 EDITED
42 pages
Res Net 2
No ratings yet
Res Net 2
40 pages
ResNet Presentation
No ratings yet
ResNet Presentation
25 pages
6 Apr - 6 - DL
No ratings yet
6 Apr - 6 - DL
69 pages
DL+lect+7 (1)
No ratings yet
DL+lect+7 (1)
15 pages
Improvement of Learning For CNN With Relu Activation by Sparse Regularization
No ratings yet
Improvement of Learning For CNN With Relu Activation by Sparse Regularization
8 pages
Self-Normalizing Neural Networks PDF
No ratings yet
Self-Normalizing Neural Networks PDF
102 pages
RESNET
No ratings yet
RESNET
5 pages
Module 2
No ratings yet
Module 2
13 pages
Biologically Inspired Deep Residual Networks
No ratings yet
Biologically Inspired Deep Residual Networks
10 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
1512 03385-Cropped
No ratings yet
1512 03385-Cropped
12 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Age and Gender Classification
No ratings yet
Age and Gender Classification
26 pages
He Delving Deep Into
No ratings yet
He Delving Deep Into
9 pages
NeurIPS 2021 Convolutional Normalization Improving Deep Convolutional Network Robustness and Training Paper
No ratings yet
NeurIPS 2021 Convolutional Normalization Improving Deep Convolutional Network Robustness and Training Paper
10 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Res Net
No ratings yet
Res Net
8 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
From Everand
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
David Macêdo
No ratings yet
Rezero Is All You Need: Fast Convergence at Large Depth: Authors Contributed Equally, Ordered by Last Name
No ratings yet
Rezero Is All You Need: Fast Convergence at Large Depth: Authors Contributed Equally, Ordered by Last Name
14 pages
He Deep Residual Learning CVPR 2016 Paper PDF
No ratings yet
He Deep Residual Learning CVPR 2016 Paper PDF
9 pages
Deep Residual Learning For Image Recognition (Summary)
No ratings yet
Deep Residual Learning For Image Recognition (Summary)
11 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
45 pages
Goldgh
No ratings yet
Goldgh
1 page
Explaining How Resnet-50 Works and Why It Is So Popular
No ratings yet
Explaining How Resnet-50 Works and Why It Is So Popular
15 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
783683 Dune House Atreides Ornithopter Kit 1655ae2a e51a 4c69 959f Ab7459ab772a
No ratings yet
783683 Dune House Atreides Ornithopter Kit 1655ae2a e51a 4c69 959f Ab7459ab772a
20 pages
What Is The Need For Residual Learning?
No ratings yet
What Is The Need For Residual Learning?
3 pages
NeurIPS 2019 How To Initialize Your Network Robust Initialization For Weightnorm Resnets Paper
No ratings yet
NeurIPS 2019 How To Initialize Your Network Robust Initialization For Weightnorm Resnets Paper
10 pages
HP Venture Report Formated
No ratings yet
HP Venture Report Formated
947 pages
VOIP Exercise in Packet Tracer
No ratings yet
VOIP Exercise in Packet Tracer
18 pages
xdqlla
No ratings yet
xdqlla
3 pages
PDF CompTIA Network+ certification exam guide: (exam N10-007) Seventh Edition Meyers download
100% (5)
PDF CompTIA Network+ certification exam guide: (exam N10-007) Seventh Edition Meyers download
55 pages
MODEL NO.: V420H2 - P01: TFT LCD Approval Specification
No ratings yet
MODEL NO.: V420H2 - P01: TFT LCD Approval Specification
31 pages
Artificial Intelligence Interview Questions
From Everand
Artificial Intelligence Interview Questions
Tech Interviews
5/5 (2)
XYZware User Manual - EN - V3.3
No ratings yet
XYZware User Manual - EN - V3.3
50 pages
Electronics 12 00218
No ratings yet
Electronics 12 00218
19 pages
InterShip Report Gurukrupa Auto
No ratings yet
InterShip Report Gurukrupa Auto
25 pages
Lecture4 PDF
No ratings yet
Lecture4 PDF
108 pages
Thesis Doc v01
No ratings yet
Thesis Doc v01
34 pages
JSA Computer
No ratings yet
JSA Computer
3 pages
SGH SGH NON-DISCLOSURE FORM
No ratings yet
SGH SGH NON-DISCLOSURE FORM
4 pages
ISO 230-2-2006-03 Test Code For Machine Tools-Part 2 Determination of Accuracy and Repeatability of Positioning Numerically Controlled Axes
100% (1)
ISO 230-2-2006-03 Test Code For Machine Tools-Part 2 Determination of Accuracy and Repeatability of Positioning Numerically Controlled Axes
39 pages
University Exam Schedule(2024-25-Odd) -Final (1)
No ratings yet
University Exam Schedule(2024-25-Odd) -Final (1)
3 pages
IoTES Lab Manual
No ratings yet
IoTES Lab Manual
9 pages
Newton Divided Difference
No ratings yet
Newton Divided Difference
25 pages
CNS_Question_bank
No ratings yet
CNS_Question_bank
2 pages
S1 Results 1
No ratings yet
S1 Results 1
52 pages
User Manual: HDD & DVD Player/ Recorder DVDR3440H
No ratings yet
User Manual: HDD & DVD Player/ Recorder DVDR3440H
80 pages
Nptel Divided
No ratings yet
Nptel Divided
47 pages
C - Assignment - BCS2A - BCS2B
No ratings yet
C - Assignment - BCS2A - BCS2B
4 pages
1.1 Basics of The Fluid-Structure Interaction Problem in Deep Vein
No ratings yet
1.1 Basics of The Fluid-Structure Interaction Problem in Deep Vein
14 pages
Hby Ums Im8
No ratings yet
Hby Ums Im8
4 pages
492 Study Report: Engin Deniz Alpman June 2, 2017
No ratings yet
492 Study Report: Engin Deniz Alpman June 2, 2017
29 pages
ID card 25 Nov 2024
No ratings yet
ID card 25 Nov 2024
1 page
DX Diag
No ratings yet
DX Diag
10 pages
Sysmon: How To Install, Upgrade, and Uninstall
No ratings yet
Sysmon: How To Install, Upgrade, and Uninstall
6 pages
Design Thinking: Short Answers
No ratings yet
Design Thinking: Short Answers
8 pages
Mohamed Ejlal Resume-7
No ratings yet
Mohamed Ejlal Resume-7
2 pages
Flow Through A Tube Simulation: Burak Yavuz 702181004 Special Topics in Computational Science and Engineering
No ratings yet
Flow Through A Tube Simulation: Burak Yavuz 702181004 Special Topics in Computational Science and Engineering
11 pages
Xavier Initialization PDF
No ratings yet
Xavier Initialization PDF
8 pages
Derivation of The Boltzmann Distribution
No ratings yet
Derivation of The Boltzmann Distribution
6 pages
Stacks Queues Lists
No ratings yet
Stacks Queues Lists
6 pages
Counting PDF
No ratings yet
Counting PDF
4 pages
Change of Sim Form
No ratings yet
Change of Sim Form
1 page
DBMS - Syllabus
No ratings yet
DBMS - Syllabus
3 pages
Propulsim Poster
No ratings yet
Propulsim Poster
1 page
Excel Weekly Status Report Template
No ratings yet
Excel Weekly Status Report Template
3 pages
AI Project Cycle: 2.1. Problem Scoping
100% (1)
AI Project Cycle: 2.1. Problem Scoping
6 pages
Solutions Book Activity 6.1 To 6.7
No ratings yet
Solutions Book Activity 6.1 To 6.7
11 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Fixup Initialization PDF

Uploaded by

Fixup Initialization PDF

Uploaded by

Published as a conference paper at ICLR 2019

Normalization layers are a staple in state-of-the-art deep neural network archi-

ReLU remove  rescale  add scalar  add scalar

(He et al., 2016) Fixup w/o bias Fixup

2 P ROBLEM : R ES N ET WITH S TANDARD I NITIALIZATIONS L EAD TO

conv conv conv

3 F IXUP : U PDATE A R ESIDUAL N ETWORK Θ(η) PER SGD S TEP

f (x; θ) is updated by Θ(η) per SGD step after initialization as η → 0.

We defer the derivation until Appendix B.2.

4.2 I MAGE CLASSIFICATION

Dataset ResNet-110 Normalization Large η Test Error (%)

Table 1: Results on CIFAR-10 with ResNet-110 (mean/median of 5 runs; lower is better).

Model Method Normalization Test Error (%)

4.3 M ACHINE TRANSLATION

Dataset Model Normalization BLEU

ResNet initialization in practice. Gehring et al. (2017); Balduzzi et al. (2017)

A P ROOFS FOR S ECTION 2

Proof of Theorem 1. We use fi→j to denote the composition fj ◦ fj−1 ◦ · · · ◦ fi , so that z =

A.2 G RADIENT NORM LOWER BOUND FOR POSITIVELY HOMOGENEOUS SETS

hence we again invoke the directional derivative argument to show

B P ROOFS FOR S ECTION 3

We have the following result on the gradient update to f :

where z , f (x0 ) ∈ Rc is the logits.

B.2 W HAT SCALAR BRANCH HAS Θ(η/L) UPDATES ?

Proof. We start by calculating the gradient of each parameter:

and a first-order approximation of ∆F (x):

where we conveniently abuse some notations by defining

C.2 CIFAR AND SVHN WITH BETTER REGULARIZATION

Dataset Model Normalization Test Error (%)

Table 4: Additional results on CIFAR-10, SVHN datasets.

C.3 T RAINING AND TEST CURVES ON I MAGE N ET

Test Error (%)

D A DDITIONAL REFERENCES : A BRIEF HISTORY OF NORMALIZATION

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Fixup Initialization PDF

Uploaded by

Fixup Initialization PDF

Uploaded by

Published as a conference paper at ICLR 2019

Normalization layers are a staple in state-of-the-art deep neural network archi-

ReLU remove rescale add scalar add scalar

(He et al., 2016) Fixup w/o bias Fixup

2 P ROBLEM : R ES N ET WITH S TANDARD I NITIALIZATIONS L EAD TO

conv conv conv

3 F IXUP : U PDATE A R ESIDUAL N ETWORK Θ(η) PER SGD S TEP

f (x; θ) is updated by Θ(η) per SGD step after initialization as η → 0.

We defer the derivation until Appendix B.2.

4.2 I MAGE CLASSIFICATION

Dataset ResNet-110 Normalization Large η Test Error (%)

Table 1: Results on CIFAR-10 with ResNet-110 (mean/median of 5 runs; lower is better).

Model Method Normalization Test Error (%)

4.3 M ACHINE TRANSLATION

Dataset Model Normalization BLEU

ResNet initialization in practice. Gehring et al. (2017); Balduzzi et al. (2017)

A P ROOFS FOR S ECTION 2

Proof of Theorem 1. We use fi→j to denote the composition fj ◦ fj−1 ◦ · · · ◦ fi , so that z =

A.2 G RADIENT NORM LOWER BOUND FOR POSITIVELY HOMOGENEOUS SETS

hence we again invoke the directional derivative argument to show

B P ROOFS FOR S ECTION 3

We have the following result on the gradient update to f :

where z , f (x0 ) ∈ Rc is the logits.

B.2 W HAT SCALAR BRANCH HAS Θ(η/L) UPDATES ?

Proof. We start by calculating the gradient of each parameter:

and a first-order approximation of ∆F (x):

where we conveniently abuse some notations by defining

C.2 CIFAR AND SVHN WITH BETTER REGULARIZATION

Dataset Model Normalization Test Error (%)

Table 4: Additional results on CIFAR-10, SVHN datasets.

C.3 T RAINING AND TEST CURVES ON I MAGE N ET

Test Error (%)

D A DDITIONAL REFERENCES : A BRIEF HISTORY OF NORMALIZATION

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

ReLU remove  rescale  add scalar  add scalar