Modern Convolutional Neural Networks
Modern Convolutional Neural Networks
Networks
Modern Convolutional Neural Networks
For instance, NVIDIA’s GeForce 256 from 1999 was able to process
at most 480 million oating-point operations, such as additions and
multiplications, per second (MFLOPS), without any meaningful
programming framework for operations beyond games.
Obtain an interesting dataset. In the early days, these datasets required expensive sensors. For
instance, the Apple QuickTake 100 of 1994 sported a whopping 0.3 megapixel resolution, capable
of storing up to 8 images, all for the price of $1000
Preprocess the dataset with hand-crafted features based on some knowledge of optics,
geometry, other analytic tools, and occasionally on the serendipitous discoveries by lucky
graduate students
Feed the data through a standard set of feature extractors such as the SIFT (scale-invariant
feature transform) (Lowe, 2004), the SURF (speeded up robust features) (Bay et al., 2006), or
any number of other hand-tuned pipelines
Dump the resulting representations into your favorite classi er, likely a linear model or kernel
method, to train a classi er.
.
fi
.
fi
:
Representation Learning
The rst modern CNN (Krizhevsky et al., 2012), named AlexNet (Alex
Krizhevsky, Ilya Sutskever, Geoffrey Hinton) is largely an evolutionary
improvement over LeNet.
Another aspect was that the images were at relatively high resolution
of 224 x 224 pixels, unlike the 80 million-sized TinyImages dataset
(Torralba et al., 2008), consisting of pixel thumbnails.
NVIDIA and ATI had begun optimizing GPUs for general computing
operations (Fernando, 2004), going as far as to market them as
general-purpose GPUs (GPGPUs).
.
GPUs
GPU cores are much simpler, which makes them more energy
ef cient.
AlexNet
After the nal convolutional layer, there
are two huge fully connected layers with
4096 outputs
AlexNet
Activation Functions
This makes the model more robust and the larger sample size
effectively reduces over tting.
.
fi
.
fl
Discussion
Reviewing the architecture, we see that AlexNet has an Achilles
heel when it comes to ef ciency: the last two hidden layers require
matrices of size 6400 x 4096 and 4096 x 4096, respectively
fi
.
The idea of using blocks rst emerged from the Visual Geometry
Group (VGG) at Oxford University, in their eponymously-named VGG
network (Simonyan and Zisserman, 2014).
VGG Blocks
The basic building block of CNNs is a sequence of the following
One of the problems with this approach is that the spatial resolution
decreases quite rapidly.
VGG Blocks
The key idea of Simonyan and Zisserman (2014) was to use multiple
convolutions in between downsampling via max-pooling in the form of
a block
VGG Blocks
In a rather detailed analysis they showed that deep and narrow networks
signi cantly outperform their shallow counterparts.
This set deep learning on a quest for ever deeper networks with over 100
layers for typical applications
VGG Blocks
A VGG block consists of a sequence of convolutions with 3 × 3 kernels with padding of 1
followed by a 2 × 2 max-pooling layer with stride of 2
VGG Network
From AlexNet to VGG. The key difference is that VGG consists of blocks of
layers, whereas AlexNet’s layers are all designed individually.
VGG Network
VGG de nes a family of networks rather than just a speci c
manifestation
fi
.
fi
Network in Network (NiN)
LeNet, AlexNet, and VGG all share a common design pattern: extract features
exploiting spatial structure via a sequence of convolutions and pooling layers and
post-process the representations via fully connected layers
The fully connected layers at the end of the architecture consume tremendous
numbers of parameters
NiN
The network in network (NiN) blocks (Lin et al., 2013) offer an
alternative, capable of solving both problems in one simple strategy
NiN Blocks
It was arguably also the rst network that exhibited a clear distinction
among the stem (data ingest), body (data processing), and head (prediction)
in a CNN.
This design pattern has persisted ever since in the design of deep
networks: the stem is given by the rst two or three convolutions that
operate on the image.
fi
fi
GoogLeNet
Finally, the head maps the features obtained so far to the required
classi cation, segmentation, detection, or tracking problem at hand
GoogLeNet
Inception Blocks
The basic convolutional block in GoogLeNet is called an Inception block,
stemming from the meme “we need to go deeper” from the movie
Inception.
fi
fi
fi
.
Adaptive solvers such as AdaGrad (Duchi et al., 2011), Adam (Kingma and
Ba, 2014), Yogi (Zaheer et al., 2018), or Distributed Shampoo (Anil et al.,
2020) aim to address this from the viewpoint of optimization, e.g., by
adding aspects of second-order methods.
.
fi
.
Next, we apply a scale coef cient and an offset to recover the lost
degrees of freedom.
fi
fi
̂
x − μℬ
BN(x) = γ ⊙ + β.
̂
σℬ
fi
.
̂ and σℬ
We calculate μℬ ̂ as follows
1 2 1 2
̂ = ̂ = ̂ ) + ϵ.
∑ ∑
μℬ x and σℬ (x − μℬ
| ℬ | x∈ℬ | ℬ | x∈ℬ
fi
:
Teye et al. (2018) and Luo et al. (2018) related the properties of batch normalization to
Bayesian priors and penalties, respectively.
In particular, this sheds some light on the puzzle of why batch normalization works best for
moderate minibatch sizes in the 50–100 range.
This particular size of minibatch seems to inject just the "right amount" of noise per layer,
both in terms of scale via σ,̂ and in terms of offset via μ:̂ a larger minibatch regularizes less
due to the more stable estimates, whereas tiny minibatches destroy useful signal due to
high variance.
fi
Training Deep Networks
Once the model is trained, we can calculate the means and
variances of each layer’s variables based on the entire dataset.
h = ϕ(BN(Wx + b)) .
fi
.
fi
Convolutional Layers
Similarly, with convolutional layers, we can apply batch normalization
after the convolution but before the nonlinear activation function
Assume that our minibatches contain m examples and that for each
channel, the output of the convolution has height p and width q
For convolutional layers, we carry out each batch normalization over the
m ⋅ p ⋅ q elements per output channel simultaneously.
Layer Normalization
Note that in the context of convolutions the batch normalization is well de ned even for
minibatches of size 1: after all, we have all the locations across an image to average.
This consideration led Ba et al. (2016) to introduce the notion of layer normalization.
It works just like a batch norm, only that it is applied to one observation at a time. For
an n-dimensional vector x, layer norms are given by
x − μ̂
x → LN(x) = ,
σ̂
where scaling and offset are applied coef cient-wise and given by
n n
1 2 1 2
n∑ ∑
μ̂ = xi and σ ̂ = (xi − μ)̂ + ϵ .
i=1
n i=1
fi
fi
Layer Normalization
One of the major bene ts of using layer normalization is that it prevents
divergence.
fi
.
Discussion
Intuitively, batch normalization is thought to make the optimization
landscape smoother
Recall that we do not even know why simpler deep neural networks
generalize well in the rst place.
.
fi
.
Discussion
The original paper proposing batch normalization (Ioffe and Szegedy,
2015), in addition to introducing a powerful and useful tool, offered
an explanation for why it works: by reducing internal covariate shift
However, there were two problems with this explanation: This drift is
very different from covariate shift, rendering the name a misnomer.
.
Discussion
The explanation offers an under-speci ed intuition but leaves the question
of why precisely this technique works an open question wanting for a
rigorous explanation
fi
Summary
Batch normalization is slightly different for fully connected layers
than for convolutional layers.
For more robust models that are less sensitive to input perturbations,
consider removing batch normalization (Wang et al., 2022).
.
Function Classes
Consider ℱ , the class of functions that a speci c network
architecture can reach
That is, for all f ∈ ℱ there exists some set of parameters that can
be obtained through training on a suitable dataset
fi
Function Classes
f*
ℱ
= argmin L(X, y, f ) subject to f ∈ ℱ .
f
.
fi
fi
:
Function Classes
We know that regularization (Morozov, 1984, Tikhonov and Arsenin,
1977) may control complexity of ℱ and achieve consistency, so a
larger size of training data generally leads to better f*
ℱ
Function Classes
However, if ℱ ⊈ ℱ′ there is no guarantee that this should even happen.
In fact, f*
ℱ′ might well be worse.


Function Classes
For deep neural networks, if we can train the newly-added layer into an identity
function f(x) = x, the new model will be as effective as the original model
This is the question that He et al. (2016) considered when working on very
deep computer vision models
At the heart of their proposed residual network (ResNet) is the idea that
every additional layer should more easily contain the identity function as one of
its elements.
These considerations are rather profound but they led to a surprisingly simple
solution, a residual block.
Function Classes
ResNet won the ImageNet Large Scale Visual Recognition Challenge in 2015.
Residual blocks have been added to recurrent networks (Kim et al., 2017, Prakash et al.,
2016).
Likewise, Transformers (Vaswani et al., 2017) use them to stack many layers of networks
ef ciently.
It is also used in graph neural networks (Kipf and Welling, 2016) and, as a basic concept, it
has been used extensively in computer vision (Redmon and Farhadi, 2018, Ren et al., 2015).
Note that residual networks are predated by highway networks (Srivastava et al.,
2015) that share some of the motivation, albeit without the elegant parametrization around
the identity function.
fi
Residual Blocks
In a regular block (left), the portion within the dotted-line box must directly learn the mapping f(x). In a
residual block (right), the portion within the dotted-line box needs to learn the residual mapping g(x) = f(x)
- x, making the identity mapping f(x) = x easier to learn.
Residual Blocks
ResNet block with and without 1x1 convolution, which transforms the input into
the desired shape for the addition operation.
ResNet Model
The original ResNet paper (He et al., 2016) allowed for up to 152 layers
[ [ 2! [ 3! ]]]
f′′(0) f′′′(0)
f(x) = f(0) + x ⋅ f′(0) + x ⋅ +x⋅ +⋯ .
f(x) = x + g(x) .






s
The main difference between ResNet (left) and DenseNet (right) in cross-layer
connections: use of addition and use of concatenation.
From ResNet to DenseNet
As a result, we perform a mapping from x to its values after applying
an increasingly complex sequence of functions
[ ([ ]) ]
x → x, f1(x), f2 ([x, f1 (x)]), f3 x, f1 (x), f2 ([x, f1 (x)]) , … .
Dense connections in DenseNet. Note how the dimensionality increases with depth.
:
Dense Blocks
The main components that comprise a DenseNet are dense blocks
and transition layers
Transition Layers
Dense Block
A 5-layer dense block with a growth rate of k = 4. Each layer takes all
preceding feature-maps as input.
DenseNet
A deep DenseNet with three dense blocks. The layers between two adjacent blocks
are referred to as transition layers and change feature-map sizes via
convolution and pooling.