0% found this document useful (0 votes)
14 views68 pages

Modern Convolutional Neural Networks

Uploaded by

michaelgorgesjr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views68 pages

Modern Convolutional Neural Networks

Uploaded by

michaelgorgesjr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Modern Convolutional Neural

Networks
Modern Convolutional Neural Networks

While the idea of deep neural networks is quite simple,


performance can vary wildly across architectures and
hyperparameter choices

Neural network architectures are the product of intuition,


a few mathematical insights, and a lot of trial and error.
.

Antiquity (in AI time)


Although the inputs to convolutional networks consist of raw or
lightly-processed pixel values, practitioners would never feed raw
pixels into traditional models.

Instead, typical computer vision pipelines consisted of manually


engineering feature extraction pipelines, such as SIFT (Lowe, 2004),
SURF (Bay et al., 2006), and bags of visual words (Sivic and
Zisserman, 2003).

Rather than learning the features, the features were crafted.


Antiquity (in AI time)


Although some neural network accelerators were available in the
1990s, they were not yet suf ciently powerful to make deep
multichannel, multilayer CNNs with a large number of parameters

For instance, NVIDIA’s GeForce 256 from 1999 was able to process
at most 480 million oating-point operations, such as additions and
multiplications, per second (MFLOPS), without any meaningful
programming framework for operations beyond games.

Today’s accelerators are able to perform in excess of 1000 TFLOPs


per device.
fl
fi

Antiquity (in AI time)


Moreover, datasets were still relatively small: OCR on 60,000 low-
resolution 28x28 pixel images was considered a highly challenging
task

Added to these obstacles, key tricks for training neural networks


including parameter initialization heuristics (Glorot and Bengio,
2010), clever variants of stochastic gradient descent (Kingma and
Ba, 2014), non-squashing activation functions (Nair and Hinton,
2010), and effective regularization techniques (Srivastava et al.,
2014) were still missing.
.

Antiquity (in AI time)


Thus, rather than training end-to-end systems, classical pipelines looked more like this

Obtain an interesting dataset. In the early days, these datasets required expensive sensors. For
instance, the Apple QuickTake 100 of 1994 sported a whopping 0.3 megapixel resolution, capable
of storing up to 8 images, all for the price of $1000

Preprocess the dataset with hand-crafted features based on some knowledge of optics,
geometry, other analytic tools, and occasionally on the serendipitous discoveries by lucky
graduate students

Feed the data through a standard set of feature extractors such as the SIFT (scale-invariant
feature transform) (Lowe, 2004), the SURF (speeded up robust features) (Bay et al., 2006), or
any number of other hand-tuned pipelines

Dump the resulting representations into your favorite classi er, likely a linear model or kernel
method, to train a classi er.
.

fi
.

fi
:

Representation Learning
The rst modern CNN (Krizhevsky et al., 2012), named AlexNet (Alex
Krizhevsky, Ilya Sutskever, Geoffrey Hinton) is largely an evolutionary
improvement over LeNet.

Image filters learned by the first layer of AlexNet.


fi
Missing Ingredient: Data
Given the limited storage capacity of computers, the relative expense of
sensors, and the comparatively tighter research budgets in the 1990s,
most research relied on tiny datasets.

In 2009, the ImageNet dataset was released (Deng et al., 2009),


challenging researchers to learn models from 1 million examples, 1000
each from 1000 distinct categories of objects.

Another aspect was that the images were at relatively high resolution
of 224 x 224 pixels, unlike the 80 million-sized TinyImages dataset
(Torralba et al., 2008), consisting of pixel thumbnails.

Missing Ingredient: Hardware


Graphical processing units (GPUs) proved to be a game changer in
making deep learning feasible

In particular, they were optimized for high throughput 4x4 matrix–


vector products, which are needed for many computer graphics
tasks

NVIDIA and ATI had begun optimizing GPUs for general computing
operations (Fernando, 2004), going as far as to market them as
general-purpose GPUs (GPGPUs).
.

GPUs
GPU cores are much simpler, which makes them more energy
ef cient.

GPUs have memory buses that are at least 10 times as wide as


many CPUs.
fi

AlexNet
After the nal convolutional layer, there
are two huge fully connected layers with
4096 outputs

Because of the limited memory in early


GPUs, the original AlexNet used a dual
data stream design, so that each of their
two GPUs could be responsible for storing
and computing only its half of the model.

From LeNet (left) to AlexNet (right).


fi
.

AlexNet
Activation Functions

AlexNet changed the sigmoid activation function to a simpler ReLU


activation function

On the one hand, the computation of the ReLU activation function is


simpler

On the other hand, the ReLU activation function makes model


training easier when using different parameter initialization methods.
.

Capacity Control and Preprocessing


AlexNet controls the model complexity of the fully connected layer
by dropout, while LeNet only uses weight decay

To augment the data even further, the training loop of AlexNet


added a great deal of image augmentation, such as ipping, clipping,
and color changes

This makes the model more robust and the larger sample size
effectively reduces over tting.
.

fi
.

fl
Discussion
Reviewing the architecture, we see that AlexNet has an Achilles
heel when it comes to ef ciency: the last two hidden layers require
matrices of size 6400 x 4096 and 4096 x 4096, respectively

This corresponds to 164 MB of memory and 81 MFLOPs of


computation, both of which are a nontrivial outlay, especially on
smaller devices, such as mobile phones

Note that even though the number of parameters exceeds by far


the amount of training data, there is hardly any over tting.
fi
.

fi
.

Networks Using Blocks (VGG)


The design of neural network architectures has grown progressively
more abstract, with researchers moving from thinking in terms of
individual neurons to whole layers, and now to blocks, repeating
patterns of layers.

A decade later, this has now progressed to researchers using entire


trained models to repurpose them for different, albeit related, tasks

Such large pretrained models are typically called foundation models


(Bommasani et al., 2021).

Networks Using Blocks (VGG)

The idea of using blocks rst emerged from the Visual Geometry
Group (VGG) at Oxford University, in their eponymously-named VGG
network (Simonyan and Zisserman, 2014).

It is easy to implement these repeated structures in code with any


modern deep learning framework by using loops and subroutines.
fi

VGG Blocks
The basic building block of CNNs is a sequence of the following

a convolutional layer with padding to maintain the resolutio

a nonlinearity such as a ReL

a pooling layer such as max-pooling to reduce the resolution.

One of the problems with this approach is that the spatial resolution
decreases quite rapidly.

In particular, this imposes a hard limit of log2 d convolutional layers on the


network before all dimensions are used up.

VGG Blocks
The key idea of Simonyan and Zisserman (2014) was to use multiple
convolutions in between downsampling via max-pooling in the form of
a block

They were primarily interested in whether deep or wide networks


perform better.

For instance, the successive application of two 3 × 3 convolutions


touches the same pixels as a single 5 × 5 convolution does.
.

VGG Blocks
In a rather detailed analysis they showed that deep and narrow networks
signi cantly outperform their shallow counterparts.

This set deep learning on a quest for ever deeper networks with over 100
layers for typical applications

Stacking 3 × 3 convolutions has become a gold standard in later deep


networks (a design decision only to be revisited recently by Liu et al. (2022))

Consequently, fast implementations for small convolutions have become a


staple on GPUs (Lavin and Gray, 2016).
fi
.

VGG Blocks
A VGG block consists of a sequence of convolutions with 3 × 3 kernels with padding of 1
followed by a 2 × 2 max-pooling layer with stride of 2

def vgg_block(num_convs, num_channels):


blk = tf.keras.models.Sequential()
for _ in range(num_convs):
blk.add(
tf.keras.layers.Conv2D(num_channels, kernel_size=3,
padding='same', activation='relu'))
blk.add(tf.keras.layers.MaxPool2D(pool_size=2, strides=2))
return blk
.

VGG Network

From AlexNet to VGG. The key difference is that VGG consists of blocks of
layers, whereas AlexNet’s layers are all designed individually.
VGG Network
VGG de nes a family of networks rather than just a speci c
manifestation

Simonyan and Zisserman (2014) described several other variants of


VGG.

In fact, it has become the norm to propose families of networks


with different speed–accuracy trade-off when introducing a new
architecture.

fi
.

fi
Network in Network (NiN)
LeNet, AlexNet, and VGG all share a common design pattern: extract features
exploiting spatial structure via a sequence of convolutions and pooling layers and
post-process the representations via fully connected layers

This design poses two major challenge

The fully connected layers at the end of the architecture consume tremendous
numbers of parameters

It is equally impossible to add fully connected layers earlier in the network to


increase the degree of nonlinearity: doing so would destroy the spatial structure
and require potentially even more memory.
.

NiN
The network in network (NiN) blocks (Lin et al., 2013) offer an
alternative, capable of solving both problems in one simple strategy

They were proposed based on a very simple insight:

use 1x1 convolutions to add local nonlinearities across the


channel activations

use global average pooling to integrate across all locations in the


last representation layer.

NiN Blocks

Comparing the architectures of VGG and NiN, and of their blocks.


NiN Model
NiN avoids fully connected layers altogether

Instead, NiN uses a NiN block with a number of output channels


equal to the number of label classes, followed by a global average
pooling layer, yielding a vector of logits

This design signi cantly reduces the number of required model


parameters, albeit at the expense of a potential increase in training
time.
fi
.

Multi-Branch Networks (GoogLeNet)


In 2014, GoogLeNet won the ImageNet Challenge (Szegedy et al., 2015),
using a structure that combined the strengths of NiN (Lin et al., 2013),
repeated blocks (Simonyan and Zisserman, 2014), and a cocktail of
convolution kernels

It was arguably also the rst network that exhibited a clear distinction
among the stem (data ingest), body (data processing), and head (prediction)
in a CNN.

This design pattern has persisted ever since in the design of deep
networks: the stem is given by the rst two or three convolutions that
operate on the image.

fi
fi
GoogLeNet

This is followed by a body of convolutional blocks

Finally, the head maps the features obtained so far to the required
classi cation, segmentation, detection, or tracking problem at hand

The key contribution in GoogLeNet was the design of the network


body.
fi
.

GoogLeNet

It solved the problem of selecting convolution kernels in an


ingenious way.

While other works tried to identify which convolution, ranging from


1x1 to 11x11 would be best, it simply concatenated multi-branch
convolutions.

Inception Blocks
The basic convolutional block in GoogLeNet is called an Inception block,
stemming from the meme “we need to go deeper” from the movie
Inception.

Structure of the Inception block.


GoogLeNet Model

The GoogLeNet architecture.


Batch Normalization
A popular and effective technique that consistently accelerates the
convergence of deep networks (Ioffe and Szegedy, 2015).

Together with residual blocks batch normalization has made it


possible for practitioners to routinely train networks with over 100
layers

A secondary bene t of batch normalization lies in its inherent


regularization.
.

fi

Training Deep Networks


Data preprocessing techniques are bene cial for keeping the
estimation problem well controlled

It is only natural to ask whether a corresponding normalization step


inside a deep network might not be bene cial

For a typical MLP or CNN, as we train, the variables in intermediate


layers may take values with widely varying magnitudes: whether
along the layers from input to output, across units in the same layer,
and over time due to our updates to the model parameters.
.

fi
fi
.

Training Deep Networks


The inventors of batch normalization (Ioffe and Szegedy, 2015) postulated
informally that this drift in the distribution of such variables could hamper
the convergence of the network

Intuitively, we might conjecture that if one layer has variable activations


that are 100 times that of another layer, this might necessitate
compensatory adjustments in the learning rates

Adaptive solvers such as AdaGrad (Duchi et al., 2011), Adam (Kingma and
Ba, 2014), Yogi (Zaheer et al., 2018), or Distributed Shampoo (Anil et al.,
2020) aim to address this from the viewpoint of optimization, e.g., by
adding aspects of second-order methods.
.

Training Deep Networks


The alternative is to prevent the problem from occurring, simply by
adaptive normalization

Deeper networks are complex and tend to be more liable to


over tting

A common technique for regularization is noise injection

As it turns out, quite serendipitously, batch normalization conveys all


three bene ts: preprocessing, numerical stability, and regularization.
fi
.

fi
.

Training Deep Networks


Batch normalization is applied to individual layers, or optionally, to all of
them:

In each training iteration, we rst normalize the inputs by subtracting


their mean and dividing by their standard deviation, where both are
estimated based on the statistics of the current minibatch.

Next, we apply a scale coef cient and an offset to recover the lost
degrees of freedom.

It is precisely due to this normalization based on batch statistics that


batch normalization derives its name.

fi
fi

Training Deep Networks


Denote by ℬ a minibatch and let x ∈ ℬ be an input to batch normalization
(BN). In this case the batch normalization is de ned as follows

̂
x − μℬ
BN(x) = γ ⊙ + β.
̂
σℬ

̂ is the sample mea


μℬ

̂ is the sample standard deviation of the minibatch ℬ


σℬ

We recover this degree of freedom by including an elementwise scale


parameter γ and shift parameter β that have the same shape as x.
n

fi
.

Training Deep Networks


The variable magnitudes for intermediate layers cannot diverge
during training since batch normalization actively centers and
rescales them back to a given mean and size

Practical experience con rms that batch normalization seems to


allow for more aggressive learning rates

̂ and σℬ
We calculate μℬ ̂ as follows
1 2 1 2
̂ = ̂ = ̂ ) + ϵ.
∑ ∑
μℬ x and σℬ (x − μℬ
| ℬ | x∈ℬ | ℬ | x∈ℬ
fi
:

Training Deep Networks


For reasons that are not yet well-characterized theoretically, various sources of noise in
optimization often lead to faster training and less over tting: this variation appears to act
as a form of regularization.

Teye et al. (2018) and Luo et al. (2018) related the properties of batch normalization to
Bayesian priors and penalties, respectively.

In particular, this sheds some light on the puzzle of why batch normalization works best for
moderate minibatch sizes in the 50–100 range.

This particular size of minibatch seems to inject just the "right amount" of noise per layer,
both in terms of scale via σ,̂ and in terms of offset via μ:̂ a larger minibatch regularizes less
due to the more stable estimates, whereas tiny minibatches destroy useful signal due to
high variance.

fi
Training Deep Networks
Once the model is trained, we can calculate the means and
variances of each layer’s variables based on the entire dataset.

Indeed this is standard practice for models employing batch


normalization; thus batch normalization layers function
differently in training mode (normalizing by minibatch
statistics) than in prediction mode (normalizing by dataset
statistics).

Fully Connected Layers


When applying batch normalization to fully connected layers, Ioffe and
Szegedy (2015), in their original paper inserted batch normalization after the
af ne transformation and before the nonlinear activation function.

Later applications experimented with inserting batch normalization right after


activation functions

Denoting the input to the fully connected layer by x, the af ne transformation


by Wx + b ,and the activation function by ϕ , we can express the computation
of a batch-normalization-enabled, fully connected layer output h as follows

h = ϕ(BN(Wx + b)) .
fi
.

fi

Convolutional Layers
Similarly, with convolutional layers, we can apply batch normalization
after the convolution but before the nonlinear activation function

The key difference from batch normalization in fully connected layers is


that we apply the operation on a per-channel basis across all locations

Assume that our minibatches contain m examples and that for each
channel, the output of the convolution has height p and width q

For convolutional layers, we carry out each batch normalization over the
m ⋅ p ⋅ q elements per output channel simultaneously.

Layer Normalization
Note that in the context of convolutions the batch normalization is well de ned even for
minibatches of size 1: after all, we have all the locations across an image to average.

This consideration led Ba et al. (2016) to introduce the notion of layer normalization.

It works just like a batch norm, only that it is applied to one observation at a time. For
an n-dimensional vector x, layer norms are given by

x − μ̂
x → LN(x) = ,
σ̂
where scaling and offset are applied coef cient-wise and given by
n n
1 2 1 2
n∑ ∑
μ̂ = xi and σ ̂ = (xi − μ)̂ + ϵ .
i=1
n i=1
fi

fi

Layer Normalization
One of the major bene ts of using layer normalization is that it prevents
divergence.

After all, ignoring ϵ , the output of the layer normalization is scale


independent

LN(x) ≈ LN(αx) for any choice of α ≠ 0.

This becomes an equality for | α | → ∞

Another advantage of the layer normalization is that it does not depend on


the minibatch size.

fi
.

Discussion
Intuitively, batch normalization is thought to make the optimization
landscape smoother

However, we must be careful to distinguish between speculative


intuitions and true explanations for the phenomena that we observe
when training deep models

Recall that we do not even know why simpler deep neural networks
generalize well in the rst place.
.

fi
.

Discussion
The original paper proposing batch normalization (Ioffe and Szegedy,
2015), in addition to introducing a powerful and useful tool, offered
an explanation for why it works: by reducing internal covariate shift

Presumably by internal covariate shift they meant something like:


the notion that the distribution of variable values changes over the
course of training

However, there were two problems with this explanation: This drift is
very different from covariate shift, rendering the name a misnomer.
.

Discussion
The explanation offers an under-speci ed intuition but leaves the question
of why precisely this technique works an open question wanting for a
rigorous explanation

Following the success of batch normalization, its explanation in terms


of internal covariate shift has repeatedly surfaced in debates in the technical
literature and broader discourse about how to present machine learning
research.

In a memorable speech given while accepting a Test of Time Award at the


2017 NeurIPS conference, Ali Rahimi used internal covariate shift as a focal
point in an argument likening the modern practice of deep learning to
alchemy.

fi
Summary
Batch normalization is slightly different for fully connected layers
than for convolutional layers.

In fact, for convolutional layers, layer normalization can sometimes


be used as an alternative

Like a dropout layer, batch normalization layers have different


behaviors in training mode than in prediction mode

For more robust models that are less sensitive to input perturbations,
consider removing batch normalization (Wang et al., 2022).
.

Function Classes
Consider ℱ , the class of functions that a speci c network
architecture can reach

That is, for all f ∈ ℱ there exists some set of parameters that can
be obtained through training on a suitable dataset

Let's assume that f* is the "truth" function that we really would


like to nd

If it is in ℱ, we are in good shape but typically we will not be quite


so lucky.
fi
.

fi
Function Classes

Instead, we will try to nd some f*


ℱ which is our best bet within

For instance, given a dataset with features X and labels y, we might


try nding it by solving the following optimization problem

f*

= argmin L(X, y, f ) subject to f ∈ ℱ .
f
.

fi
fi
:

Function Classes
We know that regularization (Morozov, 1984, Tikhonov and Arsenin,
1977) may control complexity of ℱ and achieve consistency, so a
larger size of training data generally leads to better f*

It is only reasonable to assume that if we design a different and


more powerful architecture ℱ′ we should arrive at a better
outcome.

In other words, we would expect that f*


ℱ′ is "better" than f*
ℱ .


Function Classes
However, if ℱ ⊈ ℱ′ there is no guarantee that this should even happen.

In fact, f*
ℱ′ might well be worse.


Function Classes
For deep neural networks, if we can train the newly-added layer into an identity
function f(x) = x, the new model will be as effective as the original model

This is the question that He et al. (2016) considered when working on very
deep computer vision models

At the heart of their proposed residual network (ResNet) is the idea that
every additional layer should more easily contain the identity function as one of
its elements.

These considerations are rather profound but they led to a surprisingly simple
solution, a residual block.

Function Classes
ResNet won the ImageNet Large Scale Visual Recognition Challenge in 2015.

Residual blocks have been added to recurrent networks (Kim et al., 2017, Prakash et al.,
2016).

Likewise, Transformers (Vaswani et al., 2017) use them to stack many layers of networks
ef ciently.

It is also used in graph neural networks (Kipf and Welling, 2016) and, as a basic concept, it
has been used extensively in computer vision (Redmon and Farhadi, 2018, Ren et al., 2015).

Note that residual networks are predated by highway networks (Srivastava et al.,
2015) that share some of the motivation, albeit without the elegant parametrization around
the identity function.
fi

Residual Blocks

In a regular block (left), the portion within the dotted-line box must directly learn the mapping f(x). In a
residual block (right), the portion within the dotted-line box needs to learn the residual mapping g(x) = f(x)
- x, making the identity mapping f(x) = x easier to learn.
Residual Blocks

ResNet block with and without 1x1 convolution, which transforms the input into
the desired shape for the addition operation.
ResNet Model

The ResNet-18 architecture.


ResNeXt
Applying the idea of multiple independent groups to the ResNet block
led to the design of ResNeXt (Xie et al., 2017).
Summary and Discussion

Nested function classes are desirable since they allow us to obtain


strictly more powerful rather than also subtly different function classes
when adding capacity

We can train an effective deep neural network by having residual blocks

The original ResNet paper (He et al., 2016) allowed for up to 152 layers

Another bene t of residual networks is that it allows us to add layers,


initialized as the identity function, during the training process.
fi
.

Densely Connected Networks (DenseNet)


ResNet signi cantly changed the view of how to parametrize the
functions in deep networks

DenseNet (dense convolutional network) is to some extent the logical


extension of this (Huang et al., 2017).

DenseNet is characterized by both the connectivity pattern where


each layer connects to all the preceding layers and the
concatenation operation to preserve and reuse features from earlier
layers.
fi
.

From ResNet to DenseNet

Recall the Taylor expansion for functions. At the point x = 0 it can


be written a

[ [ 2! [ 3! ]]]
f′′(0) f′′′(0)
f(x) = f(0) + x ⋅ f′(0) + x ⋅ +x⋅ +⋯ .

In a similar vein, ResNet decomposes functions int

f(x) = x + g(x) .






s

From ResNet to DenseNet

The main difference between ResNet (left) and DenseNet (right) in cross-layer
connections: use of addition and use of concatenation.
From ResNet to DenseNet
As a result, we perform a mapping from x to its values after applying
an increasingly complex sequence of functions

[ ([ ]) ]
x → x, f1(x), f2 ([x, f1 (x)]), f3 x, f1 (x), f2 ([x, f1 (x)]) , … .

Dense connections in DenseNet. Note how the dimensionality increases with depth.
:

Dense Blocks
The main components that comprise a DenseNet are dense blocks
and transition layers

A dense block consists of multiple convolution blocks, each using the


same number of output channels

The number of convolution block channels controls the growth in the


number of output channels relative to the number of input channels.
This is also referred to as the growth rate.
.

Transition Layers

A transition layer is used to control the complexity of the


model.

It reduces the number of channels by using a 1x1


convolution

Moreover, it halves the height and width via average


pooling with a stride of 2.

Dense Block

A 5-layer dense block with a growth rate of k = 4. Each layer takes all
preceding feature-maps as input.
DenseNet

A deep DenseNet with three dense blocks. The layers between two adjacent blocks
are referred to as transition layers and change feature-map sizes via
convolution and pooling.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy