Convolutional Neural Networks
Convolutional Neural Networks
Wolfgang Ecker, Lorenzo Servadei, Sebastian Schober, Daniela Lopera, Yezi Yang
Agenda
2
Intro to Convolutional Neural Networks
3
Agenda
4
Convolutional Neural Networks – Layers and Structures
input activation
1 1
10 x 3072
3072 10
weights
5
Fully Connected Layer
• How to process a small image with FC layers?
input activation
1 1
10 x 3072
3072 10
weighs
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)
6
Problems using FC layers on images
• How to process a normal image with FC layers?
1000x1000x3 image
3 billion weights in
one layer!
1000
…
1000
3
1000 neuron layer
7
Problems using FC layers on images
• How to process a normal image with FC layers?
1000x1000x3 image
3 billion weights in
… one layer!
1000
Solution: weight sharing!
1000
3
1000 neuron layer
8
Why Convolution Layer ?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017
9
What are convolutions ?
∞
𝑓 ∗𝑔= ∫ 𝑓 ( 𝜏 ) 𝑔 ( 𝑡 − 𝜏 ) 𝑑 𝜏
−∞
𝑓 :𝑟𝑒𝑑
𝑔 : 𝑏𝑙𝑢𝑒
𝑓 ∗𝑔: 𝑔𝑟𝑒𝑒𝑛
11
Image filters
[ ] [ ]
−1 −1 −1 1 1 1
1
−1 8 −1 1 1 1
9
−1 −1 −1 1 1 1
Edge detection Box blur
[ ] [ ]
Input 0 −1 0 1 2 1
−1 5 −1 1
2 4 2
0 −1 0 16
1 2 1
sharpen Gaussian blur
https://en.wikipedia.org/wiki/Kernel_(image_processing)
12
Convolution Layer
32x32x3 image -> preserve spatial structure
32 height
32 width
3 depth
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017
13
Convolution Layer
32x32x3 image
5x5x3 filter
32
14
Convolution Layer Filters always extend the full
depth of the input volume
32x32x3 image
5x5x3 filter
32
http://www.songho.ca/dsp/convolution/convolution.html#convolution_2d
15
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product +
3 bias)
16
Convolution Layer activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017
17
Convolution Layer
Consider a second green filter activation maps
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
Lecture 5 - 33
18
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
19
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 28
filters
3 6
20
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with
activation functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
32 5x5x3 28 5x5x6 24
filters filters
3 6 10
21
Preview
22
Preview
23
one filter =>
one activation map
24
Preview
25
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
26
A closer look at spatial dimensions:
27
A closer look at spatial dimensions:
28
A closer look at spatial dimensions:
29
A closer look at spatial dimensions:
30
A closer look at spatial dimensions:
31
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter applied
with stride 2
32
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter applied
with stride 2
33
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter applied
with stride 2
=> 3x3 output!
7
34
A closer look at spatial dimensions:
7
7x7 input (spatially) assume
3x3 filter applied with stride
3?
35
A closer look at spatial dimensions:
7
7x7 input (spatially) assume
3x3 filter applied with stride
3?
7 doesn’t fit!
cannot apply 3x3 filter on 7x7
input with stride 3.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017
36
N
Output size:
(N - F) / stride + 1
F
N e.g. N = 7, F = 3:
F stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
37
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
38
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1
39
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
40
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with
2 F = 7 => zero pad
with 3
41
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
32 5x5x3 28 5x5x6 24
filters filters
3 6 10
42
Examples time:
43
Examples time:
44
Examples time:
45
Examples time:
46
47
Common settings:
48
(btw, 1x1 convolution layers make perfect sense)
1x1 CONV
56 with 32 filters 56
(each filter has size
1x1x64, and
performs a 64-
56 dimensional dot 56
64 product) 32
49
Dilated Convolution
http://www.icst.pku.edu.cn/struct/Projects/joint_rain_removal.html
50
Dilated Convolution
51
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)
52
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
53
The brain/neuron view of CONV Layer
32
54
The brain/neuron view of CONV Layer
32
55
Two more layers to go: POOL/FC
56
Pooling layer
57
Max Pooling
x 1 1 2 4
max pool with 2x2 filters
5 6 7 8 and stride 2 6 8
3 2 1 0 3 4
1 2 3 4
58
59
Common settings:
F = 2, S = 2
F = 3, S = 2
60
Translation Equivariance and Invariance
61
Fully Connected Layer (FC layer)
- Contains neurons that connect to the entire input volume, as in ordinary Neural
Networks
62
Reminder: Fully Connected Layer
input activation
1 1
3072 10 x 3072 10
weights
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)
63
Summary
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017
64
Classic Architectures
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017
65
Review: LeNet-5
[LeCun et al., 1998]
Lecture 9 - 6
6
Lecture 9 - 66
Case Study: AlexNet
[Krizhevsky et al. 2012]
11x11
S=4
227x227x3 ?x?x?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
11x11
S=4
227x227x3 55x55x96
Lecture 9 - 68
Case Study: AlexNet
[Krizhevsky et al. 2012]
11x11
S=4
227x227x3 55x55x96
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 69 May 2, 2017
Lecture 9 - 69
Case Study: AlexNet
[Krizhevsky et al. 2012]
MAX POOL
11x11 3x3
S=4 S=2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 70 May 2, 2017
Lecture 9 - 70
Case Study: AlexNet
[Krizhevsky et al. 2012]
MAX POOL
11x11 3x3
S=4 S=2
Lecture 9 - 71
Case Study: AlexNet
[Krizhevsky et al. 2012]
MAX POOL
11x11 3x3
S=4 S=2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 72 May 2, 2017
Lecture 9 - 72
Case Study: AlexNet
[Krizhevsky et al. 2012]
MAX POOL
3x3 3x3 3x3 3x3
S=1 S=1 S=1 S=2
Lecture 9 - 73
Case Study: AlexNet
[Krizhevsky et al. 2012]
MAX POOL
3x3 3x3 3x3 3x3
S=1 S=1 S=1 S=2
Lecture 9 - 74
Case Study: AlexNet
[Krizhevsky et al. 2012]
Details/Retrospectives:
-first use of ReLU
-used Norm layers (not common anymore)
-heavy data augmentation
-dropout 0.5
-batch size 128
-SGD Momentum 0.9
-Learning rate 1e-2, reduced by 10
manually when val accuracy plateaus
-L2 weight decay 5e-4
- 7 CNN ensemble: 18.2% -> 15.4%
Lecture 9 - 75
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
ZFNet: Improved
hyperparameters over
AlexNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ZFNet [Zeiler and Fergus, 2013]
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5
error: 16.4% ->
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 -
11.7% May 2, 2017
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Deeper Networks
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool 3x3 conv, 512
Small filters, Deeper networks 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Lecture 9 - 83
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Lecture 9 - 84
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
three 3x3 conv (stride 1) layers? 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64
3x3 conv, 64
11x11 conv, 96 3x3 conv, 64
Input Input Input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Lecture 9 - 86
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
has same effective receptive field as Pool 3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
one 7x7 conv layer 3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
[7x7] Pool
5x5 conv, 256
Pool
3x3 conv, 64
Pool
3x3 conv, 64
11x11 conv, 96 3x3 conv, 64 3x3 conv, 64
Input Input Input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
VGG-16
Softmax
FC 1000
FC 4096
FC 4096
POOL
Pool
[3x3 conv,64]
[3x3 conv,128] 3x3 conv, 512
x2 3x3 conv, 512
x2
3x3 conv, 512
Pool
224x224x3 224x224x64 112x112x64 112x112x128 3x3 conv, 512
3x3 conv, 512
x2 x3 Pool
3x3 conv, 64
3x3 conv, 64
POOL
[3x3 conv,512]
x3
slightly better, more memory) fc6 Pool conv3-2 3x3 conv, 256 3x3 conv, 256
conv3 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
conv2-
Pool 1 Pool Pool
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Deeper Networks
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Lecture 9 - 92
Inception Layer
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017
93
Inception layer
[Szegedy et al., 2014]
› Not sure of the filter size?
Use them all !
Filter
concatenation
Apply parallel filter operations on
1x1 3x3 5x5 3x3 max the input from previous layer:
convolution convolution convolution pooling
- Multiple receptive field sizes
for convolution (1x1, 3x3,
Previous Layer
5x5)
- Pooling operation (3x3)
Naive Inception module
Concatenate all filter outputs
together depth-wise
Inception layer
[Szegedy et al., 2014]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Example:
Filter
concatenation
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Filter
concatenation
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Filter
concatenation
28x28x128
1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Filter
concatenation
28x28x128
1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Lecture 9 - 99
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Filter
concatenation
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Filter
concatenation
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
28x28x(128+192+96+256) = 28x28x672
Filter
concatenation
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Reminder: 1x1 convolutions
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Reminder: 1x1 convolutions
1x1 CONV
56 with 32 filters
56
preserves spatial
dimensions, reduces depth!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer
[Szegedy et al., 2014]
Filter
Filter concatenation
concatenation
Previous Layer
Naive Inception module
Inception module with dimension reduction
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer
[Szegedy et al., 2014]
1x1 conv “bottleneck”
layers
Filter
Filter concatenation
concatenation
Previous Layer
Naive Inception module
Inception module with dimension reduction
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Using same parallel layers as
naive example, and adding “1x1
[Szegedy et al., 2014]
conv, 64 filter” bottlenecks:
28x28x480
Filter Conv Ops:
concatenation
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 64] 28x28x64x1x1x256
28x28x128 28x28x192 28x28x96 28x28x64
[1x1 conv, 128] 28x28x128x1x1x256
1x1 conv, 3x3 conv, 5x5 conv, 1x1 conv, [3x3 conv, 192] 28x28x192x3x3x64
128 192 96 64
[5x5 conv, 96] 28x28x96x5x5x64
28x28x64 28x28x64 28x28x256 [1x1 conv, 64] 28x28x64x1x1x256
1x1 conv, 1x1 conv, 3x3 pool Total: 358M ops
64 64
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]
- 22 layers
- Efficient “Inception” module
- No FC layers
- Only 5 million parameters!
12x less than AlexNet Inception module
- ILSVRC’14 classification winner
(6.7% top 5 error)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]
Inception module
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]
Inception module
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Stem Network:
Conv-Pool-
2x Conv-Pool
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Stacked Inception
Modules
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Classifier output
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Classifier output
(removed expensive FC layers!)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
22 total layers with weights (including each parallel layer in an Inception module)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Skip Connections
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017
12
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
“Revolution of Depth”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth
56-layer
Training error
56-layer
Test error
20-layer
20-layer
Iterations Iterations
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth
56-layer
Training error
56-layer
Test error
20-layer
20-layer
Iterations Iterations
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Residual block
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
relu
H(x) F(x) + x
conv
conv
F(x) X
relu relu
identity
conv conv
X X
“Plain” Residual
layers block
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Residual block
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
H(x) = F(x) + x relu
H(x) F(x) + x
Use layers to
conv
conv fit residual
F(x) relu
X F(x) = H(x) - x
relu identity
instead of
conv conv
H(x) directly
X X
“Plain” Residual
layers block
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Why does residual block work?
𝐿 +1
𝑥
conv
What happens if and are zero?
𝑥𝐿 relu
identity
conv
Residual
block
Why does residual block work?
𝐿 +1
𝑥
𝐿+1 𝐿+1 𝐿 𝐿+1 𝐿− 1
relu 𝑥 = 𝑓 (𝑊 ⋅ 𝑥 +𝑏 +𝑥 )
~Zero ~Zero
conv
What happens if and are zero?
𝑥𝐿 relu
identity 𝐿 +1 𝐿 −1
conv
𝑥 = 𝑓 (𝑥 )
We kept the same values and added a
non-linearity
Residual
block
Why does residual block work?
𝐿 +1
𝑥
𝐿+1 𝐿+1 𝐿 𝐿+1 𝐿− 1
relu 𝑥 = 𝑓 (𝑊 ⋅ 𝑥 +𝑏 +𝑥 )
~Zero ~Zero
conv
What happens if and are zero?
𝑥𝐿 relu
identity 𝐿 +1 𝐿 −1
conv
𝑥 = 𝑓 (𝑥 )
We kept the same values and added a
non-linearity
Residual
block
• The identity is easy for the residual block to
learn
• Guaranteed it will not hurt performance, only
improve
Case Study: ResNet
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
X 3x3 conv, 64
Residual 3x3 conv, 64
block 3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
Softmax
FC 1000
Pool
Pool
7x7 conv, 64, / 2
Input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
Softmax
FC 1000
Pool
Pool
7x7 conv, 64, / 2 Beginning
Input conv
layer
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
Softmax
FC 1000 No FC layers
Pool besides FC
1000 to
[He et al., 2015] 3x3 conv, 512
3x3 conv, 512 output
classes
3x3 conv, 512
3x3 conv, 64
(only FC 1000 to block 3x3 conv, 64
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
[He et al., 2015]
28x28x256
output
28x28x256
input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
[He et al., 2015]
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (28x28x256) 1x1 conv, 256
(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv operates over
3x3 conv, 64
(similar to GoogLeNet) only 64 feature maps
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Softmax
3x3 conv, 64
[He et al., 2015] 3x3 conv, 64
3x3 conv, 64
relu 3x3 conv, 64
Very deep networks using residual F(x) + x 3x3 conv, 64
..
.
conv
- 152-layer model for ImageNet X
3x3 conv, 128
3x3 conv, 128
F(x) relu
- ILSVRC’15 classification winner identity 3x3 conv, 128
3x3 conv, 128
3x3 conv, 64
detection competitions in X
3x3 conv, 64
Pool
7x7 conv, 64 / 2
Input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
[He et al., 2015]
- Batch Normalization
after every CONV layer
- Xavier/2 initialization
from He et al.
- SGD + Momentum
(0.9)
- Learning rate: 0.1,
divided by 10 when
validation error
plateaus
- Mini-batch size 256
Fei-Fei Li & Justin
- Weight decay Johnson
of 1e-5 & Serena Yeung Lecture 9 - May 2, 2017
- No dropout used
Case Study: ResNet
[He et al., 2015]
Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lowing training error as
expected
- Swept 1st place in all ILSVRC
and COCO 2015
competitions
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
[He et al., 2015]
Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lowing training error as
expected
- Swept 1st place in all ILSVRC
and COCO 2015
competitions ILSVRC 2015 classification winner (3.6%
top 5 error) -- better than “human
performance”! (Russakovsky 2014)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Comparing complexity...
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Comparing complexity... Inception-v4: Resnet + Inception!
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
VGG: Highest
Comparing complexity...
memory,
most
operations
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
GoogLeNet:
Comparing complexity... most efficient
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
AlexNet:
Comparing complexity... Smaller compute, still memory
heavy, lower accuracy
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ResNet:
Comparing complexity... Moderate efficiency depending on
model, highest accuracy
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Forward pass time and power consumption
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Other architectures to know...
Improving ResNets...
BN
Beyond ResNets...
Densely Connected Convolutional Networks
Softmax
Dense Block 2
strengthens feature propagation, Conv
Conv
encourages feature reuse Pool
Concat Conv
Dense Block 1
Conv
Conv
Input Input
Dense Block
Lecture 9 - 153
Efficient networks...
SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and
<0.5Mb Model Size
[Iandola et al. 2017]
Lecture 9 - 154
SqueezeNet
Input
64
1x1 Conv
Squeeze
16
1x1 Conv 3x3 Conv
Expand Expand
64
64
1
5
5
Output
Concat/Eltwise
12
8 fewer parameters and <0.5MB model size”, arXiv 2016
Iandola et al, “SqueezeNet: AlexNet-level accuracy with 50x
Xception Module
16 32
3x3
https://arxiv.org/abs/1610.02357
Temporal Convolutional Networks (TCN)
Exploit dilated convolution and residual connection to cover longer history and staibilize
gradient over deep network.
Also....
- Temporal Convolutional - DenseNet
Networks - SqueezeNet
- Xception Module
Lecture 9 - 160
Sources
• http://cs231n.stanford.edu/2017/syllabus.html
• https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6509978
161