Convolutional Neural Networks (CNN) : Convolutions
Convolutional Neural Networks (CNN) : Convolutions
applying filters on the input data, basically, the filters replace the weights through
different kernel convolutional operations being responsible for the filter effect
use the same weights for different parts of the image; intuitively if feature of
one image is interesting it will prob. also be interesting in another image
Convolutions
convolve = falten; applying a filter to a function; filter in the sense of a matrix/grid
of values that alter the output of a given function
Sliding filter kernel from left to right, multiplying and summing up every overlapping fields
applying the same filter to all pixels of an image is the idea of weight sharing
Convolution on Images
Image of 5x5 with a convolutional filter of size 3x3 generating an output of size 3x3
same procedure as before: slide filter over image and apply filter through dot
product at every position resulting in zi = wT xi + b
(5×5×3)×1(5×5×3)×1 1
where the weights represent the filter, note that the output matrix z is of
dimension 1
Convolution Layer
def.: applying different filters to the same image, for every filter we apply to the
image we create a new convolutional layer, e.g. applying 2 filters to an 32 x 32 x
3 image results in 28 x 28 x 2 convolutional layers
layer defined by filter width & height, depth implicitly given by dot-product
without padding the outputs shrink with every iteration which is not a good
idea
padding assures that corner pixels are considered as well and image sizes
don't get smaller as quickler as they would otherwise → most common
paddiong: zero-padding, leading to output size: (+ N +2⋅P
S
−F
, + 1) ×
(+ N +2⋅P
S
−F
, + 1)
N: width of image
F: width of filter
F −1
P: number of padding; padding should usually be set to P = 2
S: stride
Pooling
another operator heavily used in CNNs
using padding assures that the images don't shrink as we apply the filters,
pooling allows to shrink images nevertheless but only when required → reducing
feature map size
Different ways:
Max Pooling: define equally sized regions within input and then create new
pooled output of that size consisting of highest numbers from each
corresponding input region, e.g.
if within a region more than one highest number exist just take either
Average Pooling: averaging all values of a region instead of taking max value
conv layer = feature extraction computing feature in a given region and pooling
layer = feature selection picking the strongest activation in a region
Other properties
CNN Prototype; FC applies brute force connecting everything with everything, not using shared
weights and thus not applying inductive bias
Receptive Field
describing the field of pixels from which a pixel of field within a convolutional
kernel has been created (computed through dot products) from
the deeper one goes into a network, the bigger the receptive field must be
preferably, use more layers with smaller filters (e.g. 3 layers with filter size 3x3)
as this also injects more non-linearity (with every additional layer), also less
weights → less overfitting
Classic Architectures
LeNet
top-1 score: checking if sample's top class with highest probability is the
same as target label
1000 outputs for 1000 classes: in order to get from spatial data 6x6x256 we
use fully connected networks converting data into 9216 data points, then
4096, again 4096, and finally 1000
VGGNet simplifying AlexNet by fixing CONV = 3x3 filters with stride 1 &
MAXPOOL = 2x2 filters with stride 2
again switching between CONV & POOL in 16 layers, again width & height
decreases + # of filters increase as we go deeper resulting in 138 mio
parameters
Residual Block - how can we train very deep nets (i.e. more layers) while
keeping training stable?
ResNet Block
ResNets with set of good network design choices - mostly used for computer
vision networks to classify images
if these value become 0 the output of layer L+1 will be equal to L-1, nothing
changes because gradients vanish → reason why we cant have unlimited
main layers
1x1 Convolutions
simply scales input answer by constant while keeping dimension of input
Inception Layer
core idea: too many layers result in huge computational costs, reduce # of layers
with 1x1 convolution
finding the perfect number of filters → choose them all: same convolutions with
different sizes + 3x3 max pooling with stride 1
GoogLeNet using inception blocks with extra added max pool layer to reduce
dimensionality
fully connected convolutional network assures in the last few layers to take
activation/feature maps and turn the information into a classification result
performing unpooling
U-Net
from left (contraction path, i.e. encoder) to right (expansion path, i.e. decoder)
performing series of convolutions (feature extraction) and pooling (feature
selection) → during encoding we loose spatial detail, therefore results copied to
decoder such that it also has the previous information