Why Convolutions?: Till Now in MLP
Why Convolutions?: Till Now in MLP
model = Sequential()
# Imports a Flatten layer to convert the image matrix into a vector
# Defines the neural network architecture
model.add( Flatten(input_shape = (28,28) ))
model.add(Dense(512, activation = 'relu'))
model.add(Dense(512, activation = 'relu'))
model.add(Dense(10, activation = 'softmax'))
model.summary()
• Params after the flatten layer = 0, because this layer only flattens the image to a vector
for feeding into the input layer. The weights haven’t been added yet.
• Params after layer 1 = (784 nodes in input layer) × (512 in hidden layer 1) + (512
connections to biases) = 401,920.
• Params after layer 2 = (512 nodes in hidden layer 1) × (512 in hidden layer 2) + (512
connections to biases) = 262,656.
• Params after layer 3= (512 nodes in hidden layer 2) × (10 in output layer) + (10
connections to biases) = 5,130.
• Total params in the network = 401,920 + 262,656 + 5,130 = 669,706.
Why Convolutions?
SPATIAL INVARIANCE or LOSS IN FEATURES
The spatial features of a 2D image are lost when it is flattened to a 1D vector input. Before
feeding an image to the hidden layers of an MLP, we must flatten the image matrix to a 1D
vector, as we saw in the mini project. This implies that all of the image's 2D information is
discarded.
!wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O
input.jpg
im = cv2.imread("./input.jpg")
cv2_imshow(im)
Image Matrix
im
...,
For example, if we have an image with dimensions 1,000 × 1,000, it will yield 1 million
parameters for each node in the first hidden layer.
• So if the first hidden layer has 1,000 neurons, this will yield 1 billion parameters even in
such a small network. You can imagine the computational complexity of optimizing 1
billion parameters after only the first layer.
Source
Source
A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers (often
with a subsampling step) and then followed by one or more fully connected layers as in a
standard multilayer neural network. The architecture of a CNN is designed to take advantage of
the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved
with local connections and tied weights followed by some form of pooling which results in
translation invariant features. Another benefit of CNNs is that they are easier to train and have
many fewer parameters than fully connected networks with the same number of hidden units. In
this article we will discuss the architecture of a CNN and the back propagation algorithm to
compute the gradient with respect to the parameters of the model in order to use gradient
based optimization.
Matrix Calculation
Padding Concept
Stride Concept
Feature Accumulation
Feature Aggregation
Convolution Operation
Source
Source
The CNN Complete Network Overview
%matplotlib inline
plt.imshow(image)
<matplotlib.image.AxesImage at 0x7f6b5c292bd0>
# Isolate RGB channels
r = image[:,:,0]
g = image[:,:,1]
b = image[:,:,2]
<matplotlib.image.AxesImage at 0x7f6b5bd00910>
<matplotlib.image.AxesImage at 0x7f6b5bb9f1d0>
Focusing on Filters
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import cv2
import numpy as np
%matplotlib inline
plt.imshow(image)
<matplotlib.image.AxesImage at 0x7f6b5bb01710>
# Convert to grayscale for filtering
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
plt.imshow(gray, cmap='gray')
<matplotlib.image.AxesImage at 0x7f6b5ba63a50>
# Convert to grayscale for filtering
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
plt.imshow(gray)
<matplotlib.image.AxesImage at 0x7f6b5b9ccf90>
<matplotlib.image.AxesImage at 0x7f6b5b8b04d0>
Best Place to Explore Kernels
Kernels
Source
Intuition
Let's develop better intuition for how Convolutional Neural Networks (CNN) work. We'll
examine how humans classify images, and then see how CNNs use similar approaches.
Let’s say we wanted to classify the following image of a dog as a Golden Retriever:
One thing we do is that we identify certain parts of the dog, such as the nose, the eyes, and the
fur. We essentially break up the image into smaller pieces, recognize the smaller pieces, and
then combine those pieces to get an idea of the overall dog.
In this case, we might break down the image into a combination of the following:
• A nose
• Two eyes
• Golden fur
Broadly speaking, this is what a CNN learns to do. It learns to recognize basic lines and curves,
then shapes and blobs, and then increasingly complex objects within the image. Finally, the CNN
classifies the image by combining the larger, more complex objects.
With deep learning, we don't actually program the CNN to recognize these specific features.
Rather, the CNN learns on its own to recognize such objects through forward propagation and
backpropagation!
It's amazing how well a CNN can learn to classify images, even though we never program the
CNN with information about specific features to look for.
An example of what each layer in a CNN might recognize when classifying a picture of a dog
A CNN might have several layers, and each layer might capture a different level in the hierarchy
of objects. The first layer is the lowest level in the hierarchy, where the CNN generally classifies
small parts of the image into simple shapes like horizontal and vertical lines and simple blobs of
colors. The subsequent layers tend to be higher levels in the hierarchy and generally classify
more complex ideas like shapes (combinations of lines), and eventually full objects like dogs.
Once again, the CNN learns all of this on its own. We don't ever have to tell the CNN to go
looking for lines or curves or noses or fur. The CNN just learns from the training set and
discovers which characteristics of a Golden Retriever are worth looking for.
Filters
Breaking up an Image
The first step for a CNN is to break up the image into smaller pieces. We do this by selecting a
width and height that defines a filter.
The filter looks at small pieces, or patches, of the image. These patches are the same size as the
filter.
A CNN uses filters to split an image into smaller patches. The size of these patches matches the
filter size.
We then simply slide this filter horizontally or vertically to focus on a different piece of the
image.
The amount by which the filter slides is referred to as the 'stride'. The stride is a hyperparameter
which the engineer can tune. Increasing the stride reduces the size of your model by reducing
the number of total patches each layer observes. However, this usually comes with a reduction
in accuracy.
Let’s look at an example. In this zoomed in image of the dog, we first start with the patch
outlined in red. The width and height of our filter define the size of this square.
We then move the square over to the right by a given stride (2 in this case) to get another patch.
We move our square to the right by two pixels to create another patch.
What's important here is that we are grouping together adjacent pixels and treating them as a
collective.
By taking advantage of this local structure, our CNN learns to classify local patterns, like shapes
and objects, in an image.
Filter Depth
It's common to have more than one filter. Different filters pick up different qualities of a patch.
For example, one filter might look for a particular color, while another might look for a kind of
object of a specific shape. The amount of filters in a convolutional layer is called the filter depth.
In the above
example, a patch is connected to a neuron in the next layer. Source: MIchael Neilsen.
That’s dependent on our filter depth. If we have a depth of k, we connect each patch of pixels to
k neurons in the next layer. This gives us the height of k in the next layer, as shown below. In
practice, k is a hyperparameter we tune, and most CNNs tend to pick the same starting values.
Choosing a filter depth of k connects each path to k neurons in the next layer
But why connect a single patch to multiple neurons in the next layer? Isn’t one neuron good
enough?
Multiple neurons can be useful because a patch can have multiple interesting characteristics that
we want to capture.
For example, one patch might include some white teeth, some blonde whiskers, and part of a
red tongue. In that case, we might want a filter depth of at least three - one for each of teeth,
whiskers, and tongue.
This patch of the dog has many interesting features we may want to capture. These include the
presence of teeth, the presence of whiskers, and the pink color of the tongue.
Having multiple neurons for a given patch ensures that our CNN can learn to capture whatever
characteristics the CNN learns are important.
Remember that the CNN isn't "programmed" to look for certain characteristics. Rather, it learns
on its own which characteristics to notice.
Parameters
Parameter Sharing
The weights, w, are shared across patches for a given layer in a CNN to detect the cat above
regardless of where in the image it is located.
When we are trying to classify a picture of a cat, we don’t care where in the image a cat is. If it’s
in the top left or the bottom right, it’s still a cat in our eyes. We would like our CNNs to also
possess this ability known as translation invariance. How can we achieve this?
As we saw earlier, the classification of a given patch in an image is determined by the weights
and biases corresponding to that patch.
If we want a cat that’s in the top left patch to be classified in the same way as a cat in the bottom
right patch, we need the weights and biases corresponding to those patches to be the same, so
that they are classified the same way.
This is exactly what we do in CNNs. The weights and biases we learn for a given output layer are
shared across all patches in a given input layer. Note that as we increase the depth of our filter,
the number of weights and biases we have to learn still increases, as the weights aren't shared
across the output channels.
There’s an additional benefit to sharing our parameters. If we did not reuse the same weights
across all patches, we would have to learn new parameters for every single patch and hidden
layer neuron pair. This does not scale well, especially for higher fidelity images. Thus, sharing
parameters not only helps us with translation invariance, but also gives us a smaller, more
scalable model.
Padding
Let's say we have a 5x5 grid (as shown above) and a filter of size 3x3 with a stride of 1. What's
the width and height of the next layer? We see that we can fit at most three patches in each
direction, giving us a dimension of 3x3 in our next layer. As we can see, the width and height of
each subsequent layer decreases in such a scheme.
In an ideal world, we'd be able to maintain the same width and height across layers so that we
can continue to add layers without worrying about the dimensionality shrinking and so that we
have consistency. How might we achieve this? One way is to simple add a border of 0s to our
original 5x5 image. You can see what this looks like in the below image:
This would expand our original image to a 7x7. With this, we now see how our next layer's size is
again a 5x5, keeping our dimensionality consistent.
Visualizing CNNs
Let’s look at an example CNN to see how it works in action.
The CNN we will look at is trained on ImageNet as described in this paper by Zeiler and Fergus.
In the images below (from the same paper), we’ll see what each layer in this network detects
and see how each layer detects more and more complex ideas.
Layer 1
Example patterns that cause activations in the first layer of the network.
These range from simple diagonal lines (top left) to green blobs (bottom middle).
The images above are from Matthew Zeiler and Rob Fergus' deep visualization toolbox, which
lets us visualize what each layer in a CNN focuses on.
Each image in the above grid represents a pattern that causes the neurons in the first layer to
activate - in other words, they are patterns that the first layer recognizes. The top left image
shows a -45 degree line, while the middle top square shows a +45 degree line. These squares
are shown below again for reference:
As visualized here, the first layer of the CNN can recognize -45 degree lines.
The first layer of the CNN is also able to recognize +45 degree lines, like the one
above.
Let's now see some example images that cause such activations. The below grid of images all
activated the -45 degree line. Notice how they are all selected despite the fact that they have
different colors, gradients, and patterns.
Example patches that activate the -45 degree line detector in the first layer.
So, the first layer of our CNN clearly picks out very simple shapes and patterns like lines and
blobs.
Layer 2
A visualization of the second layer in the CNN. Notice how we are picking up more complex
ideas like circles and stripes. The gray grid on the left represents how this layer of the CNN
activates (or "what it sees") based on the corresponding images from the grid on the right.
As you see in the image above, the second layer of the CNN recognizes circles (second row,
second column), stripes (first row, second column), and rectangles (bottom right).
The CNN learns to do this on its own. There is no special instruction for the CNN to focus on
more complex objects in deeper layers. That's just how it normally works out when you feed
training data into a CNN.
Layer 3
A visualization of the third layer in the CNN. The gray grid on the left represents how this layer
of the CNN activates (or "what it sees") based on the corresponding images from the grid on the
right.
The third layer picks out complex combinations of features from the second layer. These include
things like grids, and honeycombs (top left), wheels (second row, second column), and even
faces (third row, third column).
Layer 5
A visualization of the fifth and final layer of the CNN. The gray grid on the left represents how
this layer of the CNN activates (or "what it sees") based on the corresponding images from the
grid on the right.
We'll skip layer 4, which continues this progression, and jump right to the fifth and final layer of
this CNN.
The last layer picks out the highest order ideas that we care about for classification, like dog
faces, bird faces, and bicycles.
Max Pooling
Convolutions
The above is an example of a convolution with a 3x3 filter and a stride of 1 being applied to data
with a range of 0 to 1. The convolution for each 3x3 section is calculated against the weight,
[[1, 0, 1], [0, 1, 0], [1, 0, 1]], then a bias is added to create the convolved
feature on the right. In this case, the bias is zero.
Stride is an array of 4 elements; the first element in the stride array indicates the stride for batch
and last element indicates stride for features. It's good practice to remove the batches or
features you want to skip from the data set rather than use stride to skip them. You can always
set the first and last element to 1 in stride in order to use all batches and features.
Max Pooling
The above is an example of max pooling with a 2x2 filter and stride of 2. The left square is the
input and the right square is the output. The four 2x2 colors in input represents each time the
filter was applied to create the max on the right side. For example, [[1, 1], [5, 6]]
becomes 6 and [[3, 2], [1, 2]] becomes 3.
num_classes = 10
# print first ten (integer-valued) training labels
print('Integer-valued labels:')
print(y_train[:10])
Integer-valued labels:
[5 0 4 1 9 2 1 3 1 4]
One-hot labels:
[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]
There are some additional, optional arguments that you might like to tune:
• strides - The stride of the convolution. If you don't specify anything, strides is set to 1.
• padding - One of 'valid' or 'same'. If you don't specify anything, padding is set to 'valid'.
• activation - Typically 'relu'. If you don't specify anything, no activation is applied. You are
strongly encouraged to add a ReLU activation function to every convolutional layer in
your networks.
** Things to remember **
• Always add a ReLU activation function to the Conv2D layers in your CNN. With the
exception of the final layer in the network, Dense layers should also have a ReLU
activation function.
• When constructing a network for classification, the final layer in the network should be a
Dense layer with a softmax activation function. The number of nodes in the final layer
should equal the total number of classes in the dataset.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten,
Dense, Dropout, GlobalAveragePooling2D
# CONV_1: add CONV layer with RELU activation and depth = 32 kernels
model.add(Conv2D(32, kernel_size=(3, 3),
padding='same',activation='relu',input_shape=(28,28,1)))
# POOL_1: downsample the image to choose the best features
model.add(MaxPooling2D(pool_size=(2, 2)))
model.summary()
Model: "sequential_9"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_32 (Conv2D) (None, 28, 28, 32) 320
_________________________________________________________________
max_pooling2d_14 (MaxPooling (None, 14, 14, 32) 0
_________________________________________________________________
conv2d_33 (Conv2D) (None, 14, 14, 64) 18496
_________________________________________________________________
max_pooling2d_15 (MaxPooling (None, 7, 7, 64) 0
_________________________________________________________________
flatten_9 (Flatten) (None, 3136) 0
_________________________________________________________________
dense_18 (Dense) (None, 64) 200768
_________________________________________________________________
dense_19 (Dense) (None, 10) 650
=================================================================
Total params: 220,234
Trainable params: 220,234
Non-trainable params: 0
_________________________________________________________________
Things to notice:
• The network begins with a sequence of two convolutional layers, followed by max
pooling layers.
• The final layer has one entry for each object class in the dataset, and has a softmax
activation function, so that it returns probabilities.
• The Conv2D depth increases from the input layer of 1 to 32 to 64.
• We also want to decrease the height and width - This is where maxpooling comes in.
Notice that the image dimensions decrease from 28 to 14 after the pooling layer.
• You can see that every output shape has None in place of the batch-size. This is so as to
facilitate changing of batch size at runtime.
• Finally, we add one or more fully connected layers to determine what object is contained
in the image. For instance, if wheels were found in the last max pooling layer, this FC
layer will transform that information to predict that a car is present in the image with
higher probability. If there were eyes, legs and a tails, then this could mean that there is a
dog in the image.
8. Compile the Model
# compile the model
model.compile(loss='categorical_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])
Epoch 1/10
938/938 - 35s - loss: 0.1542 - accuracy: 0.9514 - val_loss: 0.0729 -
val_accuracy: 0.9753
10. Load the Model with the Best Classification Accuracy on the
Validation Set
# load the weights that yielded the best validation accuracy
model.load_weights('model.weights.best.hdf5')
fig = plt.figure(figsize=(20,5))
for i in range(36):
ax = fig.add_subplot(3, 12, i + 1, xticks=[], yticks=[])
ax.imshow(np.squeeze(x_train[i]))
** Tip: ** When using Gradient Descent, you should ensure that all features have a similar scale
to speed up training or else it will take much longer to converge.
x_train = x_train.astype('float32')/255
x_test = x_test.astype('float32')/255
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=3, padding='same',
activation='relu',
input_shape=(32, 32, 3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32, kernel_size=3, padding='same',
activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64, kernel_size=3, padding='same',
activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(10, activation='softmax'))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 16) 448
=================================================================
Total params: 541,094
Trainable params: 541,094
Non-trainable params: 0
_________________________________________________________________