M4 Ia2
M4 Ia2
Convolutional Networks: The Convolution Operation, Motivation, Pooling, Convolution and Pooling as an
Infinitely Strong Prior, Variants of the Basic Convolution Function, Structured Outputs, Data Types,
Efficient Convolution Algorithms, Random or Unsupervised Features.
What is CNN?
Convolutional Neural Networks (ConvNets or CNNs) are a category of neural networks that have proven very
effective in areas such as image recognition and classification. ConvNets have been successful in identifying
faces, objects and traffic signs apart from powering vision in robots and self-driving cars. ConvNets, therefore,
are an important tool for most machine learning practitioners today
This operation is called convolution. The convolution operation is typically denoted with an asterisk:
In convolutional network terminology, the first argument (in this example, the function x) to the convolution
is often referred to as the input and the second argument (in this example, the function w) as the kernel. The
output is sometimes referred to as the feature map.
In our example, the idea of a laser sensor that can provide measurements at every instant in time is not
realistic. Usually, when we work with data on a computer, time will be discretized, and our sensor will provide
data at regular intervals. In our example, it might be more realistic to assume that our laser provides a
measurement once per second.
For example, if we use a two-dimensional image I as our input, we probably also want to use a two-dimensional
kernel K:
9.2 Motivation
Motivation for Convolution in Machine Learning
Convolution leverages sparse interactions, parameter sharing and equivariant representations, enhancing
efficiency in machine learning systems. Convolutional layers can handle variable-sized inputs.
1. Sparse Interactions (Sparse Connectivity):
o Sparse interactions mean that each output unit interacts with only a small subset of input units,
rather than all.
o Traditional neural networks use dense connections, where every input interacts with every output,
increasing memory and computation needs.
o Convolutional networks achieve sparse connectivity by using small kernels (e.g., for detecting edges
in images), which limits connections and reduces parameter storage.
o Fewer connections reduce memory requirements and computation time, making convolution
efficient for large inputs like images.
2. Parameter Sharing:
o Parameter sharing means that the same parameters (weights) are reused across different parts of
the input.
o In traditional networks, each parameter in the weight matrix is used once per output calculation.
Convolutional networks, however, apply the same kernel parameters across different input
locations.
o Parameter sharing reduces storage needs, as only a small set of parameters (kernel size, k) is
required, not a full set for each location.
o This leads to efficient forward propagation without increasing runtime complexity, as the memory
requirements are minimized.
3. Equivariance to Translation:
o Equivariance means that when an input is transformed (e.g., shifted), the output changes in a
predictable way, typically mirroring the input shift.
o Convolution is translation-equivariant, meaning that shifting the input results in an equivalent shift
in the output.
o For time series data, this creates a timeline of feature occurrences; for images, it creates a 2-D map
showing where features appear.
o This property is useful for detecting recurring patterns (e.g., edges) across the input in a consistent
manner.
Combining sparse connectivity and parameter sharing, convolutional networks efficiently detect edges and
other features across an image.
Handling varied sized data – Convolution allows for processing data of varying sizes, which traditional fixed-
shape matrices cannot handle efficiently.
Limitations:
ₓ Convolution is not naturally equivariant to transformations like scaling or rotation, requiring other
techniques to handle these.
ₓ Parameter sharing may not be ideal in cases where specific regions of the input (like different parts of a
face) need distinct features.
9.3 Pooling
A convolutional layer typically has three stages:
Convolution Stage: Applies multiple convolutions in parallel to produce a set of linear activations.
Detector Stage: Each linear activation undergoes a nonlinear function (like ReLU).
Pooling Stage: Applies a pooling function to summarize nearby outputs.
o Complex Terminology: Each convolutional layer has multiple stages (convolution, detector,
pooling).
o Simple Terminology: Each stage in the process is treated as its own layer (e.g., convolution
layer, detector layer, pooling layer).
Pooling in the context of convolutional neural networks (CNNs) is a down-sampling operation that reduces
the spatial dimensions of feature maps (i.e., the width and height), which helps control overfitting, reduces
the computational load, and makes the model more translation invariant. Pooling essentially replaces the
output at a given location with a summary statistic from its neighboring locations, thereby condensing the
information.
Pooling replaces outputs at specific locations with a summary statistic from nearby values.
Types of Pooling:
Max Pooling: Captures the maximum value in a specified neighbourhood.
Average Pooling: Takes the average within a neighbourhood.
L2 Norm Pooling: Uses the L2 norm of values within the region.
Weighted Average: Averages values with weights based on proximity to the central pixel.\