Convolutional Neural Networks
Convolutional Neural Networks
The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in the
Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons
respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A
collection of such fields overlap to cover the entire visual area.
A ConvNet is able to successfully capture the Spatial and Temporal dependencies in an image
through the application of relevant filters. The architecture performs a better fitting to the image
dataset due to the reduction in the number of parameters involved and the reusability of weights.
In other words, the network can be trained to understand the sophistication of the image better.
The objective of the Convolution Operation is to extract the high-level features such as edges,
from the input image. ConvNets need not be limited to only one Convolutional Layer.
Conventionally, the first ConvLayer is responsible for capturing the Low-Level features such as
edges, color, gradient orientation, etc. With added layers, the architecture adapts to the High-
Level features as well, giving us a network that has a wholesome understanding of images in the
dataset, similar to how we would.
Convolutional neural networks are distinguished from other neural networks by their superior
performance with image, speech, or audio signal inputs. They have three main types of layers,
which are:
Convolutional layer
Pooling layer
Fully-connected (FC) layer
The convolutional layer is the first layer of a convolutional network. While convolutional layers
can be followed by additional convolutional layers or pooling layers, the fully-connected layer is
the final layer. With each layer, the CNN increases in its complexity, identifying greater
portions of the image. Earlier layers focus on simple features, such as colors and edges. As the
image data progresses through the layers of the CNN, it starts to recognize larger elements or
shapes of the object until it finally identifies the intended object.
1. Convolutional layer
The convolutional layer is the core building block of a CNN, and it is where the majority of
computation occurs. It requires a few components, which are input data, a filter, and a feature
map. Let’s assume that the input will be a color image, which is made up of a matrix of pixels in
3D. This means that the input will have three dimensions—a height, width, and depth—which
correspond to RGB in an image. We also have a feature detector, also known as a kernel or a
filter, which will move across the receptive fields of the image, checking if the feature is
present. This process is known as a convolution.
The majority of computations happen in the convolutional layer, which is the core building
block of a CNN. A second convolutional layer can follow the initial convolutional layer. The
process of convolution involves a kernel or filter inside this layer moving across the receptive
fields of the image, checking if a feature is present in the image.
Convolution is the first layer to extract features from an input image. Convolution preserves the
relationship between pixels by learning image features using small squares of input data. It is a
mathematical operation that takes two inputs such as image matrix and a filter or kernel.
2. Pooling layer
Like the convolutional layer, the pooling layer also sweeps a kernel or filter across the input
image. But unlike the convolutional layer, the pooling layer reduces the number of parameters in
the input and also results in some information loss. On the positive side, this layer reduces
complexity and improves the efficiency of the CNN.
The Pooling layer is responsible for reducing the spatial size of the Convolved Feature. This is
to decrease the computational power required to process the data by reducing the
dimensions. There are two types of pooling average pooling and max pooling. I’ve only had
experience with Max Pooling so far I haven’t faced any difficulties.
Fully Connected Layer
The layer we call as FC layer, we flattened our matrix into vector and feed it into a fully
connected layer like a neural network.
In the above diagram, the feature map matrix will be converted as vector (x1, x2, x3, …). With
the fully connected layers, we combined these features together to create a model. Finally, we
have an activation function such as softmax or sigmoid to classify the outputs as cat, dog, car,
truck etc.,
Summary