Unit 3
Unit 3
Convolution Neural Networks (CNNs) are a type of deep learning model primarily used for
processing structured array data such as images. They are particularly powerful for tasks that
involve visual perception. Here's a basic overview of how CNNs work and their key
components:
Convolution Layers:
This layer applies a number of filters to the input. These filters help the network in identifying
various features in the data, such as edges, textures, or specific objects in case of image data.
Each filter convolves across the input data, computing dot products between the entries of the
filter and the input, producing a feature map.
After each convolution operation, a nonlinear layer (such as a ReLU or rectified linear unit) is
applied to introduce nonlinearity into the model, enabling it to learn more complex patterns.
Pooling Layers:
Pooling (also known as sub sampling or down sampling) reduces the dimensionality of each
feature map but retains the most important information.
After several convolution and pooling layers, the high-level reasoning in the neural network is
done through fully connected layers. Neurons in a fully connected layer have connections to all
activations in the previous layer, and their activations can thus depend on the entire input.
Output Layer:
The final layer, typically a soft ax layer, provides the 'output' of the network, which could be a
class label in classification tasks or a set of values in regression.
CNNs learn through back propagation and an optimization algorithm like Stochastic Gradient
Descent (SGD), Adam, etc. During training, the network adjusts its weights to minimize the error
in its predictions compared to the actual data.
Applications of CNNs:
Face Recognition: Identifying and verifying individuals from their facial features.
Image Generation: Creating new images (e.g., deep fakes, artistic generation).
CNNs have revolutionized the field of computer vision due to their efficiency and accuracy in
processing image data. Their ability to learn hierarchical feature representations makes them
particularly suited for complex visual tasks.
A convolution neural network (CNN) is a type of artificial neural network used primarily for
image recognition and processing, due to its ability to recognize patterns in images. A CNN is a
powerful tool but requires millions of labeled data points for training.
The architecture of a Convolution Neural Network (CNN) typically consists of a series of layers
designed to process and extract features from input data, such as images. Here's a basic overview
of the typical architecture of a CNN:
Input Layer:
The input layer represents the raw input data, which is usually an image in the case of computer
vision tasks.
The dimensions of the input layer correspond to the dimensions of the input data (e.g., width,
height, and depth for images).
Convolution Layers:
Convolution layers are responsible for learning features from the input data. Each convolution
layer applies a set of filters (also known as kernels) to the input data, creating feature maps that
highlight important patterns in the data. These filters are small spatially (along width and height),
but extend through the full depth of the input volume.
An activation function is applied element-wise to the output of each convolution layer. Common
choices for activation functions include Rectified Linear Unit (ReLU), which introduces non-
linearity into the model.
Pooling Layers:
Pooling layers down sample the feature maps generated by the convolution layers. Common
pooling operations include max pooling and average pooling, which reduce the spatial
dimensions of the input, helping to reduce computation and over fitting.
After several convolution and pooling layers, the high-level reasoning in the neural network is
done through fully connected layers. Neurons in a fully connected layer have connections to all
activations in the previous layer, and their activations can thus depend on the entire input.
Output Layer:
The output layer produces the final output of the network, which could be a class label in
classification tasks or a set of values in regression tasks. The number of neurons in the output
layer depends on the specific task (e.g., the number of classes in classification tasks).
In classification tasks, the soft ax activation function is often used in the output layer to convert
raw scores into class probabilities.
Loss Function:
A loss function is used to measure the difference between the network's predictions and the
actual target values. Common loss functions include categorical cross-entropy for classification
and mean squared error for regression.
Optimization Algorithm:
An optimization algorithm (e.g., stochastic gradient descent) is used to minimize the loss
function by adjusting the weights of the network during training. This basic architecture can be
customized and extended for specific applications and datasets. For example, deeper networks
with more layers can capture more complex features, but they also require more computational
resources and are prone to over fitting if not trained carefully. There are also various
architectural innovations like skip connections (e.g., in Resent) and attention mechanisms that
have been introduced to improve the performance of CNNs in different tasks.
What is CNN?
Convolution Neural Networks (CNN or Convent) are a type of multi-layer neural network that is
meant to discern visual patterns from pixel images. In CNN, ‘convolution’ is referred to as the
mathematical function. It’s a type of linear operation in which you can multiply two functions to
create a third function that expresses how one function’s shape can be changed by the other. In
simple terms, two images that are represented in the form of two matrices are multiplied to
provide an output that is used to extract information from the image. CNN is similar to other
neural networks, but because they use a sequence of convolution layers, they add a layer of
complexity to the equation. CNN cannot function without convolution layers.
The Convent’s job is to compress the images into a format that is easier to process while
preserving elements that are important for obtaining a decent prediction. This is critical for
designing an architecture that is capable of learning features while also being scalable to large
datasets. A convolution neural network, Convents in short has three layers which are its building
blocks, let’s have a look:
Convolution Layer (CONV): They are the foundation of CNN, and they are in charge of
executing convolution operations. The Kernel/Filter is the component in this layer that performs
the convolution operation (matrix). Until the complete image is scanned, the kernel makes
horizontal and vertical adjustments dependent on the stride rate. The kernel is less in size than a
picture, but it has more depth. This means that if the image has three (RGB) channels, the kernel
height and width will be modest spatially, but the depth will span all three. Other than
convolution, there is another important part of convolution layers, known as the Non-linear
activation function. The outputs of the linear operations like convolution are passed through a
non-linear activation function. Although smooth nonlinear functions such as the sigmoid or
hyperbolic tangent (tanh) function were formerly utilized because they are mathematical
representations of biological neuron actions. The rectified linear unit (ReLU) is now the most
commonly used non-linear activation function. f(x) = max(0, x)
Pooling Layer (POOL): This layer is in charge of reducing dimensionality. It aids in reducing
the amount of computing power required to process the data. Pooling can be divided into two
types: maximum pooling and average pooling. The maximum value from the area covered by the
kernel on the image is returned by max pooling. The average of all the values in the part of the
image covered by the kernel is returned by average pooling.
Fully Connected Layer (FC): The fully connected layer (FC) works with a flattened input,
which means that each input is coupled to every neuron. After that, the flattened vector is sent
via a few additional FC layers, where the mathematical functional operations are normally
performed. The classification procedure gets started at this point. FC layers are frequently found
near the end of CNN architectures if they are present.
Along with the above layers, there are some additional terms that are part of CNN
architecture.
Activation Function: The last fully connected layer’s activation function is frequently distinct
from the others. Each activity necessitates the selection of an appropriate activation function.
The soft ax function, which normalizes output real values from the last fully connected layer to
target class probabilities, where each value ranges between 0 and 1 and all values total to 1, is an
activation function used in the multiclass classification problem.
Dropout Layers: The Dropout layer is a mask that nullifies some neurons’ contributions to the
following layer while leaving all others unchanged. A Dropout layer can be applied to the input
vector, nullifying some of its properties; however, it can also be applied to a hidden layer,
nullifying some hidden neurons. Dropout layers are critical in CNN training because they
prevent the training data from over fitting. If they aren’t there, the first batch of training data has
a disproportionately large impact on learning. As a result, learning of traits that occur only in
later samples or batches would be prevented: Now you have got a good understanding of the
building blocks of CNN, let’s have a look to some of the popular CNN architecture.
LeNet Architecture
The LeNet architecture is simple and modest making it ideal for teaching the fundamentals of
CNNs. It can even run on the CPU (if your system lacks a decent GPU), making it an excellent
“first CNN.” It’s one of the first and most extensively used CNN designs, and it’s been used to
successfully recognize handwritten digits. The LeNet-5 CNN architecture has seven layers.
Three convolution layers, two sub sampling layers, and two fully linked layers make up the layer
composition.
Alex Net’s architecture was extremely similar to Le Net’s. It was the first convolution network to
employ the graphics processing unit (GPU) to improve performance. Convolution filters and a
non-linear activation function termed ReLU are used in each convolution layer (Rectified Linear
Unit). Max pooling is done using the pooling layers. Due to the presence of fully connected
layers, the input size is fixed. The Alex Net architecture was created with large-scale image
datasets in mind, and it produced state-of-the-art results when it was first released. It has 60
million characteristics in all.
While prior Alex Net derivatives focused on smaller window sizes and strides in the first
convolution layer, VGG takes a different approach to CNN. It takes input as a 224×224 pixel
RGB image. To keep the input image size consistent for the Image Net competition, the authors
clipped out the middle 224×224 patch in each image. The receptive field of the convolution
layers in VGG is quite tiny. The convolution stride is set at 1 pixel in order to preserve spatial
resolution after convolution. VGG contains three completely connected layers, the first two of
which each have 4096 channels and the third of which has 1000 channels, one for each class.
Due to its adaptability for a variety of tasks, including object detection, the VGG CNN model is
computationally economical and serves as a good baseline for many applications in computer
vision.
Following are some of the advantages of a Convolution Neural Network: It performs parameter
sharing and uses special convolution and pooling algorithms. CNN models may now run on any
device, making them globally appealing. It finds the relevant features without the need for
human intervention. It can be utilized in a variety of industries to execute key tasks such as facial
recognition, document analysis, climate comprehension, image recognition, and item
identification, among others. By feeding your data on each level and tuning the CNN a little for a
specific purpose, you can extract valuable features from an already trained CNN with its taught
weights.
MOTIVATION LAYER
It seems there might be a misunderstanding with the term "motivation layer." In the context of
neural networks, including Convolution Neural Networks (CNNs), there is no standard layer
referred to as a "motivation layer."
Input Layer: Receives input data, such as images in the case of CNNs.
Pooling Layers: Reduce the spatial dimensions of the input, aiding in feature selection.
Fully Connected Layers: Neurons in these layers have connections to all activations in the
previous layer, performing high-level reasoning.
Output Layer: Produces the final output of the network, which could be a class label in
classification tasks or a set of values in regression tasks.
Figure 61 shows the met model of motivational concepts. It includes the actual motivations or
intentions – i.e., goals, principles, requirements, and constraints – and the sources of these
intentions; i.e., stakeholders, drivers, and assessments.
Motivational elements are related to the core elements via the requirement or constraint concept.
It is essential to understand the factors, often referred to as drivers, which influence the
motivational elements. They can originate from either inside or outside the enterprise. Internal
drivers, also called concerns, are associated with stakeholders, which can be some individual
human being or some group of human beings, such as a project team, enterprise, or society.
Examples of such internal drivers are customer satisfaction, compliance to legislation, or
profitability. It is common for enterprises to undertake an assessment of these drivers; e.g., using
a SWOT analysis, in order to respond in the best way.The actual motivations are represented by
goals, principles, requirements, and constraints. Goals represent some desired result – or end –
that a stakeholder wants to achieve; e.g., increasing customer satisfaction by 10%. Principles and
requirements represent desired properties of solutions – or means – to realize the
goals. Principles are normative guidelines that guide the design of all possible solutions in a
given context. For example, the principle “Data should be stored only once” represents a means
to achieve the goal of “Data consistency” and applies to all possible designs of the organization’s
architecture. Requirements represent formal statements of need, expressed by stakeholders,
which must be met by the architecture or solutions. For example, the requirement “Use a single
CRM system” conforms to the aforementioned principle by applying it to the current
organization’s architecture in the context of the management of customer data.
FILTERS
In the context of Convolution Neural Networks (CNNs), filters, also known as kernels, play a
crucial role. They are fundamental components used in the convolutional layers to extract
features from the input data, such as images. Here's a detailed look at how filters work in CNNs:
Filters in CNNs are small matrices of weights. These weights are learned during the training
process. Each filter is designed to detect specific features in the input data, such as edges, colors,
textures, or more complex patterns in deeper layers.
The size of a filter is typically much smaller than the size of the input data. Common dimensions
are 3x3, 5x5, or 7x7, but this can vary. Filters have a depth that matches the depth of the input
data. For example, for an RGB image (which has a depth of 3), each filter also has a depth of 3.
During the forward pass, each filter is convolved across the width and height of the input
volume, computing the dot product between the entries of the filter and the input at any position.
As a filter slides over the input data, it produces a 2-dimensional activation map (or feature map)
that gives the responses of that filter at every spatial position.
Feature Maps:
The feature map obtained by convolving a filter represents the presence of the features detected
by that filter across the input. Different filters detect different features, resulting in different
feature maps for the same input.
Learning Process:
Through the process of back propagation, the CNN adjusts the values of these filters to minimize
the loss function. This learning process enables the filters to become feature detectors, adapting
to extract relevant features for the task at hand.
The stride controls how much the filter moves across the input. A stride of 1 moves the filter one
pixel at a time, while a stride of 2 moves it two pixels, and so on. Padding can be added to the
input volume to control the spatial size of the output volumes, allowing deeper layers to retain a
larger spatial footprint of the input.
In deeper layers of the network, filters can detect more complex features, as they receive input
from feature maps created by earlier layers that represent more basic features. The stacking of
convolution layers allows CNNs to learn a hierarchy of features, from simple edges and textures
in early layers to more complex, abstract concepts in deeper layers. In summary, filters are a vital
part of CNN architecture, enabling these networks to automatically and adaptively learn spatial
hierarchies of features from input data, which is a cornerstone of their success in tasks like image
and video recognition, image segmentation, and other computer vision tasks.
What Is a Filter?
A filter is a circuit capable of passing (or amplifying) certain frequencies while attenuating other
frequencies. Thus, a filter can extract important frequencies from signals that also contain
undesirable or irrelevant frequencies. In the field of electronics, there are many practical
applications for filters. Examples include:
Radio communications: Filters enable radio receivers to only "see" the desired signal while
rejecting all other signals (assuming that the other signals have different frequency content).
DC power supplies: Filters are used to eliminate undesired high frequencies (i.e., noise) that are
present on AC input lines. Additionally, filters are used on a power supply's output to reduce
ripple.
Audio electronics: A crossover network is a network of filters used to channel low-frequency
audio to woofers, mid-range frequencies to midrange speakers, and high-frequency sounds to
tweeters.
Analog-to-digital conversion: Filters are placed in front of an ADC input to minimize Four
Major Types of Filters
The four primary types of filters include the low-pass filter, the high-pass filter, the band-pass
filter, and the notch filter (or the band-reject or band-stop filter). Take note, however, that the
terms "low" and "high" do not refer to any absolute values of frequency, but rather, they are
relative values with respect to the cutoff frequency.
Figure 1 below gives a general idea of how each of these four filters works:
Note: A notch filter is a band stop filter with a narrow bandwidth. Notch filters are used to
attenuate a narrow range of frequencies. Below are some technical terms that are commonly used
when describing filter response curves: 3 dB frequency (f3dB). This term, pronounced "minus
3dB frequency", corresponds to the input frequency that causes the output signal to drop by -3dB
relative to the input signal. The -3 dB frequency is also referred to as the cutoff frequency. It is
the frequency at which the output power is reduced by one-half (which is why this frequency is
also called the "half-power frequency"), or the output voltage is the input voltage multiplied by
1/√2. For low-pass and high-pass filters, there is only one -3 dB frequency. However, there are
two -3 dB frequencies for band-pass and notch filters—normally referred to as f1 and f2.
Center frequency (f0). The center frequency, a term used for band-pass and notch filters, is a
central frequency between the upper and lower cutoff frequencies. The center frequency is
commonly defined as the arithmetic mean (see equation below) or the geometric mean of the
lower and upper cutoff frequency.
Bandwidth (β or B.W.). The bandwidth is the width of the pass band, and the pass band is the
band of frequencies that do not experience significant attenuation when moving from the input of
the filter to the output of the filter.
Stop band frequency (fs). This is a particular frequency at which the attenuation reaches a
specified value.
For low-pass and high-pass filters, frequencies beyond the stop band frequency are referred to as
the stop band. For band-pass and notch filters, two stop band frequencies exist. The frequencies
between these two stop band frequencies are referred to as the stop band.
Quality factor (Q): The quality factor of a filter conveys its damping characteristics. In the time
domain, damping corresponds to the amount of oscillation in the system’s step response. In the
frequency domain, higher Q corresponds to more (positive or negative) peaking in the system’s
magnitude response. For a band pass or notch filter, Q represents the ratio between the center
frequency and the -3dB bandwidth (i.e., the distance between f 1 and f2).
PARAMETER SHARING
Basic Concept:
In a CNN, instead of having unique weights for every pixel in the input data, a convolution layer
uses the same filter (set of weights) across the entire input. This is what is meant by parameter
sharing. This filter is convolved across the width and height of the input image, or feature map,
applying the same weights at each position.
Efficiency in Learning:
Parameter sharing dramatically reduces the number of free parameters compared to a fully
connected layer, where each input pixel would be connected to each neuron with a unique
weight. This efficiency makes CNNs particularly suitable for high-dimensional inputs like
images.
Detection of Features Regardless of Position:
Since the same filter is applied across the entire input, the network can detect a feature regardless
of its position in the input image. For example, if a filter learns to recognize an edge in one part
of the image, it can recognize the same edge in a different part of the image. This property is
known as translation invariance.
By having fewer parameters, the risk of over fitting (where the model learns the noise in the
training data instead of the actual pattern) is reduced. This makes CNNs more generalizable to
new, unseen data.
Despite using shared parameters, CNNs can learn hierarchies of increasingly complex features.
Lower layers might learn simple features like edges and textures, while higher layers learn more
complex features like patterns or object parts.
During back propagation, the gradients from all positions where a filter was applied are summed
up, and this cumulative gradient is used to update the filter weights. This process takes into
account how the filter performed across the entire input.
Parameter sharing is one reason why CNNs can afford to be deep (have many layers); the
number of parameters does not explode with the addition of more layers. In summary, parameter
sharing in CNNs is an efficient way to learn features from images and other high-dimensional
data. It allows the network to be both deep and computationally efficient while also being robust
to over fitting, making it ideal for tasks in computer vision and related areas.
REGULARIZATION
Regularization in machine learning and deep learning is a technique used to prevent overfitting,
where a model performs well on training data but poorly on unseen data. Overfitting often occurs
in complex models with a large number of parameters, such as deep neural networks.
Regularization techniques aim to simplify the model to make it more generalizable. Here are
some common regularization methods:
These are the most common forms of regularization. They work by adding a penalty term to the
loss function.
L1 Regularization (Lasso): Adds the absolute value of the magnitude of the coefficients as the
penalty term to the loss function. It can lead to feature selection as some weights can become
zero.
L2 Regularization (Ridge): Adds the squared magnitude of the coefficients as the penalty term.
It generally leads to smaller and distributed weight values but doesn’t set them to zero.
Dropout:
Dropout is a widely used regularization technique in neural networks, especially deep neural
networks. During training, dropout randomly 'drops' (sets to zero) a proportion of the neurons in
a layer, forcing the network to learn redundant representations and preventing reliance on any
one feature. At test time, dropout is not applied; instead, the outputs are scaled down by the
dropout rate to account for the more active neurons.
Early Stopping:
Early stopping involves halting the training process before the model begins to overfit. This is
typically done by monitoring the model's performance on a validation set and stopping the
training when the performance on the validation set starts to degrade.
Data Augmentation:
Batch Normalization:
While primarily used to help in faster convergence of the training process, batch normalization
can also have a regularizing effect. It normalizes the output of a previous activation layer by
subtracting the batch mean and dividing by the batch standard deviation.
Noise Injection:
Adding noise to inputs or hidden layers during training can also act as a form of regularization,
forcing the network to learn more robust features.
Ensemble Methods:
Techniques like bagging and boosting, where multiple models are trained and their predictions
are combined, can also be seen as forms of regularization as they generally lead to more robust
and generalized models. The choice of regularization technique(s) can depend on the specific
problem, the type of model being used, and the nature of the dataset. It's often beneficial to
experiment with different methods and combinations to find what works best for a particular
scenario.
Several Convolution Neural Network (CNN) architectures have gained popularity, especially in
the field of computer vision, due to their outstanding performance in tasks like image
classification, object detection, and more. Here's a brief overview of some of the most influential
and widely-used CNN architectures:
LeNet-5:
Developed by Yann LeCun in the late 1990s, LeNet-5 is one of the earliest CNN architectures.
Primarily used for handwritten digit recognition (e.g., MNIST dataset), it consists of convolution
layers followed by sub sampling (pooling) layers, and fully connected layers.
Alex Net:
Designed by Alex Krizhevsky and published in 2012, Alex Net significantly advanced the field
of deep learning, particularly in image classification. It features deeper layers compared to LeNet
and introduced key concepts such as ReLU activations and dropout for regularization.
Developed by the Visual Graphics Group at Oxford (hence VGG), this model was a runner-up in
the ILSVRC (Image Net Large Scale Visual Recognition Challenge) 2014. VGG is known for its
simplicity, using only 3x3 convolution layers stacked on top of each other in increasing depth,
and was one of the first to show that depth is a critical component for a good model.
Introduced in 2014, Google Net (or Inception v1) won the ILSVRC 2014. It introduced the
concept of the “Inception module,” which dramatically reduced the number of parameters in the
network (compared to Alex Net and VGG).
Developed by Microsoft Research, ResNet won the ILSVRC 2015. It introduced residual blocks,
allowing training of extremely deep networks (up to 152 layers) by using skip connections to
prevent the vanishing gradient problem.
Exception:
An extension of the Inception architecture, it replaces Inception modules with depth wise
separable convolutions. Exception stands for “Extreme Inception” and was shown to outperform
Inception modules on multiple benchmarks.
Inception-v3 and v4:
These are further improvements on the Inception model, introducing more efficient and
sophisticated Inception modules.
Similar to ResNet, Dense Net also makes use of skip connections. However, instead of summing
outputs from previous layers, Dense Net concatenates them, leading to a more densely connected
network.
Mobile Nets:
Designed for mobile and embedded vision applications, Mobile Nets use depth wise separable
convolutions to build lightweight deep neural networks.
Efficient Net:
Efficient Net, a more recent architecture, scales up CNNs in a more structured manner using a
compound coefficient to scale depth, width, and resolution uniformly. These architectures have
been influential in pushing the boundaries of what's possible in computer vision and have also
inspired many variations and improvements. They serve as both practical solutions for real-world
applications and as foundational models for further research in the field.
RESNET
ResNet, short for Residual Network, is a type of deep neural network architecture that is
designed to address the problem of vanishing gradients in very deep networks. It was introduced
by Aiming He et al. in their paper "Deep Residual Learning for Image Recognition" in 2015. The
key innovation of ResNet is the use of residual connections, which allow the network to learn
residual functions with respect to the input instead of learning the desired underlying mapping
directly. This is achieved by adding shortcut connections that skip one or more layers, allowing
the network to bypass the usual forward propagation path and directly propagate the input to
deeper layers. This helps in mitigating the vanishing gradient problem and enables the training of
very deep networks (hundreds of layers) effectively. ResNet has been widely adopted in various
computer vision tasks, especially for image classification, where it has achieved state-of-the-art
performance on benchmark datasets like Image Net. It has also been used as a backbone
architecture for other tasks such as object detection, semantic segmentation, and more, due to its
effectiveness in learning hierarchical features from visual data.
ALEXNET-APPLICATION
Alex Net is a deep convolution neural network architecture that gained significant attention after
winning the Image Net Large Scale Visual Recognition Challenge (ILSVRC) in 2012. Here are
some applications and uses of Alex Net:
Image Classification: Alex Net was originally designed for image classification tasks, where it
achieved state-of-the-art performance at the time of its introduction. It can be used to classify
images into various categories, such as identifying objects in photographs.
Object Detection: The architecture of Alex Net can also be adapted for object detection tasks,
where the goal is not only to classify the objects in an image but also to locate and outline them.
This is commonly used in applications like self-driving cars, surveillance, and augmented reality.
Feature Extraction: The convolution layers of Alex Net can be used as a feature extractor. By
removing the fully connected layers and using the output of the last convolution layer, known as
the "bottleneck features," Alex Net can be used to extract features from images. These features
can then be used as inputs to other machine learning models for various tasks.
Medical Image Analysis: Alex Net and similar convolution neural network architectures have
been applied to medical image analysis tasks, such as identifying diseases from medical images
like X-rays, MRIs, and CT scans. The ability of deep learning models to learn complex patterns
in images makes them well-suited for such tasks.
Natural Language Processing (NLP): While Alex Net is primarily designed for image-related
tasks, its underlying principles, especially the use of convolution layers, have inspired
architectures in NLP tasks such as text classification and sentiment analysis. The idea of using
deep learning for feature extraction and hierarchical representation learning has been influential
across various domains.
Overall, Alex Net's impact extends beyond image classification, influencing the development of
deep learning architectures and their applications in a wide range of fields.
Alex Net. The architecture consists of eight layers: five convolution layers and three fully-
connected layers. But this isn’t what makes Alex Net special; these are some of the features used
that are new approaches to convolution neural networks:
ReLU Nonlinearity. Alex Net uses Rectified Linear Units (ReLU) instead of the tanh function,
which was standard at the time. ReLU’s advantage is in training time; a CNN using ReLU was
able to reach a 25% error on the CIFAR-10 dataset six times faster than a CNN using tanh.
Multiple GPUs. Back in the day, GPUs were still rolling around with 3 gigabytes of memory
(nowadays those kinds of memory would be rookie numbers). This was especially bad because
the training set had 1.2 million images. Alex Net allows for multi-GPU training by putting half of
the model’s neurons on one GPU and the other half on another GPU. Not only does this mean that
a bigger model can be trained, but it also cuts down on the training time.
Overlapping Pooling. CNNs traditionally “pool” outputs of neighboring groups of neurons with
no overlapping. However, when the authors introduced overlap, they saw a reduction in error by
about 0.5% and found that models with overlapping pooling generally find it harder to overfit.
The Over fitting Problem. Alex Net had 60 million parameters, a major issue in terms of over
fitting. Two methods were employed to reduce over fitting:
Data Augmentation. The authors used label-preserving transformation to make their data more
varied. Specifically, they generated image translations and horizontal reflections, which increased
the training, set by a factor of 2048. They also performed Principle Component Analysis (PCA)
on the RGB pixel values to change the intensities of RGB channels, which reduced the top-1 error
rate by more than 1%.
Dropout. This technique consists of “turning off” neurons with a predetermined probability (e.g.
50%). This means that every iteration uses a different sample of the model’s parameters, which
forces each neuron to have more robust features that can be used with other random neurons.
However, dropout also increases the training time needed for the model’s convergence.
The Results. On the 2010 version of the Image Net competition, the best model achieved 47.1%
top-1 error and 28.2% top-5 error. Alex Net vastly outpaced this with a 37.5% top-1 error and a
17.0% top-5 error. Alex Net is able to recognize off-center objects and most of its top five classes
for each image are reasonable. Alex Net won the 2012 Image Net competition with a top-5 error
rate of 15.3%, compared to the second place top-5 error rate of 26.2%. Alex Net’s most probable
labels on eight Image Net images. The correct label is written under each image, and the
probability assigned to each label is also shown by the bars. Image credits to Krizhevsky et al.,
the original authors of the Alex Net paper.
What Now? Alex Net is an incredibly powerful model capable of achieving high accuracies on
very challenging datasets. However, removing any of the convolution layers will drastically
degrade Alex Net’s performance. Alex Net is a leading architecture for any object-detection task
and may have huge applications in the computer vision sector of artificial intelligence problems.
In the future, Alex Net may be adopted more than CNNs for image tasks. As a milestone in
making deep learning more widely-applicable, Alex Net can also be credited with bringing deep
learning to adjacent fields such as natural language processing and medical image analysis.