A Study On Effects of Data Augmentation in Detection
A Study On Effects of Data Augmentation in Detection
Suman Bhurtel
Jan, 2022
1. Introduction
This workflow can be seen in many developments of Deep learning architecture (Lin et al. 2014,
LeCun and Cortes 2010, Krizhevsky et al. 2012, Simonyan and Zisserman 2014). The developers of
various Deep learning architecture today, have explored numerous possible optimization
techniques, however, these approaches of optimization of model architectures are focused mostly
on network architecture (Huang et al. 2016) or more specifically in increasing the depth of network
Simonyan and Zisserman 2014).
Data Augmentation is not a new term in the field of Deep learning and computer vision since its
development. Data augmentation is undoubtedly accepted as an optimization approach.
Most of the deep learning algorithm has implemented data augmentation to some extent. Altering
data during the training phase is however less practiced approach in the optimization process. The
altering data during the training process first reduce the overfitting of the model and also
enhances the quality of the model by increasing the data. There is no limit of how much data is
required for a machine to learn efficiently, however, data for the specific problem are always
limited, in most cases, they will be insufficient. To cover these pit holes of data limitation, data
augmentation can be one of the convenient solutions. Many studies have suggested that through
data augmenting we can increase the data and make machine learning more efficient. Most of the
contemporary Deep learning applications uses the same data augmentation techniques first used
by Krichevsky (2012). Thus, this is leaving a significantly bigger space for research in deep learning
models and their relation with data augmentation.
Although, data augmentation is widely practiced in research and application, very little research
has been conducted in investigational relation of it and model optimization [Simonyan and
Zisserman 2014]. This forgotten part of research thus resulted in giving no specific guidelines or
best practices for data augmentation application. The application of data augmentation designs is
the combination of researchers’ experience and the trial-and-error method. Data augmentation
pipelines first used by [Krizhevsky et al. 2012], in ImageNet competition are reused often today
without alteration. Examples of these practices are: [Huang et al. 2016], [He et al. 2015] and
Simonyan and Zisserman 2014].
Researchers have missed addressing the data augmentation, to the extent it has to be addressed. I
have so far not found any scientific research published, studying the effect of data augmentation in
detection accuracy or in other words the optimization of the deep learning models. Yet, (Simard et
al. 2003, Perez and Wang 2017, and Pawara et al. 2017) have published some related papers and
paid some attention in data augmentation, focused their work on specific datasets, applications, or
roughly some effects, but their works do not give any concrete guidelines in the meaningful
application of techniques of data augmentation.
Since data has a vital influence on model architecture and learning process so might data
augmentation. Therefore, filling these research spaces is significant in proposing design priorities
of the model more accurately and justifiably, thus evolving faster, computational cost-effective,
and efficient deep learning models for computer vision tasks.
2. Basics
This chapter of the thesis is also divided in two sections. In the first section the application of deep
learning in computer vision problems will be discussed on which the foundation of this thesis is
based on. The first part will be a short introduction of Deep learning and Computer vision,
following the discussion on the object classification problems. This section will also include the
highlights into the Convolutional Neural Network (CNN) architecture and their design pattern.
Moreover, I will include the overview of preprocessing techniques along with the focused data
augmentation techniques, application and functionality in order to achieve the optimized
architecture of Deep learning models. The second part of this chapter will discuss the mathematics
behind the techniques and tricks that can be utilized in boosting the CNN performance.
Figure 2.2 The convolutional kernel extracts the feature from source layer, mapping in the destination
layer. Image by (Podareanu et al. 2019)
The classification task requires a model to assign a label to provided datapoints (objects). Al most,
architectures of compute vision are examples of classification problem. For instance, Simonyan and
Zisserman et al 2014, He et al. 2015, were first utilized for ImageNet Classification problem. These
architectures were later, utilized and adopted for complex tasks like object detection Girshick (2015)
segmentation Shelhammer et al. (2017) or similarity measurement Shen et al. (2017).
The application of CNN for Object detection is hiking up with various models being developed, making
CNN more accurate, efficient and wise. Hence, the classification problem is significant in research
works, thus, I will focus in deep classification architecture for experiment part of this thesis work.
2.4.1 Common Training and Testing Methodologies
Deep learning is the sub domain of Machine Learning; therefore, it follows the general concept of
training model with data order to get the output. The common, yet widely used methodology is in
deep learning process is training and testing process. First of all, the data is preprocessed and
training, validating and testing dataset is created. Training dataset is used for training, validation
dataset is used for optimization and testing dataset is used for final validating of learning process.
This process of training validating and testing is done in order to reduce the overfitting of the
training and validation data that can significantly reduce the performance of learning, which means
model has learned very specific features from dataset like some particular orientation only, some
noise patterns, or even exact pixel values of all images in training and validation datasets (Bishop
2006). To overcome the overfitting problem in learning process, various parameters are used for
example, learning rate, dropout probability etc. Preprocessing, post processing steps, and
frameworks are also other important factors in optimizing performance of the model (Richter et al.
2018). Normalization or scaling of values, padding of images, reducing noises, are also methods
widely used techniques to make information (data) more expressive in order to achieve higher
performance of model. Other common yet, widely used methodology is using methods like Principal
component analysis and Factor analysis.
Data augmentation distorts the training dataset in very ways that improves and encourages the
learning of desired features and discourages the learning of undesired features from the training
dataset. Data augmentation is carried out by applying the classical computer vision algorithms in a
randomized manner on the training dataset, creating the distorted data points in the process of
training. For example, the random rotation of an image can be applied to avoid model to learn
specific orientation of the object as features. Hence, data augmentation is used as a method of
teaching algorithm to ignore the unwanted features like orientation or size, and focus on the
feature that are invariant towards the used distortions [Perez and Wang (2017)].
Additive Noise
Additive noise is the technique of augmentation where element-wise noise is added from particular
distribution to an object minibatch. In other word in Additive Noising pixels or color values are
changed with random component on every pixel of the image using some functions. For example,
Pepper noise sets random pixels values to zero while gaussian noise changes the value to random
gaussian distribution based on its current value. Using noise in augmentation is general application
of covering the datapoints information, this process of concealing original datapoints helps the
model to learn the features which are independent on particular pixel pattern. We can see how
noise can impact in obscuring information from original image to produce more images with
different pixel values maintain the significant features of an image.
Figure 2.5.1 a) Original image, b) Salt and Pepper Image, c) Gaussian Noisy image and d) Poisson Noisy
image. Image by: (Janaki et al.,2012).
Intensity Shift
The intensity shift augmentation technique is especially applicable in enhancing the generalization of Deep
learning models which has images with variation in brightness. This means a basic brightness change for
the image, for example if the positive value of intensity shifts towards positive value the brightness
increases, if it shifts towards negative value the brightness of image decreases. Below is the example of
Intensity shift in an image.
The above illustration includes the shifting of color channels, contrast, hue saturation and brightness.
These shifts in the image results the model to understand the invariant features of the images, which is
extremely useful when dealing with images from various sources and qualities. During the literature review
phase of this thesis work, I didn’t find any of paper published in deep learning architectures mentioned the
application of Intensity shift, surprisingly I found the implementation of Intensity shift in Keras and
Tensorflow as well as in PyTorch under the name of color Jitter.
Blur
Blurring is the similar to convolution operation as it is performed with same kernel which allows color
channel is transformed individually separably in depth.
Various Blurring, such as gaussian blur or box blur, the kernel value will be fixed depending chosen kernel
size. Blur is widely used in obscuring small features in the object that could lead model to learn
unsignificant features, and secondly it is used to reduce noise. This gives overall smooth objects and
homogenous color channel in the image.
Random Cropping
Cropping means the cutting the images with the fixed pixel size. Szegedy et al. (2017b) used the
uninformed positional crop also known as random cropping, where cropping mask is moved over the
image which systematic approach, moreover, Krizhevsky et al. (2012) and Simonyan and Zisserman (2014)
also have applied the uninformed cropping.
Informed cropping can be found in the Girshick et al. (2013), Girshick (2015) and Ren et al. (2015)
framework that uses the region proposal systems to crop the image. In this thesis work, I have considered
cropping as the data augmentation technique, thus regardless of multi-crop evaluation, or producing
multiple datapoints or exact process of selecting the cropped area, deals with varying image and remove
biases towards the position of object, enabling the model to identify partially hidden features.
Random Rotation
Random Rotation is an example of geometric transformation technique in which each image is rotated
randomly using various angles. One of the significances if random rotation to make the neural network
invariant to the rotation. The lack of rotation invariance results in false prediction. It is very unlikely to have
testing images to have same angle or orientation compared to training samples. This problem was
addressed by (Sabour et al. 2017) in his CapsuleNet-Architecture however, limited to MINIST problem.
Application of random rotation in training dataset is simple techniques for model to learn the invariant
features as already mentioned above. We can adjust the degree of rotation invariance by changing the
range of angle of rotation. After the image is rotated, the empty space is filled by padding as illustrated in
the figure below:
Random Shearing
Random shearing is also an affine transformation which has the proportional pixel values to their index
value.
X’ = 1 shx .x
The displacement of x and y (x’ and y’) is the magnitude of shx and shy, this results to wrap image linearly
on a diagonal axis. This can be clearly illustrated in figure below:
Again, there is no shearing application found so far in any literatures but can be found implemented in
Keras, Tensorflow and PyTorch.
Translation
Figure 2.5.7
X’ = Tx + x
Y’ = Ty + y
The x and y coordinates are translated with the integer value Tx and Ty that displace the whole image
unlike cropping, which in the illustration above can be seen similar. In translation the images won’t be
smaller than original image but the coordinates of pixel values are changed. Translation however has been
implemented in Keras, PyTorch, MxNet and Tensorflow, it is not popular technique as cropping is. This is
also the reason why most of the literature did not consider to prioritize this technique.
Flipping
There are two flipping mainly in practice, Horizontal flipping and vertical flipping. Flipping is also called
mirroring in other word, like rotation, flipping mirrors the image based in axis, like vertical center or
horizontal. The following function better clarifies the technique of flipping.
Xhor = Xmax - x
Y hor = y
Xver = x
Yver = Ymax -Y
Where xmax and ymax are highest coordinates of x-axis and y-axis respectively. Flipping is the popular and
widely used techniques ever since the development of Deep neural network in computer vision problems
(Krizhevsky et al. 2012, Simonyan and Zisserman 2014, Szegedy et al. 2014b and He et al. 2015). I also
found from the various published paper that horizontal flip is mostly used than vertical flip, this might be
reason that horizontally flipped image is more realistic layout.
Zooming
Zooming is also another very popular technique, where a region in the image is enlarged or shrunken with
the same ratio as the original image. Random zooming towards different points in the image forces the
model to learn scale invariance of an image. In another word, zooming helps model to learn the features of
datapoints regardless of scaling factors. Simonyan and Zisserman (2014) have used zooming for multi-scale
evaluation. Tensorflow, Keras, PyTorch and MxNet has implementation of zoom.
Static Application
Using within the storage
Online Application
Using within the frameworks like keras or tensorflow.
2.7 Model
This section of thesis will highlight on the convolutional neural network architecture which is used for the
experiment of the thesis. I will focus on the model introduction, elaboration of its structure, relevance in
the deep learning and computer vision problem. And also, the motivation behind choosing this architecture
for the experiment will be discussed.
VGGNet
The first runner-up in the ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2014 in the image
classification was achieved by VGGNet (Simonyan, Zisserman, 2014). VGGNet was developed by Simonyan
and Zisserman of Visual Geometry Group (VGG) researchers from University of Oxford. VGGNet after its
development, established certain architectural design patterns for Deep Neural network, which are
considered standard design even today. The idea of increasing the depth of the Convolutional neural
network architecture for the increasing accuracy was first highlighted by VGG. VGG used smaller filter size
in comparison to other popular CNNs of that time, AlexNet (ILSVRC winner -2012) and ZFnet (ILSVRC
winner -2013). Krizhevsky et al. 2012 in AlexNet used 11x11 and 5x5 filter size, Zeiler, Fergus et al. 2014 in
ZFnet used 7x7 filter size. The higher filter size caused network complexity, thus by reducing filter size and
increasing the depth, VGGNet reduced the number of parameters and that resulted in simplifying the
network’s complexity. VGGNet also established the standard activation function, called ReLU (rectified
linear unit). This is very highly practiced standard activation function as default in many models today.
VGGnet has pyramidal structure of networks, which is illustrated below.
Figure 2.7.1 The pyramidal structure of VGG16 architecture, showing the decreasing size of filters and the
stacks of filter size increasing. Image by (Simonyan, Zisserman 2014).
The configurations as shown in following, is in increasing order if the depth left to right. It has been
denoted from A to E. The lowest depth is denoted by A, it has 11 layers with 8 convolutional layers (Conv)
and 3 fully connected (FC) layers. The depth of layer increases as we move from A to E, as more layers are
added. The added layers can be seen in bold in the illustration. The parameters in convolutional layer are
represented as “conv<receptive field size> - <number of channels>”.
Figure 2.7.2 Convolutional neural layers configurations. Image by (Simonyan, Zisserman, 2014)
The VGGNet is as mentioned above, has deep layers, and number of fully-connected nodes, the deploying
part of VGGNet is very slow. However due to its popularity and proven efficiency it is used in most of the
image classification task. There is smaller architecture like Squeeze Net, Google Net etc., however I am
using VGGNet in my experiment. VGGNet has simple architecture than Google Net. The heterogeneous
topology of Google Net needs to be customized in each module, resulting complexity. One more drawback
of GoogLeNet is, it drastically reduces the features spaces from its bottleneck in following layer, and that
leads often in loss of significant information (Khan et al. 2020).
Experiment
# Based on https://github.com/experiencor/keras-yolo3