154751363
154751363
Saleh Albelwi
DISSERTATION
AND ENGINEERING
UNIVERSITY OF BRIDGEPORT
CONNECTICUT
April 2018
ii
HYPERPARAMETER OPTIMIZATION OF DEEP
CONVOLUTIONAL NEURAL NETWORKS
ARCHITECTURES FOR OBJECT RECOGNITION
iii
ABSTRACT
promising results in difficult deep learning tasks. However, the success of a CNN depends
challenging, time-consuming process that requires expert knowledge and effort, due to a
combines the error rate and the information learnt by a set of feature maps using
performance by guiding the CNN through better visualization of learnt features via
deconvnet. The actual optimization of the objective function is carried out via the Nelder-
Mead Method (NMM). Further, our new objective function results in much faster
convergence towards a better architecture. The proposed framework has the ability to
explore a CNN architecture’s numerous design choices in an efficient way and also allows
demonstrate that the CNN architecture designed with our approach outperforms several
existing approaches in terms of its error rate. Our results are also competitive with state-
of-the-art results on the MNIST dataset and perform reasonably against the state-of-the-
art results on CIFAR-10 and CIFAR-100 datasets. Our approach has a significant role in
iv
increasing the depth, reducing the size of strides, and constraining some convolutional
layers not followed by pooling layers in order to find a CNN architecture that produces a
Moreover, we evaluate the effectiveness of reducing the size of the training set on
CNNs using a variety of instance selection methods to speed up the training time. We
then study how these methods impact classification accuracy. Many instance selection
methods require a long run-time to obtain a subset of the representative dataset, especially
if the training set is large and has a high dimensionality. One example of these algorithms
v
ACKNOWLEDGEMENTS
My thanks are wholly devoted to God, who has helped me complete this work
achieve my goal. Special thanks go to my wife and my kids; I could have never achieved
I would like to express special thanks to my supervisor Dr. Ausif Mahmood for
his constant guidance, comments, and valuable time. Without his support, I would not
have been able to finish this work. My appreciation also goes to Dr. Khaled Elleithy for
his feedback and support, and to my committee members Dr. Miad Faezipour, Dr. Prabir
Patra, Dr. Xingguo Xiong, and Dr. Saeid Moslehpour for their valuable time and
suggestions.
vi
TABLE OF CONTENTS
ABSTRACT ………....................................................................................................................... iv
ACKNOWLEDGEMENTS ............................................................................................................ vi
vii
2.10 Analysis of optimized instance selection algorithms on large datasets with CNNs .......... 24
2.11 A Deep Architecture for Face Recognition Based on Multiple Feature Extractors .......... 28
3.2 Analysis of optimized instance selection algorithms on large datasets with CNNs ............ 47
3.3 A Deep Architecture for Face Recognition based on Multiple Feature Extractors ............. 49
4.2 Analysis of optimized instance selection algorithms on large datasets with CNNs ............ 67
viii
4.2.2 Training methodology .................................................................................................. 67
4.3 A Deep Architecture for Face Recognition Based on Multiple Feature Extractors ............ 70
REFERENCES ……………………………...……………………………………………………77
ix
LIST OF TABLES
augmentation
x
databases, including individual classifiers and MC
systems
xi
LIST OF FIGURES
Figure 3.2 The top part illustrates the deconvnet layer on the left, 36
operations [14]
Figure 3.3 Visualization from the last convolutional layer for three 37
xii
one class
Table 4.10.
xiii
CHAPTER 1: INTRODUCTION
success in a variety of areas such as computer vision [1-3] and natural language
processing [4-6]. CNNs are biologically inspired by the structure of mammals’ visual
cortexes as presented in Hubel and Wiesel’s model [7]. In 1998, LeCun et al. followed
this idea and adapted it to computer vision. CNNs are typically comprised of different
many of these layers, CNNs can automatically learn feature representation that is highly
discriminative without requiring hand-crafted features [8, 9]. In 2012, Krizhevsky et al.
[1] proposed AlexNet, a deep CNN architecture consisting of seven hidden layers with
dataset [10] with an error test of 15.3%, as compared to 26.2% obtained by second place.
AlexNet’s impressive result increased the popularity of CNNs within the computer vision
community. Other motivators that renewed interest in CNNs include the number of large
datasets, fast computation with Graphics Processing Units (GPUs), and powerful
regularization techniques such as Dropout [11]. The success of CNNs has motivated
many to apply them to solving other problems, such as extreme climate events detection
1
Some works have tried to tune the AlexNet architecture design to achieve better
accuracy. For example, in [14], state-of-the-art results are obtained in 2013 by making
the filter size and stride in the first convolutional layer smaller. Then, [3] significantly
improved accuracy by designing a very deep CNN architecture with 16 layers. The
authors pointed out that increasing the depth of the CNN architecture is critical for
achieving better accuracy. However, in [15, 16] showed that increasing the depth harmed
network makes the network more difficult to optimize and more prone to overfitting [18].
because the architecture will be different from one dataset to another. Therefore, the
architecture design needs to be adjusted for each dataset. Setting the hyperparameters
properly for a new dataset and/or application is critical [19]. Hyperparameters that
specify a CNN’s structure include: the number of layers, the filter sizes, the number of
feature maps, stride, pooling regions, pooling sizes, the number of fully-connected layers,
and the number of units in each fully-connected layer. The selection process often relies
on trial and error, and hyperparameters are tuned empirically. Repeating this process
many times is ineffective and can be very time-consuming for large datasets. Recently,
optimization problem. These automatic methods have produced results exceeding those
accomplished by human experts [20, 21]. They utilize prior knowledge to select the next
2
In this dissertation, we present an efficient optimization framework that aims to
learnt by the feature maps. The deconvnet produces a reconstructed image that includes
the activated parts of the input image. A good visualization shows that the CNN model
has learnt properly, whereas a poor visualization shows ineffective learning. We use a
correlation coefficient based on Fast Fourier Transport (FFT) to measure the similarity
between the original images and their reconstructions. The quality of the reconstruction,
using the correlation coefficient and the error rate, is combined into a new objective
function to guide the search into promising CNN architecture designs. We use the Nelder-
Mead Method (NMM) to automate the search for a high-performing CNN architecture
through a large search space by minimizing the proposed objective function. We exploit
challenge as there are numerous design choices that impact the performance [19].
Determining the proper architecture design is a challenge because it differs for each
dataset and therefore each one will require adjustments. Many structural
hyperparameters are involved in these decisions, such as depth (which includes the
number of convolutional and fully-connected layers), the number of filters, stride (step-
3
size that the filter must be moved), pooling locations and sizes, and the number of units
combination for a given dataset because it is not well understood how these
hyperparameters interact with each other to influence the accuracy of the resulting model
hyperparameters for a given dataset, so the selection relies on trial and error.
therefore, practitioners and non-expert users often employ a grid or random search to find
the best combination of hyperparameters to yield a better design, which is very time-
knowledge and effort, due to the large number of architectural design choices. In this
given dataset that will maximize the performance. This allows non-expert users and
practitioners to find a good architecture for a given dataset in reasonable time without
hand-crafting it.
1.3 Contributions
4
performing CNN architecture for a given problem through a very large search space
without any human intervention. This framework also allows for an effective parallel and
distributed execution.
We introduce a novel objective function that exploits the error rate on the
validation set and the quality of the feature visualization via deconvnet. This objective
function adjusts the CNN architecture design, which reduces the classification error and
enhances the reconstruction via the use of visualization feature maps at the same time.
Further, our new objective function results in much faster convergence towards a better
architecture.
Instance selection is a subfield in machine learning that aims to reduce the size of
the training set. One example of these algorithms is Random Mutation Hill Climbing
(RMHC). We propose a new version of RMHC that works quickly and has the same
Some of the best current facial recognition approaches use feature extraction
techniques based on Principle Component Analysis (PCA), Local Binary Patterns (LBP),
Classifiers (MC) and deep learning to build a system that uses different feature extraction
algorithms PCA, LBP+PCA, LBP+NN. The features from the above three techniques are
concatenated to form a joint feature vector. This feature vector is fed into a deep Stacked
5
CHAPTER 2: BACKGROUND AND LITERATURE REVIEW
processing [3-6, 10]. Deep learning uses multiple hidden layers of non-linear
higher levels of the hierarchy are composed from lower-level features [6]. With enough
such transformations, very complex functions can be learned. For object recognition,
higher layers of representation amplify aspects of the inputs that are important for
discrimination and suppress irrelative variation [26]. The pixels of the image are fed into
the first layer, which can learn low-level features such as point, edges, and curves. In
subsequent layers, these features are combined into a measure of the likely presence of
higher level features; for example, lines are combined into shapes, which are then
combined into more complex shapes. Once this is done, the network provides a
probability that these high-level features comprise a particular object or scene. Deep
learning is motivated by understanding how the human brain processes information. The
brain is organized as a deep architecture with many layers that manipulates the
The main aspect of deep learning is learning discriminative features from the raw
data automatically without human-engineered features. The popular models for deep
6
learning include Deep Belief Network (DBN), Recurrent Neural Network (RNN),
Stacked Autoencoder (SA), and Convolutional Neural Networks (CNN) [9, 28].
gradient descent to update the parameters of deep learning algorithms to find the
parameters (weights 𝑤 and biases 𝑏) that minimize certain loss functions in order to map
During the forward phase, the algorithm forwards through the network layers to
compute the outputs. As a result, the error of the loss function is compared to the expected
outputs. During the backward phase, the model computes the gradient of the loss function
with respect the current parameters, after which the parameters are updated by taking a
The forward phase starts by feeding the inputs through the first layer, so producing
output activations for the successive layer. This procedure is repeated until the loss
function at the last layer is computed. During the backward phase, the last layer calculates
the derivative with respect to its own learnable parameters as well as its own input, which
serves as the upstream derivatives for the previous layer. This procedure is repeated until
Gradient descent can be categorized into two main methods: Batch Gradient
Descent (BGD) and Stochastic Gradient Descent (SGD). The main difference between
both approaches is the size of the sample to consider for calculating the gradient. BGD
7
uses entire the training set to update the gradient at each iteration, while SGD performs
the gradient for each training example (𝑥 (𝑖) , 𝑦 (𝑖) ). Since the gradient of BGD is calculated
for the whole training set, it can be very slow and expensive, particularly when the size
of the training set is very large. However, the convergence is smoother and the
termination is more easily detectable. SGD is less expensive; however, it suffers from
noisy steps and its frequent updates can make the loss function fluctuate heavily [31].
SGD with mini-batch takes the best of both BGD and SGD. It updates the gradient
reduces the variance of the parameter updates, which can lead to more stable
convergence. In addition, SGD with mini-batch allows the benefits from parallelism
available in GPU, which are frequently used in deep learning frameworks such as Theano
and Tensorflow. The size of 𝑚′ mini-batch is defined by the user and can be up to few
hundred examples. The estimate gradient of SGD with mini-batch is formed as:
𝑚′
1
∇𝑤 = ′ ∇𝑤 ∑ ℓ( 𝑥 (𝑖) , 𝑦 (𝑖) , 𝑤) (2.1)
𝑚
𝑖=1
𝑚′
1
∇𝑏 = ′ ∇𝑏 ∑ ℓ( 𝑥 (𝑖) , 𝑦 (𝑖) , 𝑏) (2.2)
𝑚
𝑖=1
where ℓ( 𝑥 (𝑖) , 𝑦 (𝑖) , 𝑤) is the loss function over the mini-batch samples ℚ selected
from the training set. Once the gradients of the loss are computed with backpropagation
with respect to the parameters, they are used to perform a gradient descent parameter
8
update along the downhill direction of the gradient in order to decrease the loss function
as follows:
𝑤 = 𝑤 − 𝜖. ∇𝑤
(2.3)
𝑏 = 𝑏 − 𝜖. ∇𝑏 (2.4)
that controls the step size of the update. Algorithm (2.1) highlights the essential steps of
1 ′
5: Compute gradient for 𝑏: ∇𝑏 = 𝑚′ ∇𝑏 ∑𝑚 (𝑖) (𝑖)
𝑖=1 ℓ( 𝑥 , 𝑦 , 𝑏)
CNN is a subclass of neural networks that takes advantage of the spatial structure
convolutional layers and pooling layers (often each pooling layer is placed after a
9
convolutional layer). The last layers are a small number of fully-connected layers, and the
The critical advantage of CNNs is that it is trained end-to-end from raw pixels to
on human-crafted features [10, 11]. Since 2012, many researches have improved the
performance of CNNs in different directions, e.g. layer design, activation function, and
regularization, or applying CNNs in other areas [12, 13]. CNNs have been implemented
using large data sets such as MNIST [33], CIFAR-10/100 [34], and ImageNet [35] for
image recognition.
aim to extract local features from the input. Each kernel is used to calculate a feature map.
The units of the feature maps can only connect to a small region of the input, called the
10
receptive field. A new feature map is typically generated by sliding a filter over the input
and computing the dot product (which is similar to the convolution operation), followed
(𝑙)
where * is convolution operation ,𝑤𝑓 is convolution filter with size 𝐹 × 𝐹 ,
l
𝑥 (𝑙−1) is the output of previous layer , b j is shared bias of the feature map, and f is non-
During the backward phase, we compute the gradient of the loss function with
respect to the weights (𝑤) and biases (𝑏) of the respective layer as follows:
(𝑙) (𝑙)
∇𝑤 (𝑙) ℓ = ∑𝐹,𝐹(∇𝑥 (𝑙+1) ℓ)𝐹,𝐹 (𝑥𝐹,𝐹 ∗ 𝑤𝑓 (2.6)
𝑓 𝑓
(𝑙) (𝑙)
∇𝑏(𝑙) ℓ = ∑(∇𝑥 (𝑙+1) ℓ)𝐹,𝐹 (𝑥𝐹,𝐹 ∗ 𝑏𝑓 ) (2.7)
𝑓 𝑓
𝐹,𝐹
All units share the same weights (filters) among each feature map. The advantage
of sharing weights is the reduced number of parameters and the ability to detect the same
Several nonlinear activation functions are available, such as sigmoid, tanh, and
ReLU. However, ReLU [f(x) = max (0, x)] is preferable because it makes training faster
relative to the others [1, 37]. The size of the output feature map is based on the filter size
11
and stride, so when we convolve the input image with a size of (H × H) over a filter with
a size of (F × F) and a stride of (S), then the output size of (W × W) is given by:
𝐻−𝐹
𝑊=⌊ ⌋+1 (2.8)
𝑆
The hyperparameters of each convolutional layer are filter size, the number of
learnable filters, and stride. These hyperparameters must be chosen carefully in order to
Pooling splits the inputs into disjoint regions with a size of (R × R) to produce one output
from each region [38]. Pooling can be max or average based. If a given input with a size
of (W × W) is fed to the pooling layer, then the output size will be obtained by:
W
𝑃=⌊ ⌋ (2.9)
𝑅
During the forward phase, the maximum value of non-overlapping blocks from
Max pooling does not have any learnable parameters. During the backward phase,
the gradient from the next layer is passed back only to the neuron that achieved the max
12
value; all of the other neurons receive zero gradient.
The top layers of CNNs are one or more fully-connected layers similar to a feed-
forward neural network, which aims to extract the global features of the inputs. Units of
these layers are connected to all of the hidden units in the preceding layer. The outputs of
where ∙ is a dot product , 𝑥 (𝑙−1) is the output of the previous layer, 𝑥 (𝑙) , 𝑤 (𝑙) ,
and 𝑏 (𝑙) denotes the activations, weights, and biases of the current layer (𝑙) respectively,
During the backward phase, the gradient is calculated with respect to the weights
The fully-connected layer has only one hyperparameter, which is the number of
neurons (the number of learnable parameters connecting the input to the output).
The last layer is a softmax classifier, which estimates the posterior probability of
13
each class label over K classes as shown in Equation (2.14) [27].
exp(−𝑧𝑖 )
𝑦𝑖 =
∑𝐾
𝑗=1 exp(𝑧𝑗 )
(2.14)
In this dissertation, our learning algorithm for the CNN (Λ) is specified by a
follows:
𝑗
𝜆 = ((𝜆1𝑖 , 𝜆𝑖2 , 𝜆𝑖3 , , 𝜆𝑖4 )𝑖=1,𝑀𝑐 , (𝜆1 )𝑗=1,𝑁𝑓 ) (2.15)
where defines the domain for each hyperparameter, (𝑀𝐶 ) is the number of
convolutional layers, and (𝑁𝐹 ) is the number of fully-connected layers (i.e., the depth =
be identified. For example, for convolutional layer i: 𝜆1𝑖 is the number of filters, 𝜆𝑖2 is the
filter size (receptive field size), and 𝜆𝑖3 defines the pooling locations and sizes. If 𝜆𝑖3 is
equal to one, this means there is no pooling layer placed after convolutional layer i;
otherwise, there is a pooling layer after convolutional layer i and the value 𝜆𝑖3 defines the
𝑗
pooling region size. 𝜆𝑖4 is stride step. 𝜆5 is the number of units in fully-connected layer j.
We also use ℓ(Λ, 𝑇𝑇𝑅 , 𝑇𝑉 ) to refer to the validation loss (e.g., classification error) obtained
when we train model Λwith the training set (𝑇𝑇𝑅 ) and evaluate it on the validation set
14
hyperparameters * that designs the architecture for a given dataset automatically,
below:
Depth: defines the number of convolutional layers ( 𝑀𝐶 ) and the number of fully-
Filter size: The height and width of each filter. Generally, the sizes of the filters are
Number of filters: defines output volume and controls the number of learnable filters
connected to the same region of the input volume. Each filter detects a different feature in
the input.
Pooling layer location: this defines whether the current convolutional layer is followed by
a pooling layer.
learning frameworks such as Keras and Tensorflow, the hyperparameters of the pooling
layer are filter size and stride. In our work, the pooling region size is equivalent to the filter
size, and we always assume that the stride is equal to the filter size, which means the
15
2.4.1 Literature Review on CNNN Architecture Design
which runs multiple architectures and selects the best one based on its performance on
the validation set. However, cross-validation can only guarantee the selection of the best
architecture amongst architectures that are composed manually through a large number
grid search, which tries all possible combinations through a manually-defined range for
each hyperparameter. The drawback of a grid search is its expensive computation, which
increases exponentially with the number of hyperparameters and the depth of exploration
desired [40]. Recently, random search [41], which selects hyperparameters randomly in
a defined search space, has reported better results than grid search and requires less
computation time. However, neither random nor grid search use previous evaluations to
select the next set of hyperparameters for testing to improve upon the desired architecture.
on the previous evolutions of the objective function f. Popular techniques that implement
BO are Spearmint [21], which uses a Gaussian process model for ℳ, and Sequential
Gaussian process. According to [44], BO methods are limited because they work poorly
expensive. The work in [21] used BO with a Gaussian process to optimize nine
hyperparameters of a CNN, including the learning rate, epoch, initial weights of the
16
convolutional and full-connected layers, and the response contrast normalization
but not to the CNN architecture. Similarly, Ref. [24, 45, 46] optimized continuous
learning rate momentum and weight decay for each iteration to improve the coverage
speed of backpropagation. In addition, early stopping [52, 53] can be used when the error
rate on a validation set or training set has not improved, or when the error rate increases
for a number of epochs. In [54], an effective technique is proposed to initialize the weights
learning algorithms. In [25], a genetic algorithm is used to optimize the filter sizes and
the number of filters in the convolutional layers. Their architectures consisted of three
convolutional layers and one fully-connected layer. Since several hyperparameters were
not optimized, such as depth, pooling regions and sizes, the error rate was high, around
25%. Particle Swarm Optimization (PSO) is used to optimize the feed-forward neural
network’s architecture design [55].Soft computing techniques are used to solve different
real applications, such as rainfall and forecasting prediction [56, 57]. PSO is widely used
for optimizing rainfall–runoff modeling. For example, Ref. [58] utilized PSO as well as
extreme learning machines in the selection of data-driven input variables. Similarly, [59]
used PSO for multiple ensemble pruning. However, the drawback of evolutionary
algorithms is that the computation cost is very high, since each population member or
17
particle is an instance of a CNN architecture, and each one must be trained, adjusted and
the number of units only for fully-connected layers for artificial neural networks.
Recently, interest in architecture design for deep learning has increased. The
proposed work in [61] applied reinforcement learning and recurrent neural networks to
explore architectures, which have shown impressive results. Ref. [62] proposed a
determine the type of each layer and its hyperparameters. Ref. [63] used a genetic
managing problems in filter sizes through zeroth order interpolation. Each experiment
was distributed to over 250 parallel workers to find the best architecture. Reinforcement
learning, based on Q-learning [64], was used to search the architecture design by
discovering one layer at a time, where the depth is decided by the user. However, these
promising results were achieved only with significant computational resources and a long
execution time.
maps to monitor the evolution of features during training and thus discover problems in
a trained CNN. As a result, the work presented in [14] visualized the second layer of the
AlexNet model, which showed aliasing artifacts. They improved its performance by
reducing the stride and kernel size in the first layer. However, potential problems in the
CNN architecture are diagnosed manually, which requires expert knowledge. The
18
2.5 Regularization
reduce. There are several techniques to combat this problem, including L1, L2 weight
decay, KL-sparsity, early stopping, data augmentation, and dropout. Dropout has proven
itself an effective method to reduce overfitting due to its ability to provide a better
generalization on the testing set. Because dropout is such a powerful technique, it has
Dropout [11] is a powerful technique for regularizing full connected layers within
neural networks or CNN. The idea of dropout is each neuron is selected randomly with
probability p to be dropped (setting the activation to zero) for each training case. This
helps to prevent hidden neurons from co-adapting with each other too much; forcing the
model based on a subset of hidden neurons. The error back-propagated through only
remaining neurons that are not dropped. On the other hand, we can look to the dropout as
Early stopping [52, 53] is a kind of regularization that helps to avoid overfitting
by monitoring the performance of the model on the validation set. Once the performance
on the validation dataset decreases or saturates for a number of iterations, the model stops
Weight initialization [18] is a critical step in CNNs that influences the training
process. In order to initialize the model’s parameters properly, the weights must be within
19
a reasonable range before the training process begins. As a result, this will make the
convergence faster. Several weight initialization methods have been proposed, including
random initialization, naive initialization, and Xavier initialization. The two most widely
Naive initialization, the weights are initialized from a Gaussian distribution with
Xavier initialization [54] has become the default technique for weight
the network converge much faster than other approaches. The weight sets it
produces are also more consistent than those produced by other techniques.
There are three main methods for understanding and visualizing CNNs as follows:
during the forward pass. The drawback, however, is that some activation maps’
outputs are zero for the input images, which indicates dead filters. Additionally,
the size of the activation maps is not equal to the input image, especially in higher
layers.
Retrieving images that maximally activate a neuron: this strategy feeds a large
set of images through the network and then keeps track of which images maximize
the activations of the neurons. However, the limitation of this technique is that the
20
ReLU activation function does not always have semantic meaning by itself. This
method can involve a high computational cost to find the images that maximize
down to the input pixel space for a trained CNN. This results in a reconstructed
image the same size as the input image. It contains the regions of the input image
that were learned by a given feature map. A visualization similar to the input
image indicates that the CNN architecture learned properly. Since the
reconstructed image is the same size as the input, this allows us to measure the
similarity between the inputs and their reconstruction effectively [67] (Details in
5.3)
Several methods are used to compare the similarity between two images or
vectors. The most widely used are Euclidean distance, mutual information, and a
correlation coefficient. Each one of these methods has advantages and disadvantages.
The Euclidean distance between two images is the sum of the squared intensity
data, so any tiny errors will produce inaccurate results. Euclidean distance between
21
𝑦
(2.17)
𝑑(𝑝, 𝑞) = √∑(𝑝𝑖 − 𝑞𝑖 )2
𝑖=1
information that one image contains about the other. The value of mutual
information will be large when the similarity between a pair of images is high.
is given by:
255 255
𝑝(𝑥, 𝑦)
𝑀𝐼(𝑋, 𝑌) = ∑ ∑ 𝑝(𝑥, 𝑦)𝑙𝑜𝑔 (2.18)
𝑝(𝑥)𝑝(𝑦)
𝑥=0 𝑦=0
where p(x) and p(y) are marginal distributions of 𝑋, 𝑌 respectively, and p(x, y) is a
detects the non-linear dependence between two images. The drawback of mutual
Correlation distance is the linear dependence between two images. Fast Fourier
22
identical function 𝑓(𝑥) = 𝑥 by making the target outputs equal to the inputs during the
data. SA includes encoder and decoder steps. The encoder takes the input vector x to the
𝑦 = 𝑓(𝑊𝑥 + 𝑏) (2.19)
The decoder maps the hidden representation y back into reconstruction 𝑧 of the
𝑦 = 𝑓(𝑊 ′ 𝑦 + 𝑏) ≃ 𝑥 (2.20)
average reconstruction error 𝐿(𝑧, 𝑥) = ||𝑧 − 𝑥||2 . SA imposes sparsity on many hidden
neurons’ outputs to make them zero or close to zero in order to discover interesting an
feature representation and removing redundant and noisy information from the inputs
[69].
The first term of cost function in Equation 2.11 is an average sum of squares error
which describes the discrepancy error between the inputs x(i) and its reconstruction z(i)
over the entire training samples. The second term is weight decay, which is a
regularization technique for preventing overfitting. The last term is an extra penalty term
23
to provide sparsity constraint, where 𝜌 is the sparsity parameter and typically takes a
small value, n is number of neurons in the hidden layer, and the index j sums over the
between 𝜌̂ which is an average activation (averaged over the training data) of hidden
𝜌 1−𝜌
𝐾𝐿(𝜌||𝜌̂𝑗 ) = 𝜌 𝑙𝑜𝑔 + (1 − 𝜌) log (2.22)
𝜌̂𝑗 1 − 𝜌̂𝑗
0.05). Therefore, we would like the average activation of each hidden neuron 𝑗 to be close
to 0.05. To satisfy this constraint, the hidden unit’s activations must mostly be near zero.
Figure 2.2. Sparse Autoencoder structure: The number of units in input layer is equal to number of units in
output layer.
It is common that training set consists of instances that are useless. Therefore, it
is probable to get acceptable performance and enhance the training execution time
24
without non-useful instances; this process is called instance selection [70, 71].
Instance selection aims to choose subset (𝑇𝑆 ) from training set (𝑇𝑇𝑅 ) where
performance degradation as if the entire training set (𝑇𝑇𝑅 ) is used. This training set might
contain superfluous instances which can be redundant or noisy. Removing these instances
reducing the training set will shrink the amount of computation and memory storage,
especially if the amount of training set is large with high dimensionality. Each instance
in a training set can be either border instance or interior instance. Border instance is its k
nearest neighbors (k-NN) belonging to other classes that usually are closer to the decision
boundary. Interior instance is its k nearest neighbors belonging to the same class [73].
edition, and hybrid [74]. Condensation techniques aim to retain border instances. This
leads to make the accuracy over training set high but it might reduce the generalization
accuracy over the testing set. The reduction rate is high in condensation methods [75].
Edition methods aim to discard the border instances and retain interior instances.
Consequently, this leads to a smoother decision boundary between classes, which can
improve the classifier accuracy over the testing set. Finally, hybrid methods seek to
choose a subset of the training set containing border and interior instances to maintain or
Incremental methods start with empty subset 𝑇𝑆 = , and add each instance to 𝑇𝑆 from 𝑇𝑇𝑅
25
if it qualifies for some criteria. Decremental search starts with 𝑇𝑆 =𝑇𝑇𝑅 and removes any
instance 𝐼𝑖𝑚𝑔 from 𝑇𝑆 if it does not fulfill specific criteria. Mixed search starts with pre-
selected subset 𝑇𝑆 and iteratively can add or remove any instance meet the specific criteria
[71, 74].
All instance selection methods work under the following assumption: 𝑇𝑇𝑅 is the
training set; 𝐼𝑖𝑚𝑔_𝑖 is i-th instance in 𝑇𝑇𝑅 . Subset 𝑇𝑆 is selected from 𝑇𝑇𝑅 ; 𝐼𝑖𝑚𝑔_𝑗 is j-th
instance in 𝑇𝑆 . 𝑇𝑉 is the validation set. 𝑇𝑇𝑆 is the testing set. Typically, the accuracy of
Euclidean distance function is used to calculate the similarity between two instances q
selection. The algorithm is an incremental method that starts with adding one instance of
each class to the subset 𝑇𝑆 randomly from training set 𝑇𝑇𝑅 . Then, for each instance 𝐼𝑖𝑚𝑔
in 𝑇𝑇𝑅 is classified using the instances in 𝑇𝑆 . If the instance 𝐼𝑖𝑚𝑔 is incorrectly classified,
it will be added to 𝑇𝑆 . This guarantees all instances in 𝑇𝑇𝑅 are classified correctly. Based
on this criterion, noisy instance will be retained because they are commonly classified
The Edited Nearest Neighbor (ENN) [78] algorithm starts with 𝑇𝑆 =𝑇𝑇𝑅 , and then
each instance 𝐼𝑖𝑚𝑔 in 𝑇𝑆 is removed from 𝑇𝑆 if it does not agree with the majority of k-NN
26
(e.g. k=3). The ENN discards noisy instances as well as border instances to yield smooth
All k-NN [79] belongs to the family of ENN. The algorithm works as follows: for
i=1 to k, flags as bad for each instance misclassified by its k-NN. Once the loop is ended,
Skalak [80] exploited Random Mutation Hill Climbing (RMHC) method [81] to
select the subset 𝑇𝑆 from 𝑇𝑇𝑅 . This algorithm has two parameters should be defined by
the user early: (1) 𝑁𝑆 is the size of the subset or the training sample 𝑇𝑆 . (2) 𝑁𝑖𝑡𝑒𝑟 is number
of iterations. The algorithm is based on coding instances in 𝑇𝑇𝑅 into binary string format.
Each bit represents one instance, where 𝑁𝑆 bits are equal to one randomly (they
represent 𝑇𝑆 ). For 𝑁𝑖𝑡𝑒𝑟 iterations, the algorithm randomly mutates a single bit of zero to
one. The accuracy will be computed, if the change increases the accuracy, the change will
be kept, otherwise, roll-backed. This algorithm gives high chance to increase the size of
select 𝑁𝑆 instances randomly from 𝑇𝑇𝑅 to represent 𝑇𝑆 . For 𝑁𝑖𝑡𝑒𝑟 iterations: the algorithm
replaces one instance selected randomly from 𝑇𝑆 with instance selected randomly from
(𝑇𝑇𝑅 -𝑇𝑆 ). If the change improves the accuracy on the testing data using 1-NN, the change
will be maintained, otherwise, the change will be roll-backed. The size of 𝑇𝑆 in this way
27
2.11 A Deep Architecture for Face Recognition Based on Multiple
Feature Extractors
In recent decades, face recognition has been widely explored in the areas of
computer vision and image analysis due to its numerous application domains such as
surveillance, smart cards, law enforcement, access control, and information security.
With the number of face recognition algorithms that have been developed, face
recognition is still a very challenging task with respect to the changes in facial expression,
changes in facial appearance. Combining Multiple Classifiers (MC) in one system has
become a new direction which integrates many information resources and is likely
enhance the performance. The combination of MC can be applied at two levels: the
decision level and the feature level. The decision level addresses how to combine the
outputs of MC. In the feature level, each classifier produces a new representation that will
be concatenated in a feature vector to be fed into the new classifier [83]. This is the
approach we follow in our work. MC is only useful if combined classifiers are mutually
complementary, and they do not make coincident errors [84]. MC is a very effective
solution for classification problems involving a lot of classes and noisy input data [85].
In general, an MC system has three main topologies: parallel, serial and hybrid
[86]. The design of an MC system consists of two main steps: the classifier ensemble and
fuser. The classifier ensemble defines the selection of combined classifiers to be most
effective. The fuser step combines the results that are obtained by each individual
28
classifier [87]. The final classification can be greatly improved using deep learning.
In this dissertation, we also employ the power of combining MC and deep learning
to build a system that uses different feature extraction algorithms, namely PCA,
LBP+PCA, and LBP+NN, to ensure that each classifier produces its own basis
the outputs of the above three MCs. This joint feature vector is then fed into a deep SA
with two hidden layers to generate the classification results with probabilistic distribution
to approximate the probability of each class label. We present the results of a series of
experiments for different existing MC systems, replacing different classifiers with SSA,
implementation.
Lu et al [88] combined the basis vectors of PCA, ICA and LDA. The authors used
sum rule as well as RBF network strategies to integrate the outputs of three classifiers,
using matching scores. The outputs of the classifiers are concatenated into one vector to
be used as input to the RBF network to get the final decision. All feature extractors used
in this system are holistic techniques which utilize the information of the entire face to be
projected into subspace. The drawback of this approach is that the classification results
(not features) are combined from each classifier, whereas in this work, we combine the
individual features from different classifiers in a balanced way into a deep learning based
final classifier. Thus, preserving the entire feature information from an individual
29
classifier until the final result is produced.
sampling, a self-organizing map (SOM) neural network and convolution neural networks
(CNN). SOM is used for reducing dimensionality and invariance to small changes in the
images. CNN provides partial distortion invariance. Replacing SOM by the Karhunen-
Loeve transform produced a slightly worse result. While this approach produces good
results and to some extent is invariant to small changes in the input image, the
classification depends only on features obtained from dimensionality reduction. The best
reported result on the ORL dataset was 96.2%, whereas the approach in this work which
Eleyan and Demirel [90] proposed two systems for face recognition; PCA
followed by neural network (NN) and LDA followed by (NN). PCA and LDA are used
PCA+NN.
Lone et al [91] developed a single system for face recognition that combines four
Matching using correlation (Corr) and Partitioned Iterative Function System (PIFS). In
addition, they compared the results by combining two techniques of PCA-DCT, and three
techniques based on PCA-DCT-Corr. The results show that combining four Algorithms
30
Liong et al [92] proposed a new technique, called deep PCA, to obtain a deep
representation of data, which will be discriminant and better for recognition. The
approach comprises of two layers which are whitening and PCA. The outputs of first layer
will be inserted into the second layer to perform whitening and PCA again.
31
CHAPTER 3: RESEARCH PLAN
dominant approach to evaluating multiple models is the minimization of the error rate on
the validation set. In this work, we describe a better approach based on introducing a new
objective function that exploits the error rate as well as the visualization results from
feature activations via deconvnet. Another advantage of our objective function is that it
does not get stuck easily in local minima during optimization using NMM. Our approach
obtains a final architecture that outperforms others that use the error rate objective
function alone.
deconvolutional networks, the correlation coefficient, and the objective function. NMM
guides the CNN architecture by minimizing the proposed objective function, and web
services help to obtain a high-performing CNN architecture for a given dataset. For large
datasets, we use instance selection and statistics to determine the optimal, reduced
are shown in Figure 3.1. In order to accelerate the optimization process, we employ
32
Figure 3.1. General components and a flowchart of our framework for discovering a high-performing CNN
architecture
Training deep CNN architectures with a large training set involves a high
computational time. The large dataset may contain redundant or useless images. In
33
machine learning, a common approach of dealing with a large dataset is instance
selection, which aims to choose a subset or sample (𝑇𝑆 ) of the training set (𝑇𝑇𝑅 ) to
achieve acceptable accuracy as if the whole training set was being used. Many instance
selection algorithms have been proposed and reviewed in [71]. Albelwi and Mahmood
[76] evaluated and analyzed the performance of different instance selection algorithms
on CNNs. In this framework, for very large datasets, we employ instance selection based
on Random Mutual Hill Climbing (RMHC) [80]as a preprocessing step to select the
training sample (𝑇𝑆 ) which will be used during the exploration phase to find a high-
performing architecture. The reason for selecting RMHC is that the user can predefine
the size of the training sample, which is not possible with other algorithms. We employ
statistics to determine the most representative sample size, which is critical to obtaining
accurate results.
In statistics, calculating the optimal size of a sample depends on two main factors:
the margin of error and confidence level. The margin of error defines the maximum range
of error between the results of the whole population (training set) and the result of a
sample (training sample). The confidence level measures the reliability of the results of
the training sample, which reflects the training set. Typical confidence level values are
90%, 95%, or 99%. We use a 95% confidence interval in determining the optimal size of
Recently, there has been a dramatic interest in the use of visualization methods to
34
explore the inner operations of a CNN, which enables us to understand what the neurons
have learned. There are several visualization approaches. A simple technique called layer
activation shows the activations of the feature maps [93] as a bitmap. However, to trace
what has been detected in a CNN is very difficult. Another technique is activation
maximization [94], which retrieves the images that maximally activate the neuron. The
limitation of this method is that the ReLU activation function does not always have a
shows the parts of the input image that are learned by a given feature map. The
visualization and also allows us to diagnose potential problems with the architecture
design.
a given feature in higher layers to go back into the input space of a trained CNN. The
output of deconvnet is a reconstructed image that displays the activated parts of the input
image learned by a given feature map. Visualization is useful for evaluating the behavior
properly, whereas a poor visualization shows ineffective learning. Thus, it can help tune
the CNN architecture design accordingly in order to enhance its performance. We attach
a deconvnet layer with each convolutional layer similar to [14], as illustrated at the top of
Figure 3.2. Deconvnet applies the same operations of a CNN but in reverse, including
unpooling, a non-linear activation function (in our framework, ReLU), and filtering.
35
Figure 3.2. The top part illustrates the deconvnet layer on the left, attached to the convolutional
layer on the right. The bottom part illustrates the pooling and unpooling operations [14].
The deconvnet process involves a standard forward pass through the CNN layers
until it reaches the desired layer that contains the selected feature map to be visualized.
In a max pooling operation, it is important to record the locations of the maxima of each
pooling region in switch variables because max pooling is non-invertible. All feature
maps in a desired layer will be set to zero except the one that is to be visualized. Now we
can use deconvnet operations to go back to the input space for performing reconstruction.
Unpooling aims to reconstruct the original size of the activations by using switch
variables to return the activation from the layer above to its original position in the
pooling layer, as shown at the bottom of Figure 3.2, thereby preserving the structure of
the stimulus. Then, the output of the unpooling passes through the ReLU function.
Finally, deconvnet applies a convolution operation on the rectified, unpooled maps with
36
transposed filters in the corresponding convolutional layer. Consequently, the result of
deconvnet is a reconstructed image that contains the activated pieces of the input that
were learnt. Figure 3.3 displays the visualization of different CNN architectures. As
shown, the quality of the visualization varies from one architecture to another compared
to the original images in grayscale. For example, CNN architecture 1 shows very good
visualization; this gives a positive indication about the architecture design. On the other
hand, CNN architecture 3 shows poor visualization, indicating this architecture has
and also evaluating different architectures with criteria besides the classification error on
the validation set. Once the reconstructed image is obtained, we use the correlation
coefficient to measure the similarity between the input image and its reconstructed image
Figure 3.3. Visualization from the last convolutional layer for three different CNN architectures. Grayscale
input images are visualized after preprocessing.
37
3.1.3 Correlation Coefficient
The correlation coefficient (Corr) [95] measures the level of similarity between
two images or independent variables. The correlation coefficient is maximal when two
images are highly similar. The correlation coefficient between two images A and B is
given by:
𝑛
1 𝑎𝑖 − 𝑎̅ 𝑏𝑖 − 𝑏̅
𝐶𝑜𝑟𝑟(𝐴, 𝐵) = ∑( )( ) (3.1)
𝑛 𝜎𝑎 𝜎𝑏
𝑖=1
where 𝑎̅ and 𝑏̅ are the averages of A and B respectively, 𝜎𝑎 denotes the standard
deviation of A, and 𝜎𝑏 denotes the standard deviation of B. Fast Fourier Transform (FFT)
computational speed as compared to Equation (3.1) [96, 97]. The correlation coefficient
between A and B is computed by locating the maximum value of the following equation:
the complex conjugate, and ∘ implies element by element multiplication. This approach
reduces the time complexity of the computing correlation from 𝑂(𝑁 2 ) to 𝑂(𝑁 log 𝑁).
Once the training of the CNN is complete, we compute the error rate (Err) on the
validation set, and choose Nfm feature maps at random from the last layer to visualize their
learned parts using deconvnet. The motivation behind selecting the last convolutional
layer is that it should show the highest level of visualization as compared to preceding
layers. We choose Nimg images from the training sample at random to test the deconvnet.
38
The correlation coefficient is used to calculate the similarity between the input images
Nimg and their reconstructions. Since each image of Nimg has a correlation coefficient
(Corr) value, the results of all Corr values are accumulated in a scalar value called
(𝐶𝑜𝑟𝑟𝑅𝑒𝑠 ). Algorithm 3.1 summarizes the processing procedure for training a CNN
architecture:
Existing works on hyperparameter optimization for deep CNNs generally use the
error rate on the validation set to decide whether one architecture design is better than
another during the exploration phase. Since there is a variation in performance on the
same architecture from one validation set to another, the model design cannot always be
from the error rate (Err) on the validation set as well as the correlation results (𝐶𝑜𝑟𝑟𝑅𝑒𝑠 )
obtained from deconvnet. The new objective function can be written as:
39
𝑓(𝜆) = 𝜂(1 − 𝐶𝑜𝑟𝑟𝑅𝑒𝑠 ) + (1 − 𝜂) 𝐸𝑟𝑟 (3.3)
𝐶𝑜𝑟𝑟𝑅𝑒𝑠 . The key reason to subtract 𝐶𝑜𝑟𝑟𝑅𝑒𝑠 from one is to minimize both terms of the
optimization problem that needs to be minimized. Therefore, the objective function aims
to find a CNN architecture that minimizes the classification error and provides a high level
of visualization. We use the NMM to guide our search into a promising direction for
technique widely used for solving optimization problems based on the values of the
objective function when the derivative information is unknown. NMM uses a concept
this framework, let [Z1, Z2,…, Zn+1] refer to simplex vertices, where each vertex presents
a CNN architecture. The vertices are sorted in ascending order based on the value of
objective functions f(Z1) f(Z2) … f(Zn+1) so that Z1 is the best vertex, which provides
the best CNN architecture, and Zn+1 is the worst vertex. NMM seeks to find the best
hyperparameters * that designs a CNN architecture that minimizes the objective function
40
𝜆∗ = arg min 𝑓(𝜆) (3.4)
𝜆∈Ψ
contraction, and shrinkage, as shown in Figure 3.4. Each is associated with a scalar
iteration, NMM tries to update a current simplex to generate a new simplex which
decreases the value of the objective function. NMM replaces the worst vertex with the
best that has been found from reflected, expanded or contracted vertices. Otherwise, all
vertices of the simplex, except the best, will shrink around the best vertex. These
processes are repeated until the stop criterion is accomplished. The vertex producing the
lowest objective function value is the best solution that is returned. The main challenge
method, etc. The reason for selecting NMM is that it is faster than other derivative-free
optimization algorithms, because in each iteration, only a few vertices are evaluated.
Further, NMM is easy to parallelize with a small number of workers to accelerate the
execution time.
During the calculation of any vertex of NMM, we added some constraints to make
the output values positive integers. The value of 𝐶𝑜𝑟𝑟𝑅𝑒𝑠 is normalized between the
minimum and maximum value of the error rate in each iteration of NMM. This is critical
41
Figure 3.4. Nelder Mead method operations: reflection, expansion, contraction, and shrinkage.
Below, we provide details of our proposed framework based on the serial NMM
and the new objective function in Alg. 3.2 to obtain a good CNN architecture.
42
14: Scale 𝐶𝑜𝑟𝑟𝑅𝑒𝑠 of f(𝑅)
15: Compute new accuracy based on Equation. 3.3.
16: Scale values of 𝐶𝑜𝑟𝑟𝑅𝑒𝑠 between the max and min of Err of all vertices of
(𝑍.Compute f (𝑍i) based on Equation. 33. for all vertices
17: If f(𝐵)<f(𝑅) < f(𝑍)
18: 𝑍n+1= 𝑅
19: Else
20 If f(𝑅) f(𝐵)
21 Expansion 𝐸= 𝑅+ γ(𝑅- 𝐶)
22: Train f(𝐸) according to Alg. 3.1
23: Scale 𝐶𝑜𝑟𝑟𝑅𝑒𝑠 of f(𝐸)
24: If f(E)< f(𝑅)
25: 𝑍n+1= 𝐸
26: Else
27: 𝑍n+1= 𝑅
28: Else
29: b=true
30: If ( f(𝑅)f(𝐴)
31: Contraction 𝐶𝑜𝑛= ρ𝑅+(1- ρ) 𝐶
32: Train f(𝐶𝑜𝑛) according to Alg. 3.1
33: Scale 𝐶𝑜𝑟𝑟𝑅𝑒𝑠 of f(𝐶𝑜𝑛)
34: If f(𝐶𝑜𝑛)<f(𝑅)
35: 𝑍n+1= 𝐶𝑜𝑛
36: b=false
37: If b=false
38: shrink toward the best solution
39: for i= 2 to n+1:
40: 𝑍i= 𝐵+ σ (𝑍i- 𝐵)
41: Train 𝑍i according to Alg. 3.1
42: end while
43: Return 𝑍 [1]
43
3.1.1 Accelerating Processing Time with Parallelism
Since serial NMM executes the vertices sequentially one vertex at a time, the
optimization processing time is very expensive for deep CNN models. For this reason, it
is necessary to utilize parallel computing to reduce the execution time; NMM provides a
high degree of parallelism since there are no dependences between the vertices. In most
iterations of NMM, the worst vertex is replaced with either the reflected, expanded, or
distributed workers. There are two main types of parallelism models: asynchronous
Asynchronous master-slave model: The workers never stop to wait for any slower
workers. However, it does not work exactly like a serial NMM. This technique is not
suitable for our work because in some steps in the NMM, there is dependence in the
calculation that requires all workers to stop. In addition, the implementation requires strict
conditions and complex programming to ensure the program works properly. The final
running of several vertices on workers while the master machine controls the whole
optimization procedure. The master cannot move to the next step until all of the workers
finish their tasks. A synchronous NMM has the same properties as a serial NMM, but it
works faster. A serial NMM requires small changes in the implementation to work in a
parallel way.
44
between distributed applications written in different programming languages and running
on heterogeneous platforms, such as operating systems and hardware over the internet
[69,70]. There are two popular methods for building a web service application to interact
Representational State Transfer (RESTful). We use RESTful [71] to create web services
RESTful service submits CNN hyperparameters into worker machines. Each worker
builds and trains the architecture, computes the error rate and the correlation coefficient
results, and returns both results to the master computer. Moreover, when shrinkage is
selected, we run three vertices at the same time. This has a significant impact in reducing
the computation time. Our framework is described in Algorithm 3.3, which details and
45
14: Z = order the vertices so that f(𝑍1) f(𝑍2),…,< f(𝑍n+1).
15: Set 𝐵 = 𝑍1 , 𝐴 = 𝑍𝑛 , 𝑊 = 𝑍𝑛+1
16: Compute the centroid 𝐶 of vertices without considering the worst vertex:𝐶 =
17: 1 ∑𝑛 𝑍
𝑖=1 𝑖
18: 𝑛
Compute reflected vertex: 𝑅 = 𝐶 + α(𝐶 − 𝑊)
19:
Compute Expanded vertex 𝐸 = 𝑅 + γ (𝑅 − 𝐶)
20:
Compute Contracted vertex 𝐶𝑜𝑛 = ρ𝑅 + (1 − ρ)𝐶
21:
Train R, E, and Con simultaneously on workers 1,2, and 3 according to Alg.
22:
3.1
23:
Normalize 𝐶𝑜𝑟𝑟𝑅𝑒𝑠 of 𝑅, 𝐸, and 𝐶𝑜𝑛 between the max and min of Err of
24:
vertices of (𝑍)
25:
Compute 𝑓(𝑅), 𝑓(𝐸), and 𝑓(𝑐𝑜𝑛) based on Equation (3.3)
26:
If f(B) > R < W
27:
𝑍𝑛+1 = W
28:
29: Else If f(R) f(B)
30: If f(E) < f(R)
31: 𝑍𝑛+1 = E
32: Else
33: 𝑍𝑛+1 = R
34: Else
35: d = true
36: If f(R) ≤ f(A)
37: If f(Con) f(R)
38: 𝑍𝑛+1 = Con
39: d = false
40: If d = true
41: shrink toward the best vertex direction
42: L = ⌈𝑛/3⌉
43: For k = 2 to n+1: # do not include the best vertex
44: 𝑍𝑘 = B + σ(𝑍𝑘 −B)
45: For j = 1 to L:
46: Train each 3 vertices of (𝑍2:𝑛+1) in parallel on workers 1, 2, and 3
47: according to Alg. 3.1
Return 𝑍[1]
46
3.2 Analysis of optimized instance selection algorithms on large
calculating the training sample 𝑇𝑆 is very expensive. This is because each image in
𝑇𝑇𝑆 computes the accuracy using 1-NN for all images in the 𝑇𝑆 . However, adding or
removing an image will affect the neighbors of the image that was added or removed in
the testing set 𝑇𝑇𝑆 . Thus, there is no need to calculate the accuracy of all images in the
testing set 𝑇𝑇𝑆 . We improved RMHC to make it works faster compared to the original
RMHC with the same accuracy as follows: we first define k with a big value (e.g. k50).
Once the training sample is selected randomly from the training set with size 𝑁𝑆 . In each
iteration of 𝑁𝑖𝑡𝑒𝑟 : choose one image 𝐼𝑖𝑚𝑔1 randomly from 𝑇𝑆 to be removed and replaced
with another image 𝐼𝑖𝑚𝑔2 that selected randomly from (𝑇𝑇𝑅 − 𝑇𝑆 ). Then, we find the k-
NN images of 𝐼𝑖𝑚𝑔1 in the testing set 𝑇𝑇𝑆 and store them in (A). Also, we find the k-NN
images of 𝐼𝑖𝑚𝑔2 in the testing set 𝑇𝑇𝑆 and store them in (B). Now, we compute the
accuracy of A and B before and after the change. We have two cases as follows:
If accuracy2 > accuracy1 maintains the change, otherwise, roll-backed change. This
means we compute the accuracy of 4k images of the testing set instead of all the testing
47
set |𝑇𝑇𝑆 | (see Algorithm 3.4). The experimental results show that our approach is much
faster than the original RMHC with the same accuracy. We also investigated to study the
9: Find the 𝑘 − 𝑁𝑁 images of 𝐼𝑖𝑚𝑔2 in testing set 𝑇𝑇𝑆 and store them in (B)
10: //Before the change:
16: Else:
19: Return 𝑇𝑆
48
3.3 A Deep Architecture for Face Recognition based on Multiple
Feature Extractors
different feature extractors such that each yields similar size feature vectors i.e., k features
from each approach. This is done so that each feature type has equal importance until the
final classification. The output of the three feature extractors i.e., PCA, LBP+PCA and
LBP+NN is combined to produce a new feature vector of length 3k. Note that the outputs
from the LBP feature extractor are further fed into another PCA and NN to ensure that
each classifier produces k features. The outputs of NN are taken from the neurons in the
hidden layer after training it on the test dataset. The feature vector is fed into a Stacked
details of the different modules in our recognition system in the following subsections.
49
transformation to convert high-dimensional data into low-dimensional data by finding
the directions that maximize the variation between the data, by keeping the most
In PCA, a given M training set [𝑥1, 𝑥2, … … , 𝑥𝑀 ] is transformed into n×1 vectors.
of covariance (C) is used for computing Eigen values and Eigen vectors. To achieve
dimensionality reduction, only k Eigen vectors that correspond to the largest Eigen values
are selected. These Eigen vectors are combined into a matrix U= [u1, u2,……..,uk]. Each
Eigen vector in the U matrix is referred to as an Eigen face. We then project each data x
onto the k Eigen faces to reduce the input dimensionality of n to k dimensional subspace
𝑃 = 𝑈 𝑇 (𝑥 − 𝑥̅ ) (3.6)
Originally, LBP [100] was developed as a texture descriptor. The main idea of
LBP is that each center pixel will be compared with surrounding pixels. If the pixel value
of the center is greater or equal to its neighbor, then it is given a value of 1, otherwise, a
50
combined into a binary number by selecting one neighbor pixel as a start point and then
moving in a clockwise direction as shown Figure 3.6. The center pixel is then assigned
dividing the face into (m× n) blocks. Each block generates a local texture descriptor
producing 59 histogram vectors for sampling in a 3×3 grid i.e.., each center pixel has 8
neighbors; referred to as 8 points with radius 1 - P(8, 1). The final vector for each image
is equal to m × n × 59 [18]. To make the feature size of each feature extractor to k features,
the outputs of LBP are further fed to a PCA, and separately to an NN to make each
layers of SA in which the outputs of each layer are connected to the inputs of the
subsequent layer. The outputs of the last hidden layer are fed into the softmax classifier
51
for the classification task to estimate the probability of each class label in 𝐾 classes as
separately via SA in order to learn discriminative features on the input from the previous
layer. So, after training the hidden layer using SA, we take the weights and biases of the
encoder layer in the SA for an initialization of the hidden layer of the SSA. Once all
weights and biases are initialized via SA, the second stage of SSA is a supervised fine-
tune to train the entire network similar to the traditional neural network to minimize the
prediction error on a supervised task to map the inputs into the desired outputs as possible.
Figure 3.7. An architecture of stacked sparse autoencoder (SSA) that consists of two hidden layers.
52
Algorithm 3.4. Training SSA procedure
1: Inputs: joint feature vectors of the database, 𝜌, , h
2: Step 1: pre-training hidden layers:
3: For i= 0 to h : # number of hidden layers
4: Initialize (Wi, bi) randomly for hidden layer h=i
5: Find parameters of (Wi, bi) for h=i using SA by minimizing Equation 2.21
6: End For
7: Step 2: Training softmax classifier
8: Train the softmax classifier (the inputs are the
9: Outputs of last hidden layer)
10: Step 3: Fine-tune the whole network with supervised learning
11: Backpropagation with gradient descent
53
CHAPTER 4: IMPLEMENTATION AND RESULTS
4.1.1 Datasets
MNIST datasets. The CIFAR-10 dataset [102] has 10 classes of 32 × 32 color images.
There are 50K images for training, and 10K images for testing. CIFAR-100 [102] is the
same size as CIFAR-10, except the number of classes is 100. Each class of CIFAR-100
contains 500 images for training and 100 for testing, which makes it more challenging.
We apply the same preprocessing on both the CIFAR-10 and CIFAR-100 datasets, i.e.,
we normalize the pixels in each image by subtracting the mean pixel value and dividing
it by the standard deviation. Then we apply ZCA whitening with an epsilon value of 0.01
for both datasets. Another dataset used is MNIST [33], which consists of the handwritten
digits 0–9. Each digit image is 28 × 28 in size, and there are 10 classes. MNIST contains
60K images for training and 10K for testing. The normalization is done similarly to
architecture design using four image classification datasets: MNIST, CIFAR-10, CIFAR-
100, and SVHN. However, the previous works usually select and test the performance of
their works on two or three of these datasets only. The advantage of selecting these
54
datasets is that we can compare our results with others instead of implementing their
approaches.
Figure 4.1. CIFAR-10 dataset, each rows shows different images of one class.
Our framework is implemented in Python using the Theano library [103] to train
gradients and allows the use of GPU to accelerate the computation. During the exploration
phase via NMM, we select a training sample TS using an RMHC algorithm with a sample
size based on a margin error of 1 and confidence level of 95%. Then, we select 8000
Training Settings: We use SGD to train CNN architectures. The final learning
rate is set to 0.08 for 25 epochs and 0.008 for the last epochs; these values are selected
after doing a small grid search among different values on the validation set. We set the
55
batch size to 32 images and the weight decay to 0.0005. The weights of all layers are
initialized according to the Xavier initialization technique [54], and biases are set to zero.
The advantage of Xavier initialization is that it makes the network converge much faster
than other approaches. The weight sets it produces are also more consistent than those
produced by other techniques. We apply ReLU with all layers and employ early stopping
to prevent overfitting in the performance of the validation set. Once the error rate
increases or saturates for a number of iterations, the model stops the training procedure.
Since the training time of a CNN is expensive and some designs perform poorly, early
stopping saves time by terminating poor architecture designs early. Dropout [11] is
practice. During the exploration phase of NMM, each of the experiments are run with 35
epochs. Once the best CNN architecture is obtained, we train it with the training set TTR
construct the initial simplex with n + 1 vertices. However, this number is different for
each dataset. In order to define n for a given dataset, we initialize 80 random CNN
56
NMM will then initialize a new initial simplex (Z0) with n + 1 vertices. For all
datasets, we set the value of the correlation coefficient parameter to 𝜂 = 0.20. We select
at random Nfm = 10 feature maps from the last convolutional layer to visualize their
learned features and Nimg = 100 images from the training sample to assess the
Table 4.1 summarizes the hyperparameter initialization ranges for the initial
layers. However, the actual number of convolutional layers is controlled by the size of
the input image, filter sizes, strides and pooling region sizes according to Equations 2.8
and 2.9. Some CNN architectures may not result in feasible configurations based on initial
hyperparameter selections, because after a set of convolutional layers, the size of feature
maps (W) or pooling (P) may become 1, so the higher convolutional layers will be
automatically eliminated. Therefore, the depth varies through CNN architectures; this
For example, the hyperparameters of the following CNN architecture are initialized
randomly from Table 4.1, which consists of six convolutional layers and two fully-
[[85, 3, 2, 1], [93, 5, 1, 1], [72, 3, 2, 2], [61, 7, 2, 1], [83, 7, 2, 3], [69, 3, 2, 3],
[715, 554]]
For an input image of size 32 × 32, the framework designs a CNN architecture
57
with only three convolutional layers because the output size of a fourth layer would be
negative. Thus, the remaining convolutional layers from the fourth layer will be deleted
[[85, 3, 2, 1], [93, 5, 1, 1], [72, 3, 2, 2], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [715,
554]]
the error rate objective function. After initializing a simplex (Z0) of NMM, we optimize
the architecture using NMM based on the proposed objective function (error rate as well
as visualization). Then, from the same initialization (Z0), we execute the NMM based on
the error rate objective function alone by setting 𝜂 to zero. Table 4.2 compares the error
rate of five experiment runs obtained from the best CNN architectures found using the
respectively. The results illustrate that our new objective function outperforms the
58
optimization obtained from the error rate objective function alone. The error rate averages
of 15.87% and 40.70% are obtained with our objective function, as compared to 17.69%
and 42.72% when using the error rate objection function alone, on CIFAR-10 and
CIAFAR-100 respectively. Our objective function searches the architecture that minimizes
the error and improves the visualization of learned features, which impacts the search space
Expt. Error Rate Based on the Error Error Rate Based on Our
Num. Objective Function Objective Function
Results comparison on CIFAR-10
1 18.10% 15.27%
2 18.15% 16.65%
3 17.81% 16.14%
4 17.12% 15.52%
5 17.27% 15.79%
Avg. 17.69% 15.87%
Results comparison on CIFAR-100
1 42.10% 41.21%
2 43.84% 40.68%
3 42.44% 40.15%
4 42.98% 41.37%
5 42.26% 40.12%
Avg. 42.72% 40.70%
Table 4.2. Error rate comparisons between the top CNN architectures obtained by our objective function
and the error rate objective function via NMM.
We also compare the performance of our approach with existing methods that did
59
search, genetic algorithms, and Bayesian optimization. For CIFAR-10, [102] reported
that the best hand-crafted CNN architecture design tuned by human experts obtained an
18% error rate. In [25], genetic algorithms are used to optimize filter sizes and the number
of filters for convolutional layers; it achieved a 25% error rate. In [104], SMAC is
implemented to optimize the number of filters, filter sizes, pooling region sizes, and fully-
connected layers with a fixed depth. They achieved an error rate of 17.47%. As shown in
Table 4.3, our method outperforms others with an error rate of 15.27%. For CIFAR-100,
we implement random search by picking CNN architectures from Table 4.1, and the
lowest error rate that we obtained is 44.97%. In [104], which implemented SMAC, an
error rate of 42.21% is reported. Our approach outperforms these methods with an error
Table 4.3. Error rate comparison for different methods of designing CNN architectures on
CIFAR-10 and CIFAR-100. These results are achieved without data augmentation.
discovered that minimizes the value of the objective function; however, it can get stuck
in a local minimum. Figure 4.2 shows the value of the best CNN architecture versus the
60
iteration using both objective functions, i.e., the objective function based on error rate
alone and our new, combined objective function. In many iterations, NMM paired with
the error rate objective function is unable to pick a new, better-performing CNN
architecture and gets stuck in local minima early. The number of architectures that yield
a higher performance during the optimization process is only 12 and 10 out of 25, on
CIFAR-10 and CIFAR-100 respectively. In contrast, NMM with our new objective
19 and 17 architectures on CIFAR-10 and CIFAR-100 respectively, and it does not get
stuck early in a local minimum, as shown in Figure 4.2 (b). The main hallmark of our
new objective function is that it is based on both the error rate and the correlation
coefficient (obtained from the visualization of the CNN via deconvnet), which gives us
(a) (b)
Figure 4.2. Objective functions progress during the iterations of NMM. (a) CIFAR-10; (b) CIFAR-100.
both objective functions using NMM on CIFAR-10 and CIFAR-100. We took the average
61
of the hyperparameters of the best CNN architectures. As shown in Figure 4.3 (a), CNN
architectures obtained by the error rate objective function alone. Moreover, some
convolutional layers are not followed by pooling layers. As a result, we found that
reducing the number of pooling layers shows a better visualization, and results in adding
more layers. On the other hand, the architectures obtained by the error objective function
Figure 4.3. The average of the best CNN architectures obtained by both objective functions. (a) The
architecture averages for our framework; (b) The architecture averages for the error rate objective function.
algorithm on three distributed computers. The running time decreases almost linearly (3×)
NMM model, the master cannot move to the next step until all of the workers finish their
jobs, so other workers wait until the biggest CNN architecture training is done. This
creates a minor delay. In addition, the run-time of Corr based on Equation 3.1 is 0.05
62
second; when based on FFT in Equation 3.2, it is 0.01 second.
Table 4.4. Comparison of execution time by serial NMM and parallel NMM for Architecture Optimization.
recent works on architecture search on three datasets. As seen in Table 4.5, we achieve
results, with an error rate 0.42%. The results on CIFAR-10 and CIFAR-100 are obtained
after applying data augmentation. Recent architecture search techniques [61-64] show
good results; however, these promising results were only possible with substantial
computation resources and a long execution time. For example, GA [63] used genetic
algorithms to discover the network architecture for a given task. Each experiment was
distributed to over 250 parallel workers. Ref. [61] used reinforcement learning (RL) to
tune a total of 12,800 different architectures to find the best on CIFAR-10, and the task
took over three weeks. However, our framework uses only three workers and requires
tuning an average of 260 different CNN architectures in around one day. It is possible to
run our framework on a single computer. Therefore, our approach is comparable to the
significantly fewer resources and less processing time. Some methods such as Residential
63
Networks (ResNet) [105] achieve state-of-the-art results because the structure is different
than a CNN. However, it is possible to implement our framework in ResNet to find the
best architecture.
Table 4.5. Error rate comparisons with state-of-the-art methods and recent works on architecture design
search. We report results for CIFAR-10 and CIFAR-100 after applying data augmentation and results for
MNIST without any data augmentation.
random selection against RMHC for the same sample size; we found that RMHC
achieved better results by about 2%. We found that our new objective function is very
effective in guiding a large search space into a sub-region that yields a final, high-
performing CNN architecture design. Pooling layers and large strides show a poor
visualization, so our framework restricts the placement of pooling layers: It does not
follow every convolutional layer with a pooling layer, which shrinks the size of the
strides. Moreover, this results in increased depth. We found that the framework allows
navigation through a large search space without any restriction on depth until it finds a
64
promising CNN architecture. Our framework can be implemented in other problem
domains where images and visualization are involved, such as image registration,
Power analysis can be used to calculate the required minimum sample size. We
use the power analysis to estimate that our sufficient sample size of the data can attain
adequate power. We used the result of our first experiment to conduct the power analysis.
There are two kinds of statistical hypotheses: the null hypothesis (H0) and the
alternative hypothesis (H1). The result of the exam will be: Accept H0 or reject H0. The
statistical hypothesis of our work is that our proposed objective function performs better
than the error rate objective function (Equation 3.3). There are several types of tests, we
𝑁 ∗ 𝑃(1 − 𝑝) (4.2)
𝑆=
𝑑2
[𝑁 − 1 ∗ ( 2 )] + 𝑝(1 − 𝑝)
𝑧
Where d is the effect size, p is the required power (commonly 80%), z is the t-
value of 𝛼/2. N is the total number of the training set. In our case, the values of N is 50,000
for CIFAR-10, p=80%, d=3%, Significance (α) = 5%. Level of confidence (1- α) = 95%,
z=1.96. According to Equation 4.2, the sample size (S) is calculated as follows:
65
50,000 ∗ 0.8(1 − 0.8) 8,000
𝑆= = ≃ 5474
0.032 1.4615
[50,000 − 1 ∗ ( )] + 0.8(1 − 0.8)
1.962
According to the result, the size of 5474 is adequate. Then, the parameters below are
∑𝑥
Mean ( 𝑋̅ ) = =0.8473
𝑛
∑(𝑥−𝑥̅ )2
Variance (𝜎 2 )= = 0.1293
𝑛
Significance (𝛼)=0.05
Critical t =1.653
𝜎 0.3597
Standard Error (SE)= = =0.00486
√𝑆 √5474
𝑋̅ −𝐻0 0.8473−0.5
t value = = = 71.46
𝑆𝐸 0.00486
The H0 is rejected when the t value > Critical t. In our case, the t value (71.46) >
critical t (1.653). Thus, it means Reject H0 and Accept H1. Furthermore, the power
analysis shows that a sample set of 5474 is sufficient and can prove that our proposed
method is efficient.
66
4.2 Analysis of optimized instance selection algorithms on large
4.2.1 Dataset
architecture. We split training set into two parts: 40000 images for training and 10000
images for validation. The preprocessing step only normalizes pixels values between [0,
1].
Most of training parameters are similar Krizhevsky et al. [1]. We trained our
network using stochastic gradient descent (SGD) with mini-batch size of 32 instances,
weight decay is 0.0005, we initialized learning rate at 0.01 for iteration <100000,
otherwise 0.001 (iteration= number of training batches × epoch +mini batch index). We
initialized the weights of all layers with mean-zero normal distribution with standard
deviation 0.05 and all biases equal to 0. Dropout rate is 0.5 for fully-connected layers and
ReLU activation function for all layers. The total number of epochs is 120 for training
the network. All inputs were normalized between [0, 1]. Finally the last layer is softmax.
The code is written by python for different instance selection methods and CNNs.
67
Layer No. Layer type Feature map size Kernel size
1 Color image 32×32×3 -
2 Convolutional 28×28×30 5×5
3 Max pooling 14×14×30 2×2
4 Convolutional 10×10×40 5×5
5 Max pooling 5×5×40 2×2
6 Full connected 200×1 -
7 Softmax layer 10×1 -
Table 4.7 shows the accuracy of different instance selection methods. Results are
obtained from the average of five experiments with the same architecture as shown in
Table 4.6. The reduction rate is based on the technique. ConNN gives the best accuracy
compared to other instance selection algorithms. It retains noisy instances; this means
that CIFAR-10 contains a lot of noisy and border instances. For this reason, the reduction
rate is only 25%. In contrast, the reduction rate in ENN is high, but the accuracy is low
68
We compared the running time of the original RMHC with our improved
approach. Table 4.8 shows the running time of one iteration (replacement) for both
approaches. As shown, the running time of our approach takes 7.9 seconds and 150.86
Table 4.8 Running time comparison between original RMHC and our proposed approach of RMHC for one
iteration.
Suppose k=100, the size of testing set is |𝑇𝑇𝑆 |= 10,000, |𝑇𝑆 |=3,000, and 𝑁𝑖𝑡𝑒𝑟 =
𝑁𝑖𝑡𝑒𝑟 =10,000×3,000×200=6 × 109 while the total number of operations for our method
the running speed of our approach is 25× faster than the original RMHC. We investigated
further to find the relationship between the running speed and the value of k. As shown
in Figure 4.4, we found that increasing the value of k reduces the running speed. The
running speed of our approach is equal to the original RMHC when the value of k=2,500.
69
Figure 4.4. The running speed with different values of k.
training set size in order to shorten the training time. Based on techniques that are
evaluated, condensed nearest neighbor gave acceptable result compared to result of the
entire dataset. We conclude from the comparison that retaining noisy instances is more
Feature Extractors
4.3.1 Datasets
The ORL (Olivetti Research Labs) database of faces [109] is a set of grayscale
images, each with a size of 112 × 92 pixels. It consists of 400 images for 40 people. The
images were taken under different lighting conditions, facial expressions and different
70
details, such as the subject wearing glasses. In our experiment, we specified the first 5
images of each person for training and the remaining 5 for testing.
The AR face database [110] has color images for 126 individuals (70 men and 56
women). The images have high variation in facial expressions, illumination, and
occlusion with scarves and sunglasses. The images were taken in two sessions separated
by 2 weeks. Each session contains 13 images for each person. The first session was meant
for training, whereas the second session for testing. We used the same images that were
used in [111], in which all images were cropped and resized with 120 × 165 pixels for 50
The system was written in Python. We used gradient descent to train the neural
networks, SA, and SSA. The value of the learning rate is set to 0.01 for all networks, with
a weight decay of 0.0005 and momentum of 0.9. We initialized the weights according to
the Xavier initialization technique [54], and biases are set to zero. The total number of
epochs is 200. We apply sigmoid activation function to the layers. The number of features
In some classifiers, such as PCA and LBP, recognition is done by calculating the
Euclidean distance (L2) between feature vectors in each testing image in the testing set
with all training images. The minimum distance will be chosen as the recognized face.
The activation function in our proposed approach is a sigmoid function (sigmoid in the
range [0, 1]). Therefore, the outputs of PCA and LBP in the first stage will be normalized
71
between 0 and 1 before being forwarded into the NN or SSA.
Tables 4.9 and 4.10 show the results from a series of experiments designed to
evaluate the effectiveness of our proposed system of combining MC with SSA. Table 4.9
shows the results of different individual classifiers and MC methods. The results clearly
show that our approach provides the highest accuracy compared to other techniques. After
extracting feature vectors with length 3k, we also investigated whether we could replace
some popular classifiers such as NN, Support Vector Machine (SVM) [112], and logistic
regression [113] with SSA as a classifier in the last stage. This comparison is necessary
to show the effectiveness of SSA as a classifier in the last phase compared to other
classifiers. It can be seen in Table 4.10 that SSA outperforms other classifiers with an
accuracy of 98% and 81.67% on the ORL and AR databases respectively. Figure 4.5
summarizes Tables 4.9 and 4.10 by comparing our approach to all individual classifiers
Classifying test images based on the joint feature vector extracted from three
different extraction techniques using Euclidean distance shows low accuracy. This means
joint feature vectors do not provide diverse features between different classes. Thus, SSA
can detect new features from these joint vector features in a hierarchical way to make the
by sparse autoencoder extracts the more useful features from joint feature vectors and
initializes the weights, and 2) supervised training in the next step modifies the boundaries
72
between classification classes and minimizes the classification error.
Table 4.9. Performance of different classifiers on the ORL and AR databases, including individual classifiers
and MC systems.
Table 4.10. Performance of classifiers with the replacement of classifiers with SA in the last stage.
propose combining the features obtained from diverse approaches into a deep architecture
for final classification. Even though we present results obtained from combining features
from PCA and LBP approaches, our technique and implementation are open to accepting
features from any new or existing feature extraction approach. The key is to concatenate
the features from individual techniques in such a way that no single technique biases the
result in its favor. We accomplish this by ensuring that features from all individual
73
techniques have the same dimensionality and are scaled between 0 and 1. The combined
features are fed to a stacked sparse autoencoder to classify the results correctly. If a
feature extraction technique results in a relatively larger (or smaller) number of features,
have provided. In our detailed testing on face datasets, we accomplish better results than
MC+ SVM
MC+NN
MC+ Logistic regression
Our system (MC+SSA)
LBP+NN
Joint feature vector
LBP+PCA
LBP
PCA
AR ORL
Figure 4.5. Performance of all proposed classifiers in Tables 4.9 and Table 4.10.
74
CHAPTER 5: CONCLUSIONS
variety applications of computer vision and speech processing. Despite this, it is still
In previous work on optimization, the error rate alone is used as the objective
function; i.e., the error rate on the validation set is used to decide whether one architecture
design is better than another. In this work, we present a new objective function that
utilizes information from the error rate on the validation set as well as the quality of the
feature visualization obtained from deconvnet. The objective function aims to find a CNN
architecture that minimizes the error rate and provides a high level of feature
deconvnet in addition to the error rate in the objective function produces a superior
performance. The advantage of the proposed objective function is that it does not get
stuck in local minima easily as compared to the error rate objective function. Further, our
new objective function results in much faster convergence towards a better architecture.
training sample subset. Our framework is limited by the fact that it cannot be applied
75
outside of the imaging or vision field because part of our objective function relies on
methods in reducing the training set size in order to shorten the running time. In addition,
we proposed a new approach that makes RMHC works quickly with the same accuracy
the context of the images. We will also explore cancellation criteria to discover and avoid
76
REFERENCES
[2] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, "Learning hierarchical features for
scene labeling," IEEE transactions on pattern analysis and machine intelligence, vol.
[3] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale
[5] Y. Kim, "Convolutional neural networks for sentence classification," arXiv preprint
arXiv:1408.5882, 2014.
[7] D. H. Hubel and T. N. Wiesel, "Receptive fields and functional architecture of monkey
striate cortex," The Journal of physiology, vol. 195, no. 1, pp. 215-243, 1968.
77
[9] Y. Bengio, "Learning Deep Architectures for AI," Foundations and Trends® in
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale
[12] Y. Liu et al., "Application of Deep Convolutional Neural Networks for Detecting
[13] A. Esteva et al., "Dermatologist-level classification of skin cancer with deep neural
818-833, 2014.
[15] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition,"
2368-2376, 2015.
78
[17] K. He and J. Sun, "Convolutional neural networks at constrained time cost," in
arXiv:1512.07108, 2015.
57735-633-2, ed. Beijing, China: AAAI Press, 2013, pp. 1924-1931, 2013.
[23] S. Albelwi and A. Mahmood, "A Framework for Designing the Architectures of Deep
arXiv:1603.06560, 2016.
79
[25] S. R. Young, D. C. Rose, T. P. Karnowski, S.-H. Lim, and R. M. Patton, "Optimizing
[26] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, p. 436-444,
2015.
[27] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, "Efficient processing of deep neural
networks: A tutorial and survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-
2329, 2017.
[28] L. Deng, "Three classes of deep learning architectures and their applications: a tutorial
[29] Y. LeCun et al., "Efficient BackProp," presented at the Neural Networks: Tricks of
Networks," Master thesis, Computer science and engineering KTH Royal Institute of
[31] I. Chakroun, T. Haber, and T. J. Ashby, "SW-SGD: The Sliding Window Stochastic
Gradient Descent Algorithm," Procedia Computer Science, vol. 108, pp. 2318-2322,
2017.
[32] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press
Cambridge, 2016.
80
[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to
document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[34] A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images,"
[35] D. Jia, D. Wei, R. Socher, L. Li-Jia, L. Kai, and F.-F. Li, "ImageNet: A large-scale
[36] T. Chen, R. Xu, Y. He, and X. Wang, "A gloss composition and context clustering
based distributed word sense representation model," Entropy, vol. 17, no. 9, pp. 6007-
6024, 2015.
[37] V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann
[38] A. Aldhaheri and J. Lee, "Event detection on large social media using temporal
Conference (CCWC), 2017, Las Vegas, USA, 2017, pp. 1-6: IEEE.
[39] R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and
model selection," in Ijcai, vol. 14, no. 2, pp. 1137-1145: Stanford, CA, 1995.
Search and Simulation for Lung Texture Classification in CT Using Hadoop," Journal
81
[41] J. Bergstra and Y. Bengio, "Random search for hyper-parameter optimization," The
Journal of Machine Learning Research, vol. 13, no. 1, pp. 281-305, 2012.
2015.
[47] L.-W. Chan and F. Fallside, "An adaptive training algorithm for back propagation
networks," Computer speech & language, vol. 2, no. 3-4, pp. 205-218, 1987.
neural network modeling," in Neural Networks: Tricks of the Trade, Springer 1998, pp.
113-132, 1998.
82
[50] C.-C. Yu and B.-D. Liu, "A backpropagation algorithm with adaptive learning rate and
Neural Networks. IJCNN'02., Honolulu, HI, USA, vol. 2, pp. 1218-1223, 2002.
arXiv:1212.5701, 2012.
conjugate gradient, and early stopping," in NIPS, Denver, CO, USA, pp. 402-408, ,
2000.
[53] A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent
processing, icassp 2013, Vancouver, BC, Canada, 2013, pp. 6645-6649, 2013.
[54] X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward
[55] B. A. Garro and R. A. Vázquez, "Designing artificial neural networks using particle
[56] K. Chau and C. Wu, "A hybrid model coupled with singular spectrum analysis for
daily rainfall prediction," Journal of Hydroinformatics, vol. 12, no. 4, pp. 458-473,
2010.
83
[57] W.-c. Wang, K.-w. Chau, D.-m. Xu, and X.-Y. Chen, "Improving forecasting accuracy
of annual runoff time series using ARIMA based on EEMD decomposition," Water
[58] R. Taormina and K.-W. Chau, "Data-driven input variable selection for rainfall–runoff
[59] J. Zhang and K.-W. Chau, "Multilayer Ensemble Pruning via Novel Multi-sub-swarm
Particle Swarm Optimization," J. UCS, vol. 15, no. 4, pp. 840-858, 2009.
[60] P. Kulkarni, J. Zepeda, F. Jurie, P. Pérez, and L. Chevallier, "Learning the Structure
[61] B. Zoph and Q. V. Le, "Neural architecture search with reinforcement learning," arXiv
arXiv:1703.00548, 2017.
arXiv:1703.01041, 2017.
[64] B. Baker, O. Gupta, N. Naik, and R. Raskar, "Designing Neural Network Architectures
84
[66] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate
on computer vision and pattern recognition, Columbus, Ohio, pp. 580-587, 2014.
mid and high level feature learning," in IEEE International Conference on Computer
[69] A. Ng, "Sparse autoencoder," CS294A Lecture notes, vol. 72, no. 2011, pp. 1-19, 2011.
in Artificial Intelligence and Soft Computing, Berlin, Heidelberg, Springer 2012, pp.
64-72, 2012.
review of instance selection methods," Artificial Intelligence Review, vol. 34, no. 2, pp.
133-143, 2010.
[72] H. Liu and H. Motoda, "Data Reduction via Instance Selection," in Instance Selection
and Construction for Data Mining, H. Liu and H. Motoda, Eds. Boston, MA: Springer
[73] Z. Zhang, F. Eyben, J. Deng, and B. Schuller, "An agreement and sparseness-based
85
Proc. 5th Int. Workshop Emotion Social Signals, Sentiment, Linked Open Data,
[75] X. Sun and P. K. Chan, "An Analysis of Instance Selection for Neural Networks to
datasets with Deep Convolutional Neural Networks," in IEEE Long Island Systems,
Applications and Technology Conference (LISAT), 2016, Farmingdale, New York, pp.
1-5, 2016.
[77] P. Hart, "The condensed nearest neighbor rule (Corresp.)," IEEE Transactions on
[78] D. L. Wilson, "Asymptotic properties of nearest neighbor rules using edited data,"
IEEE Transactions on Systems, Man and Cybernetics , no. 3, pp. 408-421, 1972.
[79] I. Tomek, "An experiment with the edited nearest-neighbor rule," IEEE Transactions
[80] D. B. Skalak, "Prototype and feature selection by sampling and random mutation hill
USA, 1994.
86
[81] C. H. Papadimitriou and K. Steiglitz, Combinatorial optimization: algorithms and
[82] A. F. Abate, M. Nappi, D. Riccio, and G. Sabatino, "2D and 3D face recognition: A
survey," Pattern recognition letters, vol. 28, no. 14, pp. 1885-1906, 2007.
[83] C. Y. Suen and L. Lam, "Multiple Classifier Combination Methodologies for Different
Output Levels," in Multiple Classifier Systems, Berlin, Heidelberg, Springer 2000, pp.
52-66, 2000.
[84] M. Woźniak, M. Graña, and E. Corchado, "A survey of multiple classifier systems as
systems," IEEE Transactions on Pattern Analysis and Machine Intelligence, , vol. 16,
Multiple Classifier Systems, Berlin, Heidelberg, Springer 2000, pp. 77-86, 2000.
[87] L. I. Kuncheva, Combining pattern classifiers: methods and algorithms. John Wiley &
Sons, 2004.
[88] X. Lu, Y. Wang, and A. K. Jain, "Combining classifiers for face recognition," in
87
[90] A. Eleyan and H. Demirel, "PCA and LDA based face recognition using feedforward
Representation, Classification and Security, Istanbul, Turkey, Springer 2006, pp. 199-
206, 2006.
[92] V. E. Liong, J. Lu, and G. Wang, "Face recognition using Deep PCA," in 9th IEEE
American Society for Information Science and Technology, vol. 54, no. 6, pp. 550-560,
2003.
[96] A. Dragomir et al., "Acoustic Detection of Coronary Occlusions before and after Stent
Placement Using an Electronic Stethoscope," Entropy, vol. 18, no. 8, p. 281, 2016.
88
[97] K. Katoh, K. Misawa, K. i. Kuma, and T. Miyata, "MAFFT: a novel method for rapid
multiple sequence alignment based on fast Fourier transform," Nucleic acids research,
[98] J. A. Nelder and R. Mead, "A simplex method for function minimization," The
[99] M. A. Turk and A. P. Pentland, "Face recognition using eigenfaces," in. IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, Maui, HI,
[100] T. Ahonen, A. Hadid, and M. Pietikainen, "Face description with local binary
[101] O. A. Vătămanu and M. Jivulescu, "Image classification using local binary pattern
operators for static images," in 2013 IEEE 8th International Symposium on Applied
178, 2013.
[102] A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny
[103] J. Bergstra et al., "Theano: A CPU and GPU math compiler in Python," in Proc. 9th
89
Curves," in Proceedings of the Twenty-Fourth International Joint Conference on
[105] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image
[108] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in
[109] A. T. L. Cambridge. (1992-1994). The ORL Database of Faces. Accessed on: May.
[110] A. Martınez and R. Benavente, "The AR face database," Rapport technique, vol. 24,
1998.
[111] A. M. Martínez and A. C. Kak, "Pca versus lda," IEEE Transactions on Pattern
Analysis and Machine Intelligence, , vol. 23, no. 2, pp. 228-233, 2001.
[112] C. J. Burges, "A tutorial on support vector machines for pattern recognition," Data
[113] D. W. Hosmer Jr and S. Lemeshow, Applied logistic regression. John Wiley & Sons,
2004.
90
[114] M. Daneshzand, J. Almotiri, R. Baashirah, and K. Elleithy, "Explaining dopamine
91