A Brief Survey and An Application of Sem
A Brief Survey and An Application of Sem
1 Introduction
With advanced technology, modern camera systems can be placed in many places,
from mobile phones to surveillance systems and autonomous vehicles, to obtain very
high quality images at low cost [1]. This increases the demand for systems that can
interpret and understand these images.
The interpretation of images has been approached in various ways for years. How-
ever, the process involving reviewing images to identify objects and assess their im-
portance is the same [2]. Learning problems from visual information are generally
separated into three categories called as image classification [3], object localization
and detection [4], and semantic segmentation [5].
Semantic image segmentation is the process of mapping and classifying the natural
world for many critical applications such as especially autonomous driving, robotic
navigation, localization, and scene understanding. Semantic segmentation, which is a
pixel-level labeling for image classification, is an important technique for the scene
understanding. Because each pixel is labeled as belonging to a given semantic class.
A typical urban scene consists of classes such as street lamp, traffic light, car, pedes-
trian, barrier and sidewalk.
Autonomous driving will be one of the revolutionary technologies in the near fu-
ture in terms of the impact on the lives of people living in industrially developed
countries [6]. Many research communities have contributed to the development of
autonomous driving systems thanks to rapidly the increasing performance of vision-
based algorithms such as object detection, road segmentation and recognition of traf-
fic signals. An autonomous vehicle must sense its surroundings and act safely to reach
a certain target. Such functionality is carried out by using several types of classifiers.
Approximately up to the end of 2010, the identification of a visual phenomena was
constructed as a two-stage problem. The first of these stages is to extract features
from the image. Extensive efforts have been made to extract the features as visual
descriptors and consequently the descriptors obtained by algorithms such as Scale
Invariant Feature Transform (SIFT) [7], Local Binary Patterns (LBP) [8] and Histo-
gram of Oriented Gradients (HOG) [9] have become widely accepted. The second
stage includes to use or design classifier. Artificial Neural Networks (ANNs) are one
of the most important classifiers. ANNs are not a new approach and its past is based
on about 60 years ago. Until the 1990s, ANNs used in various fields did not provide
satisfactory achievements on nonlinear systems. Therefore, there are not many studies
about ANNs for a certain period. In 2006, Hinton et al. [10] used ANNs in speech
recognition problems and achieved successful results. Thus, ANNs have come up
again in the scientific world. Henceforth, researchers thought that the ANNs would be
the solution to problems in most areas, but they soon realized that it was a wrong idea
with various reasons, such as failure in the training of multi-layer ANNs. Then, the
researchers turned to new approaches finding the most accurate class boundaries in
feature space and input space such as Support Vector Machine (SVM) [11], AdaBoost
[12], and Spherical and Elliptical classifiers [13] using the features obtained from the
first stage. In addition to over-detailed class models to facilitate the search for com-
pletely accurate boundaries, methods of transforming feature space such as Principal
Component Analysis (PCA) and kernel mapping have also been developed.
Later, in image recognition competitions such as the ImageNet Large Scale Visual
Recognition Competition (ILSVRC), ANN-based systems took the lead and began to
get first place every year by making a big difference to other systems. As time pro-
gressed, especially through the internet, very large amount of data has begun to be
produced and stored in the digital environment. When processing this huge amount of
data, Central Processing Units (CPUs) on the computers have been slow. Along with
the developments in GPU technology, the computational operations can be performed
much faster by using the parallel computing architecture of the graphics processor.
With this increase in process power, the use of deeper neural networks has become
widespread in practice. By means of this, “Deep Learning” term has emerged as a
new approach in the machine learning.
Deep learning is the whole of the methods consisting of ANNs, which has a deep
architecture with an increased number of hidden layers. At each layer of this deep
architecture, features belonging to the problem is learned and this learned features
create an input into an upper layer. This creates a structure from the bottom layer to
the top layer, where the features are learned from the simplest to the most complex. It
would be useful to analyze the vision system in the human brain to understand this
structure. The signals coming to the eyes through nerves are evaluated in a multi-layer
hierarchical structure. At the first layer where the signal is coming after the eyes, the
local and basic features of the image, such as the edge and corner, are determined. By
combining these features, at the next layer, mouth, nose, etc. details and at the subse-
quent layers, features belonging to the overall of image, such as face, person and loca-
tion of objects, respectively can also be determined. Convolutional Neural Networks
(CNNs) approach, which combines both feature extraction and classification capabili-
ties in computer vision applications, work in this way.
Deep learning brings the success of artificial intelligence applications developed in
recent years to very high levels. Deep learning is used in many areas such as comput-
er vision, speech recognition, natural language processing and embedded systems. In
the ILSVRC, which has been carried out using huge data sets in recent years, the
competitors have been directed to the CNN approaches and achieved great success
[14]. Companies such as Google [15], Facebook [16], Microsoft [17] and Baidu [18]
have realized the progress in deep learning and carried out studies on this topic with
great investments.
A graphical representation of search interest of the “Deep Learning” on the Google
search engine in the last 5 years is shown in Fig. 1.
100
75
50
25
6 May. 2012 27 Oct. 2013 19 Apr. 2015 9 Oct. 2016 1 May. 2017
Fig. 1. Search interest of the “Deep Learning” on the Google search engine in the last 5 years
[19]
2 Deep Learning
Deep learning is a fast-growing popular machine learning approach in the artificial
intelligence field to create a model for perceiving and understanding large quantities
of machines, such as images and sound. Basically, this approach is based on deep
architectures, which are the more structurally complex of the ANNs. This deep archi-
tecture term refers to ANNs whose number of hidden layers has been increased.
Deep learning algorithms are separated from existing algorithms in machine learn-
ing; it needs very high amount of data and hardware with very high computational
power that can handle this high data rate. In recent years, the number of labeled imag-
es, especially in the field of computer vision, has increased extremely. Deep learning
approach has attracted much attention thanks to the great progress in the area of GPU-
based parallel computing power. GPUs with thousands of compute cores provide 10
to 100 times the application performance when processing these data compared to
CPUs [43]. Nowadays, deep learning has many application areas, mainly automatic
speech recognition, image recognition and natural language processing.
There are many different types of deep learning architecture. Basically, deep learn-
ing architectures can be named as in Fig. 2 [44].
2.1.1 Neuron
ANNs also have artificial neurons, as biological neural networks are neurons. The
neuron in can be called the basic calculation unit in ANN. The neurons can also called
node or unit. The structure of an artificial neuron is shown in Fig. 3.
Bias
Inputs
Weights
b
x1
w1
x2 w2
Outputs
. y
.
. wn Activation
. Transfer function
function
xn
Inputs are information incoming to a neuron from external world. These are deter-
mined by the samples for which the learning of the network is desired. Weights show
the importance of information incoming to neurons and their effect on neurons. There
are separate weights for each input. For example, W1 weight in Fig. 3 shows the effect
of x1 input on the neuron. The fact that weights are big or small does not mean that
they are important or insignificant. The transfer function calculates the net input in-
coming to a neuron. Although there are a large number of transfer functions for this,
the most commonly used is weighted sum. Each incoming information is summed by
multiplying its own weight. The activation function determines the output the neuron
will generate in response to this input by processing the net input incoming to the
neuron. The generated output is sent to the external world or another neuron. In addi-
tion, if desired, the neuron may also send its own output as an input to itself.
The activation function is usually chosen a nonlinear function. The purpose of the
activation function is to transfer the nonlinearity to the output of the neuron as in (1).
A characteristic of ANNs is nonlinearity, which is due to the nonlinearity of activa-
tion functions.
n
y = f (∑ Wi xi + b) (1)
i=1
The important thing to note when choosing the activation function is that the deriv-
ative of the function is easy to calculate. This ensures that the calculations take place
quickly.
In the literature, there are many activation functions such as linear, step, sigmoid,
hyperbolic tangent (tanh), Rectified Linear Unit (ReLU) and threshold functions.
However, sigmoid, tanh and ReLU activation functions are usually used in ANN ap-
plications.
The sigmoid activation function, expressed by (2), is a continuous and derivatable
function. It is one of the most used activation functions in ANNs. This function gen-
erates a value between 0 and 1 for each input value.
ex
σ(x) = (2)
1 + ex
The tanh activation function, expressed by (3), is similar to the sigmoid activation
function. However, the output values range from -1 to 1.
The ReLU activation function, expressed by (4), generates an output with a thresh-
old value of 0 for each of the input values. It has a characteristic as in Fig. 4. Recent-
ly, the usage in ANNs has become very popular.
Inputs
Outputs
Output layer
Inputs .......
Outputs
.......
.......
.......
.......
.......
Output layer
Hidden layers
Input layer
Fig. 6. A deep feedforward ANN model
The ability to learn from an information source is one of the most important fea-
tures of ANN. In multi-layer neural networks, learning process takes place by chang-
ing weights at each step. Therefore, how weights are determined is important. Since
the information is stored in the entire network, the weight value of a neuron does not
make sense by itself. The weights on the whole network should get the most appropri-
ate values. The process to achieve these weights is to train the network. In short, the
learning of the network occurs by finding the most appropriate values of the weights.
In addition, there are a number of considerations to be taken when designing multi-
layer neural networks, such as the number of hidden layers in a network, the number
of neurons to be found in each hidden layer, the optimal solution for the most reason-
able time, and the test of network accuracy [45].
2.2.1 Caffe
The Caffe deep learning framework, created by Yangqing Jia, is developed by the
Berkeley AI Research (BAIR) and community contributors. Caffe was designed to be
as fast and modular just like the human brain [47].
Caffe is often preferred in industrial and academic research applications. The most
important reason for this is the ability to process data quickly. Caffe can process over
60 million images per day with a single NVIDIA K40 GPU. Caffe is believed to be
among the fastest accessible CNN implementations available [47].
2.2.2 Torch
Written in LuaJIT language, Torch is a scientific computing structure that provides
extensive support for machine learning algorithms. It is an easy and efficient library
because it is written in LuaJIT and uses the C/CUDA application basis [48]. This
library, which can use numerical optimization methods, contains various neural net-
works and energy based models. It is also open source and provides fast and efficient
GPU support.
Torch is constantly being developed and is being used by various companies such
as Facebook, Google and Twitter.
2.2.3 Theano
Theano is a Python library that effectively identifies, evaluates, and optimizes math-
ematical expressions containing tensors [49]. Since this library is integrated with
NumPy library, it can easily perform intensive mathematical operations. It also offers
the option to create dynamic C code, allowing user to evaluate expressions more
quickly.
2.2.4 TensorFlow
Tensorflow is an open source deep learning library that performs numerical computa-
tions using data flow graphs. This library was developed by Google primarily to con-
duct research on machine learning and deep neural networks [50]. With its flexible
architecture, TensorFlow allows you to deploy the computation to one or more CPUs
or GPUs on a server, mobile or desktop device with a single Application Program-
ming Interface (API).
Snapchat, Twitter, Google and eBay, which are popular nowadays, also benefit
from TensorFlow.
2.2.5 Keras
Keras is a modular Python library built on TensorFlow and Theano deep learning
libraries [51]. These two basic libraries provide the ability to run on the GPU or CPU.
By making minor changes in the configuration file of Keras, it is possible to use the
TensorFlow or Theano in the background.
Keras is very useful as it simplifies the interface of TensorFlow and Theano librar-
ies, and easier application can be developed than these two libraries. Keras has a very
common usage in image processing applications.
2.2.6 DIGITS
In 2015, NVIDIA introduced the CUDA Deep Neural Network library (cuDNN) [52]
due to the growing importance of deep neural networks, both in the industrial and
academia, and the great role of GPUs. In 2016, Jen-Hsun Huang, NVIDIA CEO and
founder, has brought the Deep Learning GPU Training System (DIGITS) into use at
the GPU Technology Conference.
DIGITS is a deep learning GPU training system that helps users to develop and test
CNNs. This system supports GPU acceleration using cuDNN to greatly reduce train-
ing time while visualizing Caffe, Torch and TensorFlow by providing web interface
support.
DIGITS supports many educational objectives including image classification, se-
mantic segmentation and object detection. Fig. 7 shows the main console window
where datasets can be generated from the images and they can be prepared for train-
ing. In DIGITS, once a dataset is available, the network model can be configured and
training can begin. DIGITS also provides the necessary tools for network optimiza-
tion. Settings for network configuration can be followed and accuracy can be maxim-
ized by changing parameters such as bias, activation functions and layers.
Configure network
Create dataset
Choose dataset
Start training
Fig. 7. DIGITS main console
3 Convolutional Neural Networks
CNNs, introduced by LeCun in 1989 for computer vision applications, are a type of
multi-layer feedforward-ANN [53]. Nowadays, CNNs become increasingly popular
among in-deep learning methods as they can successfully learn models for many
computer and visual applications such as object detection, object recognition, and
semantic image segmentation.
CNNs can be thought of as classifiers that extract hierarchical features from raw
data. In CNN, images are given as input to the network, and learning takes place au-
tomatically with a feature hierarchy created without using any feature extractor meth-
od.
3.1 Architecture
All neurons in a layer in the feedforward ANNs are connected to all neurons of the
next layer. Such connected layers are called fully connected layers and, in addition to
fully connected layers in the CNN, convolution is applied to the input image to gener-
ate an output. This is caused by the local connection that all regions in the input layer
are bound to neurons in the next layer. Thus, the input image is convolved with each
learned filter used in this layer to generate different feature maps. The feature maps
become more insensitive to rotation and distortion by providing more and more com-
plex generalizations towards higher layers. In addition, the feature maps obtained in
the convolutional layer are subjected to the pooling layer in order to perform spatial
dimensionality reduction and keeping of important features. A classifier always is the
final layer to generate class probability values as an output. The final output from the
convolutional and pooling layers is transferred to one or more fully connected layers.
Then, the output prediction is obtained by transferring to the classifier layer where the
activation functions such as Softmax are used.
A simple CNN architecture is a combination of convolutional, pooling and fully
connected layers as in Fig. 8.
Feature
maps Feature Feature
Feature maps maps
Input maps
Output
h Hi Ho
Conv.
Ho
Wi Wo
Di=d
Repeat for N
filters
Wo
N
Do = N (5)
Hi − f + 2p
Ho = +1 (6)
s
Wi − f + 2p
Wo = +1 (7)
s
Fig. 10 shows a 3×3 filter to be slid over a 5×5 image matrix representing a binary
image. The sliding of the filter is from left to right and continues until the end of the
matrix. In this paper, the stride is taken as 1. By sliding filters in order, the process is
completed and the final state of the feature map is obtained as shown in Fig. 10.
Filter Image
...
Fig. 10. Image matrix and final state of the feature map [56]
64×224×224
64×112×112
Pooling
224 112
Subsampling
112
224
Feature map
x
Fig. 12. The pooling operation with the 2×2 filter, stride 2 [5]
3.1.3 Rectified Linear Unit Layer
Generally, the outputs of the convolutional layer are fed into activation functions. The
nonlinearity layers proposed for this purpose can be composed of functions such as
sigmoid, tanh and ReLU. ReLU has been found to be more effective in CNNs and is
often preferred [57].
A ReLU layer thresholds negative inputs to 0 and activates the positive inputs as
described (8) by passing them unchanged.
x, x ≥ 0
f(x) = { } (8)
0, others
where; x is the input of the ReLU, and f (x) is the rectified output.
In the ReLU layer, an operation is performed separately for each pixel value. For
example, the output of the ReLU is as shown in Fig. 13 if it is considered that the
black areas are represented by negative pixels and the white areas are represented by
positive pixels in the input feature map.
ReLU
3.1.5 Classifier
A classifier is chosen by considering the problem at hand and the data used. In this
paper, the Softmax function is used which allows to predict for a class other than
exclusive classes mutually. For the binary class problem, the Softmax function is
reduced to a logistic regression. The Softmax function gives the probability value in
(9) for a certain input belonging to a certain class c.
e(sc)
pc = (9)
∑ci=1 e(si)
where; s is the network outputs obtained from previous layers of CNNs for a particu-
lar class. For a single input, the sum of all probabilities between classes is always
equal to 1. The loss metric is defined as the probability of the negative logarithm of
the Softmax function. This is a cross entropy loss.
3.1.6 Regularization
Overfitting over training data is a big issue. Especially when dealing with deep neural
networks that the network is strong enough to fit the training set alone is a big prob-
lem. The overfitting must be avoided. The methods developed for this are called regu-
larization methods.
Dropout is a simple and effective regularization strategy integrate into the training
phase. Dropout, introduced for the first time in [59], is implemented as dropout layers
characterized by a probability value. Dropout can be accepted as a reasonable default
value of 0.5 proven to be sufficiently effective [59].
3.2 Training
The learning process in CNNs can be divided into 4 basic steps:
1. Forward computation,
2. Error/loss optimization,
3. Backpropagation,
4. Parameter updates.
Forward computation is usually the case where sublayers composed of convolu-
tional or pooling layers are followed by higher fully connected layers. The network
returns the class output, which encodes the probability of belonging to a particular
class for input. The outputs of the network may be unscaled as in the SVM classifier,
or a negative logarithm probability may be obtained as in the Softmax classifier. For
semantic segmentation, each pixel in the image is provided with a class output.
The set of class outputs provided by the network should be subjected to optimiza-
tion processing by adjusting the values of the learned parameters such as weight fil-
ters and biases. The uncertainty that occurs in determining which set of parameters is
ideal is quantified by the loss function that can be formulated as an optimization prob-
lem. For each vector of the class outputs s, the cross entropy loss is calculated as giv-
en in (10) for the Softmax classifier.
e(sc)
Li = − log ( 𝑐 ) (10)
∑𝑖=1 e(si )
where; q is the Softmax function defined in (9), and p is the probability distribution.
The total loss is calculated by (12).
N
L = ∑ Li + λ R(W) (12)
i=1
sky
tree
road
grass
Learning
Convolutionalization
Fig. 16. Converting fully connected layers of CNN to convolutional layers [20]
In a fully connected layer, each output neuron calculates the weighted sum of the
values in input, while in a convolutional layer, each filter calculates the weighted sum
of the values in the receptive field. Although these operations seem to be exactly the
same thing, they are the same only when the layer input has the same size as the re-
ceptive field. If the input is larger than the receptive field, then the convolutional layer
slide input window and calculate another weighted sum. This repeats until the input
image is scanned from left to right, from top to bottom. A fully connected layer must
be replaced with a corresponding convolutional layer, the size of the filters must be
set to the input size of the layer, and as many filters as the neurons in the fully con-
nected layer must be used.
All connected layers in the AlexNet architecture can be converted to corresponding
convolutional layers to obtain FCN architecture. This FCN has the same number of
learned parameters as the basic CNN and the same computational complexity. Convo-
lutionalization of a basic CNN brings considerable flexibility to the conversion to
FCN. The FCN model is no longer limited to work with fixed input size 224×224, as
in AlexNet. The FCN can process large images by scanning throughly, such as a slid-
ing window, and the model generates one per 224×224 window rather than generating
a single probability distribution for the entire input. Thus, the output of the network
has become a tensor in the form of N×H×W.
where; N is the number of classes, H is the number of sliding windows (filters) along
the vertical axis, and W is the the number of sliding windows along the horizontal
axis.
In summary, the first significant step in the design of the FCN is completed by add-
ing two spatial dimensions to the exit of the classification network.
The window number depends on the size of the input image, the size of the win-
dow, and the stride parameters used between the windows when the input image is
scanned, as will be understood during the design of a FCN that generates a class
probability distribution per window. Ideally, a semantic image segmentation model
should generate a probability distribution per pixel in the image. When the input im-
age passes through the sequential layers of convolutionalized AlexNet, coarse features
are extracted. The purpose of semantic image segmentation is to interpolate for these
coarse features to reconstruct a fine-tuned classification for each pixel in the input.
This can easily be done with deconvolutional layers. The deconvolutional layers per-
form the inverse operation of their convolutional counterparts. Considering the output
of the convolutional layer, the deconvolutional layer finds the input generating the
output. As it can be remembered, the stride parameter in the convolutional or pooling
layer is a measure of how much the window is to be slid when the input is processed,
and how the output is subsampled accordingly. In contrast, the stride parameter in the
deconvolutional layer is a measure of how the output is upsampled. The output vol-
ume of the deconvolutional layer, Do×Ho×Wo, is calculated using the equations (13),
(14) and (15).
Do = N (13)
Ho = s(Hi − 1) + f − 2p (14)
Wo = s(Wi − 1) + f − 2p (15)
where; s stride, p padding, f filter size, Hi and Wi input sizes, and N is the number of
channels.
It is important to know how much of the activation of the last convolutional layer
in the FCN architecture must be upsampled to obtain an output of the same size as the
input image. The upsampling layer added to create FCN-AlexNet is shown to increase
the output of the previous convolutional layer by 32 times. This means that in prac-
tice, the network has made a single prediction per 32×32 pixel block. This causes the
contours of objects in the image to be segmented as rough. Fig. 17 shows FCN-
AlexNet architecture.
Input Output
The article in [20] presents the idea of skip architecture for this restriction. The
skip connections in this architecture have been added to redirect the outputs of the
pooling layers pool3 and pool4 of the FCN architecture derived from VGG-16 direct-
ly to the network as shown in Fig. 18. These pooling layers work on low-level fea-
tures and can capture more fine details.
The FCN architectures proposed in [20] are called FCN-8s, FCN-16s and FCN-32
according to the application of skip connections, converted into corresponding convo-
lutional layers of fully connected layers in VGG-16. The visualization of these archi-
tectures is shown in Fig. 18.
32× upsampled
Image conv1 pool1 conv2 pool2 conv3 pool3 conv4 pool4 conv5 pool5 conv6-7 prediction (FCN-32s)
2× conv7
16 × upsampled
prediction (FCN-16s)
pool4
8 × upsampled
4× conv7 prediction (FCN-8s)
2× pool4
pool3
5 Experimental Studies
In this paper, a semantic image segmentation application, which is useful for autono-
mous vehicles, was performed to observe the performance of the FCNs in semantic
image segmentation. Four different popular FCN architectures were used separately
for the application: FCN-AlexNet, FCN-8s, FCN-16s and FCN-32s.
The applications were implemented using Caffe framework in DIGITS platform on
SYNTHIA-Rand-CVPR16 dataset and the segmentation performances of the used
FCN architectures for experimental studies were compared. The studies were carried
out on a desktop computer with 4th Generation Intel® Core i5 3.4 GHz processor, 8
GB RAM and NVIDIA GTX Titan X Pascal 12 GB GDDR5X graphics card. Thanks
to the CUDA support of the graphics card, the GPU-based parallel computing power
has been utilized in the computations required for the application.
(b)
Fig. 19. Samples from the SYNTHIA-Rand-CVPR16 dataset: (a) Sample images, (b) Ground
truth images with semantic labels
Parameter Value
Base learning rate 0.0001
Momentum 0.9
Weight decay 10-6
Batch size 1
Gamma 0.1
Maximum iteration 321780
Stochastic Gradient Descent (SGD) as solver type and GPU as solver mode are se-
lected. Epoch is set to 30. An epoch is a single pass through the full training set. Thus,
for 10726 training images, 1 epoch is completed in 10726 iterations with 1 batch size
and it is seen that the number of maximum iteration number for 30 epochs is 321780,
as indicated in Table 1.
Initially, FCN-AlexNet model is trained using random weight initialization in
DIGITS and the results in Fig. 20 are obtained.
Fig. 20. Training/validation loss and validation accuracy when training FCN-AlexNet using
random weight initialization
(a) (b)
(c)
Fig. 21. A sample visualization of semantic image segmentation in DIGITS with FCN-AlexNet
trained using random weight initialization: (a) Sample image, (b) Ground truth image, (c) Infer-
ence
With the Fig. 21, it is understood that the building is the most representative object
class in the SYNTHIA-Rand-CVPR16 dataset, and that the network has learned to
achieve approximately 35% accuracy by labeling everything as building.
There are several commonly accepted ways to improve a network that is not suited
to the training set [64]. These:
• Increasing the learning rate, and reducing the batch size,
• Increasing the size of the network model,
• Transfer learning.
Information learned by a deep network can be used to improve the performance of
another network and this process is very successful for computer vision applications.
For this reason, while learning the required models for the application, transfer learn-
ing was used.
Recent developments in machine learning and computer vision are first achieved
through the use of common criteria. It does not have to start from randomly initialized
weights to train a model. Transfer learning is a reuse of information that a network
learns in another dataset to improve the performance of another network [65]. A net-
work is trained on any data and gains knowledge from this data, compiled as weights
of the network. These weights can be transferred to any network. In other words, in-
stead of training the network from scratch, learned features can be transferred to the
network.
Transfer learning is often preferred in the computer vision field, since many low-
level features such as line, corner, shape, and texture can be immediately applied to
any dataset via CNNs.
Models trained and tested on high-variance standard datasets usually owe their
successes to strong features [65]. Transfer learning allows to use a model that learns
fairly generalized weights trained on a large dataset such as ImageNet and allows
fine-tuning to adapt the situation of the network to be used.
It is very logical to transfer learning from image classification dataset such as
ImageNet since the image segmentation has a classification at the pixel level. This
process is quite easy using Caffe. However, Caffe cannot automatically carry the
weights from AlexNet to FCN-AlexNet because AlexNet and FCN-AlexNet have
different weight formats. Moving these weights can be done using the Python com-
mand line “net_surgery.py” in DIGITS repository in Github. The function of
net_surgery.py is to transfer weights from fully connected layers to convolutional
equivalents [64].
Also, another possible problem is how to start the upsampling layer added to create
FCN-AlexNet since the upsampling layer is not part of the original AlexNet model. In
[20], it is recommended that the corresponding weights are first randomly initiated
and the network learns them. Later, however, the authors realized that it is easy to
initialize these weights by doing bilinear interpolation, the way that the layer just acts
like a magnifying glass [64].
As previously mentioned, training of FCN-AlexNet model was performed using
the pre-trained model obtained by adapting the AlexNet model trained on the
ImageNet dataset and the results in Fig. 22 were obtained.
Fig. 22. Training/validation loss and validation accuracy when training FCN-AlexNet using
pre-trained model
Fig. 22 shows that using the pre-trained FCN-AlexNet model, the validation accu-
racy quickly exceeded 90%, and the model achieved the highest accuracy at
92.4628% in 29th epoch. This means that 92.4628% of the pixels in the validation set
of the model obtained in 29th epoch are labeled correctly. It has been shown to have
fairly high accuracy compared to FCN-AlexNet initialized randomly weights.
When tested for sample images using the model obtained in 29th epoch, a semantic
image segmentation was performed many times more satisfactorily by detecting dif-
ferent object classes as shown in Fig. 23 and Fig. 24. However, it can be clearly seen
that the object contours are very rough.
(a) (b)
(c)
Fig. 23. A sample visualization of semantic image segmentation in DIGITS with FCN-AlexNet
trained using pre-trained model-1: (a) Sample image, (b) Ground truth image, (c) Inference
(a) (b)
(c)
Fig. 24. A sample visualization of semantic image segmentation in DIGITS with FCN-AlexNet
trained using pre-trained model-2: (a) Sample image, (b) Ground truth image, (c) Inference
FCN-8s network is used to further improve the precision and accuracy of the seg-
mentation model. Using the pre-trained model in the PASCAL VOC dataset, valida-
tion accuracy of FCN-8s quickly exceeded 94% as shown in Fig. 25. The model
reached to the highest accuracy with 96% in 30th epoch.
Fig. 25. Training/validation loss and validation accuracy when training FCN-8s using pre-
trained model
More importantly, when tested for sample images using the model obtained in 30th
epoch, much sharper object contours are shown, as shown in Fig. 26 and Fig. 27.
(a) (b)
(c)
Fig. 26. A sample visualization of semantic image segmentation in DIGITS with FCN-8s
trained using pre-trained model-1: (a) Sample image, (b) Ground truth image, (c) Inference
(a) (b)
(c)
Fig. 27. A sample visualization of semantic image segmentation in DIGITS with FCN-8s
trained using pre-trained model-2 (a) Sample image, (b) Ground truth image, (c) Inference
FCN-8s architecture has been shown to provide segmentation with sharper object
contours than FCN-AlexNet, which makes predictions in 32×32 pixel blocks, as it can
make predictions at a fine-tuning down to 8×8 pixel blocks. Similarly, trainings of the
models have been carried out by FCN-16s and FCN-32s architectures and it can be
seen in Fig. 28 and Fig. 29 that the validation accuracy has exceeded 94% rapidly in a
similar manner to FCN-8s. The highest validation accuracy was reached in 30th epoch
as in FCN-8s with 95.4111% and 94.2595% respectively. Besides, Fig. 30 shows the
comparison of segmentation inferences on the same images selected using FCN-
AlexNet, FCN-8s, FCN-16s and FCN-32s architectures.
Fig. 28. Training/validation loss and validation accuracy when training FCN-16s using pre-
trained model
Fig. 29. Training/validation loss and validation accuracy when training FCN-32s using pre-
trained model
Sample image
Ground truth
FCN-AlexNet
FCN-8s
FCN-16s
FCN-32s
Fig. 30. Comparison of segmentation inferences according to the used FCN architectures for
sample images
When the segmentation process is analyzed considering the sample and the ground
truth images in Fig. 30, it has been seen that the object contours are roughly segment-
ed in the segmentation process performed with FCN-AlexNet model. Moreover, the
fact that fine details such as pole could not be made out and segmented showed an-
other limitation of this model for the application. With FCN-8s model, contrary to
FCN-AlexNet, object contours are segmented sharply and the segmentation infer-
ences are more similar to the ground truth images. Furthermore, the fact that the ob-
ject classes can be detected completely indicates that FCN-8s is useful. Although
FCN-16s model is not as sharp as FCN-8s, it can be seen that the object contours can
be segmented successfully. Finally, when the segmentation inferences of FCN-32s
model are analyzed, it can be said that the segmentations very close to FCN-AlexNet
have been realized but may be a more useful model with small differences.
The training times of the trained models for semantic image segmentation in this
paper are given in Fig. 31.
40000
20000 34121
0
FCN-AlexNet FCN-8s FCN-16s FCN-32s
Fig. 31. Training times of the models
It was seen that the training time spent on FCN-AlexNet is considerably lower than
the other FCN models with very close training times.
6 Conclusions
For the application, firstly, FCNs are trained separately and the validation accuracy of
these trained network models is compared on the SYNTHIA-Rand-CVPR16 dataset.
Approximately 80% of the images in the dataset are used for the training phase and
the rest are used during validation phase to validate the validity of the models. In
addition, image segmentation inferences in this paper are visualized to see how pre-
cisely the used FCN architectures can segment objects.
Maximum validation accuracies of 92.4628%, 96.015%, 95.4111% and 94.2595%
are achieved with FCN-AlexNet, FCN-8s, FCN-16 and FCN-32s models trained us-
ing weights in pre-trained models, respectively. Although these models can be regard-
ed as successful at first sight when the accuracies are over 90% for the four models, it
is seen that the object contours are roughly segmented in the segmentation process
performed with FCN-AlexNet model. The impossibility of segmenting some object
classes with small pixel areas is another limitation of FCN-AlexNet model. The seg-
mentation inferences of FCN-32s model are also very close to FCN-AlexNet, but with
this model it is seen that some better results can be obtained. However, with FCN-8s
model, object contours are sharply segmented and the segmentation inferences are
more similar to the ground truth images. Although FCN-16 models are not as sharp as
FCN-8s, it is seen that the object contours are successfully segmented according to
the others.
When training times of FCN models are compared, it is seen that the training time
spent on FCN-AlexNet is about one-fourth of the other FCN models with very close
training times. However, considering that training of the model is carried out once for
the application, it can be said that it does not have a very important place in the choice
of the appropriate model. Therefore, it can be easily stated that the most suitable
model for the application is FCN-8s.
The obtained experimental results show that the FCNs from deep learning ap-
proaches are suitable for semantic image segmentation applications. In addition, it has
been understood that the FCNs are network structures in models that address many
pixel-level applications, especially semantic image segmentation.
References
1. Weisbin, R.C., et al.: Autonomous Rover Technology for Mars Sample Return. Artificial
Intelligence, Robotics and Automation in Space 440, 1 (1999)
2. Colwell, R.N.: History and Place of Photographic Interpretation. Manual of Photographic
Interpretation 2, 33–48 (1997)
3. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. Book in Preparation for MIT
Press (2016)
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Seattle, WA, USA, 770–778 (2016)
5. Karpathy, A.: Convolutional Neural Networks for Visual Recognition, Course Notes,
http://cs231n.github.io/convolutional-networks/ (accessed April 5,
2017)
6. Van Woensel, L., Archer, G., Panades-Estruch, L., Vrscaj, D.: Ten Technologies which
could Change Our Lives. Technical Report, European Parlimentary Research Service
(EPRC), Brussels, Belgium (2015)
7. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International
Journal of Computer Vision (IJCV) 60(2), 91–110 (2004)
8. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution Gray Scale and Rotation Invariant
Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis
and Machine Intelligence 24(7), 971–987 (2002)
9. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1, San Die-
go, CA, USA, 886–893 (2005)
10. Hinton, G.E., Osindero, S., Teh, Y.W.: A Fast Learning Algorithm for Deep Belief Nets.
Neural Computation 18(7), 1527–1554 (2006)
11. Cortes, C., and Vapnik, V.N.: Support Vector Networks. Machine Learning 20(3), 273–
297 (1995)
12. Freund, Y., Schapire, R.E.: A Desicion-Theoretic Generalization of On-line Learning and
an Application to Boosting. Journal of Computer and System Sciences 55, 119–139 (1997)
13. Uçar, A., Demir, Y., Güzeliş, C.: A Penalty Function Method for Designing Efficient Ro-
bust Classifiers with Input Space Optimal Separating Surfaces. Turkish Journal of Electri-
cal Engineering and Computer Sciences 22(6), 1664–1685 (2014)
14. Russakovsky, O., et al.: Imagenet Large Scale Visual Recognition Challenge. International
Journal of Computer Vision, Springer Berlin, Heidelberg 115(3), 211–252 (2015)
15. Le, Q.V.: Building High-level Features using Large Scale Unsupervised Learning. Pro-
ceedings of the IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), Vancouver, BC, Canada, 8595–8598 (2013)
16. Taigman, Y., Yang, M., Ranzato, M.A., Wolf, L.: Deepface: Closing the Gap to Human-
level Performance in Face Verification. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Colombus, Ohio, USA, 1701–1708 (2014)
17. Deng, L., Yu, D.: Deep Learning: Methods and Applications. Foundations and Trends® in
Signal Processing 7, 197–387 (2014)
18. Amodei, D., et al.: Deep Speech 2: End-to-End Speech Recognition in English and Manda-
rin. Proceedings of the International Conference on Machine Learning (ICML), New York,
USA, 173–182 (2016)
19. Trend Search of “Deep Learning” in Google,
https://trends.google.com/trend/explore?q=deep%20learning (ac-
cessed April 12, 2017)
20. Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmen-
tation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Boston, Massachusetts, USA, 3431–3440 (2015)
21. Chen, L.C., et al.: Semantic Image Segmentation with Deep Convolutional Nets and Fully
Connected CRFs. Proceedings of the International Conference on Learning Representa-
tions (ICLR), San Diego, CA, USA, 1–14 (2015)
22. Noh, H., Hong, S., Han, B.: Learning Deconvolution Network for Semantic Segmentation.
Proceedings of the IEEE International Conference on Computer Vision (ICCV), Los
Alamitos, CA, USA, 1520–1528 (2015)
23. Zheng, S., et al.: Conditional Random Fields as Recurrent Neural Networks. Proceedings
of the IEEE International Conference on Computer Vision (ICCV), Los Alamitos, CA,
USA, 1529–1537 (2015)
24. Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L.: Weakly-and Semi-supervised
Learning of a DCNN for Semantic Image Segmentation. Proceedings of the IEEE Interna-
tional Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 1742–1750
(2015)
25. Yu, F., Koltun, V.: Multi-scale Context Aggregation by Dilated Convolutions. Proceedings
of the International Conference on Learning Representations (ICLR), San Juan, Puerto Ri-
co, 1–13 (2016)
26. Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning Hierarchical Features for Scene
Labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8), 1915–
1929 (2013)
27. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A Deep Convolutional Encoder-
Decoder Architecture for Image Segmentation. arXiv preprint arXiv:1511.00561 (2015)
28. Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian Segnet: Model Uncertainty in Deep
Convolutional Encoder-Decoder Architectures for Scene Understanding. arXiv preprint
arXiv:1511.02680 (2015)
29. Fourure, D., et al.: Semantic Segmentation via Multi-task, Multi-domain Learning. Joint
IAPR International Workshop on Statistical Techniques in Pattern Recognition (SPR) and
Structural and Syntactic Pattern Recognition (SSPR), Mérida, Mexico, 333–343 (2016)
30. Treml, M., et al.: Speeding up Semantic Segmentation for Autonomous Driving. Proceed-
ings of the Conference on Neural Information Processing Systems (NIPS), Barcelona,
Spain, 1–7 (2016)
31. Hoffman, J., Wang, D., Yu, F., Darrell, T.: FCNs in the Wild: Pixel-level Adversarial and
Constraint-based Adaptation. arXiv preprint arXiv:1612.02649 (2016)
32. Marmanis, D., et al.: Semantic Segmentation of Aerial Images with an Ensemble of CNSS.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
3, 473–480 (2016)
33. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-scale Image
Recognition. Proceedings of the International Conference on Learning Representations
(ICLR), San Diego, CA, USA, 1–14 (2015)
34. Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: A RGB-D Scene Understanding
Benchmark Suite. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Boston, Massachusetts, USA, 567–576 (2015)
35. Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic Object Classes in Video: A High-
definition Ground Truth Database. Pattern Recognition Letters 30(2), 88–97 (2009)
36. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision Meets Robotics: the KITTI Dataset.
The International Journal of Robotics Research 32(11), 1231–1237 (2013)
37. Iandola, F.N., et al.: SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters
and< 0.5 MB Model Size. arXiv preprint arXiv:1602.07360 (2016)
38. Pinheiro, P.O., Lin, T.Y., Collobert, R., Dollár, P.: Learning to Refine Object Segments.
Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, Neth-
erlands, 75–91 (2016)
39. Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for Data: Ground Truth from Com-
puter Games. Proceedings of the European Conference on Computer Vision (ECCV), Am-
sterdam, Netherlands, 102–118 (2016)
40. Ros, G., et al.: the SYNTHIA Dataset: A Large Collection of Synthetic Images for Seman-
tic Segmentation of Urban Scenes. Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), Seattle, WA, USA, 3234–3243 (2016)
41. Cordts, M., et al.: the Cityscapes Dataset for Semantic Urban Scene Understanding. Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Seattle, WA, USA, 3213–3223 (2016)
42. Marmanis, D., et al.: Semantic Segmentation of Aerial Images with an Ensemble of CNSS.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
3, 473–480 (2016)
43. Machine Learning, http://www.nvidia.com/object/machine-
learning.html (accessed April 25, 2017)
44. Deep Learning Architectures, https://qph.ec.quoracdn.net/main-qimg-
4fbecaea0b4043d5450a1ca0ebe30623 (accessed May 1, 2017)
45. Haykin, S.S.: Neural Networks: A Comprehensive Foundation. Tsinghua University Press
(2001)
46. CUDA, http://www.nvidia.com/object/cuda_home_new.html (accessed
May 5, 2017)
47. BVLC Caffe, http://caffe.berkeleyvision.org/ (accessed May 7, 2017)
48. What is Torch?, http://torch.ch/ (accessed May 7, 2017)
49. Introduction to the Python Deep Learning Library Theano,
http://machinelearningmastery.com/introduction-python-deep-
learning-library-theano/ (accessed May 8, 2017)
50. About TensorFlow, https://www.tensorflow.org/ (accessed May 9, 2017)
51. Keras: Deep Learning Library for Theano and TensorFlow, https://keras.io/ (ac-
cessed May 9, 2017)
52. NVIDIA CuDNN, https://developer.nvidia.com/cudnn (accessed May 10,
2017)
53. LeCun, Y.: Backpropagation Aapplied to Handwritten ZIP Code Recognition. Neural
Computation 1(4), 541–551 (1989)
54. Shivaprakash, M.: Semantic Segmentation of Satellite Images using Deep Learning. Mas-
ter’s Thesis, Czech Technical University in Prague & Luleå University of Technology, In-
stitute of Science, Prague, Czech Rebuplic (2016)
55. Dumoulin, V., Visin, F.: A Guide to Convolution Arithmetic for Deep Learning. arXiv
preprint arXiv:1603.07285 (2016)
56. Convolution,
https://leonardoaraujosantos.gitbooks.io/artificial-
inteligence/content/convolution.html (accessed May 15, 2017)
57. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet Classification with Deep Convolu-
tional Neural Networks. Advances in Neural Information Processing Systems, 1097–1105
(2012)
58. An Intuitive Explanation of Convolutional Neural Networks,
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-
convnets/ (accessed May 16, 2017)
59. Srivastava, N., et al.: Dropout: A Simple Way to Prevent Neural Networks from Overfit-
ting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014)
60. Bengio, Y.: Practical Recommendations for Gradient-based Training of Deep Architec-
tures. Neural networks: Tricks of the Trade, Springer Berlin, Heidelberg, 437–478 (2012)
61. Szegedy, C., et al.: Going Deeper with Convolutions. Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 1–9
(2015)
62. Full-day CVPR 2013 Tutorial,
http://mpawankumar.info/tutorials/cvpr2013/ (accessed May 22, 2017)
63. Unity Development Platform, Web page: https://unity3d.com/
64. Image Segmentation using DIGITS 5,
https://devblogs.nvidia.com/parallelforall/image-
segmentation-using-digits-5/ (accessed May 25, 2017)
65. Yosinski, J., et al.: How Transferable are Features in Deep Neural Networks?. Advances in
Neural Information Processing Systems, 3320–3328 (2014)