0% found this document useful (0 votes)
15 views16 pages

TransferLearningwithAdaptiveFine Tuning

The document discusses a novel method called Differential Evolution based Fine-Tuning (DEFT) for selecting fine-tunable layers in transfer learning, particularly in medical imaging tasks such as identifying osteosarcoma. The DEFT method aims to improve classification performance by automatically determining which layers of a pre-trained convolutional neural network should be fine-tuned, outperforming conventional methods by a significant margin. This research addresses the challenges of adapting deep learning models to new tasks with limited datasets, highlighting the importance of layer selection in achieving high predictive accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

TransferLearningwithAdaptiveFine Tuning

The document discusses a novel method called Differential Evolution based Fine-Tuning (DEFT) for selecting fine-tunable layers in transfer learning, particularly in medical imaging tasks such as identifying osteosarcoma. The DEFT method aims to improve classification performance by automatically determining which layers of a pre-trained convolutional neural network should be fine-tuned, outperforming conventional methods by a significant margin. This research addresses the challenges of adapting deep learning models to new tasks with limited datasets, highlighting the importance of layer selection in achieving high predictive accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/345713638

Transfer Learning With Adaptive Fine-Tuning

Article in IEEE Access · October 2020


DOI: 10.1109/ACCESS.2020.3034343

CITATIONS READS

195 1,927

2 authors:

Grega Vrbančič Vili Podgorelec


University of Maribor University of Maribor
34 PUBLICATIONS 698 CITATIONS 201 PUBLICATIONS 3,407 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Grega Vrbančič on 01 December 2020.

The user has requested enhancement of the downloaded file.


Received September 2, 2020, accepted October 8, 2020, date of publication October 28, 2020, date of current version November 10, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3034343

Transfer Learning With Adaptive Fine-Tuning


GREGA VRBANČIČ , (Graduate Student Member, IEEE),
AND VILI PODGORELEC , (Member, IEEE)
Faculty of Electrical Engineering and Computer Science, University of Maribor, 2000 Maribor, Slovenia
Corresponding author: Grega Vrbančič (grega.vrbancic@um.si)
This work was supported by the Slovenian Research Agency (Research Core Funding) under Grant P2-0057.

ABSTRACT With the utilization of deep learning approaches, the key factors for a successful application are
sufficient datasets with reliable ground truth, which are generally not easy to obtain, especially in the field of
medicine. In recent years, this issue has been commonly addressed with the exploitation of transfer learning
via fine-tuning, which enables us to start with a model, pre-trained for a specific task, and then fine-tune
(train) only certain layers of the neural network for a related but different target task. However, the selection
of fine-tunable layers is one of the major problems of such an approach. Since there is no general rule
on how to select layers in order to achieve the highest possible performance, we developed the Differential
Evolution based Fine-Tuning (DEFT) method for the selection of fine-tunable layers for a target dataset under
the given constraints. The method was evaluated against the problem of identifying the osteosarcoma from
the medical imaging dataset. The performance was compared against a conventionally trained convolutional
neural network, a pre-trained model, and the model trained using a fine-tuning approach with manually
handpicked fine-tunable layers. In terms of classification accuracy, our proposed method outperformed the
compared methods by a margin of 4.45% to 32.75%.

INDEX TERMS Deep learning, fine-tuning, medical imaging, optimization, transfer learning.

I. INTRODUCTION to a new model, trying to tackle a similar but not equiva-


In recent years the expansion of deep learning has been lent task. Recent studies have shown that such approaches,
tremendous, with a significant impact on almost every field. especially when utilizing deep convolutional neural networks
The outstanding classification performance of modern deep (CNN), are quite beneficial in terms of not needing a large
learning approaches has attracted many researchers from dataset. In addition to that, the time complexity of training is
various fields such as agriculture [1], seismology [2], infor- also reduced since models are already somewhat pre-trained.
mation security [3], software engineering [4], computational For example, in [11] the authors achieved great performance
biology [5], healthcare [6], and medicine [7] to employ results when classifying mammographic tumors with the use
modern deep learning techniques to solve different domain of transfer learning trained only on 607 full-field digital mam-
problems. One such field, where the increase of deep learn- mographic images. In [14] the authors utilize transfer learning
ing applications has led to a number of domain problems, to effectively classify images for macular degeneration and
is medical image processing [8]–[10]. However, especially diabetic retinopathy.
in the medical image processing field, there is a problem However, with the application of transfer learning, there
with regard to the availability of large datasets, with reliable are also downsides. In order to appropriately adapt a pre-
ground truth, which are needed in order to build a good trained model to a new task, its specific parts (layers) need
predictive model with high predictive performance, utilizing to be retrained, while the others need to remain unchanged.
deep learning methods. This adaptation is usually made with fine-tuning approaches
Nowadays, the lack of adequate datasets is most commonly when facing a problem in determining which of the layers
addressed by the utilization of transfer learning approa- should be enabled for training (fine-tuning) and which ones
ches [10]–[13]. Transfer learning enables us to transfer the to leave frozen [15]. Besides, there is a common problem
knowledge of a model, previously trained for a specific task, when setting the hyper-parameter values, the same as in
the conventional training of deep neural networks. All these
The associate editor coordinating the review of this manuscript and issues have a direct impact on training capabilities and also
approving it for publication was Liang Ding . on the classification performance. If the hyper-parameter

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 196197
G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

values are not set correctly, the outcome of a model is most many and which layers of a CNN to fine-tune for a given
likely to be unsatisfying [16]. Currently, there is no general set of images.
rule or recipe to follow in order to determine which layers • We conducted an empirical evaluation of the proposed
to fine-tune or which hyper-parameter settings to use. Most method tackling the problem of identifying osteosar-
of the decisions are based on previous experiences of deal- coma from medical images.
ing with such problems. Solving the above-mentioned issues • We performed an extensive performance analysis and
is most commonly a recurring, time-consuming process in comparison of the results obtained from conducted
which various settings are tried and tested out to find the experiments.
ones that will result in a high predictive performance of the • We analyzed the impact of different selections of fine-
model. tunable layers on the model’s performance.
The motivation behind the selection of fine-tunable layers
is based on the empirical evidence that the initial (bottom) II. RELATED WORK
layers of CNNs preserve more abstract, generic features, The problem of layer selection, when employing transfer
applicable to a broader range of tasks as presented in [15], learning with fine-tuning, has been receiving a lot of attention
[17], [18]. In contrast, layers toward the end (top) of a CNN in recent years. With the general popularization of deep learn-
tend to provide more specific, task-related features. There- ing techniques, transfer learning with fine-tuning has become
fore, it should generally be more reasonable to fine-tune more the most common strategy for transferring knowledge in
top layers. However, recent studies [19] show that it is not the context of deep learning, which enables researchers and
really clear if restricting the fine-tuning to the last layers is practitioners to apply such deep learning methods to various
the best option. Azizpour et al. [17] suggested that the success domain problems more quickly.
of knowledge transfer depends on the dissimilarity between Various methods, following different strategies and
the primary task for which the CNN was trained, and new approaches, were presented in recent studies to improve the
target task for which we would like to transfer knowledge. standard fine-tuning. In general, there are two approaches
When utilizing a fine-tuning strategy, the CNNs are most addressing the aforementioned problem. The first approach
often pre-trained on an ImageNet dataset [20] or datasets focuses on selecting the input samples relevant to the target
similar to ImageNet. The distance between natural images task, as presented in [21]–[23]. The other approach – that
in an ImageNet dataset and medical imaging datasets is by we also used in this paper – focuses on the selection of
no means negligible, so at this point, the question regarding the portions, or layers, of the network in order to opti-
which layers of CNN to fine-tune remains. mize information extraction from the pre-trained network.
The mentioned empirical findings motivated us to take a Standard techniques adopting this approach either fine-tune
look at the problem from a different angle. We found an all network layers as presented in [24], or only fine-tune
analogy for this problem in the feature selection techniques the last few layers or blocks of the network as presented
and mechanisms utilized in machine learning, which are in [10], [25]. Another interesting technique proposed in [26],
well explored. The problem of feature selection could be [27] is to use a pre-trained network as a feature extractor
easily translated to the problem of finding the most optimal with a classifier such as SVM on top of it. While those
selection of layers for a given neural network architecture, fine-tuning methods have proven that they are capable of
enabled for fine-tuning, which would produce a predictive delivering a promising performance when compared to con-
model achieving the best classification performance. Based ventionally trained networks, the biggest drawback of such
on those grounds, we set our goals to develop a new straight- techniques is their inability to adapt automatically. There-
forward, automatic and CNN architecture agnostic Differen- fore, to apply such a method, one needs to adjust various
tial Evolution based Fine-Tuning (DEFT) method. The DEFT parameters manually or perform layer selection by hand,
method features an adaptive layer selection mechanism for which is commonly seen as a challenging, burdensome
finding the most optimal combination of fine-tunable lay- task.
ers for achieving high classification performance in transfer In 2018, Guo et al. [19] presented a method more related
learning tasks. to ours, which also works in an adaptive manner. The method
The proposed method is evaluated against the task of features the automatic layer selection per target instance.
identifying the osteosarcoma from Hematoxylin and eosin Introducing the policy network, the authors achieved the
(H & E) stained images. The performance of the DEFT adaptiveness of the method. The policy network is used
method is compared against a conventional trained CNN to make routing decisions on whether to pass the image
model, a pre-trained CNN model with all layers being fine- through the fine-tuned layers or through the pre-trained lay-
tuned, a CNN model with handpicked layers fine-tuned, and ers. Combined with ResNet CNN architecture’s ability to
also specific state-of-the-art methods evaluated on the same be resilient to residual block swapping and dropping [28],
dataset. the method delivers encouraging classification performance.
We summarize our contributions as follows: Although the authors stated that the method could be applied
• We propose a novel adaptive fine-tuning mechanism for to different neural network architectures, such an application
transfer learning, which automatically determines how would most certainly not be a straight-forward task because

196198 VOLUME 8, 2020


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

of the method’s heavy dependence on the mentioned ResNet


architecture abilities.
In contrast to the mentioned methods, our proposed DEFT
method features an automatic adaptive layer selection mech-
anism, which works per target dataset, while also being CNN
architecture agnostic, since it does not rely whatsoever on the
capabilities of a particular architecture.

III. METHODS
A. TRANSFER LEARNING
Conventional machine learning approaches make future pre-
dictions based on the statistical models, trained on previ-
ously collected, labeled, or unlabeled data. An approach that
utilizes the labeled data in the process of model training is
most commonly referred to as supervised learning, while the
utilization of unlabeled data in the process of model train-
ing is commonly known as unsupervised or self-supervised
training. When dealing with small, insufficient sets of labeled
data, building a good classifier is a hard and burdensome task.
Many studies [29]–[31] have been conducted to tackle this
issue, utilizing semi-supervised training or some variation of
such an approach where the usage of a large unlabeled set of
data and a small set of labeled data is combined. The most
common issue with such approaches is the assumption that
the labeled and unlabeled data distributions are the same [32].
In contrast to semi-supervised approaches, transfer learn- FIGURE 1. The conceptual diagram of transfer learning technique.
ing enables the domains, tasks and data distributions to be
different. are fed to the classifier of choice [38]. Our research focuses
The first studies to focus on transfer learning date back on the first-mentioned transfer learning approach or strategy
to 1995 [33]. However, it can be found under different known as fine-tuning – the mechanism which is presented
names such as inductive transfer [34], multitask learning [35], in-depth in the following section.
incremental/cumulative learning [36], with one of the most
closely related learning techniques for the transfer learn- B. FINE-TUNING
ing approach being the multitask learning framework [35]. A general transfer learning approach is to train a base network
In general, the transfer learning technique can be defined as and then copy its first n layers to the first n layers of the
the improvement of learning a new task through the trans- target network. The new target network’s remaining layers
fer of knowledge from a related task that has already been are most commonly randomly initialized and trained for a
learned. In machine learning terms, as presented in Fig. 1, specific target task. We can also choose to backpropagate the
transfer learning roughly translates to transferring the weights errors from the new task into the base features to fine-tune
of the already trained model, specialized for a specific task, them for the new task, or we can freeze some of the feature
to the model solving a different, but related task [37]. With layers, which are not going to be trained (fine-tuned) against
the expansion of the deep learning field, transfer learning also the new task [15].
gained momentum, due to the requirements common to all The fine-tuning approach is one of the most popular trans-
deep neural networks – the need for large datasets. Addition- fer learning strategies among applications in neural networks.
ally, the process of training deep neural network architectures Its use was pioneered in [39] by transferring knowledge from
requires a lot of computational power and thus is a time- a generative to a discriminative model, thereby achieving
consuming task. In such cases, with the utilization of the high generalization. The initial pipeline was composed out
transfer learning techniques, one could benefit significantly of a pre-trained network where the last classifier layer was
in terms of time complexity as well as in terms of the large, replaced with a randomly initialized one.
required dataset. Nowadays, the concept of fine-tuning a strategy is quite
In general, transfer learning techniques are used in two similar. While we do not, in general, replace the last classi-
ways – one being the approach where the weights of the pre- fier layer with randomly initialized one, we most commonly
trained model are preserved (frozen) on some of the layers enable fine-tuning of some of the layers in a pre-trained
and fine-tuned (trained) in the remaining layers, and the other neural network instead of replacing them. However, at this
being the approach where the pre-trained deep neural network point, the question of how to select fine-tunable layers for a
is utilized as a feature extractor, while the extracted features pre-trained network model arises.

VOLUME 8, 2020 196199


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

The last few layers of a deep neural network are usually are copied to the trial solution.
fine-tuned, while the remaining initial layers are kept frozen ( (t)
with their initial pre-trained values. The motivation behind (t+1) u randj (0, 1) ≤ CR ∨ j = jrand ,
wi,j = i,j (t) (3)
such a strategy is driven by a combination of size-limited xi,j otherwise,
datasets and some empirical evidence that the initial layers Finally, the selection operator is utilized to decide whether
(bottom layers) of a deep neural network preserve more a produced vector should become a generation member uti-
abstract, generic features. Such features are commonly appli- lizing the greedy criterion. The selection could be formally
cable to a broader range of tasks, while the layers closer to the expressed as follows:
top provide more specific task-related features [19]. However, (
(t) (t) (t)
while the selection of fine-tunable layers is still a more or less (t+1) wi if f (wi ) ≤ f (xi ),
xi = (t) (4)
manual process, which most commonly requires a tremen- xi otherwise .
dous amount of experimenting, and while there are significant
(t)
number of empirical studies [10], [18], [40] which show where f (xi ) denotes the fitness function defined for solving
great success with regard to limiting the fine-tuning of the a specific optimization problem. The DE algorithm works in
last few layers, there are also some recent studies [19], [28] an iterative manner, where each produced solution is evalu-
which diminish the assumption that the early or middle layer ated using the given fitness function. Based on the selection
features should be shared. Based on the mentioned empirical operator, presented in Eq. 4, the DE algorithm’s search mech-
findings and given that there is no general rule or recipe anism seeks solutions for ever-better fitness scores. Naturally,
to follow when selecting fine-tunable layers, we decided to the DE algorithm is optimized for solving the minimization
tackle the problem from the perspective of representing it as optimization problems, and therefore the fitness function
an optimization problem and tried to solve it utilizing a well- should be tailored to give the lower score to the preferred
known optimization meta-heuristic algorithm. solutions.
The choice of DE parameters, namely Np, F, and CR can
C. DIFFERENTIAL EVOLUTION have a enourmous impact on optimization performance. The
Differential Evolution (DE) is one of the most popular Np parameter, as presented, denotes the population size of
population-based meta-heuristic algorithms, introduced by real-coded vectors (individuals) on top of which the muta-
Storn and Price in 1997 [41]. Thanks to many wins at interna- tion, crossover, and selection operators are applied. Based on
tional competitions, DE is considered one of the most appro- research by Piotrowski [45], a too small population size limits
priate algorithms for continuous optimization. Besides the the number of available moves, which may lead to stagna-
general popularity of the DE algorithm for solving optimiza- tion (the population stops proceeding towards the optimum,
tion tasks, it was also successfully applied to various machine although population diversity remains high) or premature
learning problems, such as the hyper-parameter optimization convergence. On the contrary, a vast population slows down
problem [42], the problem of designing neural network archi- individuals’ clustering and frequently wastes many function
tecture [43] or the feature selection problem [44]. calls on almost random explorative moves. The F parameter
The DE algorithm is composed of Np real-coded vectors is a scale factor which controls the length of the exploitation
and three operators: mutations, crossovers and selections. vector and thus determines how far from point xi the offspring
The Np real-coded vectors are representing the candidate should be generated. Based on research from Das et al. [46],
solutions (individuals) which can be formally defined as a good initial choice of F is 0.5, while the effective range of F
presented in Eq. 1, where each element of the solution is in is usually between 0.4 and 1. The DE parameter CR controls
(t) (L) (U ) (L) (U )
the interval xi,1 ∈ [xi , xi ], while xi and xi denote the how many parameters are expected to change in a population
lower and upper bounds of the i-th variable, respectively. member. When CR is set to a low value, a small number of
parameters are changed in each generation, and the step-wise
(t) (t) (t)
xi = (xi,1 , . . . , xi,n ), for i = 1, . . . , Np, (1) movement tends to be orthogonal to the current coordinate
axes. On the other hand, high values of CR cause most of
The DE’s basic strategy consists of mutation, crossover, the mutant vector directions to be inherited, prohibiting the
and selection operations. The mutation operation can be for- generation of axis orthogonal steps [46].
mally expressed as follows:
IV. ADAPTIVE FINE-TUNING
(t) (t) (t) (t)
ui = xr1 + F· (xr2 − xr3 ), for i = 1, . . . , Np, (2) To tackle the problem of finding and selecting which layers of
given a CNN architecture to fine-tune, we have developed the
where F represents the scaling factor as a positive real number adaptive DEFT (Differential Evolution based Fine-Tuning)
that scales the rate of modification while r1, r2 and r3 are method. As the problem of finding and selecting the layers
randomly selected values in the interval 1 . . . Np. to fine-tune can be easily translated into an optimization
In order to increase the diversity of the parameter vectors, problem, we adopted the DE algorithm for the purposes of
the crossover operator is introduced as presented in Eq. 3, finding the most optimal solution to the problem. The solution
where CR ∈ [0.0, 1.0] controls the fraction of parameters that represents an optimal combination of layers, which should be

196200 VOLUME 8, 2020


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

evaluates its predictive performance. The evaluation of CNN


is performed using a fitness function, for which we utilized
the well-known categorical cross-entropy (CCE) loss. The
performance evaluation score (fitness value) is passed back to
the layers selection mechanism, based on which the new indi-
vidual is produced. The process reiterates, trying to minimize
the fitness value, until the maximum number of model evalua-
tions is reached. The maximum number of model evaluations
is a parameter which can be set manually and represents a
total number of produced individuals within the DE loop. The
best individual (the selection of layers that produced the best
performing CNN – with the lowest categorical cross-entropy
loss) is deemed to be optimal under the given circumstances.
The optimal found selection of layers is then used for fine-
tuning the final CNN.
Each of the DEFT components is explained in detail in the
following subsections.

A. LAYERS SELECTION MECHANISM


The layers selection mechanism is based on the DE algo-
rithm, modified for the task of selecting which layers of
CNN architecture will be enabled for fine-tuning and which
will remain frozen. In order to be able to perform the layer
selection using the DE algorithm, it is mandatory to introduce
the modifications to the representation of the individuals for
the DE optimization process.
The individuals in the proposed DEFT method are pre-
sented as an array (a vector) containing real values, which
can be formally expressed as presented in Eq. 5, for i =
(t)
0, . . . , Np, where each layer xi,0 for i = 0, . . . , N is selected
from the interval [0, 1] and N represents the total number of
layers of the selected CNN architecture.
(t) (t) (t) (t)
xi = (xi,0 , . . . , xi,N , Thi ), (5)

For the purpose of obtaining more diverse individuals,


we exploited the moving threshold method initially presented
in [47]. The mechanism behind the moving threshold method
is that the threshold value is part of an individual solution
vector and is therefore set dynamically for each solution
candidate, which removes the need for setting it manually.
FIGURE 2. The conceptual diagram of the Differential Evolution based In addition to that, it provides us with potentially more
Fine-Tuning (DEFT) method. diverse individuals. The Th denotes the threshold value, based
on which the mapping function presented in Eq. 6 deter-
selected for fine-tuning in order to train the best performing mines whether a corresponding layer is enabled for fine-
CNN. tuning or not.
In Fig. 2 the conceptual diagram of the DEFT method (
(t)
is presented. The DEFT method is composed of two com- (t) 1, if xi,j > Th(t)
si,j = (6)
ponents: the layers selection mechanism and the evalua- 0, otherwise
tion of selected layers. The layers selection mechanism
is responsible for providing the individual, i.e., array of The individual vector si presents the mapped array of
binary values, where each value in the array reflects whether binary values (0 and 1) for i-th individual, where the value
the corresponding layer of CNN architecture is selected 0 for j-th layer reflects in a layer not being selected for fine-
for fine-tuning or not. Based on the provided individual, tuning, while the value 1 reflects in j-th layer being selected
the evaluation component fine-tunes the CNN model and for fine-tuning.

VOLUME 8, 2020 196201


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

B. EVALUATION OF SELECTED LAYERS TABLE 1. The basic information about Osteosarcoma data from UT
Southwestern/UT Dallas for Viable and Necrotic Tumor Assessment
For each produced individual s – described in the previous dataset class distribution.
subsection – the fine-tuning of CNN is conducted. To deter-
mine how good or bad the produced individual is, we define
a fitness function L. The fitness function is calculated after
CNN model fine-tuning based on the given individual is
finished and returned to the DE algorithm in order to enable
the DE to find better individuals. The fine-tuning is conducted
for a maximum number of epochs utilizing the early stopping
technique to stop the fine-tuning of not-so-promising layer
• DEFT where the transfer learning approach with our
selections. The maximum number of epochs is a parameter
proposed method was utilized to select the most optimal
that can be set manually; in general, it should be set with
fine-tunable layers.
regard to a target dataset. To evaluate the performance of a
model, trained by fine-tuning the layers selected by the pro- To objectively evaluate the performance of compared
duced individual, we adopted a well-known categorical cross- methods, the experiments were conducted in two different
entropy (CCE) loss function [48], which can be formally scenarios. In the first scenario, the conventional, pretrained,
expressed as follows: and baseline methods are utilizing the same early stopping
technique as the DEFT. In contrast, in the second scenario,
M C no early stopping criteria is used for the three compared
1 XX
L = CCE = − yij log(pij ) (7) methods.
M
i=1 j=1 All of the conducted experiments were implemented in the
Python programming language with the use of the follow-
where i indexes samples from the total of M samples, ing libraries: Keras [49] with Tensorflow [50] backend for
j indexes classes from total ofPC classes, y denotes the sample developing and training CNNs, NiaPy [51] for providing a DE
label, and pij ∈ (0, 1) : j pij = 1∀i, j represents the algorithm implementation, and PyCM [52] for classification
prediction for a sample. performance metrics calculation.
The experiments were executed on a single Intel Core
C. TRAINING THE FINAL CNN MODEL i7-6700K based PC, with 4 cores (8 threads) CPU running at
To obtain the best performing selection of layers, a subset 4 GHz, with 64 GB of RAM, and three Nvidia GeForce Titan
of 80% of the training set was used within the DE loop for X Pascal GPUs each with 12 GB of dedicated GDDR5 mem-
training CNN models based on the DE algorithm’s individual ory, running the Linux Mint 19 operating system.
solution. The remaining 20% of the training set is then used
to evaluate the performance of trained CNN models using the A. DATASET
fitness function in Eq. 7. The proposed DEFT method is evaluated on the prob-
After all candidate solutions are evaluated, the best per- lem of identifying the osteosarcoma from hematoxylin
forming one is selected, based on which the combination and eosin (H & E) stained osteosarcoma images. We used
of layers selected for fine-tuning is used to fine-tune the a publicly available dataset Osteosarcoma data from UT
final CNN. In contrast to the models trained on candidate Southwestern/UT Dallas for Viable and Necrotic Tumor
solutions, this final CNN model is trained from the start upon Assessment [53], the properties of which are presented
the whole training set. This final CNN model is the result of in Table 1. The data were collected by clinical scientists
the DEFT method. at the University of Texas Southwestern Medical Center,
Dallas. The archival samples of 50 patients treated at the
V. EXPERIMENTAL SETUP Children’s Medical Center, Dallas, between 1995 and 2015,
To evaluate the performance of the proposed adaptive DEFT were used to create this dataset from which 942 histology
method, the experimental approach was utilized. Experiments glass slides were digitized into whole slide images (WSI).
were conducted with four methods – the three compared From those, two pathologists manually selected 40 WSIs
methods and our proposed DEFT method, tackling the task representing the heterogeneity of a tumor and response char-
of identifying the osteosarcoma from Hematoxylin and eosin acteristics under study. Thirty 1024 x 1024 pixel image tiles
(H & E) stained osteosarcoma images: at 10X magnification factor, as seggested by the patholo-
• conventional where the conventional approach for train- gists, were randomly selected from each WSI. From the
ing a CNN was utilized, resulting 1,200 images tiles, 66 irrelevant image tiles such
• pretrained where the CNN was trained using the con- as image tiles falling in non-tissue, ink-mark regions, and
ventional approach with pre-trained weights, blurry images were removed. The performed randomiza-
• baseline where the transfer learning approach with the tion of tile-generation was conducted to remove any bias in
fine-tuning of handpicked layers of CNN architecture the dataset, prepared for feature-generation and subsequent
was used, and machine/deep-learning steps. Two medical experts performed

196202 VOLUME 8, 2020


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

TABLE 2. VGG19 convolutional base. The convolutional layers are


denoted as ‘‘convolution (kernel size), number of kernels’’.

FIGURE 3. Sample images from the dataset. The a) sample represents the
non-tumor image, b) represents the necrotic tumor image while c)
represents the viable tumor image.

the labeling and annotation, each labeling half of the image


tiles [54].
In the pre-processing data phase, images were scaled down
to 224 × 224 pixels, which is the default input size of
the utilized VGG19 CNN architecture. Additionally, numer-
ous studies have shown that such scaling when utilizing
VGG19 architectures results in encouraging classification
performance [10], [12], [40]. Fig. 3 shows the samples of
the individual target class, namely non-tumor, necrotic tumor,
and viable tumor. The viable tumor can be identified based
on the nuclei (cells) densely aggregated together while the
necrotic tumor reflects on images with disintegrated nuclei
but with less color density than the viable tumor. In medical pretrained method all layers were fine-tuned, while for the
terms, the necrotic tumor denotes the dying parts of the tumor baseline method the selection of layers enabled for fine-
cells while the viable tumor denotes the H&E stained tissue tuning was done manually. Based on the encouraging results
images where the tumor cells are capable of normal growth. from various studies [10], [40], we followed the strategy of
fine-tuning only the last convolutional block (in our case
B. CONVOLUTIONAL NEURAL NETWORK block5 of VGG19 architecture) in the convolutional base,
As a CNN architecture, we adapted the original VGG19 CNN while the layers towards the beginning of the convolutional
architecture presented in Fig. 4, initially presented in [55] base were kept frozen. For the DEFT method, the fine-tunable
(denoted as configuration E in the original paper). At the bot- layers were dynamically chosen by utilizing the proposed
tom of the VGG19 CNN convolutional base is an input layer adaptive approach.
consuming the 224 × 224 pixel RGB images. The input layer On top of the convolutional base from VGG19 CNN archi-
is followed by five convolutional blocks presented in Table 2. tecture, after the last convolutional block, in each experiment,
Each convolutional block comprises several convolutional the following randomly initialized feed-forward layers are
layers chained one after another, followed by a maximization added and trained in order to perform the classification task
pooling layer. (see Fig. 4):
The choice of selecting the VGG19 CNN architecture is • flatten layer,
based on the encouraging results presented in various studies • dropout layer with dropout probability set to 0.8,
where the VGG19 CNN with transfer learning was utilized • dense layer with 512 neurons and ReLU activation
for different target tasks. For example, the VGG19 is adopted function,
for automatic glaucoma classification in [12], for breast can- • dropout layer with dropout probability set to 0.5, and
cer histology images classification in [13], and for hologram • dense layer with 3 neurons (number of target classes)
classification for molecular diagnostics in [56]. and Softmax activation function.
In the case of the pretrained, baseline and DEFT exper-
iment, the VGG19 convolutional base was pre-trained on 2) TRAINING PARAMETERS
an ImageNet dataset, while in the case of a conventional For each of the conducted experiments, the training parame-
experiment, the CNN is trained from scratch. ters were set as presented in Table 3. The training parameters’
values were picked based on previous experiences utilizing
C. PARAMETER SETTINGS CNNs for various image recognition tasks [10], [57], [58].
1) CNN SETTINGS Each experiment was trained using the adam optimizer func-
For the conventional method, the VGG19 convolutional tion with the initial learning rate set to 1 ∗ 10−4 and batch
base is used with the random weights initialization; for the size set to 128. For the conventional, pretrained and baseline

VOLUME 8, 2020 196203


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

FIGURE 4. The presentation of the utilized CNN architecture. The convolutional base (Blocks 1 - 5) is adopted from VGG19 CNN
architecture. The dropout layers and fully connected layers are denoted with DO and FC, respectively.

TABLE 3. Used training parameters for the conducted experiments. Therefore, all the reported results are the averages across all
those runs.

D. EVALUATION METHOD AND METRICS


To objectively evaluate the performance of the proposed
DEFT method against the compared methods, we conducted
a gold standard 10-fold cross-validation procedure, where
TABLE 4. Used parameter values for the DEFT method.
the given dataset was divided into training and testing sets
at a ratio of 90:10. In this way, the process was repeated
for a total of ten times, each time leaving a different 10%
of the initial dataset for testing. The 10-fold cross-validation
provides a more reliable evaluation of the classification per-
formance since, for each experiment, we train CNN on ten
different training sets and, more importantly, evaluate the
trained model on 10 test sets. It enables us to observe how
the method behaves when trained and evaluated on different
experiments, the training was performed for 500 epochs. For subsets of the target dataset. Additionally, such an approach
the DEFT experiment, the maximum number of epochs was can also highlight the potential layer selection bias of the used
set to 20, which means that the model was fine-tuned for method.
up to 20 epochs for each produced individual. Additionally, In the process of conducting the 10-fold cross-validation,
the early stopping technique was utilized, which was con- we calculated well-known classification metrics such as accu-
figured to stop the training if there were no improvement in racy, the F-1 macro and micro score, and AUC. As we
lowering the loss value for any three consecutive epochs. The are dealing with the a multi-class classification problem,
maximum number of epochs was selected experimentally, we utilized the AUC metric for multi-classification problems,
as in the vast majority of cases the early stopping criteria was namely AUNP, and the Cohen’s Kappa coefficient, which
met before reaching 20 epochs; in general, it should be set successfully handles multi-classification as well as imbal-
with regard to a target dataset. anced classification problems. Since we have both multi-
classification and a bit of an imbalanced dataset, it seemed
3) DEFT PARAMETERS like the reasonable choice to use the Cohen’s Kappa coeffi-
For the proposed DEFT method, we set the parameters as cient as one of the performance metrics.
presented in Table 4. The total number of function (model) The AUNP metric is calculated as the AUC of each class
evaluations was set to 50, while each of the individuals had a against the rest, using the a priori class distribution. The AUC
maximum number of epochs set to 20 to achieve the lowest (Area under the ROC Curve) of a binary classifier, formally
(best) possible loss. The selected combination of layers for defined in Eq. 8, is equivalent to the probability that the classi-
fine-tuning, which trained the CNN with the lowest (best) fier’s rank of randomly chosen positive instance is higher than
loss value, was then used to fine-tune the final CNN from a randomly chosen negative instance. The f (i, j) represents
the beginning for the whole 20 epochs; all the other fine- the actual probability of the example i to be class j, where we
tuned models were discarded. Due to the non-deterministic assume that f (i, j) always takes a valuePof 0 or 1 and is strictly
m
nature of the underlying DE, the experiment conducted with an indicator function. With mj = i=1 f (i, j), we denote
the DEFT method was executed in 10 independent runs. the number of class examples j, and with p(i, j), we denote

196204 VOLUME 8, 2020


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

TABLE 5. Averages of classification performance metrics over 10 folds trained up to 500 epochs with the aid of early stopping
when classifiers were trained for up to 500 epochs with early stopping
criteria. criteria. On the other hand, the results presented in Table 6
reveal the same metrics for the case where all three com-
pared methods had been trained for a full 500 epochs. Our
DEFT method uses the same approach (up to 50 iterations
of 20 epochs with early stopping) in both cases.
In both cases, for all the classification metrics (accuracy,
AUNP, F1-macro and Cohen’s Kappa coefficient), the results
of the DEFT method stand out, outperforming all the com-
pared methods by a great margin, although the difference is
smaller in cases where compared methods are trained for a
TABLE 6. Averages of classification performance metrics over 10 folds
when classifiers were trained for full 500 epochs. full 500 epochs. For all three compared methods, the results
are better in cases when they were trained for 500 epochs.
In terms of accuracy, the DEFT method outperforms the
second-best conventional method by a margin of 4.5% and
the other two methods by a margin of 17.1-21.8%. Focus-
ing on AUNP, F1-macro and Cohen’s kappa coefficient the
performance improvements are 3.5-17.4%, 4.3-29.9% and
6.9-34.8% respectively.
It is interesting to observe that in cases where early stop-
ping was used, the results of all three compared methods
are quite similar, with the best of them being the baseline.
However, in the case of a full 500 epochs of training, the con-
the prior probability of class j. In other words, the p(i, j) ventional method outperformed the other two methods by
represents the estimated probability of example i to be of class a significant margin, while the other two performed very
j taking the values in [0, 1]. The l(·) is a comparison function similarly. This is an intriguing behavior since, due to the small
satisfying l(a, b) = 1 if a > b, l(a, b) = 0 if a < b and dataset, we would expect that the pretrained and especially
l(a, b) = 0.5 if a = b [59]. the baseline method would, overall, perform better than the
Pm conventionally trained one, due to the high risk of extreme
f (i, j)l(p(i, j), p(t, j))
AUC(j, k) = i=1 (8) over-fitting of such deep CNN architectures trained against a
mj · mk small dataset. Also, the handpicked selection of fine-tunable
The AUNP metric introduced in 2001 by Fawcett [60] layers of a baseline method was made based on the previous
computes the AUC treating a c-dimensional classifier as c empirical results in which such a selection of layers proved
two-dimensional classifiers, taking into account the prior to be successful.
probability of each class (p(j)). Formally, the AUNP can be
expressed as follows: 1) CLASSIFICATION PERFORMANCE METRICS: ACCURACY,
F1-SCORE, AUNP AND KAPPA
c
X In order to perform a more in-depth classification perfor-
AUNP = p(j)AUC(j, restj ), (9)
mance analysis of our proposed DEFT method, we present
j=1
the performance comparison among the compared classifier
where restj represents all classes different from class j. methods on 10 folds using the box-dot-plot visualizations for
accuracy, F1-score, AUNP and Cohen’s kappa coefficient.
VI. RESULTS For each metric, we performed two experiments: a) first,
Using the presented experiment setup, evaluation method we compared the results of training the classifiers for up to
and classification metrics, we obtained the results, which are 500 epochs using the early stopping criteria, and b) second,
presented and discussed in detail in the following subsections. we compared the results of training the classifiers for a full
500 epochs.
A. CLASSIFICATION PERFORMANCE COMPARISON Fig. 5 presents the accuracy performance comparison
The classification results obtained from the conducted exper- among the compared methods in the case of using early
iments identifying the osteosarcoma from H&E stained stopping. We can easily observe that the DEFT method out-
osteosarcoma images, using the dataset Osteosarcoma data performs the three compared methods, both in terms of mean
from the UT Southwestern/UT Dallas for Viable and Necrotic accuracy over 10-folds as well as in terms of the standard
Tumor Assessment, are presented in Table 5 and Table 6. deviation of the accuracy. The small standard deviation of
Table 5 shows the averages of various classification perfor- accuracy also shows the capability of our method to gen-
mance metrics over ten folds when using the three compared eralize well. The second best method in terms of overall
methods (conventional, pretrained and baseline), which were accuracy seems to be the baseline, followed by the pretrained

VOLUME 8, 2020 196205


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

FIGURE 5. Accuracy of compared methods trained for up to 500 epochs FIGURE 7. F1-macro of compared methods trained for up to 500 epochs
with early stopping; each dot represents one fold. with early stopping; each dot represents one fold.

FIGURE 6. Accuracy of compared methods trained for full 500 epochs; FIGURE 8. F1-macro of compared methods trained for a full 500 epochs;
each dot represents one fold. each dot represents one fold.

one, while the conventional method performed the worst. The dropout layers. The side effect of the applied regularization
accuracy results of the three compared methods are in line in the case of the conventional method resulted in a slow but
with expectations. While the conventional method, trained steady convergence throughout training. In contrast, for the
from scratch, was not able to achieve good performance methods that utilized fine-tuning, some sort of regularization
within a few epochs, the pretrained method benefited from was necessary to mitigate the over-fitting, which is generally
the pre-trained weights. The baseline method benefited even a quite common effect when dealing with transfer learning
more as the CNN with the pre-trained weights was fine-tuned approaches.
with regard to reasonably hand-picked layers only rather than Very similar results can also be observed for the other three
all layers as the pretrained method. performance metrics: F1-score, AUNP, and kappa.
Although all four methods used the same early stopping The F1-score can be, in general, interpreted as a har-
criteria and could have been theoretically trained for the same monic mean of the precision and recall score. When dealing
amount of epochs, it turned out that the three compared meth- with a multi-classification problem, we obtain a per class
ods used much fewer epochs to train (see Table 5). In order F1-score, which we would like to represent in the form of one
to make a more fair comparison, in the second experiment, value representing the classifiers’ performance. One possible
we fixed the number of epochs at 500 for all three compared way to achieve that is to use a macro-averaged F1-score,
methods by removing the early stopping mechanism (Fig. 6). which is computed as the simple arithmetic mean of per-class
We can see that the results of all three compared methods F1-scores. The F1-score results are presented in Fig. 7 and
improved, but were still not able to outperform our DEFT Fig. 8.
method, both with regard to the overall accuracy and stan- The results of AUNP metric performance for all compared
dard deviation. The method which benefited the most from methods are presented in Fig. 9 and Fig. 10. AUNP is a
prolonged training was the conventional, while the other two metric that combines the AUC measure of each class against
methods clearly finished in local minima in several folds. the rest, using the a priori class distribution. It is one of the
This benefit could be attributed to the quite drastic regular- most common AUC variations when dealing with multi-class
ization applied in the top layers with the utilization of two classifiers, as in our case.

196206 VOLUME 8, 2020


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

FIGURE 9. AUNP of compared methods trained for up to 500 epochs with FIGURE 11. Kappa of compared methods trained for up to 500 epochs
early stopping; each dot represents one fold. with early stopping; each dot represents one fold.

FIGURE 10. AUNP of compared methods trained for a full 500 epochs; FIGURE 12. Kappa of compared methods trained for a full 500 epochs;
each dot represents one fold. each dot represents one fold.

Fig. 11 and Fig. 12 shows a comparison of Cohen’s Kappa consumed a lot more epochs due to its iterative nature. For
coefficient values for all of the compared methods. Essen- the DEFT method, such behavior was expected. On the other
tially, the Cohen’s Kappa metric compares an observed accu- hand, we expected other methods to use more epochs before
racy with an expected accuracy (random chance), which is stopping, but it turned out that the training stopped after no
generally less misleading than the accuracy metric itself, due more than 10 epochs on average (see Table 5). When the
to the Cohen’s Kappa taking random chance into account. full 500 epochs were used for training, the three compared
Fleiss’s characterization [61] of the Kappa coefficient val- methods’ classification performance improved but were still
ues are translated into a poor agreement when the value unable to outperform our proposed DEFT method.
is < 0.40, fair to a good agreement when the value is The overall time spent on training was highly correlated
0.40 − 0.75 and an excellent agreement when the value with the number of used epochs, although it turned out that
is > 0.75. Following the above-mentioned characterization there were some differences. Interestingly, although trained
of Cohen’s Kappa coefficient values, our proposed DEFT for the same amount of epochs, the baseline method seems
method, on average, achieves an excellent agreement with to consume the least amount of time, being followed by the
an average of 0.7866. On the other hand, the remaining pretrained and conventional methods, while our proposed
compared methods achieved, on average, a poor agreement DEFT method used the highest amount of time. This could
in the case of using early stopping (see Table 5) and fair to be attributed to the fact that the DE optimization algorithm’s
good agreement in the case of not using early stopping (see inner workings consume a significant amount of time.
Table 6). The analysis of the total spent training time is presented
in Fig. 13. Observing it, we can easily see that the time
2) COMPUTATIONAL AND TIME COMPLEXITY: NUMBER OF consumption does not vary greatly on a per fold basis, which
EPOCHS AND TOTAL TRAINING TIME is encouraging and proves that the layer selection mechanism
When focusing on the number of epochs, in cases when early of our DEFT method is capable of finding the most optimal
stopping criteria was applied, the proposed DEFT method solution based on the given constraints. Additionally, it also

VOLUME 8, 2020 196207


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

TABLE 7. Average ranks (the best result is shown in bold) and statistical comparison (all differences are significant) of the four methods trained with
early stopping.

TABLE 8. Average ranks (the best result is shown in bold) and statistical comparison (only time and epochs have significant differences) of the four
methods trained for full 500 epochs.

TABLE 9. Accuracy performance comparison with similar studies.

To further compare the results of our DEFT method


with the remaining three methods, the Wilcoxon signed-
rank test was applied next (using α=0.05), as suggested by
Demšar [62]. If the Wilcoxon test resulted in a statistically
FIGURE 13. Time spent training a full 500 epochs; each dot represents significant difference between the two methods, the method
one fold.
with a better average rank could be regarded as the better
method. Our proposed DEFT method achieved the highest
gives us a solid basis to further investigate how to make average rank (which is the best) among the four methods
the selection mechanism even more effective in terms of for accuracy, AUNP, F1, and kappa, both when using early
consumed epochs and time. stopping or not (see also Figs. 5–12). In the case of using early
stopping, our DEFT method performed the worst in both the
3) STATISTICAL COMPARISON time and number of epochs spent on training. When the clas-
To evaluate the statistical significance of these results, we first sifiers were trained for the full 500 epochs, however, DEFT
applied the Friedman test as suggested by Demšar [62] by achieved the lowest amount of epochs, but still performed a
calculating the asymptotic significance for the four compared bit slower than the remaining three methods (see also Fig. 13).
methods on all 10 folds (using α=0.05). As the results are not
normally distributed, the Friedman test was applied, which is 4) A COMPARISON WITH OTHER SIMILAR STUDIES
a non-parametric statistical test used to detect differences in The problem of identifying osteosarcoma has already been
the results of various methods across multiple test attempts. addressed in previous studies using the same presented
When early stopping was used, the results of the performed dataset. However, the authors of those studies split the
Friedman test, with regard to all six metrics, show that differ- original images into smaller patches and performed a classifi-
ences between the four methods are statistically significant cation over those patches taking into account additional con-
(Table 7). In the case of using the full 500 epochs for training, textual information about the tumor. Even though the reported
the Friedman test results show significant differences for time results are quite similar to ours, as presented in Table 9.
and number of epochs (Table 8). In [63], the authors proposed a CNN architecture with

196208 VOLUME 8, 2020


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

layers have a more significant impact on final prediction


than the others. We can also generally observe relatively
low selection probabilities for layers towards the end, which
diminishes the assumption that early or middle layer features
should be shared, as has already been noted in previous
sections.

C. THREATS TO VALIDITY
Commonly, in the machine learning field, the validity threats
often relate to the diversity, quality and quantity of the data.
Since all supervised machine learning methods, techniques,
FIGURE 14. Layer selection analysis, based on a DEFT method results. The
y-axis denotes the average selections from the best 5 performing and approaches rely on how the given data is labeled or pre-
individual candidates to the worst 5 performing ones. On the x-axis there classified, for our research we picked the dataset that was
are numbers denoting the layers of the VGG19 architecture.
collected by clinical scientists and labeled by two medical
experts to minimize potential threats to validity. Nevertheless,
7 layers, 3 of them being convolutional layers, 3 of them our obtained results and findings may not be generalized to
being maximization pooling layers, and two being fully con- all specific situations.
nected layers. In terms of accuracy, our proposed method Splitting the data into a training and test set could also be a
outperforms the mentioned one by a margin of 2.57%. In [64] potential a threat to validity. To reduce the possibility of such
and [54], the reported accuracy was somewhat higher than it a threat, we adopted a well-known 10-fold cross-validation
is in our case (by 5.83% and 4.63%, respectively). However, procedure.
we must also consider that the authors in [54] are using mech- Due to the stochastic nature of our proposed DEFT
anisms for the expert-guided generation of features, while on method, in order to reduce the internal threat to validity,
the other side, our proposed DEFT method does not utilize the experiment conducted with the DEFT method was exe-
any domain expert knowledge in the process of building the cuted in 10 runs, and the reported performance metrics were
predictive model. the averages of those runs.

B. LAYER SELECTION ANALYSIS VII. CONCLUSION


In this section, we analyze the DEFT method’s selection In this work, we presented a novel DEFT adaptive method
of fine-tunable layers and the performance of such various for transfer learning with fine-tuning, featuring the layer
solutions produced by the method. As presented in previous selection mechanism based on the DE algorithm. The method
sections, there is no general rule or recipe to follow when addresses the problem of selecting which layers of given
selecting which layers of CNN to fine-tune and which ones CNN architecture to fine-tune and which ones to leave
to leave frozen. So we dove deeper into the performance of frozen, in order to achieve the best possible classification
various layer combinations, conducted by our method in the performance. The exploited DE algorithm enables the DEFT
process of finding the most optimal solution under the given method to find the most optimal layer selection solution given
constraints. the constraints and available dataset in an automatic, straight-
Fig. 14 shows a heat-map of layer selection probabilities forward, adaptive manner.
averaged from the best five found solutions to the worst five The presented method was evaluated against three other
found solutions by our DEFT method. The rationale behind methods, one where the CNN is trained conventionally (from
grouping the individual based candidate models is that, inter- scratch), one where the CNN was trained using the conven-
estingly, the best performing individuals are quite different in tional approach with convolutional layers being pre-trained
terms of selected layers but still deliver similar classification on the ImageNet dataset, and one where the transfer learning
performance. As can be seen from the figure, the most obvi- was utilized and the fine-tunable layers were handpicked
ous thing is that the better performing individuals in general based on general recommendations and our previous expe-
have a lower selection probability rate than the worse per- riences. For our experiments’ target task, we selected the
forming individual-based candidate models. In other words, image classification task of identifying the osteosarcoma
the CNN architectures with fewer layers enabled for fine- from H&E stained osteosarcoma images. The results obtained
tuning are achieving better performance results than the ones from the conducted experiments show that our proposed
with more layers. This effect is also consistent with the DEFT method outperformed the compared methods in all
low classification performance achieved by the pretrained predictive performance metrics. On the other hand, the pro-
method, where all convolutional layers are being trained. posed method is significantly more time-consuming, utilizing
An interesting observation is also the occurrence of higher an iterative optimization mechanism.
selection probabilities for layers 2, 8, 14, 16, and 17 when The DEFT method could be easily utilized with any kind
looking at the best performing individuals, which could be of task, where a standard fine-tuning methodology is appli-
explained by the fact that the features extracted on those cable. Furthermore, the proposed DEFT method could also

VOLUME 8, 2020 196209


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

be utilized with any other CNN architecture regardless of [10] G. Vrbančič, M. Zorman, and V. Podgorelec, ‘‘Transfer learning tuning
the number of convolutional layers, with respect to adjusting utilizing grey wolf optimizer for identification of brain hemorrhage from
head ct images,’’ in Proc. StuCoSReC: 6th Student Comput. Sci. Res. Conf.,
method parameters such as the dimension of the problem and 2019, pp. 61–66.
the number of function evaluations. Generally, the latter one [11] B. Q. Huynh, H. Li, and M. L. Giger, ‘‘Digital mammographic tumor
should be increased when utilizing deeper CNN architectures classification using transfer learning from deep convolutional neural net-
works,’’ J. Med. Imag., vol. 3, no. 3, Aug. 2016, Art. no. 034501.
since such architectures feature a larger number of layers, [12] J. J. Gómez-Valverde, A. Antón, G. Fatti, B. Liefers, A. Herranz,
which translates to a larger search space and increased time A. Santos, C. I. Sánchez, and M. J. Ledesma-Carbayo, ‘‘Automatic glau-
complexity in order to find the most suitable combination coma classification using color fundus images based on convolutional
neural networks and transfer learning,’’ Biomed. Opt. Express, vol. 10,
of layers selected for fine-tuning. Similar to the utilization no. 2, pp. 892–913, 2019.
of more conventional methods, when applying our proposed [13] R. Mehra, ‘‘Breast cancer histology images classification: Training from
DEFT method against various datasets with a different num- scratch or transfer learning?’’ ICT Express, vol. 4, no. 4, pp. 247–254,
2018.
ber of samples, one should also revise and appropriately adapt [14] D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang,
the classifier layers as well as initial training parameters to S. L. Baxter, A. McKeown, G. Yang, X. Wu, and F. Yan, ‘‘Identifying
achieve the best possible outcome. medical diagnoses and treatable diseases by image-based deep learning,’’
Cell, vol. 172, no. 5, pp. 1122–1131, 2018.
In the future, we would like to extend our research to [15] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are
the utilization of various optimization algorithms such as features in deep neural networks?’’ in Proc. Adv. Neural Inf. Process. Syst.,
the Firefly algorithm or the Particle Swarm Optimization. 2014, pp. 3320–3328.
[16] G. Vrbancic, I. J. Fister, and V. Podgorelec, ‘‘Parameter setting for deep
We would also like to apply the proposed DEFT method on neural networks using swarm intelligence on phishing Websites classifica-
different medical imaging datasets, and possibly also to other tion,’’ Int. J. Artif. Intell. Tools, vol. 28, no. 6, Oct. 2019, Art. no. 1960008.
image classification tasks. To analyse how the method per- [17] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, ‘‘Fac-
tors of transferability for a generic ConvNet representation,’’ IEEE Trans.
forms regardless of the classification domain, it would have Pattern Anal. Mach. Intell., vol. 38, no. 9, pp. 1790–1802, Sep. 2016.
been useful to test it on several datasets on multiple tasks, [18] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall,
including those from other domains. Additionally, we would M. B. Gotway, and J. Liang, ‘‘Convolutional neural networks for medical
image analysis: Full training or fine tuning?’’ IEEE Trans. Med. Imag.,
like to explore the possibilities of how to make the layers vol. 35, no. 5, pp. 1299–1312, May 2016.
selection mechanism more efficient, which would enable the [19] Y. Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing, and R. Feris, ‘‘Spot-
DEFT method to deliver better classification performance Tune: Transfer learning through adaptive fine-tuning,’’ in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4805–4814.
with lower time complexity, closer to the one where con- [20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
ventional training is utilized. Finally, we would also like to A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
investigate the possibilities of combining our DEFT method Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[21] L. Zhu, S. O. Arik, Y. Yang, and T. Pfister, ‘‘Learning to transfer learn:
with active learning approaches. Reinforcement learning-based selection for adaptive transfer learning,’’
arXiv, New York, NY, USA, Tech. Rep., 2020.
[22] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie, ‘‘Large scale
REFERENCES
fine-grained categorization and domain-specific transfer learning,’’ in
[1] A. Kamilaris and F. X. Prenafeta-Boldú, ‘‘Deep learning in agriculture: A Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
survey,’’ Comput. Electron. Agricult., vol. 147, pp. 70–90, Apr. 2018. pp. 4109–4118.
[2] Q. Kong, D. T. Trugman, Z. E. Ross, M. J. Bianco, B. J. Meade, and [23] W. Ge and Y. Yu, ‘‘Borrowing treasures from the wealthy: Deep transfer
P. Gerstoft, ‘‘Machine learning in seismology: Turning data into insights,’’ learning through selective joint fine-tuning,’’ in Proc. IEEE Conf. Comput.
Seismol. Res. Lett., vol. 90, no. 1, pp. 3–14, Jan. 2019. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1086–1095.
[3] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, ‘‘A deep learning [24] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies
approach for network intrusion detection system,’’ in Proc. 9th EAI Int. for accurate object detection and semantic segmentation,’’ in Proc. IEEE
Conf. Bio-Inspired Inf. Commun. Technol. (formerly BIONETICS), 2016, Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
pp. 21–26. [25] M. Long, Y. Cao, J. Wang, and M. Jordan, ‘‘Learning transferable features
[4] J. Flisar and V. Podgorelec, ‘‘Identification of self-admitted technical debt with deep adaptation networks,’’ in Proc. Int. Conf. Mach. Learn., 2015,
using enhanced feature selection based on word embedding,’’ IEEE Access, pp. 97–105.
vol. 7, pp. 106475–106494, 2019. [26] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, ‘‘CNN features
Off-the-shelf: An astounding baseline for recognition,’’ in Proc. IEEE
[5] C. Angermueller, T. Pärnamaa, L. Parts, and O. Stegle, ‘‘Deep learn-
Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014, pp. 806–813.
ing for computational biology,’’ Mol. Syst. Biol., vol. 12, no. 7, p. 878,
[27] Z. Shi, H. Hao, M. Zhao, Y. Feng, L. He, Y. Wang, and K. Suzuki, ‘‘A
Jul. 2016.
deep CNN based transfer learning method for false positive reduction,’’
[6] R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, ‘‘Deep learning Multimedia Tools Appl., vol. 78, no. 1, pp. 1017–1033, Jan. 2019.
for healthcare: Review, opportunities and challenges,’’ Briefings Bioinf., [28] A. Veit, M. J. Wilber, and S. Belongie, ‘‘Residual networks behave like
vol. 19, no. 6, pp. 1236–1246, Nov. 2018. ensembles of relatively shallow networks,’’ in Proc. Adv. Neural Inf. Pro-
[7] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, cess. Syst., 2016, pp. 550–558.
B. T. Do, G. P. Way, E. Ferrero, P.-M. Agapow, M. Zietz, and [29] X. J. Zhu, ‘‘Semi-supervised learning literature survey,’’ Dept. Comput.
M. M. Hoffman, ‘‘Opportunities and obstacles for deep learning in Sci., Univ. Wisconsin-Madison, Madison, WI, USA, Tech. Rep. 1530,
biology and medicine,’’ J. Roy. Soc. Interface, vol. 15, no. 141, 2018, 2005.
Art. no. 20170387. [30] T. N. Kipf and M. Welling, ‘‘Semi-supervised classification with graph
[8] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoo- convolutional networks,’’ 2016, arXiv:1609.02907. [Online]. Available:
rian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez, ‘‘A http://arxiv.org/abs/1609.02907
survey on deep learning in medical image analysis,’’ Med. Image Anal., [31] L. I. Kuncheva and J. J. Rodriguez, ‘‘Classifier ensembles with a random
vol. 42, pp. 60–88, Dec. 2017. linear oracle,’’ IEEE Trans. Knowl. Data Eng., vol. 19, no. 4, pp. 500–508,
[9] M. I. Razzak, S. Naz, and A. Zaib, ‘‘Deep learning for medical image Apr. 2007.
processing: Overview, challenges and the future,’’ in Classification in [32] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans. Knowl.
BioApps. Cham, Switzerland: Springer, 2018, pp. 323–350. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.

196210 VOLUME 8, 2020


G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning

[33] J. Y. Ching, A. K. C. Wong, and K. C. C. Chan, ‘‘Class-dependent dis- [57] G. Vrbancic and V. Podgorelec, ‘‘Automatic classification of motor
cretization for inductive learning from continuous and mixed-mode data,’’ impairment neural disorders from EEG signals using deep convolu-
IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, no. 7, pp. 641–651, tional neural networks,’’ Elektronika ir Elektrotechnika, vol. 24, no. 4,
Jul. 1995. pp. 3–7, Aug. 2018. [Online]. Available: http://eejournal.ktu.lt/index.
[34] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, ‘‘Inductive learning php/elt/article/view/21469
algorithms and representations for text categorization,’’ in Proc. 7th Int. [58] G. Vrbancic, I. J. Fister, and V. Podgorelec, ‘‘Automatic detec-
Conf. Inf. Knowl. Manage. (CIKM), 1998, pp. 148–152. tion of heartbeats in heart sound signals using deep convolutional
[35] Q. Yang, C. Ling, X. Chai, and R. Pan, ‘‘Test-cost sensitive classification neural networks,’’ Elektronika ir Elektrotechnika, vol. 25, no. 3,
on data with missing values,’’ IEEE Trans. Knowl. Data Eng., vol. 18, no. 5, pp. 71–76, Jun. 2019. [Online]. Available: http://eejournal.ktu.lt/index.
pp. 626–638, May 2006. php/elt/article/view/23680
[36] X. Zhu and X. Wu, ‘‘Class noise handling for effective cost-sensitive learn- [59] C. Ferri, J. Hernández-Orallo, and R. Modroiu, ‘‘An experimental compar-
ing by cost-guided iterative classification filtering,’’ IEEE Trans. Knowl. ison of performance measures for classification,’’ Pattern Recognit. Lett.,
Data Eng., vol. 18, no. 10, pp. 1435–1440, Oct. 2006. vol. 30, no. 1, pp. 27–38, 2009.
[37] M. Hussain, J. J. Bird, and D. R. Faria, ‘‘A study on CNN transfer learning [60] T. Fawcett, ‘‘Using rule sets to maximize ROC performance,’’ in Proc.
for image classification,’’ in Proc. UK Workshop Comput. Intell. Cham, IEEE Int. Conf. Data Mining, Nov./Dec. 2001, pp. 131–138.
Switzerland: Springer, 2018, pp. 191–202. [61] J. L. Fleiss, B. Levin, and M. C. Paik, Statistical Methods for Rates and
[38] K. Nogueira, O. A. B. Penatti, and J. A. dos Santos, ‘‘Towards better Proportions, 3rd ed. Hoboken, NJ, USA: Wiley, 2003.
exploiting convolutional neural networks for remote sensing scene clas- [62] J. Demsar, ‘‘Statistical comparisons of classifiers over multiple data sets,’’
sification,’’ Pattern Recognit., vol. 61, pp. 539–556, Jan. 2017. J. Mach. Learn. Res., vol. 7, pp. 1–30, Jan. 2006.
[39] G. E. Hinton, ‘‘Reducing the dimensionality of data with neural networks,’’ [63] R. Mishra, O. Daescu, P. Leavey, D. Rakheja, and A. Sengupta,
Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006. ‘‘Histopathological diagnosis for viable and non-viable tumor prediction
[40] H. Chougrad, H. Zouaki, and O. Alheyane, ‘‘Deep convolutional neu- for osteosarcoma using convolutional neural network,’’ in Proc. Int. Symp.
ral networks for breast cancer screening,’’ Comput. Methods Programs Bioinf. Res. Appl. Springer, 2017, pp. 12–23.
Biomed., vol. 157, pp. 19–30, Apr. 2018. [64] R. Mishra, O. Daescu, P. Leavey, and D. Rakheja, ‘‘Convolutional neural
[41] R. Storn and K. Price, ‘‘Differential evolution–A simple and efficient network for histopathological analysis of osteosarcoma,’’ J. Comput. Biol.,
heuristic for global optimization over continuous spaces,’’ J. Global vol. 25, no. 3, pp. 313–325, 2018.
Optim., vol. 11, no. 4, pp. 341–359, 1997.
[42] B. Nakisa, M. N. Rastgoo, A. Rakotonirainy, F. Maire, and
V. Chandran, ‘‘Long short term memory hyperparameter optimization for GREGA VRBANČIČ (Graduate Student Member,
a neural network based emotion recognition framework,’’ IEEE Access, IEEE) received the B.Sc. and M.Eng. degrees in
vol. 6, pp. 49325–49338, 2018. informatics and communication technologies from
[43] N. Xue, I. Triguero, G. P. Figueredo, and D. Landa-Silva, ‘‘Evolving deep the University of Maribor, in 2015 and 2017,
CNN-LSTMs for inventory time series prediction,’’ in Proc. IEEE Congr. respectively, where he is currently pursuing the
Evol. Comput. (CEC), Jun. 2019, pp. 1517–1524. Ph.D. degree in computer science. He is also a
[44] L. Brezočnik, I. Fister, and G. Vrbančič, ‘‘Applying differential evolution Young Researcher with the Faculty of Electrical
with threshold mechanism for feature selection on a phishing Websites Engineering and Computer Science, University of
classification,’’ in New Trends in Databases and Information Systems, Maribor. He is the author of three peer-reviewed
T. Welzer, J. Eder, V. Podgorelec, R. Wrembel, M. Ivanović, J. Gamper, scientific journal articles and eight conference
M. Morzy, T. Tzouramanis, J. Darmont, and A. K. Latifić, Eds. Cham, papers. He has been involved in several industrial research and development
Switzerland: Springer, 2019, pp. 11–18. projects. His research interests include deep learning, especially convolu-
[45] A. P. Piotrowski, ‘‘Review of differential evolution population size,’’
tional neural networks, focusing on training strategies and transfer learning.
Swarm Evol. Comput., vol. 32, pp. 1–24, Feb. 2017.
[46] S. Das and P. N. Suganthan, ‘‘Differential evolution: A survey of the
State-of-the-Art,’’ IEEE Trans. Evol. Comput., vol. 15, no. 1, pp. 4–31, VILI PODGORELEC (Member, IEEE) received
Feb. 2011. the Ph.D. degree from the University of Maribor,
[47] D. Fister, I. Fister, T. Jagric, I. Fister, and J. Brest, ‘‘A novel self-adaptive Slovenia, in 2001.
differential evolution for feature selection using threshold mechanism,’’ in He worked as a Visiting Professor and/or
Proc. IEEE Symp. Ser. Comput. Intell. (SSCI), Nov. 2018, pp. 17–24. a Researcher at several universities around the
[48] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, world, including the University of Osaka, Japan;
USA: Springer-Verlag, 2006. the Federal University of Sao Paulo, Brazil; the
[49] F. Chollet. (2015). Keras. [Online]. Available: https://keras.io University of Nantes, France; the University of La
[50] (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous
Laguna, Spain; the University of Madeira, Portu-
Systems. [Online]. Available: https://www.tensorflow.org/
[51] G. Vrbančič, L. Brezočnik, U. Mlakar, D. Fister, and I. Fister, Jr., ‘‘NiaPy: gal; the University of Applied Sciences Seinäjoki,
Python microframework for building nature-inspired algorithms,’’ J. Open Finland; and the University of Applied Sciences Valencia, Spain. He is
Source Softw., vol. 3, no. 23, p. 613, Mar. 2018, doi: 10.21105/joss.00613. currently a Professor of computer science with the University of Maribor.
[52] S. Haghighi, M. Jasemi, S. Hessabi, and A. Zolanvari, ‘‘PyCM: Multiclass He has been involved in AI and intelligent systems for 20 years, where
confusion matrix library in Python,’’ J. Open Source Softw., vol. 3, no. 25, he gained professional experience in the implementation of many scientific
p. 729, May 2018, doi: 10.21105/joss.00729. and industrial research and development projects related to the analysis,
[53] P. Leavey, A. Sengupta, D. Rakheja, O. Daescu, H. B. Arunachalam, design, implementation, integration, and evaluation of intelligent informa-
and R. Mishra, ‘‘Osteosarcoma data from ut southwestern/UT Dallas for tion systems. He has authored more than 50 peer-reviewed scientific journal
viable and necrotic tumor assessment [data set],’’ Cancer Imag. Arch., articles, more than 100 conference papers, three books, and several book
Fayetteville, AR, USA, Tech. Rep., 2019. chapters on machine learning, computational intelligence, data science,
[54] H. B. Arunachalam, R. Mishra, O. Daescu, K. Cederberg, D. Rakheja, medical informatics, and software engineering. He has top expertise in AI
A. Sengupta, D. Leonard, R. Hallac, and P. Leavey, ‘‘Viable and
and machine learning methods and algorithms in the field of transparent
necrotic tumor assessment from whole slide images of osteosarcoma using
data-driven decision making (especially in medicine), in-depth knowledge
machine-learning and deep-learning models,’’ PLoS ONE, vol. 14, no. 4,
Apr. 2019, Art. no. e0210706.
of system integration technologies and methods applied to intelligent data
[55] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for analysis, classification and prediction of human-centered data, as well as
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail- large experience in designing and implementing information retrieval, nat-
able: http://arxiv.org/abs/1409.1556 ural language processing and text mining solutions for academia, industrial
[56] S.-J. Kim, C. Wang, B. Zhao, H. Im, J. Min, H. J. Choi, J. Tadros, partners and international companies using state-of-the-art approaches, and
N. R. Choi, C. M. Castro, R. Weissleder, H. Lee, and K. Lee, ‘‘Deep methods and tools. He received several international awards and grants for
transfer learning-based hologram classification for molecular diagnostics,’’ his research activities.
Sci. Rep., vol. 8, no. 1, pp. 1–12, Dec. 2018.

VOLUME 8, 2020 196211

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy