TransferLearningwithAdaptiveFine Tuning
TransferLearningwithAdaptiveFine Tuning
net/publication/345713638
CITATIONS READS
195 1,927
2 authors:
All content following this page was uploaded by Grega Vrbančič on 01 December 2020.
ABSTRACT With the utilization of deep learning approaches, the key factors for a successful application are
sufficient datasets with reliable ground truth, which are generally not easy to obtain, especially in the field of
medicine. In recent years, this issue has been commonly addressed with the exploitation of transfer learning
via fine-tuning, which enables us to start with a model, pre-trained for a specific task, and then fine-tune
(train) only certain layers of the neural network for a related but different target task. However, the selection
of fine-tunable layers is one of the major problems of such an approach. Since there is no general rule
on how to select layers in order to achieve the highest possible performance, we developed the Differential
Evolution based Fine-Tuning (DEFT) method for the selection of fine-tunable layers for a target dataset under
the given constraints. The method was evaluated against the problem of identifying the osteosarcoma from
the medical imaging dataset. The performance was compared against a conventionally trained convolutional
neural network, a pre-trained model, and the model trained using a fine-tuning approach with manually
handpicked fine-tunable layers. In terms of classification accuracy, our proposed method outperformed the
compared methods by a margin of 4.45% to 32.75%.
INDEX TERMS Deep learning, fine-tuning, medical imaging, optimization, transfer learning.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 196197
G. Vrbančič, V. podgorelec: Transfer Learning With Adaptive Fine-Tuning
values are not set correctly, the outcome of a model is most many and which layers of a CNN to fine-tune for a given
likely to be unsatisfying [16]. Currently, there is no general set of images.
rule or recipe to follow in order to determine which layers • We conducted an empirical evaluation of the proposed
to fine-tune or which hyper-parameter settings to use. Most method tackling the problem of identifying osteosar-
of the decisions are based on previous experiences of deal- coma from medical images.
ing with such problems. Solving the above-mentioned issues • We performed an extensive performance analysis and
is most commonly a recurring, time-consuming process in comparison of the results obtained from conducted
which various settings are tried and tested out to find the experiments.
ones that will result in a high predictive performance of the • We analyzed the impact of different selections of fine-
model. tunable layers on the model’s performance.
The motivation behind the selection of fine-tunable layers
is based on the empirical evidence that the initial (bottom) II. RELATED WORK
layers of CNNs preserve more abstract, generic features, The problem of layer selection, when employing transfer
applicable to a broader range of tasks as presented in [15], learning with fine-tuning, has been receiving a lot of attention
[17], [18]. In contrast, layers toward the end (top) of a CNN in recent years. With the general popularization of deep learn-
tend to provide more specific, task-related features. There- ing techniques, transfer learning with fine-tuning has become
fore, it should generally be more reasonable to fine-tune more the most common strategy for transferring knowledge in
top layers. However, recent studies [19] show that it is not the context of deep learning, which enables researchers and
really clear if restricting the fine-tuning to the last layers is practitioners to apply such deep learning methods to various
the best option. Azizpour et al. [17] suggested that the success domain problems more quickly.
of knowledge transfer depends on the dissimilarity between Various methods, following different strategies and
the primary task for which the CNN was trained, and new approaches, were presented in recent studies to improve the
target task for which we would like to transfer knowledge. standard fine-tuning. In general, there are two approaches
When utilizing a fine-tuning strategy, the CNNs are most addressing the aforementioned problem. The first approach
often pre-trained on an ImageNet dataset [20] or datasets focuses on selecting the input samples relevant to the target
similar to ImageNet. The distance between natural images task, as presented in [21]–[23]. The other approach – that
in an ImageNet dataset and medical imaging datasets is by we also used in this paper – focuses on the selection of
no means negligible, so at this point, the question regarding the portions, or layers, of the network in order to opti-
which layers of CNN to fine-tune remains. mize information extraction from the pre-trained network.
The mentioned empirical findings motivated us to take a Standard techniques adopting this approach either fine-tune
look at the problem from a different angle. We found an all network layers as presented in [24], or only fine-tune
analogy for this problem in the feature selection techniques the last few layers or blocks of the network as presented
and mechanisms utilized in machine learning, which are in [10], [25]. Another interesting technique proposed in [26],
well explored. The problem of feature selection could be [27] is to use a pre-trained network as a feature extractor
easily translated to the problem of finding the most optimal with a classifier such as SVM on top of it. While those
selection of layers for a given neural network architecture, fine-tuning methods have proven that they are capable of
enabled for fine-tuning, which would produce a predictive delivering a promising performance when compared to con-
model achieving the best classification performance. Based ventionally trained networks, the biggest drawback of such
on those grounds, we set our goals to develop a new straight- techniques is their inability to adapt automatically. There-
forward, automatic and CNN architecture agnostic Differen- fore, to apply such a method, one needs to adjust various
tial Evolution based Fine-Tuning (DEFT) method. The DEFT parameters manually or perform layer selection by hand,
method features an adaptive layer selection mechanism for which is commonly seen as a challenging, burdensome
finding the most optimal combination of fine-tunable lay- task.
ers for achieving high classification performance in transfer In 2018, Guo et al. [19] presented a method more related
learning tasks. to ours, which also works in an adaptive manner. The method
The proposed method is evaluated against the task of features the automatic layer selection per target instance.
identifying the osteosarcoma from Hematoxylin and eosin Introducing the policy network, the authors achieved the
(H & E) stained images. The performance of the DEFT adaptiveness of the method. The policy network is used
method is compared against a conventional trained CNN to make routing decisions on whether to pass the image
model, a pre-trained CNN model with all layers being fine- through the fine-tuned layers or through the pre-trained lay-
tuned, a CNN model with handpicked layers fine-tuned, and ers. Combined with ResNet CNN architecture’s ability to
also specific state-of-the-art methods evaluated on the same be resilient to residual block swapping and dropping [28],
dataset. the method delivers encouraging classification performance.
We summarize our contributions as follows: Although the authors stated that the method could be applied
• We propose a novel adaptive fine-tuning mechanism for to different neural network architectures, such an application
transfer learning, which automatically determines how would most certainly not be a straight-forward task because
III. METHODS
A. TRANSFER LEARNING
Conventional machine learning approaches make future pre-
dictions based on the statistical models, trained on previ-
ously collected, labeled, or unlabeled data. An approach that
utilizes the labeled data in the process of model training is
most commonly referred to as supervised learning, while the
utilization of unlabeled data in the process of model train-
ing is commonly known as unsupervised or self-supervised
training. When dealing with small, insufficient sets of labeled
data, building a good classifier is a hard and burdensome task.
Many studies [29]–[31] have been conducted to tackle this
issue, utilizing semi-supervised training or some variation of
such an approach where the usage of a large unlabeled set of
data and a small set of labeled data is combined. The most
common issue with such approaches is the assumption that
the labeled and unlabeled data distributions are the same [32].
In contrast to semi-supervised approaches, transfer learn- FIGURE 1. The conceptual diagram of transfer learning technique.
ing enables the domains, tasks and data distributions to be
different. are fed to the classifier of choice [38]. Our research focuses
The first studies to focus on transfer learning date back on the first-mentioned transfer learning approach or strategy
to 1995 [33]. However, it can be found under different known as fine-tuning – the mechanism which is presented
names such as inductive transfer [34], multitask learning [35], in-depth in the following section.
incremental/cumulative learning [36], with one of the most
closely related learning techniques for the transfer learn- B. FINE-TUNING
ing approach being the multitask learning framework [35]. A general transfer learning approach is to train a base network
In general, the transfer learning technique can be defined as and then copy its first n layers to the first n layers of the
the improvement of learning a new task through the trans- target network. The new target network’s remaining layers
fer of knowledge from a related task that has already been are most commonly randomly initialized and trained for a
learned. In machine learning terms, as presented in Fig. 1, specific target task. We can also choose to backpropagate the
transfer learning roughly translates to transferring the weights errors from the new task into the base features to fine-tune
of the already trained model, specialized for a specific task, them for the new task, or we can freeze some of the feature
to the model solving a different, but related task [37]. With layers, which are not going to be trained (fine-tuned) against
the expansion of the deep learning field, transfer learning also the new task [15].
gained momentum, due to the requirements common to all The fine-tuning approach is one of the most popular trans-
deep neural networks – the need for large datasets. Addition- fer learning strategies among applications in neural networks.
ally, the process of training deep neural network architectures Its use was pioneered in [39] by transferring knowledge from
requires a lot of computational power and thus is a time- a generative to a discriminative model, thereby achieving
consuming task. In such cases, with the utilization of the high generalization. The initial pipeline was composed out
transfer learning techniques, one could benefit significantly of a pre-trained network where the last classifier layer was
in terms of time complexity as well as in terms of the large, replaced with a randomly initialized one.
required dataset. Nowadays, the concept of fine-tuning a strategy is quite
In general, transfer learning techniques are used in two similar. While we do not, in general, replace the last classi-
ways – one being the approach where the weights of the pre- fier layer with randomly initialized one, we most commonly
trained model are preserved (frozen) on some of the layers enable fine-tuning of some of the layers in a pre-trained
and fine-tuned (trained) in the remaining layers, and the other neural network instead of replacing them. However, at this
being the approach where the pre-trained deep neural network point, the question of how to select fine-tunable layers for a
is utilized as a feature extractor, while the extracted features pre-trained network model arises.
The last few layers of a deep neural network are usually are copied to the trial solution.
fine-tuned, while the remaining initial layers are kept frozen ( (t)
with their initial pre-trained values. The motivation behind (t+1) u randj (0, 1) ≤ CR ∨ j = jrand ,
wi,j = i,j (t) (3)
such a strategy is driven by a combination of size-limited xi,j otherwise,
datasets and some empirical evidence that the initial layers Finally, the selection operator is utilized to decide whether
(bottom layers) of a deep neural network preserve more a produced vector should become a generation member uti-
abstract, generic features. Such features are commonly appli- lizing the greedy criterion. The selection could be formally
cable to a broader range of tasks, while the layers closer to the expressed as follows:
top provide more specific task-related features [19]. However, (
(t) (t) (t)
while the selection of fine-tunable layers is still a more or less (t+1) wi if f (wi ) ≤ f (xi ),
xi = (t) (4)
manual process, which most commonly requires a tremen- xi otherwise .
dous amount of experimenting, and while there are significant
(t)
number of empirical studies [10], [18], [40] which show where f (xi ) denotes the fitness function defined for solving
great success with regard to limiting the fine-tuning of the a specific optimization problem. The DE algorithm works in
last few layers, there are also some recent studies [19], [28] an iterative manner, where each produced solution is evalu-
which diminish the assumption that the early or middle layer ated using the given fitness function. Based on the selection
features should be shared. Based on the mentioned empirical operator, presented in Eq. 4, the DE algorithm’s search mech-
findings and given that there is no general rule or recipe anism seeks solutions for ever-better fitness scores. Naturally,
to follow when selecting fine-tunable layers, we decided to the DE algorithm is optimized for solving the minimization
tackle the problem from the perspective of representing it as optimization problems, and therefore the fitness function
an optimization problem and tried to solve it utilizing a well- should be tailored to give the lower score to the preferred
known optimization meta-heuristic algorithm. solutions.
The choice of DE parameters, namely Np, F, and CR can
C. DIFFERENTIAL EVOLUTION have a enourmous impact on optimization performance. The
Differential Evolution (DE) is one of the most popular Np parameter, as presented, denotes the population size of
population-based meta-heuristic algorithms, introduced by real-coded vectors (individuals) on top of which the muta-
Storn and Price in 1997 [41]. Thanks to many wins at interna- tion, crossover, and selection operators are applied. Based on
tional competitions, DE is considered one of the most appro- research by Piotrowski [45], a too small population size limits
priate algorithms for continuous optimization. Besides the the number of available moves, which may lead to stagna-
general popularity of the DE algorithm for solving optimiza- tion (the population stops proceeding towards the optimum,
tion tasks, it was also successfully applied to various machine although population diversity remains high) or premature
learning problems, such as the hyper-parameter optimization convergence. On the contrary, a vast population slows down
problem [42], the problem of designing neural network archi- individuals’ clustering and frequently wastes many function
tecture [43] or the feature selection problem [44]. calls on almost random explorative moves. The F parameter
The DE algorithm is composed of Np real-coded vectors is a scale factor which controls the length of the exploitation
and three operators: mutations, crossovers and selections. vector and thus determines how far from point xi the offspring
The Np real-coded vectors are representing the candidate should be generated. Based on research from Das et al. [46],
solutions (individuals) which can be formally defined as a good initial choice of F is 0.5, while the effective range of F
presented in Eq. 1, where each element of the solution is in is usually between 0.4 and 1. The DE parameter CR controls
(t) (L) (U ) (L) (U )
the interval xi,1 ∈ [xi , xi ], while xi and xi denote the how many parameters are expected to change in a population
lower and upper bounds of the i-th variable, respectively. member. When CR is set to a low value, a small number of
parameters are changed in each generation, and the step-wise
(t) (t) (t)
xi = (xi,1 , . . . , xi,n ), for i = 1, . . . , Np, (1) movement tends to be orthogonal to the current coordinate
axes. On the other hand, high values of CR cause most of
The DE’s basic strategy consists of mutation, crossover, the mutant vector directions to be inherited, prohibiting the
and selection operations. The mutation operation can be for- generation of axis orthogonal steps [46].
mally expressed as follows:
IV. ADAPTIVE FINE-TUNING
(t) (t) (t) (t)
ui = xr1 + F· (xr2 − xr3 ), for i = 1, . . . , Np, (2) To tackle the problem of finding and selecting which layers of
given a CNN architecture to fine-tune, we have developed the
where F represents the scaling factor as a positive real number adaptive DEFT (Differential Evolution based Fine-Tuning)
that scales the rate of modification while r1, r2 and r3 are method. As the problem of finding and selecting the layers
randomly selected values in the interval 1 . . . Np. to fine-tune can be easily translated into an optimization
In order to increase the diversity of the parameter vectors, problem, we adopted the DE algorithm for the purposes of
the crossover operator is introduced as presented in Eq. 3, finding the most optimal solution to the problem. The solution
where CR ∈ [0.0, 1.0] controls the fraction of parameters that represents an optimal combination of layers, which should be
B. EVALUATION OF SELECTED LAYERS TABLE 1. The basic information about Osteosarcoma data from UT
Southwestern/UT Dallas for Viable and Necrotic Tumor Assessment
For each produced individual s – described in the previous dataset class distribution.
subsection – the fine-tuning of CNN is conducted. To deter-
mine how good or bad the produced individual is, we define
a fitness function L. The fitness function is calculated after
CNN model fine-tuning based on the given individual is
finished and returned to the DE algorithm in order to enable
the DE to find better individuals. The fine-tuning is conducted
for a maximum number of epochs utilizing the early stopping
technique to stop the fine-tuning of not-so-promising layer
• DEFT where the transfer learning approach with our
selections. The maximum number of epochs is a parameter
proposed method was utilized to select the most optimal
that can be set manually; in general, it should be set with
fine-tunable layers.
regard to a target dataset. To evaluate the performance of a
model, trained by fine-tuning the layers selected by the pro- To objectively evaluate the performance of compared
duced individual, we adopted a well-known categorical cross- methods, the experiments were conducted in two different
entropy (CCE) loss function [48], which can be formally scenarios. In the first scenario, the conventional, pretrained,
expressed as follows: and baseline methods are utilizing the same early stopping
technique as the DEFT. In contrast, in the second scenario,
M C no early stopping criteria is used for the three compared
1 XX
L = CCE = − yij log(pij ) (7) methods.
M
i=1 j=1 All of the conducted experiments were implemented in the
Python programming language with the use of the follow-
where i indexes samples from the total of M samples, ing libraries: Keras [49] with Tensorflow [50] backend for
j indexes classes from total ofPC classes, y denotes the sample developing and training CNNs, NiaPy [51] for providing a DE
label, and pij ∈ (0, 1) : j pij = 1∀i, j represents the algorithm implementation, and PyCM [52] for classification
prediction for a sample. performance metrics calculation.
The experiments were executed on a single Intel Core
C. TRAINING THE FINAL CNN MODEL i7-6700K based PC, with 4 cores (8 threads) CPU running at
To obtain the best performing selection of layers, a subset 4 GHz, with 64 GB of RAM, and three Nvidia GeForce Titan
of 80% of the training set was used within the DE loop for X Pascal GPUs each with 12 GB of dedicated GDDR5 mem-
training CNN models based on the DE algorithm’s individual ory, running the Linux Mint 19 operating system.
solution. The remaining 20% of the training set is then used
to evaluate the performance of trained CNN models using the A. DATASET
fitness function in Eq. 7. The proposed DEFT method is evaluated on the prob-
After all candidate solutions are evaluated, the best per- lem of identifying the osteosarcoma from hematoxylin
forming one is selected, based on which the combination and eosin (H & E) stained osteosarcoma images. We used
of layers selected for fine-tuning is used to fine-tune the a publicly available dataset Osteosarcoma data from UT
final CNN. In contrast to the models trained on candidate Southwestern/UT Dallas for Viable and Necrotic Tumor
solutions, this final CNN model is trained from the start upon Assessment [53], the properties of which are presented
the whole training set. This final CNN model is the result of in Table 1. The data were collected by clinical scientists
the DEFT method. at the University of Texas Southwestern Medical Center,
Dallas. The archival samples of 50 patients treated at the
V. EXPERIMENTAL SETUP Children’s Medical Center, Dallas, between 1995 and 2015,
To evaluate the performance of the proposed adaptive DEFT were used to create this dataset from which 942 histology
method, the experimental approach was utilized. Experiments glass slides were digitized into whole slide images (WSI).
were conducted with four methods – the three compared From those, two pathologists manually selected 40 WSIs
methods and our proposed DEFT method, tackling the task representing the heterogeneity of a tumor and response char-
of identifying the osteosarcoma from Hematoxylin and eosin acteristics under study. Thirty 1024 x 1024 pixel image tiles
(H & E) stained osteosarcoma images: at 10X magnification factor, as seggested by the patholo-
• conventional where the conventional approach for train- gists, were randomly selected from each WSI. From the
ing a CNN was utilized, resulting 1,200 images tiles, 66 irrelevant image tiles such
• pretrained where the CNN was trained using the con- as image tiles falling in non-tissue, ink-mark regions, and
ventional approach with pre-trained weights, blurry images were removed. The performed randomiza-
• baseline where the transfer learning approach with the tion of tile-generation was conducted to remove any bias in
fine-tuning of handpicked layers of CNN architecture the dataset, prepared for feature-generation and subsequent
was used, and machine/deep-learning steps. Two medical experts performed
FIGURE 3. Sample images from the dataset. The a) sample represents the
non-tumor image, b) represents the necrotic tumor image while c)
represents the viable tumor image.
FIGURE 4. The presentation of the utilized CNN architecture. The convolutional base (Blocks 1 - 5) is adopted from VGG19 CNN
architecture. The dropout layers and fully connected layers are denoted with DO and FC, respectively.
TABLE 3. Used training parameters for the conducted experiments. Therefore, all the reported results are the averages across all
those runs.
TABLE 5. Averages of classification performance metrics over 10 folds trained up to 500 epochs with the aid of early stopping
when classifiers were trained for up to 500 epochs with early stopping
criteria. criteria. On the other hand, the results presented in Table 6
reveal the same metrics for the case where all three com-
pared methods had been trained for a full 500 epochs. Our
DEFT method uses the same approach (up to 50 iterations
of 20 epochs with early stopping) in both cases.
In both cases, for all the classification metrics (accuracy,
AUNP, F1-macro and Cohen’s Kappa coefficient), the results
of the DEFT method stand out, outperforming all the com-
pared methods by a great margin, although the difference is
smaller in cases where compared methods are trained for a
TABLE 6. Averages of classification performance metrics over 10 folds
when classifiers were trained for full 500 epochs. full 500 epochs. For all three compared methods, the results
are better in cases when they were trained for 500 epochs.
In terms of accuracy, the DEFT method outperforms the
second-best conventional method by a margin of 4.5% and
the other two methods by a margin of 17.1-21.8%. Focus-
ing on AUNP, F1-macro and Cohen’s kappa coefficient the
performance improvements are 3.5-17.4%, 4.3-29.9% and
6.9-34.8% respectively.
It is interesting to observe that in cases where early stop-
ping was used, the results of all three compared methods
are quite similar, with the best of them being the baseline.
However, in the case of a full 500 epochs of training, the con-
the prior probability of class j. In other words, the p(i, j) ventional method outperformed the other two methods by
represents the estimated probability of example i to be of class a significant margin, while the other two performed very
j taking the values in [0, 1]. The l(·) is a comparison function similarly. This is an intriguing behavior since, due to the small
satisfying l(a, b) = 1 if a > b, l(a, b) = 0 if a < b and dataset, we would expect that the pretrained and especially
l(a, b) = 0.5 if a = b [59]. the baseline method would, overall, perform better than the
Pm conventionally trained one, due to the high risk of extreme
f (i, j)l(p(i, j), p(t, j))
AUC(j, k) = i=1 (8) over-fitting of such deep CNN architectures trained against a
mj · mk small dataset. Also, the handpicked selection of fine-tunable
The AUNP metric introduced in 2001 by Fawcett [60] layers of a baseline method was made based on the previous
computes the AUC treating a c-dimensional classifier as c empirical results in which such a selection of layers proved
two-dimensional classifiers, taking into account the prior to be successful.
probability of each class (p(j)). Formally, the AUNP can be
expressed as follows: 1) CLASSIFICATION PERFORMANCE METRICS: ACCURACY,
F1-SCORE, AUNP AND KAPPA
c
X In order to perform a more in-depth classification perfor-
AUNP = p(j)AUC(j, restj ), (9)
mance analysis of our proposed DEFT method, we present
j=1
the performance comparison among the compared classifier
where restj represents all classes different from class j. methods on 10 folds using the box-dot-plot visualizations for
accuracy, F1-score, AUNP and Cohen’s kappa coefficient.
VI. RESULTS For each metric, we performed two experiments: a) first,
Using the presented experiment setup, evaluation method we compared the results of training the classifiers for up to
and classification metrics, we obtained the results, which are 500 epochs using the early stopping criteria, and b) second,
presented and discussed in detail in the following subsections. we compared the results of training the classifiers for a full
500 epochs.
A. CLASSIFICATION PERFORMANCE COMPARISON Fig. 5 presents the accuracy performance comparison
The classification results obtained from the conducted exper- among the compared methods in the case of using early
iments identifying the osteosarcoma from H&E stained stopping. We can easily observe that the DEFT method out-
osteosarcoma images, using the dataset Osteosarcoma data performs the three compared methods, both in terms of mean
from the UT Southwestern/UT Dallas for Viable and Necrotic accuracy over 10-folds as well as in terms of the standard
Tumor Assessment, are presented in Table 5 and Table 6. deviation of the accuracy. The small standard deviation of
Table 5 shows the averages of various classification perfor- accuracy also shows the capability of our method to gen-
mance metrics over ten folds when using the three compared eralize well. The second best method in terms of overall
methods (conventional, pretrained and baseline), which were accuracy seems to be the baseline, followed by the pretrained
FIGURE 5. Accuracy of compared methods trained for up to 500 epochs FIGURE 7. F1-macro of compared methods trained for up to 500 epochs
with early stopping; each dot represents one fold. with early stopping; each dot represents one fold.
FIGURE 6. Accuracy of compared methods trained for full 500 epochs; FIGURE 8. F1-macro of compared methods trained for a full 500 epochs;
each dot represents one fold. each dot represents one fold.
one, while the conventional method performed the worst. The dropout layers. The side effect of the applied regularization
accuracy results of the three compared methods are in line in the case of the conventional method resulted in a slow but
with expectations. While the conventional method, trained steady convergence throughout training. In contrast, for the
from scratch, was not able to achieve good performance methods that utilized fine-tuning, some sort of regularization
within a few epochs, the pretrained method benefited from was necessary to mitigate the over-fitting, which is generally
the pre-trained weights. The baseline method benefited even a quite common effect when dealing with transfer learning
more as the CNN with the pre-trained weights was fine-tuned approaches.
with regard to reasonably hand-picked layers only rather than Very similar results can also be observed for the other three
all layers as the pretrained method. performance metrics: F1-score, AUNP, and kappa.
Although all four methods used the same early stopping The F1-score can be, in general, interpreted as a har-
criteria and could have been theoretically trained for the same monic mean of the precision and recall score. When dealing
amount of epochs, it turned out that the three compared meth- with a multi-classification problem, we obtain a per class
ods used much fewer epochs to train (see Table 5). In order F1-score, which we would like to represent in the form of one
to make a more fair comparison, in the second experiment, value representing the classifiers’ performance. One possible
we fixed the number of epochs at 500 for all three compared way to achieve that is to use a macro-averaged F1-score,
methods by removing the early stopping mechanism (Fig. 6). which is computed as the simple arithmetic mean of per-class
We can see that the results of all three compared methods F1-scores. The F1-score results are presented in Fig. 7 and
improved, but were still not able to outperform our DEFT Fig. 8.
method, both with regard to the overall accuracy and stan- The results of AUNP metric performance for all compared
dard deviation. The method which benefited the most from methods are presented in Fig. 9 and Fig. 10. AUNP is a
prolonged training was the conventional, while the other two metric that combines the AUC measure of each class against
methods clearly finished in local minima in several folds. the rest, using the a priori class distribution. It is one of the
This benefit could be attributed to the quite drastic regular- most common AUC variations when dealing with multi-class
ization applied in the top layers with the utilization of two classifiers, as in our case.
FIGURE 9. AUNP of compared methods trained for up to 500 epochs with FIGURE 11. Kappa of compared methods trained for up to 500 epochs
early stopping; each dot represents one fold. with early stopping; each dot represents one fold.
FIGURE 10. AUNP of compared methods trained for a full 500 epochs; FIGURE 12. Kappa of compared methods trained for a full 500 epochs;
each dot represents one fold. each dot represents one fold.
Fig. 11 and Fig. 12 shows a comparison of Cohen’s Kappa consumed a lot more epochs due to its iterative nature. For
coefficient values for all of the compared methods. Essen- the DEFT method, such behavior was expected. On the other
tially, the Cohen’s Kappa metric compares an observed accu- hand, we expected other methods to use more epochs before
racy with an expected accuracy (random chance), which is stopping, but it turned out that the training stopped after no
generally less misleading than the accuracy metric itself, due more than 10 epochs on average (see Table 5). When the
to the Cohen’s Kappa taking random chance into account. full 500 epochs were used for training, the three compared
Fleiss’s characterization [61] of the Kappa coefficient val- methods’ classification performance improved but were still
ues are translated into a poor agreement when the value unable to outperform our proposed DEFT method.
is < 0.40, fair to a good agreement when the value is The overall time spent on training was highly correlated
0.40 − 0.75 and an excellent agreement when the value with the number of used epochs, although it turned out that
is > 0.75. Following the above-mentioned characterization there were some differences. Interestingly, although trained
of Cohen’s Kappa coefficient values, our proposed DEFT for the same amount of epochs, the baseline method seems
method, on average, achieves an excellent agreement with to consume the least amount of time, being followed by the
an average of 0.7866. On the other hand, the remaining pretrained and conventional methods, while our proposed
compared methods achieved, on average, a poor agreement DEFT method used the highest amount of time. This could
in the case of using early stopping (see Table 5) and fair to be attributed to the fact that the DE optimization algorithm’s
good agreement in the case of not using early stopping (see inner workings consume a significant amount of time.
Table 6). The analysis of the total spent training time is presented
in Fig. 13. Observing it, we can easily see that the time
2) COMPUTATIONAL AND TIME COMPLEXITY: NUMBER OF consumption does not vary greatly on a per fold basis, which
EPOCHS AND TOTAL TRAINING TIME is encouraging and proves that the layer selection mechanism
When focusing on the number of epochs, in cases when early of our DEFT method is capable of finding the most optimal
stopping criteria was applied, the proposed DEFT method solution based on the given constraints. Additionally, it also
TABLE 7. Average ranks (the best result is shown in bold) and statistical comparison (all differences are significant) of the four methods trained with
early stopping.
TABLE 8. Average ranks (the best result is shown in bold) and statistical comparison (only time and epochs have significant differences) of the four
methods trained for full 500 epochs.
C. THREATS TO VALIDITY
Commonly, in the machine learning field, the validity threats
often relate to the diversity, quality and quantity of the data.
Since all supervised machine learning methods, techniques,
FIGURE 14. Layer selection analysis, based on a DEFT method results. The
y-axis denotes the average selections from the best 5 performing and approaches rely on how the given data is labeled or pre-
individual candidates to the worst 5 performing ones. On the x-axis there classified, for our research we picked the dataset that was
are numbers denoting the layers of the VGG19 architecture.
collected by clinical scientists and labeled by two medical
experts to minimize potential threats to validity. Nevertheless,
7 layers, 3 of them being convolutional layers, 3 of them our obtained results and findings may not be generalized to
being maximization pooling layers, and two being fully con- all specific situations.
nected layers. In terms of accuracy, our proposed method Splitting the data into a training and test set could also be a
outperforms the mentioned one by a margin of 2.57%. In [64] potential a threat to validity. To reduce the possibility of such
and [54], the reported accuracy was somewhat higher than it a threat, we adopted a well-known 10-fold cross-validation
is in our case (by 5.83% and 4.63%, respectively). However, procedure.
we must also consider that the authors in [54] are using mech- Due to the stochastic nature of our proposed DEFT
anisms for the expert-guided generation of features, while on method, in order to reduce the internal threat to validity,
the other side, our proposed DEFT method does not utilize the experiment conducted with the DEFT method was exe-
any domain expert knowledge in the process of building the cuted in 10 runs, and the reported performance metrics were
predictive model. the averages of those runs.
be utilized with any other CNN architecture regardless of [10] G. Vrbančič, M. Zorman, and V. Podgorelec, ‘‘Transfer learning tuning
the number of convolutional layers, with respect to adjusting utilizing grey wolf optimizer for identification of brain hemorrhage from
head ct images,’’ in Proc. StuCoSReC: 6th Student Comput. Sci. Res. Conf.,
method parameters such as the dimension of the problem and 2019, pp. 61–66.
the number of function evaluations. Generally, the latter one [11] B. Q. Huynh, H. Li, and M. L. Giger, ‘‘Digital mammographic tumor
should be increased when utilizing deeper CNN architectures classification using transfer learning from deep convolutional neural net-
works,’’ J. Med. Imag., vol. 3, no. 3, Aug. 2016, Art. no. 034501.
since such architectures feature a larger number of layers, [12] J. J. Gómez-Valverde, A. Antón, G. Fatti, B. Liefers, A. Herranz,
which translates to a larger search space and increased time A. Santos, C. I. Sánchez, and M. J. Ledesma-Carbayo, ‘‘Automatic glau-
complexity in order to find the most suitable combination coma classification using color fundus images based on convolutional
neural networks and transfer learning,’’ Biomed. Opt. Express, vol. 10,
of layers selected for fine-tuning. Similar to the utilization no. 2, pp. 892–913, 2019.
of more conventional methods, when applying our proposed [13] R. Mehra, ‘‘Breast cancer histology images classification: Training from
DEFT method against various datasets with a different num- scratch or transfer learning?’’ ICT Express, vol. 4, no. 4, pp. 247–254,
2018.
ber of samples, one should also revise and appropriately adapt [14] D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang,
the classifier layers as well as initial training parameters to S. L. Baxter, A. McKeown, G. Yang, X. Wu, and F. Yan, ‘‘Identifying
achieve the best possible outcome. medical diagnoses and treatable diseases by image-based deep learning,’’
Cell, vol. 172, no. 5, pp. 1122–1131, 2018.
In the future, we would like to extend our research to [15] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are
the utilization of various optimization algorithms such as features in deep neural networks?’’ in Proc. Adv. Neural Inf. Process. Syst.,
the Firefly algorithm or the Particle Swarm Optimization. 2014, pp. 3320–3328.
[16] G. Vrbancic, I. J. Fister, and V. Podgorelec, ‘‘Parameter setting for deep
We would also like to apply the proposed DEFT method on neural networks using swarm intelligence on phishing Websites classifica-
different medical imaging datasets, and possibly also to other tion,’’ Int. J. Artif. Intell. Tools, vol. 28, no. 6, Oct. 2019, Art. no. 1960008.
image classification tasks. To analyse how the method per- [17] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, ‘‘Fac-
tors of transferability for a generic ConvNet representation,’’ IEEE Trans.
forms regardless of the classification domain, it would have Pattern Anal. Mach. Intell., vol. 38, no. 9, pp. 1790–1802, Sep. 2016.
been useful to test it on several datasets on multiple tasks, [18] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall,
including those from other domains. Additionally, we would M. B. Gotway, and J. Liang, ‘‘Convolutional neural networks for medical
image analysis: Full training or fine tuning?’’ IEEE Trans. Med. Imag.,
like to explore the possibilities of how to make the layers vol. 35, no. 5, pp. 1299–1312, May 2016.
selection mechanism more efficient, which would enable the [19] Y. Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing, and R. Feris, ‘‘Spot-
DEFT method to deliver better classification performance Tune: Transfer learning through adaptive fine-tuning,’’ in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4805–4814.
with lower time complexity, closer to the one where con- [20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
ventional training is utilized. Finally, we would also like to A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
investigate the possibilities of combining our DEFT method Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[21] L. Zhu, S. O. Arik, Y. Yang, and T. Pfister, ‘‘Learning to transfer learn:
with active learning approaches. Reinforcement learning-based selection for adaptive transfer learning,’’
arXiv, New York, NY, USA, Tech. Rep., 2020.
[22] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie, ‘‘Large scale
REFERENCES
fine-grained categorization and domain-specific transfer learning,’’ in
[1] A. Kamilaris and F. X. Prenafeta-Boldú, ‘‘Deep learning in agriculture: A Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
survey,’’ Comput. Electron. Agricult., vol. 147, pp. 70–90, Apr. 2018. pp. 4109–4118.
[2] Q. Kong, D. T. Trugman, Z. E. Ross, M. J. Bianco, B. J. Meade, and [23] W. Ge and Y. Yu, ‘‘Borrowing treasures from the wealthy: Deep transfer
P. Gerstoft, ‘‘Machine learning in seismology: Turning data into insights,’’ learning through selective joint fine-tuning,’’ in Proc. IEEE Conf. Comput.
Seismol. Res. Lett., vol. 90, no. 1, pp. 3–14, Jan. 2019. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1086–1095.
[3] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, ‘‘A deep learning [24] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies
approach for network intrusion detection system,’’ in Proc. 9th EAI Int. for accurate object detection and semantic segmentation,’’ in Proc. IEEE
Conf. Bio-Inspired Inf. Commun. Technol. (formerly BIONETICS), 2016, Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
pp. 21–26. [25] M. Long, Y. Cao, J. Wang, and M. Jordan, ‘‘Learning transferable features
[4] J. Flisar and V. Podgorelec, ‘‘Identification of self-admitted technical debt with deep adaptation networks,’’ in Proc. Int. Conf. Mach. Learn., 2015,
using enhanced feature selection based on word embedding,’’ IEEE Access, pp. 97–105.
vol. 7, pp. 106475–106494, 2019. [26] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, ‘‘CNN features
Off-the-shelf: An astounding baseline for recognition,’’ in Proc. IEEE
[5] C. Angermueller, T. Pärnamaa, L. Parts, and O. Stegle, ‘‘Deep learn-
Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014, pp. 806–813.
ing for computational biology,’’ Mol. Syst. Biol., vol. 12, no. 7, p. 878,
[27] Z. Shi, H. Hao, M. Zhao, Y. Feng, L. He, Y. Wang, and K. Suzuki, ‘‘A
Jul. 2016.
deep CNN based transfer learning method for false positive reduction,’’
[6] R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, ‘‘Deep learning Multimedia Tools Appl., vol. 78, no. 1, pp. 1017–1033, Jan. 2019.
for healthcare: Review, opportunities and challenges,’’ Briefings Bioinf., [28] A. Veit, M. J. Wilber, and S. Belongie, ‘‘Residual networks behave like
vol. 19, no. 6, pp. 1236–1246, Nov. 2018. ensembles of relatively shallow networks,’’ in Proc. Adv. Neural Inf. Pro-
[7] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, cess. Syst., 2016, pp. 550–558.
B. T. Do, G. P. Way, E. Ferrero, P.-M. Agapow, M. Zietz, and [29] X. J. Zhu, ‘‘Semi-supervised learning literature survey,’’ Dept. Comput.
M. M. Hoffman, ‘‘Opportunities and obstacles for deep learning in Sci., Univ. Wisconsin-Madison, Madison, WI, USA, Tech. Rep. 1530,
biology and medicine,’’ J. Roy. Soc. Interface, vol. 15, no. 141, 2018, 2005.
Art. no. 20170387. [30] T. N. Kipf and M. Welling, ‘‘Semi-supervised classification with graph
[8] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoo- convolutional networks,’’ 2016, arXiv:1609.02907. [Online]. Available:
rian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez, ‘‘A http://arxiv.org/abs/1609.02907
survey on deep learning in medical image analysis,’’ Med. Image Anal., [31] L. I. Kuncheva and J. J. Rodriguez, ‘‘Classifier ensembles with a random
vol. 42, pp. 60–88, Dec. 2017. linear oracle,’’ IEEE Trans. Knowl. Data Eng., vol. 19, no. 4, pp. 500–508,
[9] M. I. Razzak, S. Naz, and A. Zaib, ‘‘Deep learning for medical image Apr. 2007.
processing: Overview, challenges and the future,’’ in Classification in [32] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans. Knowl.
BioApps. Cham, Switzerland: Springer, 2018, pp. 323–350. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[33] J. Y. Ching, A. K. C. Wong, and K. C. C. Chan, ‘‘Class-dependent dis- [57] G. Vrbancic and V. Podgorelec, ‘‘Automatic classification of motor
cretization for inductive learning from continuous and mixed-mode data,’’ impairment neural disorders from EEG signals using deep convolu-
IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, no. 7, pp. 641–651, tional neural networks,’’ Elektronika ir Elektrotechnika, vol. 24, no. 4,
Jul. 1995. pp. 3–7, Aug. 2018. [Online]. Available: http://eejournal.ktu.lt/index.
[34] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, ‘‘Inductive learning php/elt/article/view/21469
algorithms and representations for text categorization,’’ in Proc. 7th Int. [58] G. Vrbancic, I. J. Fister, and V. Podgorelec, ‘‘Automatic detec-
Conf. Inf. Knowl. Manage. (CIKM), 1998, pp. 148–152. tion of heartbeats in heart sound signals using deep convolutional
[35] Q. Yang, C. Ling, X. Chai, and R. Pan, ‘‘Test-cost sensitive classification neural networks,’’ Elektronika ir Elektrotechnika, vol. 25, no. 3,
on data with missing values,’’ IEEE Trans. Knowl. Data Eng., vol. 18, no. 5, pp. 71–76, Jun. 2019. [Online]. Available: http://eejournal.ktu.lt/index.
pp. 626–638, May 2006. php/elt/article/view/23680
[36] X. Zhu and X. Wu, ‘‘Class noise handling for effective cost-sensitive learn- [59] C. Ferri, J. Hernández-Orallo, and R. Modroiu, ‘‘An experimental compar-
ing by cost-guided iterative classification filtering,’’ IEEE Trans. Knowl. ison of performance measures for classification,’’ Pattern Recognit. Lett.,
Data Eng., vol. 18, no. 10, pp. 1435–1440, Oct. 2006. vol. 30, no. 1, pp. 27–38, 2009.
[37] M. Hussain, J. J. Bird, and D. R. Faria, ‘‘A study on CNN transfer learning [60] T. Fawcett, ‘‘Using rule sets to maximize ROC performance,’’ in Proc.
for image classification,’’ in Proc. UK Workshop Comput. Intell. Cham, IEEE Int. Conf. Data Mining, Nov./Dec. 2001, pp. 131–138.
Switzerland: Springer, 2018, pp. 191–202. [61] J. L. Fleiss, B. Levin, and M. C. Paik, Statistical Methods for Rates and
[38] K. Nogueira, O. A. B. Penatti, and J. A. dos Santos, ‘‘Towards better Proportions, 3rd ed. Hoboken, NJ, USA: Wiley, 2003.
exploiting convolutional neural networks for remote sensing scene clas- [62] J. Demsar, ‘‘Statistical comparisons of classifiers over multiple data sets,’’
sification,’’ Pattern Recognit., vol. 61, pp. 539–556, Jan. 2017. J. Mach. Learn. Res., vol. 7, pp. 1–30, Jan. 2006.
[39] G. E. Hinton, ‘‘Reducing the dimensionality of data with neural networks,’’ [63] R. Mishra, O. Daescu, P. Leavey, D. Rakheja, and A. Sengupta,
Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006. ‘‘Histopathological diagnosis for viable and non-viable tumor prediction
[40] H. Chougrad, H. Zouaki, and O. Alheyane, ‘‘Deep convolutional neu- for osteosarcoma using convolutional neural network,’’ in Proc. Int. Symp.
ral networks for breast cancer screening,’’ Comput. Methods Programs Bioinf. Res. Appl. Springer, 2017, pp. 12–23.
Biomed., vol. 157, pp. 19–30, Apr. 2018. [64] R. Mishra, O. Daescu, P. Leavey, and D. Rakheja, ‘‘Convolutional neural
[41] R. Storn and K. Price, ‘‘Differential evolution–A simple and efficient network for histopathological analysis of osteosarcoma,’’ J. Comput. Biol.,
heuristic for global optimization over continuous spaces,’’ J. Global vol. 25, no. 3, pp. 313–325, 2018.
Optim., vol. 11, no. 4, pp. 341–359, 1997.
[42] B. Nakisa, M. N. Rastgoo, A. Rakotonirainy, F. Maire, and
V. Chandran, ‘‘Long short term memory hyperparameter optimization for GREGA VRBANČIČ (Graduate Student Member,
a neural network based emotion recognition framework,’’ IEEE Access, IEEE) received the B.Sc. and M.Eng. degrees in
vol. 6, pp. 49325–49338, 2018. informatics and communication technologies from
[43] N. Xue, I. Triguero, G. P. Figueredo, and D. Landa-Silva, ‘‘Evolving deep the University of Maribor, in 2015 and 2017,
CNN-LSTMs for inventory time series prediction,’’ in Proc. IEEE Congr. respectively, where he is currently pursuing the
Evol. Comput. (CEC), Jun. 2019, pp. 1517–1524. Ph.D. degree in computer science. He is also a
[44] L. Brezočnik, I. Fister, and G. Vrbančič, ‘‘Applying differential evolution Young Researcher with the Faculty of Electrical
with threshold mechanism for feature selection on a phishing Websites Engineering and Computer Science, University of
classification,’’ in New Trends in Databases and Information Systems, Maribor. He is the author of three peer-reviewed
T. Welzer, J. Eder, V. Podgorelec, R. Wrembel, M. Ivanović, J. Gamper, scientific journal articles and eight conference
M. Morzy, T. Tzouramanis, J. Darmont, and A. K. Latifić, Eds. Cham, papers. He has been involved in several industrial research and development
Switzerland: Springer, 2019, pp. 11–18. projects. His research interests include deep learning, especially convolu-
[45] A. P. Piotrowski, ‘‘Review of differential evolution population size,’’
tional neural networks, focusing on training strategies and transfer learning.
Swarm Evol. Comput., vol. 32, pp. 1–24, Feb. 2017.
[46] S. Das and P. N. Suganthan, ‘‘Differential evolution: A survey of the
State-of-the-Art,’’ IEEE Trans. Evol. Comput., vol. 15, no. 1, pp. 4–31, VILI PODGORELEC (Member, IEEE) received
Feb. 2011. the Ph.D. degree from the University of Maribor,
[47] D. Fister, I. Fister, T. Jagric, I. Fister, and J. Brest, ‘‘A novel self-adaptive Slovenia, in 2001.
differential evolution for feature selection using threshold mechanism,’’ in He worked as a Visiting Professor and/or
Proc. IEEE Symp. Ser. Comput. Intell. (SSCI), Nov. 2018, pp. 17–24. a Researcher at several universities around the
[48] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, world, including the University of Osaka, Japan;
USA: Springer-Verlag, 2006. the Federal University of Sao Paulo, Brazil; the
[49] F. Chollet. (2015). Keras. [Online]. Available: https://keras.io University of Nantes, France; the University of La
[50] (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous
Laguna, Spain; the University of Madeira, Portu-
Systems. [Online]. Available: https://www.tensorflow.org/
[51] G. Vrbančič, L. Brezočnik, U. Mlakar, D. Fister, and I. Fister, Jr., ‘‘NiaPy: gal; the University of Applied Sciences Seinäjoki,
Python microframework for building nature-inspired algorithms,’’ J. Open Finland; and the University of Applied Sciences Valencia, Spain. He is
Source Softw., vol. 3, no. 23, p. 613, Mar. 2018, doi: 10.21105/joss.00613. currently a Professor of computer science with the University of Maribor.
[52] S. Haghighi, M. Jasemi, S. Hessabi, and A. Zolanvari, ‘‘PyCM: Multiclass He has been involved in AI and intelligent systems for 20 years, where
confusion matrix library in Python,’’ J. Open Source Softw., vol. 3, no. 25, he gained professional experience in the implementation of many scientific
p. 729, May 2018, doi: 10.21105/joss.00729. and industrial research and development projects related to the analysis,
[53] P. Leavey, A. Sengupta, D. Rakheja, O. Daescu, H. B. Arunachalam, design, implementation, integration, and evaluation of intelligent informa-
and R. Mishra, ‘‘Osteosarcoma data from ut southwestern/UT Dallas for tion systems. He has authored more than 50 peer-reviewed scientific journal
viable and necrotic tumor assessment [data set],’’ Cancer Imag. Arch., articles, more than 100 conference papers, three books, and several book
Fayetteville, AR, USA, Tech. Rep., 2019. chapters on machine learning, computational intelligence, data science,
[54] H. B. Arunachalam, R. Mishra, O. Daescu, K. Cederberg, D. Rakheja, medical informatics, and software engineering. He has top expertise in AI
A. Sengupta, D. Leonard, R. Hallac, and P. Leavey, ‘‘Viable and
and machine learning methods and algorithms in the field of transparent
necrotic tumor assessment from whole slide images of osteosarcoma using
data-driven decision making (especially in medicine), in-depth knowledge
machine-learning and deep-learning models,’’ PLoS ONE, vol. 14, no. 4,
Apr. 2019, Art. no. e0210706.
of system integration technologies and methods applied to intelligent data
[55] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for analysis, classification and prediction of human-centered data, as well as
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail- large experience in designing and implementing information retrieval, nat-
able: http://arxiv.org/abs/1409.1556 ural language processing and text mining solutions for academia, industrial
[56] S.-J. Kim, C. Wang, B. Zhao, H. Im, J. Min, H. J. Choi, J. Tadros, partners and international companies using state-of-the-art approaches, and
N. R. Choi, C. M. Castro, R. Weissleder, H. Lee, and K. Lee, ‘‘Deep methods and tools. He received several international awards and grants for
transfer learning-based hologram classification for molecular diagnostics,’’ his research activities.
Sci. Rep., vol. 8, no. 1, pp. 1–12, Dec. 2018.