Kanhere Thesis 2023
Kanhere Thesis 2023
by
Baltimore, Maryland
May, 2023
Segmentation is one of the primary tasks in deep learning for medical imag-
rithms using deep learning have created highly accurate segmentation models
for the delineation of several organs of the body on various imaging modalities.
However, there are several limitations to training such models such as requir-
ing a large amount of well-annotated data and high-quality images and the
need for creating a segmentation model for each use case. Federated learning
has the potential to solve both of these problems by aggregating the knowl-
edge from several local models into a global model that can be used to segment
multiple target regions. To this end, in this thesis, I propose SegViz, a feder-
non-i.i.d datasets with partial annotations. SegViz using FedBN as the aggre-
ii
ABSTRACT
with dice scores of 0.93, 0.83, 0.55, and 0.75 for segmentation of liver, spleen,
pancreas, and kidneys, respectively, significantly (p < 0.05) better (except liver
and spleen) than the dice scores of 0.88, 0.79, 0.46, and 0.64 for the baseline
models. In contrast, the central aggregation model significantly (p < 0.05) per-
formed poorly with dice scores of 0.65, 0, 0.55, and 0.68. Furthermore, in order
to reduce the need for a large number of training samples during training, I
show that coresets using group sampling can effectively reduce the total num-
ber of training samples. The dice scores for the liver MSD dataset using core-
sets of sample sizes - 45,27,18 are 0.90, 0.89, and 0.87 on the external BTCV
dataset compared to the dice score of 0.89 trained on the entire training set.
Similarly, for spleen segmentation, coresets of size 20,10,6 samples yield dice
tire training set. In the last chapter, I conclude and discuss the future scope
iii
ABSTRACT
Primary Advisors:
Dr. Paul Yi
Center
iv
ABSTRACT
Reviewer Panel:
v
Acknowledgments
First and foremost, I’m extremely grateful to my advisors Dr. Paul Yi, and
Dr. Vishwa Parekh for their constant support and guidance, for providing me
with the freedom to work on projects that intrinsically motivate me, and for
grateful to my reviewers, Dr. Cheng Ting Lin, Dr. Chien-Ming Huang, and
Dr. Mathias Unberath for their guidance, and for their time in reviewing my
thesis.
I would also like to thank all the students, faculty, and researchers at the
not be the same without the support from all the friendly and helpful folks at
the Biomedical Engineering Department. I’m also very grateful to Dr. Graeme
Warren, and Dr. Sudip Gupta from the Carey Business School, for giving me
From UM2ii, I would also like to sincerely thank Dr. Peter Kamel, Pranav
Kulkarni, Sean Garin, and Vivian Zhang for all their support on several projects.
vi
ACKNOWLEDGMENTS
I’m very thankful to my friend Neha John for helping me improve my writ-
ing. Finally, a heartfelt thank you to all my friends at Hopkins who helped me
survive grad school with the late-night car drives, impromptu chai sessions,
vii
Dedication
whom I’m greatly indebted for all their love, support, and guidance, to my
viii
Contents
Abstract ii
Acknowledgments vi
1 Introduction 1
2 Federated Learning 10
ix
CONTENTS
notations 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
tation Algorithms 34
4.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
x
CONTENTS
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Discussion 48
Bibliography 53
xi
List of Tables
3.1 Mean Dice score performance of all the experiments on the in-
federation internal validation dataset. The standard deviation
values are in parentheses. . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Mean Dice score performance of all the experiments on the out-
of-federation BTCV dataset. The standard deviation values are
in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 P-values comparison using Paired statistical t-test between base-
line, central aggregated model, and best performing SegViz (FedBN+FT)
models for the internal validation dataset. The significant values
(p < 0.05) are highlighted in bold. . . . . . . . . . . . . . . . . . . . 32
3.4 P-values comparison using Paired statistical t-test between base-
line, central aggregated model, and best performing SegViz (FedBN+FT)
models for the external BTCV dataset. The significant values
(p < 0.05) are highlighted in bold. . . . . . . . . . . . . . . . . . . . 33
4.1 Paired t-test comparison between the baseline (all training sam-
ples) with the coreset constructions of varying sample sizes. The
significant values (p < 0.05) are highlighted in bold . . . . . . . . . 46
xii
List of Figures
4.1 Figure depicting 2 main clusters generated for the spleen dataset
after T-SNE [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Figure depicting 9 main clusters generated for the liver dataset
after T-SNE [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Figure showing the Dice similarity index for the baseline Liver
and coreset experiments. Higher is better. . . . . . . . . . . . . . . 45
4.4 Figure showing the Dice similarity index for the baseline Spleen
and coreset experiments. Higher is better. . . . . . . . . . . . . . . 45
xiii
Chapter 1
Introduction
population over 60 years of age will be 22% by 2050, nearly double that of
beneficiaries. This gap in the demand for radiologists can cause delayed diag-
nosis of the disease for the patient and can also decrease the reliability of the
radiological service. Thus, the rising number of radiological images that are ac-
quired for a patient’s diagnosis and the need for increased efficiency within the
diagnostic workflow has created the need for automating the process of medical
1
CHAPTER 1. INTRODUCTION
forms the basis for many downstream applications, including diagnosis, sur-
from a lung CT image, we can characterize the shape and size of the tumor for
diagnosis as well as map the progression of the tumor to assess the response of
consuming and expensive task. In [3], the authors show that on average, a
would take well over 10 years if they were to manually annotate 104 target
gorithms for each use case would potentially result in hundreds of models,
2
CHAPTER 1. INTRODUCTION
target regions in a medical image. Prior research has also demonstrated that
the clinical diagnostic workflow, leading to quicker turnaround times for pa-
age Analysis
for computer vision can be attributed to the AlexNet paper, the first large-scale
conventional methods in computer vision and win the ImageNet 2012 chal-
and object detection. There have been several versions of neural networks that
tial and semantic details of the region of interest from the input image. One
important network architecture that is commonly used is the U-Net [4] - The
3
CHAPTER 1. INTRODUCTION
UNet architecture was the first at the time to solve the problem of extracting
spatial and semantic features from an image while maintaining good resolu-
tion. The UNet architecture consisted of three main blocks - The encoder, the
ers followed by pooling layers that extract low-level feature information while
resolution. Thus the network is able to extract the semantic information from
the encoder and the localized spatial information from the decoder to gener-
ate the output that is at the same resolution as the input generating the final
segmentation.
As a result, many large-scale datasets have been curated and released for
the segmentation of different organ types and tumor structures [5]. However,
each of these datasets have been curated for a specific use case and therefore,
4
CHAPTER 1. INTRODUCTION
aggregate several datasets from different sources, each having a different dis-
tribution, which can hinder the performance of the deep learning model during
training.
5
CHAPTER 1. INTRODUCTION
The challenge of training several individual models separately and the need
Federated learning (FL) has gained importance in recent years for solving
this challenge by collaboratively training one global model from several local
models without data sharing. The global model can aggregate knowledge from
each local model and share the knowledge with each model allowing knowledge
imaging center may focus on related but different tasks - suppose one center is
ing such a situation because these two datasets would contain images with a
in 1.1. Such a situation, where one dataset has only a few organs annotated
while another dataset contains no overlapping annotations with the first one
6
CHAPTER 1. INTRODUCTION
I also discuss how we can tune this FL setup for personalized high-performance
Another major problem with training deep learning models is the require-
ment for a large amount of training data. In the medical scenario, it is not
fer learning is the most common solution to address the problem of data in-
for transfer learning to generalize well on the new dataset, large amounts of
heterogeneous data are still required. Moreover, transfer learning also suffers
intelligently select a subset of training data, such that there is no major drop
in performance?
need to first identify the inherent distribution of features that are important,
7
CHAPTER 1. INTRODUCTION
In this thesis, I have shown that creating coresets using group sampling
using the popular k-means clustering algorithm can effectively reduce the total
sample size required for training a deep learning model without a major drop
in performance.
ing to the readers. Then, I describe the types of federated learning sys-
Then, I show that group sampling can be used to effectively capture the
used to train deep learning models using a much smaller cohort without
8
CHAPTER 1. INTRODUCTION
Chapter 4 and coresets and discuss the limitations and future goals.
9
Chapter 2
Federated Learning
multi-centric, composed of images from several institutions such that each in-
tation, and object detection. Some of the advantages of this approach are -
10
CHAPTER 2. FEDERATED LEARNING
the trained model can be run on several devices at scale. For example,
Although this approach can build AI systems with high accuracy, it usually
has several constraints that limit its scalability and deployment in a real-world
scenario -
• It requires that all the training data has to be stored in one location. This
is not always ideal in a medical scenario where data privacy and data
the data to create the ground truth annotations to train the model. Deep
learning models are very sensitive to the quality of ground truth anno-
tinuously maintain, curate and transfer the large-scale training data which
11
CHAPTER 2. FEDERATED LEARNING
Edge devices such as Mobile phones, tablets, and other IoT devices are in-
clinical workflows for sharing medical images for further downstream analysis
tasks. These days, radiologists can connect their edge devices to view medical
reports and images over the internet via a secure network. In telemedicine,
hospital or data center. This centralized system cannot achieve high network
scalability due to the growing amounts of health data and IoMT devices in con-
latency. Moreover, relying on a central server or other third parties for data
work rather than being centralized, such a centralized setup for the AI in-
frastructure may no longer be feasible in the future as the amount of data in-
12
CHAPTER 2. FEDERATED LEARNING
that was first introduced by Google in 2016 based on the principles of “focused
client sites which host their own datasets and a central server that coordinates
and communicated with each client. Each client then trains a model locally on
its dataset and sends the updates to the central server. During the round of
communication, only the weights of the locally trained model are shared with
the central server and not the data. The central server aggregates the weights
from all the clients using a predefined strategy and these updated weights are
then shared back with the clients. Federated Learning has gained traction in
recent years, especially within the healthcare industry, due to its high capacity
for maintaining privacy with client sites like hospitals by keeping their data
in-house.
13
CHAPTER 2. FEDERATED LEARNING
Such a system is common when every client has a database with the same
feature space but a different sample space [6]. The healthcare clients can take
part in training a shared global model using their datasets, which have sep-
arate sample spaces but the same feature space as the datasets. In this sys-
tem, each client can train the same core AI model and send updates to the
server. The server subsequently aggregates the local updates from each client
Such as system is common when the datasets have different feature spaces
but the same sample spaces. Such systems are common in settings where sev-
eral clients can coordinate the training of an AI model using a central dataset
such that each client is trained for a different task but using the same features
The last setup for FL is one in which we have different sample spaces as well
14
CHAPTER 2. FEDERATED LEARNING
Feature values are computed from several feature spaces based on a single
local datasets. In a practical setting, we can see that such a setup can assist
learning setup. A central server is required that will orchestrate the entire
• Model sharing and broadcasting - The server broadcasts the latest global
• Client local training - Using the local data and the latest training scheme,
• Weight aggregation - After local training is complete, the new weights are
shared with the server where they are aggregated using the server-side
aggregation strategy.
15
CHAPTER 2. FEDERATED LEARNING
• Global model update - The server updates the global model based on the
The Federated Averaging algorithm was proposed by [7] and is one of the
most common and widely used FL aggregation strategies. Based on the success
of SGD optimization in deep learning, the authors talk about adapting SGD for
computing the gradient of the loss over all the client data (called FedSGD).
Their work suggests that during the federated learning process, the shared
B as the local batch size, E as the number of epochs, and η as the learning rate
is given by -
16
CHAPTER 2. FEDERATED LEARNING
– The model at the server then aggregates all the model parameters
PK nk k
submitted by the clients using wt+1 ← k=1 n wt+1
• The main advantage of FL is to move away from the need for direct access
to the central training data. It allows the global model to become robust
its resources to train the global model, and as such the overall compute
scales at no extra cost. Over time, the global model is trained on an ever-
increasing dataset while all the computations are handled by the clients.
geneity between the clients is a major challenge for FL. In [8], the authors
17
CHAPTER 2. FEDERATED LEARNING
cal imaging scenarios as each client node can non-.i.i.d data as the images
can be acquired using a different protocol and have inherent data distri-
bution shifts.
clients, communication within the network can make be slow when the
size of the models is large. This coupled with the need for regular up-
dates to the local model also puts an additional constraint on the number
communicating the model updates to the central server can reveal in-
18
Chapter 3
FL for Distributed
Incomplete Annotations
3.1 Introduction
quires high skill, and is an expensive effort, especially for 3D images [9]. One
knowledge from similar partially annotated datasets from multiple groups can
19
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
ing [10]. Knowledge aggregation would not only save time but also allow differ-
ent groups to benefit from each other’s annotations without explicitly sharing
segmentation models using partial labels. The works of [15, 16] show how to
create subsets of the partially labeled datasets to create fully labeled subsets.
encoder and task-specific decoders that are trained separately. This approach
requires all the data to be hosted locally and is not realistic in a real-world med-
ical scenario owing to patient privacy concerns and storage and computational
limitations.
The work of [19, 20] has shown promise in developing multi-task segmen-
tation models using multi-scale feature abstraction but the proposed method
requires all the data to be centrally located. We know that such a situation is
not realistic in a medical scenario not only because of privacy and data shar-
many distinct activities should the model be trained for. In [13], the authors
20
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
proposed technique was developed and evaluated for different anatomical re-
in [11] was focused on segmentation of the same anatomical structure, and the
The work of [20] introduced a real FL setup for segmentation using par-
tial labels where client nodes were trained on specific sub-networks for their
specific tasks using a shared decoder. However, this method is not scalable
and again, needs knowledge of all the tasks to be trained. It was for the first
time in the work of [14] that knowledge aggregation was introduced using a
anatomical structures on the external test set. For optimal performance, the
Thus, to create an FL setup that can generalize well to non-i.i.d data, and
accurately segment target organs for heterogeneous datasets with partial la-
illustrated in Figure 3.1. The global SegViz model is initialized at the server
21
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
with two distinct blocks - a representation block and a task block. The goal
derlying dataset while the goal of the task block is to learn individual tasks
the SegViz model, comprising the representation block and a subset of the task
block representing the client’s tasks. During training, the weights of the rep-
resentation block are always aggregated by the server and redistributed back
to the client nodes. On the other hand, the weights of the task block are di-
rectly copied from the corresponding client nodes containing the corresponding
task, thereby preserving the task-related information for each node in their
task block.
The SegViz framework was evaluated using four publicly available datasets
from the Medical Segmentation Decathalon (MSD) challenge [5]. The Spleen
spleen annotations out of which only the 41 training set volumes were used.
The Liver MSD dataset consists of 201 3D CT volumes with liver and liver
tumor annotations out of which only 131 training set volumes were consid-
which only 282 training volumes were used. Lastly, from the 2019 Kidney Tu-
mor Segmentation Challenge dataset [21], I used 210 3D CT volumes from the
training dataset. The training and internal validation splits were considered
22
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
from the overall training data in an 80:20 split. I considered all 30 training
image volumes from the Beyond the Cranial Vault (BTCV) dataset [22] as an
external test set for all our experiments. As the test set only contains labels for
the organs, all disease labels were ignored from each task, and the models are
During pre-processing, all the image volumes were first resized to 256 × 256
× 128, and the intensity values normalized between 0 and 1. All the volumes
were resampled to a constant spacing of (1.5, 1.5, 2.0). Then, I extract random
foreground patches of size 128 ×128 × 32 from each volume such that the center
modified version of the multi-head 3D-UNet [23] configuration for all our ex-
the end. The U-Net also contains 2 residual units at the layers and uses Batch
ture with each head consisting of two layers, including the final classification
layer. The SegViz model was implemented using the MONAI [24] framework
23
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
and the pre-processing and training using Pytorch. The SegViz model archi-
During training, all weights are initialized using LeCun initialization. The
batch size was set to 2 and the learning rate was initially set to 1e-4 with the
Adam optimizer and CosineAnnealingLR [25] as the scheduler. The Dice Loss
was used as the loss function. The average Dice Score was chosen as the final
evaluation metric. Each model was trained for a total of 500 epochs.
For every task, I trained a single U-Net model based on the Segviz model ar-
chitecture on the training dataset after the 80:20 split and data augmentation
a single model trained on the training dataset for the liver, spleen, pancreas,
and kidneys.
datasets together to create a central repository of all the data. This is con-
sidered as a lower bound because naive aggregation of the data in the case
individual model trained for each dataset separately. I set up our centrally
24
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
Figure 3.1: Illustration of the proposed SegViz framework: Client nodes up-
date the global meta-model where knowledge aggregation occurs after every 10
iterations of the local model. The weights of the global model are then shared
with the client models allowing both nodes to share knowledge without sharing
data.
25
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
26
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
where each client node in the setup represents an isolated group having one
of the datasets. The same UNet configuration with the Segviz architecture was
used at each client. Apart from the same pre-processing steps as the baseline
and scaling. While training the local models, after every 10 epochs, following
the FedAvg algorithm, the global model gets all but the last two convolutional
layers’ weights and averages them. The updated weights are then shared back
as our FedAvg implementation. Making sure that our global model is gener-
data and it does so by not aggregating the batch norm layers during knowledge
transfer.
Local fine-tuning In [26, 27], the authors demonstrated the need for fine-
27
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
models (keeping the representation block frozen) on the local datasets to im-
prove task-specific performance of each task block while keeping the same rep-
resentation block.
3.2 Results
As shown in Figure 3.3, the FedBN model with fine-tuning performs the best
test set. The SegViz framework using FedBN with fine-tuning segmented the
BTCV test set with dice scores of 0.93, 0.83, 0.55, and 0.75 for segmentation
better (except liver and spleen) than the dice scores of 0.88, 0.79, 0.46, and 0.64
for the baseline models. In contrast, the central aggregation model performed
significantly (p < 0.05) poorly on the test dataset with dice scores of 0.65, 0,
0.55, and 0.68. Note that the model trained on the centrally aggregated data
did not generalize to the spleen label due to the overall model becoming biased
toward the liver and pancreas labels, which contains more samples per label. I
have included the statistical t-test results between the baseline and the best-
28
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
Figure 3.3: A comparison of the ground truth segmentation masks with the
masks generated by the baseline and SegViz FedBN+FT models for two differ-
ent images. Note that the individual baselines are computed separately and
are shown here together for illustration only. The Red color signifies the Liver,
Blue for the Spleen, Green for the Kidneys, and Yellow for the Pancreas
29
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
Table 3.1: Mean Dice score performance of all the experiments on the in-
federation internal validation dataset. The standard deviation values are in
parentheses.
30
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
Table 3.2: Mean Dice score performance of all the experiments on the out-of-
federation BTCV dataset. The standard deviation values are in parentheses.
31
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
Table 3.3: P-values comparison using Paired statistical t-test between base-
line, central aggregated model, and best performing SegViz (FedBN+FT) mod-
els for the internal validation dataset. The significant values (p < 0.05) are
highlighted in bold.
32
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS
Table 3.4: P-values comparison using Paired statistical t-test between base-
line, central aggregated model, and best performing SegViz (FedBN+FT) mod-
els for the external BTCV dataset. The significant values (p < 0.05) are high-
lighted in bold.
33
Chapter 4
Segmentation Algorithms
icians use their smartphones/tablets to view reports and sometimes even mod-
need to efficiently deploy machine learning models on such edge devices within
34
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
the sample size of the training data. Coresets essentially represent a smaller
version of the original dataset which is used to train the model such that it
cantly reduce the amount of data needed to train models locally leading further
to lower training times, reduced memory requirements, and reduced power con-
sumption.
Moreover, in recent years, the field of machine learning has seen rapid ad-
k-means or k-medians clustering [29], using Naive Baye’s and nearest neigh-
bors [30], Nystrom methods [31, 32], and Bayesian inference [33] have been
In this chapter, I show that coresets can be used to efficiently reduce the
sample size required for training 3D segmentation models for two different
35
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
Let us define a few terms before formally defining our problem statement -
Query space : For every tuple (P, w, X , f ) where P represents a finite set of
the importance, and f being a distance function that approximates the pseudo-
X
w(p)f (p, x)
p∈P
Rn → R
that
tions of each point is still time-consuming and thus a more practical approach
niques for creating a coreset. Unlike other methods, uniform sampling time
36
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
sub-linear time in the input does not provide (1 + ϵ) multiplicative errors com-
every observation. One shortcoming of this technique is that outliers that are
drastically different are often ignored whereas it might be possible that the
outlier is a crucial point in the data and needs to be included manually as part
the sensitivities of the points are also uniform, then sensitivity sampling is also
tered or divided into several groups, whereby the behavior of the observations
group, but they might also differ significantly between groups. In such situa-
each group. Then, sensitivity sampling can be approximated to divide the data
fixed number of points from each group uniformly and at random. Thus for m
points and k groups, I randomly sample m/k points from each group which is
In the next section, I show how coresets can be applied successfully to med-
ical image segmentation tasks. Using uniform sampling, I show how coresets
37
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
can be built for the task of pediatric airways segmentation on Magnetic Res-
onance (MR) images, and using group sampling, I show how coresets created
by using the k-means algorithm can be applied to segment the liver and spleen
pling
We demonstrate how group sampling can be used to create coresets for seg-
4.1.1 Data
For the experiment, I used imaging data for both organs from the Medical
Segmentation Decathalon challenge. The Liver MSD and Spleen MSD datasets
have already been described in the SegViz section and the same data finger-
print was used to run this experiment. The Liver MSD dataset contains 201
38
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
4.1.2 Methods
For the Liver dataset, I select sample sizes of 5,3, and 2 samples per cluster
while for the Spleen dataset, I sample 10, 5, and 3 samples per cluster. For
both datasets, I extracted coresets using the weighted sampling strategy de-
treating each image as a sample point, I partition all the points in space using
clustering. Furthermore, since the overall distribution of all the images in the
training set is known, where every image is sourced using the same acqui-
this strategy, the overall sample space for the Liver data can be divided into 9
clusters and for Spleen into 2 clusters. The results of the clustering algorithm
Using the coresets setup using the steps above, I trained a 3d-full-res model
using the standard nnUNet pipeline. The only modification was that each fold
39
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
Figure 4.1: Figure depicting 2 main clusters generated for the spleen dataset
after T-SNE [2]
Figure 4.2: Figure depicting 9 main clusters generated for the liver dataset
after T-SNE [2]
40
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
was trained only for 100 epochs and the rest of the pipeline was kept the same.
The nnUNet pipeline [16], is a seminal paper in the automated image anal-
main idea behind nnUNet is the generation of dynamic templates for auto-
UNet architectures to train the model - The 2D UNet, the 3D UNet with a full
resolution field of view and a two-stage 3D cascaded UNet version where the
first stage of the cascade trains on low-resolution while the next stage trains on
full resolution images. The blueprint parameters that had been determined in
the fingerprint are then used to create the training schedule such as the learn-
pipelines for dataset-specific adaptations like patch size, batch size, the pre-
41
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
Given an image volume, nnUNet extracts the central nonzero region first.
Next, nnUNet constructs the fingerprint that is specific to the dataset at hand
and computes several properties such as all image sizes before and after crop-
ping, the imaging modality for the task, the voxel spacing for the image and the
ground truth annotation, and the number of classes that are required for seg-
ing - For CT it computes the mean and standard deviation, the 5-percentile,
After creating the dataset fingerprint, nnUNet creates a set of fixed param-
eters that do not change for any dataset. This includes parameters such as
the architecture templates for the various configurations of the standard UNet
model, the learning rate scheduler, and the various data augmentation tech-
niques.
42
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
The principle idea of the nnUNet paper states that the standard UNet con-
any dataset. nnUNet has preconfigured scripts for the standard UNet config-
uration in 2D, 3D, and cascaded 3D setups. All the configs use the standard
All the model configurations favor larger patch sizes over larger batch sizes
tion function. The configurations are initialized with 32 feature maps at the
Standard nnUNet pipelines are trained for 1000 epochs with 250 iterations
per epoch. I train nnUNets only for 100 epochs for all our experiments. The
Adam optimizer with an initial learning rate of 0.01 is used together with Nes-
version of the Dice loss together with the cross-entropy loss was used as the
loss function.
43
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
centile clipping. Image resampling and target spacing modifications are done
using the standard nnUNet pipeline which would default to third-order spline
interpolation for resampling and median spacing for each axis depending on
4.1.3 Results
For the test set, I used all 30 training images from the BTCV dataset. I
compared the dice similarity index between the ground truth annotations from
the BTCV dataset and predictions of each nnUNet model for each coreset con-
struction.
The baseline for both datasets is a nnUNet model that is trained on all
images from the training set. For the Liver dataset, the performance on the
BTCV dataset gives a Dice index of 0.897 with a Jaccard coefficient of 0.822
while for the Spleen dataset, the Dice index is 0.865 with a Jaccard coefficient
of 0.79.
For the Spleen dataset - We see that for the coreset with 20 samples, the
Dice index is 0.862 with a Jaccard coefficient of 0.787. For the coreset with 10
samples, the Dice Index is 0.76 with a Jaccard coefficient of 0.692 while for the
44
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
Figure 4.3: Figure showing the Dice similarity index for the baseline Liver
and coreset experiments. Higher is better.
Figure 4.4: Figure showing the Dice similarity index for the baseline Spleen
and coreset experiments. Higher is better.
45
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
Table 4.1: Paired t-test comparison between the baseline (all training samples)
with the coreset constructions of varying sample sizes. The significant values
(p < 0.05) are highlighted in bold
46
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS
6-sample coreset, the Dice index is 0.741 with a Jaccard coefficient of 0.668.
For the Liver dataset - We see that for the coreset with 45 samples, the
Dice index is 0.903 with a Jaccard coefficient of 0.832. For the coreset with 27
samples, the Dice index is 0.898 with a Jaccard coefficient of 0.824 and for the
coreset with 18 samples, the Dice index is 0.872 with a Jaccard coefficient of
0.795.
A comparison of the dice scores from the baselines and coreset experiments
is shown in 4.3 4.4. The paired t-test results are shown in 4.1 and suggest that
the baseline vs coreset liver models are not significant while the baseline vs
47
Chapter 5
Discussion
tions has many potential benefits. For example, SegViz can potentially reduce
labeling time by 1/η where η is the number of distinct labels in the distributed
48
CHAPTER 5. DISCUSSION
data sets by allowing the transfer of knowledge between each client. This
would not only save time but also allow different research groups to potentially
ods such as nnUNet. Our global model and local models are only 18 MB in
trained models easy to deploy in a real-world scenario where the model weights
can be communicated using edge devices without the need for high computing
in its implementations, such as using a learning rate decay and random affine
It is important to note that the SegViz using FedAvg can create a global
when using FedAvg as the aggregating strategy, we can have a single global
model that has knowledge of all the tasks being trained using the participating
clients. This is not true for the FedBN model because, during training with the
49
CHAPTER 5. DISCUSSION
from the knowledge aggregation rounds. This provides the SegViz maintainers
with the choice of deploying FedAvg if a global model is desired or FedBN for
built on working with CT images only, and I’m yet to extend it to other modali-
ties. Also, the current framework is focused on segmenting only organs that are
present in the same field of view and not for organs in different fields of view
(for example, brain and liver). For the coresets framework, I only employed
group sampling for the creation of the coresets, which may not generalize well
for all datasets. Also, the testing dataset used only contains 30 samples from
In the future, I would like to extend our experiments using a modality that
is less stable than CT such as MRI. I would like to scale our experiment with
even more nodes where some nodes can have a different modality as well. I
would also like to investigate the real-world performance of our FL setup where
client nodes can join and drop contact with the server at any point in time while
tient privacy.
50
CHAPTER 5. DISCUSSION
With our coreset experiments, I show that group sampling using k-means
coreset S from the entire superset of training data such that a model trained
on the coreset can still converge successfully. I compare the overall dice in-
dex on an external test bench of several trained models using different coreset
utilizing only a fraction of the total training cases. While I acknowledge that
our liver coreset results are not statistically significant, I would like to per-
determine if our results are not significant at all, or not significant due to the
In the future, I will experiment with several different strategies for building
All our code for building SegViz and all other models in comparison is avail-
trained models for each task in the same repository. The data used to repro-
51
CHAPTER 5. DISCUSSION
when the datasets in use are heterogeneous and contain partial labels. The
results from the coresets experiments show that we can intelligently reduce
the number of training samples required to train deep learning models suc-
cessfully, thus saving time, space, and compute costs to train large-scale deep
learning models.
52
Bibliography
Available: https://doi.org/10.5281/zenodo.2526396
[2] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal
pp. 234–241.
53
BIBLIOGRAPHY
Z. Lin, O. Dobre, and W.-J. Hwang, “Federated learning for smart health-
care: A survey,” ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–37,
2022.
arXiv:2003.00295, 2020.
[8] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of
2020.
54
BIBLIOGRAPHY
Fuh, P.-T. Chen, K.-L. Liu et al., “Multi-task federated learning for het-
joint anatomical priors,” Medical Image Analysis, vol. 81, p. 102556, 2022.
[14] C. Shen, P. Wang, D. Yang, D. Xu, M. Oda, P.-T. Chen, K.-L. Liu, W.-C. Liao,
C.-S. Fuh, K. Mori et al., “Joint multi organ and tumor segmentation from
pp. 58–67.
[15] Q. Yu, Y. Shi, J. Sun, Y. Gao, J. Zhu, and Y. Dai, “Crossbar-net: A novel
55
BIBLIOGRAPHY
ages,” IEEE transactions on image processing, vol. 28, no. 8, pp. 4060–
4074, 2019.
2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Pro-
[18] S. Chen, K. Ma, and Y. Zheng, “Med3d: Transfer learning for 3d medical
[19] J. Zhang, Y. Xie, Y. Xia, and C. Shen, “Dodnet: Learning to segment multi-
56
BIBLIOGRAPHY
challenge data: 300 kidney tumor cases with clinical context, ct semantic
2019.
[25] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm
57
BIBLIOGRAPHY
arXiv:1909.12488, 2019.
arXiv:2001.01523, 2020.
[29] S. Har-Peled and S. Mazumdar, “On coresets for k-means and k-median
[30] K. Wei, R. Iyer, and J. Bilmes, “Submodularity in data subset selection and
vival,” Academy of Management journal, vol. 47, no. 4, pp. 501–522, 2004.
58
BIBLIOGRAPHY
[32] C. Musco and C. Musco, “Recursive sampling for the nystrom method,”
[34] S. Har-Peled and S. Mazumdar, “Fast algorithms for computing the small-
tive experience replay compression using coresets for lifelong deep rein-
2023.
59
Adway Kanhere
akanher1@jhu.edu
linkedin.com/in/adwaykanhere
github.com/adwaykanhere
EDUCATION
2023 Master of Science in Engineering Degree, Biomedical Engineering
The Johns Hopkins University and School of Medicine, USA.
Specialization in Biomedical Data Science - Thesis track
2020 Bachelor of Engineering Degree, Medical Electronics
M.S Ramaiah Institute of Technology, India
PROFESSIONAL EXPERIENCE
09/2022 – present Bioinformatics Software Engineer-I
University of Maryland Medical Intelligent Imaging Center (UM2ii), USA
Developed an AI and deep learning algorithm for federated-learning (FL) in Radiology - SegViz, for medical
•
that can connect directly to a hospital PACS and is loaded with automatic segmentation and active
learning capabilities.
Defined and maintained software pipelines for ingesting and automating data collection, pre-processing,
•
and AI model development workflows for several clinically translatable medical image segmentation
tasks.
Developed a software pipeline for ingesting, processing, and developing an AI model on pediatric
•
management to meet the lab's high throughput and scalable computing needs.
Assisted in writing technical papers and patents to communicate key project results in high-impact
•
CT images.
Developed a deep learning based pipeline for progression-free survival, overall survival prediction and
•
production
03/2022 – 05/2023 Graduate Teaching Assistant
Johns Hopkins University - Carey Business School, USA
Courses proctored
Big Data Machine Learning Spring 2022
•
native communications platform, and interfacing with the native web application for high latency wireless
transfer of sensor data at long ranges.
Raised $8,000 in pre-seed funding and won the first prize at the Johns Hopkins New Venture Challenge
•
2022 (HOPSTART).
12/2021 – 05/2022 Graduate Research Assistant
Johns Hopkins University & School of Medicine
Programmed a deep learning-based pipeline for transforming smartphone-based photos of chest X-Ray
•
images to digital X-Ray images using a Pix2Pix Generative Adversarial Network (GAN).
Programmed a deep learning-based pipeline for localizing the region of the thorax in chest X-Ray images
•
Remodeled the existing codebase to standardize and speed up the loading and pre-processing of audio data.
•
Created a pipeline for Bayesian Hyperparameter tuning of the existing CNN models and improved the performance of the baseline CNN
•
from 64% to 88% accuracy. Extended the same pipeline to improve the performance of all CNN models implemented in the previous
study.
Project Link View Pull Requests
Detecting Abnormalities in chest X-Ray images with Convolutional Neural Networks
Implemented a DENSENET-181 model to identify 14 abnormalities on Chest X-ray images with transfer learning
•
Deployed model to an online application with an interactive interface for real-time analysis.
•
Github Link
CERTIFICATES
Deep learning Specialization AI for Medical Diagnosis
AWS Machine learning Foundations Introduction to Machine Learning in Production
Successful Negotiation: Essential Strategies and Skills