0% found this document useful (0 votes)
65 views74 pages

Kanhere Thesis 2023

Uploaded by

Gurmehak kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views74 pages

Kanhere Thesis 2023

Uploaded by

Gurmehak kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

FEDERATED LEARNING BASED MEDICAL IMAGE

SEGMENTATION FOR HETEROGENEOUS DATA SETS

WITH PARTIAL ANNOTATIONS

by

Adway Uday Kanhere

A thesis submitted to The Johns Hopkins University in conformity with the

requirements for the degree of Master of Science in Engineering.

Baltimore, Maryland

May, 2023

© 2023 Adway Uday Kanhere

All rights reserved


Abstract

Segmentation is one of the primary tasks in deep learning for medical imag-

ing, owing to its multiple downstream clinical applications in diagnosis, prog-

nosis, and treatment planning. However, generating manual annotations for

medical images is time-consuming, requires high skill, and is an expensive

effort, especially for 3D images. In recent years, Artificial Intelligence algo-

rithms using deep learning have created highly accurate segmentation models

for the delineation of several organs of the body on various imaging modalities.

However, there are several limitations to training such models such as requir-

ing a large amount of well-annotated data and high-quality images and the

need for creating a segmentation model for each use case. Federated learning

has the potential to solve both of these problems by aggregating the knowl-

edge from several local models into a global model that can be used to segment

multiple target regions. To this end, in this thesis, I propose SegViz, a feder-

ated learning-based framework to train a segmentation model from distributed

non-i.i.d datasets with partial annotations. SegViz using FedBN as the aggre-

ii
ABSTRACT

gation strategy demonstrated excellent performance on the external BTCV set

with dice scores of 0.93, 0.83, 0.55, and 0.75 for segmentation of liver, spleen,

pancreas, and kidneys, respectively, significantly (p < 0.05) better (except liver

and spleen) than the dice scores of 0.88, 0.79, 0.46, and 0.64 for the baseline

models. In contrast, the central aggregation model significantly (p < 0.05) per-

formed poorly with dice scores of 0.65, 0, 0.55, and 0.68. Furthermore, in order

to reduce the need for a large number of training samples during training, I

show that coresets using group sampling can effectively reduce the total num-

ber of training samples. The dice scores for the liver MSD dataset using core-

sets of sample sizes - 45,27,18 are 0.90, 0.89, and 0.87 on the external BTCV

dataset compared to the dice score of 0.89 trained on the entire training set.

Similarly, for spleen segmentation, coresets of size 20,10,6 samples yield dice

scores of 0.86,0.76,0.69 respectively compared to a dice score of 0.86 on the en-

tire training set. In the last chapter, I conclude and discuss the future scope

and limitations of my work.

iii
ABSTRACT

Primary Advisors:

Dr. Paul Yi

Director, University of Maryland Medical Intelligent Imaging Center

Assistant Professor of Diagnostic Radiology and Nuclear Medicine

University of Maryland School of Medicine

Adjunct Assistant Research Scientist of the Malone Center

for Engineering in Healthcare

Johns Hopkins University and School of Medicine

Dr. Vishwa Parekh

Technical Director, University of Maryland Medical Intelligent Imaging

Center

Assistant Professor of Diagnostic Radiology and Nuclear Medicine

University of Maryland School of Medicine

iv
ABSTRACT

Reviewer Panel:

Dr. Cheng Ting Lin

Associate Professor of Radiology and Radiological Science

Department of Diagnostic Radiology

Johns Hopkins University and School of Medicine

Dr. Chien-Ming Huang

John C. Malone Assistant Professor of Computer Science

Malone Center for Engineering in Healthcare

Johns Hopkins University

Dr. Mathias Unberath

Assistant Professor of Computer Science

Laboratory for Computational Sensing and Robotics

Malone Center for Engineering in Healthcare

Johns Hopkins University and School of Medicine

v
Acknowledgments

First and foremost, I’m extremely grateful to my advisors Dr. Paul Yi, and

Dr. Vishwa Parekh for their constant support and guidance, for providing me

with the freedom to work on projects that intrinsically motivate me, and for

fostering a collaborative and friendly environment at UM2ii. I’m also very

grateful to my reviewers, Dr. Cheng Ting Lin, Dr. Chien-Ming Huang, and

Dr. Mathias Unberath for their guidance, and for their time in reviewing my

thesis.

I would also like to thank all the students, faculty, and researchers at the

Johns Hopkins University and School of Medicine. My time at Hopkins would

not be the same without the support from all the friendly and helpful folks at

the Biomedical Engineering Department. I’m also very grateful to Dr. Graeme

Warren, and Dr. Sudip Gupta from the Carey Business School, for giving me

several opportunities to teach and develop my communication skills.

From UM2ii, I would also like to sincerely thank Dr. Peter Kamel, Pranav

Kulkarni, Sean Garin, and Vivian Zhang for all their support on several projects.

vi
ACKNOWLEDGMENTS

I’m very thankful to my friend Neha John for helping me improve my writ-

ing. Finally, a heartfelt thank you to all my friends at Hopkins who helped me

survive grad school with the late-night car drives, impromptu chai sessions,

and travel memories.

vii
Dedication

This thesis is dedicated to my parents, Uday Kanhere and Uma Kanhere

whom I’m greatly indebted for all their love, support, and guidance, to my

Sadhguru - Sri Ganapati Sachchidananda Swamiji for his spiritual guidance

and blessings in my life, and to my late grandfather Parshuram Kanhere who

would be really proud of me today. Pranams!

viii
Contents

Abstract ii

Acknowledgments vi

List of Tables xii

List of Figures xiii

1 Introduction 1

1.1 Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . 1

1.2 Artificial Intelligence For Medical Image Analysis . . . . . . . . . 3

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Federated Learning 10

2.1 Central Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Types of FL Systems . . . . . . . . . . . . . . . . . . . . . . 14

ix
CONTENTS

2.2.1.1 Horizontal FL Systems . . . . . . . . . . . . . . . 14

2.2.1.2 Vertical FL System . . . . . . . . . . . . . . . . . . 14

2.2.1.3 Federated Transfer Learning . . . . . . . . . . . . 14

2.2.2 General FL Algorithm . . . . . . . . . . . . . . . . . . . . . 15

2.2.2.1 Federated Averaging (FedAvg) Algorithm . . . . . 16

3 FL for Distributed Heterogeneous Datasets with Incomplete An-

notations 19

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1.1 Individual Baseline Implementation . . . . . . . . 24

3.1.1.2 Central Aggregation Implementation . . . . . . . 24

3.1.1.3 SegViz Implementation . . . . . . . . . . . . . . . 27

3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Coresets For Data Efficient Training of Medical Image Segmen-

tation Algorithms 34

4.1 Coreset Creation Using Group Sampling . . . . . . . . . . . . . . . 38

4.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.2.1 Coreset Creation . . . . . . . . . . . . . . . . . . . 39

4.1.2.2 Model Training . . . . . . . . . . . . . . . . . . . . 39

x
CONTENTS

4.1.2.3 The nnUNet Training Pipeline . . . . . . . . . . . 41

4.1.2.4 Dataset Fingerprint Extraction . . . . . . . . . . . 42

4.1.2.5 Blueprint Parameters . . . . . . . . . . . . . . . . 42

4.1.2.6 Architecture Templates . . . . . . . . . . . . . . . 43

4.1.2.7 Training Schedule . . . . . . . . . . . . . . . . . . 43

4.1.2.8 Inferred Parameters . . . . . . . . . . . . . . . . . 44

4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Discussion 48

5.1 Data and Code Availability . . . . . . . . . . . . . . . . . . . . . . 51

Bibliography 53

xi
List of Tables

3.1 Mean Dice score performance of all the experiments on the in-
federation internal validation dataset. The standard deviation
values are in parentheses. . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Mean Dice score performance of all the experiments on the out-
of-federation BTCV dataset. The standard deviation values are
in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 P-values comparison using Paired statistical t-test between base-
line, central aggregated model, and best performing SegViz (FedBN+FT)
models for the internal validation dataset. The significant values
(p < 0.05) are highlighted in bold. . . . . . . . . . . . . . . . . . . . 32
3.4 P-values comparison using Paired statistical t-test between base-
line, central aggregated model, and best performing SegViz (FedBN+FT)
models for the external BTCV dataset. The significant values
(p < 0.05) are highlighted in bold. . . . . . . . . . . . . . . . . . . . 33

4.1 Paired t-test comparison between the baseline (all training sam-
ples) with the coreset constructions of varying sample sizes. The
significant values (p < 0.05) are highlighted in bold . . . . . . . . . 46

xii
List of Figures

1.1 Illustration of an example setup with nodes containing datasets


with a similar field of view but different and incomplete annota-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Illustration of the proposed SegViz framework: Client nodes up-


date the global meta-model where knowledge aggregation occurs
after every 10 iterations of the local model. The weights of the
global model are then shared with the client models allowing
both nodes to share knowledge without sharing data. . . . . . . . 25
3.2 Illustration of the modified 3D-UNet configuration: The repre-
sentation block refers to all the layers except the last two convo-
lutional layers, while the task block refers to the final two layers
including the classifier. The classifier (in green) can be fine-tuned
for segmenting either of the four organs. Image generated using [1] 26
3.3 A comparison of the ground truth segmentation masks with the
masks generated by the baseline and SegViz FedBN+FT models
for two different images. Note that the individual baselines are
computed separately and are shown here together for illustration
only. The Red color signifies the Liver, Blue for the Spleen, Green
for the Kidneys, and Yellow for the Pancreas . . . . . . . . . . . . 29

4.1 Figure depicting 2 main clusters generated for the spleen dataset
after T-SNE [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Figure depicting 9 main clusters generated for the liver dataset
after T-SNE [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Figure showing the Dice similarity index for the baseline Liver
and coreset experiments. Higher is better. . . . . . . . . . . . . . . 45
4.4 Figure showing the Dice similarity index for the baseline Spleen
and coreset experiments. Higher is better. . . . . . . . . . . . . . . 45

xiii
Chapter 1

Introduction

1.1 Medical Image Segmentation

According to the World Health Organization, the proportion of the world’s

population over 60 years of age will be 22% by 2050, nearly double that of

2015. While the number of active radiologists in the workforce is increasing

steadily, the Radiology Society of North America (RSNA) highlights a shortage

of radiologists in the United States relative to the growing number of Medicare

beneficiaries. This gap in the demand for radiologists can cause delayed diag-

nosis of the disease for the patient and can also decrease the reliability of the

radiological service. Thus, the rising number of radiological images that are ac-

quired for a patient’s diagnosis and the need for increased efficiency within the

diagnostic workflow has created the need for automating the process of medical

1
CHAPTER 1. INTRODUCTION

image analysis in radiology.

Medical image analysis in radiology refers to the process of using advanced

computer algorithms to extract information from several radiological images

such as digital X-Ray images, Magnetic Resonance (MR) images, Computed

Tomography (CT) images, and Positron emission technology (PET) images.

Image segmentation refers to the process of delineating the region of inter-

est in an image, and is a fundamental task in medical image analysis as it

forms the basis for many downstream applications, including diagnosis, sur-

vival prognosis, treatment response planning, image reconstruction, and re-

sponse assessment. For example, by segmenting the region of a lung tumor

from a lung CT image, we can characterize the shape and size of the tumor for

diagnosis as well as map the progression of the tumor to assess the response of

the patient to targeted radiotherapy.

Manual delineation of the target organ of interest by a radiologist is a time-

consuming and expensive task. In [3], the authors show that on average, a

radiologist requires 10 minutes to manually annotate a particular region, and

would take well over 10 years if they were to manually annotate 104 target

regions on 1024 patient images. Consequently, developing and deploying al-

gorithms for each use case would potentially result in hundreds of models,

thereby limiting their clinical utility – imagine deploying a different algorithm

for every type of cancer, injury, and other diseases.

2
CHAPTER 1. INTRODUCTION

Hence there is a strong need for automating the process of delineation of

target regions in a medical image. Prior research has also demonstrated that

the utilization of automated medical image analysis can supplement/ enhance

the clinical diagnostic workflow, leading to quicker turnaround times for pa-

tients and reduced clinician workload.

1.2 Artificial Intelligence For Medical Im-

age Analysis

Artificial intelligence, particularly deep learning, has played a pivotal role

in the advancement of medical image analysis. The success of deep learning

for computer vision can be attributed to the AlexNet paper, the first large-scale

convolutional neural network (CNN) to surpass the previous state-of-the-art

conventional methods in computer vision and win the ImageNet 2012 chal-

lenge. CNNs have since dominated the field in classification, segmentation,

and object detection. There have been several versions of neural networks that

are convolutional in nature with an increasing number of layers and modifica-

tions to the networks such as residual connections and skip connections.

For a segmentation model to be effective, it must be capable of learning spa-

tial and semantic details of the region of interest from the input image. One

important network architecture that is commonly used is the U-Net [4] - The

3
CHAPTER 1. INTRODUCTION

UNet architecture was the first at the time to solve the problem of extracting

spatial and semantic features from an image while maintaining good resolu-

tion. The UNet architecture consisted of three main blocks - The encoder, the

decoder, and skip connections between the encoder and decoder.

Similar to traditional CNNs, the encoder block comprises convolutional lay-

ers followed by pooling layers that extract low-level feature information while

decreasing the resolution. The semantic information available at the bottle-

neck is then gradually rebuilt in the decoder network to provide high-accuracy

segmentation maps at the image-level resolution. Leveraging skip connections,

the U-Net architecture concatenates feature maps with progressively higher

resolution/gradually increasing from the encoder with semantic maps of higher

resolution. Thus the network is able to extract the semantic information from

the encoder and the localized spatial information from the decoder to gener-

ate the output that is at the same resolution as the input generating the final

segmentation.

In order to train deep learning models with high performance, we need a

large amount of high-quality annotated data. Moreover, this problem becomes

more difficult when creating annotations for several organs/regions of interest.

As a result, many large-scale datasets have been curated and released for

the segmentation of different organ types and tumor structures [5]. However,

each of these datasets have been curated for a specific use case and therefore,

4
CHAPTER 1. INTRODUCTION

focuses on segmenting only a particular organ or tumor subset in the body.

Moreover, when we want to segment multiple objects in the image, we need to

aggregate several datasets from different sources, each having a different dis-

tribution, which can hinder the performance of the deep learning model during

training.

Figure 1.1: Illustration of an example setup with nodes containing datasets


with a similar field of view but different and incomplete annotations.

5
CHAPTER 1. INTRODUCTION

The challenge of training several individual models separately and the need

for large-scale multi-organ datasets can be addressed by training multi-task

segmentation models from distributed datasets using collaborative learning.

Federated learning (FL) has gained importance in recent years for solving

this challenge by collaboratively training one global model from several local

models without data sharing. The global model can aggregate knowledge from

each local model and share the knowledge with each model allowing knowledge

sharing along with personalized training. In recent years, federated learning

has shown to be an effective strategy to train models in a distributed setting

by focusing on the following main principles - dealing with data imbalance,

data heterogeneity across nodes, variability in communication, and balancing

communication between a large number of nodes sharing their weights.

However, the capability of FL in aggregating knowledge from datasets cu-

rated at different imaging centers is still challenging and non-trivial as each

imaging center may focus on related but different tasks - suppose one center is

training a liver segmentation model while another center is training a spleen

segmentation model from CT scans. FL has shown to be suboptimal in learn-

ing such a situation because these two datasets would contain images with a

similar field of view but different, and incomplete annotations, as illustrated

in 1.1. Such a situation, where one dataset has only a few organs annotated

while another dataset contains no overlapping annotations with the first one

6
CHAPTER 1. INTRODUCTION

is very common in medical imaging.

In this thesis, I discuss creating an FL setup that gives excellent perfor-

mance on distributed, heterogenous datasets that contain partial annotations.

I also discuss how we can tune this FL setup for personalized high-performance

and make it robust to non-i.i.d data.

Another major problem with training deep learning models is the require-

ment for a large amount of training data. In the medical scenario, it is not

possible to obtain a large number of manual annotations as discussed. Trans-

fer learning is the most common solution to address the problem of data in-

adequacy for training deep learning models on large datasets. Nevertheless,

for transfer learning to generalize well on the new dataset, large amounts of

heterogeneous data are still required. Moreover, transfer learning also suffers

from data distribution shifts and catastrophic forgetting leading to underper-

forming/suboptimal models. This raises a crucial non-trivial question - Can we

intelligently select a subset of training data, such that there is no major drop

in performance?

Coresets are a smaller representation of a larger dataset that smartly cap-

tures the inherent important features/characteristics of the main dataset. Cre-

ating corsets that capture this inherent feature distribution is non-trivial as we

need to first identify the inherent distribution of features that are important,

and further select only samples that are relevant.

7
CHAPTER 1. INTRODUCTION

In this thesis, I have shown that creating coresets using group sampling

using the popular k-means clustering algorithm can effectively reduce the total

sample size required for training a deep learning model without a major drop

in performance.

1.3 Thesis Outline

• In Chapter 2, I describe the concept of central learning & federated learn-

ing to the readers. Then, I describe the types of federated learning sys-

tems and the popular Federated Averaging (FedAvg) strategy.

• In Chapter 3, I introduce the concept of SegViz: The federated learning

framework for distributed heterogeneous datasets with partial annota-

tions. Then, I describe the model architecture, methodology, and imple-

mentation of the baseline, centrally aggregated, and SegViz-based mod-

els. Finally, I discuss the performance of using SegViz compared to the

other models on an external out-of-federation test bench data set.

• In Chapter 4. I show how to apply coresets for annotation efficient train-

ing of deep learning models and their applicability to real-world scenarios.

Then, I show that group sampling can be used to effectively capture the

native data distribution in a large cohort of medical images and can be

used to train deep learning models using a much smaller cohort without

8
CHAPTER 1. INTRODUCTION

a significant drop in performance.

• Lastly, in Chapter 5, I conclude and infer the results of Chapter 3 and

Chapter 4 and coresets and discuss the limitations and future goals.

9
Chapter 2

Federated Learning

2.1 Central Learning

Centralized learning refers to the process of training a deep learning model

using a centralized database of images for a specific application. The cen-

tralized database can be curated such that it can be mono-modal, containing

only images belonging to one particular modality such as CT images, or multi-

modal, containing images of different modalities. The database can also be

multi-centric, composed of images from several institutions such that each in-

stitution acquires data using a custom protocol and scanner configuration.

Centralized training is the most common approach used in training deep

learning models for specific applications such as image classification, segmen-

tation, and object detection. Some of the advantages of this approach are -

10
CHAPTER 2. FEDERATED LEARNING

• Centralized learning can be configured to build highly accurate and spe-

cialized models for data from a particular distribution.

• The trained model can be deployed as a Model-as-a-service (MaaS) where

the trained model can be run on several devices at scale. For example,

services like Amazon Web services (AWS).

Although this approach can build AI systems with high accuracy, it usually

has several constraints that limit its scalability and deployment in a real-world

scenario -

• It requires that all the training data has to be stored in one location. This

is not always ideal in a medical scenario where data privacy and data

sharing laws are very strict.

• Once we have the data in one location, we need to manually annotate

the data to create the ground truth annotations to train the model. Deep

learning models are very sensitive to the quality of ground truth anno-

tations and their performance can be degraded significantly if there are

lapses in correctly annotating the ground truth.

• A large amount of computing and storage resources are required to con-

tinuously maintain, curate and transfer the large-scale training data which

can lead to increased costs.

11
CHAPTER 2. FEDERATED LEARNING

2.2 Federated Learning

Edge devices such as Mobile phones, tablets, and other IoT devices are in-

creasingly being used in hospitals to monitor the health of patients and in

clinical workflows for sharing medical images for further downstream analysis

tasks. These days, radiologists can connect their edge devices to view medical

reports and images over the internet via a secure network. In telemedicine,

AI algorithms are often deployed on edge systems to assist clinicians in the

diagnosis and remote treatment planning.

Traditionally, such edge applications in medical settings, called Internet-of-

Medical-things (IoMT) rely on a central repository of medical data hosted at a

hospital or data center. This centralized system cannot achieve high network

scalability due to the growing amounts of health data and IoMT devices in con-

temporary healthcare networks, which results in suboptimal communication

latency. Moreover, relying on a central server or other third parties for data

learning creates serious privacy concerns.

Furthermore, as healthcare data are distributed within a large IoMT net-

work rather than being centralized, such a centralized setup for the AI in-

frastructure may no longer be feasible in the future as the amount of data in-

creases. In order to provide scalable and privacy-preserving intelligent health-

care applications at the network edge, it is vital to move toward distributed

12
CHAPTER 2. FEDERATED LEARNING

AI techniques for training as well as deployment of AI-assisted tools. In this

regard, Federated Learning (FL) has shown to be a promising solution for

the challenge of building distributed and privacy-preserving AI-based tools for

medical image analytics. Federated learning is a machine learning framework

that was first introduced by Google in 2016 based on the principles of “focused

collection” proposed by the 2012 White House report on consumer privacy.

In a typical Federated learning setup, a “federation” is created using the

client sites which host their own datasets and a central server that coordinates

and communicated with each client. Each client then trains a model locally on

its dataset and sends the updates to the central server. During the round of

communication, only the weights of the locally trained model are shared with

the central server and not the data. The central server aggregates the weights

from all the clients using a predefined strategy and these updated weights are

then shared back with the clients. Federated Learning has gained traction in

recent years, especially within the healthcare industry, due to its high capacity

for maintaining privacy with client sites like hospitals by keeping their data

in-house.

13
CHAPTER 2. FEDERATED LEARNING

2.2.1 Types of FL Systems

2.2.1.1 Horizontal FL Systems

Such a system is common when every client has a database with the same

feature space but a different sample space [6]. The healthcare clients can take

part in training a shared global model using their datasets, which have sep-

arate sample spaces but the same feature space as the datasets. In this sys-

tem, each client can train the same core AI model and send updates to the

server. The server subsequently aggregates the local updates from each client

and transmits them back to them.

2.2.1.2 Vertical FL System

Such as system is common when the datasets have different feature spaces

but the same sample spaces. Such systems are common in settings where sev-

eral clients can coordinate the training of an AI model using a central dataset

such that each client is trained for a different task but using the same features

from the central dataset.

2.2.1.3 Federated Transfer Learning

The last setup for FL is one in which we have different sample spaces as well

as different feature spaces to train a global model across different datasets.

14
CHAPTER 2. FEDERATED LEARNING

Feature values are computed from several feature spaces based on a single

representation using a transfer learning approach, which is then used to train

local datasets. In a practical setting, we can see that such a setup can assist

in disease diagnosis by partnering with regions that have several hospitals

with a broad range of patients (sample space) as well as several therapeutic

approaches (feature space).

2.2.2 General FL Algorithm

In this section, we will discuss the process of training a model in a federated

learning setup. A central server is required that will orchestrate the entire

process. The main training steps are -

• Selection of clients - The central server samples a certain number of

clients to participate in the federated training.

• Model sharing and broadcasting - The server broadcasts the latest global

model weights and the training scheme with the clients.

• Client local training - Using the local data and the latest training scheme,

the clients train a local model.

• Weight aggregation - After local training is complete, the new weights are

shared with the server where they are aggregated using the server-side

aggregation strategy.

15
CHAPTER 2. FEDERATED LEARNING

• Global model update - The server updates the global model based on the

new aggregated weights computed from the clients.

2.2.2.1 Federated Averaging (FedAvg) Algorithm

The Federated Averaging algorithm was proposed by [7] and is one of the

most common and widely used FL aggregation strategies. Based on the success

of SGD optimization in deep learning, the authors talk about adapting SGD for

the federated optimization problem by selecting the C-fraction of clients and

computing the gradient of the loss over all the client data (called FedSGD).

Their work suggests that during the federated learning process, the shared

model initialization works better together with naive parameter averaging.

The general algorithm of running Federated Averaging assuming K clients,

B as the local batch size, E as the number of epochs, and η as the learning rate

is given by -

• At first, randomly initialize the model on the server

• Perform the following operation for every round of communication:

– First select a random set of clients

– Allow every chosen client to perform local steps of gradient descent

in parallel such that gk = ∇Fk (wt )

– Return the weights after gradient descent to the server

16
CHAPTER 2. FEDERATED LEARNING

– The model at the server then aggregates all the model parameters
PK nk k
submitted by the clients using wt+1 ← k=1 n wt+1

There are several advantages to federated learning -

• The main advantage of FL is to move away from the need for direct access

to the central training data. It allows the global model to become robust

by learning from a diverse set of training samples. Moreover, due to the

decentralized nature of the FL setup, In a practical scenario, it allows

different medical institutions to learn a global model thus allowing them

to share knowledge about different tasks.

• FL is more scalable than centralized learning. In FL, each client uses

its resources to train the global model, and as such the overall compute

scales at no extra cost. Over time, the global model is trained on an ever-

increasing dataset while all the computations are handled by the clients.

In centralized learning, such a situation would require large computing

and storage requirements and would thus not be scalable.

However, FL also has the following disadvantages -

• Statistical Heterogeneity - It has been shown that statistical data hetero-

geneity between the clients is a major challenge for FL. In [8], the authors

show that FedAvg will suffer from significant degradation in performance

17
CHAPTER 2. FEDERATED LEARNING

when deployed in a non-i.i.d setting. Such a situation is common in medi-

cal imaging scenarios as each client node can non-.i.i.d data as the images

can be acquired using a different protocol and have inherent data distri-

bution shifts.

• Expensive Communication - Communication between the central server

and the local models can be a bottleneck when deploying FL systems in

the wild. Although federated networks can comprise several 1000s of

clients, communication within the network can make be slow when the

size of the models is large. This coupled with the need for regular up-

dates to the local model also puts an additional constraint on the number

of communication rounds required between the server and the clients.

• Privacy - Although FL setups can demonstrate data privacy between clients,

communicating the model updates to the central server can reveal in-

formation about the clients. Moreover, FL may introduce training time

adversarial attacks on clients such as data poisoning, model update poi-

soning, and inference attacks.

18
Chapter 3

FL for Distributed

Heterogeneous Datasets with

Incomplete Annotations

3.1 Introduction

Generating manual annotations for medical images is time-consuming, re-

quires high skill, and is an expensive effort, especially for 3D images [9]. One

potential solution is to curate datasets with partial annotations, wherein only

a subset of structures is annotated for each image or volume. Furthermore,

knowledge from similar partially annotated datasets from multiple groups can

be aggregated to collaboratively train global models using Federated Learn-

19
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

ing [10]. Knowledge aggregation would not only save time but also allow differ-

ent groups to benefit from each other’s annotations without explicitly sharing

them. Consequently, different techniques have been proposed in the literature

for aggregating knowledge from heterogeneous datasets with partial, incom-

plete labels [11–14].

There has been considerable research in the past on developing multi-task

segmentation models using partial labels. The works of [15, 16] show how to

create subsets of the partially labeled datasets to create fully labeled subsets.

However, this strategy requires very heavy computational resources. Another

approach as described by [17,18] is to design a multi-task head with a common

encoder and task-specific decoders that are trained separately. This approach

requires all the data to be hosted locally and is not realistic in a real-world med-

ical scenario owing to patient privacy concerns and storage and computational

limitations.

The work of [19, 20] has shown promise in developing multi-task segmen-

tation models using multi-scale feature abstraction but the proposed method

requires all the data to be centrally located. We know that such a situation is

not realistic in a medical scenario not only because of privacy and data shar-

ing restrictions but also because it is impossible to anticipate in advance how

many distinct activities should the model be trained for. In [13], the authors

developed a multi-task multi-domain deep segmentation model for the segmen-

20
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

tation of pediatric imaging datasets with excellent performance. However, the

proposed technique was developed and evaluated for different anatomical re-

gions in the body with no overlapping field of view or incomplete annotations.

Similarly, the cross-domain medical image segmentation technique developed

in [11] was focused on segmentation of the same anatomical structure, and the

proposed technique was not developed to tackle incomplete annotations

The work of [20] introduced a real FL setup for segmentation using par-

tial labels where client nodes were trained on specific sub-networks for their

specific tasks using a shared decoder. However, this method is not scalable

and again, needs knowledge of all the tasks to be trained. It was for the first

time in the work of [14] that knowledge aggregation was introduced using a

single network in a federated manner. The global federated learning frame-

work developed in their work, however, failed to accurately segment different

anatomical structures on the external test set. For optimal performance, the

authors used an ensemble of multiple local federated learning models, making

it computationally expensive and practically challenging.

Thus, to create an FL setup that can generalize well to non-i.i.d data, and

accurately segment target organs for heterogeneous datasets with partial la-

bels, I developed SegViz - A multi-task federated learning framework to learn

a diverse set of tasks from distributed nodes with incomplete annotations, as

illustrated in Figure 3.1. The global SegViz model is initialized at the server

21
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

with two distinct blocks - a representation block and a task block. The goal

of the representation block is to learn a generalized representation of the un-

derlying dataset while the goal of the task block is to learn individual tasks

distributed across different nodes. Every client is initialized with a subset of

the SegViz model, comprising the representation block and a subset of the task

block representing the client’s tasks. During training, the weights of the rep-

resentation block are always aggregated by the server and redistributed back

to the client nodes. On the other hand, the weights of the task block are di-

rectly copied from the corresponding client nodes containing the corresponding

task, thereby preserving the task-related information for each node in their

task block.

The SegViz framework was evaluated using four publicly available datasets

from the Medical Segmentation Decathalon (MSD) challenge [5]. The Spleen

MSD dataset consists of 61 3D Computed tomographies (CT) volumes with

spleen annotations out of which only the 41 training set volumes were used.

The Liver MSD dataset consists of 201 3D CT volumes with liver and liver

tumor annotations out of which only 131 training set volumes were consid-

ered. Similarly, the Pancreas MSD dataset consists of 420 3D CT volumes of

which only 282 training volumes were used. Lastly, from the 2019 Kidney Tu-

mor Segmentation Challenge dataset [21], I used 210 3D CT volumes from the

training dataset. The training and internal validation splits were considered

22
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

from the overall training data in an 80:20 split. I considered all 30 training

image volumes from the Beyond the Cranial Vault (BTCV) dataset [22] as an

external test set for all our experiments. As the test set only contains labels for

the organs, all disease labels were ignored from each task, and the models are

trained to only segment the active regions of each organ.

During pre-processing, all the image volumes were first resized to 256 × 256

× 128, and the intensity values normalized between 0 and 1. All the volumes

were resampled to a constant spacing of (1.5, 1.5, 2.0). Then, I extract random

foreground patches of size 128 ×128 × 32 from each volume such that the center

voxel of each patch belonged to either the foreground or background class.

3.1.1 Model Architecture

The backbone of the SegViz model architecture was constructed using a

modified version of the multi-head 3D-UNet [23] configuration for all our ex-

periments. Each U-Net has 5 layers with down/up-sampling at each layer by a

factor of 2. Unlike how U-Net implementations typically operate, these down

or up-sampling operations happen at the beginning of each block instead of at

the end. The U-Net also contains 2 residual units at the layers and uses Batch

Normalization at each layer. The task block comprised a multi-head architec-

ture with each head consisting of two layers, including the final classification

layer. The SegViz model was implemented using the MONAI [24] framework

23
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

and the pre-processing and training using Pytorch. The SegViz model archi-

tecture has been illustrated in 3.2.

During training, all weights are initialized using LeCun initialization. The

batch size was set to 2 and the learning rate was initially set to 1e-4 with the

Adam optimizer and CosineAnnealingLR [25] as the scheduler. The Dice Loss

was used as the loss function. The average Dice Score was chosen as the final

evaluation metric. Each model was trained for a total of 500 epochs.

3.1.1.1 Individual Baseline Implementation

For every task, I trained a single U-Net model based on the Segviz model ar-

chitecture on the training dataset after the 80:20 split and data augmentation

such as random affine transformations (rotation and scaling). Hence we have

a single model trained on the training dataset for the liver, spleen, pancreas,

and kidneys.

3.1.1.2 Central Aggregation Implementation

As a lower bound for a multi-task segmentation setup, I combined all four

datasets together to create a central repository of all the data. This is con-

sidered as a lower bound because naive aggregation of the data in the case

of partial annotations would lead to suboptimal performance compared to an

individual model trained for each dataset separately. I set up our centrally

24
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

Figure 3.1: Illustration of the proposed SegViz framework: Client nodes up-
date the global meta-model where knowledge aggregation occurs after every 10
iterations of the local model. The weights of the global model are then shared
with the client models allowing both nodes to share knowledge without sharing
data.

25
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

Figure 3.2: Illustration of the modified 3D-UNet configuration: The represen-


tation block refers to all the layers except the last two convolutional layers,
while the task block refers to the final two layers including the classifier. The
classifier (in green) can be fine-tuned for segmenting either of the four organs.
Image generated using [1]

26
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

aggregated model using the same steps as our baseline implementation.

3.1.1.3 SegViz Implementation

FedAvg: I use the popular FedAvg algorithm to construct an FL setup

where each client node in the setup represents an isolated group having one

of the datasets. The same UNet configuration with the Segviz architecture was

used at each client. Apart from the same pre-processing steps as the baseline

implementation, I also added random affine transformations such as rotation

and scaling. While training the local models, after every 10 epochs, following

the FedAvg algorithm, the global model gets all but the last two convolutional

layers’ weights and averages them. The updated weights are then shared back

to all the local models.

FedBN: I investigated the popular FedBN algorithm in a similar setup

as our FedAvg implementation. Making sure that our global model is gener-

alizable to non-i.i.d data is especially important in medical imaging as data

from different centers is obtained using different scanners/protocols. FedBN

has shown to be successful compared to other FL algorithms such as FedAvg

and FedProx in creating a global model that is generalizable well to non-i.i.d

data and it does so by not aggregating the batch norm layers during knowledge

transfer.

Local fine-tuning In [26, 27], the authors demonstrated the need for fine-

27
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

tuning in FL models in order to reduce the effect of catastrophic forgetting and

stabilize personalized performance. I also finetuned the FedAvg and FedBN

models (keeping the representation block frozen) on the local datasets to im-

prove task-specific performance of each task block while keeping the same rep-

resentation block.

3.2 Results

As shown in Figure 3.3, the FedBN model with fine-tuning performs the best

on the in-federation internal validation set as well the out-of-federation BTCV

test set. The SegViz framework using FedBN with fine-tuning segmented the

BTCV test set with dice scores of 0.93, 0.83, 0.55, and 0.75 for segmentation

of liver, spleen, pancreas, and kidneys, respectively, significantly (p < 0.05)

better (except liver and spleen) than the dice scores of 0.88, 0.79, 0.46, and 0.64

for the baseline models. In contrast, the central aggregation model performed

significantly (p < 0.05) poorly on the test dataset with dice scores of 0.65, 0,

0.55, and 0.68. Note that the model trained on the centrally aggregated data

did not generalize to the spleen label due to the overall model becoming biased

toward the liver and pancreas labels, which contains more samples per label. I

have included the statistical t-test results between the baseline and the best-

performing models in Table 3.3 and Table 3.4.

28
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

Figure 3.3: A comparison of the ground truth segmentation masks with the
masks generated by the baseline and SegViz FedBN+FT models for two differ-
ent images. Note that the individual baselines are computed separately and
are shown here together for illustration only. The Red color signifies the Liver,
Blue for the Spleen, Green for the Kidneys, and Yellow for the Pancreas

29
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

Table 3.1: Mean Dice score performance of all the experiments on the in-
federation internal validation dataset. The standard deviation values are in
parentheses.

30
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

Table 3.2: Mean Dice score performance of all the experiments on the out-of-
federation BTCV dataset. The standard deviation values are in parentheses.

31
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

Table 3.3: P-values comparison using Paired statistical t-test between base-
line, central aggregated model, and best performing SegViz (FedBN+FT) mod-
els for the internal validation dataset. The significant values (p < 0.05) are
highlighted in bold.

32
CHAPTER 3. FL FOR DISTRIBUTED HETEROGENEOUS DATASETS
WITH INCOMPLETE ANNOTATIONS

Table 3.4: P-values comparison using Paired statistical t-test between base-
line, central aggregated model, and best performing SegViz (FedBN+FT) mod-
els for the external BTCV dataset. The significant values (p < 0.05) are high-
lighted in bold.

33
Chapter 4

Coresets For Data Efficient

Training of Medical Image

Segmentation Algorithms

There is an increasing demand for edge computing devices such as smart-

phones, smartwatches, and home automation systems. It is common to see clin-

icians use their smartphones/tablets to view reports and sometimes even mod-

ify an existing annotation. Current deep-learning models for segmentation are

too large to be trained or locally fine-tuned on such systems. Modern segmen-

tation algorithms such as nn-UNet and MONAI also require high-performing

computing such as GPUs to process these large datasets. Hence, there is a

need to efficiently deploy machine learning models on such edge devices within

34
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

their limited computational constraints.

Coresets are a foundational technique in machine learning that reduces the

computational requirement for training a machine learning model by reducing

the sample size of the training data. Coresets essentially represent a smaller

version of the original dataset which is used to train the model such that it

captures the essential information of the original dataset with no significant

drop in performance compared to training on the entire data.

By extracting a coreset of a larger dataset, such edge devices can signifi-

cantly reduce the amount of data needed to train models locally leading further

to lower training times, reduced memory requirements, and reduced power con-

sumption.

Moreover, in recent years, the field of machine learning has seen rapid ad-

vancements in the deployment of DL algorithms in the wild. Coresets for data

efficient training were introduced by the authors in [28]. Coresets based on

k-means or k-medians clustering [29], using Naive Baye’s and nearest neigh-

bors [30], Nystrom methods [31, 32], and Bayesian inference [33] have been

shown to provide high-probability solutions [34, 35] to specific problems.

In this chapter, I show that coresets can be used to efficiently reduce the

sample size required for training 3D segmentation models for two different

organ segmentation tasks - Segmenting the airways in pediatric MR images

and segmenting the liver and spleen in chest CT images.

35
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

Let us define a few terms before formally defining our problem statement -

Query space : For every tuple (P, w, X , f ) where P represents a finite set of

inputs, X represents an infinite set of queries, w is a weight that determines

the importance, and f being a distance function that approximates the pseudo-

distance between a point and a query.

Coreset : We define a coreset C as follows - A function that approximates

the sum of weighted distances

X
w(p)f (p, x)
p∈P

or any cost function

Rn → R

that

n = |P | distances (f (p, x))p∈P

to a non-negative total cost. It turns out that finding sensitivity approxima-

tions of each point is still time-consuming and thus a more practical approach

is to build a procedure that builds on the main intuition of sensitivity sampling

instead of actually performing sensitivity sampling.

Sensitivity/Uniform Sampling : This is one of the most common tech-

niques for creating a coreset. Unlike other methods, uniform sampling time

36
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

sub-linear time in the input does not provide (1 + ϵ) multiplicative errors com-

pared to other techniques. This technique assigns a different sensitivity to

every observation. One shortcoming of this technique is that outliers that are

drastically different are often ignored whereas it might be possible that the

outlier is a crucial point in the data and needs to be included manually as part

of the coreset. However, if the dataset is roughly uniform in distribution, i.e.

the sensitivities of the points are also uniform, then sensitivity sampling is also

achieved by randomly selecting observations at random.

Group Sampling : The ability of the dataset’s observations to be clus-

tered or divided into several groups, whereby the behavior of the observations

is similar within a group but significantly different across groups, is a common

phenomenon. The sensitivities in this situation are comparable within each

group, but they might also differ significantly between groups. In such situa-

tions, sensitivity sampling would take a similar number of observations from

each group. Then, sensitivity sampling can be approximated to divide the data

into the known groups if a natural a priori partition is known, by sampling a

fixed number of points from each group uniformly and at random. Thus for m

points and k groups, I randomly sample m/k points from each group which is

then weighted proportional to the number of points in the group.

In the next section, I show how coresets can be applied successfully to med-

ical image segmentation tasks. Using uniform sampling, I show how coresets

37
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

can be built for the task of pediatric airways segmentation on Magnetic Res-

onance (MR) images, and using group sampling, I show how coresets created

by using the k-means algorithm can be applied to segment the liver and spleen

organs from chest CT images.

4.1 Coreset Creation Using Group Sam-

pling

We demonstrate how group sampling can be used to create coresets for seg-

menting the liver and spleen from chest CT images.

4.1.1 Data

For the experiment, I used imaging data for both organs from the Medical

Segmentation Decathalon challenge. The Liver MSD and Spleen MSD datasets

have already been described in the SegViz section and the same data finger-

print was used to run this experiment. The Liver MSD dataset contains 201

CT images, while the Spleen MSD dataset contains 61 CT images.

38
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

4.1.2 Methods

4.1.2.1 Coreset Creation

For the Liver dataset, I select sample sizes of 5,3, and 2 samples per cluster

while for the Spleen dataset, I sample 10, 5, and 3 samples per cluster. For

both datasets, I extracted coresets using the weighted sampling strategy de-

scribed in [36] - Given a set of images of size N and a compression rate of R,

treating each image as a sample point, I partition all the points in space using

the k-means ++ algorithm. Because we want to manually choose the number

of clusters to be N/R as it indicates the compression ratio, I employ k-means++

clustering. Furthermore, since the overall distribution of all the images in the

training set is known, where every image is sourced using the same acqui-

sition protocols and hence has no outliers, and is one-dimensional in nature,

k-means++ clustering will provide a quick and accurate convergence. Using

this strategy, the overall sample space for the Liver data can be divided into 9

clusters and for Spleen into 2 clusters. The results of the clustering algorithm

for both datasets are shown in 4.2 and 4.1.

4.1.2.2 Model Training

Using the coresets setup using the steps above, I trained a 3d-full-res model

using the standard nnUNet pipeline. The only modification was that each fold

39
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

Figure 4.1: Figure depicting 2 main clusters generated for the spleen dataset
after T-SNE [2]

Figure 4.2: Figure depicting 9 main clusters generated for the liver dataset
after T-SNE [2]

40
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

was trained only for 100 epochs and the rest of the pipeline was kept the same.

4.1.2.3 The nnUNet Training Pipeline

The nnUNet pipeline [16], is a seminal paper in the automated image anal-

ysis domain. It is an open-source framework developed by the German Cancer

Research Centre (DFKZ) for out-of-the-box medical image segmentation. The

main idea behind nnUNet is the generation of dynamic templates for auto-

matic segmentation that can be adapted to any new dataset. Compared to

traditional machine learning techniques which rely on the knowledge of the

necessary model architecture and the right configuration of hyperparameters,

nnUNet focuses on determining a dataset fingerprint of key pre-determined

hyperparameters. nnUNet then can configure 3 different types of standard

UNet architectures to train the model - The 2D UNet, the 3D UNet with a full

resolution field of view and a two-stage 3D cascaded UNet version where the

first stage of the cascade trains on low-resolution while the next stage trains on

full resolution images. The blueprint parameters that had been determined in

the fingerprint are then used to create the training schedule such as the learn-

ing rate schedule, loss function, architecture template, and hyperparameters.

nnUNet also calculates another set of parameters called inferred parameters

that build on top of the blueprint parameters to configure specialized training

pipelines for dataset-specific adaptations like patch size, batch size, the pre-

41
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

processing techniques such as normalization and standardization.

4.1.2.4 Dataset Fingerprint Extraction

Given an image volume, nnUNet extracts the central nonzero region first.

Next, nnUNet constructs the fingerprint that is specific to the dataset at hand

and computes several properties such as all image sizes before and after crop-

ping, the imaging modality for the task, the voxel spacing for the image and the

ground truth annotation, and the number of classes that are required for seg-

mentation. At this stage, nnUNet also performs modality-specific preprocess-

ing - For CT it computes the mean and standard deviation, the 5-percentile,

and the 95-percentile for the intensity values.

4.1.2.5 Blueprint Parameters

After creating the dataset fingerprint, nnUNet creates a set of fixed param-

eters that do not change for any dataset. This includes parameters such as

the architecture templates for the various configurations of the standard UNet

model, the learning rate scheduler, and the various data augmentation tech-

niques.

42
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

4.1.2.6 Architecture Templates

The principle idea of the nnUNet paper states that the standard UNet con-

figuration with proper adaptation can achieve state-of-art results on virtually

any dataset. nnUNet has preconfigured scripts for the standard UNet config-

uration in 2D, 3D, and cascaded 3D setups. All the configs use the standard

encode-decoder architecture using skip connections and a constant stride of 1.

All the model configurations favor larger patch sizes over larger batch sizes

keeping in mind the hardware limitations. Instead of batch normalization, all

configurations use instance normalization instead of the Leaky ReLU activa-

tion function. The configurations are initialized with 32 feature maps at the

highest resolution which is doubled with each downsampling.

4.1.2.7 Training Schedule

Standard nnUNet pipelines are trained for 1000 epochs with 250 iterations

per epoch. I train nnUNets only for 100 epochs for all our experiments. The

Adam optimizer with an initial learning rate of 0.01 is used together with Nes-

terov momentum (µ = 0.99) as the optimizer. A learning rate scheduler with

decay based on the total number of epochs is also implemented. A modified

version of the Dice loss together with the cross-entropy loss was used as the

loss function.

43
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

4.1.2.8 Inferred Parameters

For CT images, intensity normalization is done using global dataset per-

centile clipping. Image resampling and target spacing modifications are done

using the standard nnUNet pipeline which would default to third-order spline

interpolation for resampling and median spacing for each axis depending on

the training cases.

4.1.3 Results

For the test set, I used all 30 training images from the BTCV dataset. I

compared the dice similarity index between the ground truth annotations from

the BTCV dataset and predictions of each nnUNet model for each coreset con-

struction.

The baseline for both datasets is a nnUNet model that is trained on all

images from the training set. For the Liver dataset, the performance on the

BTCV dataset gives a Dice index of 0.897 with a Jaccard coefficient of 0.822

while for the Spleen dataset, the Dice index is 0.865 with a Jaccard coefficient

of 0.79.

For the Spleen dataset - We see that for the coreset with 20 samples, the

Dice index is 0.862 with a Jaccard coefficient of 0.787. For the coreset with 10

samples, the Dice Index is 0.76 with a Jaccard coefficient of 0.692 while for the

44
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

Figure 4.3: Figure showing the Dice similarity index for the baseline Liver
and coreset experiments. Higher is better.

Figure 4.4: Figure showing the Dice similarity index for the baseline Spleen
and coreset experiments. Higher is better.

45
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

Table 4.1: Paired t-test comparison between the baseline (all training samples)
with the coreset constructions of varying sample sizes. The significant values
(p < 0.05) are highlighted in bold

46
CHAPTER 4. CORESETS FOR DATA EFFICIENT TRAINING OF
MEDICAL IMAGE SEGMENTATION ALGORITHMS

6-sample coreset, the Dice index is 0.741 with a Jaccard coefficient of 0.668.

For the Liver dataset - We see that for the coreset with 45 samples, the

Dice index is 0.903 with a Jaccard coefficient of 0.832. For the coreset with 27

samples, the Dice index is 0.898 with a Jaccard coefficient of 0.824 and for the

coreset with 18 samples, the Dice index is 0.872 with a Jaccard coefficient of

0.795.

A comparison of the dice scores from the baselines and coreset experiments

is shown in 4.3 4.4. The paired t-test results are shown in 4.1 and suggest that

the baseline vs coreset liver models are not significant while the baseline vs

spleen models are significant for sample sizes 10,6.

47
Chapter 5

Discussion

The SegViz framework proposed in this work demonstrated excellent perfor-

mance in aggregating knowledge from heterogeneous datasets with different,

incomplete labels. Our approach successfully aggregated knowledge from all

nodes with little to no drop in the performance of the global meta-model in

terms of the average dice score.

The comparable performance between the SegViz segmentations and mul-

tiple baseline model segmentation illustrates a preliminary example of con-

structing a single global multi-task segmentation model with clinical applica-

bility from dispersed datasets with disjoint partial annotations.

Image segmentation from heterogeneous datasets with incomplete annota-

tions has many potential benefits. For example, SegViz can potentially reduce

labeling time by 1/η where η is the number of distinct labels in the distributed

48
CHAPTER 5. DISCUSSION

data sets by allowing the transfer of knowledge between each client. This

would not only save time but also allow different research groups to potentially

benefit from each others’ annotations without explicitly sharing them.

SegViz-trained models using the MONAI framework also demonstrate a

smaller storage footprint compared to models trained by other state-of-art meth-

ods such as nnUNet. Our global model and local models are only 18 MB in

space whereas, nnUNet pre-trained models, which are an ensemble by default

occupy on average approximately 250-300 MB of storage. This makes SegViz-

trained models easy to deploy in a real-world scenario where the model weights

can be communicated using edge devices without the need for high computing

requirements and large servers.

I believe the success of SegViz is attributed to several inherent advantages

in its implementations, such as using a learning rate decay and random affine

transformations during training which makes it more robust to non-i.i.d data.

Moreover, extending our FL implementations with fine-tuning allows for cre-

ating stable, high-performing personalized local models.

It is important to note that the SegViz using FedAvg can create a global

model that can be extended to contain a multi-head classifier block. Thus,

when using FedAvg as the aggregating strategy, we can have a single global

model that has knowledge of all the tasks being trained using the participating

clients. This is not true for the FedBN model because, during training with the

49
CHAPTER 5. DISCUSSION

FedBN strategy, we are forcing the batch normalization layers to be excluded

from the knowledge aggregation rounds. This provides the SegViz maintainers

with the choice of deploying FedAvg if a global model is desired or FedBN for

high-performance and personalized training.

Our study contains a few limitations - The Segviz framework is currently

built on working with CT images only, and I’m yet to extend it to other modali-

ties. Also, the current framework is focused on segmenting only organs that are

present in the same field of view and not for organs in different fields of view

(for example, brain and liver). For the coresets framework, I only employed

group sampling for the creation of the coresets, which may not generalize well

for all datasets. Also, the testing dataset used only contains 30 samples from

one particular distribution and it would be beneficial to test the performance

of the models on several other non-i.i.d datasets.

In the future, I would like to extend our experiments using a modality that

is less stable than CT such as MRI. I would like to scale our experiment with

even more nodes where some nodes can have a different modality as well. I

would also like to investigate the real-world performance of our FL setup where

client nodes can join and drop contact with the server at any point in time while

maintaining no drop in performance. I’m actively working towards integrating

differential privacy methods in our existing FL setup to further improve pa-

tient privacy.

50
CHAPTER 5. DISCUSSION

With our coreset experiments, I show that group sampling using k-means

clustering as the grouping strategy can be effectively utilized in constructing a

coreset S from the entire superset of training data such that a model trained

on the coreset can still converge successfully. I compare the overall dice in-

dex on an external test bench of several trained models using different coreset

sampling configurations and show that we can achieve similar performance by

utilizing only a fraction of the total training cases. While I acknowledge that

our liver coreset results are not statistically significant, I would like to per-

form a power analysis after incorporating additional sampling techniques to

determine if our results are not significant at all, or not significant due to the

current choice of 30 images being used in our test set.

In the future, I will experiment with several different strategies for building

coresets - prioritizing random sampling and inverse CDF sampling to compare

the performances of these approaches against group sampling to determine the

method that provides superior convergence and stable performance.

5.1 Data and Code Availability

All our code for building SegViz and all other models in comparison is avail-

able at https://github.com/UM2ii/SegViz. You can also access the pre-

trained models for each task in the same repository. The data used to repro-

51
CHAPTER 5. DISCUSSION

duce the experiments from the Medical Segmentation Decathlon is available at

http://medicaldecathlon.com/ The external BTCV test set can be down-

loaded from https://www.synapse.org/#!Synapse:syn3193805/wiki/

89480. Please use the Abdomen.zip file.

In conclusion, the results from the SegViz experiments demonstrate the

success of SegViz as a first step to building an accurate global model using FL

when the datasets in use are heterogeneous and contain partial labels. The

results from the coresets experiments show that we can intelligently reduce

the number of training samples required to train deep learning models suc-

cessfully, thus saving time, space, and compute costs to train large-scale deep

learning models.

52
Bibliography

[1] H. Iqbal, “Harisiqbal88/plotneuralnet v1.0.0,” Dec. 2018. [Online].

Available: https://doi.org/10.5281/zenodo.2526396

[2] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal

of machine learning research, vol. 9, no. 11, 2008.

[3] J. Wasserthal, M. Meyer, H.-C. Breit, J. Cyriac, S. Yang, and M. Segeroth,

“Totalsegmentator: robust segmentation of 104 anatomical structures in

ct images,” arXiv preprint arXiv:2208.05868, 2022.

[4] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks

for biomedical image segmentation,” in International Conference on Medi-

cal image computing and computer-assisted intervention. Springer, 2015,

pp. 234–241.

[5] M. Antonelli, A. Reinke, S. Bakas, K. Farahani, A. Kopp-Schneider, B. A.

Landman, G. Litjens, B. Menze, O. Ronneberger, R. M. Summers et al.,

53
BIBLIOGRAPHY

“The medical segmentation decathlon,” Nature communications, vol. 13,

no. 1, pp. 1–13, 2022.

[6] D. C. Nguyen, Q.-V. Pham, P. N. Pathirana, M. Ding, A. Seneviratne,

Z. Lin, O. Dobre, and W.-J. Hwang, “Federated learning for smart health-

care: A survey,” ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–37,

2022.

[7] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar,

and H. B. McMahan, “Adaptive federated optimization,” arXiv preprint

arXiv:2003.00295, 2020.

[8] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of

fedavg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019.

[9] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding,

“Embracing imperfect datasets: A review of deep learning solutions for

medical image segmentation,” Medical Image Analysis, vol. 63, p. 101693,

2020.

[10] A. Chowdhury, H. Kassem, N. Padoy, R. Umeton, and A. Karargyris, “A

review of medical federated learning: Applications in oncology and can-

cer research,” in International MICCAI Brainlesion Workshop. Springer,

2022, pp. 3–24.

54
BIBLIOGRAPHY

[11] V. S. Parekh, S. Lai, V. Braverman, J. Leal, S. Rowe, J. J. Pillai, and

M. A. Jacobs, “Cross-domain federated learning in medical imaging,”

arXiv preprint arXiv:2112.10001, 2021.

[12] C. Shen, P. Wang, H. R. Roth, D. Yang, D. Xu, M. Oda, W. Wang, C.-S.

Fuh, P.-T. Chen, K.-L. Liu et al., “Multi-task federated learning for het-

erogeneous pancreas segmentation,” in Clinical Image-Based Procedures,

Distributed and Collaborative Learning, Artificial Intelligence for Com-

bating COVID-19 and Secure and Privacy-Preserving Machine Learning.

Springer, 2021, pp. 101–110.

[13] A. Boutillon, P.-H. Conze, C. Pons, V. Burdin, and B. Borotikar, “Gener-

alizable multi-task, multi-domain deep segmentation of sparse pediatric

imaging datasets via multi-scale contrastive regularization and multi-

joint anatomical priors,” Medical Image Analysis, vol. 81, p. 102556, 2022.

[14] C. Shen, P. Wang, D. Yang, D. Xu, M. Oda, P.-T. Chen, K.-L. Liu, W.-C. Liao,

C.-S. Fuh, K. Mori et al., “Joint multi organ and tumor segmentation from

partial labels using federated learning,” in International Workshop on Dis-

tributed, Collaborative, and Federated Learning, Workshop on Affordable

Healthcare and AI for Resource Diverse Global Health. Springer, 2022,

pp. 58–67.

[15] Q. Yu, Y. Shi, J. Sun, Y. Gao, J. Zhu, and Y. Dai, “Crossbar-net: A novel

55
BIBLIOGRAPHY

convolutional neural network for kidney tumor segmentation in ct im-

ages,” IEEE transactions on image processing, vol. 28, no. 8, pp. 4060–

4074, 2019.

[16] F. Isensee, P. F. Jäger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “Au-

tomated design of deep learning methods for biomedical image segmenta-

tion,” arXiv preprint arXiv:1904.08128, 2019.

[17] R. Huang, Y. Zheng, Z. Hu, S. Zhang, and H. Li, “Multi-organ segmenta-

tion via co-training weight-averaged models from few-organ datasets,” in

Medical Image Computing and Computer Assisted Intervention–MICCAI

2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Pro-

ceedings, Part IV 23. Springer, 2020, pp. 146–155.

[18] S. Chen, K. Ma, and Y. Zheng, “Med3d: Transfer learning for 3d medical

image analysis,” arXiv preprint arXiv:1904.00625, 2019.

[19] J. Zhang, Y. Xie, Y. Xia, and C. Shen, “Dodnet: Learning to segment multi-

organ and tumors from multiple partially labeled datasets,” in Proceed-

ings of the IEEE/CVF conference on computer vision and pattern recogni-

tion, 2021, pp. 1195–1204.

[20] X. Xu and P. Yan, “Federated multi-organ segmentation with partially la-

beled data,” arXiv preprint arXiv:2206.07156, 2022.

56
BIBLIOGRAPHY

[21] N. Heller, N. Sathianathen, A. Kalapara, E. Walczak, K. Moore, H. Kaluz-

niak, J. Rosenberg, P. Blake, Z. Rengel, M. Oestreich et al., “The kits19

challenge data: 300 kidney tumor cases with clinical context, ct semantic

segmentations, and surgical outcomes,” arXiv preprint arXiv:1904.00445,

2019.

[22] B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein,

“Miccai multi-atlas labeling beyond the cranial vault–workshop and

challenge,” in Proc. MICCAI Multi-Atlas Labeling Beyond Cranial

Vault—Workshop Challenge, vol. 5, 2015, p. 12.

[23] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d

u-net: learning dense volumetric segmentation from sparse annotation,”

in International conference on medical image computing and computer-

assisted intervention. Springer, 2016, pp. 424–432.

[24] M. J. Cardoso, W. Li, R. Brown, N. Ma, E. Kerfoot, Y. Wang, B. Murrey,

A. Myronenko, C. Zhao, D. Yang et al., “Monai: An open-source framework

for deep learning in healthcare,” arXiv preprint arXiv:2211.02701, 2022.

[25] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm

restarts,” arXiv preprint arXiv:1608.03983, 2016.

[26] Y. Jiang, J. Konečnỳ, K. Rush, and S. Kannan, “Improving federated learn-

57
BIBLIOGRAPHY

ing personalization via model agnostic meta learning,” arXiv preprint

arXiv:1909.12488, 2019.

[27] P. P. Liang, T. Liu, L. Ziyin, N. B. Allen, R. P. Auerbach, D. Brent,

R. Salakhutdinov, and L.-P. Morency, “Think locally, act globally: Fed-

erated learning with local and global representations,” arXiv preprint

arXiv:2001.01523, 2020.

[28] B. Mirzasoleiman, J. Bilmes, and J. Leskovec, “Coresets for data-efficient

training of machine learning models,” in International Conference on Ma-

chine Learning. PMLR, 2020, pp. 6950–6960.

[29] S. Har-Peled and S. Mazumdar, “On coresets for k-means and k-median

clustering,” in Proceedings of the thirty-sixth annual ACM symposium on

Theory of computing, 2004, pp. 291–300.

[30] K. Wei, R. Iyer, and J. Bilmes, “Submodularity in data subset selection and

active learning,” in International conference on machine learning. PMLR,

2015, pp. 1954–1963.

[31] R. Agarwal, R. Echambadi, A. M. Franco, and M. B. Sarkar, “Knowledge

transfer through inheritance: Spin-out generation, development, and sur-

vival,” Academy of Management journal, vol. 47, no. 4, pp. 501–522, 2004.

58
BIBLIOGRAPHY

[32] C. Musco and C. Musco, “Recursive sampling for the nystrom method,”

Advances in neural information processing systems, vol. 30, 2017.

[33] T. Campbell and T. Broderick, “Bayesian coreset construction via greedy

iterative geodesic ascent,” in International Conference on Machine Learn-

ing. PMLR, 2018, pp. 698–706.

[34] S. Har-Peled and S. Mazumdar, “Fast algorithms for computing the small-

est k-enclosing circle,” Algorithmica, vol. 41, pp. 147–157, 2005.

[35] O. Bachem, M. Lucic, and A. Krause, “Practical coreset constructions for

machine learning,” arXiv preprint arXiv:1703.06476, 2017.

[36] G. Zheng, S. Zhou, V. Braverman, M. A. Jacobs, and V. S. Parekh, “Selec-

tive experience replay compression using coresets for lifelong deep rein-

forcement learning in medical imaging,” arXiv preprint arXiv:2302.11510,

2023.

59
Adway Kanhere
akanher1@jhu.edu
linkedin.com/in/adwaykanhere
github.com/adwaykanhere

EDUCATION
2023 Master of Science in Engineering Degree, Biomedical Engineering
The Johns Hopkins University and School of Medicine, USA.
Specialization in Biomedical Data Science - Thesis track
2020 Bachelor of Engineering Degree, Medical Electronics
M.S Ramaiah Institute of Technology, India

PROFESSIONAL EXPERIENCE
09/2022 – present Bioinformatics Software Engineer-I
University of Maryland Medical Intelligent Imaging Center (UM2ii), USA
Developed an AI and deep learning algorithm for federated-learning (FL) in Radiology - SegViz, for medical

image segmentation of heterogenous datasets with incomplete annotations.


Built and deployed an open-source web-based DICOM viewer for medical image visualization and analysis

that can connect directly to a hospital PACS and is loaded with automatic segmentation and active
learning capabilities.
Defined and maintained software pipelines for ingesting and automating data collection, pre-processing,

and AI model development workflows for several clinically translatable medical image segmentation
tasks.
Developed a software pipeline for ingesting, processing, and developing an AI model on pediatric

radiology imaging data from collaborators within the UM Medical Center.


Standardized and maintaining the data and AI infrastructure using GCP, AWS, and on-prem resource

management to meet the lab's high throughput and scalable computing needs.
Assisted in writing technical papers and patents to communicate key project results in high-impact

conferences such as MedNeurIPS, SIIM, and MICCAI.


06/2022 – 08/2022 Data Science Intern - Early-Clinical Development
Regeneron Pharmaceuticals, USA
Developed a deep learning based pipeline for automated segmentation of lung tumor volumes from lung

CT images.
Developed a deep learning based pipeline for progression-free survival, overall survival prediction and

response to immunotherapy for patients with non-small cell lung cancer.


Designed and deployed an internal web application based GUI to translate the trained AI model into

production
03/2022 – 05/2023 Graduate Teaching Assistant
Johns Hopkins University - Carey Business School, USA
Courses proctored
Big Data Machine Learning Spring 2022

Linear Econometrics for Finance Fall 2022


Operations Management Fall 2022


Data Analytics Fall 2022


Python for Data Analytics Spring 2023


Big Data Machine Learning Spring 2023


Empirical Finance Spring 2023


01/2022 – 12/2022 Co-Founder and CTO


FarmPlus, LLC, USA
Designed solutions for a smart cow tag - designing specialized system-on-chip (SoC), building the cloud-

native communications platform, and interfacing with the native web application for high latency wireless
transfer of sensor data at long ranges.
Raised $8,000 in pre-seed funding and won the first prize at the Johns Hopkins New Venture Challenge

2022 (HOPSTART).
12/2021 – 05/2022 Graduate Research Assistant
Johns Hopkins University & School of Medicine
Programmed a deep learning-based pipeline for transforming smartphone-based photos of chest X-Ray

images to digital X-Ray images using a Pix2Pix Generative Adversarial Network (GAN).
Programmed a deep learning-based pipeline for localizing the region of the thorax in chest X-Ray images

using Detectron2's pipeline.


Analyzed the performance of existing state-of-art CNN-based algorithms in TorchXRayVision to generalize

the prediction of lung abnormalities on smartphone photographs of chest X-rays compared to


synthetically modified digital chest X-Ray images.

Adway Kanhere 1/2


SKILLS
Python (Numpy, Scipy, Sklearn, Matplotlib, Seaborn) | Deep learning (Pytorch, Keras, Tensorflow) | R & R Studio
Statistical Analytics | Git & Github Version Control | Docker | AWS | GCP | SQL | Leadership
Analytical and Problem solving | Jira | Agile

CONFERENCES AND PUBLICATIONS


Optimizing Acute Stroke Segmentation: Do Additional Sequences Matter for Deep Learning Algorithms?
Kamel, P., Kanhere, A., Kulkarni, P., Parekh, V., & Yi, P. H.
2023 Society for Imaging Informatics in Medicine Annual Meeting
SegViz: A Federated Learning Framework for Medical Image Segmentation from Distributed Datasets with Different and
Incomplete Annotations
Adway U. Kanhere, Pranav Kulkarni, Paul H. Yi, Vishwa S. Parekh https://doi.org/10.48550/arXiv.2301.07074
Surgical Aggregation: A Federated Learning Framework for Harmonizing Distributed Datasets with Diverse Tasks
Pranav Kulkarni, Adway U. Kanhere, Paul H. Yi, Vishwa S. Parekh https://doi.org/10.48550/arXiv.2301.06683
From Competition to Collaboration: Making Toy Datasets on Kaggle Clinically Useful for Chest X-Ray Diagnosis Using Federated
Learning
Pranav Kulkarni, Adway U. Kanhere, Paul H. Yi, Vishwa S. Parekh - Accepted to Med-NeurIps 2022 https://doi.org/10.48550/arXiv.2211.06212
The Impact of Standard Image Preprocessing on Deep Learning Models for Chest Radiographs: An Overlooked Source of
Performance Variability
Kargilis D, Kanhere A, Murphy Z, Hafey C, Parekh VS, Yi PH. . Podium Presentation
2022 Conference on Machine Intelligence in Medical Imaging, Society for Imaging Informatics in Medicine

OPEN SOURCE CONTRIBUTIONS & PROJECTS


When are Deep Networks really better than Decision Forests at small sample sizes, and how?
Neuro Data Design - Johns Hopkins Biomedical Engineering
Assessed the conceptual & empirical comparisons between decision forests & deep networks for audio data on the FSDK-18 dataset.

Remodeled the existing codebase to standardize and speed up the loading and pre-processing of audio data.

Created a pipeline for Bayesian Hyperparameter tuning of the existing CNN models and improved the performance of the baseline CNN

from 64% to 88% accuracy. Extended the same pipeline to improve the performance of all CNN models implemented in the previous
study.
Project Link View Pull Requests
Detecting Abnormalities in chest X-Ray images with Convolutional Neural Networks
Implemented a DENSENET-181 model to identify 14 abnormalities on Chest X-ray images with transfer learning

Deployed model to an online application with an interactive interface for real-time analysis.

Github Link App Link


Design of EEG acquisition circuit and automatic error correction using deep learning
Developed an automatic error correction system using a deep neural network based on visual stimuli for a Brain-Computer Interface

(BCI) operating on EEG signals using the P300 task paradigm.


Designed an alternate elastic head cap to acquire EEG signals from the brain instead of traditional hard-mesh caps from OpenBCI.

Github Link

CERTIFICATES
Deep learning Specialization AI for Medical Diagnosis
AWS Machine learning Foundations Introduction to Machine Learning in Production
Successful Negotiation: Essential Strategies and Skills

Adway Kanhere 2/2

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy