MedSyn Text-Guided Anatomy-Aware Synthesis
MedSyn Text-Guided Anatomy-Aware Synthesis
Abstract— This paper introduces an innovative method- with Radiology Report, Controllable Synthesis.
ology for producing high-quality 3D lung CT images guided
by textual information. While diffusion-based generative
models are increasingly used in medical imaging, current I. I NTRODUCTION
state-of-the-art approaches are limited to low-resolution
outputs and underutilize radiology reports’ abundant infor- Denoising Diffusion Probabilistic Models (DDPM) [1], also
mation. The radiology reports can enhance the generation known as score-based generative models [2], have emerged as
process by providing additional guidance and offering fine- powerful tools in both computer vision and medical imaging
grained control over the synthesis of images. Nevertheless, due to their stability during training and exceptional gen-
expanding text-guided generation to high-resolution 3D
images poses significant memory and anatomical detail- eration quality. State-of-the-art image generation tools, such
preserving challenges. Addressing the memory issue, we as Imagen [3] and latent Diffusion Models (LDMs) [4], can
introduce a hierarchical scheme that uses a modified UNet employ text prompts to provide fine-grained guidance during
architecture. We start by synthesizing low-resolution im- image generation. This capability is promising for medical
ages conditioned on the text, serving as a foundation for imaging, encompassing applications like privacy-preserving
subsequent generators for complete volumetric data. To en-
sure the anatomical plausibility of the generated samples, data generation, image augmentation, black box uncertainty
we provide further guidance by generating vascular, airway, quantification, and explanations. Although methods such as
and lobular segmentation masks in conjunction with the CT RoentGen [5] have illustrated the potential of 2D cross-
images. The model demonstrates the capability to use tex- modality generative models conditioned on text prompts,
tual input and segmentation tasks to generate synthesized there are no known text-guided volumetric image generation
images. Algorithmic comparative assessments and blind
evaluations conducted by 10 board-certified radiologists techniques for medical imaging. Extending such an approach
indicate that our approach exhibits superior performance to 3D presents challenges, including high memory demands
compared to the most advanced models based on GAN and and preserving crucial anatomical details. This paper aims to
diffusion techniques, especially in accurately retaining cru- address these challenges.
cial anatomical features such as fissure lines and airways. To enhance the image resolution for generative models,
This innovation introduces novel possibilities. This study
focuses on two main objectives: (1) the development of a increasing the voxel count within a fixed field of view is
method for creating images based on textual prompts and memory-intensive [6]. Synthesizing 3D volumetric images at
anatomical components, and (2) the capability to generate high resolutions (e.g., 256 × 256 × 256) demands substantial
new images conditioning on anatomical elements. The ad- memory since the neural network must store intermediate acti-
vancements in image generation can be applied to enhance vations for backpropagation. To preserve essential anatomical
numerous downstream tasks.
details and condition on input text prompts, a high-capacity
Index Terms— Diffusion Model, Text-guided image gen- network is essential, further compounding the issue. Although
eration, 3D image generation, Lung CT, Volume Synthesis GANs [6]–[9] and conditional Denoising Diffusion Proba-
This work is equally contributed by Y. Xu, L. Sun and W. Peng bilistic Models (cDPMs) [10]–[12] have set benchmarks in
and was partially supported by NIH Award Number 1R01HL141813-01, volumetric medical imaging synthesis, each has its limitations.
NSF 1839332 Tripod+X, and SAP SE. We were also grateful for the GANs, despite their inference efficiency, can compromise
computational resources provided by Pittsburgh SuperComputing grant
number TG-ASC170024. sample diversity and sometimes produce anatomically implau-
Y. Xu, L. Sun, S. Jia and K. Batmanghelich are with Department of sible artifacts. Conversely, diffusion models, built to boost
Electrical and Computer Engineering, Boston University, Boston, MA sample diversity through iterative denoising, grapple with
02215 (email: {yanwuxu,lisun,brucejia,batman}@bu.edu)
W. Peng are with Department of Psychiatry and Behavioral Sciences, memory constraints, primarily due to their resource-intensive
Stanford University, Stanford, CA 94305 (email: wepeng@stanford.edu) 3D attention UNet. The iterative denoising coupled with the
K. Morrison, A. Perer and M. Eslami are with Human-Computer sequential sub-volume generation in cDPMs makes them time-
Interaction Institute, Carnegie Mellon University, Pittsburgh, PA
15213 (email: kcmorris@cs.cmu.edu, adamperer@cmu.edu, mes- intensive at the inference stage. As a result, many diffusion
lami@andrew.cmu.edu) models are relegated to 2D or low-resolution volumetric
A. Zandifar is with the University of Pittsburgh Medical Center, applications. Incorporating text prompts, such as radiology
Pittsburgh, PA 15213 (email: zandifara@upmc.edu)
S. Visweswaran is with the Department of Biomedical Informatics, reports, into image generation presents another challenge.
University of Pittsburgh, Pittsburgh, PA 15206 (email: shv3@pitt.edu ) Generative models utilizing text prompts as conditions demand
2 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020
<latexit sha1_base64="Z/unh1x4cXYQQO0kKcDW8s/tX3Q=">AAACAHicbVDLSsNAFJ34rPUVdeHCzWARXJVEirosuHElFewDmlAm00k7dB5hZiKUkI2/4saFIm79DHf+jZM2C209cOFwzr3ce0+UMKqN5307K6tr6xubla3q9s7u3r57cNjRMlWYtLFkUvUipAmjgrQNNYz0EkUQjxjpRpObwu8+EqWpFA9mmpCQo5GgMcXIWGngHgck0ZRJEWjKA47MGCOW3eUDt+bVvRngMvFLUgMlWgP3KxhKnHIiDGZI677vJSbMkDIUM5JXg1STBOEJGpG+pQJxosNs9kAOz6wyhLFUtoSBM/X3RIa41lMe2c7iRL3oFeJ/Xj818XWYUZGkhgg8XxSnDBoJizTgkCqCDZtagrCi9laIx0ghbGxmVRuCv/jyMulc1P3Lun/fqDUbZRwVcAJOwTnwwRVoglvQAm2AQQ6ewSt4c56cF+fd+Zi3rjjlzBH4A+fzB5WIlwA=</latexit>
4⇥
<latexit sha1_base64="ylWFu7m+G5ccOJPzTa/Jn/dfJbk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jNs1BWx8MPN6bYWZekEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mdwu/+8S1EbF6xFnC/YiOlQgFo2ilbmOAIuJmWK25dTcHWSdeQWpQoDWsfg1GMUsjrpBJakzfcxP0M6pRMMnnlUFqeELZlI5531JF7RI/y8+dkwurjEgYa1sKSa7+nshoZMwsCmxnRHFiVr2F+J/XTzG89TOhkhS5YstFYSoJxmTxOxkJzRnKmSWUaWFvJWxCNWVoE6rYELzVl9dJ56ruXde9h0at2SjiKMMZnMMleHADTbiHFrSBwRSe4RXenMR5cd6dj2VrySlmTuEPnM8fKPKPag==</latexit>
4
<latexit sha1_base64="Hz2YAGyfiZe+pJ9WM8c6f41F0vU=">AAAB5HicbVBNS8NAEJ3Urxq/qlcvi0XwFBIR9Vjw4rGC/YA2lM120q7dbMLuRiihv8CLB8Wrv8mb/8Ztm4O2Phh4vDfDzLwoE1wb3/92KhubW9s71V13b//g8KjmHrd1miuGLZaKVHUjqlFwiS3DjcBuppAmkcBONLmb+51nVJqn8tFMMwwTOpI85owaKz1cDWp13/MXIOskKEkdSjQHta/+MGV5gtIwQbXuBX5mwoIqw5nAmdvPNWaUTegIe5ZKmqAOi8WhM3JulSGJU2VLGrJQf08UNNF6mkS2M6FmrFe9ufif18tNfBsWXGa5QcmWi+JcEJOS+ddkyBUyI6aWUKa4vZWwMVWUGZuNa0MIVl9eJ+1LL7j2gnrDK8OowimcwQUEcAMNuIcmtIABwgu8wbvz5Lw6H8vGilNOnMAfOJ8/ENeLgA==</latexit>
H
<latexit sha1_base64="9X8iUiT/mvYJLqfhXp9X67h8vE4=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPBS48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ/dzvPKHSPJYPZpqgH9GR5CFn1FipWR+UK27VXYCsEy8nFcjRGJS/+sOYpRFKwwTVuue5ifEzqgxnAmelfqoxoWxCR9izVNIItZ8tDp2RC6sMSRgrW9KQhfp7IqOR1tMosJ0RNWO96s3F/7xeasI7P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp6N1WveV2pVfM4inAG53AJHtxCDerQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7A+fwBmW2Mvg==</latexit>
W Super-Res 3D UNet
<latexit sha1_base64="GsjgkyrHYukikZCL537vzxwbWhY=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4ComIeix48diC/YA2lM120q7dbMLuRiihv8CLB0W8+pO8+W/ctjlo64OBx3szzMwLU8G18bxvZ219Y3Nru7RT3t3bPzisHB23dJIphk2WiER1QqpRcIlNw43ATqqQxqHAdji+m/ntJ1SaJ/LBTFIMYjqUPOKMGis12v1K1XO9Ocgq8QtShQL1fuWrN0hYFqM0TFCtu76XmiCnynAmcFruZRpTysZ0iF1LJY1RB/n80Ck5t8qARImyJQ2Zq78nchprPYlD2xlTM9LL3kz8z+tmJroNci7TzKBki0VRJohJyOxrMuAKmRETSyhT3N5K2IgqyozNpmxD8JdfXiWtS9e/dv3GVbXmFnGU4BTO4AJ8uIEa3EMdmsAA4Rle4c15dF6cd+dj0brmFDMn8AfO5w+wKYzN</latexit>
Low-Res 3D UNet
Vessel Airway
xlT xl0
<latexit sha1_base64="l+WI+Esq/Rh8nbWMdXgkcYIdu2I=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VmlZoY9lsp+3SzSbsbsQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCopeNUMfRZLGJ1H1KNgkv0DTcC7xOFNAoFtsPxzcxvP6LSPJZNM0kwiOhQ8gFn1FjJf3oQvWavXHGr7hxklXg5qUCORq/81e3HLI1QGiao1h3PTUyQUWU4EzgtdVONCWVjOsSOpZJGqINsfuyUnFmlTwaxsiUNmau/JzIaaT2JQtsZUTPSy95M/M/rpGZwHWRcJqlByRaLBqkgJiazz0mfK2RGTCyhTHF7K2EjqigzNp+SDcFbfnmVtC6q3mXVu6tV6rU8jiKcwCmcgwdXUIdbaIAPDDg8wyu8OdJ5cd6dj0VrwclnjuEPnM8fvWmOmQ==</latexit> <latexit sha1_base64="6GusKZ+vy8u9Ej9ol3xyE9XRSz4=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VTFtoY9lsJ+3SzSbsbsRS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dgpr6xubW8Xt0s7u3v5B+fCoqZNMMfRZIhLVDqlGwSX6hhuB7VQhjUOBrXB0M/Nbj6g0T+S9GacYxHQgecQZNVbynx5Ez+2VK27VnYOsEi8nFcjR6JW/uv2EZTFKwwTVuuO5qQkmVBnOBE5L3UxjStmIDrBjqaQx6mAyP3ZKzqzSJ1GibElD5urviQmNtR7Hoe2MqRnqZW8m/ud1MhNdBxMu08ygZItFUSaIScjsc9LnCpkRY0soU9zeStiQKsqMzadkQ/CWX14lzYuqd1n17mqVei2PowgncArn4MEV1OEWGuADAw7P8ApvjnRenHfnY9FacPKZY/gD5/MHhtmOdQ==</latexit>
x0
Fig. 1. Overview of our generative model, MedSyn. Using a hierarchical approach, we first generate a 64× 64 × 64 low-resolution volume, along
with its anatomical components, conditioning on Gaussian noise ϵ and radiology report. The low-resolution volumes are then seamlessly upscaled
to a detailed 256× 256 × 256 resolution.
a high-capacity denoiser to map the subtle and occasionally of our knowledge, this is the first work to evaluate a text-to-
ambiguous pathologies mentioned in the reports to visual medical image model with radiologists empirically. We also
patterns. Enhancing the capacity of the UNet, further increases delve into the significance of components within our proposed
memory usage. modules. Our code and pre-trained model is publicly available
at https://github.com/batmanlab/MedSyn
In 3D medical image synthesis, limitations often manifest
as “hallucinations,” leading to potential biases and inaccura-
II. R ELATED W ORKS
cies—critically concerning in medical settings. Our paper con-
centrates on generating Computed Tomography (CT) images A. Image Synthesis Based on Text Prompt
of the lung. Hallucinations may manifest as missing fissure Text-conditional image generation enables new applications
lines separating lung lobes or the creation of implausible and improves the diversity of generated images compared to
airway and vessel structures. This problem of hallucinations models that are not conditioned on text. The most relevant
is less pronounced in X-ray image generation (e.g., Roent- studies in the domain of multi-modal generation include Ima-
Gen [5]) since the 2D X-ray projection obscures many fine gen [3], Latent Diffusion (Stable Diffusion) [4], Video Diffu-
details, like the lung’s lobular structure, airway, and vessels. sion [13], and RoentGen [14]. For natural images, both Imagen
However, the issue becomes more pronounced in 3D image and Latent Diffusion pioneered the text-conditional diffusion
generation. Training on hundreds of thousands of 3D medical models, producing high-fidelity 2D natural images. Dall·E
images isn’t feasible, necessitating a strong prior to constrain- 2 [15] uses pre-trained CLIP model to extract features, which
ing the space of generative models. guides text-to-image generation Video Diffusion introduces
To address these challenges, we propose MedSyn, a model a method to generate videos using score-guided sequential
tailored for high-resolution, text-guided, anatomy-aware volu- diffusion models, accepting text prompts as conditional inputs.
metric generation. Leveraging a large dataset of approximately In medical imaging domain, RoentGen enhanced the Stable
9,000 3D chest CT scans paired with radiology reports, we Diffusion prior to achieve an exceptional generative quality for
employ a hierarchical training approach for efficient cross- Chest X-Ray scans. TauPETGen [16] proposes to utilize text
modality generation. Given the input of tokenized radiology description as a condition for generating 2D Tau PET images.
reports, our method initiates with a low-resolution synthesis However, these methodologies cannot be seamlessly integrated
(64 × 64 × 64), which is then fed to a super-resolution into multi-modal 3D medical volume generation because of
module to upscale the image to 256 × 256 × 256. We have their inherent 2D orientation or the challenges associated
modified the UNet to bolster the network’s capacity (i.e., with efficient high-resolution 3D volume creation. In addition,
the number of parameters) without significantly increasing most previous methods adopt language models pre-trained on
memory requirements. MedSyn enhances controllability by generic text, then fine-tune on it using biomedical text. In
harnessing textual priors from radiology reports to guide contrast, we utilize a language encoder model [17] trained
synthesis. We further regularize the generator by creating specifically with biomedical text data that learns domain-
segmentation masks of the lung’s airway, vessels, and lobular specific vocabulary and semantics.
structure as additional output channels alongside synthetic CT
images to conserve detailed anatomical features. The resultant B. Generative Models for 3D Medical Imaging
model can condition not only on the text but also on single or Generative models have emerged as a powerful tool in the
paired anatomical structures. Our algorithmic experiments and field of medical imaging, offering a range of applications
blind evaluations conducted by 10 board-certified radiologists and benefits. Previous work [18], [19] leverages 3D GAN for
highlight the superior generative quality and efficiency of the volume generation. However, the generated images are limited
proposed method compared to baseline techniques. To the best to the small size of 128×128×128 or below, due to insufficient
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 3
Multi-channel
Efficient Low-Res Base Model (Ours) Multi-channel
Low-Res input
Low-Res output
Parameter size: 700M
Upsample
Downsample
Skip Connection
Pseudo 3D
Timestep conv 1x3x3
3D Conv
3x3x3
2D Spatial
Attn
Temporal
Text Token
Attn
Cross Attn
Medical BERT
Frozen
[CLS] Tok 1 Tok N [SEP] Tok 1 Tok M
Findings Impression
Radiology Report
Fig. 2. This figure shows our efficient low-res generative model with the clinical tokens input. In this process, we train the denoising diffusion UNet
and fix the pre-trained text feature extractor of Medical BERT. To be notified, our low-res base model has a large capacity of 700 million parameters.
B. Text-Conditioned Volume Generation 2) Super Resolution Model: Our model puts most of the
This section discusses how to incorporate the text em- learning capacity in the low-resolution module. We designed
bedding from the Medical BERT. Direct training of high- a lightweight diffusion UNet for the super-resolution module,
resolution diffusion models (such as 256 × 256 × 256 for a which takes the low-resolution input and outputs the full high-
given field of view) is highly memory-demanding and thus resolution volume. For the super-resolution module, we design
not feasible. We propose an efficient hierarchical model with the loss to match the denoising distribution as follows:
a two-phase process. In the first phase, we generate a low- X
Lsup = − Eq(x0 ,xl0 ) q(xt |x0 )[
resolution volume (64 × 64 × 64) conditioned via the radiol- t>0
ogy report. In the second phase, the model outputs a high-
KL(q(xt−1 |xt , xl0 )||pθ (xt−1 |xt , xl0 ))]
resolution volume 256 × 256 × 256 from a 3D super-solution
module, which only takes low-resolution volume as input. The Lapproximate
sup = Eq(x0 ,xl0 ) q(xt |x0 )[
low-resolution image ensures the volumetric consistency of ∥H(xt , xl0 ; ϕ) − x0 ∥22 ], (8)
the final images. Using the radiology reports as a condition
for low-res image generation is advantageous since we do not where H is the super-resolution denoising module with its
need to compromise between resolution and model capacity. parameters ϕ. To be notified, we do not contain the additional
1) Low-Resolution Volume Generation Conditioned on Re- text information in our super-resolution module to save the
ports: To generate data conditioned via specific signals c, e.g., computational cost.
text information, we need to reformulate the unconditional
DDPM objective in Equation 4 with the conditions, C. Joint Generative Modeling of Volume and Anatomical
Eq(x0 )q(xt |x0 ,c) KL(q(xt−1 |xt , x0 , c)||pθ (xt−1 |xt , c)). Structures
X
L=−
t>0 Lung CT scans exhibit more intricate details compared to
(6) other organs. These include the bronchial tree formations, the
We first downsample the CT volume x0 by 4× to produce the delicate vascular network, and fissure lines delineating the lung
low-resolution one xl0 . Then, given the extracted features from lobes. Such complexity heightens the challenge of producing
the radiology report, we can finalize the training loss for the high-fidelity synthetic lung CT volumes. And, those subtle
text conditional diffusion model as follows: anatomical structures are easily ignored during the generative
modeling. Moreover, anatomical structures are usually roughly
Llow = Eq(x0 ,ftext )q(xlt |xl0 ) ∥G(xt , ftext ; θ) − xl0 ∥22 , (7)
described in the clinical report, which enlarges the difficulty
where G is the denoising model with cross-attention modules of CT synthesis purely based on radiology reports. Therefore,
and θ is its parameters. We inject the text conditional infor- we propose to model the volume generation jointly with
mation into the denoising network to reconstruct the original the anatomical structure generation, all conditioned on the
xl0 via a cross-attention mechanism. radiology reports. To this end, we extract the shape information
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 5
for the core anatomical structures, i.e., the lung lobes, the almost remove all the attention modules and keep one temporal
airway and the vessels. We choose the commonly used pre- attention at the bottom of the UNet, which makes it feasible
trained segmentation tools to provide stable shape information for 256 × 256 × 256 volume inputs.
for these three structures. We use lungmask [26] to segment
lobes from CT volumes and use TotalSegmentator [27] to F. Implementation Details
segment vessels, and the NaviAirway [28] is utilized for
airway segmentation. We denote the segmentation map as To pre-train our text encoder, we use 209,683 reports
l, a, v for the lung, airway and vessels. To enable the previous without paired images, and 7,728 reports with paired images
model G(·; θ) to jointly synthesize CT volume x along with as training set. We pre-train our text encoder for 5 epochs.
its structures l, a, v, we simply add three more channels to the Our training objective for the diffusion models is a simple
input and output layers of model G, while still conditioning on reconstruction loss: the ℓ2 pixel-level reconstruction between
time step t. We then directly concatenate them in the channel the ground truth and the prediction for different denoising time
dimension and construct a diffusion process on four-channel steps without any other terms, such as perceptual loss [30].
3D volumes. To be notified, all the shape segmentations are Following the common choice of training diffusion model,
paired with the volumes in the concatenation operation. We we use a continuous cosine time scheduler [1]. For the
define the concatenation operation as concat(·) and the newly training of both the low-resolution diffusion model and the
constructed volume as x′ = concat(x, l, a, v). Then, with super-resolution model, the number of time steps is set as
minor modifications to Equations 7 and 8, we can write down 1000. During inference, we use 50 DDIM steps for the low-
our joint generation training objectives for the low-resolution resolution diffusion model, and 20 DDIM steps for the super-
phase and high-resolution phase as follows: resolution model. For the optimizer, we apply AdamW [31]
with the learning rate of 1 × 10−4 and β = {0.9, 0.999} with
clipping the gradient norm bigger than 1. We apply the mixed-
Llow = Eq(x′0 ,text)q(x′t |x′0 ) ∥G(x′t , ftext ; θ) − x′0 ∥22 , (9) precision for optimizing models to make the training more
Lsup = Eq(x′0 ,x′l0 )q(x′t |x′0 ) ∥H(x′t , x′l0 ; ϕ) − x′0 ∥22 . (10) efficient. The gradient accumulation is applied during training
to scale up the training batch size, as diffusion models are
sensitive to the small training batch size [32]. Ultimately, we
D. Anatomically Controllable Synthesis can train our 700M parameters low-resolution base model with
Compared with the conditional modelling of Control- a batch size of 64 and the super-resolution model with a batch
Net [29], our model is significantly different, where we size 32 on four NVIDIA A6000 GPUs. Both models have
model the volume generation with the structure information converged after undergoing 40k iterations of training.
jointly p(x, c) while ControlNet generates data conditioned on
the structure p(x|c). Ours is more flexible as MedSyn can still V. E XPERIMENTS
generate data when the predefined structures are unavailable,
which is not feasible for ControlNet. Furthermore, if we This section presents a comprehensive evaluation of the
marginalize one or multiple components in the joint denoising proposed generative model, MedSyn. We will first describe the
process, such as fixing the l given the predefined lobe structure dataset used in this experiment. Then, the MedSyn is compared
inputs, we can achieve exactly what ControlNet can do. with the state-of-the-art GANs and diffusion model, including
Further, if we fix the x, we can get the structure output from WGAN [33], α-GAN [18], HA-GAN [6] and Medical Diffu-
our model, such as segmentation maps. We will show these sion [34]. Extensive comparisons and analysis are finally given
advantages of our model in the experimental section. to evaluate the effectiveness of our method, qualitatively and
quantitatively. For training of all baseline methods, we use the
author’s implementation. We made minimal modifications to
E. Efficient 3D Attention UNet
the code to adapt to our dataset. All models are trained from
Although we propose an efficient two-phase text2volume scratch, for fair comparison with our method.
generation, like other works in 3D generation [10], we still
face the memory issue when generating such high-resolution
volume. Sequential generation with conditional diffusion mod- A. Dataset
els [10], [11] works as one solution, but new issues will be We conduct experiments on a large-scale 3D dataset, which
easily introduced. Therefore, we design a new base neural contains 3D thorax computerized tomography (CT) images
architecture for much more efficient volume generation. Com- and associated radiology reports from 8,752 subjects. The
pared with the common 3D attention UNet [13] for video gen- dataset also contains 209,683 reports without corresponding
eration, we build an encoder-decoder with pure convolutions images. The images and reports are collected by the University
and move all the attention mechanisms into the bottom of the of Pittsburgh Medical Center and have been de-identified. We
UNet. In this way, we propose a more efficient base model randomly split our dataset of 8,752 subjects into a training
structure that drops all the computational burden from the set consisting of 7,728 subjects (88%), and a validation
attention mechanism while still benefiting it from the latent set of 1,024 subjects (12%). The images have been pre-
space where spatial resolution is much lower. This increases aligned using affine registration and re-sampled to 1mm3 . We
the parameter size via 10× more but largely increases the resize the images to 256 × 256 × 256. We use the nearest-
computational efficiency. For the super-resolution network, we neighbor downsampling to reduce the scans 4× to train the
6 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020
HA-GAN Real
Fig. 3. Randomly generated images (from HA-GAN and Medical Diffusion) and the real images. The first two columns show axial and coronal
slices, which use the HU range of [-1024, 600]. The last column shows the zoom-in region and uses HU range of [-1024, -250] to highlight the lung
details. Our method is the only one that can preserve delicate anatomical details, including fissures, as indicated by the arrows.
low-resolution base model. The Hounsfield units of the CT the quality of the synthetic images by measuring the distance
images have been calibrated, and air density correction has to the real data, using Fréchet Inception Distance (FID) [35]
been applied. The Hounsfield Units (HU) are mapped to the and Maximum Mean Discrepancy (MMD) [36]. The lower the
intensity window of [−1024, 600] and normalized to [−1, 1]. FID/MMD value is, the more similar the synthetic images are
to the real ones. We use a sample size of 1,024 for computing
B. Evaluation for Synthesis Quality the FID and MMD scores. For our method, we use randomly
selected reports as the condition. To compute FID and MMD
TABLE I
scores for 3D CT scans, like [6], we leverage a pre-trained 3D
Q UANTITATIVE COMPARISON WITH DIFFERENT METHODS . O UR METHOD ResNet model on medical images [37] for feature extraction.
OUTPERFORMS BASELINE METHODS IN TERMS OF DISTANCE METRICS , As shown in Table I, our method achieves lowest FID and
AND PRESERVES AIRWAY BETTER MMD than the baselines, which implies that our diffusion
Method FID↓ MMD↓ Airway (×104 mm3 )↑
model generates more realistic images.
WGAN 0.070 0.094 1.07±0.64 Quantitative Evaluation on Anatomical Details: While
α-GAN 0.028 0.057 1.14±0.68 metrics like FID and MMD are widely used in literature and
HA-GAN 0.023 0.054 2.04±0.73 empirically work well for natural images, they highlight the
Medical Diffusion 0.013 0.022 1.77±0.93
semantic-wise similarly (distance) but may ignore subtle but
Ours 0.009 0.019 3.34±1.19
Ours w/o shape - - 1.99±1.05
important anatomical details in medical images, as implied
by the small (FID/MMD) gap between different methods.
Real - - 4.58±1.45
Their real distances, as later shows in Fig. 3, could be much
1) Quantitative Evaluation: If the synthetic images are realis- bigger when taking account into the anatomical details we are
tic, then their distribution should be indistinguishable from that focused on. Therefore, we evaluate how well the generated
of the real images. Therefore, we can quantitatively evaluate images can preserve the anatomical details. Specifically, we
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 7
Prompt: There are large pleural effusions seen. There is no airspace opacity or Prompt: There is no airspace opacity, effusion or
pneumothorax. There is no evidence of suspicious pulmonary nodule or mass. pneumothorax. There is no evidence of suspicious
pulmonary nodule or mass.
Prompt: There is extensive consolidation seen. No pulmonary nodules are noted. Prompt: No consolidation is identified. No pulmonary
Bone windowed images demonstrate no lytic or blastic lesions. No evidence of nodules are noted. Bone windowed images
pulmonary embolus. demonstrate no lytic or blastic lesions. No evidence of
pulmonary embolus.
Prompt: There is no significant mediastinal lymphadenopathy. There is moderate Prompt: There is no significant mediastinal
cardiomegaly. The visualized upper abdominal organs are unremarkable. There is lymphadenopathy. There is no cardiomegaly. The
minimal perihepatic free fluid. visualized upper abdominal organs are unremarkable.
There is minimal perihepatic free fluid.
Fig. 4. Images conditionally generated with disease-related prompts. We show the real images in the first two columns. Then we extract disease-
related mentions from their associated reports as conditions to generate images, which are shown in the third and fourth columns. We also show
the synthesized samples by conditioning on prompts reversed of the disease in the last two columns. Four slices are shown for each subject. The
generated images are conditioned on text only.
8 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020
use Total Segmentor [38] to segment vessels and airways from prompts, specifically on pleural effusion, bullous emphysema,
generated images and real images, and measure the volume. and cardiomegaly. First, we build prompt pairs by selecting
The results are shown in Table. I. We also perform statistical prompts from real reports that contain descriptions about
tests (one-tailed two-sample t-test) on the evaluation results. certain pathology (e.g. There are large pleural effusions seen),
At the significance level of p < 0.05, the results are significant and reversing its description (e.g. There is no pleural effusion)
for all three conditions, which further identify the effectiveness to build its prompt pair. The prompts used here can be found
of our model on prompting generation. in Fig. 4. Then, we use our model to generate images con-
2) Qualitative Evaluation: To qualitatively analyze the re- ditioning on the original prompts and the modified prompts,
sults, we show cases of synthetic images from current state- respectively. Conditioned on each prompt, we generate 32
of-the-art GAN [6] and diffusion model [21]. As in Fig. 3, al- CT volumes and perform quantitative analysis to measure the
though synthetic images from different methods are all closed alignment between the synthetic images and the abnormality
to the real ones in overall appearance, only our MedSyn con- condition specified in the prompt. For pleural effusion, we use
sistently produces anatomically plausible CT scans upon closer Total Segmentor [38] to segment the effusion from generated
inspection, showcasing its superiority. images and measure the volume. For bullous emphysema, we
measure the %LAA-950 (percentage of low attenuation areas
C. Evaluation for Conditional Generation less than a threshold of -950 Hounsfield units) for generated
images. For cardiomegaly, we use Total Segmentor [38] to
TABLE II segment the heart and lung region from CT volume, and then
E VALUATION FOR CONDITIONAL GENERATION OF PLEURAL EFFUSION . we compute the cardiothoracic ratio (CTR) by measurement
W E MEASURE THE SEGMENTED VOLUME OF PLEURAL EFFUSION FROM of the maximal cardiac width divided by the maximal thoracic
GENERATED IMAGES CONDITIONED ON DIFFERENT PROMPTS width at the same axial scan level. The evaluation results for
Prompt type Pleural effusion volume (L) pleural effusion, bullous emphysema and cardiomegaly are
shown in Table. VII, III and IV respectively. For pleural
No effusion 0.00±.00
Large effusion 1.73±.22 effusion, we found that when conditioning on the prompt
with “large effusion,” the generated images show a greater
volume of pleural effusion compared to images synthesized
with a prompt containing “no effusion.” For bullous emphy-
TABLE III TABLE IV
sema, we found that the generated images conditioning on
E VALUATION FOR CONDITIONAL E VALUATION FOR CONDITIONAL
GENERATION OF BULLOUS GENERATION OF prompt containing “no bullae” have higher %LAA-950 values,
EMPHYSEMA . T HE RESULTS CARDIOMEGALY. T HE RESULTS which suggests more severe emphysema. For cardiomegaly, we
SHOW THAT THE BULLAE SHOW THAT THE found that when conditioning on the prompt with “There is
MENTIONING CAN INCREASE CARDIOMEGALY MENTIONING cardiomegaly,” the generated images have higher CTR, which
THE EMPHYSEMA VOLUME IN CAN INCREASE THE HEART SIZE
suggests a greater degree of cardiomegaly. We also provide
GENERATED VOLUMES IN GENERATED VOLUMES
the distribution of CTR in Fig. 5.
Prompt type %LAA-950 Prompt type CTR For the qualitative examples generated from our model,
No bullae 0.019±.018 No cardiomegaly 0.48±.06 we chose those three distinct prompts paired with negative
With bullae 1.4±3.5 With cardiomegaly 0.75±.24 prompts to show the prompting effect on synthetic images. In
Fig. 4, we show the volumes from the real and synthetic data
with the text description and the negative prompting synthetic
Distribution of cardiothoracic ratio (CTR) for generated images data. Our model shows the ability to generate unseen data and
3.0 Prompt w/ cardiomegaly control the generative process through prompting.
Prompt w/o cardiomegaly
2.5
D. Controllable Synthesis via Conditioning on
2.0 Anatomical Structures
Frequency
Control Segs True Image Seed 1, Generated Seed 2, Generated Seed 3, Generated
Fig. 6. Controlled volume synthesis via the anatomical priors. The first column shows the anatomical mask used as the condition. The second
column shows the corresponding real images. The remaining columns show samples of conditionally generated images. The results show that the
generated images can preserve the conditioning anatomical structures.
SARLE labeler [40] to parse the reports in the validation set interface for both tasks to simulate how radiologists view CT
of our UPMC dataset to derive the labels for lung opacity scans by showing them the axial, coronal, and sagittal views.
and pleural effusion. Next, we randomly sampled 100 reports Our interface, built on a medical research image viewer called
parsed for positive and negative for the two diseases mentioned Papaya 1 , is embedded into a Qualtrics survey. We chose to
above. Then, we use the 200 reports as the condition, and feed use Papaya due to the limitations of Qualtrics, and professional
them into our MedSyn model to generate additional samples. viewers such as ITKSnap or 3DSlicer cannot be embedded into
Finally, we use the synthesized samples as extra samples an anonymized survey platform like Qualtrics. Hence, we have
to train the classification models. We perform 5-fold cross- limitations for the contrast window that radiologists use when
validation. The results are shown in Table. VI. We found viewing vessels. Therefore, the evaluation of the vessels has
that when using augmented samples from our MedSyn, the limitations in terms of visualization. Our interface did allow
performance of the classifier can be improved. radiologists to adjust the contrast of the images and swap the
main image with one of the other two views.
TABLE VI All radiologists were provided an instruction video to show
E VALUATION FOR DATA AUGMENTATION . B ASELINE MODEL IS TRAINED them how to interact with the interface before they started
ONLY WITH REAL RADC HEST DATA . W E ALSO AUGMENT THE TRAINING
the survey. During the instruction video, all radiologists were
SET WITH 200 M ED S YN - GENERATED SAMPLES , AND REPORT THE
ACCURACY AND F1 SCORE .
told that they would review CT scans “that may belong
to different patients and were acquired based on different
Method Pleural effusion Lung opacity image acquisition devices and image reconstruction methods”.
Metric Accuracy%↑ F1↑ Accuracy%↑ F1↑
Additionally, one board-certified radiologist and one of the
Baseline 90.7±3.2 0.79±.05 61.0±2.2 0.72±.03
Augmented w/ MedSyn 94.0±.2 0.84±.01 62.0±1.5 0.75±.00
authors were available while the participant took the survey to
help address any confusion or technical issues.
We recruited 10 board-certified radiologists with varying
years of experience to participate in our study. All radiologists
F. Evaluation by Radiologists were recruited through our professional network.
To complement the evaluations of our method, we designed
a blind evaluation survey that elicits board-certified radiol- a) Pathology Recognition Task: The pathology recognition
ogists’ opinions on the anatomical feasibility of structures task in the survey asks radiologists to identify the most
generated by our approach in comparison with those generated prominent finding in the CT scan from a selection of options:
by existing methods, including Medical Diffusion [34] and consolidation, pleural effusion, cardiomegaly, no abnormali-
HA-GAN [6], and real CT scans. The survey also exposes ties, and other abnormalities not listed. They could leave a
how accurately radiologists can recognize pathologies in CT note in an optional text field if they feel multiple prominent
scans generated by pathology prompts using our method. To or other findings are present. Radiologists are shown six CT
achieve a blind evaluation of the generated CT scans, we scans in total: five generated CT scans by our method and one
intentionally did not mention to the radiologists that some of real CT scan. We generated five CT scans, including one for
the CT scans they were about to interpret were generated by AI cardiomegaly, one for consolidation, two for pleural effusion,
to avoid potentially biased results due to potentially negative and one for no abnormalities. The real CT scan shown is
perceptions towards AI. randomly selected to present one of those conditions. The six
Overall, our findings from the survey with 10 radiologists CT scans are shown to the radiologist in a random order.
with varying years of experience (4 – 23 years) reveal that
radiologists can correctly recognize the pathologies defined b) Anatomical Feasibility Task: The anatomical feasibility
by the prompt in the CT scans generated by our method task specifically asks radiologists to rank the four CT scans
with high accuracy. Additionally, our findings reveal that based on how well they preserve the given anatomical struc-
our method generates CT scans with fissure lines and lobe ture, considering which most looks like it is of a real image
structures that are significantly more anatomically feasible where 1 is best preserved and 4 is least preserved. This
than in CT scans generated by the Medical Diffusion [34] task evaluates the realism of three categories of anatomical
and HA-GAN [6] methods. Our findings also reveal that our structures: the lobe structure along with the fissure lines, the
methods generate CT scans with airway structures that are vessel structures, and the airway structures. For each category,
significantly indistinguishable from real CT scans and more we show the radiologists four different sets of four CT scans
anatomically feasible than those in CT scans generated by the to rank. The four sets and the four CT scans for a given set
Medical Diffusion method. We expand upon the experiment are randomized in the order they appear. Lastly, the anatomical
design that led to these findings and additional statistical structure to be identified is shown in a randomized order.
analyses below. 2) Analysis: For the pathology recognition task, we calcu-
1) Experiment Design: We designed an online survey to late the total number of radiologists that correctly identify
elicit radiologists’ opinions on how accurately our method the pathology used in the prompt to generate that CT scan.
represents different diseases and how our method compares to We cross-check the radiologists’ open-ended responses for
generated CT images from related works and real CT images. correctness with a radiologist within our professional network.
To explore this, our survey consisted of pathology recognition
and anatomical feasibility evaluation tasks. We designed an 1 https://github.com/rii-mango/Papaya
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 11
For the anatomical feasibility task, we collect four data patients can have with the imaging machines. The W from
points from each participant for the rank of each method for Kendall’s Coefficient Test is 0.528, indicating moderate agree-
the three categories (lobe structure along with lung fissures, ment among the rankings provided by the radiologists for this
vessel structure, and airway structure). With this ranked data, structure (p < .001), validating the effect of these findings.
we calculate the frequency of participants that ranked each
method for each rank and calculate the mean rank. We use a 4.00
P< 0.1 *; p<0.05 **; p<0.001 ***
*
non-parametric Friedman test to determine if the mean rank
are significantly different from each other and use Bonferonni 3.00
***
Correction to adjust the p-values due to multiple comparisons.
We set our significance threshold as p <= 0.05. We calculate
Kendall’s Coefficient of Concordance to determine how con-
2.00
sistent the rankings are across the radiologists.
3) Findings: The radiologists who completed the survey
included four senior-level radiologists with more than 15 years
of experience in radiology and six junior-level radiologists 1.00
Lobe Structure & Lung Fissures Airway Structure Vessel Structure
with four to eight years of experience in radiology. One of
the radiologists is currently in their residency, and another HA-GAN Medical Diffusion Our Method Real CT
is a faculty member. The survey took the radiologists 37.47 Fig. 8. The mean ranks for each method and each structure. We show
minutes on average to complete (std = 12.06 minutes). Below the p-values for our method compared to the other methods and the real
CT images. P-values for comparisons that do not include our method are
are additional statistics on each task. not included.
a) Pathology Recognition Task: Figure 7 shows that every For the airway structures, CT scans generated by our
radiologist’s interpretation was consistent with the pathology method produce airway structures that are significantly more
of the real CT scan and the prompted pathology in the anatomically feasible than those generated by the Medical
generated CT scan representing consolidation. Nine of ten Diffusion method (p < .05). Interestingly, our method is able
radiologists correctly interpreted the pathology for the CT to generate the airway structures in CT scans to be significantly
scans generated with a pathology prompt for cardiomegaly, indistinguishable from those in real CT scans (p = 0.1). The
pericardial effusion, and normal conditions. Eight out of 10 radiologists, on average, ranked the anatomical feasibility of
radiologists correctly recognized the other pericardial effusion the airway structures in the real CT images as 1.6 and 2.28
case. One radiologist interpreted this CT scan to present for our method (see Figure 8).
consolidation, while another mentioned another abnormality For the vessel structure, the real CT scans are significantly
that was not listed. better than all three methods (p = 0.0), and our method is
10 indistinguishable from the Medical Diffusion and HA-GAN
Number of Radiologists Interpreting the CT Correctly
using a language model specifically pre-trained on biomedical poses a significant bottleneck for generating high-resolution
text improves the sensitivity of the diffusion model to text 3D medical images. To address this challenge, we introduce
prompts. a two-stage stacked diffusion model. Unlike latent diffusion
models (LDM), our hierarchical approach conducts denoising
TABLE VII operations directly on image pixels, offering finer control over
E VALUATION FOR CONDITIONAL GENERATION OF PLEURAL EFFUSION . image generation and preserving higher fidelity details. Our
O UR BIOMEDICAL LANGUAGE MODEL IS MORE SENSITIVE TO THE hierarchical diffusion model incurs higher computational costs
EFFUSION MENTIONING IN THE PROMPT
when compared to LDMs. There exists a natural trade-off
Measured effusion volume (mL) Generic pre-trained LM Biomedical pre-trained LM between fidelity and computational efficiency. However, in
Prompting w/o effusion 0.6±1.4 0.0±0.0 our scenario, the advantages of heightened fidelity surpass
Prompting w/ effusion 2.4±5.9 1725.7±219.5
the disadvantages of slower generation times. Recently, there
have been advancements in methods [42], [43] that markedly
decrease the necessary denoising steps, sometimes even to just
VI. D ISCUSSION one step. We anticipate that integrating these methods could
In this study, we achieve high-fidelity, anatomy-aware syn- narrow the disparity between our approach and LDMs, and
thesis of volumetric lung CT scans using guidance from radi- we leave this as a prospect for our future work.
ology reports. Nonetheless, our model has certain limitations. The proposed model has several fascinating applications
First, the anatomical structures of the lobe, airway, and vessel that could be pursued in the future. Data augmentation is
are derived from pre-trained segmentation networks. This the most obvious application of the conditional generative
method may not always align with the ground truth, particu- models [44], [45]. One can use the generative model as a
larly concerning detailed airway and vessel structures. Second, building block for model explanation, as suggested in [46]–
the radiology reports only provide a condition on the low- [48]. Generated samples can be used to audit the uncer-
resolution images. Therefore, if the text condition mentions tainty of pre-trained Deep Learning models by conditioning
subtle changes that require high-resolution, the model will on the pathology, changing various aspects of the anatomy,
likely be unable to generate that. In other words, our current and assessing the DL model’s output distribution. One can
model is good at generating global and large-scale changes. deploy such an approach to evaluate and improve the out-of-
While our approach demonstrates remarkable adaptability in sample distribution of the DL model for various tasks such as
generating volumetric lung CT scans from radiology reports, classification and segmentation [49]–[52]. Since our model can
evaluating more intricate lung diseases remains challenging condition anatomical segmentation and generate a consistent
due to the complexities presented in the reports. Such evalu- volumetric image, one can use synthetic data to train a data-
ations might best be deferred to radiology experts. Therefore, free and robust segmentation method similar to [53], [54].
we conduct a blind user study by radiologists in Section. V-F,
which verified the quality and fidelity of the generated images. VII. C ONCLUSION
During the inference time, our models support both generation Our research takes a significant step forward by synthesizing
with only one conditioning type, and generation with both high-resolution 3D CT lung scans guided by detailed radio-
text and shape conditioning. If the user thinks there might logical and anatomical information. While GANs and cDPMs
be a conflict between the text prompt and anatomical shape have set benchmarks, they come with inherent limitations,
conditioning, he/she can choose to use one conditioning. particularly when generating intricate chest CT scan details.
The existing conditional generative models utilized in med- Our proposed MedSyn model addresses these challenges using
ical imaging have limited capabilities. They can either ac- a comprehensive dataset and a hierarchical training approach.
commodate discrete conditions (such as the presence and Innovative architectural designs not only overcome previous
absence of a disease) or are limited to only 2D images constraints but also pioneer anatomy-conscious volumetric
(i.e., X-ray images) when conditioned on text. Utilizing free- generation. Future work can leverage our model to enhance
style text as a condition can yield substantial enhancements clinical applications.
in the diversity of the generated samples. One can control
the pathology and the anatomical location, severity, size, and R EFERENCES
many other aspects of the pathology. This technique could
[1] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”
mitigate the longstanding issue of releasing extensive medical Advances in Neural Information Processing Systems, vol. 33, pp. 6840–
imaging datasets. For example, collecting datasets for rare 6851, 2020.
pathology is challenging. Synthetic samples from a well- [2] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and
B. Poole, “Score-based generative modeling through stochastic differ-
trained generative model with a radiology report can be viewed ential equations,” in International Conference on Learning Representa-
as the second-best replacement in such a scenario. While our tions, 2021.
approach cannot entirely substitute the release of the real data, [3] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S.
Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans,
collaborative efforts within the medical imaging community J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffu-
to refine this model on diverse datasets and share it can sion models with deep language understanding,” 2022.
significantly mitigate this issue in the foreseeable future. [4] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
resolution image synthesis with latent diffusion models,” in Proceedings
Our goal is to enhance the memory efficiency of the diffu- of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
sion model without compromising its fidelity. Memory demand tion, 2022, pp. 10 684–10 695.
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 13
[5] P. Chambon, C. Bluethgen, J.-B. Delbrouck, R. Van der Sluijs, [27] J. Wasserthal, H.-C. Breit, M. T. Meyer, M. Pradella, D. Hinck, A. W.
M. Połacin, J. M. Z. Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, Sauter, T. Heye, D. Boll, J. Cyriac, S. Yang et al., “Totalsegmentator:
and A. Chaudhari, “Roentgen: Vision-language foundation model for robust segmentation of 104 anatomical structures in ct images,” arXiv
chest x-ray generation,” arXiv preprint arXiv:2211.12737, 2022. preprint arXiv:2208.05868, 2022.
[6] L. Sun, J. Chen, Y. Xu, M. Gong, K. Yu, and K. Batmanghelich, [28] A. Wang, T. C. C. Tam, H. M. Poon, K.-C. Yu, and W.-N. Lee, “Navi-
“Hierarchical amortized gan for 3d high resolution medical image airway: a bronchiole-sensitive deep learning-based airway segmentation
synthesis,” IEEE Journal of Biomedical and Health Informatics, vol. 26, pipeline,” arXiv preprint arXiv:2203.04294, 2022.
no. 8, pp. 3966–3975, 2022. [29] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to
[7] H. Shan, Y. Zhang, Q. Yang, U. Kruger, M. K. Kalra, L. Sun, W. Cong, text-to-image diffusion models,” 2023.
and G. Wang, “3-d convolutional encoder-decoder network for low-dose [30] S. Lin and X. Yang, “Diffusion model with perceptual loss,” arXiv
ct via transfer learning from a 2-d trained network,” IEEE transactions preprint arXiv:2401.00110, 2023.
on medical imaging, vol. 37, no. 6, pp. 1522–1534, 2018. [31] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[8] W. Jin, M. Fatehi, K. Abhishek, M. Mallya, B. Toyota, and G. Hamarneh, in International Conference on Learning Representations, 2019.
“Applying artificial intelligence to glioma imaging: Advances and chal- [32] S. Ghalebikesabi, L. Berrada, S. Gowal, I. Ktena, R. Stanforth, J. Hayes,
lenges,” arXiv preprint arXiv:1911.12886, 2019. S. De, S. L. Smith, O. Wiles, and B. Balle, “Differentially private
[9] M. D. Cirillo, D. Abramian, and A. Eklund, “Vox2vox: 3d-gan for brain diffusion models generate useful synthetic images,” arXiv preprint
tumour segmentation,” arXiv preprint arXiv:2003.13653, 2020. arXiv:2302.13861, 2023.
[10] W. Peng, E. Adeli, T. Bosschieter, S. H. Park, Q. Zhao, and K. M. Pohl, [33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
“Generating realistic brain mris via a conditional diffusion probabilistic “Improved training of wasserstein gans,” in Advances in neural infor-
model,” 2023. mation processing systems, 2017, pp. 5767–5777.
[11] J. S. Yoon, C. Zhang, H.-I. Suk, J. Guo, and X. Li, “SADM: Sequence- [34] F. Khader, G. Mueller-Franzes, S. T. Arasteh, T. Han, C. Haarburger,
aware diffusion model for longitudinal medical image generation,” in M. Schulze-Hagen, P. Schad, S. Engelhardt, B. Baessler, S. Foersch,
Lecture Notes in Computer Science. Springer Nature Switzerland, 2023, J. Stegmaier, C. Kuhl, S. Nebelung, J. N. Kather, and D. Truhn, “Medical
pp. 388–400. diffusion - denoising diffusion probabilistic models for 3d medical image
[12] W. H. Pinaya, M. S. Graham, E. Kerfoot, P.-D. Tudosiu, J. Dafflon, generation,” 2022. [Online]. Available: https://arxiv.org/abs/2211.03364
V. Fernandez, P. Sanchez, J. Wolleb, P. F. da Costa, A. Patel et al., [35] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
“Generative ai for medical imaging: extending the monai framework,” “Gans trained by a two time-scale update rule converge to a local nash
arXiv preprint arXiv:2307.15208, 2023. equilibrium,” in Advances in neural information processing systems,
[13] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, 2017.
“Video diffusion models,” arXiv:2204.03458, 2022. [36] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola,
[14] P. Chambon, C. Bluethgen, J.-B. Delbrouck, R. Van der Sluijs, “A kernel two-sample test,” Journal of Machine Learning Research,
M. Połacin, J. M. Z. Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, vol. 13, no. Mar, pp. 723–773, 2012.
and A. Chaudhari, “Roentgen: vision-language foundation model for
[37] S. Chen, K. Ma, and Y. Zheng, “Med3d: Transfer learning for 3d medical
chest x-ray generation,” arXiv preprint arXiv:2211.12737, 2022.
image analysis,” arXiv preprint arXiv:1904.00625, 2019.
[15] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical
[38] J. Wasserthal, H.-C. Breit, M. T. Meyer, M. Pradella, D. Hinck,
text-conditional image generation with clip latents,” arXiv preprint
A. W. Sauter, T. Heye, D. T. Boll, J. Cyriac, S. Yang, M. Bach, and
arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.
M. Segeroth, “Totalsegmentator: Robust segmentation of 104 anatomic
[16] S.-I. Jang, C. Lois, E. Thibault, J. A. Becker, Y. Dong, M. D. Normandin,
structures in ct images,” Radiology: Artificial Intelligence, vol. 5, no. 5,
J. C. Price, K. A. Johnson, G. E. Fakhri, and K. Gong, “Taupetgen: Text-
p. e230024, 2023.
conditional tau pet image synthesis based on latent diffusion models,”
arXiv preprint arXiv:2306.11984, 2023. [39] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and
[17] B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, H. Greenspan, “Gan-based synthetic medical image augmentation for
S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., increased cnn performance in liver lesion classification,” Neurocomput-
“Making the most of text semantics to improve biomedical vision– ing, vol. 321, pp. 321–331, 2018.
language processing,” in European conference on computer vision. [40] R. L. Draelos, D. Dov, M. A. Mazurowski, J. Y. Lo, R. Henao, G. D.
Springer, 2022, pp. 1–21. Rubin, and L. Carin, “Machine-learning-based multiple abnormality pre-
[18] G. Kwon, C. Han, and D.-s. Kim, “Generation of 3d brain mri using diction with large-scale chest computed tomography volumes,” Medical
auto-encoding generative adversarial networks,” in International Confer- image analysis, vol. 67, p. 101857, 2021.
ence on Medical Image Computing and Computer-Assisted Intervention. [41] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba,
Springer, 2019, pp. 118–126. and S. Fidler, “Aligning books and movies: Towards story-like visual
[19] S. Xing, H. Sinha, and S. J. Hwang, “Cycle consistent embedding of 3d explanations by watching movies and reading books,” in The IEEE
brains with auto-encoding generative adversarial networks,” in Medical International Conference on Computer Vision (ICCV), December 2015.
Imaging with Deep Learning, 2021. [42] X. Liu, X. Zhang, J. Ma, J. Peng, and Q. Liu, “Instaflow: One step is
[20] K. Han, Y. Xiong, C. You, P. Khosravi, S. Sun, X. Yan, J. Duncan, and enough for high-quality diffusion-based text-to-image generation,” arXiv
X. Xie, “Medgen3d: A deep generative framework for paired 3d image preprint arXiv:2309.06380, 2023.
and mask generation,” 2023. [43] Y. Xu, M. Gong, S. Xie, W. Wei, M. Grundmann, T. Hou et al.,
[21] F. Khader, G. Mueller-Franzes, S. T. Arasteh, T. Han et al., “Denoising “Semi-implicit denoising diffusion models (siddms),” arXiv preprint
diffusion probabilistic models for 3d medical image generation,” Scien- arXiv:2306.12511, 2023.
tific Reports, vol. 13, no. 7303, 2023. [44] X. Chen, Y. Li, L. Yao, E. Adeli, and Y. Zhang, “Generative adversarial
[22] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- u-net for domain-free medical image augmentation,” arXiv preprint
resolution image synthesis,” in Proceedings of the IEEE/CVF conference arXiv:2101.04793, 2021.
on computer vision and pattern recognition, 2021, pp. 12 873–12 883. [45] H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L.
[23] W. H. Pinaya, P.-D. Tudosiu, J. Dafflon, P. F. Da Costa, V. Fernandez, Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski, “Medical image
P. Nachev, S. Ourselin, and M. J. Cardoso, “Brain imaging generation synthesis for data augmentation and anonymization using generative
with latent diffusion models,” in MICCAI Workshop on Deep Generative adversarial networks,” in International workshop on simulation and
Models. Springer, 2022, pp. 117–126. synthesis in medical imaging. Springer, 2018, pp. 1–11.
[24] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, [46] S. Singla, M. Eslami, B. Pollack, S. Wallace, and K. Batmanghelich,
“Deep unsupervised learning using nonequilibrium thermodynamics,” “Explaining the black-box smoothly—a counterfactual approach,” Med-
in International Conference on Machine Learning. PMLR, 2015, pp. ical Image Analysis, vol. 84, p. 102721, 2023.
2256–2265. [47] H. Montenegro, W. Silva, and J. S. Cardoso, “Privacy-preserving gener-
[25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training ative adversarial network for case-based explainability in medical image
of deep bidirectional transformers for language understanding,” arXiv analysis,” IEEE Access, vol. 9, pp. 148 037–148 047, 2021.
preprint arXiv:1810.04805, 2018. [48] C. Mauri, S. Cerri, O. Puonti, M. Mühlau, and K. Van Leemput,
[26] J. Hofmanninger, F. Prayer, J. Pan, S. Röhrich, H. Prosch, and G. Langs, “Accurate and explainable image-based prediction using a lightweight
“Automatic lung segmentation in routine imaging is primarily a data generative model,” in International Conference on Medical Image Com-
diversity problem, not a methodology problem,” European Radiology puting and Computer-Assisted Intervention. Springer, 2022, pp. 448–
Experimental, vol. 4, no. 1, pp. 1–13, 2020. 458.
14 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020