0% found this document useful (0 votes)

34 views14 pages

MedSyn Text-Guided Anatomy-Aware Synthesis

Uploaded by

wxiaolong2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views14 pages

MedSyn Text-Guided Anatomy-Aware Synthesis

Uploaded by

wxiaolong2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO.

XX, XXXX 2020 1

MedSyn: Text-guided Anatomy-aware Synthesis

of High-Fidelity 3D CT Images
Yanwu Xu, Li Sun, Wei Peng, Shuyue Jia, Katelyn Morrison, Adam Perer, Afrooz Zandifar, Shyam
Visweswaran, Motahhare Eslami, and Kayhan Batmanghelich
arXiv:2310.03559v6 [eess.IV] 15 Oct 2024

Abstract— This paper introduces an innovative method- with Radiology Report, Controllable Synthesis.
ology for producing high-quality 3D lung CT images guided
by textual information. While diffusion-based generative
models are increasingly used in medical imaging, current I. I NTRODUCTION
state-of-the-art approaches are limited to low-resolution
outputs and underutilize radiology reports’ abundant infor- Denoising Diffusion Probabilistic Models (DDPM) [1], also
mation. The radiology reports can enhance the generation known as score-based generative models [2], have emerged as
process by providing additional guidance and offering fine- powerful tools in both computer vision and medical imaging
grained control over the synthesis of images. Nevertheless, due to their stability during training and exceptional gen-
expanding text-guided generation to high-resolution 3D
images poses significant memory and anatomical detail- eration quality. State-of-the-art image generation tools, such
preserving challenges. Addressing the memory issue, we as Imagen [3] and latent Diffusion Models (LDMs) [4], can
introduce a hierarchical scheme that uses a modified UNet employ text prompts to provide fine-grained guidance during
architecture. We start by synthesizing low-resolution im- image generation. This capability is promising for medical
ages conditioned on the text, serving as a foundation for imaging, encompassing applications like privacy-preserving
subsequent generators for complete volumetric data. To en-
sure the anatomical plausibility of the generated samples, data generation, image augmentation, black box uncertainty
we provide further guidance by generating vascular, airway, quantification, and explanations. Although methods such as
and lobular segmentation masks in conjunction with the CT RoentGen [5] have illustrated the potential of 2D cross-
images. The model demonstrates the capability to use tex- modality generative models conditioned on text prompts,
tual input and segmentation tasks to generate synthesized there are no known text-guided volumetric image generation
images. Algorithmic comparative assessments and blind
evaluations conducted by 10 board-certified radiologists techniques for medical imaging. Extending such an approach
indicate that our approach exhibits superior performance to 3D presents challenges, including high memory demands
compared to the most advanced models based on GAN and and preserving crucial anatomical details. This paper aims to
diffusion techniques, especially in accurately retaining cru- address these challenges.
cial anatomical features such as fissure lines and airways. To enhance the image resolution for generative models,
This innovation introduces novel possibilities. This study
focuses on two main objectives: (1) the development of a increasing the voxel count within a fixed field of view is
method for creating images based on textual prompts and memory-intensive [6]. Synthesizing 3D volumetric images at
anatomical components, and (2) the capability to generate high resolutions (e.g., 256 × 256 × 256) demands substantial
new images conditioning on anatomical elements. The ad- memory since the neural network must store intermediate acti-
vancements in image generation can be applied to enhance vations for backpropagation. To preserve essential anatomical
numerous downstream tasks.
details and condition on input text prompts, a high-capacity
Index Terms— Diffusion Model, Text-guided image gen- network is essential, further compounding the issue. Although
eration, 3D image generation, Lung CT, Volume Synthesis GANs [6]–[9] and conditional Denoising Diffusion Proba-
This work is equally contributed by Y. Xu, L. Sun and W. Peng bilistic Models (cDPMs) [10]–[12] have set benchmarks in
and was partially supported by NIH Award Number 1R01HL141813-01, volumetric medical imaging synthesis, each has its limitations.
NSF 1839332 Tripod+X, and SAP SE. We were also grateful for the GANs, despite their inference efficiency, can compromise
computational resources provided by Pittsburgh SuperComputing grant
number TG-ASC170024. sample diversity and sometimes produce anatomically implau-
Y. Xu, L. Sun, S. Jia and K. Batmanghelich are with Department of sible artifacts. Conversely, diffusion models, built to boost
Electrical and Computer Engineering, Boston University, Boston, MA sample diversity through iterative denoising, grapple with
02215 (email: {yanwuxu,lisun,brucejia,batman}@bu.edu)
W. Peng are with Department of Psychiatry and Behavioral Sciences, memory constraints, primarily due to their resource-intensive
Stanford University, Stanford, CA 94305 (email: wepeng@stanford.edu) 3D attention UNet. The iterative denoising coupled with the
K. Morrison, A. Perer and M. Eslami are with Human-Computer sequential sub-volume generation in cDPMs makes them time-
Interaction Institute, Carnegie Mellon University, Pittsburgh, PA
15213 (email: kcmorris@cs.cmu.edu, adamperer@cmu.edu, mes- intensive at the inference stage. As a result, many diffusion
lami@andrew.cmu.edu) models are relegated to 2D or low-resolution volumetric
A. Zandifar is with the University of Pittsburgh Medical Center, applications. Incorporating text prompts, such as radiology
Pittsburgh, PA 15213 (email: zandifara@upmc.edu)
S. Visweswaran is with the Department of Biomedical Informatics, reports, into image generation presents another challenge.
University of Pittsburgh, Pittsburgh, PA 15206 (email: shv3@pitt.edu ) Generative models utilizing text prompts as conditions demand
2 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

<latexit sha1_base64="Z/unh1x4cXYQQO0kKcDW8s/tX3Q=">AAACAHicbVDLSsNAFJ34rPUVdeHCzWARXJVEirosuHElFewDmlAm00k7dB5hZiKUkI2/4saFIm79DHf+jZM2C209cOFwzr3ce0+UMKqN5307K6tr6xubla3q9s7u3r57cNjRMlWYtLFkUvUipAmjgrQNNYz0EkUQjxjpRpObwu8+EqWpFA9mmpCQo5GgMcXIWGngHgck0ZRJEWjKA47MGCOW3eUDt+bVvRngMvFLUgMlWgP3KxhKnHIiDGZI677vJSbMkDIUM5JXg1STBOEJGpG+pQJxosNs9kAOz6wyhLFUtoSBM/X3RIa41lMe2c7iRL3oFeJ/Xj818XWYUZGkhgg8XxSnDBoJizTgkCqCDZtagrCi9laIx0ghbGxmVRuCv/jyMulc1P3Lun/fqDUbZRwVcAJOwTnwwRVoglvQAm2AQQ6ewSt4c56cF+fd+Zi3rjjlzBH4A+fzB5WIlwA=</latexit>

✏⇠N CT Scans Lobe

D 50 DDIM Steps 20 DDIM Steps
<latexit sha1_base64="sefQuGDZBMi+Y7CX6QJSzDa6HQs=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0hE1GNBDx5bsB/QhrLZTtq1m03Y3Qil9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5YSq4Np737RTW1jc2t4rbpZ3dvf2D8uFRUyeZYthgiUhUO6QaBZfYMNwIbKcKaRwKbIWj25nfekKleSIfzDjFIKYDySPOqLFS/a5XrniuNwdZJX5OKpCj1it/dfsJy2KUhgmqdcf3UhNMqDKcCZyWupnGlLIRHWDHUklj1MFkfuiUnFmlT6JE2ZKGzNXfExMaaz2OQ9sZUzPUy95M/M/rZCa6CSZcpplByRaLokwQk5DZ16TPFTIjxpZQpri9lbAhVZQZm03JhuAvv7xKmheuf+X69ctK1c3jKMIJnMI5+HANVbiHGjSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPk12Mug==</latexit>

4⇥
<latexit sha1_base64="ylWFu7m+G5ccOJPzTa/Jn/dfJbk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jNs1BWx8MPN6bYWZekEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mdwu/+8S1EbF6xFnC/YiOlQgFo2ilbmOAIuJmWK25dTcHWSdeQWpQoDWsfg1GMUsjrpBJakzfcxP0M6pRMMnnlUFqeELZlI5531JF7RI/y8+dkwurjEgYa1sKSa7+nshoZMwsCmxnRHFiVr2F+J/XTzG89TOhkhS5YstFYSoJxmTxOxkJzRnKmSWUaWFvJWxCNWVoE6rYELzVl9dJ56ruXde9h0at2SjiKMMZnMMleHADTbiHFrSBwRSe4RXenMR5cd6dj2VrySlmTuEPnM8fKPKPag==</latexit>

4
<latexit sha1_base64="Hz2YAGyfiZe+pJ9WM8c6f41F0vU=">AAAB5HicbVBNS8NAEJ3Urxq/qlcvi0XwFBIR9Vjw4rGC/YA2lM120q7dbMLuRiihv8CLB8Wrv8mb/8Ztm4O2Phh4vDfDzLwoE1wb3/92KhubW9s71V13b//g8KjmHrd1miuGLZaKVHUjqlFwiS3DjcBuppAmkcBONLmb+51nVJqn8tFMMwwTOpI85owaKz1cDWp13/MXIOskKEkdSjQHta/+MGV5gtIwQbXuBX5mwoIqw5nAmdvPNWaUTegIe5ZKmqAOi8WhM3JulSGJU2VLGrJQf08UNNF6mkS2M6FmrFe9ufif18tNfBsWXGa5QcmWi+JcEJOS+ddkyBUyI6aWUKa4vZWwMVWUGZuNa0MIVl9eJ+1LL7j2gnrDK8OowimcwQUEcAMNuIcmtIABwgu8wbvz5Lw6H8vGilNOnMAfOJ8/ENeLgA==</latexit>

H
<latexit sha1_base64="9X8iUiT/mvYJLqfhXp9X67h8vE4=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPBS48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ/dzvPKHSPJYPZpqgH9GR5CFn1FipWR+UK27VXYCsEy8nFcjRGJS/+sOYpRFKwwTVuue5ifEzqgxnAmelfqoxoWxCR9izVNIItZ8tDp2RC6sMSRgrW9KQhfp7IqOR1tMosJ0RNWO96s3F/7xeasI7P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp6N1WveV2pVfM4inAG53AJHtxCDerQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7A+fwBmW2Mvg==</latexit>

W Super-Res 3D UNet
<latexit sha1_base64="GsjgkyrHYukikZCL537vzxwbWhY=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4ComIeix48diC/YA2lM120q7dbMLuRiihv8CLB0W8+pO8+W/ctjlo64OBx3szzMwLU8G18bxvZ219Y3Nru7RT3t3bPzisHB23dJIphk2WiER1QqpRcIlNw43ATqqQxqHAdji+m/ntJ1SaJ/LBTFIMYjqUPOKMGis12v1K1XO9Ocgq8QtShQL1fuWrN0hYFqM0TFCtu76XmiCnynAmcFruZRpTysZ0iF1LJY1RB/n80Ck5t8qARImyJQ2Zq78nchprPYlD2xlTM9LL3kz8z+tmJroNci7TzKBki0VRJohJyOxrMuAKmRETSyhT3N5K2IgqyozNpmxD8JdfXiWtS9e/dv3GVbXmFnGU4BTO4AJ8uIEa3EMdmsAA4Rle4c15dF6cd+dj0brmFDMn8AfO5w+wKYzN</latexit>

Low-Res 3D UNet
Vessel Airway
xlT xl0
<latexit sha1_base64="l+WI+Esq/Rh8nbWMdXgkcYIdu2I=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VmlZoY9lsp+3SzSbsbsQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCopeNUMfRZLGJ1H1KNgkv0DTcC7xOFNAoFtsPxzcxvP6LSPJZNM0kwiOhQ8gFn1FjJf3oQvWavXHGr7hxklXg5qUCORq/81e3HLI1QGiao1h3PTUyQUWU4EzgtdVONCWVjOsSOpZJGqINsfuyUnFmlTwaxsiUNmau/JzIaaT2JQtsZUTPSy95M/M/rpGZwHWRcJqlByRaLBqkgJiazz0mfK2RGTCyhTHF7K2EjqigzNp+SDcFbfnmVtC6q3mXVu6tV6rU8jiKcwCmcgwdXUIdbaIAPDDg8wyu8OdJ5cd6dj0VrwclnjuEPnM8fvWmOmQ==</latexit> <latexit sha1_base64="6GusKZ+vy8u9Ej9ol3xyE9XRSz4=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VTFtoY9lsJ+3SzSbsbsRS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dgpr6xubW8Xt0s7u3v5B+fCoqZNMMfRZIhLVDqlGwSX6hhuB7VQhjUOBrXB0M/Nbj6g0T+S9GacYxHQgecQZNVbynx5Ez+2VK27VnYOsEi8nFcjR6JW/uv2EZTFKwwTVuuO5qQkmVBnOBE5L3UxjStmIDrBjqaQx6mAyP3ZKzqzSJ1GibElD5urviQmNtR7Hoe2MqRnqZW8m/ud1MhNdBxMu08ygZItFUSaIScjsc9LnCpkRY0soU9zeStiQKsqMzadkQ/CWX14lzYuqd1n17mqVei2PowgncArn4MEV1OEWGuADAw7P8ApvjnRenHfnY9FacPKZY/gD5/MHhtmOdQ==</latexit>

Radiology Report <latexit sha1_base64="ADncWX0d4IiG+A4oOV610xwxggU=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48V7Qe0oWy2k3bpZhN2N2IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3H1FpHssHM0nQj+hQ8pAzaqx0/9R3++WKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8NrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbQuqt5l1burVeq1PI4inMApnIMHV1CHW2hAExgM4Rle4c0Rzovz7nwsWgtOPnMMf+B8/gAH7o2X</latexit>

x0
Fig. 1. Overview of our generative model, MedSyn. Using a hierarchical approach, we first generate a 64× 64 × 64 low-resolution volume, along
with its anatomical components, conditioning on Gaussian noise ϵ and radiology report. The low-resolution volumes are then seamlessly upscaled
to a detailed 256× 256 × 256 resolution.

a high-capacity denoiser to map the subtle and occasionally of our knowledge, this is the first work to evaluate a text-to-
ambiguous pathologies mentioned in the reports to visual medical image model with radiologists empirically. We also
patterns. Enhancing the capacity of the UNet, further increases delve into the significance of components within our proposed
memory usage. modules. Our code and pre-trained model is publicly available
at https://github.com/batmanlab/MedSyn
In 3D medical image synthesis, limitations often manifest
as “hallucinations,” leading to potential biases and inaccura-
II. R ELATED W ORKS
cies—critically concerning in medical settings. Our paper con-
centrates on generating Computed Tomography (CT) images A. Image Synthesis Based on Text Prompt
of the lung. Hallucinations may manifest as missing fissure Text-conditional image generation enables new applications
lines separating lung lobes or the creation of implausible and improves the diversity of generated images compared to
airway and vessel structures. This problem of hallucinations models that are not conditioned on text. The most relevant
is less pronounced in X-ray image generation (e.g., Roent- studies in the domain of multi-modal generation include Ima-
Gen [5]) since the 2D X-ray projection obscures many fine gen [3], Latent Diffusion (Stable Diffusion) [4], Video Diffu-
details, like the lung’s lobular structure, airway, and vessels. sion [13], and RoentGen [14]. For natural images, both Imagen
However, the issue becomes more pronounced in 3D image and Latent Diffusion pioneered the text-conditional diffusion
generation. Training on hundreds of thousands of 3D medical models, producing high-fidelity 2D natural images. Dall·E
images isn’t feasible, necessitating a strong prior to constrain- 2 [15] uses pre-trained CLIP model to extract features, which
ing the space of generative models. guides text-to-image generation Video Diffusion introduces
To address these challenges, we propose MedSyn, a model a method to generate videos using score-guided sequential
tailored for high-resolution, text-guided, anatomy-aware volu- diffusion models, accepting text prompts as conditional inputs.
metric generation. Leveraging a large dataset of approximately In medical imaging domain, RoentGen enhanced the Stable
9,000 3D chest CT scans paired with radiology reports, we Diffusion prior to achieve an exceptional generative quality for
employ a hierarchical training approach for efficient cross- Chest X-Ray scans. TauPETGen [16] proposes to utilize text
modality generation. Given the input of tokenized radiology description as a condition for generating 2D Tau PET images.
reports, our method initiates with a low-resolution synthesis However, these methodologies cannot be seamlessly integrated
(64 × 64 × 64), which is then fed to a super-resolution into multi-modal 3D medical volume generation because of
module to upscale the image to 256 × 256 × 256. We have their inherent 2D orientation or the challenges associated
modified the UNet to bolster the network’s capacity (i.e., with efficient high-resolution 3D volume creation. In addition,
the number of parameters) without significantly increasing most previous methods adopt language models pre-trained on
memory requirements. MedSyn enhances controllability by generic text, then fine-tune on it using biomedical text. In
harnessing textual priors from radiology reports to guide contrast, we utilize a language encoder model [17] trained
synthesis. We further regularize the generator by creating specifically with biomedical text data that learns domain-
segmentation masks of the lung’s airway, vessels, and lobular specific vocabulary and semantics.
structure as additional output channels alongside synthetic CT
images to conserve detailed anatomical features. The resultant B. Generative Models for 3D Medical Imaging
model can condition not only on the text but also on single or Generative models have emerged as a powerful tool in the
paired anatomical structures. Our algorithmic experiments and field of medical imaging, offering a range of applications
blind evaluations conducted by 10 board-certified radiologists and benefits. Previous work [18], [19] leverages 3D GAN for
highlight the superior generative quality and efficiency of the volume generation. However, the generated images are limited
proposed method compared to baseline techniques. To the best to the small size of 128×128×128 or below, due to insufficient
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 3

memory during training. In order to address the challenges be written as follows:

of producing efficient high-resolution medical volumes, HA- X
GAN [6] introduces an end-to-end GAN framework that L=− Eq(x0 )q(xt |x0 ) KL(q(xt−1 |xt , x0 )||pθ (xt−1 |xt )),
operates on multiple-level feature maps. This can synthesize t>0

medical volumes of dimensions 256×256×256 within a single (4)

model forward pass. Recently, diffusion models have evolved, which indirectly maximizes the ELBO of the likelihood
and both cDPM [10] and SADM [11] aim to generate full pθ (x0 ). When x0 is given, the posterior q(xt−1 |xt , x0 ) is
3D volumes. This sequential approach was adopted primarily Gaussian. Thus, the above objective becomes ℓ2 distance
because of the considerable memory demands of the 3D between the sampled xt−1 from the posterior and the predicted
Attention UNet. In order to mitigate the discontinuity between denoised data, conditioning on time step t. The practice of
slices due to sequential generation, MedGen3D [20] introduces predicting x0 is also used in previous work [3]. The final
a diffusion refiner that generates images from three different minimization objective for the diffusion model G can be
views and averages them. In order to address the memory con- simplified as follows:
straint, Medical Diffusion [21] proposes to use VQ-GAN [22]
to compress volume into latent space, then trains a diffusion
L = Eq(x0 )q(xt |x0 ) ∥G(xt ; θ) − x0 ∥22 (5)
model in the latent space. Yet, a compromise arises between
memory efficiency and speed during inference, making it
challenging to scale up for high-resolution and expansive IV. M ETHOD
cross-modality medical volume creation. Pinaya et al. [23]
have previously used LDMs to generate 3D brain MRI. The Our radiology report conditional generated framework
aforementioned limitations of previous works underscore the (MedSyn), as illustrated in Fig. 1, is a hierarchical model built
unique value of our proposed method for efficient high-fidelity upon the conditional diffusion models, which take the inputs
cross-modality medical volume generation, distinguishing it of a random Gaussian noise ϵ and the text token featuresftext
from prior techniques and applications. to generate a 256 × 256 × 256 medical volumes. The proposed
model has three core components: 1. A pre-trained text-
encoder (Medical BERT) for extracting language features from
III. B ACKGROUND radiology reports. 2. A text-guided low-resolution 3D diffusion
model that jointly synthesizes CT volume and its anatomical
Diffusion models contain two processes: a forward diffu- structure volumes. 3. A super-resolution 3D diffusion model
sion process and a reversion diffusion process. The forward for scaling up the low-resolution generated volumes and com-
diffusion gradually generates corrupted data from x0 ∼ q(x0 ) plementing the missing anatomical details. In the following
via interpolating between the sampled data and the Gaussian session, we will introduce the details of our proposed method.
noise as follows:
T
A. Learning Representations from Radiology Reports
Y
q(x1:T |x0 ) := q(xt |xt−1 ),
t=1 To enable the text-guided medical image generation that
p
q(xt |xt−1 ) := N (xt ; 1 − βt xt−1 , βt I), utilize radiology reports as the guidance, we train a text
encoder (Medical BERT [17]) that encodes natural language
where T denotes the maximum time steps and βt ∈ (0, 1] is from radiology reports into high-dimensional vectors (ftext ).
the variance schedule. According to Sohl-Dickstein et al. [24], The encoder schematic is shown in the lower part of Fig. 2.
this can be reformed as a closed-form solution, from which Specifically, we use the CT radiology reports in our col-
we can sample xt at any arbitrary time step t without needing lected data to fine-tune a BERT model [25] pre-trained on
to iterate through the entire Markov chain. datasets of chest X-ray reports. We use the “Impression”, and
√ the “Findings” sections of CT reports as the input for the
q(xt |x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I), (1) model since they provide the most characteristic descriptions
Qt related to the disease symptoms and anatomy structures. The
in which ᾱt := s=1 (1 − βs ). The parameterized reversed
model is then fine-tuned using two tasks: (a) Mask language
diffusion can be formulated correspondingly:
modelling; (b) Paired section prediction. In task (a), we
T
Y randomly replace 15% of input tokens with [MASK], and
pθ (x0:T ) := pθ (xT ) pθ (xt−1 |xt ), (2) use the network to predict the masked token. In task (b), we
t=1 randomly swap the “Impression” section of half of the samples
pθ (xt−1 |xt ) := N (xt−1 ; µθ (xt , t), σt2 I), (3) with another subject’s “Impression” section. As a result, half
of the samples in the dataset have unpaired “Impression”
where we can parameterize pθ (xt−1 |xt ) as Gaussian distri- sections and “Findings” sections. In this way, we construct
bution when the noise addition between each adjacent step is different [CLS] embedding for paired and unpaired samples.
sufficiently small, the denoised function µθ produces the mean Therefore, we train a linear classifier that uses the [CLS]
value of the adjacent predictions, while the determination of embedding to predict whether the input “Impression” and the
the variance σt relies on βt . The optimization objective can “Findings” sections are paired or not.
4 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

Multi-channel
Efficient Low-Res Base Model (Ours) Multi-channel
Low-Res input
Low-Res output
Parameter size: 700M

Upsample

Downsample
Skip Connection
Pseudo 3D
Timestep conv 1x3x3

3D Conv
3x3x3

2D Spatial
Attn

Temporal
Text Token
Attn

Cross Attn
Medical BERT

Frozen
[CLS] Tok 1 Tok N [SEP] Tok 1 Tok M

Findings Impression

Radiology Report

Fig. 2. This figure shows our efficient low-res generative model with the clinical tokens input. In this process, we train the denoising diffusion UNet
and fix the pre-trained text feature extractor of Medical BERT. To be notified, our low-res base model has a large capacity of 700 million parameters.

B. Text-Conditioned Volume Generation 2) Super Resolution Model: Our model puts most of the
This section discusses how to incorporate the text em- learning capacity in the low-resolution module. We designed
bedding from the Medical BERT. Direct training of high- a lightweight diffusion UNet for the super-resolution module,
resolution diffusion models (such as 256 × 256 × 256 for a which takes the low-resolution input and outputs the full high-
given field of view) is highly memory-demanding and thus resolution volume. For the super-resolution module, we design
not feasible. We propose an efficient hierarchical model with the loss to match the denoising distribution as follows:
a two-phase process. In the first phase, we generate a low- X
Lsup = − Eq(x0 ,xl0 ) q(xt |x0 )[
resolution volume (64 × 64 × 64) conditioned via the radiol- t>0
ogy report. In the second phase, the model outputs a high-
KL(q(xt−1 |xt , xl0 )||pθ (xt−1 |xt , xl0 ))]
resolution volume 256 × 256 × 256 from a 3D super-solution
module, which only takes low-resolution volume as input. The Lapproximate
sup = Eq(x0 ,xl0 ) q(xt |x0 )[
low-resolution image ensures the volumetric consistency of ∥H(xt , xl0 ; ϕ) − x0 ∥22 ], (8)
the final images. Using the radiology reports as a condition
for low-res image generation is advantageous since we do not where H is the super-resolution denoising module with its
need to compromise between resolution and model capacity. parameters ϕ. To be notified, we do not contain the additional
1) Low-Resolution Volume Generation Conditioned on Re- text information in our super-resolution module to save the
ports: To generate data conditioned via specific signals c, e.g., computational cost.
text information, we need to reformulate the unconditional
DDPM objective in Equation 4 with the conditions, C. Joint Generative Modeling of Volume and Anatomical
Eq(x0 )q(xt |x0 ,c) KL(q(xt−1 |xt , x0 , c)||pθ (xt−1 |xt , c)). Structures
X
L=−
t>0 Lung CT scans exhibit more intricate details compared to
(6) other organs. These include the bronchial tree formations, the
We first downsample the CT volume x0 by 4× to produce the delicate vascular network, and fissure lines delineating the lung
low-resolution one xl0 . Then, given the extracted features from lobes. Such complexity heightens the challenge of producing
the radiology report, we can finalize the training loss for the high-fidelity synthetic lung CT volumes. And, those subtle
text conditional diffusion model as follows: anatomical structures are easily ignored during the generative
modeling. Moreover, anatomical structures are usually roughly
Llow = Eq(x0 ,ftext )q(xlt |xl0 ) ∥G(xt , ftext ; θ) − xl0 ∥22 , (7)
described in the clinical report, which enlarges the difficulty
where G is the denoising model with cross-attention modules of CT synthesis purely based on radiology reports. Therefore,
and θ is its parameters. We inject the text conditional infor- we propose to model the volume generation jointly with
mation into the denoising network to reconstruct the original the anatomical structure generation, all conditioned on the
xl0 via a cross-attention mechanism. radiology reports. To this end, we extract the shape information
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 5

for the core anatomical structures, i.e., the lung lobes, the almost remove all the attention modules and keep one temporal
airway and the vessels. We choose the commonly used pre- attention at the bottom of the UNet, which makes it feasible
trained segmentation tools to provide stable shape information for 256 × 256 × 256 volume inputs.
for these three structures. We use lungmask [26] to segment
lobes from CT volumes and use TotalSegmentator [27] to F. Implementation Details
segment vessels, and the NaviAirway [28] is utilized for
airway segmentation. We denote the segmentation map as To pre-train our text encoder, we use 209,683 reports
l, a, v for the lung, airway and vessels. To enable the previous without paired images, and 7,728 reports with paired images
model G(·; θ) to jointly synthesize CT volume x along with as training set. We pre-train our text encoder for 5 epochs.
its structures l, a, v, we simply add three more channels to the Our training objective for the diffusion models is a simple
input and output layers of model G, while still conditioning on reconstruction loss: the ℓ2 pixel-level reconstruction between
time step t. We then directly concatenate them in the channel the ground truth and the prediction for different denoising time
dimension and construct a diffusion process on four-channel steps without any other terms, such as perceptual loss [30].
3D volumes. To be notified, all the shape segmentations are Following the common choice of training diffusion model,
paired with the volumes in the concatenation operation. We we use a continuous cosine time scheduler [1]. For the
define the concatenation operation as concat(·) and the newly training of both the low-resolution diffusion model and the
constructed volume as x′ = concat(x, l, a, v). Then, with super-resolution model, the number of time steps is set as
minor modifications to Equations 7 and 8, we can write down 1000. During inference, we use 50 DDIM steps for the low-
our joint generation training objectives for the low-resolution resolution diffusion model, and 20 DDIM steps for the super-
phase and high-resolution phase as follows: resolution model. For the optimizer, we apply AdamW [31]
with the learning rate of 1 × 10−4 and β = {0.9, 0.999} with
clipping the gradient norm bigger than 1. We apply the mixed-
Llow = Eq(x′0 ,text)q(x′t |x′0 ) ∥G(x′t , ftext ; θ) − x′0 ∥22 , (9) precision for optimizing models to make the training more
Lsup = Eq(x′0 ,x′l0 )q(x′t |x′0 ) ∥H(x′t , x′l0 ; ϕ) − x′0 ∥22 . (10) efficient. The gradient accumulation is applied during training
to scale up the training batch size, as diffusion models are
sensitive to the small training batch size [32]. Ultimately, we
D. Anatomically Controllable Synthesis can train our 700M parameters low-resolution base model with
Compared with the conditional modelling of Control- a batch size of 64 and the super-resolution model with a batch
Net [29], our model is significantly different, where we size 32 on four NVIDIA A6000 GPUs. Both models have
model the volume generation with the structure information converged after undergoing 40k iterations of training.
jointly p(x, c) while ControlNet generates data conditioned on
the structure p(x|c). Ours is more flexible as MedSyn can still V. E XPERIMENTS
generate data when the predefined structures are unavailable,
which is not feasible for ControlNet. Furthermore, if we This section presents a comprehensive evaluation of the
marginalize one or multiple components in the joint denoising proposed generative model, MedSyn. We will first describe the
process, such as fixing the l given the predefined lobe structure dataset used in this experiment. Then, the MedSyn is compared
inputs, we can achieve exactly what ControlNet can do. with the state-of-the-art GANs and diffusion model, including
Further, if we fix the x, we can get the structure output from WGAN [33], α-GAN [18], HA-GAN [6] and Medical Diffu-
our model, such as segmentation maps. We will show these sion [34]. Extensive comparisons and analysis are finally given
advantages of our model in the experimental section. to evaluate the effectiveness of our method, qualitatively and
quantitatively. For training of all baseline methods, we use the
author’s implementation. We made minimal modifications to
E. Efficient 3D Attention UNet
the code to adapt to our dataset. All models are trained from
Although we propose an efficient two-phase text2volume scratch, for fair comparison with our method.
generation, like other works in 3D generation [10], we still
face the memory issue when generating such high-resolution
volume. Sequential generation with conditional diffusion mod- A. Dataset
els [10], [11] works as one solution, but new issues will be We conduct experiments on a large-scale 3D dataset, which
easily introduced. Therefore, we design a new base neural contains 3D thorax computerized tomography (CT) images
architecture for much more efficient volume generation. Com- and associated radiology reports from 8,752 subjects. The
pared with the common 3D attention UNet [13] for video gen- dataset also contains 209,683 reports without corresponding
eration, we build an encoder-decoder with pure convolutions images. The images and reports are collected by the University
and move all the attention mechanisms into the bottom of the of Pittsburgh Medical Center and have been de-identified. We
UNet. In this way, we propose a more efficient base model randomly split our dataset of 8,752 subjects into a training
structure that drops all the computational burden from the set consisting of 7,728 subjects (88%), and a validation
attention mechanism while still benefiting it from the latent set of 1,024 subjects (12%). The images have been pre-
space where spatial resolution is much lower. This increases aligned using affine registration and re-sampled to 1mm3 . We
the parameter size via 10× more but largely increases the resize the images to 256 × 256 × 256. We use the nearest-
computational efficiency. For the super-resolution network, we neighbor downsampling to reduce the scans 4× to train the
6 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

Ours Medical Diffusion

HA-GAN Real
Fig. 3. Randomly generated images (from HA-GAN and Medical Diffusion) and the real images. The first two columns show axial and coronal
slices, which use the HU range of [-1024, 600]. The last column shows the zoom-in region and uses HU range of [-1024, -250] to highlight the lung
details. Our method is the only one that can preserve delicate anatomical details, including fissures, as indicated by the arrows.

low-resolution base model. The Hounsfield units of the CT the quality of the synthetic images by measuring the distance
images have been calibrated, and air density correction has to the real data, using Fréchet Inception Distance (FID) [35]
been applied. The Hounsfield Units (HU) are mapped to the and Maximum Mean Discrepancy (MMD) [36]. The lower the
intensity window of [−1024, 600] and normalized to [−1, 1]. FID/MMD value is, the more similar the synthetic images are
to the real ones. We use a sample size of 1,024 for computing
B. Evaluation for Synthesis Quality the FID and MMD scores. For our method, we use randomly
selected reports as the condition. To compute FID and MMD
TABLE I
scores for 3D CT scans, like [6], we leverage a pre-trained 3D
Q UANTITATIVE COMPARISON WITH DIFFERENT METHODS . O UR METHOD ResNet model on medical images [37] for feature extraction.
OUTPERFORMS BASELINE METHODS IN TERMS OF DISTANCE METRICS , As shown in Table I, our method achieves lowest FID and
AND PRESERVES AIRWAY BETTER MMD than the baselines, which implies that our diffusion
Method FID↓ MMD↓ Airway (×104 mm3 )↑
model generates more realistic images.
WGAN 0.070 0.094 1.07±0.64 Quantitative Evaluation on Anatomical Details: While
α-GAN 0.028 0.057 1.14±0.68 metrics like FID and MMD are widely used in literature and
HA-GAN 0.023 0.054 2.04±0.73 empirically work well for natural images, they highlight the
Medical Diffusion 0.013 0.022 1.77±0.93
semantic-wise similarly (distance) but may ignore subtle but
Ours 0.009 0.019 3.34±1.19
Ours w/o shape - - 1.99±1.05
important anatomical details in medical images, as implied
by the small (FID/MMD) gap between different methods.
Real - - 4.58±1.45
Their real distances, as later shows in Fig. 3, could be much
1) Quantitative Evaluation: If the synthetic images are realis- bigger when taking account into the anatomical details we are
tic, then their distribution should be indistinguishable from that focused on. Therefore, we evaluate how well the generated
of the real images. Therefore, we can quantitatively evaluate images can preserve the anatomical details. Specifically, we
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 7

True Data Generated Data (Diseased) Generated Data (Normal)

Prompt: There are large pleural effusions seen. There is no airspace opacity or Prompt: There is no airspace opacity, effusion or
pneumothorax. There is no evidence of suspicious pulmonary nodule or mass. pneumothorax. There is no evidence of suspicious
pulmonary nodule or mass.

Prompt: There is extensive consolidation seen. No pulmonary nodules are noted. Prompt: No consolidation is identified. No pulmonary
Bone windowed images demonstrate no lytic or blastic lesions. No evidence of nodules are noted. Bone windowed images
pulmonary embolus. demonstrate no lytic or blastic lesions. No evidence of
pulmonary embolus.

Prompt: There is no significant mediastinal lymphadenopathy. There is moderate Prompt: There is no significant mediastinal
cardiomegaly. The visualized upper abdominal organs are unremarkable. There is lymphadenopathy. There is no cardiomegaly. The
minimal perihepatic free fluid. visualized upper abdominal organs are unremarkable.
There is minimal perihepatic free fluid.

Fig. 4. Images conditionally generated with disease-related prompts. We show the real images in the first two columns. Then we extract disease-
related mentions from their associated reports as conditions to generate images, which are shown in the third and fourth columns. We also show
the synthesized samples by conditioning on prompts reversed of the disease in the last two columns. Four slices are shown for each subject. The
generated images are conditioned on text only.
8 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

use Total Segmentor [38] to segment vessels and airways from prompts, specifically on pleural effusion, bullous emphysema,
generated images and real images, and measure the volume. and cardiomegaly. First, we build prompt pairs by selecting
The results are shown in Table. I. We also perform statistical prompts from real reports that contain descriptions about
tests (one-tailed two-sample t-test) on the evaluation results. certain pathology (e.g. There are large pleural effusions seen),
At the significance level of p < 0.05, the results are significant and reversing its description (e.g. There is no pleural effusion)
for all three conditions, which further identify the effectiveness to build its prompt pair. The prompts used here can be found
of our model on prompting generation. in Fig. 4. Then, we use our model to generate images con-
2) Qualitative Evaluation: To qualitatively analyze the re- ditioning on the original prompts and the modified prompts,
sults, we show cases of synthetic images from current state- respectively. Conditioned on each prompt, we generate 32
of-the-art GAN [6] and diffusion model [21]. As in Fig. 3, al- CT volumes and perform quantitative analysis to measure the
though synthetic images from different methods are all closed alignment between the synthetic images and the abnormality
to the real ones in overall appearance, only our MedSyn con- condition specified in the prompt. For pleural effusion, we use
sistently produces anatomically plausible CT scans upon closer Total Segmentor [38] to segment the effusion from generated
inspection, showcasing its superiority. images and measure the volume. For bullous emphysema, we
measure the %LAA-950 (percentage of low attenuation areas
C. Evaluation for Conditional Generation less than a threshold of -950 Hounsfield units) for generated
images. For cardiomegaly, we use Total Segmentor [38] to
TABLE II segment the heart and lung region from CT volume, and then
E VALUATION FOR CONDITIONAL GENERATION OF PLEURAL EFFUSION . we compute the cardiothoracic ratio (CTR) by measurement
W E MEASURE THE SEGMENTED VOLUME OF PLEURAL EFFUSION FROM of the maximal cardiac width divided by the maximal thoracic
GENERATED IMAGES CONDITIONED ON DIFFERENT PROMPTS width at the same axial scan level. The evaluation results for
Prompt type Pleural effusion volume (L) pleural effusion, bullous emphysema and cardiomegaly are
shown in Table. VII, III and IV respectively. For pleural
No effusion 0.00±.00
Large effusion 1.73±.22 effusion, we found that when conditioning on the prompt
with “large effusion,” the generated images show a greater
volume of pleural effusion compared to images synthesized
with a prompt containing “no effusion.” For bullous emphy-
TABLE III TABLE IV
sema, we found that the generated images conditioning on
E VALUATION FOR CONDITIONAL E VALUATION FOR CONDITIONAL
GENERATION OF BULLOUS GENERATION OF prompt containing “no bullae” have higher %LAA-950 values,
EMPHYSEMA . T HE RESULTS CARDIOMEGALY. T HE RESULTS which suggests more severe emphysema. For cardiomegaly, we
SHOW THAT THE BULLAE SHOW THAT THE found that when conditioning on the prompt with “There is
MENTIONING CAN INCREASE CARDIOMEGALY MENTIONING cardiomegaly,” the generated images have higher CTR, which
THE EMPHYSEMA VOLUME IN CAN INCREASE THE HEART SIZE
suggests a greater degree of cardiomegaly. We also provide
GENERATED VOLUMES IN GENERATED VOLUMES
the distribution of CTR in Fig. 5.
Prompt type %LAA-950 Prompt type CTR For the qualitative examples generated from our model,
No bullae 0.019±.018 No cardiomegaly 0.48±.06 we chose those three distinct prompts paired with negative
With bullae 1.4±3.5 With cardiomegaly 0.75±.24 prompts to show the prompting effect on synthetic images. In
Fig. 4, we show the volumes from the real and synthetic data
with the text description and the negative prompting synthetic
Distribution of cardiothoracic ratio (CTR) for generated images data. Our model shows the ability to generate unseen data and
3.0 Prompt w/ cardiomegaly control the generative process through prompting.
Prompt w/o cardiomegaly
2.5
D. Controllable Synthesis via Conditioning on
2.0 Anatomical Structures
Frequency

1.5 In this section, we explore the application of conditional

generation. In this study, we aim to generate data when the
1.0 anatomical structures are available, such as we can simulate
0.5 the structures of lobes, airways or vessels. Conditioned on
those priors, we are able to provide volumetric CT scans
0.0 through our model. In Fig 6, we fix the lobes, vessels or the
0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75
CTR airway in the input channel, respectively, which are segmented
from the real data. Then, we generate the lungs with those
Fig. 5. Distribution of cardiothoracic ratio for images generated condi-
tioning on different prompt types. The results show that when feeding
anatomical structures, which shows great consistency with
prompt with cardiomegaly mentioning, the generated images will have those anatomical priors and variance for different seeds.
higher CTR. To be notified, if we marginalize the input channel of the
volumes, we can get the segmentation from the rest of the
In this section, we perform experiments to study the rel- channels, which means our model can perform the segmenta-
evance of generated images in response to specific input tion task as well. However, the generative quality is mostly
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Control Segs True Image Seed 1, Generated Seed 2, Generated Seed 3, Generated

Fig. 6. Controlled volume synthesis via the anatomical priors. The first column shows the anatomical mask used as the condition. The second
column shows the corresponding real images. The remaining columns show samples of conditionally generated images. The results show that the
generated images can preserve the conditioning anatomical structures.

dominated by the low-res module, thus, the segmentation TABLE V

cannot be compared with the state of the art. To show the E VALUATION FOR SEGMENTATION . W ITH ADDITIONAL INFORMATION
FROM REPORTS , THE SEGMENTATION IS FURTHER IMPROVED
potential of future multimodal segmentation via prompt, we
compare the segmentation quality of the model with prompt or Dice score↑ Airway Lobe
without the prompt (by setting zero values for the conditional No text prompt 0.70±.14 0.69±.12
text embedding). In our experiments, we use the embedding With text prompt 0.75±.10 0.77±.12
extracted from the associated radiology report as the condition.
The evaluation results are shown in Table. V. We use Dice
score as the evaluation metric for segmentation. We note that E. Data Augmentation for Supervised Learning
the baseline methods can’t use segmentation as conditioning,
In this experiment, we used the synthesized samples from
so we don’t include the results here. We find that our method
MedSyn to augment the training dataset for a supervised
achieves decent segmentation performance. With additional
classification task. Previous work [39] has shown that synthetic
information from reports, our method performs even better. In
samples improve the diversity of the training dataset, result-
this sense, we show that our generative model has the potential
ing in a better discriminative performance of the classifier.
for prompt-guided volumetric segmentation.
Motivated by their results, we conduct new experiments of
using our MedSyn for data augmentation. First, we train a
classifier for predicting lung opacity and pleural effusion from
CT scans in RADChest dataset [40]. Second, we use the
10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

SARLE labeler [40] to parse the reports in the validation set interface for both tasks to simulate how radiologists view CT
of our UPMC dataset to derive the labels for lung opacity scans by showing them the axial, coronal, and sagittal views.
and pleural effusion. Next, we randomly sampled 100 reports Our interface, built on a medical research image viewer called
parsed for positive and negative for the two diseases mentioned Papaya 1 , is embedded into a Qualtrics survey. We chose to
above. Then, we use the 200 reports as the condition, and feed use Papaya due to the limitations of Qualtrics, and professional
them into our MedSyn model to generate additional samples. viewers such as ITKSnap or 3DSlicer cannot be embedded into
Finally, we use the synthesized samples as extra samples an anonymized survey platform like Qualtrics. Hence, we have
to train the classification models. We perform 5-fold cross- limitations for the contrast window that radiologists use when
validation. The results are shown in Table. VI. We found viewing vessels. Therefore, the evaluation of the vessels has
that when using augmented samples from our MedSyn, the limitations in terms of visualization. Our interface did allow
performance of the classifier can be improved. radiologists to adjust the contrast of the images and swap the
main image with one of the other two views.
TABLE VI All radiologists were provided an instruction video to show
E VALUATION FOR DATA AUGMENTATION . B ASELINE MODEL IS TRAINED them how to interact with the interface before they started
ONLY WITH REAL RADC HEST DATA . W E ALSO AUGMENT THE TRAINING
the survey. During the instruction video, all radiologists were
SET WITH 200 M ED S YN - GENERATED SAMPLES , AND REPORT THE
ACCURACY AND F1 SCORE .
told that they would review CT scans “that may belong
to different patients and were acquired based on different
Method Pleural effusion Lung opacity image acquisition devices and image reconstruction methods”.
Metric Accuracy%↑ F1↑ Accuracy%↑ F1↑
Additionally, one board-certified radiologist and one of the
Baseline 90.7±3.2 0.79±.05 61.0±2.2 0.72±.03
Augmented w/ MedSyn 94.0±.2 0.84±.01 62.0±1.5 0.75±.00
authors were available while the participant took the survey to
help address any confusion or technical issues.
We recruited 10 board-certified radiologists with varying
years of experience to participate in our study. All radiologists
F. Evaluation by Radiologists were recruited through our professional network.
To complement the evaluations of our method, we designed
a blind evaluation survey that elicits board-certified radiol- a) Pathology Recognition Task: The pathology recognition
ogists’ opinions on the anatomical feasibility of structures task in the survey asks radiologists to identify the most
generated by our approach in comparison with those generated prominent finding in the CT scan from a selection of options:
by existing methods, including Medical Diffusion [34] and consolidation, pleural effusion, cardiomegaly, no abnormali-
HA-GAN [6], and real CT scans. The survey also exposes ties, and other abnormalities not listed. They could leave a
how accurately radiologists can recognize pathologies in CT note in an optional text field if they feel multiple prominent
scans generated by pathology prompts using our method. To or other findings are present. Radiologists are shown six CT
achieve a blind evaluation of the generated CT scans, we scans in total: five generated CT scans by our method and one
intentionally did not mention to the radiologists that some of real CT scan. We generated five CT scans, including one for
the CT scans they were about to interpret were generated by AI cardiomegaly, one for consolidation, two for pleural effusion,
to avoid potentially biased results due to potentially negative and one for no abnormalities. The real CT scan shown is
perceptions towards AI. randomly selected to present one of those conditions. The six
Overall, our findings from the survey with 10 radiologists CT scans are shown to the radiologist in a random order.
with varying years of experience (4 – 23 years) reveal that
radiologists can correctly recognize the pathologies defined b) Anatomical Feasibility Task: The anatomical feasibility
by the prompt in the CT scans generated by our method task specifically asks radiologists to rank the four CT scans
with high accuracy. Additionally, our findings reveal that based on how well they preserve the given anatomical struc-
our method generates CT scans with fissure lines and lobe ture, considering which most looks like it is of a real image
structures that are significantly more anatomically feasible where 1 is best preserved and 4 is least preserved. This
than in CT scans generated by the Medical Diffusion [34] task evaluates the realism of three categories of anatomical
and HA-GAN [6] methods. Our findings also reveal that our structures: the lobe structure along with the fissure lines, the
methods generate CT scans with airway structures that are vessel structures, and the airway structures. For each category,
significantly indistinguishable from real CT scans and more we show the radiologists four different sets of four CT scans
anatomically feasible than those in CT scans generated by the to rank. The four sets and the four CT scans for a given set
Medical Diffusion method. We expand upon the experiment are randomized in the order they appear. Lastly, the anatomical
design that led to these findings and additional statistical structure to be identified is shown in a randomized order.
analyses below. 2) Analysis: For the pathology recognition task, we calcu-
1) Experiment Design: We designed an online survey to late the total number of radiologists that correctly identify
elicit radiologists’ opinions on how accurately our method the pathology used in the prompt to generate that CT scan.
represents different diseases and how our method compares to We cross-check the radiologists’ open-ended responses for
generated CT images from related works and real CT images. correctness with a radiologist within our professional network.
To explore this, our survey consisted of pathology recognition
and anatomical feasibility evaluation tasks. We designed an 1 https://github.com/rii-mango/Papaya
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 11

For the anatomical feasibility task, we collect four data patients can have with the imaging machines. The W from
points from each participant for the rank of each method for Kendall’s Coefficient Test is 0.528, indicating moderate agree-
the three categories (lobe structure along with lung fissures, ment among the rankings provided by the radiologists for this
vessel structure, and airway structure). With this ranked data, structure (p < .001), validating the effect of these findings.
we calculate the frequency of participants that ranked each
method for each rank and calculate the mean rank. We use a 4.00
P< 0.1 *; p<0.05 **; p<0.001 ***
*
non-parametric Friedman test to determine if the mean rank

Mean Rank (1 = best, 4 = worst)

** **

for each method is significantly different from that of the other.

We conduct a pairwise analysis to determine which mean ranks ***

are significantly different from each other and use Bonferonni 3.00
***
Correction to adjust the p-values due to multiple comparisons.
We set our significance threshold as p <= 0.05. We calculate
Kendall’s Coefficient of Concordance to determine how con-
2.00
sistent the rankings are across the radiologists.
3) Findings: The radiologists who completed the survey
included four senior-level radiologists with more than 15 years
of experience in radiology and six junior-level radiologists 1.00
Lobe Structure & Lung Fissures Airway Structure Vessel Structure
with four to eight years of experience in radiology. One of
the radiologists is currently in their residency, and another HA-GAN Medical Diffusion Our Method Real CT

is a faculty member. The survey took the radiologists 37.47 Fig. 8. The mean ranks for each method and each structure. We show
minutes on average to complete (std = 12.06 minutes). Below the p-values for our method compared to the other methods and the real
CT images. P-values for comparisons that do not include our method are
are additional statistics on each task. not included.

a) Pathology Recognition Task: Figure 7 shows that every For the airway structures, CT scans generated by our
radiologist’s interpretation was consistent with the pathology method produce airway structures that are significantly more
of the real CT scan and the prompted pathology in the anatomically feasible than those generated by the Medical
generated CT scan representing consolidation. Nine of ten Diffusion method (p < .05). Interestingly, our method is able
radiologists correctly interpreted the pathology for the CT to generate the airway structures in CT scans to be significantly
scans generated with a pathology prompt for cardiomegaly, indistinguishable from those in real CT scans (p = 0.1). The
pericardial effusion, and normal conditions. Eight out of 10 radiologists, on average, ranked the anatomical feasibility of
radiologists correctly recognized the other pericardial effusion the airway structures in the real CT images as 1.6 and 2.28
case. One radiologist interpreted this CT scan to present for our method (see Figure 8).
consolidation, while another mentioned another abnormality For the vessel structure, the real CT scans are significantly
that was not listed. better than all three methods (p = 0.0), and our method is
10 indistinguishable from the Medical Diffusion and HA-GAN
Number of Radiologists Interpreting the CT Correctly

(p = 1.0); (Kendall’s Coefficient W = 0.413, p =< .001). We

8
hypothesize that we do not see any significance for the vessel
6
structure because we train our MedSyn with masks predicted
by pre-trained segmentation models, and those segmentation
4 methods are more reliable for the airway (NaviAirway [28])
than the vessel structures (TotalSegmentator [38]). Addition-
2
ally, the contrast of the CT scan was not pre-determined for
the radiologists for this structure, and some chose not to adjust
0
Generated CT: Generated CT: Generated CT: Generated CT:
Normal Cardiomegaly Consolidation Effusion 1
Generated CT
Effusion 2
Real CT (any
abnormalities)
the contrast.
Fig. 7. The number of radiologists that correctly interpreted the CT
scans generated by our method for different conditions and real CT
scans for different conditions.
G. Ablation study on the language model
In our MedSyn model, we leverage a language model pre-
trained on biomedical text to extract conditioning features for
b) Anatomical Feasibility Task: We found that the CT scans the diffusion model. In this section, we conducted new ablation
generated by our method present lobe structures and lung experiments to demonstrate the effectiveness of our biomedical
fissures that are significantly more anatomically feasible than language model. Specifically, we adopt BERT-Base [25] as
those generated by the Medical Diffusion (p < .05) and pre-trained language model, which is pre-trained on general
weakly significantly better than CT scans generated by the text corpora, including English Wikipedia and BookCorpus
HA-GAN (p = .056). As seen in Figure 8, the real CT dataset [41]. Next, we fine-tuned the model with our radiology
scans are significantly better than all three of these methods, report dataset. Finally, we re-train our diffusion model with
as expected. Research images achieve higher quality than the standard language model, and perform the evaluation. The
clinical images because of limited data and limited exposure results are shown in Table. VII. It’s evident that our choice of
12 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

using a language model specifically pre-trained on biomedical poses a significant bottleneck for generating high-resolution
text improves the sensitivity of the diffusion model to text 3D medical images. To address this challenge, we introduce
prompts. a two-stage stacked diffusion model. Unlike latent diffusion
models (LDM), our hierarchical approach conducts denoising
TABLE VII operations directly on image pixels, offering finer control over
E VALUATION FOR CONDITIONAL GENERATION OF PLEURAL EFFUSION . image generation and preserving higher fidelity details. Our
O UR BIOMEDICAL LANGUAGE MODEL IS MORE SENSITIVE TO THE hierarchical diffusion model incurs higher computational costs
EFFUSION MENTIONING IN THE PROMPT
when compared to LDMs. There exists a natural trade-off
Measured effusion volume (mL) Generic pre-trained LM Biomedical pre-trained LM between fidelity and computational efficiency. However, in
Prompting w/o effusion 0.6±1.4 0.0±0.0 our scenario, the advantages of heightened fidelity surpass
Prompting w/ effusion 2.4±5.9 1725.7±219.5
the disadvantages of slower generation times. Recently, there
have been advancements in methods [42], [43] that markedly
decrease the necessary denoising steps, sometimes even to just
VI. D ISCUSSION one step. We anticipate that integrating these methods could
In this study, we achieve high-fidelity, anatomy-aware syn- narrow the disparity between our approach and LDMs, and
thesis of volumetric lung CT scans using guidance from radi- we leave this as a prospect for our future work.
ology reports. Nonetheless, our model has certain limitations. The proposed model has several fascinating applications
First, the anatomical structures of the lobe, airway, and vessel that could be pursued in the future. Data augmentation is
are derived from pre-trained segmentation networks. This the most obvious application of the conditional generative
method may not always align with the ground truth, particu- models [44], [45]. One can use the generative model as a
larly concerning detailed airway and vessel structures. Second, building block for model explanation, as suggested in [46]–
the radiology reports only provide a condition on the low- [48]. Generated samples can be used to audit the uncer-
resolution images. Therefore, if the text condition mentions tainty of pre-trained Deep Learning models by conditioning
subtle changes that require high-resolution, the model will on the pathology, changing various aspects of the anatomy,
likely be unable to generate that. In other words, our current and assessing the DL model’s output distribution. One can
model is good at generating global and large-scale changes. deploy such an approach to evaluate and improve the out-of-
While our approach demonstrates remarkable adaptability in sample distribution of the DL model for various tasks such as
generating volumetric lung CT scans from radiology reports, classification and segmentation [49]–[52]. Since our model can
evaluating more intricate lung diseases remains challenging condition anatomical segmentation and generate a consistent
due to the complexities presented in the reports. Such evalu- volumetric image, one can use synthetic data to train a data-
ations might best be deferred to radiology experts. Therefore, free and robust segmentation method similar to [53], [54].
we conduct a blind user study by radiologists in Section. V-F,
which verified the quality and fidelity of the generated images. VII. C ONCLUSION
During the inference time, our models support both generation Our research takes a significant step forward by synthesizing
with only one conditioning type, and generation with both high-resolution 3D CT lung scans guided by detailed radio-
text and shape conditioning. If the user thinks there might logical and anatomical information. While GANs and cDPMs
be a conflict between the text prompt and anatomical shape have set benchmarks, they come with inherent limitations,
conditioning, he/she can choose to use one conditioning. particularly when generating intricate chest CT scan details.
The existing conditional generative models utilized in med- Our proposed MedSyn model addresses these challenges using
ical imaging have limited capabilities. They can either ac- a comprehensive dataset and a hierarchical training approach.
commodate discrete conditions (such as the presence and Innovative architectural designs not only overcome previous
absence of a disease) or are limited to only 2D images constraints but also pioneer anatomy-conscious volumetric
(i.e., X-ray images) when conditioned on text. Utilizing free- generation. Future work can leverage our model to enhance
style text as a condition can yield substantial enhancements clinical applications.
in the diversity of the generated samples. One can control
the pathology and the anatomical location, severity, size, and R EFERENCES
many other aspects of the pathology. This technique could
[1] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”
mitigate the longstanding issue of releasing extensive medical Advances in Neural Information Processing Systems, vol. 33, pp. 6840–
imaging datasets. For example, collecting datasets for rare 6851, 2020.
pathology is challenging. Synthetic samples from a well- [2] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and
B. Poole, “Score-based generative modeling through stochastic differ-
trained generative model with a radiology report can be viewed ential equations,” in International Conference on Learning Representa-
as the second-best replacement in such a scenario. While our tions, 2021.
approach cannot entirely substitute the release of the real data, [3] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S.
Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans,
collaborative efforts within the medical imaging community J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffu-
to refine this model on diverse datasets and share it can sion models with deep language understanding,” 2022.
significantly mitigate this issue in the foreseeable future. [4] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
resolution image synthesis with latent diffusion models,” in Proceedings
Our goal is to enhance the memory efficiency of the diffu- of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
sion model without compromising its fidelity. Memory demand tion, 2022, pp. 10 684–10 695.
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 13

[5] P. Chambon, C. Bluethgen, J.-B. Delbrouck, R. Van der Sluijs, [27] J. Wasserthal, H.-C. Breit, M. T. Meyer, M. Pradella, D. Hinck, A. W.
M. Połacin, J. M. Z. Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, Sauter, T. Heye, D. Boll, J. Cyriac, S. Yang et al., “Totalsegmentator:
and A. Chaudhari, “Roentgen: Vision-language foundation model for robust segmentation of 104 anatomical structures in ct images,” arXiv
chest x-ray generation,” arXiv preprint arXiv:2211.12737, 2022. preprint arXiv:2208.05868, 2022.
[6] L. Sun, J. Chen, Y. Xu, M. Gong, K. Yu, and K. Batmanghelich, [28] A. Wang, T. C. C. Tam, H. M. Poon, K.-C. Yu, and W.-N. Lee, “Navi-
“Hierarchical amortized gan for 3d high resolution medical image airway: a bronchiole-sensitive deep learning-based airway segmentation
synthesis,” IEEE Journal of Biomedical and Health Informatics, vol. 26, pipeline,” arXiv preprint arXiv:2203.04294, 2022.
no. 8, pp. 3966–3975, 2022. [29] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to
[7] H. Shan, Y. Zhang, Q. Yang, U. Kruger, M. K. Kalra, L. Sun, W. Cong, text-to-image diffusion models,” 2023.
and G. Wang, “3-d convolutional encoder-decoder network for low-dose [30] S. Lin and X. Yang, “Diffusion model with perceptual loss,” arXiv
ct via transfer learning from a 2-d trained network,” IEEE transactions preprint arXiv:2401.00110, 2023.
on medical imaging, vol. 37, no. 6, pp. 1522–1534, 2018. [31] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[8] W. Jin, M. Fatehi, K. Abhishek, M. Mallya, B. Toyota, and G. Hamarneh, in International Conference on Learning Representations, 2019.
“Applying artificial intelligence to glioma imaging: Advances and chal- [32] S. Ghalebikesabi, L. Berrada, S. Gowal, I. Ktena, R. Stanforth, J. Hayes,
lenges,” arXiv preprint arXiv:1911.12886, 2019. S. De, S. L. Smith, O. Wiles, and B. Balle, “Differentially private
[9] M. D. Cirillo, D. Abramian, and A. Eklund, “Vox2vox: 3d-gan for brain diffusion models generate useful synthetic images,” arXiv preprint
tumour segmentation,” arXiv preprint arXiv:2003.13653, 2020. arXiv:2302.13861, 2023.
[10] W. Peng, E. Adeli, T. Bosschieter, S. H. Park, Q. Zhao, and K. M. Pohl, [33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
“Generating realistic brain mris via a conditional diffusion probabilistic “Improved training of wasserstein gans,” in Advances in neural infor-
model,” 2023. mation processing systems, 2017, pp. 5767–5777.
[11] J. S. Yoon, C. Zhang, H.-I. Suk, J. Guo, and X. Li, “SADM: Sequence- [34] F. Khader, G. Mueller-Franzes, S. T. Arasteh, T. Han, C. Haarburger,
aware diffusion model for longitudinal medical image generation,” in M. Schulze-Hagen, P. Schad, S. Engelhardt, B. Baessler, S. Foersch,
Lecture Notes in Computer Science. Springer Nature Switzerland, 2023, J. Stegmaier, C. Kuhl, S. Nebelung, J. N. Kather, and D. Truhn, “Medical
pp. 388–400. diffusion - denoising diffusion probabilistic models for 3d medical image
[12] W. H. Pinaya, M. S. Graham, E. Kerfoot, P.-D. Tudosiu, J. Dafflon, generation,” 2022. [Online]. Available: https://arxiv.org/abs/2211.03364
V. Fernandez, P. Sanchez, J. Wolleb, P. F. da Costa, A. Patel et al., [35] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
“Generative ai for medical imaging: extending the monai framework,” “Gans trained by a two time-scale update rule converge to a local nash
arXiv preprint arXiv:2307.15208, 2023. equilibrium,” in Advances in neural information processing systems,
[13] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, 2017.
“Video diffusion models,” arXiv:2204.03458, 2022. [36] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola,
[14] P. Chambon, C. Bluethgen, J.-B. Delbrouck, R. Van der Sluijs, “A kernel two-sample test,” Journal of Machine Learning Research,
M. Połacin, J. M. Z. Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, vol. 13, no. Mar, pp. 723–773, 2012.
and A. Chaudhari, “Roentgen: vision-language foundation model for
[37] S. Chen, K. Ma, and Y. Zheng, “Med3d: Transfer learning for 3d medical
chest x-ray generation,” arXiv preprint arXiv:2211.12737, 2022.
image analysis,” arXiv preprint arXiv:1904.00625, 2019.
[15] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical
[38] J. Wasserthal, H.-C. Breit, M. T. Meyer, M. Pradella, D. Hinck,
text-conditional image generation with clip latents,” arXiv preprint
A. W. Sauter, T. Heye, D. T. Boll, J. Cyriac, S. Yang, M. Bach, and
arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.
M. Segeroth, “Totalsegmentator: Robust segmentation of 104 anatomic
[16] S.-I. Jang, C. Lois, E. Thibault, J. A. Becker, Y. Dong, M. D. Normandin,
structures in ct images,” Radiology: Artificial Intelligence, vol. 5, no. 5,
J. C. Price, K. A. Johnson, G. E. Fakhri, and K. Gong, “Taupetgen: Text-
p. e230024, 2023.
conditional tau pet image synthesis based on latent diffusion models,”
arXiv preprint arXiv:2306.11984, 2023. [39] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and
[17] B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, H. Greenspan, “Gan-based synthetic medical image augmentation for
S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., increased cnn performance in liver lesion classification,” Neurocomput-
“Making the most of text semantics to improve biomedical vision– ing, vol. 321, pp. 321–331, 2018.
language processing,” in European conference on computer vision. [40] R. L. Draelos, D. Dov, M. A. Mazurowski, J. Y. Lo, R. Henao, G. D.
Springer, 2022, pp. 1–21. Rubin, and L. Carin, “Machine-learning-based multiple abnormality pre-
[18] G. Kwon, C. Han, and D.-s. Kim, “Generation of 3d brain mri using diction with large-scale chest computed tomography volumes,” Medical
auto-encoding generative adversarial networks,” in International Confer- image analysis, vol. 67, p. 101857, 2021.
ence on Medical Image Computing and Computer-Assisted Intervention. [41] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba,
Springer, 2019, pp. 118–126. and S. Fidler, “Aligning books and movies: Towards story-like visual
[19] S. Xing, H. Sinha, and S. J. Hwang, “Cycle consistent embedding of 3d explanations by watching movies and reading books,” in The IEEE
brains with auto-encoding generative adversarial networks,” in Medical International Conference on Computer Vision (ICCV), December 2015.
Imaging with Deep Learning, 2021. [42] X. Liu, X. Zhang, J. Ma, J. Peng, and Q. Liu, “Instaflow: One step is
[20] K. Han, Y. Xiong, C. You, P. Khosravi, S. Sun, X. Yan, J. Duncan, and enough for high-quality diffusion-based text-to-image generation,” arXiv
X. Xie, “Medgen3d: A deep generative framework for paired 3d image preprint arXiv:2309.06380, 2023.
and mask generation,” 2023. [43] Y. Xu, M. Gong, S. Xie, W. Wei, M. Grundmann, T. Hou et al.,
[21] F. Khader, G. Mueller-Franzes, S. T. Arasteh, T. Han et al., “Denoising “Semi-implicit denoising diffusion models (siddms),” arXiv preprint
diffusion probabilistic models for 3d medical image generation,” Scien- arXiv:2306.12511, 2023.
tific Reports, vol. 13, no. 7303, 2023. [44] X. Chen, Y. Li, L. Yao, E. Adeli, and Y. Zhang, “Generative adversarial
[22] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- u-net for domain-free medical image augmentation,” arXiv preprint
resolution image synthesis,” in Proceedings of the IEEE/CVF conference arXiv:2101.04793, 2021.
on computer vision and pattern recognition, 2021, pp. 12 873–12 883. [45] H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L.
[23] W. H. Pinaya, P.-D. Tudosiu, J. Dafflon, P. F. Da Costa, V. Fernandez, Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski, “Medical image
P. Nachev, S. Ourselin, and M. J. Cardoso, “Brain imaging generation synthesis for data augmentation and anonymization using generative
with latent diffusion models,” in MICCAI Workshop on Deep Generative adversarial networks,” in International workshop on simulation and
Models. Springer, 2022, pp. 117–126. synthesis in medical imaging. Springer, 2018, pp. 1–11.
[24] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, [46] S. Singla, M. Eslami, B. Pollack, S. Wallace, and K. Batmanghelich,
“Deep unsupervised learning using nonequilibrium thermodynamics,” “Explaining the black-box smoothly—a counterfactual approach,” Med-
in International Conference on Machine Learning. PMLR, 2015, pp. ical Image Analysis, vol. 84, p. 102721, 2023.
2256–2265. [47] H. Montenegro, W. Silva, and J. S. Cardoso, “Privacy-preserving gener-
[25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training ative adversarial network for case-based explainability in medical image
of deep bidirectional transformers for language understanding,” arXiv analysis,” IEEE Access, vol. 9, pp. 148 037–148 047, 2021.
preprint arXiv:1810.04805, 2018. [48] C. Mauri, S. Cerri, O. Puonti, M. Mühlau, and K. Van Leemput,
[26] J. Hofmanninger, F. Prayer, J. Pan, S. Röhrich, H. Prosch, and G. Langs, “Accurate and explainable image-based prediction using a lightweight
“Automatic lung segmentation in routine imaging is primarily a data generative model,” in International Conference on Medical Image Com-
diversity problem, not a methodology problem,” European Radiology puting and Computer-Assisted Intervention. Springer, 2022, pp. 448–
Experimental, vol. 4, no. 1, pp. 1–13, 2020. 458.
14 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

[49] D. Mahapatra, A. Poellinger, L. Shao, and M. Reyes, “Interpretability-

driven sample selection using self supervised learning for disease
classification and segmentation,” IEEE transactions on medical imaging,
vol. 40, no. 10, pp. 2548–2562, 2021.
[50] D. Li, J. Yang, K. Kreis, A. Torralba, and S. Fidler, “Semantic segmen-
tation with generative models: Semi-supervised learning and strong out-
of-domain generalization,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2021, pp. 8300–8311.
[51] W. Yan, Y. Wang, S. Gu, L. Huang, F. Yan, L. Xia, and Q. Tao,
“The domain shift problem of medical image segmentation and vendor-
adaptation by unet-gan,” in Medical Image Computing and Computer
Assisted Intervention–MICCAI 2019: 22nd International Conference,
Shenzhen, China, October 13–17, 2019, Proceedings, Part II 22.
Springer, 2019, pp. 623–631.
[52] X. Xie, J. Chen, Y. Li, L. Shen, K. Ma, and Y. Zheng, “Mi 2 gan: Gen-
erative adversarial network for medical image domain adaptation using
mutual information constraint,” in International Conference on Medical
Image Computing and Computer-Assisted Intervention. Springer, 2020,
pp. 516–525.
[53] B. Billot, D. N. Greve, O. Puonti, A. Thielscher, K. Van Leemput, B. Fis-
chl, A. V. Dalca, and J. E. Iglesias, “Synthseg: domain randomisation
for segmentation of brain scans of any contrast and resolution,” arXiv
preprint arXiv:2107.09559, 2021.
[54] B. Billot, D. Greve, K. Van Leemput, B. Fischl, J. E. Iglesias, and A. V.
Dalca, “A learning strategy for contrast-agnostic mri segmentation,”
arXiv preprint arXiv:2003.01995, 2020.

2 CV Gearbox
No ratings yet
2 CV Gearbox
1 page
Apertura AD Telescope Manual PDF
No ratings yet
Apertura AD Telescope Manual PDF
19 pages
Med Image Syn
No ratings yet
Med Image Syn
10 pages
Mam e
No ratings yet
Mam e
23 pages
Zeroth Review
No ratings yet
Zeroth Review
9 pages
Synthetic CT Generation From MRI Using 3D Transformer-Based Denoising Diffusion Model
No ratings yet
Synthetic CT Generation From MRI Using 3D Transformer-Based Denoising Diffusion Model
22 pages
MedCoDi-M: A Multi-Prompt Foundation Model For Multimodal Medical Data Generation
No ratings yet
MedCoDi-M: A Multi-Prompt Foundation Model For Multimodal Medical Data Generation
40 pages
Litrature Review
No ratings yet
Litrature Review
4 pages
SYN-LUNGS: Towards Simulating Lung Nodules With Anatomy-Informed Digital Twins For AI Training
No ratings yet
SYN-LUNGS: Towards Simulating Lung Nodules With Anatomy-Informed Digital Twins For AI Training
12 pages
An Efficient Deep Neural Network To Classify Large 3D Images With Small Objects
No ratings yet
An Efficient Deep Neural Network To Classify Large 3D Images With Small Objects
15 pages
Text
No ratings yet
Text
10 pages
Met 3 D
No ratings yet
Met 3 D
12 pages
Final
No ratings yet
Final
21 pages
GAN-based Synthetic Medical Image Augmentation
No ratings yet
GAN-based Synthetic Medical Image Augmentation
10 pages
Chapter MedicalImageGenerationusingGAN
No ratings yet
Chapter MedicalImageGenerationusingGAN
21 pages
Cross-Modality Synthesis From MRI To PET Using Adversarial U-Net With Different Normalization
No ratings yet
Cross-Modality Synthesis From MRI To PET Using Adversarial U-Net With Different Normalization
5 pages
D P M GAN M 2DI: Iffusion Robabilistic Odels Beat ON Edical Mages
No ratings yet
D P M GAN M 2DI: Iffusion Robabilistic Odels Beat ON Edical Mages
13 pages
Chapter 8
No ratings yet
Chapter 8
6 pages
Grove 等 - 2011 - From CT to NURBS Contour Fitting with B-spline Curves
No ratings yet
Grove 等 - 2011 - From CT to NURBS Contour Fitting with B-spline Curves
19 pages
Merlin: A Vision Language Foundation Model For 3D Computed Tomography
No ratings yet
Merlin: A Vision Language Foundation Model For 3D Computed Tomography
28 pages
Hierarchical Amortized GAN For 3D High Resolution Medical Image Synthesis
No ratings yet
Hierarchical Amortized GAN For 3D High Resolution Medical Image Synthesis
10 pages
Medical Image Analysis With Transformers
No ratings yet
Medical Image Analysis With Transformers
66 pages
Adapting Pretrained Vision-Language Foundational Models To Medical Imaging Domains
No ratings yet
Adapting Pretrained Vision-Language Foundational Models To Medical Imaging Domains
17 pages
First Review
No ratings yet
First Review
10 pages
2020 - Singh - 3D Deep Learning On Medical Images
No ratings yet
2020 - Singh - 3D Deep Learning On Medical Images
26 pages
Sciadv Abb7973
No ratings yet
Sciadv Abb7973
12 pages
3D Brain and Heart Volume Generative Models A Surv
No ratings yet
3D Brain and Heart Volume Generative Models A Surv
34 pages
INTRODUCTION
No ratings yet
INTRODUCTION
25 pages
Med3D: Transfer Learning For 3D Medical Image Analysis: April 2019
No ratings yet
Med3D: Transfer Learning For 3D Medical Image Analysis: April 2019
13 pages
Set 6-2 Memory-Efficient GAN-based Domain Translation of High Resolution 3D
No ratings yet
Set 6-2 Memory-Efficient GAN-based Domain Translation of High Resolution 3D
10 pages
First Review
No ratings yet
First Review
11 pages
Conversion Between CT and MRI Images Using Diffusion and Score-Matching Models
No ratings yet
Conversion Between CT and MRI Images Using Diffusion and Score-Matching Models
10 pages
Automatic Report Generation For Chest X-Ray Images: A Multilevel Multi-Attention Approach
No ratings yet
Automatic Report Generation For Chest X-Ray Images: A Multilevel Multi-Attention Approach
10 pages
3D VISUALIZATION Literature Survey
No ratings yet
3D VISUALIZATION Literature Survey
5 pages
1 s2.0 S001048252030086X Main
No ratings yet
1 s2.0 S001048252030086X Main
7 pages
TMRGM A Template-Based Multi-Attention Model For X
No ratings yet
TMRGM A Template-Based Multi-Attention Model For X
12 pages
Serpent
No ratings yet
Serpent
10 pages
Attention Based Cross-Domain Synthesis and Segmentation From Unpaired Medical Images
No ratings yet
Attention Based Cross-Domain Synthesis and Segmentation From Unpaired Medical Images
13 pages
Medical Paper
No ratings yet
Medical Paper
20 pages
Nguyen Et Al 2025 Enhanced Medical Image Generation Through Advanced Latent Space Diffusion
No ratings yet
Nguyen Et Al 2025 Enhanced Medical Image Generation Through Advanced Latent Space Diffusion
13 pages
Project 1
No ratings yet
Project 1
14 pages
Funct Imaging of Lungs
No ratings yet
Funct Imaging of Lungs
14 pages
10 1002@mp 13617
No ratings yet
10 1002@mp 13617
17 pages
Automated Abnormality Classi Fication of Chest Radiographs Using Deep Convolutional Neural Networks
No ratings yet
Automated Abnormality Classi Fication of Chest Radiographs Using Deep Convolutional Neural Networks
8 pages
Deep Learning-Based Synthetic CT Generation From MR Images: Comparison of Generative Adversarial and Residual Neural Networks
No ratings yet
Deep Learning-Based Synthetic CT Generation From MR Images: Comparison of Generative Adversarial and Residual Neural Networks
25 pages
Improving Quality of Medical Scans Using GANs
No ratings yet
Improving Quality of Medical Scans Using GANs
7 pages
Medical Image Segmentation
No ratings yet
Medical Image Segmentation
6 pages
1 s2.0 S1361841523003067 Main
No ratings yet
1 s2.0 S1361841523003067 Main
32 pages
Log With CNN
No ratings yet
Log With CNN
8 pages
Computerized Medical Diagnosis For Human Organ
No ratings yet
Computerized Medical Diagnosis For Human Organ
47 pages
Verma 2021
No ratings yet
Verma 2021
6 pages
LungViT Ensembling Cascade of Texture Sensitive Hierarchical Vision Transformers For Cross-Volume Chest CT Image-to-Image Translation
No ratings yet
LungViT Ensembling Cascade of Texture Sensitive Hierarchical Vision Transformers For Cross-Volume Chest CT Image-to-Image Translation
18 pages
Developing Generalist Foundation Models From A Multimodal Dataset For 3D Computed Tomography
No ratings yet
Developing Generalist Foundation Models From A Multimodal Dataset For 3D Computed Tomography
47 pages
Applsci 15 00343
No ratings yet
Applsci 15 00343
14 pages
A Survey On Automatic Generation of Medical Imaging Reports Based On Deep Learning
No ratings yet
A Survey On Automatic Generation of Medical Imaging Reports Based On Deep Learning
16 pages
Multi-Label Classification of Lung Diseases Using Deep Learning
No ratings yet
Multi-Label Classification of Lung Diseases Using Deep Learning
19 pages
Classification of Pulmonary Lesions Using Gan and Semi Supervised Gan
No ratings yet
Classification of Pulmonary Lesions Using Gan and Semi Supervised Gan
10 pages
Applsci 11 11185
No ratings yet
Applsci 11 11185
19 pages
Artificial Intelligence in Respirato - 2021 - Archivos de Bronconeumolog A Engl
No ratings yet
Artificial Intelligence in Respirato - 2021 - Archivos de Bronconeumolog A Engl
2 pages
Paper 2
No ratings yet
Paper 2
11 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
PDF Veterinary Surgery: Small Animal Expert Consult E BOOK 2nd Edition, (Ebook PDF) Download
100% (1)
PDF Veterinary Surgery: Small Animal Expert Consult E BOOK 2nd Edition, (Ebook PDF) Download
47 pages
7 Steps To Building Lean Vegan Muscle 1
No ratings yet
7 Steps To Building Lean Vegan Muscle 1
18 pages
Instruction Manual Single Method - M100 - Chlorine T - en PDF
No ratings yet
Instruction Manual Single Method - M100 - Chlorine T - en PDF
11 pages
List of Merchat in Phnom Penh As of 31 May 2023
No ratings yet
List of Merchat in Phnom Penh As of 31 May 2023
30 pages
PEX - Installation Simulator
No ratings yet
PEX - Installation Simulator
17 pages
Resume Emily Pomykala
No ratings yet
Resume Emily Pomykala
2 pages
Ahunanya, Destiny Homa
No ratings yet
Ahunanya, Destiny Homa
2 pages
Atlas Copco Breathing Air Purifier (BAP12-142+ Series)
No ratings yet
Atlas Copco Breathing Air Purifier (BAP12-142+ Series)
4 pages
Bechayda, Act. 1
No ratings yet
Bechayda, Act. 1
1 page
Stress Vocabulary Ex-S Student
No ratings yet
Stress Vocabulary Ex-S Student
2 pages
Assessment of Quality of Life in Children With Epi
No ratings yet
Assessment of Quality of Life in Children With Epi
8 pages
Installation, Operation, and Maintenance Manual: 80 Series Pump
No ratings yet
Installation, Operation, and Maintenance Manual: 80 Series Pump
29 pages
Presentation Schedule 5nov 1page
No ratings yet
Presentation Schedule 5nov 1page
6 pages
Distillation, Ponchon Savarit, Shahzad
No ratings yet
Distillation, Ponchon Savarit, Shahzad
30 pages
Rules Sustainability Report
No ratings yet
Rules Sustainability Report
6 pages
GREPA Digest
No ratings yet
GREPA Digest
6 pages
6.8. Corporate Code of Conduct
No ratings yet
6.8. Corporate Code of Conduct
3 pages
MASTER BUDGET TUTORIAL SHEET - Students
No ratings yet
MASTER BUDGET TUTORIAL SHEET - Students
2 pages
Ep5161a SLC
No ratings yet
Ep5161a SLC
39 pages
Wood Frame Prescriptive Provisions One Story Residential Construction Only Ib P bc2011 004
No ratings yet
Wood Frame Prescriptive Provisions One Story Residential Construction Only Ib P bc2011 004
9 pages
SOP of TOC Analyzer
No ratings yet
SOP of TOC Analyzer
28 pages
Surge Arester 2. Wave Trap 3. Reaktor 4. Earthing Switch 5. Pms 6. CT 7. PMT 8. BB 9. Trafo
No ratings yet
Surge Arester 2. Wave Trap 3. Reaktor 4. Earthing Switch 5. Pms 6. CT 7. PMT 8. BB 9. Trafo
1 page
STD-INSP-0123 IGC Practice - A - (ASTM G28)
No ratings yet
STD-INSP-0123 IGC Practice - A - (ASTM G28)
7 pages
Sample Mistakes in Writing
No ratings yet
Sample Mistakes in Writing
1 page
Photoelectric effect-AQA
No ratings yet
Photoelectric effect-AQA
8 pages
Unit III - Naz Foundation Case
No ratings yet
Unit III - Naz Foundation Case
21 pages
Objective 7.01 Key Terms
No ratings yet
Objective 7.01 Key Terms
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

MedSyn Text-Guided Anatomy-Aware Synthesis

Uploaded by

MedSyn Text-Guided Anatomy-Aware Synthesis

Uploaded by

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO.

XX, XXXX 2020 1

MedSyn: Text-guided Anatomy-aware Synthesis

✏⇠N CT Scans Lobe

memory during training. In order to address the challenges be written as follows:

medical volumes of dimensions 256×256×256 within a single (4)

Ours Medical Diffusion

True Data Generated Data (Diseased) Generated Data (Normal)

1.5 In this section, we explore the application of conditional

dominated by the low-res module, thus, the segmentation TABLE V

Mean Rank (1 = best, 4 = worst)

for each method is significantly different from that of the other.

(p = 1.0); (Kendall’s Coefficient W = 0.413, p =< .001). We

[49] D. Mahapatra, A. Poellinger, L. Shao, and M. Reyes, “Interpretability-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.