Harmonizing Flows: Unsupervised MR Harmonization Based On Normalizing Flows
Harmonizing Flows: Unsupervised MR Harmonization Based On Normalizing Flows
3
Canadian Neonatal Brain Platform, Montreal
∗
farzad.beizaee.1@ens.etsmtl.ca
Abstract
In this paper, we propose an unsupervised framework based on normalizing flows
that harmonizes MR images to mimic the distribution of the source domain. The
proposed framework consists of three steps. First, a shallow harmonizer network
is trained to recover images of the source domain from their augmented versions.
A normalizing flow network is then trained to learn the distribution of the source
domain. Finally, at test time, a harmonizer network is modified so that the output
images match the source domain’s distribution learned by the normalizing flow
model. Our unsupervised, source-free and task-independent approach is evaluated
on cross-domain brain MRI segmentation using data from four different sites.
Results demonstrate its superior performance compared to existing methods. The
code is available at https://github.com/farzad-bz/Harmonizing-Flows
1 Introduction
Deep learning models have become the de facto solution for most image-based problems, including
those in the medical domain. Despite significant progress, these models still suffer under distributional
drift, and their performance largely degrades when they are applied to data obtained in different
conditions.
Clinical studies using magnetic resonance imaging (MRI) often have to deal with such large domain
shifts. Due to the qualitative nature of the MRI acquisition process, generated images are sensitive
to imaging devices, acquisition protocols, scanner artifacts, as well as to patient populations [28].
For instance, images from the same modality (e.g., T1-w) acquired from two different scanners with
separate configurations will likely present noticeable differences, which can be considered a domain
shift. Consequently, collecting a multi-center MRI dataset to address a particular clinical question
does not guarantee a greater statistical power, as the increase in variance comes from a non-clinical
source. Furthermore, this data heterogeneity can also hamper the generalizability of deep learning
models, preventing their large dissemination. In particular, when trained on a specific site, such
models are typically unable to provide similar performance for other centers.
To alleviate this issue, image harmonization addresses the distributional shift problem from an image-
to-image mapping perspective, where the objective is to transfer image contrasts across different
domains. Nevertheless, most harmonization methods in the literature make strong assumptions that
might hamper their scalability and usability in real-life scenarios. First, some methods must have
access to source images during the adaptation, which may no longer be available. Labels associated
with the downstream task may also be required in other approaches. Finally, most harmonization
techniques need to know the target domains during training, while these domains are often unknown.
In this work, we make the following contributions:
2 Related work
Image harmonization. Several techniques have been proposed for the harmonization of images
in the medical domain, and particularly for MRI data. Classical post-processing steps, such as
intensity histogram matching [24, 26], reduce the influence of biases across scanners, but may also
remove informative local variations in intensity. Statistical approaches can model image intensity
and dataset bias at the voxel level [11, 12, 2], however they must often be adjusted each time images
from new sites are provided. Modern strategies for image harmonization, which are based on deep
learning models, have shown to be a promising alternative for this problem [5, 35, 21, 36, 4, 8].
Nevertheless, they make unrealistic assumptions that hamper the scalability of existing approaches to
large scale multi-site harmonization tasks. First, images of the same target anatomy across multiple
sites, commonly referred to as traveling subjects are employed to identify intensity transformations
between different sites [5]. This involves that a given number of subjects are scanned at every
site or scanner required for training, a condition rarely met in practice. Second, another group
of methods is limited to two domains [35] and requires target domains to be known at training
time [35, 21]. In addition, each time a new domain is added, these approaches must be fine-tuned
in order to accommodate the characteristics of each domain. Calamity [21] further needs paired
multi-modal MR sequences, limiting even more its applicability to single modality scenarios. Last,
task-dependent approaches leverage labels associated to each image for a given down-stream task
[4, 8], thus optimizing the harmonization for this specific problem. Nevertheless, having access to
large labeled datasets might be impractical due to the underlying labeling cost.
Test-time Adaptation. Our method also relates to the problem of test-time domain adaptation
(TTA) [30, 27, 3] which aims to quickly adapt a pre-trained deep network to domain shifts during
inference on test examples. One key difference between TTA and the well-known unsupervised
domain adaption (UDA) problem is that, in TTA, the source examples are no longer available. One
of the earliest TTA approaches, called TENT [30], updates the affine transformation parameters of
normalization layers by minimizing the Shannon entropy of predictions for test examples. In [23], this
strategy is improved by optimizing a log-likelihood ratio instead of entropy, as well as by considering
the normalization statistics of the test batch. The method named SHOT [20] fine-tunes the entire
feature extractor with a mutual information loss and uses pseudo-labels to provide additional test-time
guidance. Instead of updating the network parameters, LAME [3] uses Laplacian regularization to do
a post-hoc adaptation of the softmax predictions.
Normalizing flows. Recently, normalizing Flows (NFs) have emerged as a popular approach
for constructing probabilistic and generative models with tractable distributions [19]. NFs aim at
transforming unknown complex distributions into simpler ones, for instance, a standard normal
distribution. This is achieved by applying a sequence of invertible and differentiable transformations.
While most existing literature has leveraged NFs for generative tasks (e.g., image generation [15, 17],
noise modeling [1], graph modeling [33]) and anomaly detection [14, 18], recent evidence also
suggests their usefulness for aligning a given set of source domains [13, 29]. To our knowledge,
a single work has investigated NFs in the context of harmonization [32]. However, it aimed at
2
NF model: Harmonizer network:
Gradient
Step3: adapting the harmonizer network to map the input images from target to source domain
𝛼𝑥+𝛽
Figure 1: Pipeline of the proposed Harmonizing Flows method. Our approach consists of two
steps. First, we employ normalizing flows (NFs) to capture the distribution of the source domain.
During the second stage, the trained NFs are leveraged to update the parameters of a harmonizer
network, which are updated in order to maximize the similarity between the harmonized outputs and
the distribution learned by the NF. Note that steps 1 and 2 are not dependent on each other, and can
therefore be performed in any order.
performing causal inference on pre-extracted features (brain ROI volume measures), and not image
harmonization as in our work. Moreover, since extracting ROIs requires pixel-wise labels, the method
in [32] is not task-agnostic.
3 Methodology
We first define the problem addressed in our work. Let XS = {xn }N n=1 be a set of unlabeled images
in the source domain S, where a given image i is represented by xi ∈ R|Ω| and Ω denotes its spatial
domain (i.e., W × H). Similarly, we denote as XT = {xn }M n=1 the set of unlabeled images in
a potential target domain T 1 . The goal of unsupervised data harmonization is to find a mapping
function fθ : S → T without having access to labeled images for any of the domains. In what follows,
we present our NF-based solution for this problem, whose framework is depicted in Figure 1.
We leverage Normalizing Flows (NFs) [7] to model the distribution of the source domain. NFs are a
recent family of generative methods that can model a complex probability density px (x) (i.e., the
source) as a series of transformation functions, denoted as gφ = g1 ◦ g2 ◦ . . . gT , applied on simpler
and tractable probability density pu (u) (e.g., a standard multi-variate Gaussian distribution). We can
express a source image as x = gφ (u), where u ∼ pu (u) and pu (u) is the base distribution of the
flow model. An important requirement of the transformation function gφ is that it must be invertible,
and both gφ and gφ−1 should be differentiable. Under these conditions, the density of the original
variable x is well-defined and its likelihood can be computed exactly using the change of variables
rule as:
log px (x) = log pz gφ−1 (x) + log det Jg−1 (x)
φ
XT (1)
gφ−1 (x) +
= log pz log det Jg−1 (ut−1 )
t
t=1
where the first term on the right-hand side is the log-likelihood under the simple distribution, and
Jg−1 (ut−1 ) is the Jacobian matrix of the inverse transformation gt . To train the NF model and learn
t
the source data distribution, the model parameters φ are typically optimized so to minimize the
negative log-likelihood in Eq. 1. This results in the following loss function:
LN F = − log px (x) (2)
1
Note that for simplicity, we assume here that there exists only a single domain. Nevertheless, our formulation
is directly applied to T different domains.
3
Building the Normalizing Flow. To build a bijective transformation function for the NF model,
stacking a sequence of affine coupling layers [7, 17] has been demonstrated to be an efficient
strategy. Because flows based on coupling layers are computationally symmetric, i.e., equally fast
to evaluate or invert, they can overcome the usability issues of asymmetric flows such as masked
autoregressive flows, making them a popular choice. Let us consider z ∈ RD as the input to the
coupling layer, which is split into a disjoint partition: (zA , zB ) ∈ Rd × RD−d . The transformation
function g(·) : RD → RD can then be defined as:
yA = zA , yB = zB exp s zA + t zA
(3)
This setting offers simplicity for calculating the Jacobian determinant, which makes it possible to
use complex neural networks as shift s(·) and scale t(·) networks. Note that the transformation
in Eq. 3 is invertible and therefore allows for efficient Jacobian computation in Eq. 1. The work
in [7] presented coupling flows on simpler tasks and datasets, e.g., CIFAR, which required less
enriched representations. In contrast, the problem at hand requires pixel-to-pixel mappings on more
challenging images. Thus, we replace the simple convolutional blocks in [7] with shallow U-shaped
convolutional neural networks to find the shift and scale parameters of the affine transformation, as
they capture more global context and provide higher representation power. Furthermore, as NFs are
based on the change of variables rule, which is defined in continuous space, it is crucial to make the
input continuous. Dequantization of the input can be achieved by adding a uniform noise u ∈ U [0, 1]
to the discrete values. However, it might result in a hypercube representation of the images with
sharp borders. These sharp borders are hard to model for a flow as it uses smooth transformations.
Recently, a variational framework was proposed [15] to extend dequantization to more sophisticated
distributions, by replacing the uniform distribution with a learnable distribution.
Constraining the source-distribution learning. Optimizing the objective in Eq. 2 with only source
images might bias the model to focus on characteristics of subjects, such as age and gender, rather
than on source-specific features like contrast and brightness. To overcome this issue, we propose a
strategy that facilitates the learning of the source-domain distribution. This technique consists in
randomly selecting N 0 images from the original dataset XS and applying a series of augmentations
faug (·) such that the resulting image has a dissimilarity to the original image (measured by mean
squared distance) higher than a specified threshold. In particular, we employ contrast augmentation,
brightness changes, multiplication, and random monotonically increasing mapping functions to
augment these images. Then, the total learning objective of our model can be defined as:
−N 0
NX N
X
0
The first term is the learning objective in Eq. 2 over the original source images, whereas the second
one forces the NF model to decrease the likelihood on the augmented images, which facilitates the
learning of domain-specific characteristics (e.g., contrast or brightness) instead of subject-related
features (e.g., sex or age). Furthermore, we use a constant margin c in the second term to prevent the
negative log-likelihood of an augmented sample from diverging to infinity.
4
network thus becomes:
N
1 X 2
θinit = argmin k(xn − hθ (faug (xn ))k (5)
θ N n=1
We stress that the performed augmentations are not reliable representations of potential unseen target
domains. Consequently, the direct application of the learned parameters θinit for image-to-image
mapping will result in suboptimal domain transformations. Nevertheless, they can serve as the initial
model for the subsequent step. A simple UNet is considered for the harmonizer network, which
learns two values. First, the last layer of the network (β) is employed as a bias value having the same
dimension as the input image. Second, a scalar α from the middle layer of the network is used as a
coefficient value. In this way, the output of the harmonizer can be defined as hθ (x) = α ∗ x + β.
Guiding the harmonizer network with the Normalizing Flow. The final step involves updating
the harmonizer network so that images from the target domain are mapped into the source domain
distribution. To achieve this, we propose to leverage the trained NF, which is stacked at the output
of the harmonizer network. Note that the NF model has already learned the distribution of source
data, and therefore its parameters remain frozen during the adaptation of the harmonizer. Thus, the
learning objective of the adaptation stage consists in increasing the likelihood of the harmonizer
outputs for images from the target domain, based on the NF model’s density estimation. This loss
function can be formally defined as follows:
M
X
LAdap = − log px gφ hθ (xm ) (6)
m=1
As stopping criterion for updating the harmonizer, we evaluate two possible alternatives. First, we
measure the Shannon entropy of the predictions for the target task (e.g., segmentation or classification),
stopping the adaptation when the entropy plateaus. We also consider the bits per dimension (bpd), a
scaled version of the negative log-likelihood widely used for evaluating generative models: bpd =
Q −1
− log px (x) · (log 2 · i Ωi ) where Ω1 , ..., ΩT , is the spatial dimension of the input images. More
concretely, we can stop updating the harmonizer parameters when the reached bpd value is the same
as the one observed for the source domain using the NF model. In practice, this value can be obtained
at training time using a validation set.
4 Experiments
4.1 Experimental setting
We evaluate the proposed method on the task of brain MRI segmentation across multiple sites. The
reason behind this choice stems from the fact that the segmentation performance is a reliable indicator
of whether the structural information is well preserved during the mapping.
Datasets. Four sites of the Autism Brain Imaging Data Exchange (ABIDE) [6] dataset are employed:
California Institute of Technology (CALTECH), Kennedy Krieger Institute (KKI), University of
Pittsburgh School of Medicine (PITT) and NYU Langone Medical Center (NYU). The selection
of these sites is based on their cross-site difference, as these datasets present the most distinct
histogram from each other, which better highlights the impact of harmonization. These sites are
denoted as D1 , D2 , D3 , and D4 , respectively. From each site, we selected 20 T1-weighted MRIs from
the healthy control population (19 from CALTECH), which were skull-stripped, motion-corrected,
and quantized to 256 levels of intensity. 2D coronal slices of 60% of these images are used for
training, 15% for validation, and the remaining 25% for testing. Furthermore, the segmentation
labels are obtained from FreeSurfer [10], following other large-scale studies [9], and grouped into 15
labels: background, cerebellum gray matter, cerebellum WM, cerebral GM, cerebral WM, thalamus,
hippocampus, amygdala, ventricles, caudate, putamen, pallidum, ventral DC, CSF, and brainstem.
Harmonization baselines. The proposed approach is benchmarked against a set of relevant har-
monization and image-to-image translation methods. We first consider a simple Baseline applying
the segmentation network directly on non-harmonized images, in order to assess the impact of
each harmonization approach. Our comparison also includes: Histogram Matching [24], aleatoric
5
uncertainty estimation (AUE) [31], Combat [25], BigAug [34] (which uses heavy augmentations for
generalization of the segmentation networks), and two popular generative-based approaches, i.e.,
Cycle-GAN [22] and Style-Transfer [21].
Evaluation protocol. To assess the performance of our harmonization approach, we resort to a
segmentation task as it requires the preservation of fine-grained structural details. First, a segmentation
network SΦ (·) is trained on the images from the source domain, whose parameters remain frozen
thereafter. The harmonized images from each method are then employed to evaluate segmentation
performance, which is measured with the Dice Similarity Coefficient (DSC) and modified Hausdorff
distance (HD). To evaluate the robustness of tested methods, we repeat the experiments four times,
each employing a different source and set of target domains. These different settings are denoted as A :
D1 → {D2 , D3 , D4 }; B : D2 → {D1 , D3 , D4 }; C : D3 → {D1 , D2 , D4 }; D : D4 → {D1 , D2 , D3 }.
Implementation details. The Normalizing flow model is trained for 1600 epochs using Adam
optimizer with an initial learning rate of 1 × 10−3 , a weight decay of 0.5 every 200 epochs and a
batch-size of 32. We use a U-shaped network inside the coupling layers, which consists of four
levels of different scales with a scaling factor of 2. Each level includes a modified version of the
ELU activation function, i.e., concat(ELU(x), ELU(−x)), and a convolutional layer followed by a
normalizing layer. To construct the NF model, we first cascade four coupling layers with checkerboard
masking to learn the noise distribution using variational dequantization. After applying four of the
same coupling layers, features are squeezed as explained in [7] to have a lower spatial dimension
and more channels. We then add four coupling layers using a channel-masking strategy, another
feature squeezing function, and a final set of four coupling layers with channel-masking. The overall
architecture of the flow model is shown in Fig. 1. The margin c used for guiding the flow is set
empirically to 1.2. The harmonizer has five levels of different scales with a scaling factor of 2, each
level including two layers of the modified ELU activation function followed by a convolutional layer.
The number of kernels of each level is 16, 32, 48, 64, and 64, respectively. The harmonizer is trained
for 200 epochs using Adam optimizer with a learning rate starting at 1 × 10−3 , a weight decay of 0.5
every 30 epochs and a batch-size of 32. The segmentation network is trained for 200 epochs using
Adam optimizer with an initial learning rate of 4 × 10−3 , a weight decay of 0.5 every 30 epochs and
a batch-size of 32. All the models were implemented in PyTorch and were run on NVIDIA RTX
A6000 GPU cards.
4.2 Results
6
Table 1: Performance overview. Main results for the compared methods across different settings
(A, B, C, D). The best results are highlighted in bold.
A B C D Average
SF T A U D
DSC (%)
Baseline – – – 54.6 ±7.5 60.8 ±4.6 62.9 ±5.8 72.6 ±4.5 62.7 ±5.6
AUE [31] 3 7 3 54.7 ±7.4 60.7 ±4.7 62.6 ±5.7 72.4 ±4.5 62.6 ±5.6
Hist matching[24] 3 3 3 55.7 ±8.6 58.1 ±5.1 62.2 ±4.8 69.5 ±4.9 61.4 ±5.9
Combat [25] 3 3 3 75.7 ±9.2 79.9 ±6.0 79.5 ±8.1 79.9 ±7.8 78.7 ±7.8
BigAug [34] 3 7 3 54.2 ±7.6 67.9 ±3.6 61.5 ±4.5 78.0 ±3.7 65.4 ±4.8
Cycle-GAN [22] 7 3 7 74.5 ±3.0 78.8 ±2.9 80.1 ±2.2 83.1 ±2.0 79.1 ±2.5
Style-transfer [21] 3 3 7 56.9 ±7.1 80.0 ±1.7 67.8 ±4.9 73.4 ±4.0 69.5 ±4.4
Ours 3 3 3 80.8 ±3.2 82.3 ±2.2 83.2 ±3.3 85.2 ±1.5 82.9 ±2.6
HD (mm)
Baseline – – – 18.20 ±8.27 9.57 ±3.23 9.07 ±2.78 5.73 ±1.81 10.64 ±4.03
AUE [31] 3 7 3 17.57 ±8.18 9.67 ±3.39 9.03 ±2.87 5.57 ±1.85 10.46 ±4.08
Hist matching[24] 3 3 3 17.40 ±8.10 10.47 ±3.77 12.00 ±4.56 6.73 ±2.40 11.65 ±4.71
Combat [25] 3 3 3 5.23 ±3.87 3.67 ±2.47 3.30 ±1.80 3.17 ±2.14 3.84 ±2.57
BigAug [34] 3 7 3 19.53 ±10.51 8.43 ±3.40 18.87 ±7.76 3.70 ±1.07 12.63 ±5.69
Cycle-GAN [22] 7 3 7 4.63 ±2.89 3.63 ±1.93 2.63 ±0.62 2.30 ±0.55 3.30 ±1.50
Style-transfer [21] 3 3 7 14.23 ±7.20 2.93 ±0.78 7.53 ±2.38 4.27 ±1.35 7.24 ±2.92
Ours 3 3 3 3.10 ±1.63 2.77 ±0.87 2.37 ±0.77 2.30 ±0.50 2.63 ±0.94
in Table 3, although minimum entropy is a better criterion compared to bpd, both achieve comparable
performances. In addition, both stopping criteria are a suitable choice, as their results are very close
to the Oracle.
Qualitative results. Figure 2 depicts several examples of harmonized images produced by the
proposed approach. These results illustrate that, regardless of the target domain, our method produces
reliable image-to-image mappings to the source distribution.
Results when N4 bias correction is applied. In previous sections, we used the original MRIs of
the ABIDE dataset without bias correction to evaluate the proposed harmonization method on more
challenging scenarios, where pre-processing steps to enhance the images might not be applicable.
Compared to bias-corrected MRIs, original MRIs have arguably more complex distributions, which
makes it more difficult for harmonization methods to map MRIs from a target domain to the source
one. To demonstrate that our method also achieves satisfactory performance when the initial domain
shifts are reduced, we repeated the previous steps with N4-biased corrected MRIs. These results,
shown in Table 4, also showcase the advantage of our method in this different setting. For conciseness,
we report here the average results for the HD metric (in mm) across different methods: Baseline
(5.04±1.41), Hist matching (4.45±1.33), Combat (3.28±0.75), BigAUG (2.64±0.47), Cycle-GAN
(2.55 ± 0.31), Style-transfer (3.77 ± 1.07) and ours (2.36 ± 0.62).
7
Figure 2: Examples of harmonized images produced by the proposed method.
Experiments on Test-Time Adaptation (TTA). Our model can also be employed in a TTA scenario,
where the model needs to be updated at inference for a given image, or set of images. To motivate
this assumption, we compare the performance of our approach to the popular TENT model [30].
Adapting the segmentation network SΦ (·) with TENT yields 65.1 ± 5.0 of DSC, which represents a
considerable gap compared to our model, i.e., 82.9 ± 2.6. Note that there exist other TTA methods
for segmentation in the medical field, e.g., [16], however, they require segmentation masks for the
adaptation.
5 Conclusion
In this paper, we proposed a novel harmonization method which leverages Normalizing Flows to
guide the adaptation of a harmonizer network. Our approach is source-free, task-agnostic, and works
with unseen domains. These characteristics make our model applicable in real-life problems where
the source domain might not accessible during adaptation, target domains are unknown at training
time, and harmonization is not dependent on a specified target task. The proposed method achieves
state-of-the-art harmonization performance based on the segmentation task, yet relaxes the strong
assumptions made by existing harmonization strategies. Thus, we believe that our model is a powerful
alternative for MRI multi-site harmonization.
References
[1] Abdelhamed, A., Brubaker, M.A., Brown, M.S.: Noise flow: Noise modeling with conditional
normalizing flows. In: ICCV. pp. 3165–3173 (2019)
[2] Beer, J.C., et al.: Longitudinal Combat: A method for harmonizing longitudinal multi-scanner
imaging data. Neuroimage 220, 117129 (2020)
[3] Boudiaf, M., et al.: Parameter-free Online Test-time Adaptation. In: CVPR. pp. 8344–8353
(2022)
[4] Delisle, P.L., et al.: Realistic image normalization for multi-Domain segmentation. Medical
Image Analysis 74, 102191 (2021)
[5] Dewey, B.E., et al.: Deepharmony: A deep learning approach to contrast harmonization across
scanner changes. Magnetic resonance imaging 64, 160–170 (2019)
[6] Di Martino, A., et al.: The autism brain imaging data exchange: towards a large-scale evaluation
of the intrinsic brain architecture in autism. Molecular psychiatry 19(6), 659–667 (2014)
8
[7] Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: ICLR (2017)
[8] Dinsdale, N.K., et al.: Deep learning-based unlearning of dataset bias for MRI harmonisation
and confound removal. NeuroImage 228, 117689 (2021)
[9] Dolz, J., Desrosiers, C., Ayed, I.B.: 3D fully convolutional networks for subcortical segmenta-
tion in MRI: A large-scale study. NeuroImage 170, 456–470 (2018)
[10] Fischl, B.: Freesurfer. Neuroimage 62(2), 774–781 (2012)
[11] Fortin, J.P., et al.: Removing inter-subject technical variability in magnetic resonance imaging
studies. NeuroImage 132, 198–212 (2016)
[12] Fortin, J.P., et al.: Harmonization of multi-site diffusion tensor imaging data. Neuroimage 161,
149–170 (2017)
[13] Grover, A., et al.: Alignflow: Cycle consistent learning from multiple domains via normalizing
flows. In: AAAI. pp. 4028–4035 (2020)
[14] Gudovskiy, D., et al.: Cflow-ad: Real-time unsupervised anomaly detection with localization
via conditional normalizing flows. In: WACV. pp. 98–107 (2022)
[15] Ho, J., et al.: Flow++: Improving flow-based generative models with variational dequantization
and architecture design. In: ICML. pp. 2722–2730 (2019)
[16] Karani, N., et al.: Test-time adaptable neural networks for robust medical image segmentation.
Medical Image Analysis 68, 101907 (2021)
[17] Kingma, D.P., Dhariwal, P.: Glow: Generative flow with invertible 1x1 convolutions. NeurIPS
31 (2018)
[18] Kirichenko, P., Izmailov, P., Wilson, A.G.: Why normalizing flows fail to detect out-of-
distribution data. NeurIPS 33, 20578–20589 (2020)
[19] Kobyzev, I., Prince, S.J., Brubaker, M.A.: Normalizing flows: An introduction and review of
current methods. IEEE PAMI 43(11), 3964–3979 (2020)
[20] Liang, J., et al.: Do we really need to access the source data? Source hypothesis transfer for
unsupervised domain adaptation. In: ICML. pp. 6028–6039 (2020)
[21] Liu, M., et al.: Style transfer using generative adversarial networks for multi-site MRI harmo-
nization. In: MICCAI. pp. 313–322 (2021)
[22] Modanwal, G., et al.: MRI image harmonization using cycle-consistent generative adversarial
network. In: SPIE Medical Imaging 2020. vol. 11314, pp. 259–264 (2020)
[23] Mummadi, C.K., et al.: Test-time adaptation to distribution shift by confidence maximization
and input transformation. ICLR (2022)
[24] Nyúl, L.G., Udupa, J.K., Zhang, X.: New variants of a method of MRI scale standardization.
IEEE transactions on medical imaging 19(2), 143–150 (2000)
[25] Pomponio, R., et al.: Harmonization of large MRI datasets for the analysis of brain imaging
patterns throughout the lifespan. NeuroImage 208, 116450 (2020)
[26] Shinohara, R., et al.: Statistical normalization techniques for magnetic resonance imaging.
NeuroImage: Clinical 6, 9–19 (2014)
[27] Sun, Y., et al.: Test-time training with self-supervision for generalization under distribution
shifts. In: ICML. pp. 9229–9248 (2020)
[28] Takao, H., et al.: Effect of scanner in longitudinal studies of brain volume changes. Journal of
Magnetic Resonance Imaging 34(2), 438–444 (2011)
[29] Usman, B., et al.: Log-likelihood ratio minimizing flows: Towards robust and quantifiable
neural distribution alignment. NeurIPS 33, 21118–21129 (2020)
[30] Wang, D., et al.: TENT: Fully Test-Time Adaptation by Entropy Minimization. In: ICLR (2020)
[31] Wang, G., et al.: Aleatoric uncertainty estimation with test-time augmentation for medical
image segmentation with convolutional neural networks. Neurocomputing 338, 34–45 (2019)
[32] Wang, R., et al.: Harmonization with flow-based causal inference. In: MICCAI. pp. 181–190
(2021)
9
[33] Zang, C., Wang, F.: Moflow: an invertible flow model for generating molecular graphs. In:
Proceedings of the 26th ACM SIGKDD. pp. 617–626 (2020)
[34] Zhang, L., et al.: Generalizing deep learning for medical image segmentation to unseen domains
via deep stacked transformation. IEEE TMI pp. 2531–2540 (2020)
[35] Zhu, J.Y., et al.: Unpaired image-to-image translation using cycle-consistent adversarial net-
works. In: ICCV. pp. 2223–2232 (2017)
[36] Zuo, L., et al.: Information-based disentangled representation learning for unsupervised MR
harmonization. In: IPMI. pp. 346–359 (2021)
10