CNN-Based Projected Gradient Descent For Consistent CT Image Reconstruction
CNN-Based Projected Gradient Descent For Consistent CT Image Reconstruction
6, JUNE 2018
Abstract — We present a new image reconstruction formulated as an inverse problem where the image-formation
method that replaces the projector in a projected gradient physics are modeled by an operator H : R N → R M (called the
descent (PGD) with a convolutional neural network (CNN). forward model). The measurement equation is y = Hx + n ∈
Recently, CNNs trained as image-to-image regressors have
been successfully used to solve inverse problems in imag- R M , where x∈ R N is the space-domain image that we are
ing. However, unlike existing iterative image reconstruc- interested in recovering and n ∈ R M is the noise intrinsic to
tion algorithms, these CNN-based approaches usually lack the acquisition process.
a feedback mechanism to enforce that the reconstructed In the case of extreme imaging, the number of measurements
image is consistent with the measurements. We propose a is reduced as much as possible to decrease either the radiation
relaxed version of PGD wherein gradient descent enforces
measurement consistency, while a CNN recursively projects dose in computed tomography (CT) or the scanning time in
the solution closer to the space of desired reconstruc- MRI. Moreover, the measurements are typically very noisy due
tion images. We show that this algorithm is guaranteed to short integration times, which calls for some form of denois-
to converge and, under certain conditions, converges to ing. Indeed, there may be significantly fewer measurements
a local minimum of a non-convex inverse problem. Finally, than the number of unknowns (M N). This gives rise to an
we propose a simple scheme to train the CNN to act like
a projector. Our experiments on sparse-view computed- ill-posed problem in the sense that there may be an infinity of
tomography reconstruction show an improvement over total consistent images that map to the same measurements y. Thus,
variation-based regularization, dictionary learning, and a one challenge of the reconstruction algorithm is to select the
state-of-the-art deep learning-based direct reconstruction best solution among a multitude of potential candidates.
technique. The available reconstruction algorithms can be broadly
Index Terms — Deep learning, inverse problems, biomed- arranged in three categories (or generations), which represent
ical image reconstruction, low-dose computed tomography. the continued efforts of the research community to address the
aforementioned challenges.
I. I NTRODUCTION 1) Classical Algorithms: Here, the reconstruction is per-
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/
GUPTA et al.: CNN-BASED PGD FOR CONSISTENT CT IMAGE RECONSTRUCTION 1441
Within this cascade the data-fidelity is enforced at multiple will be effective, we first analyze how using a projector onto a
steps. However, in all of these approaches the training is set, combined with gradient descent, can be helpful in solving
performed end-to-end, meaning that the network parameters inverse problems. Properties of PGD using an orthogonal
are dependent on the iterative scheme chosen. projector onto a convex set are known [41]. Here, we extend
These approaches differ from plug-and-play these results for any projector onto a nonconvex set. This
ADMM [26]–[28], where an independent off-the-shelf extension is required because there is no guarantee that the
denoiser or a trained operator is plugged into the iterative set of desirable reconstruction images is convex. Proofs of all
scheme of the alternating-direction method of multipliers the results in this section can be found in the supplementary
(ADMM) [14]. ADMM is an iterative optimization technique material.
that alternates between (i) a linear solver that reinforces
consistency with respect to the measurements; and (ii) a A. Notation
nonlinear operation that re-injects the prior. The idea of
We consider the finite-dimensional Hilbert space R N
plug-and-play ADMM is to replace (ii), which resembles
equipped with the scalar product · , · that induces the 2
denoising, with an off-the-shelf denoiser. Plug-and-play
norm · 2 . The spectral norm of the matrix H, denoted by
ADMM is more general than the optimization framework (1)
H 2 , is equal to its largest singular value. For x ∈ R N and
but still lacks theoretical justifications. In fact, there is little
ε > 0, we denote by Bε (x) the 2 -ball centered at x with
understanding yet of the connection between the use of a
radius ε, i.e.,
given denoiser and the regularization it imposes (though this
link has recently been explored in [29]). Bε (x) = z ∈ R N : z − x 2 ≤ ε .
In [30], a generative adversarial network (GAN) trained
as a projector onto a set, has been used with the plug-and- The operator T : R N → R N is Lipschitz-continuous with
play ADMM. Similarly, in [31], the inverse problem is solved constant L if
over a set parameterised by a generative model. However,
T (x) − T (z) 2 ≤ L x − z 2 , ∀x, z ∈ R N .
it requires a precise initialization of the parameters. In [32],
similarly to us, the projector in PGD is replaced with a neural It is contractive if it is Lipschitz-continuous with constant
network. However, the scheme lacks convergence guarantee L < 1 and non-expansive if L = 1. A fixed point x∗ of T (if
and a rigorous theoretical analysis. any) satisfies T (x∗ ) = x∗ .
Our scheme is similar in spirit to plug-and-play ADMM, but Given the set S ⊂ R N , the mapping PS : R N → S is called
is simpler to analyze. Although our methodology is generic a projector if it satisfies the idempotent property PS PS = PS .
and can be applied in principle to any inverse problem, our It is called an orthogonal projector if
experiments here involve sparse-view x-ray CT reconstruction.
For a recent overview of the field, see [33]. Current approaches PS (x) = inf x − z 2 , ∀x ∈ R N .
z∈S
to sparse-view CT reconstruction follow the formulation (1),
e.g., using a penalized weighted least-squares data term and
sparsity-promoting regularizer [34], dictionary learning-based B. Constrained Least Squares
regularizer [35], or generalized total variation regularizer [36]. Consider the problem of the reconstruction of the image
There are also prior works on the direct application of CNNs x ∈ R N from its noisy measurements y = Hx + n, where
to CT reconstruction. These methods generally use the CNN H ∈ R M×N is the linear forward model and n ∈ R M is additive
to denoise the sinogram [37] or the reconstruction obtained white Gaussian noise. The framework is also applicable to
from a standard technique [16], [38]–[40]; as such, they do Poisson noise model-based CT via a suitable transformation,
not perform the reconstruction directly. as shown in Appendix B.
Our reconstruction incorporates a strong form of prior
C. Roadmap
knowledge about the original image: We assume that x must
The paper is organized as follows: In Section II, we discuss lie in a set S ⊂ R N that contains all objects of interest. The
the mathematical framework that motivates our approach and proposed way to make the reconstruction consistent with the
justify the use of a projector onto a set as an effective measurements as well as with the prior knowledge is to solve
strategy to solve inverse problems. In Section III, we present the constrained least-squares problem
our algorithm, which is a relaxed version of PGD. It has
1
been modified so as to converge in practical cases where min Hx − y 22 . (3)
x∈S 2
the projection property is only approximate. We discuss in
Section IV a novel technique to train the CNN as a projector The condition x ∈ S in (3) plays the role of a regularizer.
onto a set, especially when the training data is small. This is If no two points in S have the same measurements and in
followed by experiments (Section V), results and discussions case y is noiseless, then out of all the points in R N that
(Section VI and Section VII), and conclusions (Section VIII). are consistent with the measurement y, (3) selects a unique
point x∗ ∈ S. In this way, the ill-posedness of the inverse
II. T HEORETICAL F RAMEWORK problem is bypassed. When the measurements are noisy, (3)
Our goal is to use a trained CNN iteratively inside PGD returns a point x∗ ∈ S such that y∗ = Hx∗ is as close as
to solve an inverse problem. To understand why this scheme possible to y. Thus, it also denoises the measurement, where
GUPTA et al.: CNN-BASED PGD FOR CONSISTENT CT IMAGE RECONSTRUCTION 1443
the quantity y∗ can be regarded as the denoised version of y. Propositions 1-3 suggest that, when S is non-convex,
Note that formulation (3) is similar to (2) for the case when E the best we can hope for is to find a local minimizer of (3)
is least-squares, with the difference that the search space is the through a fixed point of G γ . Theorem 1 provides a suffi-
data manifold S instead of a set defined by the regularizer S R . cient condition for PGD to converge to a unique fixed point
The point x∗ ∈ S is called a local minimizer of (3) if of G γ .
Theorem 1: Let λmax and λmin be the largest and smallest
∃ε > 0 : Hx∗ − y2 ≤ Hx − y 2 , ∀x ∈ S ∩ Bε (x∗ ). eigenvalues of HT H, respectively. If PS satisfies (6) and is
Lipschitz-continuous with constant L < (λmax +λmin )/(λmax −
C. Projected Gradient Descent λmin ), then, for γ = 2/(λmax + λmin ), the sequence {xk }
When S is a closed convex set, it is well known [41] that generated by (4) converges to a local minimizer of (3),
a solution of (3) can be found by PGD regardless of the initialization x0 .
It is important to note that the projector PS can never be
xk+1 = PS (xk − γ HT Hxk + γ HT y), (4) contractive since it preserves the distance between any two
points on S. Therefore, when H has a nontrivial null space,
T γ is a step size chosen such that γ < 2/
where
the condition L < (λmax + λmin )/(λmax − λmin ) of Theorem 1
H H . This algorithm combines the orthogonal projec-
2 is not feasible. The smallest possible Lipschitz constant of PS
tion onto S with the gradient descent with respect to
is L = 1, which means that PS is non-expansive. Even with
the quadratic objective function, also called the Landwe-
this condition, it is not guaranteed that the combined operator
ber update [42]. PGD [43, Sec. 2.3] is a subclass of the
Fγ has a fixed point. This limitation can be overcome when
forward-backward splitting [44], [45], which is known in the
Fγ is assumed to have a nonempty set of fixed points. Indeed,
1 -minimization literature as iterative shrinkage/thresholding
we state in Theorem 2 that one of them must be reached by
algorithms (ISTA) [11], [12], [46].
iterating the averaged operator α Id +(1 − α)G γ , where α ∈
In our problem, S is presumably non-convex, but we
(0, 1) and Id is the identity operator. We call this scheme
propose to still use the update (4) with some projector PS that
averaged PGD (APGD).
may not be orthogonal. In the rest of this section, we provide
Theorem 2: Let λmax be the largest eigenvalue of HT H.
sufficient conditions on the projector PS (not on S itself) under
If PS satisfies (6) and is a non-expansive operator such that
which (4) leads to a local minimizer of (3). Similarly to the
G γ in (5) has a fixed point for some γ < 2/λmax , then the
convex case, we characterize the local minimizers of (3) by
sequence {xk } generated by APGD, with
the fixed points of the combined operator
G γ (x) = PS (x − γ HT Hx + γ HT y) (5) xk+1 = (1 − α)xk + αG γ (xk ) (8)
and then show that some fixed point of that operator must be for any α ∈ (0, 1), converges to a local minimizer of (3),
reached by the iteration xk+1 = G γ (xk ) as k → ∞, regardless regardless of the initialization x0 .
of the initial point x0 . We first state a sufficient condition for
each fixed point of G γ to become a local minimizer of (3). III. R ELAXATION WITH G UARANTEED C ONVERGENCE
Proposition 1: Let γ > 0 and PS be such that, for all Despite their elegance, Theorems 1 and 2 are not directly
x ∈ RN , productive when we construct the projector PS by training
z − PS x , x − PS x ≤ 0, ∀z ∈ S ∩ Bε (PS x), (6) a CNN because it is unclear how to enforce the Lipschitz
continuity of PS on the CNN architecture. Without putting
for some ε > 0. Then, any fixed point of the operator G γ in (5) any constraints on the CNN, however, we can still achieve the
is a local minimizer of (3). Furthermore, if (6) is satisfied convergence of the reconstruction sequence by modifying PGD
globally, in the sense that as described in Algorithm 1; we name it relaxed projected
gradient descent (RPGD). In Algorithm 1, the projector PS
z − PS x , x − PS x ≤ 0, ∀x ∈ R N , z ∈ S, (7) is replaced by the general nonlinear operator F. We also
then any fixed point of G γ is a solution of (3). introduce a sequence {ck } that governs the rate of convergence
Two remarks are in order. First, (7) is a well-known of the algorithm and a sequence {αk } of relaxation parameters
property of orthogonal projections onto closed convex sets. that evolves with the algorithm. The convergence of RPGD is
It actually implies the convexity of S (see Proposition 2). guaranteed by Theorem 3. More importantly, if the nonlinear
Second, (6) is much more relaxed and easily achievable, for operator F is actually a projector and the relaxation parameters
example, as stated in Proposition 3, by orthogonal projections do not go all the way to 0, then RPGD converges to a
onto unions of closed convex sets. (Special cases are unions meaningful point.
of subspaces, which have found some applications in data Theorem 3: Let the input sequence {ck } of Algorithm 1
modeling and clustering [47]). be asymptotically upper-bounded by C < 1. Then,
Proposition 2: If PS is a projector onto S ⊂ R N that the following statements hold true for the reconstruction
satisfies (7), then S must be convex. sequence {xk }:
Proposition 3: If S is a union of a finite number of closed (i) xk → x∗ as k → ∞, for all choices of F;
convex sets in R N , then the orthogonal projector PS onto S (ii) if F is continuous and the relaxation parameters {αk }
satisfies (6). are lower-bounded by ε > 0, then x∗ is a fixed
1444 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 37, NO. 6, JUNE 2018
the residual-net architecture does not increase the capac- Light Source. During pre-processing, we split this sinogram
ity or the approximation power of the CNN, it does help in slice-by-slice and downsampled it to create a dataset of 377
learning functions that are close to an identity operator, as is (729 px × 720 view) sinograms. CT images of size (512×512)
the case in our setting. were then generated from these full-dose sinograms (using
the FBP, see Section V-C). For the qth z-slice, we denote
q
B. Sequential Training Strategy the corresponding image xFD . For experiments based on this
dataset, the first 327 and the last 25 slices are used for training
We train the CNN in three stages. In Stage 1, we train it
and testing, respectively. This left a gap of 25 slices in between
for T1 epochs with respect to the partial-loss function J2 in
the training and testing data.
(11) which only uses the ensemble {x̃q,2 } generated by (13).
In Stage 2, we add the ensemble {x̃q,3 } according to (14)
at every epoch and then train the CNN with respect to the B. Experimental Setups
loss function J2 + J3 ; we repeat this procedure for T2 epochs. We now describe three experimental setups. We use the first
Finally, in Stage 3, we train the CNN for T3 epochs with all dataset for the first experiment and the second for the last two.
three ensembles {x̃q,1 , x̃q,2 , x̃q,3 } to minimize the original loss 1) Experiment 1: We split the Mayo dataset into 475 images
function J = J1 + J2 + J3 from (11). from 9 patients for training and 25 images from the remaining
We shall see in Section VII-B that this sequential procedure patient for testing. We assume these images to be the ground
speeds up the training without compromising the performance. truth. From the qth image xq , we generated the sparse-view
The parameters of Unet are initialized by a normal distribution sinogram yq = Hxq using several different experimental
with a very low variance. Since CNN = Id +Unet, this conditions. Our task is to reconstruct the image from the
function acts close to an identity operator in the initial epochs sinogram.
and makes it redundant to use {x̃q,1 } for the initial training The sinograms always have 729 offsets per view, but we
stages. Therefore, {x̃q,1 } is only added at the last stage when varied the number of views and the level of measurement noise
the CNN is no longer close to an identity operator. After for different cases. We took 144 views and 45 views, which
training with only {x̃q,2 } in Stage 1, x̃q,3 will be close to corresponds to ×5 and ×16 dosage reductions (assuming a
xq since it is the output of the CNN for the input x̃q,2 . This full-view sinogram has 720 views). We added Gaussian noise
eases the training for {x̃q,3 } in the second and third stage. to the sinograms to make the SNR equal to 35, 40, 45, 70,
and infinity dB, where we refer to the first three as high
V. E XPERIMENTS measurement noise and the last two as low measurement noise.
We validate the proposed method on the challenging case The SNR of the sinogram y + n is defined as
of sparse-view CT reconstruction. Conventionally, CT imaging
SNR(y + n, y) = 20 log10 y 2 / n 2 . (15)
requires many views to obtain good quality reconstruction.
We call this scenario full-dose reconstruction. Our main aim For testing with the low and high measurement noise,
in these experiments is to reduce the number of views (or dose) we trained the CNNs without noise and at the 40-dB level
for CT imaging while retaining the quality of full-dose recon- of noise, respectively (see Section V-D for details).
structions. We denote a k-times reduction in views by ×k. To make the experiments more realistic and to reduce
The measurement operator H for our experiments is the the inverse crime, the sinograms were generated by slightly
Radon transform. It maps an image to the values of its perturbing the angles of the views by a zero-mean addi-
integrals along a known set of lines [2]. In 2D, the mea- tive white Gaussian noise (AWGN) with standard deviation
surements are indexed by the angle and offset of each lines of 0.05 degrees. This creates a deliberate mismatch between
and arranged in a 2D sinogram. We implemented H and HT the actual measurement process and the forward model.
with Matlab’s radon and iradon (normalized to satisfy the q
2) Experiment 2: We used images xFD from the rat-brain
adjoint property), respectively. The Matlab code for the RPGD dataset to generate Poisson-noise-corrupted sinograms yq with
and the sequential-strategy-based training are made publically 144 views. Just as in Experiment 1, the task is to reconstruct
available1 . q
xFD back from yq . Sinograms were generated with 25, 30,
q
and 35 dB SNR with respect to HxFD . To achieve this,
A. Datasets in (26) and (27), we assume the readout noise to be zero
We use two datasets for our experiments. and {b1 , . . . , bm } = b0 = 1.66 × 105 , 5.24 × 105 , and
1) Mayo Clinic Dataset. It consists of 500 clinically realis- 1.66 × 106 , respectively. More details about this process is
tic, (512 × 512) CT images from the lower lungs to the lower given in Appendix B. The CNNs were trained at only the 30-
abdomen of 10 patients. Those were obtained from the Mayo dB level of noise. Again, our task is to reconstruct the images
clinic AAPM Low Dose CT Grand Challenge [52]. from the sinograms.
2) Rat Brain Dataset. We use a real (1493 px × 720 view × 3) Experiment 3. We downsampled the views of the original,
377 slice) sinogram from a CT scan of a single rat brain. The (729 × 720) rat-brain sinograms by 5 to obtain sparse-view
data acquisition was performed at the Paul Scherrer Institute in sinograms of size (729 × 144). For the qth z-slice, we denote
q
Villigen, Switzerland at the TOMCAT beam line of the Swiss the corresponding sparse-view sinograms yReal . Note that,
unlike in Experiments 1 and 2, the sinogram was not generated
1 https://github.com/harshit-gupta-epfl/CNN-RPGD from an image but was obtained experimentally.
1446 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 37, NO. 6, JUNE 2018
C. Comparison Methods experiments with Poisson noise, we use the slightly modified
Given the ground truth x, our figure of merit for the RPGD described in Appendix B. For all the experiments, FBP
reconstructed x∗ is the regressed SNR given by is used for the operator A.
SNR(x∗ , x) = arg max SNR(ax∗ + b, x), (16) D. Training and Selection of Parameters
a,b
where the purpose of a and b is to adjust for contrast and 1) Experiment 1: For TV, the regularization parameter λ is
offset. We also evaluate the performance using the structural selected via a golden-section search over 20 values so as to
similarity index (SSIM) [53]. We compare five reconstruction maximize the SNR of xTV with respect to the ground truth.
methods. We set the additional penalty parameter inside ADMM (see
1) FBP. FBP is the classical direct inversion of the Radon [14, eq. (2.6)]) equal to λ. The rationale for this heuristic
transform H, here implemented in Matlab by the iradon is that it puts the soft-threshold parameter in the same order
command with the ram-lak filter and linear interpolation of magnitude as the image gradients. We set the number of
as options. iterations to 100, which was enough to show good empirical
2) Total-Variation Reconstruction. TV solves convergence.
For DL, the parameters are selected via a parameter sweep,
1
xTV = min Hx − y 22 + λ x TV s.t. x ≥ 0, (17) roughly following the approach described in [35, Table 1].
x 2 Specifically: The patch size is L = 8.
where During dictionary learning, the sparsity level is set to
N−1
N−1 5 and 10. During reconstruction, the sparsity level for OMP
x TV = (Dh;i, j (x))2 + (Dv;i, j (x))2 , is set to 5, 8, 10, 12, 20, and 25, while the tolerance level is
i=1 j =1 taken to be 10, 100, and 1000. This, in effect, is the same as
Dh;i, j (x) = [x]i, j +1 − [x]i, j , and Dv;i, j (x) = [x]i, j +1 − [x]i, j . sweeping over ν j in (18). For each of these 2 × 6 × 3 = 36
The optimization is carried out via ADMM [14]. parameter settings, λ in (18) is chosen by a golden-section
3) Dictionary Learning (DL). DL [35] solves search over 7 values.
As discussed earlier, the CNNs for both the ×5
xDL and ×16 cases are trained separately for high and low mea-
J
surement noise.
= arg min Hx − y + λ
2
E j x − Dα j + λν j α j 0 ,
2
a) Training with noiseless measurements: The training of the
x,α
j =1 projector for RPGD follows the sequential procedure described
(18) in Section IV, with the configurations
where E j : R N×N → RL
2
extracts and vectorizes the j th • ×5, no noise: T1 = 80, T2 = 49, T3 = 5;
• ×16, no noise: T1 = 71, T2 = 41, T3 = 11.
patch of size (L × L) from the image x, D ∈ R L ×256 is
2
the dictionary, α j is the j th column of α ∈ R256×R , and R = We use the CNN obtained right after the first stage for
(N − L +1)2 . Note that the patches are extracted with a sliding FBPconv, since during this stage, only the training ensemble
distance of one pixel. in (13) is taken into account. We empirically found that the
For a given y, the dictionary D is learned from the corre- training error J2 converged in T1 epochs of Stage 1, yielding
sponding ground truth using the procedure described in [54]. an optimal performance for FBPconv.
The objective (18) is then solved iteratively by first minimizing b) Training with 40-dB measurement noise: This includes
it with respect to x using gradient descent as described in [35] replacing the ensemble in (13) with {Ayq } where yq =
and then with respect to α using orthogonal matching pursuit Hxq + n, has a 40-dB SNR with respect to Hxq . With 20%
(OMP) [55]. Since D is learned from the testing ground truth probability, we also perturb the views of the measurements
itself, the performance that we report here is an upper bound with an AWGN of 0.05 standard deviation so as to enforce
to the one that would be achieved by learning it using the robustness to model mismatch. These CNNs are initialized
training images. with the ones obtained after the first stage of the noiseless
4) FBPconv. FBPconv [16] is a state-of-the-art deep- training and are then trained with the configurations
learning technique, in which a residual CNN with U-net • ×5, 40-dB noise: T1 = 35, T2 = 49, T3 = 5;
architecture is trained to directly denoise the FBP . It has • ×16, 40-dB noise: T1 = 32, T2 = 41, T3 = 11.
been shown to outperform other deep-learning-based direct Similarly to the previous case, the CNNs obtained after the
reconstruction methods for sparse-view CT. In our proposed first and the third training stage are used in FBPconv and
method, we use a CNN with the same architecture as in RPGD, respectively. For clarity, these variants will be referred
FBPconv. As a result, in our framework, FBPconv corresponds to as FBPconv40 and RPGD40.
to training with only the ensemble in (13). In the testing phase, The learning rate is decreased in a geometric progression
the FBP of the measurements is fed into the trained CNN to from 10−2 to 10−3 in Stage 1 and kept at 10−3 for Stages 2
output the reconstruction image. and 3. Recall that the last two stages contain the ensemble with
5) RPGD. RPGD is our proposed method. It is described dynamic perturbation (14) which changes in every epoch. The
in Algorithm 1. There the nonlinear operator F is the CNN lower learning rate, therefore, avoids drastic changes in para-
trained as a projector (as discussed in Section IV). For meters between the epochs. The batch size is fixed to 2. The
GUPTA et al.: CNN-BASED PGD FOR CONSISTENT CT IMAGE RECONSTRUCTION 1447
Fig. 2. Comparison of reconstructions using different methods for the ×16 case in Experiment 1. First column: reconstruction from noiseless
measurements of a lung image. Second column: zoomed version of the area marked by the box in the original in the first column. Third and
fourth columns: zoomed version for the case of 45 and 35 dB, respectively. Fifth to eighth columns: corresponding results for an abdomen image.
Seventh and eighth column correspond to 45 and 40 dB, respectively. (a) Results(∞-dB). (b) zoom(∞-dB). (c) zoom(45-dB). (d) zoom(35-dB).
(e) Results(∞-dB). (f) zoom(∞-dB). (g) zoom(45-dB). (h) zoom(40-dB).
TABLE I
R ECONSTRUCTION R ESULTS FOR E XPERIMENT 1 W ITH L OW M EASUREMENT N OISE (G AUSSIAN ). G RAY C ELLS I NDICATE T HAT
THE M ETHOD WAS T UNED /T RAINED FOR THE C ORRESPONDING N OISE L EVEL
other hyper-parameters follow [16]. For stability, gradients is set to the constant C = 0.99 for RPGD and C = 0.8 for
above 10−2 are clipped and the momentum is set to 0.99. The RPGD40. For each noise level and views number, the only
total training time for the noiseless case is around 21.5 hours free parameter γ is swept over 20 values geometrically spaced
on a Titan X GPU (Pascal architecture). between 10−2 and 10−5 . We pick the γ which gives the
The hyper-parameters for RPGD are chosen as follows: The best average SNR over the 25 test images. Note that, for TV
relaxation parameter α0 is initialized with 1, the sequence {ck } and DL, the value of the optimum λ generally increases as
1448 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 37, NO. 6, JUNE 2018
TABLE II
R ECONSTRUCTION R ESULTS FOR E XPERIMENT 1 W ITH H IGH M EASUREMENT N OISE (G AUSSIAN ). G RAY C ELLS I NDICATE T HAT
THE M ETHOD WAS T UNED /T RAINED FOR THE C ORRESPONDING N OISE L EVEL
Fig. 3. Profile of the high- and low-contrast regions marked in the first and fifth columns of Figure 2 by solid and dashed line segments, respectively.
First and second columns: ×16, 45-dB noise case for the lung image. Third and fourth columns: ×16, 40-dB noise case for the abdomen image.
(a) High-contrast profile. (b) Low-contrast profile. (c) High-contrast profile. (d) Low-contrast profile.
the measurement noise increases; however, no such obvious 2) Experiment 2: For this case the CNNs are trained simi-
relation exists for γ . This is mainly because it is the step larly to the CNN for RPGD40 in Experiment 1. Perturbations
q
size of the gradient descent in RPGD and not a regularization (12)-(14) are used with the replacement of AHxFD in (13) by
q q q q
parameter. In all experiments, the gradient step is skipped Ay , where y had 30 dB Poisson noise. The xFD and AyReal
during the first iteration. are multiplied with a constant so that their maximum pixel
On the GPU, one iteration of RPGD takes less than 1 sec- value is 480.
ond. The algorithm is stopped when the residual xk+1 − The CNN obtained after the first stage is used as FBPconv.
xk 2 reaches a value less than 1, which is sufficiently small While testing, we keep C = 0.4. Other training hyper-
compared to the dynamic range [0,350] of the image. It takes parameters and testing parameters of the RPGD are kept the
around 1-2 minutes to reconstruct an image with RPGD. same as the RPGD40 for ×5 case in Experiment 1.
GUPTA et al.: CNN-BASED PGD FOR CONSISTENT CT IMAGE RECONSTRUCTION 1449
Fig. 5. Convergence with iteration k of RPGD for the Experiment 1, ×16, no-noise case when C = 0.99. Results are averaged over 25 test images.
(a) SNRs of Ük with respect to the ground-truth image. (b) SNRs of ÀÜk with respect to the ground-truth sinogram. (c) Evolution of the relaxation
parameters αk . In (a) and (b), the FBP, FBPconv, and TV results are independent of the RPGD iteration k but have been shown for the sake of
comparison.
Note that all the results discussed in Section II and III apply to [17] Y. S. Han, J. Yoo, and J. C. Ye. (2017). “Deep learning with domain
Problem (33). As a consequence, we use Algorithm 1 to solve adaptation for accelerated projection-reconstruction MR.” [Online].
Available: https://arxiv.org/abs/1703.01135
the problem with the following small change in the gradient [18] S. Antholzer, M. Haltmeier, and J. Schwab. (2017). “Deep learning
step: for photoacoustic tomography from sparse data.” [Online]. Available:
https://arxiv.org/abs/1704.04587
zk = F(xk − γ HT H xk + γ HT y ). (34) [19] S. Wang et al., “Accelerating magnetic resonance imaging via deep
learning,” in Proc. IEEE Int. Symp. Biomed. Imag. (ISBI), Apr. 2016,
pp. 514–517.
[20] A. Mousavi and R. G. Baraniuk. (2017). “Learning to invert: Sig-
nal recovery via deep convolutional networks.” [Online]. Available:
ACKNOWLEDGMENT https://arxiv.org/abs/1701.03891
The authors thank Emmanuel Soubies for his helpful sug- [21] K. Gregor and Y. LeCun, “Learning fast approximations of sparse
coding,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, pp. 399–406.
gestions on training the CNN and Dr. Cynthia McCollough, [22] Y. Yang, J. Sun, H. Li, and Z. Xu, “Deep ADMM-net for compressive
the Mayo Clinic, the American Association of Physicists in sensing MRI,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2016,
Medicine, and the National Institute of Biomedical Imaging pp. 10–18.
[23] J. Adler and O. Öktem. (2017). “Solving ill-posed inverse prob-
and Bioengineering for the Mayo-clinic dataset. They also lems using iterative deep neural networks.” [Online]. Available:
thank Dr. Marco Stampanoni, Swiss Light Source, Paul Scher- https://arxiv.org/abs/1704.04058
rer Institute, Villigen, Switzerland, for the rat-brain dataset. [24] P. Putzky and M. Welling. (2017). “Recurrent inference machines
for solving inverse problems.” [Online]. Available: https://arxiv.
They also thankfully acknowledge the support of the NVIDIA org/abs/1706.04008
Corporation, in providing the Titan X GPU for this research. [25] J. Schlemper, J. Caballero, J. V. Hajnal, A. Price, and D. Rueckert,
“A deep cascade of convolutional neural networks for MR image
reconstruction,” in Proc. Int. Conf. Inf. Process. Med. Imag., 2017,
R EFERENCES pp. 647–658.
[1] M. Lustig, D. Donoho, and J. M. Pauly, “Sparse MRI: The application of [26] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-
compressed sensing for rapid MR imaging,” Magn. Reson. Med., vol. 58, play priors for model based reconstruction,” in Proc. IEEE Global Conf.
no. 6, pp. 1182–1195, 2007. Signal Inf. Process. (GlobalSIP), Dec. 2013, pp. 945–948.
[2] A. C. Kak and M. Slaney, Principles of Computerized Tomographic [27] S. H. Chan, X. Wang, and O. A. Elgendy, “Plug-and-play
Imaging (Classics in Applied Mathematics). New York, NY, USA: ADMM for image restoration: Fixed-point convergence and appli-
SIAM, 2001. cations,” IEEE Trans. Comput. Imag., vol. 3, no. 1, pp. 84–98,
[3] X. C. Pan, E. Y. Sidky, and M. Vannier, “Why do commercial CT Jan. 2017.
scanners still employ traditional, filtered back-projection for image [28] S. Sreehari et al., “Plug-and-play priors for bright field electron tomog-
reconstruction?” Inverse Problems, vol. 25, no. 12, p. 123009, 2009. raphy and sparse interpolation,” IEEE Trans. Comput. Imag., vol. 2,
[4] C. Bouman and K. Sauer, “A generalized Gaussian image model for no. 4, pp. 408–423, Dec. 2016.
edge-preserving MAP estimation,” IEEE Trans. Image Process., vol. 2, [29] Y. Romano, M. Elad, and P. Milanfar, “The little engine that could:
no. 3, pp. 296–310, Jul. 1993. Regularization by denoising (RED),” SIAM J. Imag. Sci., vol. 10, no. 4,
[5] P. Charbonnier, L. Blanc-Féraud, G. Aubert, and M. Barlaud, “Determin- pp. 1804–1844, 2017.
istic edge-preserving regularization in computed imaging,” IEEE Trans. [30] J. H. R. Chang, C.-L. Li, B. Póczos, B. V. K. V. Kumar, and
Image Process., vol. 6, no. 2, pp. 298–311, Feb. 1997. A. C. Sankaranarayanan. (2017). “One network to solve them all—
[6] E. Candès and J. Romberg, “Sparsity and incoherence in compressive Solving linear inverse problems using deep projection models.” [Online].
sampling,” Inverse Probl., vol. 23, no. 3, pp. 969–985, 2007. Available: https://arxiv.org/abs/1703.09912
[7] S. Ramani and J. A. Fessler, “Parallel MR image reconstruction using [31] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. (2017). “Com-
augmented Lagrangian methods,” IEEE Trans. Med. Imag., vol. 30, pressed sensing using generative models.” [Online]. Available:
no. 3, pp. 694–706, Mar. 2011. https://arxiv.org/abs/1703.03208
[8] M. Elad and M. Aharon, “Image denoising via sparse and redundant [32] B. Kelly, T. P. Matthews, and M. A. Anastasio. (2017). “Deep learning-
representations over learned dictionaries,” IEEE Trans. Image Process., guided image reconstruction from incomplete data.” [Online]. Available:
vol. 15, no. 12, pp. 3736–3745, Dec. 2006. https://arxiv.org/abs/1709.00584
[9] E. J. Candès, Y. C. Eldar, D. Needell, and P. Randall, “Compressed sens- [33] J. Z. Liang, P. J. La Riviere, G. El Fakhri, S. J. Glick, and J. Siewerdsen,
ing with coherent and redundant dictionaries,” Appl. Comput. Harmon. “Guest editorial low-dose CT: What has been done, and what challenges
Anal., vol. 31, no. 1, pp. 59–73, Jul. 2011. remain?” IEEE Trans. Med. Imag., vol. 36, no. 12, pp. 2409–2416,
[10] S. Ravishankar, R. R. Nadakuditi, and J. A. Fessler, “Efficient sum Dec. 2017.
of outer products dictionary learning (SOUP-DIL) and its application
[34] S. Ramani and J. A. Fessler, “A splitting-based iterative algorithm
to inverse problems,” IEEE Trans. Comput. Imag., vol. 3, no. 4,
for accelerated statistical X-ray CT reconstruction,” IEEE Trans. Med.
pp. 694–709, Dec. 2017.
Imag., vol. 31, no. 3, pp. 677–688, Mar. 2012.
[11] M. A. T. Figueiredo and R. D. Nowak, “An EM algorithm for wavelet-
based image restoration,” IEEE Trans. Image Process., vol. 12, no. 8, [35] Q. Xu, H. Yu, X. Mou, L. Zhang, J. Hsieh, and G. Wang, “Low-dose
pp. 906–916, Aug. 2003. X-ray CT reconstruction via dictionary learning,” IEEE Trans. Med.
[12] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding Imag., vol. 31, no. 9, pp. 1682–1697, Sep. 2012.
algorithm for linear inverse problems with a sparsity constraint,” Com- [36] S. Niu et al., “Sparse-view X-ray CT reconstruction via total gen-
mun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, Nov. 2004. eralized variation regularization,” Phys. Med. Biol., vol. 59, no. 12,
[13] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding pp. 2997–3017, 2014.
algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, [37] L. Gjesteby, Q. Yang, Y. Xi, Y. Zhou, J. Zhang, and G. Wang, “Deep
pp. 183–202, 2009. learning methods to guide CT image reconstruction and reduce metal
[14] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed artifacts,” Proc. SPIE, vol. 10132, p. 101322W, Mar. 2017.
optimization and statistical learning via the alternating direction method [38] H. Chen et al., “Low-dose CT with a residual encoder-decoder convo-
of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, lutional neural network,” IEEE Trans. Image Process., vol. 36, no. 12,
Jan. 2011. pp. 2524–2535, Dec. 2017.
[15] M. T. McCann, K. H. Jin, and M. Unser, “Convolutional neural networks [39] E. Kang, J. Min, and J. C. Ye, “A deep convolutional neural network
for inverse problems in imaging: A review,” IEEE Signal Process. Mag., using directional wavelets for low-dose X-ray CT reconstruction,” Med.
vol. 34, no. 6, pp. 85–95, Nov. 2017. Phys., vol. 44, no. 10, pp. e360–e375, Oct. 2017.
[16] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convo- [40] Y. S. Han, J. Yoo, and J. C. Ye. (2016). “Deep residual learning
lutional neural network for inverse problems in imaging,” IEEE Trans. for compressed sensing CT reconstruction via persistent homology
Image Process., vol. 26, no. 9, pp. 4509–4522, Sep. 2017. analysis.” [Online]. Available: https://arxiv.org/abs/1611.06391
GUPTA et al.: CNN-BASED PGD FOR CONSISTENT CT IMAGE RECONSTRUCTION 1453
[41] B. Eicke, “Iteration methods for convexly constrained ill-posed problems [50] A. E. Orhan and X. Pitkow. (2017). “Skip connections eliminate singu-
in Hilbert space,” Numer. Funct. Anal. Optim., vol. 13, nos. 5–6, larities.” [Online]. Available: https://arxiv.org/abs/1701.09175
pp. 413–429, 1992. [51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[42] L. Landweber, “An iteration formula for fredholm integral equations image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
of the first kind,” Amer. J. Math., vol. 73, no. 3, pp. 615–624, nit. (CVPR), Jun. 2016, pp. 770–778.
Jul. 1951. [52] C. McCollough, “TU-FG-207A-04: Overview of the low dose CT grand
[43] D. P. Bertsekas, Nonlinear Programming, 2nd ed. Cambridge, MA, challenge,” Med. Phys., vol. 43, no. 6, pp. 3759–3760, 2016.
USA: Athena Scientific, 1999. [53] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural
[44] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal similarity for image quality assessment,” in Proc. 37th Asilomar Conf.
forward-backward splitting,” Multiscale Model. Simul., vol. 4, no. 4, Signals, Syst. Comput., vol. 2. Nov. 2003, pp. 1398–1402.
pp. 1168–1200, 2005. [54] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for
[45] P. L. Combettes and J.-C. Pesquet, Proximal Splitting Meth- matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11,
ods in Signal Processing. New York, NY, USA: Springer, 2011, pp. 19–60, Mar. 2010.
pp. 185–212. [55] J. A. Tropp and A. C. Gilbert, “Signal recovery from random mea-
[46] J. Bect, L. Blanc-Féraud, G. Aubert, and A. Chambolle, “A 1 -unified surements via orthogonal matching pursuit,” IEEE Trans. Inf. Theory,
variational framework for image restoration,” in Proc. Eur. Conf. Com- vol. 53, no. 12, pp. 4655–4666, Dec. 2007.
put. Vis. (ECCV), 2004, pp. 1–13. [56] K. Sauer and C. Bouman, “A local update strategy for iterative recon-
[47] A. Aldroubi and R. Tessera, “On the existence of optimal unions of struction from projections,” IEEE Trans. Signal Process., vol. 41, no. 2,
subspaces for data modeling and clustering,” Found. Comput. Math., pp. 534–548, Feb. 1993.
vol. 11, no. 3, pp. 363–379, Jun. 2011. [57] I. A. Elbakri and J. A. Fessler, “Statistical image reconstruction for
[48] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, polyenergetic X-ray computed tomography,” IEEE Trans. Med. Imag.,
MA, USA: MIT Press, 2016. vol. 21, no. 2, pp. 89–99, Feb. 2002.
[49] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [58] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone
for biomedical image segmentation,” in Proc. Med. Image. Comput. Operator Theory in Hilbert Spaces. New York, NY, USA: Springer,
Comput. Assist. Intervent. (MICCAI), 2015, pp. 234–241. 2011.