ChamPock An
ChamPock An
Thomas Pock
ICG, Graz University of Technology, AIT, Austria
E-mail: pock@icg.tugraz.at
CONTENTS
1 Introduction 2
2 Typical optimization problems in imaging 5
3 Notation and basic notions of convexity 12
4 Gradient methods 20
5 Saddle-point methods 49
6 Non-convex optimization 75
7 Applications 81
A Abstract convergence theory 128
B Proof of Theorems 4.1, 4.9 and 4.10. 131
C Convergence rates for primal–dual algorithms 136
References 140
2 A. Chambolle and T. Pock
1. Introduction
The purpose of this paper is to describe, and illustrate with numerical ex-
amples, the fundamentals of a branch of continuous optimization dedicated
to problems in imaging science, in particular image reconstruction, inverse
problems in imaging, and some simple classification tasks. Many of these
problems can be modelled by means of an ‘energy’, ‘cost’ or ‘objective’ which
represents how ‘good’ (or bad!) a solution is, and must be minimized.
These problems often share a few characteristic features. One is their size,
which can be very large (typically involving at most around a billion vari-
ables, for problems such as three-dimensional image reconstruction, dense
stereo matching, or video processing) but usually not ‘huge’ like some recent
problems in learning or statistics. Another is the fact that for many prob-
lems, the data are structured in a two- or three-dimensional grid and interact
locally. A final, frequent and fundamental feature is that many useful prob-
lems involve non-smooth (usually convex) terms, for reasons that are now
well understood and concern the concepts of sparsity (DeVore 1998, Candès,
Romberg and Tao 2006b, Donoho 2006, Aharon, Elad and Bruckstein 2006)
and robustness (Ben-Tal and Nemirovski 1998).
These features have strongly influenced the type of numerical algorithms
used and further developed to solve these problems. Due to their size and
lack of smoothness, higher-order methods such as Newton’s method, or
methods relying on precise line-search techniques, are usually ruled out,
although some authors have suggested and successfully implemented quasi-
Newton methods for non-smooth problems of the kind considered here (Ito
and Kunisch 1990, Chan, Golub and Mulet 1999).
Hence these problems will usually be tackled with first-order descent
methods, which are essentially extensions and variants of a plain gradi-
ent descent, appropriately adapted to deal with the lack of smoothness of
the objective function. To tackle non-smoothness, one can either rely on
controlled smoothing of the problem (Nesterov 2005, Becker, Bobin and
Candès 2011) and revert to smooth optimization techniques, or ‘split’ the
problem into smaller subproblems which can be exactly (or almost) solved,
and combine these resolutions in a way that ensures that the initial problem
is eventually solved. This last idea is now commonly referred to as ‘proxi-
mal splitting’ and, although it relies on ideas from as far back as the 1950s
or 1970s (Douglas and Rachford 1956, Glowinski and Marroco 1975), it has
been a very active topic in the past ten years in image and signal processing,
as well as in statistical learning (Combettes and Pesquet 2011, Parikh and
Boyd 2014).
Hence, we will focus mainly on proximal splitting (descent) methods,
and primarily for convex problems (or extensions, such as finding zeros of
maximal-monotone operators). We will introduce several important prob-
Optimization for imaging 3
1
Of course, what follows is also valid for images/signals defined on a one- or three-dim-
ensional domain.
8 A. Chambolle and T. Pock
We will also frequently need the operator norm kDk, which is estimated as
√
kDk ≤ 8 (2.5)
(see Chambolle 2004b). The discrete ROF model is then defined by
1
min λkDukp,1 + ku − u⋄ k22 , (2.6)
u 2
where u⋄ ∈ Rm×n is the given noisy image, and the discrete total variation
is defined by
m,n m,n
1/p
(Du)pi,j,1 + (Du)pi,j,2
X X
kDukp,1 = |(Du)i,j |p = ,
i=1,j=1 i=1,j=1
that is, the ℓ1 -norm of the p-norm of the pixelwise image gradients.2 The
parameter p can be used, for example, to realize anisotropic (p = 1) or
isotropic (p = 2) total variation. Some properties of the continuous model,
such as the co-area formula, carry over to the discrete model only if p = 1,
but the isotropic total variation is often preferred in practice since it does
not exhibit a grid bias.
From a sparsity point of view, the idea of the total variation denoising
model is that the ℓ1 -norm induces sparsity in the gradients of the image,
hence it favours piecewise constant images with sparse edges. On the other
hand, this property – also known as the staircasing effect – might be con-
sidered a drawback for some applications. Some workarounds for this issue
will be suggested in Example 4.7 and Section 7.2. The isotropic case (p = 2)
can also be interpreted as a very simple form of group sparsity, grouping
together the image derivatives in each spatial dimension.
In many practical problems it is necessary to incorporate an additional
linear operator in the data-fitting term. Such a model is usually of the form
1
min λkDukp,1 + kAu − u⋄ k22 , (2.7)
u 2
where A : Rm×n → Rk×l is a linear operator, u⋄ ∈ Rk×l is the given data,
and k, l will depend on the particular application. Examples include image
deblurring, where A models the blur kernel, and magnetic resonance imag-
ing (MRI), where the linear operator is usually a combination of a Fourier
transform and the coil sensitivities; see Section 7.4 for details.
The quadratic data-fitting term of the ROF model is specialized for zero-
mean Gaussian noise. In order to apply the model to other types of noise,
different data-fitting terms have been proposed. When the noise is impulsive
or contains gross outliers, a simple yet efficient modification is to replace
2
Taking only right differences is of course arbitrary, and may lead to anisotropy issues.
However, this is rarely important for applications (Chambolle, Levine and Lucier 2011).
Optimization for imaging 9
Figure 2.1. Total variation based image denoising. (a) Original input image, and
(b) noisy image containing additive Gaussian noise with standard deviation σ = 0.1.
(c) Denoised image obtained by minimizing the ROF model using λ = 0.1.
the quadratic data-fitting term with an ℓ1 -data term. The resulting model,
called the TV-ℓ1 model, is given by
min λkDukp,1 + ku − u⋄ k1 . (2.8)
u
This model has many nice properties such as noise robustness and contrast
invariance (Nikolova 2004, Chan and Esedoglu 2004). However, this does
not come for free. While the ROF model still contains some regularity in
the data term that can be exploited during optimization, the TV-ℓ1 model
is completely non-smooth and hence significantly more difficult to minimize.
Figure 2.2. An image deblurring problem. (a) Original image, and (b) blurry
and noisy image (Gaussian noise with standard deviation σ = 0.01) together
with the known blur kernel. (c, d) Image deblurring without (λ = 0) and with
(λ = 5 × 10−4 ) total variation regularization. Observe the noise amplification when
there is no regularization.
with |ui,j |p = ( rk=1 |ui,j,k |p )1/p denoting the p-vector norm acting on the
P
single pixels. Similarly, if the pixels are matrix-valued (or tensor-valued),
that is, Ui,j ∈ Rr×s , we have U = (U1,1 , . . . , Um,n ) ∈ Rm×n×r×s , and we
will consider matrix norms, acting on the single pixels Ui,j .
3
This definition avoids the risky expression (+∞) + (−∞); see for instance Rockafellar
(1997, Section 4).
14 A. Chambolle and T. Pock
which is convex, l.s.c., and proper when C is convex, closed and non-empty.
The minimization of such functions will allow us to easily model convex
constraints in our problems.
3.2. Subgradient
Given a convex, extended real valued, l.s.c. function f : X → [−∞, +∞],
we recall that its subgradient at a point x is defined as the set
∂f (x) := {p ∈ X : f (y) ≥ f (x) + hp, y − xi for all y ∈ X }.
An obvious remark which stems from the definition is that this notion al-
lows us to generalize Fermat’s stationary conditions (∇f (x) = 0 if x is a
minimizer of f ) to non-smooth convex functions: we indeed have
x ∈ X is a global minimizer of f if and only if 0 ∈ ∂f (x). (3.1)
The function is strongly convex or ‘µ-convex’ if in addition, for x, y ∈ X
and p ∈ ∂f (x), we have
µ
f (y) ≥ f (x) + hp, y − xi + ky − xk2
2
or, equivalently, if x 7→ f (x) − µkxk2 /2 is also convex. It is then, obviously,
strictly convex as it satisfies
t(1 − t)
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) − µ ky − xk2 (3.2)
2
for any x, y and any t ∈ [0, 1]. A trivial but important remark is that if f
is strongly convex and x is a minimizer, then we have (since 0 ∈ ∂f (x))
µ
f (y) ≥ f (x) + ky − xk2
2
for all y ∈ X .
The domain of f is the set dom f = {x ∈ X : f (x) < +∞}, while
the domain of ∂f is the set dom ∂f = {x ∈ X : ∂f (x) 6= ∅}. Clearly
dom ∂f ⊂ dom f ; in fact if f is convex, l.s.c. and proper, then dom ∂f
is dense in dom f (Ekeland and Témam 1999). In finite dimensions, one
can show that for a proper convex function, dom ∂f contains at least the
relative interior of dom f (that is, the interior in the vector subspace which
is generated by dom f ).
We must mention here that subgradients of convex l.s.c. functions are only
a particular class of maximal monotone operators, which are multivalued
operators T : X → P(X ) such that
hp − q, x − yi ≥ 0 for all (x, y) ∈ X 2 , p ∈ T x, q ∈ T y (3.5)
and whose graph {(x, p) : x ∈ T p} ⊂ X × X is maximal (with respect to
inclusion) in the class of graphs of operators which satisfy (3.5). Strongly
monotone and co-coercive monotone operators are defined accordingly. It is
also almost obvious from the definition that any maximal monotone operator
T has an inverse T −1 defined by x ∈ T −1 p ⇔ p ∈ T x, which is also maximal
monotone. The operators ∂f and ∂f ∗ are inverse in this sense. Examples
of maximal monotone operators which are not subgradients of a convex
function are given by skew-symmetric operators. See, for instance, Brézis
(1973) for a general study of maximal monotone operators in Hilbert spaces.
(3.8)
which in fact holds for any maximal monotone operators T, T −1 . It shows
in particular that if we know how to compute proxτ f , then we also know
how to compute proxf ∗ /τ . Finally, we will sometimes let proxM
τ f (x) denote
the proximity operator computed in the metric M , that is, the solution of
1
min f (y) + ky − xk2M .
y∈X 2τ
where
f : Y → (−∞, +∞], g : X → (−∞, +∞]
are convex l.s.c. functions and K : X → Y is a bounded linear operator.
Then, since f = f ∗∗ , one can write
min f (Kx) + g(x) = min suphy, Kxi − f ∗ (y) + g(x).
x∈X x∈X y∈Y
which is always non-negative (even if the min and sup cannot be swapped),
vanishes if and only if (x, y) is a saddle point.
Finally we remark that
K∗
x ∂g(x) 0 x
T := + (3.15)
y ∂f ∗ (y) −K 0 y
is a maximal monotone operator, being the sum of two maximal monotone
operators, only
one of which is a subgradient, and the conditions above can
x∗
be written T ∗ ∋ 0.
y
Example 3.1 (dual of the ROF model). As an example, consider the
minimization problem (2.6) above. This problem has the general form (3.9),
with x = u, K = D, f = λk·kp,1 and g = k · −u⋄ k2 /2. Hence the dual
problem (3.11) reads
1 ∗ 2
max −f ∗ (p) − kD pk − hD∗ p, u⋄ i
p 2
1 ∗ 1
∗
= − min f (p) + kD p − u k + ku⋄ k2 ,
⋄ 2
p 2 2
where p ∈ Rm×n×2 is the dual variable. Equation (3.13) shows that the
solution u of the primal problem is recovered from the solution p of the
dual by letting u = u⋄ − D∗ p. One interesting observation is that the dual
ROF model, with f ∗ being a norm, has almost exactly the same structure
as the Lasso problem (2.2).
In this example, f is a norm, so f ∗ is the indicator function of the polar
ball: in this case the dual variable has the structure p = (p1,1 , . . . , pm,n ),
Optimization for imaging 19
where pi,j = (pi,j,1 , pi,j,2 ) is the per pixel vector-valued dual variable, and
therefore
(
0 if |pi,j |q ≤ λ for all i, j,
f ∗ (p) = δ{k·kq,∞ ≤λ} (p) = (3.16)
+∞ else,
where q is the parameter of the polar norm ball which is defined via 1/p +
1/q = 1. The most relevant cases are p = 1 or p = 2. In the first case we
have q = +∞, so the corresponding constraint reads
|pi,j |∞ = max{|pi,j,1 |, |pi,j,2 |} ≤ λ for all i, j.
In the second case we have q = 2, and the corresponding constraint reads
q
|pi,j |2 = p2i,j,1 + p2i,j,2 ≤ λ for all i, j.
Of course, more complex norms can be used, such as the nuclear norm for
colour images. In this case the per pixel dual variable pi,j will be matrix-
valued (or tensor-valued) and should be constrained to have its spectral
(operator) norm less than λ, for all i, j. See Section 7.3 for an example and
further details.
In practice, we will (improperly) use ‘dual problem’ to denote the mini-
mization problem
min{kD∗ p − u⋄ k2 : |pi,j |q ≤ λ for all i, j}, (3.17)
which is essentially a projection problem. For this problem, it is interesting
to observe that the primal–dual gap
1 1
G(u, p) = f (Du) + ku − u⋄ k2 + f ∗ (p) + kD∗ pk2 − hD∗ p, u⋄ i
2 2
1
= λkDukp,1 + δ{k·kq,∞ ≤λ} (p) − hp, Dui + ku⋄ − D∗ p − uk2
2
(3.18)
4. Gradient methods
The first family of methods we are going to describe is that of first-order
gradient descent methods. It might seem a bit strange to introduce such
simple and classical tools, which might be considered outdated. However, as
mentioned in the Introduction, the most efficient way to tackle many sim-
ple problems in imaging is via elaborate versions of plain gradient descent
schemes. In fact, as observed in the 1950s, such methods can be consider-
ably improved by adding inertial terms or performing simple over-relaxation
steps (or less simple steps, such as Chebyshev iterations for matrix inversion:
Varga 1962), line-searches, or more elaborate combinations of these, such
as conjugate gradient descent; see for instance Polyak (1987, Section 3.2) or
Bertsekas (2015, Section 2.1). Also, if second-order information is available,
Newton’s method or quasi-Newton variants such as the (l-)BFGS algorithm
(Byrd, Lu, Nocedal and Zhu 1995) can be used, and are known to converge
very fast. However, for medium/large non-smooth problems such as those
described above, such techniques are not always convenient. It is now ac-
knowledged that, if not too complex to implement, then simpler iterations,
which require fewer operations and can sometimes even be parallelized, will
generally perform better for a wide class of large-dimensional problems, such
as those considered in this paper.
In particular, first-order iterations can be accelerated by many simple
tricks such as over-relaxation or variable metrics – for instance Newton’s
method – but this framework can be transferred to fairly general schemes
(Vũ 2013b, Combettes and Vũ 2014), and since the seminal contribution
Optimization for imaging 21
and let us first assume that f is differentiable. Then, the most straightfor-
ward approach to solving the problem is to implement a gradient descent
scheme with fixed step size τ > 0: see Algorithm 1. The major issue is
that this will typically not work if f is not sufficiently smooth. The natural
assumption is that ∇f is Lipschitz with some constant L, and 0 < τ L < 2.
If τ is too large, this method will oscillate: if for instance f (x) = x2 /2, then
xk+1 = (1 − τ )xk , and it is obvious that this recursion converges if and only
if τ < 2. On the other hand, a Taylor expansion shows that
τL
f (x − τ ∇f (x)) ≤ f (x) − τ 1 − k∇f (x)k 2 ,
2
so that if τ < 2/L, then we see both that f (xk ) is a P strictly decreasing se-
quence (unless ∇f (xk ) = 0 at some point) and that k k∇f (xk )k 2 < +∞
if f is bounded from below. If f is, in addition, coercive (with bounded level
sets), it easily follows in the finite dimensional setting that f (xk ) converges
to a critical value and that every converging subsequence of (xk )k≥0 goes
to a critical point. If f is convex, then x 7→ x − τ ∇f (x) is also a (weak)
contraction, which shows that kxk − x∗ k is also non-increasing, for any min-
imizer4 x∗ of f . In this case we can deduce the convergence of the whole
4
We shall always assume the existence of at least one minimizer, here and elsewhere.
22 A. Chambolle and T. Pock
sequence (xk )k to a solution, if 0 < τ < 2/L. In fact, this is a particular case
of the fairly general theory of averaged operators, for which such iterations
converge: see Theorem A.1 in the Appendix for details and references.
5
This is an extension of Theorem A.1; see also the references cited there.
24 A. Chambolle and T. Pock
since the 1970s (Martinet 1970, Rockafellar 1976), and in fact many of the
methods we consider later on are special instances. Convergence proofs and
rates of convergence
P 2 can be found,Pfor instance, in Brézis and Lions (1978)
(these require k τk = +∞, but k τk = +∞ is sufficient if T = ∂f ); see
also the work of Güler (1991) when T = ∂f . In fact some of the results
mentioned in Section 4.7 below will apply to this method as a particular
case, when T = ∂f , extending some of the results of Güler (1991).
Fairly general convergence rates for gradient methods are given in the
rich book of Bertsekas (2015, Propositions 5.1.4, 5.1.5), depending on the
behaviour of f near the set of minimizers. In the simplest case of the descent
(4.2) applied to a function f with L-Lipschitz gradient, the convergence rate
is found in many other textbooks (e.g. Nesterov 2004) and reads as follows.
Theorem 4.1. Let x0 ∈ X and xk be recursively defined by (4.2), with
τ ≤ 1/L. Then not only does (xk )k converge to a minimizer, but the value
f (xk ) decays with the rate
1
f (xk ) − f (x∗ ) ≤kx∗ − x0 k2 ,
2τ k
where x∗ is any minimizer of f . If in addition f is strongly convex with
parameter µf > 0, we have
1 k 1
f (xk ) − f (x∗ ) + kx − x∗ k2 ≤ ω k kx0 − x∗ k2 ,
2τ 2τ
where ω = (1 − τ µf ) < 1.
A short (standard) proof is given in Appendix B.
Remark 4.2. This form of the result is slightly suboptimal, allowing a
very elementary proof in Appendix B. However, it can be checked that the
first rate holds for larger steps τ < 2/L, while the second can be improved
by taking larger steps (τ = 2/(L + µf )), yielding linear convergence with a
factor ω = (1 − µf /L)/(1 + µf /L); see for instance Nesterov (2004, Theo-
rem 2.1.15). However, we will see very soon that this too can be improved.
Of course, the observations above show that similar rates will also hold
for the implicit form (4.7): indeed, recalling that fτ (x∗ ) = f (x∗ ) for any
τ > 0, we have that a bound on fτ (xk ) − fτ (x∗ ) is, by definition, also a
bound on
kxk+1 − xk k2
f (xk+1 ) − f (x∗ ) + .
2τ
We remark that in this implicit case it would seem that we only have to
choose the largest possible τ to solve the minimization accurately. We will
see further (Example 3.1) that in practice, we are not always free to choose
the step or the metric which makes the algorithm actually implementable. In
Optimization for imaging 25
other situations the choice of the step might eventually result in a trade-off
between the precision of the computation, the overall rate and the complex-
ity of one single iteration (which should also depend on τ ).
(where a slightly different function is used: see Theorems 2.1.7 and 2.1.13).
If µ = 0, using (possible translates of) the function (4.12), which is very ill
conditioned (and degenerate if defined in dimension n > p), the following
general lower bound for smooth convex optimization can be shown.
Theorem 4.3. For any x0 ∈ Rn , L > 0, and k < n there exists a convex,
continuously differentiable function f with L-Lipschitz-continuous gradient,
such that for any first-order algorithm satisfying (4.11), we have
Lkx0 − x∗ k2
f (xk ) − f (x∗ ) ≥ , (4.13)
8(k + 1)2
where x∗ denotes a minimizer of f .
This particular bound is reached by considering the function in (4.12)
with p = k + 1, and an appropriate change of variable which moves the
starting point to the origin. Observe that the above lower bound is valid
only if the number of iterates k is less than the problem size. We cannot
improve this with a quadratic function, as the conjugate gradient method
(which is a first-order method) is then known to find the global minimizer
after at most n steps. But the practical problems we encounter in imaging
are often so large that we will never be able to perform as many iterations
as the dimension of the problem.
If choosing µ > 0 so that the function (4.12) becomes µ-strongly convex,
a lower bound for first-order methods is given in Theorem 2.1.13 of Nesterov
(2004), which reads as follows.
Theorem 4.4. For any x0 ∈ R∞ ≃ ℓ2 (N) and µ, L > 0 there exists a
µ-strongly convex, continuously differentiable function f with L-Lipschitz-
continuous gradient, such that, for any algorithm in the class of first-order
algorithms defined by (4.11), we have
√
q − 1 2k 0
k ∗ µ
f (x ) − f (x ) ≥ √ kx − x∗ k2 (4.14)
2 q+1
for all k, where q = L/µ ≥ 1 is the condition number, and x∗ is the minimizer
of f .
In finite dimensions, one can adapt the proof of Nesterov (2004) to show
the same result for sufficiently small k, with respect to n. It is important to
bear in mind that these lower bounds are inevitable for any first-order algo-
rithm (assuming the functions are ‘no better’ than with L-Lipschitz gradient
and µ-strongly convex). Of course, one could ask if these lower bounds are
not too pessimistic, and whether such hard problems will appear in prac-
tice. We will indeed see that these lower bounds are highly relevant to our
algorithms, and are observed when minimizing relatively simple problems
28 A. Chambolle and T. Pock
such as the ROF model. Let us mention that many other types of interest-
ing lower bounds can be found in the literature for most of the algorithmic
techniques described in this paper, and a few others; see in particular the
recent and fairly exhaustive study by Davis and Yin (2014a).
and conjugate gradient (CG), together with the lower bound for smooth
optimization provided in (4.13). The results show that AGD is significantly
faster than GD. For comparison we also applied CG, which is known to be
an optimal method for quadratic optimization and provides convergence, in
finitely many steps, to the true solution, in this case after at most k = 100
iterations. Observe that CG exactly touches the lower bound at k = 99
(black cross), which shows that the lower bound is sharp for this problem.
Before and after k = 99, however, the lower bound is fairly pessimistic.
6
This point of view is a bit restrictive: it will be seen in Section 4.7 that one can also
choose τ = 1/kKk2 – or even τ < 2/kKk2 for simple descent with fixed steps.
30 A. Chambolle and T. Pock
1.1
0.9
0.8
Value xi
x10000 (GD)
0.7
x10000 (AGD)
x∗
0.6
0.5
0.4
0 10 20 30 40 50 60 70 80 90 100
Index i
(a)
2
10
0
10
−2
10
f(xk ) − f(x∗ )
−4
10
−6
10
GD
−8 Rate of GD
10
AGD
−10
Rate of AGD
10
Lower bound
CG
−12
10
0 1 2 3 4
10 10 10 10 10
Iteration k
(b)
metric)
1 ⋄ 2
x̂ = proxτ g x̄ − τ ∇ kK · −x k (x̄) ,
2
combining a step of implicit (‘backward’) gradient descent for g and a step
of explicit (‘forward’) gradient descent for the smooth part 12 kK · −x⋄ k2 of
(4.16). This is a particular case of a more general gradient descent algorithm
which mixes the two points of view explained so far, and which we describe
in Section 4.7 below.
These first elementary convergence results can already be applied to quite
important problems in imaging and statistics. We first consider plain gra-
dient descent for the primal ROF problem and then show how we can use
implicit descent to minimize the dual of the ROF problem (3.17), which has
the same structure as the Lasso problem (2.2).
Example 4.7 (minimizing the primal ROF model). In this example
we consider gradient descent methods to minimize the primal ROF model,
in (2.6), for p = 2. As mentioned above, this will work only if the gradient
of our energy is Lipschitz-continuous, which is not the case for (2.6). Hence
we consider a smoothed version of the total variation, which is obtained
by replacing the norm kDuk2,1 , which is singular at 0, with a smoothed
approximation; this means in practice that we solve a different problem,
but we could theoretically estimate how far the solution to this problem is
from the solution to the initial problem. A classical choice is
Xq
ε2 + (Du)2i,j,1 + (Du)2i,j,2 ,
i,j
plement the gradient descent algorithm (4.2) using a constant step size
τ = 2/(L + µ) and apply the algorithm to Example 2.1. Figure 4.2 shows
the convergence of the primal–dual gap using different values of ε. Since
the objective function is smooth and strongly convex, the gradient descent
converges linearly. However, for smaller values of ε, where the smoothed
ROF model approaches the original ROF model, the convergence of the al-
gorithm becomes very slow. The next example shows that it is actually a
better idea to minimize the dual of the ROF model.
Optimization for imaging 33
4
10
2
10
0
10
−2
Primal-dual gap
10
−4
10
−6
10
−8
10
−10
10
ε = 0.01
−12
10 ε = 0.05
ε = 0.001
−14
10
0 1 2 3
10 10 10 10
Iterations
Figure 4.2. Minimizing the primal ROF model using smoothed (Huber) total vari-
ation applied to the image in Figure 2.1. The figure shows the convergence of the
primal–dual gap using plain gradient descent for different settings of the smoothing
parameter ε.
Example 4.8 (minimizing the dual ROF model). Let us turn to the
problem of minimizing the dual ROF model using the explicit representation
of the Moreau–Yosida envelope. We consider (4.16) with K = D and g =
δ{k·k2,∞ ≤λ} . The Moreau–Yosida regularization is given by
1 1
fM (p̄) := min kp − p̄k2M + kD∗ p − u⋄ k2 + δ{k·k2,∞ ≤λ} (p), (4.22)
p 2 2
with τ ′ such that M = (1/τ ′ ) I − DD∗ > 0, and the minimum of the right-
hand side is attained for
p̂ = Π{k·k2,∞ ≤λ} (p̄ − τ ′ D(D∗ p̄ − u⋄ )),
where Π{k·k2,∞ ≤λ} denotes the (pixelwise) orthogonal projection onto 2-balls
with radius λ, that is, for each pixel i, j, the projection is computed by
p̃i,j
p̂ = Π{k·k2,∞ ≤λ} (p̃) ⇔ p̂i,j = . (4.23)
max 1, λ−1 |p̃i,j |2
As shown before, the gradient in the M -metric is given by
∇fM (p̄) = p̄ − p̂. (4.24)
The advantages of minimizing the dual ROF model rather than the pri-
mal ROF model as in Example 4.7 are immediate. Thanks to the implicit
smoothing of the Moreau–Yosida regularization, we do not need to artifi-
cially smooth the objective function and hence any gradient method will
converge to the exact minimizer. Second, the step size of a gradient method
34 A. Chambolle and T. Pock
will just depend on kDk, whereas the step size of a gradient method applied
to the primal ROF model is proportional to the smoothing parameter ε. We
implement both a standard gradient descent (GD) with step size τ = 1.9
and the accelerated gradient descent (AGD) with step size τ = 1. The
parameter τ ′ in the M -metric is set to τ ′ = 0.99/kDk2 .
Since we are dealing with a smooth, unconstrained optimization in (4.22),
we can also try to apply a black-box algorithm, which only needs information
about the gradients and the function values. A very popular algorithm is the
limited memory BFGS quasi-Newton method (Byrd et al. 1995, Zhu, Byrd,
Lu and Nocedal 1997, Morales and Nocedal 2011). We applied a 1-memory
variant of the l-BFGS algorithm7 to the Moreau–Yosida regularization of
the dual ROF model and supplied the algorithm with function values (4.22)
(using the correct values of p̂) and gradients (4.24). The idea of using vari-
able metric approaches to the Moreau–Yosida regularization of the operator
has been investigated in many papers (Bonnans, Gilbert, Lemaréchal and
Sagastizábal 1995, Burke and Qian 1999, Burke and Qian 2000) and can lead
to very fast convergence under simple smoothness assumptions. However,
it is not always suitable or easily implementable for many of the problems
we address in this paper.
The plot in Figure 4.3 represents the decay of the primal–dual gap (which
bounds the energy and the ℓ2 -error) obtained from gradient descent (GD),
accelerated gradient descent (AGD) and the limited memory BFGS quasi-
Newton method (l-BFGS). It appears that the energy actually decreases
faster for the accelerated method and the quasi-Newton method, with no
clear advantage of one over the other (the first being of course simpler to im-
plement). Also observe that both AGD and l-BFGS are only slightly faster
than the lower bound O(1/k 2 ) for smooth convex optimization. This seems
to shows that the dual ROF model is already quite a hard optimization
problem. We should mention here that the idea of applying quasi-Newton
methods to a regularized function as in this example has been recently
extended to improve the convergence of some of the methods introduced
later in this paper, namely the forward-backward and Douglas-Rachford
splittings, with very interesting results: see Patrinos, Stella and Bemporad
(2014), Stella, Themelis and Patrinos (2016).
7
We used S. Becker’s MATLAB wrapper of the implementation at http://users.iems.
northwestern.edu/˜nocedal/lbfgsb.html.
Optimization for imaging 35
4
10
3
10
2
10
Primal-dual gap 1
10
0
10
GD
−1
AGD
10
l-BFGS
−2
O(1/k)
10
O(1/k 2 )
−3
10
0 1 2 3
10 10 10 10
Iterations
the smooth case. Naturally, a rate of convergence for the errors is required
to obtain an improved global rate.
Proofs of both Theorems 4.9 and 4.10 are given in Appendix B, where
more cases are discussed, including more possibilities for the choice of the
parameters. They rely on the following essential but straightforward descent
rule.8 Let x̂ = Tτ x̄. Then, for all x ∈ X ,
kx − x̄k2
F (x) + (1 − τ µf )
2τ
1 − τ L kx̂ − x̄k2 kx − x̂k2
≥ + F (x̂) + (1 + τ µg ) . (4.36)
τ 2 2τ
In particular, if τ L ≤ 1,
kx − x̄k2 kx − x̂k2
F (x) + (1 − τ µf ) ≥ F (x̂) + (1 + τ µg ) . (4.37)
2τ 2τ
The proof is elementary, especially if we follow the lines of the presentation
8
This rule – or some variant of it – is of course found in almost all papers on first-order
descent methods.
38 A. Chambolle and T. Pock
Remark 4.11. One can more precisely deduce from this computation that
kx − x̄k2
F (x) + (1 − τ µf )
2τ
kx − x̂k2 kx̂ − x̄k2
≥ F (x̂) + (1 + τ µg ) + − Df (x̂, x̄) , (4.38)
2τ 2τ
where Df (x, y) := f (x) − f (y) − h∇f (y), x − yi ≤ (L/2)kx − yk2 is the
‘Bregman f -distance’ from y to x (Brègman 1967). In particular, (4.37)
holds once
kx̂ − x̄k2
Df (x̂, x̄) ≤ ,
2τ
which is always true if τ ≤ 1/L but might also occur in other situations, and
in particular, be tested ‘on the fly’ during the iterations. This allows us to
implement efficient backtracking strategies of the type of Armijo (1966) (see
Nesterov 1983, Nesterov 2013, Beck and Teboulle 2009) for the algorithms
described in this section when the Lipschitz constant of f is not a priori
known.
Discussion
The idea of forward–backward splitting is very natural, and appears in many
papers in optimization for imaging: it would not be possible to mention all
the related literature. Historically, it is a generalization of projected gradi-
ent descent, which dates back at least to Goldstein (1964) (see Passty 1979,
Lions and Mercier 1979, Fukushima and Mine 1981). For minimization
problems, it can be viewed as successive minimizations of a parabolic upper
bound of the smooth part added to the non-smooth part. It has been gener-
alized, and popularized in the imaging community by Combettes and Wajs
(2005), yet a few particular forms were already well known, such as iterative
soft-thresholding for the Lasso problem (Daubechies et al. 2004). It is not
always obvious how to choose parameters correctly when they are unknown.
Several backtracking techniques will work, such as those of Nesterov (2013),
for both the Lipschitz constants and strong convexity parameters; see also
Nesterov (1983), Beck and Teboulle (2009) and Bonettini et al. (2015) for
estimates of the Lipschitz constant.
For simpler problems such as Lasso (2.2), convergence of the iterates
(more precisely of Axk ) yields that after some time (generally unknown),
the support {i : x∗i = 0} of the solution x∗ should be detected by the al-
gorithm (under ‘generic’ conditions). In that case, the objective which is
solved becomes smoother than during the first iterations, and some authors
have succeeded in exploiting this ‘partial smoothness’ to show better (lin-
ear) convergence of the FB descent (Bredies and Lorenz 2008, Grasmair,
Haltmeier and Scherzer 2011, Liang, Fadili and Peyré 2014, Tao, Boley and
Zhang 2015). Liang, Fadili and Peyré (2015a) have extended this approach
to the abstract setting of Appendix A, so that this remark also holds for
some of the saddle-point-type algorithms introduced in Section 5 below.
Another interesting and alternative approach to convergence rates is to
use the ‘Kurdyka–Lojasiewicz’ (KL) inequality, which in practice will bound
a function of the distance of a point to the critical set by the norm of
the (sub)gradient. As shown by Bolte, Daniilidis and Lewis (2006), such a
property will hold for ‘most’ of the functions optimized in practice, including
non-smooth functions, and this can lead to improved convergence rates for
many algorithms (Attouch, Bolte and Svaiter 2013). It is also possible to
derive accelerated schemes for problems with different types of smoothness
(such as Hölder-continuous gradients); see Nesterov (2015).
Finally, a heuristic technique which often works to improve the conver-
gence rate, when the objective is smoother than actually known, consists
simply in ‘restarting’ the method after a certain number of iterations: in
Algorithm 5 (for µ = 0), we start with a new sequence (tk )k letting tk̄ = 1
for some sufficiently large k̄. Ideally, we should restart when we are sure
that the distance of xk̄ to the optimum x∗ (unique if the objective is strongly
convex) has shrunk by a given, sufficiently small factor (but the correspond-
40 A. Chambolle and T. Pock
A non-linear analogue of (4.37) will easily follow from (4.41) and the Lips-
chitz property of ∇f , which reads
k∇f (x) − ∇f (y)k∗ ≤ Lkx − yk for all x, y ∈ X ,
where k·k∗ is the norm in X ′ induced by the norm k·k of X , with respect to
which ψ is strongly convex. The simple FB descent method (using x̄ = xk ,
xk+1 = x̂) will then converge with essentially the same rate (but constants
which depend on the new distance Dψ ); see Tseng (2008) for details. More
interesting is the fact that, again, Tseng (2008) has also introduced accel-
erated variants which reach a convergence rate in O(1/k 2 ), as before (see
also Allen-Zhu and Orecchia 2014). A different way to introduce barriers
and non-linearities for solving (4.25) by smoothing is proposed in Nesterov
(2005), where another O(1/k 2 ) algorithm is introduced.
9
See also the version at http://www2.isye.gatech.edu/˜nemirovs.
42 A. Chambolle and T. Pock
Then it is known (Teboulle 1992, Beck and Teboulle 2003) that the entropy
n
X
ψ(x) := xi ln xi (and ∇ψ(x) = (1 + ln xi )ni=1 )
i=1
(see Polyak 1987). For a quadratic function this problem is easily solved,
and it is known that the descent method obtained minimizes the quadratic
function exactly in rank A iterations, where A = ∇2 f . It is the fastest
method in this case (Polyak 1987); see the plot ‘CG’ in Figure 4.1. In prac-
tice, this method should be implemented on a sufficiently smooth problem
when the cost of performing a line-search (which requires evaluations of the
function) is not too large; as for non-quadratic problems, the optimal step
cannot be computed in closed form.
A generalization of the HB algorithm to a strongly convex function given
by the sum of a smooth, twice continuously differentiable function with
Lipschitz-continuous gradient and a non-smooth function, with easily com-
puted proximal map, was investigated for quadratic functions in Bioucas-
Dias and Figueiredo (2007) and for more general smooth functions in Ochs,
Brox and Pock (2015). It is of the form:
xk+1 = proxαg (xk + α∇f (xk ) + β(xk − xk−1 )). (4.43)
The proximal HB algorithm offers the same optimal convergence rate as
the HB algorithm, but can be applied only if the smooth function is twice
continuously differentiable. It is therefore very efficient; see Figure 4.4 below
for a comparison of this method with other accelerated methods.
44 A. Chambolle and T. Pock
rithms recover optimal rates (in particular when specialized to the one-block
case) and allow for descent steps which are optimal for each block (Fercoq
and Richtárik 2013a, 2013b).
4.9. Examples
We conclude this section by providing two examples. In the first example we
consider minimizing the dual of the Huber-ROF problem, which is strongly
convex and can therefore be minimized using accelerated proximal gradient
descent for strongly convex problems. The second example uses the explicit
representation of Moreau–Yosida regularization to transform the dual of an
anisotropic variant of the ROF model into a form consisting of a smooth
plus a non-smooth function, which can be tackled by accelerated forward–
backward splitting.
Example 4.14 (minimizing the dual of Huber-ROF). Let us revisit
the dual of the Huber-ROF model introduced in (4.21):
1 ε
min kD∗ p − u⋄ k2 + kpk2 + δ{k·k2,∞ ≤λ} (p),
p 2 2λ
Optimization for imaging 47
where u⋄ is again the noisy image of size m × n from Example 2.1, and D
is the (two-dimensional) finite difference operator. This problem is the sum
of a smooth function with Lipschitz-continuous gradient,
1
f (p) = kD∗ p − u⋄ k2 ,
2
plus a non-smooth function with easily computed proximal map,
ε
g(p) = kpk2 + δ{k·k2,∞ ≤λ} (p).
2λ
The gradient of the smooth function is given by
∇f (p) = D(D∗ p − u⋄ ),
(1 + τ µ)−1 p̃i,j
p̂ = proxτ g (p̃) ⇔ p̂i,j =
max{1, (1 + τ µ)−1 |p̃i,j |2 }
Let us now apply the Huber-ROF model to the image in Example 2.1 us-
ing the parameters λ = 0.1 and ε = 0.001. We implemented the FISTA
algorithm (Algorithm 5) using the extrapolation parameters corresponding
to µ = 0 and the correct µ = ε/λ. For comparison, we also implemented
the proximal heavy ball algorithm (4.43) and used the optimal parameter
settings
√ √
4 ( µ − L + µ)2
α= √ √ , β= √ √ .
( µ + L + µ)2 − 4µ ( µ + L + µ)2 − 4µ
Figure 4.4 shows that it is generally not a good idea to apply the classi-
cal FISTA algorithm using µ = 0 to a strongly convex problem. On the
other hand, applying the FISTA algorithm with the correct settings for the
strong convexity, that is, µ = ε/λ, largely improves the convergence rate of
the algorithm. Interestingly, it turns out that the proximal HB algorithm
converges almost twice as fast as the FISTA algorithm (ω k as opposed to
√ √
ω 2k with q = L/µg and ω = ( q − 1)/( q + 1)). In fact the proximal HB
algorithm seems to exactly obey the lower bound of first-order algorithms
for the strongly convex problems presented in Theorem 4.14.
5
10
0
10
−5
Primal-dual gap
10
−10
10
FISTA (µ = 0)
−15
10 FISTA (µ = ε/λ)
proximal HB
−20
10 O(ω k )
O(ω 2k ) (lower bound)
−25
10
0 50 100 150 200 250 300 350 400
Iterations
an equivalence does not hold. Consider again the dual of the ROF model:
1
min kD∗ p − u⋄ k2 + δ{k·k∞ ≤λ} (p), (4.46)
p 2
which differs slightly from our previous ROF problems by the choice of
the norm constraining the dual variables. First, application of the adjoint
of the finite difference operator to the dual variables p = (p1 , p2 ) can be
decomposed via
X2
∗
D p= D∗d pd ,
d=1
where D∗dis the adjoint finite difference operator in the direction d. Sec-
ond, by a change of variables td = D∗d pd and using the property that the
constraint on p is also decomposable, we can rewrite the problem in the
equivalent form
2 2
1 X X
min k t d − u ⋄ k2 + δC d (td ), (4.47)
(td )2d=1 2
d=1 d=1
where
C d = {td : td = D∗d pd , kpd k∞ ≤ λ}, for d = 1, 2.
Optimization for imaging 49
Hence, as shown in Section 4.8.3, this problem could be easily solved via ac-
celerated alternating minimization in td if we were able to efficiently compute
the proximal maps with respect to δC d (td ). Moreover, we have shown that
the (accelerated) alternating minimization corresponds to an (accelerated)
forward–backward algorithm on the partial Moreau–Yosida regularization
that is obtained by partially minimizing (4.47) with respect to one variable,
hence corresponding to a non-trivial instance of the forward–backward al-
gorithm.
Observe that the characteristic functions of the sets C d are exactly the
convex conjugates of the total variation in each dimension d, that is,
δC d (td ) = suphu, td i − λkDd uk1 .
u
In other words, if we were able to solve the proximal maps for one-dimen-
sional total variation problems along chains, we could – thanks to Moreau’s
identity – also efficiently solve the proximal maps for the functions δC d (td ).
As a matter of fact, there exist several direct algorithms that can solve
one-dimensional ROF problems very efficiently, and hence the proximal
maps for one-dimensional total variation. Some of the algorithms even work
in linear time; see Davies and Kovac (2001), Condat (2013a), Johnson (2013)
and Kolmogorov et al. (2016), and references therein.
Figure 4.5 presents a comparison between the convergence rates of ac-
celerated block descent (FISTA-chains) applied to (4.47) and a standard
implementation of FISTA applied to (4.46). To solve the one-dimensional
total variation subproblems on chains we used the linear-time dynamic pro-
gramming approach from Kolmogorov et al. (2016). Figure 4.5(a) shows
that in terms of iterations, the accelerated block descent is about 10–20
times as fast. Clearly, one iteration of the accelerated block descent is com-
putationally more expensive compared to one iteration of the standard im-
plementation; in our C++ implementation, one iteration of standard FISTA
was approximately three times as fast compared to the accelerated block de-
scent. Yet overall the block splitting technique turns out to be more efficient
for a given precision, as shown in Figure 4.5(b). Later, in Section 7.8, we
will come back to a similar example and show how accelerated block descent
can be used to solve large-scale stereo problems.
5. Saddle-point methods
In this section we will briefly describe the main optimization techniques
for finding saddle points, which are commonly used for imaging problems.
The goal of these approaches is, as before, to split a complex problem into
simpler subproblems which are easy to solve – although depending on the
structure and properties of the functions, one form might be more suitable
50 A. Chambolle and T. Pock
4
10
2
10
0
10
Primal-dual gap
−2
10
−4
10
−6
10
FISTA-chains
−8
10 FISTA
O(1/k 2 )
−10
10
0 1 2 3
10 10 10 10
Iterations
(a) iterations
4
10
2
10
0
10
Primal-dual gap
−2
10
−4
10
−6
10
FISTA-chains
−8
10 FISTA
O(1/k 2 )
−10
10
−2 −1 0 1
10 10 10 10
Time (seconds)
Figure 4.5. Minimizing the dual ROF model applied to the image in Figure 2.1.
This experiment shows that an accelerated proximal block descent algorithm
(FISTA-chains) that exactly solves the ROF problem on horizontal and vertical
chains significantly outperforms a standard accelerated proximal gradient descent
(FISTA) implementation. (a) Comparison based on iterations, (b) comparison
based on the CPU time.
Pock 2015a). We will mention the simplest useful results. These have been
generalized and improved in many ways; see in particular Davis (2015) and
Davis and Yin (2014a, 2014b) for an extensive study of convergence rates,
Chen, Lan and Ouyang (2014a), Ouyang, Chen, Lan and Pasiliao (2015) and
Valkonen and Pock (2015) for optimal methods exploiting partial regularity
of some objectives, and Fercoq and Bianchi (2015) for efficient stochastic
approaches.
The natural order in which to present these algorithms should be to start
with the Douglas–Rachford splitting (Douglas and Rachford 1956; the mod-
ern form we will describe is found in Lions and Mercier 1979) and the
ADMM, which have been used for a long time in non-smooth optimization.
However, since the convergence results for primal–dual methods are in some
sense much simpler and carry on to the other algorithms, we first start by
describing these methods.
Then (this dates back to Arrow, Hurwicz and Uzawa 1958), we alternate a
(proximal) descent in the variable x and an ascent in the dual variable y:
xk+1 = proxτ g (xk − τ K ∗ y k ), (5.1)
k+1 k k+1
y = proxσf ∗ (y + σKx ). (5.2)
It is not clear that such iterations will converge. (We can easily convince
ourselves that a totally explicit iteration, with xk+1 above replaced with
xk , will in general not converge.) However, this scheme was proposed in
Zhu and Chan (2008) for problem (2.6) and observed to be very efficient
for this problem, especially when combined with an acceleration strategy
consisting in decreasing τ and increasing σ at each step (e.g., following
the rules in Algorithm 8 below). Proofs of convergence for the Zhu–Chan
method have been proposed by Esser, Zhang and Chan (2010), Bonettini
and Ruggiero (2012) and He, You and Yuan (2014). For a general problem
52 A. Chambolle and T. Pock
Algorithm 6 PDHG.
Input: initial pair of primal and dual points (x0 , y 0 ), steps τ, σ > 0.
for all k ≥ 0 do
find (xk+1 , y k+1 ) by solving
xk+1 = proxτ g (xk − τ K ∗ y k ) (5.3)
y k+1 = proxσf ∗ (y k + σK(2xk+1 − xk )). (5.4)
end for
there exist several strategies to modify these iterations into converging sub-
sequences. Popov (1981) proposed incorporating a type of ‘extragradient’
strategy into these iterations, as introduced by Korpelevich (1976, 1983):
the idea is simply to replace y k with proxσf ∗ (y k + σK ∗ xk ) in (5.1). This
makes the algorithm convergent; moreover, an O(1/k) (ergodic) convergence
rate is shown in Nemirovski (2004) (for a class of schemes including this one,
using also non-linear ‘mirror’ descent steps: see Section 4.8.1). A variant
with similar properties, but not requiring us to compute an additional step
at each iteration, was proposed at roughly the same time by Esser et al.
(2010) (who gave it the name ‘PDHG’11 ), and Pock, Cremers, Bischof and
Chambolle (2009). The iterations can be written as in Algorithm 6.
The over-relaxation step 2xk+1 − xk = xk+1 + (xk+1 − xk ) can be inter-
preted as an approximate extragradient, and indeed it is possible to show
convergence of this method with a rate which is the same as in Nemirovski
(2004) (see also Chambolle and Pock 2011, 2015a). On the other hand, this
formula might recall similar relaxations present in other standard splitting
algorithms such as the Douglas–Rachford splitting or the ADMM (see Sec-
tions 5.3 and 5.4 below), and indeed, we then see that this algorithm is
merely a variant of these other methods, in a possibly degenerate metric.
He et al. (2014) observed that, letting z = (x, y), the iterations above can
be written as
M (z k+1 − z k ) + T z k+1 ∋ 0, (5.5)
where T is the monotone operator in (3.15) and M is the metric
1
I −K ∗
M= τ , (5.6)
1
−K σI
which is positive definite if τ σkKk2 < 1. Hence, in this form the primal–dual
11
Primal–dual hybrid gradient. More precisely, the algorithm we describe here would
correspond to ‘PDHGMu’ and ‘PDHGMp’ in Esser et al. (2010), while ‘PDHG’ corre-
spond to a plain Arrow–Hurwicz alternating scheme such as in Zhu and Chan (2008).
However, for simplicity we will keep the name ‘PDHG’ for the general converging
primal–dual method.
Optimization for imaging 53
Acceleration
An interesting feature of these types of primal–dual iteration is the fact
they can be ‘accelerated’, in cases when the objective function has more
regularity. The first case is when g + h (or f ∗ ) is strongly convex: see
Algorithm 8. Observe that if f ∗ is µf -strongly convex, then x 7→ f (Kx) has
(L2 /µf )-Lipschitz gradient, and it is natural to expect that one will be able
to decrease the objective at rate O(1/k 2 ) as before. Similarly, we expect
the same if g or h is strongly convex. This is the result we now state. We
should assume here that g is µg -convex, h is µh -convex, and µ = µg +µh > 0.
However, in this case it is no different to assuming that g is µ-convex, as
one can always replace h with h(x) − µh kxk2 /2 (which is convex with (Lh −
µh )-Lipschitz gradient ∇h(x) − µh x), and g with g(x) + µh kxk2 /2 (whose
proximity operator is as easy to compute as g’s). For notational simplicity,
we will thus restrict ourselves to this latter case – which is equivalent to the
general case upon replacing τ with τ ′ = τ /(1 + τ µh ).
Theorem 5.2. Let (xk , y k )k≥0 be the iterations of Algorithm 8. For each
12
This is called an ‘ergodic’ convergence rate.
Optimization for imaging 55
k
k k 1 X −i+1 i i
(X , Y ) = θ (x , y ).
Tk
i=1
Preconditioning
As a quick remark, we mention here that it is not always obvious how to
estimate the norm of the matrix L = kKk precisely and efficiently, without
which we cannot choose parameters correctly. An interesting use of gen-
eral preconditioners is suggested by Bredies and Sun (2015a, 2015b) for the
variants of the algorithm described in the next few sections. The main dif-
ficulty is that if the metric is changed, f and g might no longer be ‘simple’.
A simpler approach is suggested in Pock and Chambolle (2011), for prob-
lems where a diagonal preconditioning does not alter the property that the
proximal operators of f and g are easy to compute. Let us briefly describe
a variant which is very simple and allows for a large choice of diagonal pre-
conditioners. If we assume h = 0, then the PDHG algorithm of Theorem 5.1
can be written equivalently as a proximal-point iteration such as (5.5) (He
et al. 2014). Changing the metric means replacing M in (5.6) with
−1
−K ∗
′ T
M = ,
−K Σ−1
Optimization for imaging 57
where T and Σ are positive definite symmetric matrices. This means that
the prox operators in iteration (5.8) must be computed in the new met-
rics T −1 and Σ−1 : in other words, the points (x̂, ŷ) are replaced with the
solutions of
1 1
min kx − x̄k2T −1 + hỹ, Kxi + g(x), min ky − ȳk2Σ−1 − hy, K x̃i + f ∗ (y).
x 2 y 2
The following strategy, which extends the choice in Pock and Chambolle
(2011), allows us to design matrices Σ and T such that this holds. We
assume here that X = Rn and Y = Rm , m, n ≥ 1, so K is an (m × n)-
matrix.
Lemma 5.5. Let (τ̃i )1≤i≤n and (σ̃j )1≤j≤m be arbitrary positive numbers,
and α ∈ [0, 2]. Then let
where
τ̃i σ̃j
τi = P m 2−α
, σj = P n α
.
j=1 σ̃j |Kj,i | i=1 τ̃i |Kj,i |
5
10
Primal-dual gap
0
10
PDHG
aPDHG
FISTA
O(1/k 2 )
−5
10
0 1 2 3
10 10 10 10
Iterations
Figure 5.1. Minimizing the ROF model applied to the image in Figure 2.1. This
experiment shows that the accelerated primal–dual method with optimal dynamic
step sizes (aPDHG) is significantly faster than a primal–dual algorithm that uses
fixed step sizes (PDGH). For comparison we also show the performance of acceler-
ated proximal gradient descent (FISTA).
onal projection onto ℓ2 -balls with radius λ. This projection can be easily
computed using formula (4.23).
We implemented both the standard PDHG algorithm (Algorithm 6) and
its accelerated variant (Algorithm 8) and applied it to the image from Exam-
ple 2.1. For comparison we also ran the FISTA algorithm (Algorithm 5) on
the dual ROF problem. For the plain PDHG we used √ a fixed setting of the
2
step sizes τ = 0.1, σ = 1/(τ L ), where L = kDk ≤ 8. For the accelerated
PDHG (aPDHG), we observe that the function g(u) is (µg = 1)-strongly
convex, and we used the proposed settings for dynamically updating the step
size parameters. The initial step size parameters were set to τ0 = σ0 = 1/L.
Figure 5.1 shows the decay of the primal–dual gap for PDHG, aPDHG
and FISTA. It can be observed that the dynamic choice of the step sizes
greatly improves the performance of the algorithm. It can also be observed
that the fixed choice of step sizes for the PDHG algorithm seems to be fairly
optimal for a certain accuracy, but for higher accuracy the performance of
the algorithm breaks down. We can also see that in terms of the primal–dual
gap – which in turn bounds the ℓ2 -error to the true solution – the aPDHG
algorithm seems to be superior to the FISTA algorithm.
3
10
2
10
1
10
0
10
Primal gap
−1
10
−2
10
−3
10
PD-explicit
PD-split
−4
10 O(1/k)
−5
10
0 1 2 3
10 10 10 10
Iterations
Figure 5.2. Minimizing the TV-deblurring problem applied to the image in Fig-
ure 2.2. We compare the performance of a primal–dual algorithm with explicit
gradient steps (PD-explicit) and a primal–dual algorithm that uses a full splitting
of the objective function (PD-split). PD-explicit seems to perform slightly better
at the beginning, but PD-split performs better for higher accuracy.
where, letting y = (p, q), K ∗ = (D∗ , A∗ ), f ∗ (y) = (fp∗ (p), fq∗ (q)), with
1
fp∗ (p) = δ{k·k2,∞ ≤λ} (p), fq∗ (q) = kq + u⋄ k2 ,
2
we obtain the saddle-point problem
min maxhKu, yi − f ∗ (y),
u y
which exactly fits the class of problems that can be optimized by the PDHG
algorithm. To implement the algorithm, we just need to know how to com-
pute the proximal maps with respect to f ∗ . Since f ∗ is separable in p, q,
we can compute the proximal maps independently for both variables. The
formula to compute the proximal map for fp∗ is again given by the projec-
tion formula (4.23). The proximal map for fq∗ requires us to solve pixelwise
quadratic optimization problems. For a given q̃, its solution is given by
q̃i,j − σu⋄i,j
q̂ = proxσfq∗ (q̃) ⇔ q̂i,j =
.
1+σ
We found it beneficial to apply a simple form of 2-block diagonal precondi-
tioning by observing that the linear operator K is compiled from the two
distinct but regular blocks D and A. According to Lemma 5.5, we √ can
perform the following feasible
√ choice of the step sizes: τ = c/(L + Lh ),
σp = 1/(cL), and σq = 1/(c Lh ), for some c > 0 where σp is used to update
the p variable and σq is used to update the q variable.
62 A. Chambolle and T. Pock
Note that we cannot rely on the accelerated form of the PDHG algorithm
because the objective function lacks strong convexity in u. However, the
objective function is strongly convex in the variable Au, which can be used
to achieve partial acceleration in the q variable (Valkonen and Pock 2015).
Figure 5.2 shows a comparison between the two different variants of the
PDHG algorithm for minimizing the TV-deblurring problem from Exam-
ple 2.2. In both variants we used c = 10. The true primal objective function
has been computed by running the ‘PD-split’ algorithm for a large number
of iterations. One can see that the ‘PD-split’ variant is significantly faster
for higher accuracy. The reason is that the choice of the primal step size in
‘PD-explicit’ is more restrictive (τ < Lh ). On the other hand, ‘PD-explicit’
seems to perform well at the beginning and also has a smaller memory
footprint.
min λkDuk2,1 + ku − u⋄ k1 ,
u
Having detailed the computation of the proximal maps for the TV-ℓ1 model,
the implementation of the PDHG algorithm (Algorithm 6) is straightfor-
ward. The step size parameters were set to τ = σ = kDk−1 . For compar-
ison, we also implemented the FBF algorithm (Tseng 2000) applied to the
primal–dual system (5.5), which for the TV-ℓ1 model and fixed step size is
Optimization for imaging 63
5
10
4
10
Primal gap
3
10
2
10
PDHG
FBF
SGM
10
1
O(1/k)
√
O(1/ k)
0
10
0 1 2 3
10 10 10 10
Iterations
Figure 5.3. Minimizing the TV-ℓ1 model applied to the image in Figure 2.3.
The plot shows a comparison of the convergence of the primal gap between the
primal–dual (PDHG) algorithm and the forward–backward–forward (FBF) algo-
rithm. PDHG and FBF perform almost equally well, but FBF requires twice as
many evaluations of the linear operator. We also show the performance of a plain
subgradient method (SGM) in order to demonstrate the clear advantage of PDHG
and FBF exploiting the structure of the problem.
given by
uk+1/2 = proxτ g (uk − τ D∗ pk ),
pk+1/2 = proxτ f ∗ (pk +τ Duk ),
uk+1 = uk+1/2 − τ D∗ (pk+1/2 − pk ),
pk+1 = pk+1/2 + τ D(uk+1/2 − uk ).
Observe that the FBF method requires twice as many matrix–vector mul-
tiplications as the PDHG algorithm. For simplicity, we used a fixed step
size τ = kDk−1 . We also tested the FBF method with an Armijo-type
line-search procedure, but it did not improve the results in this example.
Moreover, as a baseline, we also implemented a plain subgradient method
(SGM), as presented in (4.10). In order to compute a subgradient of the
total variation we used a Huber-type smoothing, but we set the smoothing
parameter to a very small value, ε = 10−30 . For the subgradient of the data
term, we just took the sign of the√argument of the ℓ1 -norm. We used a
diminishing step size of the form c/ k for some c > 0 since it gave the best
results in our experiments.
Figure 5.3 shows the convergence of the primal gap, where we computed
the ‘true’ value of the primal objective function by running the PDHG
algorithm for a large number of iterations. It can be observed that both
64 A. Chambolle and T. Pock
5.2. Extensions
Convergence of more general algorithms of the same form are found in many
papers in the literature: the subgradient can be replaced with general mono-
tone operators (Vũ 2013a, Boţ et al. 2015, Davis and Yin 2015). In particu-
lar, some acceleration techniques carry on to this setting, as observed in Boţ
et al. (2015). Davis and Yin (2015) discuss a slightly different method with
similar convergence properties and rates which mix the cases of subgradients
and monotone operators.
As for the case of forward–backward descent methods, this primal–dual
method (being a variant of a proximal-point method) can be over-relaxed in
some cases, or implemented with inertial terms, yielding better convergence
rates (Chambolle and Pock 2015a).
Another important extension involves the Banach/non-linear setting. The
proximity operators in (5.8) can be computed with non-linear metrics such
as in the mirror prox algorithm (4.40). It dates back at least to Nemirovski
(2004) in an extragradient form. For the form (5.8), it can be found in
Hohage and Homann (2014) and is also implemented in Yanez and Bach
(2014) to solve a matrix factorization problem. For a detailed convergence
analysis see Chambolle and Pock (2015a).
Finally we should mention important developments towards optimal rates:
Valkonen and Pock (2015) show how to exploit partial strong convexity
(with respect to some of the variables) to gain acceleration, and obtain a
rate which is optimal in both smooth and non-smooth situations; see also
Chen et al. (2014a).
A few extensions to non-convex problems have recently been proposed
(Valkonen 2014, Möllenhoff, Strekalovskiy, Moeller and Cremers 2015); see
Section 6 for details.
Algorithm 10 ADMM.
Choose γ > 0, y 0 , z 0 .
for all k ≥ 0 do
Find xk+1 by minimizing x 7→ f (x) − hz k , Axi + γ2 kb − Ax − By k k2 ,
Find y k+1 by minimizing y 7→ g(y) − hz k , Byi + γ2 kb − Axk+1 − Byk2 ,
Update z k+1 = z k + γ(b − Axk+1 − By k+1 ).
end for
and Osher 2009, Zhang, Burger, Bresson and Osher 2010), which is inspired
by Bregman iterations (Brègman 1967) and whose implementation boils
down to an instance of the ADMM (though with interesting interpretations).
A fairly general description of the relationships between the ADMM and
similar splitting methods can be found in Esser (2009).
In its standard form, the ADMM aims at tackling constrained problems
of the form
min f (x) + g(y), (5.15)
Ax+By=b
and set these to +∞ when the set of constraints are empty, then these
functions are convex, l.s.c., proper and the convex conjugates of f ∗ (A∗ ·) and
g ∗ (B ∗ ·), respectively; see Rockafellar (1997, Corollary 31.2.1).13 Then one
can rewrite the iterations of Algorithm 10, letting ξ k = Axk and η k = Ay k ,
in the form
zk
k+1 k
ξ = proxf˜/γ b + −η ,
γ
zk
k+1 k+1 (5.16)
η = proxg̃/γ b + −ξ ,
γ
z k+1 = z k + γ(b − ξ k+1 − η k+1 ).
In fact it is generally impossible to express the functions f˜ and g̃ explicitly,
but the fact that the algorithm is computable implicitly assumes that the
operators proxτ f˜, proxτ g̃ are computable. Observe that from the last two
steps, thanks to Moreau’s identity (3.8), we have
z k+1 zk zk
k+1 k+1 1
= proxγg̃∗ z k +γ(b−ξ k+1 ) .
= b+ −ξ −proxg̃/γ b+ −ξ
γ γ γ γ
Hence, letting τ = γ, σ = 1/γ, z̄ k = z k + γ(b − ξ k − η k ), we see that the
iterations (5.16) can be rewritten as
ξ k+1 = proxσf˜(ξ k + σz̄ k ),
z k+1 = proxτ g̃∗ z k − τ (ξ k+1 − b) ,
(5.17)
z̄ k+1 = 2z k+1 − z k ,
13
In infinite dimensions, we must require for instance that f ∗ is continuous at some point
A∗ ζ; see in particular Bouchitté (2006).
Optimization for imaging 67
are discussed in Davis and Yin (2014a); see also He and Yuan (2015a). This
form of the ADMM has been generalized to problems involving more than
two blocks (with some structural conditions) (He and Yuan 2015c, Fu, He,
Wang and Yuan 2014) and/or to non-convex problems (see the references
in Section 6.3).
Accelerated ADMM
The relationship that exists between the two previous method also allows us
to derive accelerated variants of the ADMM method if either the function
g̃ ∗ (z) = g ∗ (B ∗ z) or the function f˜ is strongly convex. The first case will
occur when g has Lg -Lipschitz gradient and B ∗ is injective; then it will
follow that g̃ ∗ is 1/(Lg k(BB ∗ )−1 k)-strongly convex. This should not cover
too many interesting cases, except perhaps the cases where B = Id and g is
smooth so that the problem reduces to
ξ k+1 = proxσf˜(ξ k + σk z̄ k ),
z k+1 = proxτ g̃∗ z k − τk (ξ k+1 − b) ,
τk
r
θk = 1/ 1 + , τk+1 = θk τk , σk+1 = 1/τk+1 ,
Lg
z̄ k+1 = z k+1 + θk (z k+1 − z k ).
This, in turn, can be rewritten in the following ‘ADMM’-like form, letting
14
However, if we have a fast solver for the prox of g̃, it might still be interesting to
consider the ADMM option.
15
If both cases occur, then of course one must expect linear convergence, as in the
previous section (Theorem 5.4). A derivation from the convergence of the primal–dual
algorithm is found in Tan (2016), while general linear rates for the ADMM in smooth
cases (including with over-relaxation and/or linearization) are proved by Deng and Yin
(2015).
Optimization for imaging 69
ξ k = Axk , η k = y k , and τk = γk :
γk
xk+1 = arg min f (x) − hz k , Axi + kb − Ax − y k k2 ,
x 2
k+1 γk
y k
= arg min g(y) − hz , yi + kb − Axk+1 − yk2 ,
y 2
k+1
(5.18)
z = z + γk (b − Axk+1 − y k+1 ),
k
γk
r
γk+1 = γk / 1 + .
Lg
16
As Lh = 0 and L = 1.
70 A. Chambolle and T. Pock
Linearized ADMM
An important remark of Chambolle and Pock (2011), successfully exploited
by Shefi and Teboulle (2014) to derive new convergence rates, is that the
‘PDHG’ primal–dual algorithm (5.3)–(5.4), is exactly the same as a lin-
earized variant of the ADMM for B = Id, with the first minimization step
replaced by a proximal descent step (following a general approach intro-
duced in Chen and Teboulle 1994),
γ γ
xk+1 = arg min f (x) − hz k , Axi + kb − Ax − y k k2 + kx − xk k2M , (5.23)
x 2 2
Optimization for imaging 71
3
10
2
10
1
10
Primal-dual gap 0
10
−1
10
ADMM
−2
10 aADMM
aPDHG
−3 O(1/k)
10
O(1/k 2 )
−4
10
0 1 2 3
10 10 10 10
Iterations
Figure 5.4. Comparison of ADMM and accelerated ADMM (aADMM) for solving
the ROF model applied to the image in Figure 2.1. For comparison we also plot
the convergence of the accelerated primal–dual algorithm (aPDHG). The ADMM
methods are fast, especially at the beginning.
with a few iterations of a linear solver, and in many cases the output will
be equivalent to exactly (5.23) in some (not necessarily known) metric M
with M + A∗ A + (1/γ)K ∗ K ≥ 0. (For example, this occurs in the ‘split
Bregman’ algorithm (Goldstein and Osher 2009), for which it has been ob-
served, and proved by Zhang, Burger and Osher (2011), that one can do
only one inner iteration of a linear solver; see also Yin and Osher (2013),
who study inexact implementations.) For a precise statement we refer to
Bredies and Sun (2015b, Section 2.3). It is shown there and in Bredies and
Sun (2015a, 2015c) that careful choice of a linear preconditioner can lead to
very fast convergence. A generalization of the ADMM in the same flavour
is considered in Deng and Yin (2015), and several convergence rates are
derived in smooth cases.
17
We will discuss acceleration strategies in the spirit of Theorem 5.2 in a forthcoming
paper.
74 A. Chambolle and T. Pock
Figure 5.5. Solving the image deblurring problem from Example 2.2. (a) Problem
(2.7) after 150 iterations of Douglas–Rachford (DR) splitting. (b) Huber variant
after 150 iterations with accelerated DR splitting. The figure shows that after the
same number of iterations, the accelerated algorithm yields a higher PSNR value.
6. Non-convex optimization
In this very incomplete section, we mention some extensions of the meth-
ods described so far to non-convex problems. Of course, many interesting
optimization problems in imaging are not convex. If f is a smooth non-
convex function, many of the optimization methods designed for smooth
convex functions will work and find a critical point of the function. For
instance, a simple gradient method (4.2) always guarantees that, denoting
g k = ∇f (xk ),
f (xk+1 ) = f (xk − τ g k )
Z τ
= f (xk ) − τ h∇f (xk ), g k i + (τ − t)hD2 f (xk − tg k )g k , g k i dt
0
τ L
≤ f (xk ) − τ 1 − kg k k2
2
as long as D2 f ≤ L Id, whether positive or not. Hence, if 0 < τ < 2/L, then
f (xk ) will still be decreasing. If f is coercive and bounded from below, we
deduce that subsequences of (xk )k converge to some critical point. Likewise,
inertial methods can be used and are generally convergent (Zavriev and
Kostyuk 1991) if ∇f is L-Lipschitz and with suitable assumptions which
ensure the boundedness of the trajectories.
Then one will generally look for a critical point (hoping of course that it
might be optimal!) by trying to find x∗ such that
∇f (x∗ ) + ∂g(x∗ ) ∋ 0.
There is a vast literature on optimization techniques for such problems,
which have been tackled in this form at least since Mine and Fukushima
(1981) and Fukushima and Mine (1981). These authors study and prove the
convergence of a proximal FB descent (combined with an approximate line-
search in the direction of the new point) for non-convex f . Recent contribu-
tions in this direction, in particular for imaging problems, include those of
Grasmair (2010), Chouzenoux, Pesquet and Repetti (2014), Bredies, Lorenz
and Reiterer (2015a) and Nesterov (2013). We will describe the inertial ver-
sion of Ochs, Chen, Brox and Pock (2014), which is of the same type but
seems empirically faster, which is natural to expect as it reduces to the
standard heavy ball method (Section 4.8.2) in the smooth case. Let us de-
scribe the simplest version, with constant steps: see Algorithm 11. Here
again, L is the Lipschitz constant of ∇f . Further, subsequences of (xk )k
will still converge to critical points of the energy; see Ochs et al. (2014, The-
orem 4.8). This paper also contains many interesting variants (with varying
steps, monotonous algorithms, etc.), as well as convergence rates for the
residual of the method.
where f is again smooth but not necessarily convex, while g1 , g2 are non-
smooth and simple functions, possibly non-convex.
The convergence of alternating minimizations or proximal (implicit) de-
scent steps in this setting (which is not necessarily covered by the gen-
Optimization for imaging 77
eral approach of Tseng 2001) has been studied by Attouch et al. (2013),
Attouch, Bolte, Redont and Soubeyran (2010) and Beck and Tetruashvili
(2013). However, Bolte, Sabach and Teboulle (2014) have observed that,
in general, these alternating steps will not be computable. These authors
propose instead to alternate linearized proximal descent steps, as shown in
Algorithm 12. Here, L1 (y) is the Lipschitz constant of ∇x f (·, y), while L2 (x)
is the Lipschitz constant of ∇y f (x, ·). These are assumed to be bounded
from below18 and above (in the original paper the assumptions are slightly
weaker). Also, for convergence one must require that a minimizer exists; in
particular, the function must be coercive.
Then it is proved by Bolte et al. (2014, Lemma 5) that the distance of
the iterates to the set of critical points of (6.2) goes to zero. Additional
convergence results are shown if, in addition, the objective function has a
very generic ‘KL’ property. We have presented a simplified version of the
PALM algorithm: in fact, there can be more than two blocks, and the simple
functions gi need not even be convex: as long as they are bounded from
below, l.s.c., and their proximity operator (which is possibly multivalued,
but still well defined by (3.6)) can be computed, then the algorithm will
converge. We use an inertial variant of PALM (Pock and Sabach 2016) in
Section 7.12 to learn a dictionary of patches.
1
min λϕ(Du) + kAu − u⋄ k2 . (6.5)
u 2
Optimization for imaging 79
2
10
1
10
Primal energy 0
10
−1
10
−2
10
ADMM
iPiano
−3
10
0 1 2 3
10 10 10 10
Iterations
Figure 6.1. Image deblurring using a non-convex variant of the total variation.
The plot shows the convergence of the primal energy for the non-convex TV model
using ADMM and iPiano. In order to improve the presentation in the plot, we
have subtracted a strict lower bound from the primal energy. ADMM is faster at
the beginning but iPiano finds a slightly lower energy.
|pi,j |22
1X
ϕ(p) = ln 1 + ,
2 µ2
i,j
(for all pixels i, j), which we can compute here using a fixed point (Newton)
iteration, or by solving a third-order polynomial.
The second approach is based on directly minimizing the primal objective
using the iPiano algorithm (Algorithm 11). We perform a forward–backward
splitting by taking explicit steps with respect to the (differentiable) regular-
izer f (u) = λϕ(Du), and perform a backward step with respect to the data
term g(u) = 21 kAu − u⋄ k2 . The gradient with respect to the regularization
80 A. Chambolle and T. Pock
(a) NC, TV, ADMM (PSNR ≈ 27.80) (b) NC, TV, iPiano (PSNR ≈ 27.95)
Figure 6.2. Image deblurring using non-convex functions after 150 iterations. (a, b)
Results of the non-convex TV-deblurring energy obtained from ADMM and iPiano.
(c) Result obtained from the non-convex learned energy, and (d) convolution filters
Dk sorted by their corresponding λk value (in descending order) used in the non-
convex learned model. Observe that the learned non-convex model leads to a
significantly better PSNR value.
term is given by
λ ∗
∇f (u) = D p̃,
µ2
where p̃ is of the form p̃ = (p̃1,1 , . . . , p̃m,n ), and
(Du)i,j
p̃i,j = .
|(Du)i,j |22
1+ µ2
mented using the FFT. We used the following parameter settings for the
iPiano algorithm: β = 0.7 and α = 2(1 − β)/L.
Moreover, we implemented a variant of (6.5), where we have replaced the
non-convex TV regularizer with a learned regularizer of the form
K
X
λk ϕ(Dk u),
k=1
7. Applications
In the rest of the paper we will show how the algorithms presented so far
can be used to solve a number of interesting problems in image process-
ing, computer vision and learning. We start by providing some theoretical
background on the total variation and some extensions.
and read |ϕ(x)|◦ ≤ 1 for all x. The most common choices (at least for grey-
scale images) are (possibly weighted) 2- and 1-norms. The main advantage
of the total variation is that it allows for sharp jumps across hypersurfaces,
for example edges or boundaries in the image, while being a convex func-
tional, in contrast to other Sobolev norms. For smooth images u we easily
check from (7.2) (integrating by parts) that it reduces to the L1 -norm of
the image gradient, but it is also well defined for non-smooth functions.
For characteristic functions of sets it measures the length or surface of the
boundary of the set inside Ω (this again is easy to derive, at least for smooth
sets, from (7.2) and Green’s formula). This also makes the total variation
interesting for geometric problems such as image segmentation.
Concerning the data-fitting term, numerous variations of (7.1) have been
proposed in the literature. A simple modification of the ROF model is to
replace the squared data term with an L1 -norm (Nikolova 2004, Chan and
Esedoḡlu 2005):
Z Z
min λ |Du| + |u(x) − u⋄ (x)| dx. (7.3)
u Ω Ω
The resulting model, called the ‘TV-ℓ1 model’, turns out to have interest-
ing new properties. It is purely geometric in the sense that the energy
decomposes on the level set of the image. Hence, it can be used to remove
structures of an image of a certain scale, and the regularization parame-
ter λ can be used for scale selection. The TV-ℓ1 model is also effective in
removing impulsive (outlier) noise from images.
In the presence of Poisson noise, a popular data-fitting term (justified by a
Optimization for imaging 83
Figure 7.1. Contrast invariance of the TV-ℓ1 model. (a–d) Result of the TV-ℓ1
model for varying values of the regularization parameter λ. (e–h) Result of the ROF
model for varying values of λ. Observe the morphological property of the TV-ℓ1
model. Structures are removed only with respect to their size, but independent of
their contrast.
This model has applications in synthetic aperture radar (SAR) imaging, for
example.
We have already detailed the discretization of TV models in (2.6) and we
have shown that an efficient algorithm to minimize total variation models
is the PDHG algorithm (Algorithm 6 and its variants). A saddle point
formulation of discrete total variation models that summarizes the different
aforementioned data-fitting terms is as follows:
min maxhDu, pi + g(u) − δ{k·k2,∞ ≤λ} (p),
u p
and for g(u) = i,j ui,j − u⋄i,j log ui,j + δ(0,∞) (u) we obtain the TV-entropy
P
model. The implementation of the models using the PDHG algorithm only
differs in the implementation of the proximal operators û = proxτ g (ũ). For
all 1 ≤ i ≤ m, 1 ≤ j ≤ n the respective proximal operators are given by
ũi,j + τ u⋄i,j
ûi,j = (ROF),
1+τ
ûi,j = ui,j + max{0, |ũi,j − u⋄i,j | − τ } · sgn(ũi,j − u⋄i,j )
⋄
(TV-ℓ1 ),
q
ũi,j − τ + (ũi,j − τ )2 + 4τ u⋄i,j
( )
ûi,j = max 0, (TV-entropy).
2
Figure 7.1 demonstrates the contrast invariance of the TV-ℓ1 model and
compares it to the ROF model. Both models were minimized using Algo-
rithm 6 (PDHG) or Algorithm 8. Gradually increasing the regularization
parameter λ in the TV-ℓ1 model has the effect that increasingly larger struc-
tures are removed from the image. Observe that the structures are removed
only with respect to their size and not with respect to their contrast. In
the ROF model, however, scale and contrast are mixed such that gradually
increasing the regularization parameter results in removing structures with
increased size and contrast.
Figure 7.2 compares the ROF model with the TV-entropy model for image
denoising in the presence of Poisson noise. The noisy image of size 480×640
pixels has been generated by degrading an aerial image of Graz, Austria with
a Poisson noise with parameter the image values scaled between 0 and 50.
Both models have been minimized using the PDHG algorithm. It can be
observed that the TV-entropy model adapts better to the noise properties of
the Poisson noise and hence leads to better preservation of dark structures
and exhibits better contrast.
where u ∈ BV(Ω), v ∈ BV(Ω; R2 ), and λ0,1 > 0 are tuning parameters. The
idea of TGV2 is to force the gradient Du of the image to deviate only on a
sparse set from a vector field v which itself has sparse gradient. This will get
Optimization for imaging 85
Figure 7.2. Total variation based image denoising in the presence of Poisson noise.
(a) Aerial view of Graz, Austria, (b) noisy image degraded by Poisson noise. (c) Re-
sult using the ROF model, and (d) result using the TV-entropy model. One can see
that the TV-entropy model leads to improved results, especially in dark regions,
and exhibits better contrast.
rid of the staircasing effect on affine parts of the image, while still preserving
the possibility of having sharp edges. The discrete counterpart of (7.4) can
be obtained by applying the same standard discretization techniques as in
the case of the ROF model.
We introduce the discrete scalar images u, u⋄ ∈ Rm×n and vectorial image
v = (v1 , v2 ) ∈ Rm×n×2 . The discrete version of the TGV2 model is hence
given by
1
min λ1 kDu − vk2,1 + λ0 kDvk2,1 + ku − u⋄ k2 ,
u,v 2
where D : Rm×n×2 → Rm×n×4 is again a finite difference operator that
computes the Jacobian (matrix) of the vectorial image v, which we treat
as a vector here. It can be decomposed into Dv = (Dv1 , Dv2 ), where D
is again the standard finite difference operator introduced in (2.4). The
86 A. Chambolle and T. Pock
discrete versions of the total first- and second-order variations are given by
m,n q
X
kDu − vk2,1 = ((Du)i,j,1 − vi,j,1 )2 + ((Du)i,j,2 − vi,j,2 )2 ,
i=1,j=1
m,n q
X
kDvk2,1 = (Dv1 )2i,j,1 + (Dv1 )2i,j,2 + (Dv2 )2i,j,1 + (Dv2 )2i,j,2 .
i=1,j=1
Figure 7.3. Comparison of TV and TGV2 denoising. (a) Original input image,
and (b) noisy image, where we have added Gaussian noise with standard deviation
σ = 0.1. (c) Result obtained from the ROF model, and (d) result obtained by min-
imizing the TGV2 model. The main advantage of the TGV2 model over the ROF
model is that it is better at reconstructing smooth regions while still preserving
sharp discontinuities.
where the σn (J(x)) denote the singular values of the Jacobian J(x) (i.e.,
the square roots of the eigenvalues of J(x)J(x)∗ or J(x)∗ J(x)).
If p = 2, the resulting norm is equivalent to the Frobenius norm, which
corresponds to one of the most classical choices (Bresson and Chan 2008),
though other choices might also be interesting (Sapiro and Ringach 1996,
88 A. Chambolle and T. Pock
Figure 7.4. Denoising a colour image using the vectorial ROF model. (a) Original
RGB colour image, and (b) its noisy variant, where Gaussian noise with standard
deviation σ = 0.1 has been added. (c) Solution of the vectorial ROF model using
the Frobenius norm, and (d) solution using the nuclear norm. In smooth regions
the two variants lead to similar results, while in textured regions the nuclear norm
leads to significantly better preservation of small details (see the close-up views
in (c, d)).
variation:
Z
|Du|Sp = (7.5)
Ω
Z
∞ d×k
sup − u(x) · div ϕ(x) dx : ϕ ∈ C (Ω; R ), |ϕ(x)|Sq ≤ 1, ∀x ∈ Ω ,
Ω
where q is the parameter of the polar norm associated with the parameter
p of the Schatten norm and is given by 1/p + 1/q = 1. Based on that we
can define a vectorial ROF model as
1
Z Z
min λ |Du|Sp + |u(x) − u⋄ (x)|22 dx. (7.6)
u Ω 2 Ω
The discretization of the vectorial ROF model is similar to the discretiza-
tion of the standard ROF model. We consider a discrete colour image
u = (ur , ug , ub ) ∈ Rm×n×3 , where ur , ug , ub ∈ Rm×n denote the red, green,
and blue colour channels, respectively. We also consider a finite difference
operator D : Rm×n×3 → Rm×n×2×3 given by Du = (Dur , Dug , Dub ), where
D is again the finite difference operator defined in (2.4). The discrete colour
ROF model based on the 1-Schatten norm is given by
1
min λkDukS1 ,1 + ku − u⋄ k2 .
u 2
The vectorial ROF model can be minimized either by applying Algorithm 5
to its dual formulation or by applying Algorithm 8 to its saddle-point for-
mulation. Let us consider the saddle-point formulation:
1
min maxhDu, Pi + ku − u⋄ k2 + δ{k·kS∞ ,∞ ≤λ} (P),
u P 2
where P ∈ Rm×n×2×3 is the tensor-valued dual variable, hence the dual
variable can also be written as P = (P1,1 , . . . , Pm,n ), where Pi,j ∈ R2×3 is
a 2 × 3 matrix. Hence, the polar norm ball {kPkS∞ ,∞ ≤ λ} is also given by
{P = (P1,1 , . . . , Pm,n ) : |Pi,j |S∞ ≤ λ, for all i, j},
hence the set of variables P, whose tensor-valued components Pi,j have an
operator norm less than or equal to λ. To compute the projection to the
polar norm ball we can use the singular value decomposition (SVD) of the
matrices. Let U, S, V with U ∈ R2×2 , let S ∈ R2×3 with S = diag(s1 , s2 ),
and let V ∈ R3×3 be an SVD of P̃i,j , that is, P̃i,j = U SV T . As shown by
Cai, Candès and Shen (2010), the orthogonal projection of P̃i,j to the polar
norm ball {kPkS∞ ,∞ ≤ λ} is
Π{k·kS∞ ,∞ ≤λ} (P̃i,j ) = U Sλ V T , Sλ = diag(min{s1 , λ}, min{s2 , λ}).
Figure 7.4 shows an example of denoising a colour image of size 384 × 512
with colour values in the range [0, 1]3 . It can be seen that the nuclear norm
90 A. Chambolle and T. Pock
where F : Cm×n
→ Cm×n
denotes the (discrete) fast Fourier transform,
and ◦ denotes the Hadamard product (the element-wise product of the two
matrices). In order to minimize the TV-MRI objective function, we first
transform the problem into a saddle-point problem:
C
X 1
min maxhDu, pi + kF(σc ◦ u) − gc k2 − δ{k·k2,∞ ≤λ} (p),
u p 2
c=1
where p ∈ Cm×n×2 is the dual variable. Observe that we have just dualized
the total variation term but kept the data-fitting term
C
X 1
h(u) = kF(σc ◦ u) − gc k2
2
c=1
19
Data courtesy of Florian Knoll, Center for Biomedical Imaging and Center for Ad-
vanced Imaging Innovation and Research (CAI2R), Department of Radiology, NYU
School of Medicine.
92 A. Chambolle and T. Pock
where different norms can be considered for both the total variation and
the data-fitting term. The most common choice is p = 2, and q = 1 (Brox,
Bruhn, Papenberg and Weickert 2004, Zach, Pock and Bischof 2007, Cham-
bolle and Pock 2011). For numerical solution we discretize the TV-ℓ1 op-
tical flow model in the same spirit as we did with the previous TV mod-
els. We consider a discrete velocity field v = (v1 , v2 ) ∈ Rm×n×2 , where
v1 corresponds to the horizontal velocity and v2 corresponds to the ver-
tical velocity. It can also be written in the form of v = (v1,1 , . . . , vm,n ),
where vi,j = (vi,j,1 , vi,j,2 ) is the local velocity vector. To discretize the total
variation, we again consider a finite difference approximation of the vecto-
rial gradient D : Rm×n×2 → Rm×n×4 , defined by Dv = (Dv1 , Dv2 ), where
D is defined in (2.4). In order to discretize the data term, we consider a
certain point in time for which we have computed finite difference approx-
imations for the space-time gradient of I(x, t). It is necessary to have at
least two images in time in order to compute the finite differences in time.
Optimization for imaging 93
Figure 7.6. Optical flow estimation using total variation. (a) A blending of the
two input images. (b) A colour coding of the computed velocity field. The colour
coding of the velocity field is shown in the upper left corner of the image.
where
ri,j · (vi,j , 1) = ri,j,1 vi,j,1 + ri,j,2 vi,j,2 + ri,j,3 .
For the vectorial total variation we consider the standard 2-vector norm,
that is,
X q
kDvk2,1 = (Dv1 )2i,j,1 + (Dv1 )2i,j,2 + (Dv2 )2i,j,1 + (Dv2 )2i,j,2 .
i=1,j=1
A simple computation (Zach et al. 2007) shows that the proximal map is
given by
v̂ = proxτ g (ṽ) ⇔
τ ri,j
if ri,j · (ṽi,j , 1) < −τ |ri,j |2 ,
v̂i,j = ṽi,j + −τ ri,j if ri,j · (ṽi,j , 1) > τ |ri,j |2 ,
−(ri,j · (ṽi,j , 1)/|ri,j |2 )ri,j
else.
Figure 7.7. Image inpainting using shearlet regularization. (a) Original image,
and (b) input image with a randomly chosen fraction of 10% of the image pix-
els. (c) Reconstruction using TV regularization, and (d) reconstruction using the
shearlet model. Observe that the shearlet-based model leads to significantly better
reconstruction of small-scale and elongated structures.
Easley, Labate and Lim 2008, Kutyniok and Lim 2011), for image inpaint-
ing. For this we consider the following formulation:
X
min kΦuk1 + δ{u⋄i,j } (ui,j ),
u
(i,j)∈I
where
D = {(i, j) : 1 ≤ i ≤ m, 1 ≤ j ≤ n}
is the set of pixel indices of a discrete image of size m × n, and I ⊂ D is the
subset of known pixels of the image u⋄ . After transforming to a saddle-point
problem, the solution of the inpainting problem can be computed using the
PDHG algorithm. It just remains to give the proximal map with respect to
the data term
X
g(u) = δ{u⋄i,j } (ui,j ).
(i,j)∈I
the so-called jump set, that is, the set of points where the function u is
allowed to jump and Hd−1 is the (d − 1)-dimensional Hausdorff measure
(Ambrosio et al. 2000, Attouch et al. 2014, Evans and Gariepy 1992), which
is, for d = 2, the length of the jump set Su and hence the total length of
edges in u. The main difference between the ROF functional and the MS
functional is as follows. While the ROF functional penalizes discontinuities
proportional to their jump height, the MS functional penalizes disconti-
nuities independently of their jump height and hence allows for better dis-
crimination between smooth and discontinuous parts of the image. We must
stress that the MS functional is very hard to minimize. The reason is that
the jump set Su is not known beforehand and hence the problem becomes
a non-convex optimization problem. Different numerical approaches have
been proposed to find approximate solutions to the Mumford–Shah problem
(Ambrosio and Tortorelli 1992, Chambolle 1999, Chan and Vese 2002, Pock
et al. 2009).
Here we focus on the work by Alberti, Bouchitté and Dal Maso (2003),
who proposed a method called the calibration method to characterize global
minimizers of the MS functional. The approach is based on a convex
representation of the MS functional in a three-dimensional space Ω × R,
where the third dimension is given by the value t = u(x). The idea of
the calibration method is to consider the maximum flux of a vector field
ϕ = (ϕx , ϕt ) ∈ C0 (Ω × R; Rd+1 ) through the interface of the subgraph
(
1 if t < u(x),
1u (x, t) = (7.11)
0 else,
where the inequalities in the definition of K hold for all x ∈ Ω and for all
98 A. Chambolle and T. Pock
Z
min sup ϕDv. (7.15)
v ϕ∈K Ω×R
With this, the discrete version of (7.15) is given by the saddle-point problem
min maxhDv, pi + δC (v) − δK (p),
v p
which can be solved using Algorithm 6. The critical part of the implemen-
tation of the algorithm is the solution of the projection of p onto K:
kp − p̃k2
p̂ = ΠK (p̃) = arg min ,
p∈K 2
which is non-trivial since the set K contains a quadratic number (in fact
r(r + 1)/2) of coupled constraints. In order to solve the projection prob-
lem, we may adopt Dykstra’s algorithm for computing the projection on
the intersection of convex sets (Dykstra 1983). The algorithm performs a
coordinate descent on the dual of the projection problem, which is defined
in the product space of the constraints. In principle, the algorithm proceeds
by sequentially projecting onto the single constraints. The projections onto
the 2-ball constraints can be computed using projection formula (4.23). The
projection to the parabola constraint can be computed by solving a cubic
100 A. Chambolle and T. Pock
stereo problem. After interchanging the left and right images we repeated
the experiment. This allowed us to perform a left–right consistency check
and in turn to identify occluded regions. Those pixels are shown in black.
Although the calibration method is able to compute the globally optimal
solution, it is important to point out that this does not come for free. The
associated optimization problem is huge because the range space of the
solution also has to be discretized. In our stereo example, the disparity
image is of size 1835 × 3637 pixels and the number of disparities was 100.
Optimization for imaging 103
where Ω is the image domain, Per(S; Ω) denotes the perimeter of the set S
in Ω, and w1,2 : Ω → R+ are given non-negative potential functions. This
problem belongs to a general class of minimal surface problems that have
been studied for a long time (see for instance the monograph by Giusti
1984).
The discrete version of this energy is commonly known as the ‘Ising’
model, which represents the interactions between spins in an atomic lattice
and exhibits phase transitions. In computer science, the same kind of energy
has been used to model many segmentation and classification tasks, and
has received a lot of attention since it was understood that it could be
efficiently minimized if represented as a minimum s − t cut problem (Picard
and Ratliff 1975) in an oriented graph (V, E). Here, V denotes a set of
vertices and E denotes the set of edges connecting some of these vertices.
Given two particular vertices, the ‘source’ s and the ‘sink’ t, the s − t
minimum cut problem consists in finding two disjoint sets S ∋ s and T ∋ t
with S∪T = V and the cost of the ‘cut’ C(S, T ) = {(u, v) ∈ E : u ∈ S, v ∈ T }
is minimized. The cost of the cut can be determined by simply counting the
number of edges, or by summing a certain weight wuv associated with each
edge (u, v) ∈ E. By the Ford–Fulkerson min-cut/max-flow duality theorem
104 A. Chambolle and T. Pock
(see Ahuja, Magnanti and Orlin 1993 for a fairly complete textbook on
these topics), this minimal s − t cut can be computed by finding a maximal
flow through the oriented graph, which can be solved by a polynomial-
time algorithm. In fact, there is a ‘hidden’ convexity in the problem. We
will describe this briefly in the continuous setting; for discrete approaches
to image segmentation we refer to Boykov and Kolmogorov (2004), and the
vast subsequent literature. The min-cut/max-flow duality in the continuous
setting and the analogy with minimal surfaces type problems were first
investigated by Strang (1983) (see also Strang 2010).
We mentioned in the previous section that the total variation (7.2) is also
well defined for characteristic functions of sets, and measures the length of
the boundary (in the domain). This is, in fact, the ‘correct’ way to define
the perimeter of a measurable set, introduced by R. Caccioppoli in the
early 1950s. Ignoring constants, we can replace (7.21) with the following
equivalent variational problem:
Z Z
min |D1S | + 1S (x)w(x) dx, (7.22)
S⊆Ω Ω Ω
where for notational simplicity we have set w = w1 − w2 , and 1S is the
characteristic function associated with the set S, that is,
(
1 if x ∈ S,
1S (x) =
0 else.
The idea is now to replace the binary function 1S : Ω → {0, 1} with a
continuous function u : Ω → [0, 1] such that the problem becomes convex:
Z Z
min |Du| + u(x)w(x) dx, such that u(x) ∈ [0, 1] a.e. in Ω, (7.23)
u Ω Ω
It turns out that the relaxed formulation is exact in the sense that any
thresholded solution v = 1{u≥s} of the relaxed problem for any s ∈ (0, 1]
is also a global minimizer of the binary problem (Chan, Esedoḡlu and
Nikolova 2006, Chambolle 2005, Chambolle and Darbon 2009). This is a
consequence of the co-area formula (Federer 1969, Giusti 1984, Ambrosio et
al. 2000), which shows that minimizing the total variation of u decomposes
into independent problems on all level sets of the function u.
Interestingly, there is also a close relationship between the segmentation
model (7.23) and the ROF model (7.1). In fact a minimizer of (7.23) is
obtained by minimizing (7.1), with u⋄ = w being the input image, and then
thresholding the solution u at the 0 level (Chambolle 2004a, 2005). Con-
versely, this relationship has also been successfully used to derive efficient
combinatorial algorithms, based on parametric maximal flow approaches
(Gallo, Grigoriadis and Tarjan 1989), to solve the fully discrete ROF model
exactly in polynomial time (Hochbaum 2001, Darbon and Sigelle 2004, Dar-
Optimization for imaging 105
bon and Sigelle 2006a, Darbon and Sigelle 2006b, Chambolle and Darbon
2012), where the total variation is approximated by sum of pairwise inter-
actions |ui − uj |.
Exploiting the relation between the ROF model and the two-label segmen-
tation model, we can easily solve the segmentation problem by considering
a discrete version of the ROF model. In our setting here, we consider a
discrete image u ∈ Rm×n and a discrete weighting function w ∈ Rm×n . The
discrete model we need to solve is
1
min kDuk2,1 + ku − wk2 .
u 2
It can be solved by using either Algorithm 8 or Algorithm 5 (applied to the
dual problem). Let u∗ denote the minimizer of the ROF problem. The final
discrete and binary segmentation 1S is given by thresholding u∗ at zero:
(
0 if u∗i,j < 0,
(1S )i,j =
1 else.
Gf (u⋄i,j ; µf , Σf , αf )
wi,j = − log ,
Gb (u⋄i,j ; µb , Σb , αb )
for l = 1, . . . , L and each foreground pixel (i, j) ∈ fg, and similarly for
(π b )i,j,l , (i, j) ∈ bg (here fg, bg ⊂ {1, . . . , n} × {1, . . . , m} denote the set
of foreground and background pixels, respectively).
After solving the segmentation problem, the Gaussian mixture models can
be re-computed and the segmentation can be refined.
Figure 7.10. Interactive image segmentation using the continuous two-label image
segmentation model. (a) Input image overlaid with the initial segmentation pro-
vided by the user. (b) The weighting function w, computed using the negative
log-ratio of two Gaussian mixture models fitted to the initial segments. (c) Binary
solution of the segmentation problem, and (d) the result of performing background
removal.
This model can be interpreted as the continuous version of the ‘Potts’ model
that has also been proposed in statistical mechanics to model the interac-
tions of spins on a crystalline lattice. It is also widely used as a smoothness
term in graphical models for computer vision, and can be minimized (ap-
proximately) by specialized combinatorial optimization algorithms such as
those proposed by Boykov et al. (2001).
The continuous Potts model (7.25) is also closely related to the seminal
Mumford–Shah model (Mumford and Shah 1989), where the smooth ap-
108 A. Chambolle and T. Pock
denotes the (K − 1)-dimensional unit simplex, and the vectorial total vari-
ation is given by
Z Z
|Du|P = sup − u(x) · div ϕ(x) dx :
Ω Ω
∞ d×K
ϕ ∈ C (Ω; R ), ϕ(x) ∈ CP , for all x ∈ Ω ,
where CP is a convex set, for which various choices can be made. If the
convex set is given by
d×K 1
CP1 = ξ = (ξ1 , . . . , ξK ) ∈ R : |ξk |2 ≤ , for all k ,
2
the vectorial total variation is simply the sum of the total variations of the
single channels (Zach, Gallup, Frahm and Niethammer 2008). Chambolle,
Cremers and Pock (2012) have shown that a strictly larger convex function
is obtained by means of the so-called paired calibration (Lawlor and Morgan
1994, Brakke 1995). In this case, the convex set is given by
CP2 = ξ = (ξ1 , . . . , ξK ) ∈ Rd×K : |ξk − ξl |2 ≤ 1, for all k 6= l ,
which has a more complicated structure than CP1 but improves the convex
relaxation. See Figure 7.11 for a comparison. Note that unlike in the
two-phase case, the relaxation is not exact. Thresholding or rounding a
Optimization for imaging 109
Figure 7.11. Demonstration of the quality using different relaxations. (a) Input
image, where the task is to compute a partition of the grey zone in the middle of the
image using the three colours as boundary constraints. (b) Colour-coded solution
using the simple relaxation CP1 , and (c) result using the stronger relaxation CP2 .
Observe that the stronger relaxation exactly recovers the true solution, which is a
triple junction.
and the vectorial total variation that is intended to measure half the length
of the total boundaries is given by
kDuk2,P = suphDu, Pi, such that Pi,j ∈ CP for all i, j,
P
particular choice of the set. If we choose the weaker set CP1 the projection
reduces to K independent projections onto the 2-ball with radius 1/2. If
we choose the stronger relaxation CP2 , no closed-form solution is available
to compute the projection. A natural approach is to implement Dykstra’s
iterative projection method (Dykstra 1983), as CP2 is the intersection of sim-
ple convex sets on which a projection is straightforward. Another efficient
possibility would be to introduce Lagrange multipliers for the constraints
defining this set, but in a progressive way as they get violated. Indeed,
in practice, it turns out that few of these constraints are actually active,
in general no more than two or three, and only in a neighbourhood of the
boundary of the segmentation.
Figure 7.12 shows the application of interactive multilabel image segmen-
tation using four phases. We again use the user input to specify the de-
sired regions, and we fit Gaussian mixture models (7.24) Gk (·; µk , Σk , αk )),
k = 1, . . . , K with 10 components to those initial regions. The weight func-
tions wk are computed using the negative log probability of the respective
mixture models, that is,
wi,j,k = − log Gk (u⋄i,j ; µk , Σk , αk ), k = 1, . . . , K.
It can be observed that the computed phases uk are almost binary, which
indicates that the computed solution is close to a globally optimal solution.
7.11. Curvature
Using curvature information in imaging is mainly motivated by findings
in psychology that so-called subjective (missing) object boundaries that
are seen by humans are linear or curvilinear (Kanizsa 1979). Hence such
boundaries can be well recovered by minimizing the ‘elastica functional’
Z
(α + βκ2 ) dγ, (7.29)
γ
Figure 7.12. Interactive image segmentation using the multilabel Potts model.
(a) Input image overlaid with the initial segmentation provided by the user. (b) Fi-
nal segmentation, where the colour values correspond to the average colours of the
segments. (c–f) The corresponding phases uk . Observe that the phases are close
to binary and hence the algorithm was able to find an almost optimal solution.
image:
2
∇u
Z
|∇u| α + β div dx. (7.30)
Ω |∇u|
Here, div(∇u/|∇u|) = κ{u=u(x)} (x) represents the curvature of the level
line/surface of u passing through x, and thanks to the co-area formula this
112 A. Chambolle and T. Pock
In the lifted space a new regularization term can be defined that penalizes
curvature information. Such a regularizer – called total vertex regularization
(TVX) – is given by
Z
sup Dx ψ(x, ϑ) · ϑ dµ(x, ϑ), (7.31)
ψ(x,·)∈Bρ Ω×S1
and which generalizes the L1 -norm to measures (Evans and Gariepy 1992).
This enforces sparsity of the lifted measure µ. In practice, it turns out that a
combination of total variation regularization and total vertex regularization
performs best. An image restoration model combining both total variation
and total vertex regularization is given by
1
Z
min α sup Dx ψ(x, ϑ) · ϑ dµ(x, ϑ) + βkµkM + ku − u⋄ k2 ,
(u,µ) ψ(x,·)∈Bρ Ω×S1 2
such that (u, µ) ∈ LµDu = {(u, µ) | µ is the lifting of Du}, (7.35)
where α and β are tuning parameters. Clearly, the constraint that µ is
a lifting of Du represents a non-convex constraint. A convex relaxation
of this constraint is obtained by replacing LµDu with the following convex
constraint:
Z Z
µ ⊥
LDu = (u, µ) µ ≥ 0, ϕ · dDu = ϕ(x) · ϑ dµ(x, ϑ) , (7.36)
Ω Ω×S1
for all smooth test functions ϕ that are compactly supported on Ω. With
114 A. Chambolle and T. Pock
xi,j xi,j+1
xi+1,j
Figure 7.13. A 16-neighbourhood system on the grid. The black dots refer to the
grid points xi,j , the shaded squares represent the image pixels Ωi,j , and the line
segments li,j,k connecting the grid points are depicted by thick lines.
this, the problem becomes convex and can be solved. However, it remains
unclear how close minimizers of the relaxed problem are to minimizers of
the original problem.
It turns out that the total vertex regularization functional works best
for inpainting tasks, since it tries to connect level lines in the image with
curves with a small number of corners or small curvature. Tackling the
TVX models numerically is not an easy task because the lifted measure
is expected to concentrate on line-like structures in the roto-translational
space. Let us assume our image is defined on a rectangular domain Ω =
[0, n) × [0, m). On this domain we consider a collection of S square pixels
{Ωi,j }m,n
i=1,j=1 with Ωi,j = [j − 1, j) × [i − 1, i), such that Ω = m,n
i=1,j=1 Ωi,j .
m,n
Furthermore, we consider a collection of grid points {xi,j }i=1,j=1 with xi,j =
(j, i) such that the grid points xi,j are located on the lower right corners of
the corresponding image pixels Ωi,j . Using the collection of image pixels,
we consider a piecewise constant image
u = {u : Ω → R : u(x) = Ui,j , for all x ∈ Ωi,j },
where U ∈ Rm×n is the discrete version of the continuous image u.
Following Bredies et al. (2015b), we use a neighbourhood system based
on a set of o distinct displacement vectors δk = (δk1 , δk2 ) ∈ Z2 . On a regular
grid, it is natural to define a system consisting of 4, 8, 16, 32, etc. neighbours.
Figure 7.13 depicts an example based on a neighbourhood system of 16
neighbours. The displacement vectors naturally imply orientations ϑk ∈ S1 ,
defined by ϑk = δk /|δk |2 . We shall assume that the displacement vectors δk
are ordered such that the corresponding orientations ϑk are ordered on S1 .
Next, we consider a collection of line segments {li,j,k }m,n,o
i=1,j=1,k=1 , where
the line segments li,j,k = [xi,j , xi,j + δk ] connect the grid points xi,j to a
collection of neighbouring grid points xı̂,̂ , as defined by the neighbourhood
Optimization for imaging 115
XX o Z
= Vı̂,̂,k ϑ⊥k · ϕi,j (x) dx ⇐⇒ DU = CV.
ı̂,̂ k=1 lı̂,̂,k
(b) TVX0 , λ = 1/2 (c) TVX0 , λ = 1/4 (d) TVX0 , λ = 1/8 (e) TVX0 , λ = 1/16
(f) TVX1 , λ = 1/2 (g) TVX1 , λ = 1/4 (h) TVX1 , λ = 1/8 (i) TVX1 , λ = 1/16
Figure 7.14. Comparison of TVX0 (b–e) and TVX1 (f–i) regularization for shape
denoising. One can see that TVX0 leads to a gradually simplified polygonal approx-
imation of the shape in F whereas TVX1 leads to an approximation by piecewise
smooth shapes.
discrete orientations. In Figure 7.14 we show the results of TVX1 and TVX0
regularization using different weights λ in the data-fitting term. It can be
seen that TVX0 minimizes the number of corners of the shape in U and
hence leads to a gradually simplified polygonal approximation of the origi-
nal shape. TVX1 minimizes the total curvature of the shape in U and hence
leads to a piecewise smooth approximation of the shape.
In Figure 7.15 we provide a visualization of the measure µ in the roto-
translation space for the image shown in Figure 7.14(e), obtained using
TVX0 regularization. One can observe that in our discrete approximation
the measure µ nicely concentrates on thin lines in the roto-translation space.
In our second experiment we consider image inpainting. For this we
choose
X
g(U ) = δ{Ui,j
⋄ } (Ui,j ),
(i,j)∈I
where U ⋄ ∈ Rm×n is a given image and I defines the set of indices for
which pixel information is available. Figure 7.16 shows the image inpainting
118 A. Chambolle and T. Pock
Figure 7.15. Visualization of the measure µ in the roto-translation space for the
image of Figure 7.14(e), obtained using TVX0 regularization. Observe that the
measure µ indeed concentrates on thin lines in this space.
results, where we have used the same test image as in Figure 7.7. The
parameters α, β of the TVX model were set to α = 0.01, β = 1, and we
used o = 32 discrete orientations. The parameter α is used to control the
amount of total variation regularization while the parameter β is used to
control the amount of curvature regularization. We tested two different
kinds of missing pixel information. In the experiment shown on the left we
randomly threw away 90% of the image pixels, whereas in the experiment
shown on the right we skipped 80% of entire rows of the image. From the
results one can see that the TVX1 models can faithfully reconstruct the
missing image information even if there are large gaps.
In our third experiment we apply the TVX1 regularizer for image denois-
ing in the presence of salt-and-pepper noise. Following the classical TV-ℓ1
model, we used a data term based on the ℓ1 -norm: g(U ) = λkU − U ⋄ k1 ,
where U ⋄ is the given noisy image. We applied the TVX1 model to the
same test image as used in Figure 2.3. In this experiment the parameters
for the regularizer were set to α = 0.01, β = 1. For the data term we used
λ = 0.25, and we used o = 32 discrete orientations. Figure 7.17 shows the
results obtained by minimizing the TVX1 model. The result shows that the
TVX1 model performs particularly well at preserving thin and elongated
structures (e.g. on the glass pyramid). The main reason why these models
works so well when applied to salt-and-pepper denoising is that the problem
is actually very close to inpainting, for which curvature minimizing models
were originally developed.
Optimization for imaging 119
Figure 7.16. Image inpainting using TVX1 regularization. (a,c,e) Input image with
90% missing pixels and recovered solutions. (b,d,f) Input image with 80% missing
lines and recovered solutions.
Figure 7.17. Denoising an image containing salt-and-pepper noise. (a) Noisy image
degraded by 20% salt-and-pepper noise. (b) Denoised image using TVX1 regular-
ization. Note the significant improvement over the result of the TV-ℓ1 model,
shown in Figure 2.3.
120 A. Chambolle and T. Pock
1
min λkXk1 + kDX − P k22 ,
X,D 2
22
A more reasonable approach would of course be to learn the dictionary on a set of
representative images (excluding the test image). Although we learn the dictionary on
the patches of the original image, observe that we are still far from obtaining a perfect
reconstruction. On one hand the number of dictionary atoms (81) is relatively small
compared to the number of patches (136 500), and on the other hand the regularization
parameter also prevents overfitting.
122 A. Chambolle and T. Pock
Figure 7.18. Image denoising using a patch-based Lasso model. (a) Original image,
and (b) its noisy variant, where additive Gaussian noise with standard deviation
0.1 has been added. (c) Learned dictionary containing 81 atoms with patch size
9 × 9, and (d) final denoised image.
Figure 7.19. Image denoising using the convolutional Lasso model. (a) The 81
convolution filters of size 9 × 9 that have been learned on the original image.
(b) Denoised image obtained by minimizing the convolutional Lasso model.
neural networks (CNNs), which have been shown to perform extremely well
on large-scale image classification tasks (Krizhevsky et al. 2012).
For learning the filters di , we minimize the convolutional Lasso problem
(7.38) with respect to both the filters di and the coefficient images vi . Some
care has to be taken to avoid a trivial solution. Therefore we fix the first filter
kernel to be a Gaussian filter and fix the corresponding coefficient image to
be the input image u⋄ . Hence, the problem is equivalent to learning the
dictionary only for the high-frequency filtered image ũ = u⋄ − g ∗ u⋄ , where
g ∈ Rl×l is a Gaussian filter with standard deviation σ = l.
To minimize (7.38) in vi and di , we again use the inertial variant of the
PALM algorithm. We used k = 81 filters of size l = 9 and the first filter was
set to a Gaussian filter of the same size. The regularization parameter λ was
set to λ = 0.2. Figure 7.19(a) shows the filters we have learned on the clean
image shown in Figure 7.18(a). Comparing the learned convolution filters
to the dictionary of the patch-based Lasso problem, one can see that the
learned filters contain Gabor-like structures (Hubel and Wiesel 1959) but
also more complex structures, which is a known effect caused by the induced
shift invariance (Hashimoto and Kurata 2000). We then also applied the
convolutional Lasso model to a noisy variant of the original image, and the
result is shown in Figure 7.19(b). From the PSNR values, one can see that
the convolutional Lasso model leads to a slightly better result.
effect (the bias b is replaced with cyi wd+1 for some constant c of the order
of the norm of the samples). This smoothing makes the problem strongly
convex, hence slightly easier to solve, as one can use Algorithm 8. An
additional acceleration trick consists in starting the optimization with a
small number of samples and periodically adding to the problem a fraction
of the worst classified samples. As it is well known (and desirable) that only
a small proportion of the samples should be really useful for classification
(the ‘support vectors’ which bound the margin), it is expected, and actually
observed, that the size of the problems can remain quite small with this
strategy.
An extension of the SVM to non-linear classifiers can be achieved by
applying the kernel trick (Aı̆zerman, Braverman and Rozonoèr 1964) to
the hyperplane, which lifts the linear classifier to a new feature space of
arbitrary (even infinite) dimension (Vapnik 2000).
To illustrate this method, we have tried to learn a classifier on the 60 000
digits of the MNIST23 database (LeCun, Bottou, Bengio and Haffner 1998a):
see figure 7.20. Whereas it is known that a kernel SVM can achieve good
performance on this dataset (see the results reported on the web page of
the project) it is computationally quite expensive, and we have tried here to
incorporate non-linearities in a simpler way. To start with, it is well known
that training a linear SVM directly on the MNIST data (which consists of
small 28 × 28 images) does not lead to good results. To improve the per-
formance, we trained the 400-component dictionary shown in Figure 7.20,
using the model in Section 7.12, and then computed the coefficients ((ci )400 i=1 )
of each MNIST digit on this dictionary using the Lasso problem. This rep-
resents a fairly large computation and may take several hours on a standard
computer.
Then we trained the SVMs on feature vectors of the form, for each digit,
(c̃, (ci )400 2 400
i=1 , (ci )i=1 ) (in dimension 801), where c̃ is the constant which maps
all vectors in a hyperplane ‘far’ from the origin, as explained above, and
the additional (c2i )400 i=1 represent a non-linear lifting which slightly boosts
the separability of the vectors. This mimics a non-linear kernel SVM with
a simple isotropic polynomial kernel.
The technique we have employed here is a standard ‘one-versus-one’ clas-
sification approach, which proved slightly more efficient than, for instance,
training an SVM to separate each digit from the rest. It consists in training
45 vectors wi,j , 0 ≤ i < j ≤ 9, each separating the training subset of digits i
from the digits j (in this case, in particular, each learning problem remains
quite small).
Then, to classify a new digit, we have counted how often it is classified
as ‘i’ or ‘j’ by wi,j (which is simply testing whether hwi,j , xi i is positive or
23
http://yann.lecun.com/exdb/mnist
126 A. Chambolle and T. Pock
Figure 7.22. Inverting a convolutional neural network. (a) Original image used
to compute the initial feature vector φ⋄ . (b) Image recovered from the non-linear
deconvolution problem. Due to the high degree of invariances of the CNN with
respect to scale and spatial position, the recovered image contains structures from
the same object class, but the image looks very different.
Acknowledgements
The authors benefit from support of the ANR and FWF via the ‘EANOI’
(Efficient Algorithms for Nonsmooth Optimization in Imaging) joint project,
FWF no. I1148 / ANR-12-IS01-0003. Thomas Pock also acknowledges the
support of the Austrian Science Fund (FWF) under the START project
BIVISION, no. Y729, and the European Research Council under the Hori-
zon 2020 program, ERC starting grant ‘HOMOVIS’, no. 640156. Antonin
Chambolle also benefits from support of the ‘Programme Gaspard Monge
pour l’Optimisation et la Recherche Opérationnelle’ (PGMO), through the
‘MAORI’ group, as well as the ‘GdR MIA’ of the CNRS. He also warmly
thanks Churchill College and DAMTP, Centre for Mathematical Sciences,
University of Cambridge, for their hospitality, with the support of the French
Embassy in the UK. Finally, the authors are very grateful to Yunjin Chen,
Jalal Fadili, Yura Malitsky, Peter Ochs and Glennis Starling for their com-
ments and their careful reading of the manuscript.
Theorem A.1. Let x ∈ X , 0 < θ < 1, and assume F 6= ∅. Then (Tθk x)k≥1
weakly converges to some point x∗ ∈ F .
Proof. Throughout this proof let xk = Tθk x for each k ≥ 0.
Step 1. The first observation is that since Tθ is also a weak contraction, the
sequence (kxk − x∗ k)k is non-increasing for any x∗ ∈ F (which is also the set
of fixed points of Tθ ). The sequence (xk )k is said to be Fejér-monotone with
respect to F , which yields a lot of interesting consequences; see Bauschke
and Combettes (2011, Chapter 5) for details. It follows that for any x∗ ∈ F ,
one can define m(x∗ ) := inf k kxk − x∗ k = limk kxk − x∗ k. If there exists x∗
such that m(x∗ ) = 0, then the theorem is proved, as xk converges strongly
to x∗ .
Step 2. If not, let us show that we still obtain Tθ xk − xk = xk+1 − xk → 0.
An operator which satisfies this property is said to be asymptotically regular
(Browder and Petryshyn 1966). We will use the following result, which is
standard, and in fact gives a hint that this proof can be extended to more
general spaces with uniformly convex norms.
Lemma A.2. For all ε > 0, θ ∈ (0, 1), there exists δ > 0 such that, for all
x, y ∈ X with kxk, kyk ≤ 1 and kx − yk ≥ ε,
kθx + (1 − θ)yk ≤ (1 − δ) max{kxk, kyk}.
This follows from the strong convexity of x 7→ kxk2 (i.e. the parallelogram
identity), and we leave the proof to the reader.
Now assume that along a subsequence, we have kxkl +1 − xkl k ≥ ε > 0.
Observe that
xkl +1 − x∗ = θ(xkl − x∗ ) + (1 − θ)(T0 xkl − x∗ )
and that
1
(xkl − x∗ ) − (T0 xkl − x∗ ) = xkl − T0 xkl = − (xkl +1 − xkl ),
1−θ
so that
k(xkl − x∗ ) − (T0 xkl − x∗ )k ≥ ε/(1 − θ) > 0.
Hence we can invoke the lemma (remember that (xk − x∗ )k is globally
bounded since its norm is non-increasing), and we obtain that, for some
δ > 0,
m(x∗ ) ≤ kxkl +1 − x∗ k ≤ (1 − δ) max{kxkl − x∗ k, kT0 xkl − x∗ k},
but since kT0 xkl − x∗ k ≤ kxkl − x∗ k, it follows that
m(x∗ ) ≤ (1 − δ)kxkl − x∗ k.
As kl → ∞, we get a contradiction if m(x∗ ) > 0.
130 A. Chambolle and T. Pock
Step 3. Assume now that x̄ is the weak limit of some subsequence (xkl )l .
Then we claim it is a fixed point. An easy way to see it is to use Minty’s
trick (Brézis 1973) and the fact that I −Tθ is a monotone operator. Another
is to use Opial’s lemma.
Lemma A.3 (Opial 1967, Lemma 1). If the sequence (xn )n is weakly
convergent to x0 in a Hilbert space X , then, for any x 6= x0 ,
lim inf kxn − xk > lim inf kxn − x0 k.
n n
The proof in the Hilbert space setting is easy and we leave it to the reader.
Since Tθ is a weak contraction, we observe that for each k,
kxk − x̄k2 ≥ kTθ xk − Tθ x̄k2
= kxk+1 − xk k2 + 2hxk+1 − xk , xk − Tθ x̄i + kxk − Tθ x̄k2 ,
and we deduce (thanks to Step 2 above)
lim inf kxkl − x̄k ≥ lim inf kxkl − Tθ x̄k.
l l
25
The proof above can easily be extended to allow for some variation of the averaging
parameter θ. This would yield convergence, for instance, for gradient descent algo-
rithms with varying steps (within some bounds) and many other similar methods.
Optimization for imaging 131
results in this direction. Finally, we mention that one can improve such re-
sults to obtain convergence rates; in particular, Liang et al. (2015a) Liang,
Fadili, Peyré and Luke (2015b) have recently shown that for some problems
one can get an eventual linear convergence for algorithms based on this type
of iteration.
We deduce both (4.29) (for µ = µf + µg > 0 so that ω < 1) and (4.28) (for
µ = 0 and ω = 1).
Proof of Theorem 4.10. The idea behind the proof of Beck and Teboulle
(2009) is to improve this inequality (4.37) by trying to obtain strict decay of
132 A. Chambolle and T. Pock
the term in F in the inequality. The trick is to use (4.37) at a point which
is a convex combination of the previous iterate and an arbitrary point.
If, in (4.37), we replace x with ((t − 1)xk + x)/t (t ≥ 1), x̄ with y k and
x̂ with xk+1 = Tτ y k , where t ≥ 1 is arbitrary, we find that for any x (after
multiplication by t2 ),
t−1
t(t − 1)(F (xk ) − F (x)) − µ kx − xk k2
2
k(t − 1)xk + x − ty k k2
+ (1 − τ µf )
2τ
k(t − 1)xk + x − txk+1 k2
≥ t2 (F (xk+1 ) − F (x)) + (1 + τ µg ) . (B.1)
2τ
Then we observe that
t−1 kx − xk + t(xk − y k )k2
−µ kx − xk k2 + (1 − τ µf )
2 2τ
kx − xk k2 1 − τ µf
= (1 − τ µf − µτ (t − 1)) + thx − xk , xk − y k i
2τ τ
kxk − y k k2
+ t2 (1 − τ µf )
2τ
(1 + τ µg − tµτ ) 1−τ µf
= kx − xk + t 1+τ µg −tµτ (xk − y k )k2
2τ
k
1 − τ µf kx − y k k2
2
+ t (1 − τ µf ) 1 −
1 + τ µg − tµτ 2τ
(1 + τ µg − tµτ ) 1−τ µf
= kx − xk + t 1+τ µg −tµτ (xk − y k )k2
2τ
τ µ(1 − τ µf ) kxk − y k k2
− t2 (t − 1) .
1 + τ µg − tµτ 2τ
It follows that, for any x ∈ X ,
1−τ µ
kx − xk − t 1+τ µg −tµτ
f
(y k − xk )k2
k
t(t − 1)(F (x ) − F (x)) + (1 + τ µg − tµτ )
2τ
kx − x k+1 − (t − 1)(xk+1 − xk )k2
≥ t2 (F (xk+1 ) − F (x)) + (1 + τ µg )
2τ
τ µ(1 − τ µ ) kx k − y k k2
f
+ t2 (t − 1) . (B.2)
1 + τ µg − tµτ 2τ
We let t = tk+1 above. Then we can get a useful recursion if we let
1 + τ µg − tk+1 µτ µτ
ωk = = 1 − tk+1 ∈ [0, 1], (B.3)
1 + τ µg 1 + τ µg
tk+1 (tk+1 − 1) ≤ ωk t2k , (B.4)
Optimization for imaging 133
tk − 1 1 + τ µg − tk+1 µτ tk − 1 1 + τ µ g
βk = = ωk , (B.5)
tk+1 1 − τ µf tk+1 1 − τ µf
y k = xk + βk (xk − xk−1 ). (B.6)
Denoting αk = 1/tk and
τµ τ µf + τ µg
q= = < 1,
1 + τ µg 1 + τ µg
we easily check that these rules are precisely the same as in Nesterov (2004,
formula (2.2.9), p. 80), with the minor difference that in our case the choice
t0 = 0, t1 = 1 is admissible26 and a shift in the numbering of the sequences
(xk ), (y k ). In this case we find
1 + τ µg
t2k+1 (F (xk+1 ) − F (x)) + kx − xk+1 − (tk+1 − 1)(xk+1 − xk )k2
2τ
1 + τ µg
≤ ωk t2k (F (xk ) − F (x)) + kx − xk − (tk − 1)(xk − xk−1 )k2 ,
2τ
so that
1 + τ µg
t2k (F (xk ) − F (x)) + kx − xk − (tk − 1)(xk − xk−1 )k2
2τ
k−1
Y
2 0 1 + τ µg 0 2
≤ ωn t0 (F (x ) − F (x)) + kx − x k . (B.7)
2τ
n=0
26
Note, however, that this is no different from performing a first step of the forward–
backward descent scheme to the energy before actually implementing Nesterov’s iter-
ations.
134 A. Chambolle and T. Pock
√
will now assume, qtk ≤ 1 for all k. Finally, we also observe that
t2k+1 = (1 − qt2k )tk+1 + t2k ,
showing that tk is an increasing sequence. It remains to estimate the factor
k−1
Y
θk = t−2
k ωn for k ≥ 1.
n=0
From (B.4) (with an equality) we find that
1 t2k
1− = ωk 2 ,
tk+1 tk+1
so
k−1 k
t20 Y
Y 1 √
t20 θk
= 2 ωn = 1− ≤ (1 − q)k
tk n=0 tk
n=1
√ √
since 1/tk ≥ q. If t0 ≥ 1, then θk ≤ (1 − q)k /t20 . If t0 ∈ [0, 1[, we instead
write
k−1 k
ω0 Y ω0 Y 1
θk = 2 ωn = 2 1−
tk t1 tk
n=1 n=2
and observe that (B.9) yields (using 2 − q ≥ 1 ≥ q)
1 − qt20 + 1 + 2(2 − q)t20 + q 2 t40
p
t1 = ≥ 1.
2
Also, ω0 ≤ 1 − q (from (B.3)), so that
√ √
θk ≤ (1 + q)(1 − q)k .
The next step is to bound θk by O(1/k 2 ). It also follows from Nesterov
(2004, Lemma 2.2.4). In our notation, we have
1 1 θk − θk+1 θk (1 − (1 − 1/tk+1 ))
p −√ =p √ p ≥ p
θk+1 θk θk θk+1 ( θk + θk+1 ) 2θk θk+1
since θk is non-increasing. It follows that
1 1 1 1 1 1
p −√ ≥ p = qQ ≥ ,
θk+1 θk 2tk+1 θk+1 2 k 2
n=0 ωn
showing that
p k−1 √ k+1
1/ θk ≥ + t1 / ω 0 ≥ .
2 2
√
Hence, provided that qt0 ≤ 1, we also find
4
θk ≤ . (B.10)
(k + 1)2
Optimization for imaging 135
Remark B.2 (constant steps). If µ > 0 (which is q > 0), then an ad-
√
missible choice which satisfies (B.3), (B.4) and (B.5) is to take t = 1/ q,
√
ω = 1 − q, and
p √
2 1 + τ µg 1 + τ µg − τ µ
β=ω =p √ .
1 − τ µf 1 + τ µg + τ µ
Then (B.11) becomes
kx0 − x∗ k2
k ∗ √ k 0 ∗
F (x ) − F (x ) ≤ (1 − q) F (x ) − F (x ) + µ .
2
Remark B.3 (monotone algorithms). The algorithms studied here are
not necessarily ‘monotone’ in the sense that the objective F is not always
non-increasing. A workaround implemented in various papers (Tseng 2008,
Beck and Teboulle 2009) consists in choosing xk+1 to be any point for which
F (xk+1 ) ≤ F (Tτ y k ),27 which will not change (B.1) much except that, in the
last term, xk+1 should be replaced with Tτ y k . Then, the same computations
carry on, and it is enough to replace the update rule (B.6) for y k with
tk 1 + τ µ g
y k = xk + βk (xk − xk−1 ) + ωk (Tτ y k−1 − xk )
tk+1 1 − τ µf
k k k−1 tk k−1 k
= x + βk (x − x ) + (Tτ y −x ) (B.6′ )
tk − 1
to obtain the same rates of convergence. The most sensible choice for xk+1
is to take Tτ y k if F (Tτ y k ) ≤ F (xk ), and xk otherwise (see the monotone
implementation in Beck and Teboulle 2009), in which case one of the two
terms (xk − xk−1 or Tτ y k−1 − xk ) vanishes in (B.6′ ).
27
This makes sense only if the evaluation of F is easy and does not take too much time.
136 A. Chambolle and T. Pock
Tao et al. (2015) recently suggested choosing xk+1 to be the point reach-
ing the minimum value between F (Tτ y k ) and F (Tτ xk ) (this requires ad-
ditional computation), hoping to attain the best rate of accelerated and
non-accelerated proximal descents, and thus obtain a linear convergence
rate for the standard ‘FISTA’ (µ = 0) implementation if F turns out to
be strongly convex. This is very reasonable and seems to be supported by
experiment, but we are not sure how to prove it.
1
f ∗ (y) − hK x̃, yi + ky − ȳk2
2σ
1 1
≥ f ∗ (ŷ) − hK x̃, ŷi + kȳ − ŷk2 + kŷ − yk2 ,
2σ 2σ
where µg ≥ 0 is a convexity parameter for g, which we will consider in Sec-
tion C.2. Summing these two inequalities and rearranging, we obtain (C.1).
The PDHG algorithm corresponds to the choice (x̃, ỹ) = (2xk+1 − xk , y k ),
(x̂, ŷ) = (xk+1 , y k+1 ), (x̄, ȳ) = (xk , y k ). We deduce (assuming µg = 0) that
Equation (5.10) follows from the convexity of (ξ, η) 7→ L(ξ, y) − L(x, η), and
using
kx − x0 k2 ky − y 0 k2
2hK(x − x0 ), y − y 0 i ≤ + .
τ σ
Using (C.1) with (x̃, ỹ) = (xk+1 , y k + θk (y k − y k−1 )), we obtain that for all
(x, y) ∈ X × Y,
1 1
kx − xk k2 + ky − y k k2
2τk 2σk
1 + τk µ g 1
≥ L(xk+1 , y) − L(x, y k+1 ) + kx − xk+1 k2 + ky − y k+1 k2
2τk 2σk
− hK(xk+1 − x), y k+1 − y k i + θk hK(xk+1 − x), y k − y k−1 i
1 − τk L h k 1
+ kx − xk+1 k2 + ky k − y k+1 k2 .
2τk 2σk
Letting
kx − xk k2 ky − y k k2
∆k (x, y) := + ,
2τk 2σk
1 + τk µ g 1
≥ , (C.4)
τk θk+1 τk+1
σk = θk+1 σk+1 , (C.5)
we obtain
θk hK(xk+1 − xk ), y k − y k−1 i
θk2 L2 σk τk k 1
≥− kx − xk+1 k2 − ky k − y k−1 k2 ,
2τk 2σk
Optimization for imaging 139
σk 1 − θk2 L2 τk σk 1
+ kxk − xk2 + kyk − yk2 .
σ0 2τk 2σ0
There are several choices of τk , σk , θk that will ensure a good rate of conver-
gence for the ergodic gap or for the distance kxk − xk; see Chambolle and
Pock (2015a) for a discussion. A simple choice, as in Chambolle and Pock
(2011), is to take, for k ≥ 0,
1
θk+1 = p , (C.8)
1 + µ g τk
σk
τk+1 = θk+1 τk , σk+1 = . (C.9)
θk+1
One can show that in this case, since
1 1 µg
= + p ,
τk+1 τk 1 + 1 + µ g τk
140 A. Chambolle and T. Pock
REFERENCES28
M. Aharon, M. Elad and A. Bruckstein (2006), ‘K-SVD: An algorithm for designing
overcomplete dictionaries for sparse representation’, Signal Processing, IEEE
Transactions on 54(11), 4311–4322.
R. K. Ahuja, T. L. Magnanti and J. B. Orlin (1993), Network flows, Prentice Hall
Inc., Englewood Cliffs, NJ. Theory, algorithms, and applications.
M. A. Aı̆zerman, È. M. Braverman and L. I. Rozonoèr (1964), ‘A probabilistic
problem on automata learning by pattern recognition and the method of
potential functions’, Avtomat. i Telemeh. 25, 1307–1323.
G. Alberti, G. Bouchitté and G. Dal Maso (2003), ‘The calibration method for
the Mumford-Shah functional and free-discontinuity problems’, Calc. Var.
Partial Differential Equations 16(3), 299–333.
Z. Allen-Zhu and L. Orecchia (2014), ‘Linear Coupling: An Ultimate Unification
of Gradient and Mirror Descent’, ArXiv e-prints.
M. Almeida and M. A. T. Figueiredo (2013), ‘Deconvolving images with un-
known boundaries using the alternating direction method of multipliers’,
IEEE Trans. on Image Processing 22(8), 3074 – 3086.
F. Alvarez (2003), ‘Weak convergence of a relaxed and inertial hybrid projection-
proximal point algorithm for maximal monotone operators in Hilbert space’,
SIAM J. on Optimization 14(3), 773–782.
F. Alvarez and H. Attouch (2001), ‘An inertial proximal method for maximal mono-
tone operators via discretization of a nonlinear oscillator with damping’, Set-
Valued Anal. 9(1-2), 3–11. Wellposedness in optimization and related topics
(Gargnano, 1999).
L. Ambrosio and S. Masnou (2003), ‘A direct variational approach to a problem
arising in image reconstruction’, Interfaces Free Bound. 5(1), 63–81.
L. Ambrosio and V. M. Tortorelli (1992), ‘On the approximation of free disconti-
nuity problems’, Boll. Un. Mat. Ital. B (7) 6(1), 105–123.
L. Ambrosio, N. Fusco and D. Pallara (2000), Functions of bounded variation and
free discontinuity problems, The Clarendon Press Oxford University Press,
New York.
Optimization for imaging 141
A. Beck and L. Tetruashvili (2013), ‘On the convergence of block coordinate descent
type methods’, SIAM J. Optim. 23(4), 2037–2060.
S. Becker and J. Fadili (2012), A quasi-newton proximal splitting method, in Ad-
vances in Neural Information Processing Systems 25, pp. 2627–2635.
S. Becker, J. Bobin and E. J. Candès (2011), ‘NESTA: a fast and accurate first-
order method for sparse recovery’, SIAM J. Imaging Sci. 4(1), 1–39.
S. R. Becker and P. L. Combettes (2014), ‘An algorithm for splitting parallel sums
of linearly composed monotone operators, with applications to signal recov-
ery’, J. Nonlinear Convex Anal. 15(1), 137–159.
J. Bect, L. Blanc-Féraud, G. Aubert and A. Chambolle (2004), A l1 -unified frame-
work for image restoration, in Proceedings ECCV 2004 (Prague) (T. Pajdla
and J. Matas, eds), number 3024 in ‘Lecture Notes in Computer Science’,
Springer, pp. 1–13.
A. Ben-Tal and A. Nemirovski (1998), ‘Robust convex optimization’, Math. Oper.
Res. 23(4), 769–805.
A. Ben-Tal and A. Nemirovski (2001), Lectures on modern convex optimiza-
tion, MPS/SIAM Series on Optimization, Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society
(MPS), Philadelphia, PA. Analysis, algorithms, and engineering applications.
A. Ben-Tal, L. El Ghaoui and A. Nemirovski (2009), Robust optimization, Princeton
Series in Applied Mathematics, Princeton University Press, Princeton, NJ.
J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna and G. Peyré (2015), ‘Iterative
Bregman projections for regularized transportation problems’, SIAM J. Sci.
Comput. 37(2), A1111–A1138.
A. Benfenati and V. Ruggiero (2013), ‘Inexact Bregman iteration with an applica-
tion to Poisson data reconstruction’, Inverse Problems 29(6), 065016, 31.
D. P. Bertsekas (2015), Convex Optimization Algorithms, Athena Scientific.
D. P. Bertsekas and S. K. Mitter (1973), ‘A descent numerical method for opti-
mization problems with nondifferentiable cost functionals’, SIAM J. Control
11, 637–652.
J. Bioucas-Dias and M. Figueiredo (2007), ‘A new TwIST: two-step iterative
shrinkage/thresholding algorithms for image restoration’, IEEE Trans. on
Image Processing 16, 2992–3004.
A. Blake and A. Zisserman (1987), Visual Reconstruction, MIT Press.
P. Blomgren and T. F. Chan (1998), ‘Color TV: total variation methods for
restoration of vector-valued images’, IEEE Transactions on Image Processing
7(3), 304–309.
J. Bolte, A. Daniilidis and A. Lewis (2006), ‘The lojasiewicz inequality for nons-
mooth subanalytic functions with applications to subgradient dynamical sys-
tems’, SIAM J. Optim. 17(4), 1205–1223 (electronic).
J. Bolte, S. Sabach and M. Teboulle (2014), ‘Proximal alternating linearized mini-
mization for nonconvex and nonsmooth problems’, Math. Program. 146(1-2,
Ser. A), 459–494.
S. Bonettini and V. Ruggiero (2012), ‘On the convergence of primal-dual hybrid
gradient algorithms for total variation image restoration’, J. Math. Imaging
Vision 44(3), 236–253.
Optimization for imaging 143
J. V. Burke and M. Qian (1999), ‘A variable metric proximal point algorithm for
monotone operators’, SIAM J. Control Optim. 37(2), 353–375 (electronic).
J. V. Burke and M. Qian (2000), ‘On the superlinear convergence of the variable
metric proximal point algorithm using Broyden and BFGS matrix secant
updating’, Math. Program. 88(1, Ser. A), 157–181.
R. H. Byrd, P. Lu, J. Nocedal and C. Y. Zhu (1995), ‘A limited memory algorithm
for bound constrained optimization’, SIAM J. Sci. Comput. 16(5), 1190–1208.
J.-F. Cai, E. J. Candès and Z. Shen (2010), ‘A singular value thresholding algorithm
for matrix completion’, SIAM J. on Optimization 20(4), 1956–1982.
E. Candès, L. Demanet, D. Donoho and L. Ying (2006a), ‘Fast discrete curvelet
transforms’, Multiscale Model. Simul. 5(3), 861–899 (electronic).
E. J. Candès, X. Li, Y. Ma and J. Wright (2011), ‘Robust principal component
analysis?’, J. ACM 58(3), Art. 11, 37.
E. J. Candès, J. Romberg and T. Tao (2006b), ‘Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency information’,
IEEE Trans. Inform. Theory pp. 489–509.
A. Chambolle (1994), Partial differential equations and image processing, in Pro-
ceedings 1994 International Conference on Image Processing, Austin, Texas,
USA, November 13-16, 1994, pp. 16–20.
A. Chambolle (1999), ‘Finite-differences discretizations of the Mumford-Shah func-
tional’, M2AN Math. Model. Numer. Anal. 33(2), 261–288.
A. Chambolle (2004a), ‘An algorithm for mean curvature motion’, Interfaces Free
Bound. 6(2), 195–218.
A. Chambolle (2004b), ‘An algorithm for total variation minimization and applica-
tions’, J. Math. Imaging Vision 20(1-2), 89–97. Special issue on mathematics
and image analysis.
A. Chambolle (2005), Total variation minimization and a class of binary MRF
models, in Energy Minimization Methods in Computer Vision and Pattern
Recognition, pp. 136–152.
A. Chambolle and J. Darbon (2009), ‘On total variation minimization and surface
evolution using parametric maximum flows’, Int. J. Comput. Vis. 84(3), 288–
307.
A. Chambolle and J. Darbon (2012), Image Processing and Analysis with Graphs:
Theory and Practice, CRC Press, chapter “A Parametric Maximum Flow
Approach for Discrete Total Variation Regularization”.
A. Chambolle and C. Dossal (2015), ‘On the convergence of the iterates of the fast
iterative shrinkage/thresholding algorithm’, Journal of Optimization Theory
and Applications 166(3), 968–982.
A. Chambolle and P.-L. Lions (1995), Image restoration by constrained total vari-
ation minimization and variants, in Investigative and Trial Image Processing,
San Diego, CA (SPIE vol. 2567), pp. 50–59.
A. Chambolle and P.-L. Lions (1997), ‘Image recovery via total variation minimiza-
tion and related problems’, Numer. Math. 76(2), 167–188.
A. Chambolle and T. Pock (2011), ‘A first-order primal-dual algorithm for convex
problems with applications to imaging’, J. Math. Imaging Vision 40(1), 120–
145.
146 A. Chambolle and T. Pock
A. Chambolle and T. Pock (2015a), ‘On the ergodic convergence rates of a first-
order primal-dual algorithm’, Mathematical Programming pp. 1–35. (online
first).
A. Chambolle and T. Pock (2015b), ‘A remark on accelerated block coordinate
descent for computing the proximity operators of a sum of convex functions’,
SMAI-Journal of Computational Mathematics 1, 29–54.
A. Chambolle, D. Cremers and T. Pock (2012), ‘A convex approach to minimal
partitions’, SIAM J. Imaging Sci. 5(4), 1113–1158.
A. Chambolle, R. A. DeVore, N.-y. Lee and B. J. Lucier (1998), ‘Nonlinear
wavelet image processing: variational problems, compression, and noise re-
moval through wavelet shrinkage’, IEEE Trans. Image Process. 7(3), 319–335.
A. Chambolle, S. E. Levine and B. J. Lucier (2011), ‘An upwind finite-difference
method for total variation-based image smoothing’, SIAM J. Imaging Sci.
4(1), 277–299.
T. Chan and S. Esedoglu (2004), ‘Aspects of total variation regularized L1 function
approximation’, SIAM J. Appl. Math. 65(5), 1817–1837.
T. Chan and L. Vese (2001), ‘Active contours without edges’, IEEE Trans. Image
Processing 10(2), 266–277.
T. F. Chan and S. Esedoḡlu (2005), ‘Aspects of total variation regularized L1
function approximation’, SIAM J. Appl. Math. 65(5), 1817–1837 (electronic).
T. F. Chan and L. A. Vese (2002), Active contour and segmentation models using
geometric PDE’s for medical imaging, in Geometric methods in bio-medical
image processing, Math. Vis., Springer, Berlin, pp. 63–75.
T. F. Chan, S. Esedoḡlu and M. Nikolova (2006), ‘Algorithms for finding global min-
imizers of image segmentation and denoising models’, SIAM J. Appl. Math.
66(5), 1632–1648 (electronic).
T. F. Chan, G. H. Golub and P. Mulet (1999), ‘A nonlinear primal-dual method for
total variation-based image restoration’, SIAM J. Sci. Comput. 20(6), 1964–
1977 (electronic).
R. Chartrand and B. Wohlberg (2013), A nonconvex ADMM algorithm for group
sparsity with sparse groups, in Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on, IEEE, pp. 6009–6013.
G. Chen and M. Teboulle (1993), ‘Convergence analysis of a proximal-like mini-
mization algorithm using Bregman functions’, SIAM J. Optim. 3(3), 538–543.
G. Chen and M. Teboulle (1994), ‘A proximal-based decomposition method for
convex minimization problems’, Math. Programming 64(1, Ser. A), 81–101.
S. Chen and D. Donoho (1994), Basis pursuit, in 28th Asilomar Conf. on Signals,
Systems, and Computers, pp. 41–44.
S. S. Chen, D. L. Donoho and M. A. Saunders (1998), ‘Atomic Decomposition by
Basis Pursuit’, SIAM Journal on Scientific Computing 20(1), 33–61.
Y. Chen, G. Lan and Y. Ouyang (2014a), ‘Optimal primal-dual methods for a class
of saddle point problems’, SIAM J. Optim. 24(4), 1779–1814.
Y. Chen, R. Ranftl and T. Pock (2014b), ‘Insights into analysis operator learning:
From patch-based sparse models to higher order MRFs’, IEEE Transactions
on Image Processing 23(3), 1060–1072.
E. Chouzenoux, J.-C. Pesquet and A. Repetti (2014), ‘Variable metric forward-
backward algorithm for minimizing the sum of a differentiable function and
a convex function’, J. Optim. Theory Appl. 162(1), 107–132.
Optimization for imaging 147
P. L. Davies and A. Kovac (2001), ‘Local extremes, runs, strings and multiresolu-
tion’, The Annals of Statistics 29(1), 1–65.
D. Davis (2015), ‘Convergence Rate Analysis of Primal-Dual Splitting Schemes’,
SIAM J. Optim. 25(3), 1912–1943.
D. Davis and W. Yin (2014a), ‘Convergence rate analysis of several splitting
schemes’, ArXiv e-prints.
D. Davis and W. Yin (2014b), ‘Faster convergence rates of relaxed Peaceman-
Rachford and ADMM under regularity assumptions’, ArXiv e-prints.
D. Davis and W. Yin (2015), A three-operator splitting scheme and its op-
timization applications, Technical report. CAM Report 15-13 / preprint
arXiv:1504.01032.
W. Deng and W. Yin (2015), ‘On the global and linear convergence of the gen-
eralized alternating direction method of multipliers’, Journal of Scientific
Computing pp. 1–28.
R. A. DeVore (1998), Nonlinear approximation, in Acta numerica, 1998, Vol. 7 of
Acta Numer., Cambridge Univ. Press, Cambridge, pp. 51–150.
D. L. Donoho (1995), ‘De-noising by soft-thresholding’, IEEE Trans. Inform. The-
ory 41(3), 613–627.
D. L. Donoho (2006), ‘Compressed sensing’, IEEE Trans. Inf. Theor. 52(4), 1289–
1306.
J. Douglas and H. H. Rachford (1956), ‘On the numerical solution of heat conduc-
tion problems in two and three space variables’, Transactions of The Ameri-
can Mathematical Society 82, 421–439.
Y. Drori, S. Sabach and M. Teboulle (2015), ‘A simple algorithm for a class of non-
smooth convex-concave saddle-point problems’, Oper. Res. Lett. 43(2), 209–
214.
K.-B. Duan and S. S. Keerthi (2005), Which is the best multiclass svm method? an
empirical study, in Multiple Classifier Systems: 6th International Workshop,
MCS 2005, Seaside, CA, USA, June 13-15, 2005. Proceedings (N. C. Oza,
R. Polikar, J. Kittler and F. Roli, eds), number 3541 in ‘LNCS’, Springer,
pp. 278–285.
J. Duchi, S. Shalev-Shwartz, Y. Singer and T. Chandra (2008), Efficient projections
onto the ℓ1 -ball for learning in high dimensions, in Proceedings of the 25th
International Conference on Machine Learning, ICML ’08, ACM, New York,
pp. 272–279.
F.-X. Dupé, M. J. Fadili and J.-L. Starck (2012), ‘Deconvolution under Poisson
noise using exact data fidelity and synthesis or analysis sparsity priors’, Stat.
Methodol. 9(1-2), 4–18.
J. Duran, M. Moeller, C. Sbert and D. Cremers (2016a), ‘Collaborative total vari-
ation: a general framework for vectorial TV models’, SIAM J. Imaging Sci.
9(1), 116–151.
J. Duran, M. Moeller, C. Sbert and D. Cremers (2016b), ‘On the Implementation
of Collaborative TV Regularization: Application to Cartoon+Texture De-
composition’, Image Processing On Line 6, 27–74. http://dx.doi.org/10.
5201/ipol.2016.141.
R. L. Dykstra (1983), ‘An algorithm for restricted least squares regression’, J.
Amer. Statist. Assoc. 78(384), 837–842.
Optimization for imaging 149
G. Easley, D. Labate and W.-Q. Lim (2008), ‘Sparse directional image represen-
tations using the discrete shearlet transform’, Appl. Comput. Harmon. Anal.
25(1), 25–46.
J. Eckstein (1989), Splitting methods for monotone operators with applications
to parallel optimization, PhD thesis, Massachusetts Institute of Technology.
PhD Thesis.
J. Eckstein (1993), ‘Nonlinear proximal point algorithms using Bregman functions,
with applications to convex programming’, Mathematics of Operations Re-
search 18(1), 202–226.
J. Eckstein and D. P. Bertsekas (1992), ‘On the Douglas-Rachford splitting method
and the proximal point algorithm for maximal monotone operators’, Math.
Programming 55(3, Ser. A), 293–318.
I. Ekeland and R. Témam (1999), Convex analysis and variational problems, Vol. 28
of Classics in Applied Mathematics, english edn, Society for Industrial and
Applied Mathematics (SIAM), Philadelphia, PA. Translated from the French.
E. Esser (2009), Applications of Lagrangian-based alternating direction methods
and connections to split Bregman, CAM Reports 09-31, UCLA, Center for
Applied Math.
E. Esser, X. Zhang and T. F. Chan (2010), ‘A general framework for a class of
first order primal-dual algorithms for convex optimization in imaging science’,
SIAM J. Imaging Sci. 3(4), 1015–1046.
L. C. Evans and R. F. Gariepy (1992), Measure theory and fine properties of func-
tions, CRC Press, Boca Raton, FL.
H. Federer (1969), Geometric measure theory, Springer-Verlag New York Inc., New
York.
O. Fercoq and P. Bianchi (2015), ‘A Coordinate Descent Primal-Dual Algorithm
with Large Step Size and Possibly Non Separable Functions’, ArXiv e-prints.
O. Fercoq and P. Richtárik (2013a), Accelerated, parallel and proximal coordinate
descent, Technical report. arXiv:1312.5799.
O. Fercoq and P. Richtárik (2013b), Smooth minimization of nonsmooth functions
with parallel coordinate descent methods, Technical report. arXiv:1309.5885.
S. Ferradans, N. Papadakis, G. Peyré and J.-F. Aujol (2014), ‘Regularized discrete
optimal transport’, SIAM J. Imaging Sci. 7(3), 1853–1882.
M. Fortin and R. Glowinski (1982), Méthodes de lagrangien augmenté, Vol. 9
of Méthodes Mathématiques de l’Informatique [Mathematical Methods of In-
formation Science], Gauthier-Villars, Paris. Applications à la résolution
numérique de problèmes aux limites. [Applications to the numerical solution
of boundary value problems].
X. L. Fu, B. S. He, X. F. Wang and X. M. Yuan (2014), ‘Block-wise alternating
direction method of multipliers with gaussian back substitution for multiple-
block convex programming’.
M. Fukushima and H. Mine (1981), ‘A generalized proximal point algorithm for cer-
tain nonconvex minimization problems’, Internat. J. Systems Sci. 12(8), 989–
1000.
D. Gabay (1983), Applications of the method of multipliers to variational in-
equalities, in Augmented Lagrangian Methods: Applications to the Solution of
150 A. Chambolle and T. Pock
L. Grippo and M. Sciandrone (2000), ‘On the convergence of the block nonlinear
Gauss-Seidel method under convex constraints’, Oper. Res. Lett. 26(3), 127–
136.
O. Güler (1991), ‘On the convergence of the proximal point algorithm for convex
minimization’, SIAM Journal on Control and Optimization 29, 403–419.
O. Güler (1992), ‘New proximal point algorithms for convex minimization’, SIAM
J. Optim. 2(4), 649–664.
K. Guo and D. Labate (2007), ‘Optimally sparse multidimensional representation
using shearlets.’, SIAM J. Math. Analysis 39(1), 298–318.
K. Guo, G. Kutyniok and D. Labate (2006), Sparse multidimensional repre-
sentations using anisotropic dilation and shear operators, in Wavelets and
splines: Athens 2005, Mod. Methods Math., Nashboro Press, Brentwood,
TN, pp. 189–201.
W. Hashimoto and K. Kurata (2000), ‘Properties of basis functions generated by
shift invariant sparse representations of natural images’, Biological Cybernet-
ics 83(2), 111–118.
S. Hawe, M. Kleinsteuber and K. Diepold (2013), ‘Analysis operator learning and
its application to image reconstruction’, Image Processing, IEEE Transac-
tions on 22(6), 2138–2150.
B. He and X. Yuan (2015a), ‘On non-ergodic convergence rate of Douglas-Rachford
alternating direction method of multipliers’, Numer. Math. 130(3), 567–577.
B. He and X. Yuan (2015b), ‘On the convergence rate of Douglas–Rachford operator
splitting method’, Math. Program. 153(2, Ser. A), 715–722.
B. S. He and X. M. Yuan (2015c), ‘Block-wise alternating direction method of mul-
tipliers for multiple-block convex programming and beyond’, SMAI-Journal
of computational mathematics 1, 145–174.
B. He, Y. You and X. Yuan (2014), ‘On the convergence of primal-dual hybrid
gradient algorithm’, SIAM J. Imaging Sci. 7(4), 2526–2537.
M. R. Hestenes (1969), ‘Multiplier and gradient methods’, J. Optimization Theory
Appl. 4, 303–320.
D. S. Hochbaum (2001), ‘An efficient algorithm for image segmentation, Markov
random fields and related problems’, J. ACM 48(4), 686–701 (electronic).
T. Hohage and C. Homann (2014), A generalization of the Chambolle-Pock al-
gorithm to Banach spaces with applications to inverse problems, Technical
report. arXiv:1412.0126.
M. Hong, Z.-Q. Luo and M. Razaviyayn (2014), ‘Convergence Analysis of Alter-
nating Direction Method of Multipliers for a Family of Nonconvex Problems’,
ArXiv e-prints.
B. K. P. Horn and B. G. Schunck (1981), ‘Determining optical flow’, Artif. Intell.
17(1-3), 185–203.
D. Hubel and T. Wiesel (1959), ‘Receptive fields of single neurones in the cat’s
striate cortex’, The Journal of Physiology 148(3), 574–591.
K. Ito and K. Kunisch (1990), ‘The augmented Lagrangian method for equality and
inequality constraints in Hilbert spaces’, Math. Programming 46, 341–360.
N. A. Johnson (2013), ‘A dynamic programming algorithm for the fused lasso and
l0 -segmentation’, J. Computational and Graphical Statistics.
G. Kanizsa (1979), Organization in Vision, Praeger, New York.
152 A. Chambolle and T. Pock